In this tutorial, we will learn how to find a correlation between categorical variables in Python. This is particularly useful when you’re trying to understand the relationship between two qualitative or categorical variables in your data.
We will use the chi-square
test of independence to determine if there is a significant association between two categorical variables.
Prerequisites
To follow this tutorial, you should have basic knowledge of Python programming, Pandas library, and some familiarity with statistics.
Step 1: Import the necessary libraries
Let’s start by importing the required libraries:
1 2 3 |
import pandas as pd import numpy as np from scipy.stats import chi2_contingency |
Step 2: Load the dataset
Load your dataset using the Pandas function pd.read_csv()
. Replace the filename with the path to your dataset.
1 |
data = pd.read_csv('your_dataset.csv') |
For this tutorial’s purpose, let’s create a sample dataset consisting of two categorical variables – occupation and marital status.
1 2 3 4 |
data = pd.DataFrame({ 'Occupation': ['Doctor', 'Engineer', 'Teacher', 'Engineer', 'Teacher', 'Doctor', 'Engineer'], 'Marital_status': ['Married', 'Single', 'Married', 'Single', 'Single', 'Married', 'Married'], }) |
Step 3: Create a contingency table
A contingency table is a matrix that shows the relationship between the two categorical variables. In this case, we will create a contingency table of the counts of each combination of ‘Occupation’ and ‘Marital_status’.
1 2 |
contingency_table = pd.crosstab(data['Occupation'], data['Marital_status']) print(contingency_table) |
Marital_status Married Single Occupation Doctor 2 0 Engineer 2 1 Teacher 1 1
Step 4: Compute the chi-square test of independence
Now, let’s use the chi2_contingency()
function from the scipy.stats
module to compute the chi-square test of independence.
1 2 3 4 5 6 |
chi2, p_value, dof, expected_freqs = chi2_contingency(contingency_table) print("Chi-square:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") print(expected_freqs) |
Chi-square: 2.0535714285714284 P-value: 0.35854010732159473 Degrees of freedom: 2 Expected frequencies: [[0.85714286 1.14285714] [1.28571429 1.71428571] [0.85714286 1.14285714]]
Step 5: Interpret the results
To interpret the results, we look at the obtained p-value. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two categorical variables. We cannot reject the null hypothesis if the p-value is greater than the significance level.
In this example, the p-value is approximately 0.3585, which is greater than the significance level of 0.05. Therefore, we cannot reject the null hypothesis, and there’s no significant correlation between occupation and marital status in the given dataset.
Full code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import pandas as pd import numpy as np from scipy.stats import chi2_contingency data = pd.DataFrame({ 'Occupation': ['Doctor', 'Engineer', 'Teacher', 'Engineer', 'Teacher', 'Doctor', 'Engineer'], 'Marital_status': ['Married', 'Single', 'Married', 'Single', 'Single', 'Married', 'Married'], }) contingency_table = pd.crosstab(data['Occupation'], data['Marital_status']) chi2, p_value, dof, expected_freqs = chi2_contingency(contingency_table) print("Chi-square:", chi2) print("P-value:", p_value) print("Degrees of freedom:", dof) print("Expected frequencies:") print(expected_freqs) |
Conclusion
In this tutorial, we learned how to perform a chi-square test of independence to find the correlation between two categorical variables in Python.
We discussed the importance of the p-value and its role in determining the presence of a significant correlation.
This technique can be helpful when analyzing relationships between different categorical variables in various datasets and use-cases.