How To Find A Correlation Between Categorical Variables In Python

In this tutorial, we will learn how to find a correlation between categorical variables in Python. This is particularly useful when you’re trying to understand the relationship between two qualitative or categorical variables in your data.

We will use the chi-square test of independence to determine if there is a significant association between two categorical variables.

Prerequisites

To follow this tutorial, you should have basic knowledge of Python programming, Pandas library, and some familiarity with statistics.

Step 1: Import the necessary libraries

Let’s start by importing the required libraries:

Step 2: Load the dataset

Load your dataset using the Pandas function pd.read_csv(). Replace the filename with the path to your dataset.

For this tutorial’s purpose, let’s create a sample dataset consisting of two categorical variables – occupation and marital status.

Step 3: Create a contingency table

A contingency table is a matrix that shows the relationship between the two categorical variables. In this case, we will create a contingency table of the counts of each combination of ‘Occupation’ and ‘Marital_status’.

Marital_status  Married  Single
Occupation
Doctor                 2       0
Engineer               2       1
Teacher                1       1

Step 4: Compute the chi-square test of independence

Now, let’s use the chi2_contingency() function from the scipy.stats module to compute the chi-square test of independence.

Chi-square: 2.0535714285714284
P-value: 0.35854010732159473
Degrees of freedom: 2
Expected frequencies:
[[0.85714286 1.14285714]
 [1.28571429 1.71428571]
 [0.85714286 1.14285714]]

Step 5: Interpret the results

To interpret the results, we look at the obtained p-value. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two categorical variables. We cannot reject the null hypothesis if the p-value is greater than the significance level.

In this example, the p-value is approximately 0.3585, which is greater than the significance level of 0.05. Therefore, we cannot reject the null hypothesis, and there’s no significant correlation between occupation and marital status in the given dataset.

Full code

Conclusion

In this tutorial, we learned how to perform a chi-square test of independence to find the correlation between two categorical variables in Python.

We discussed the importance of the p-value and its role in determining the presence of a significant correlation.

This technique can be helpful when analyzing relationships between different categorical variables in various datasets and use-cases.