How To Find A Correlation Between Categorical Variables In Python

In this tutorial, we will learn how to find a correlation between categorical variables in Python. This is particularly useful when you’re trying to understand the relationship between two qualitative or categorical variables in your data.

We will use the chi-square test of independence to determine if there is a significant association between two categorical variables.

Prerequisites

To follow this tutorial, you should have basic knowledge of Python programming, Pandas library, and some familiarity with statistics.

Step 1: Import the necessary libraries

Let’s start by importing the required libraries:

import pandas as pd

import numpy as np

from scipy.stats import chi2_contingency

Step 2: Load the dataset

Load your dataset using the Pandas function pd.read_csv(). Replace the filename with the path to your dataset.

1	data = pd.read_csv('your_dataset.csv')

For this tutorial’s purpose, let’s create a sample dataset consisting of two categorical variables – occupation and marital status.

data = pd.DataFrame({

'Occupation': ['Doctor', 'Engineer', 'Teacher', 'Engineer', 'Teacher', 'Doctor', 'Engineer'],

'Marital_status': ['Married', 'Single', 'Married', 'Single', 'Single', 'Married', 'Married'],

})

Step 3: Create a contingency table

A contingency table is a matrix that shows the relationship between the two categorical variables. In this case, we will create a contingency table of the counts of each combination of ‘Occupation’ and ‘Marital_status’.

1 2	contingency_table = pd.crosstab(data['Occupation'], data['Marital_status']) print(contingency_table)

Marital_status  Married  Single
Occupation
Doctor                 2       0
Engineer               2       1
Teacher                1       1

Step 4: Compute the chi-square test of independence

Now, let’s use the chi2_contingency() function from the scipy.stats module to compute the chi-square test of independence.

chi2, p_value, dof, expected_freqs = chi2_contingency(contingency_table)

print("Chi-square:", chi2)

print("P-value:", p_value)

print("Degrees of freedom:", dof)

print("Expected frequencies:")

print(expected_freqs)

Chi-square: 2.0535714285714284
P-value: 0.35854010732159473
Degrees of freedom: 2
Expected frequencies:
[[0.85714286 1.14285714]
 [1.28571429 1.71428571]
 [0.85714286 1.14285714]]

Step 5: Interpret the results

To interpret the results, we look at the obtained p-value. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two categorical variables. We cannot reject the null hypothesis if the p-value is greater than the significance level.

In this example, the p-value is approximately 0.3585, which is greater than the significance level of 0.05. Therefore, we cannot reject the null hypothesis, and there’s no significant correlation between occupation and marital status in the given dataset.

Full code