Dealing with missing data is an integral part of every data science project. Specifically, handling missing categorical data often poses a peculiar challenge.
Missing data can introduce a substantial amount of bias, make the handling and analysis of the data more intricate, and may lead to random outcomes.
In this tutorial, we will learn about how to handle missing categorical data in Python utilizing various techniques. For the purpose of this tutorial, we will use the Pandas library.
Step 1: Import Necessary Libraries
We will first import the necessary libraries. This includes pandas, numpy, and scikit-learn. Before proceeding to run the Python code ensure you have these libraries installed.
The below snippet will import the libraries we need:
1 2 3 |
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer |
Step 2: Load the Dataset
Before we can handle any missing values, we need to load the dataset. For this tutorial, we will use a made-up dataset that contains some missing categorical values.
1 2 3 4 |
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', np.nan], 'Gender': ['M', np.nan, 'M', 'F', 'F'], 'Profession': ['Doctor', 'Engineer', 'Engineer', 'Doctor', np.nan]} df = pd.DataFrame(data) |
Here is what the data looks like:
Name Gender Profession 0 John M Doctor 1 Anna NaN Engineer 2 Peter M Engineer 3 Linda F Doctor 4 NaN F NaN
Step 3: Identify the Missing Data
Next, we will identify the missing data in our dataset. Pandas provide isnull() and info() functions for this.
1 2 |
print(df.isnull().sum()) print(df.info()) |
The output will show you the number of missing entries for each column.
Step 4: Handling the Missing Data
Now we get to the main part of this tutorial, handling the missing data. We will use the SimpleImputer function from the scikit-learn library.
1 2 3 |
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') df['Profession'] = imputer.fit_transform(df['Profession'].values.reshape(-1,1))[:,0] df['Gender'] = imputer.fit_transform(df['Gender'].values.reshape(-1,1))[:,0] |
This function replaces the missing values with the most frequent value in each column.
Full Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer data = {'Name': ['John', 'Anna', 'Peter', 'Linda', np.nan], 'Gender': ['M', np.nan, 'M', 'F', 'F'], 'Profession': ['Doctor', 'Engineer', 'Engineer', 'Doctor', np.nan]} df = pd.DataFrame(data) print(df.isnull().sum()) print(df.info()) imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') df['Profession'] = imputer.fit_transform(df['Profession'].values.reshape(-1,1))[:,0] df['Gender'] = imputer.fit_transform(df['Gender'].values.reshape(-1,1))[:,0] |
Name 1 Gender 1 Profession 1 dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 4 non-null object 1 Gender 4 non-null object 2 Profession 4 non-null object dtypes: object(3) memory usage: 248.0+ bytes None
Conclusion
Handling missing categorical data, although challenging, is crucial for any data analysis. In Python, using libraries like pandas and scikit-learn, this process becomes much more manageable.
After following this tutorial, you should be able to effectively handle missing categorical data for your data science projects.