Dealing with messy data in Excel can be a real headache. However, Python provides a simple and powerful tool to clean and analyze your data quickly using its Pandas library.
Pandas offer DataFrame functionality, which makes it easy to read, explore, and modify the data in a table format similar to Excel.
In this tutorial, you’ll learn how to clean Excel data in Python using pandas. We’ll cover importing the data, removing unwanted rows and columns, converting data to the correct types, and handling missing data in a variety of ways.
Step 1: Create an Excel file
Create example.xlsx with the following content:
Date | Column1 | Column2 | Column3 | Amount |
---|---|---|---|---|
2023-01-01 | A | B | C | 100 |
2023-01-02 | D | E | F | |
2023-01-03 | G | H | I | 200 |
2023-01-04 | J | K | L | 300 |
2023-01-05 | M | N | O | 400 |
Step 1: Import data from Excel using Pandas
First, we’ll install the necessary libraries for this tutorial using the following command:
1 |
pip install pandas openpyxl |
The first step is to read and import the data from an Excel file into a pandas DataFrame. You can do this using the pandas read_excel
function:
1 2 3 4 5 6 7 8 |
import pandas as pd # Read the Excel file file_name = 'example.xlsx' df = pd.read_excel(file_name) # Display the first few rows of data print(df.head()) |
This code reads the data from the ‘example.xlsx’ file and loads it into a pandas DataFrame called df
. You can change ‘example.xlsx’ to the name of your Excel file.
Step 2: Remove unnecessary columns
Often there are unnecessary columns in our data that we don’t want in the final output. We’ll use the df.drop()
function to remove the unwanted columns:
1 2 3 4 5 |
# Columns to be removed columns_to_remove = ['Column1', 'Column2'] # Drop the specified columns df = df.drop(columns=columns_to_remove) |
Replace ‘Column1’ and ‘Column2’ with the names of the columns you want to remove.
Step 3: Rename columns
To make our data more readable, we can rename the columns using the df.rename()
function. For example, if we want to change ‘Column3’ to ‘NewColumnName’, we can use the following code:
1 |
df = df.rename(columns={'Column3': 'NewColumnName'}) |
You can change ‘Column3’ and ‘NewColumnName’ to the appropriate values for your data.
Step 4: Convert data to the correct types
It’s essential to ensure that the data types for each column are correct. For instance, if you have a column ‘Date’ that is read as an object, you want to convert it to a datetime data type:
1 2 |
# Convert the 'Date' column to a datetime data type df['Date'] = pd.to_datetime(df['Date']) |
You can change ‘Date’ to the name of the relevant column in your data.
Step 5: Fill in the missing data
Dealing with missing data is an essential part of the data-cleaning process. In pandas, you can use several methods to fill in missing data, like forward filling, backward filling, or filling them with a specific value. For example, let’s use the fillna()
method to replace any missing data in the ‘Amount’ column with the mean of the column:
1 2 |
# Fill missing data in the 'Amount' column with the column's mean df['Amount'] = df['Amount'].fillna(df['Amount'].mean()) |
You can change ‘Amount’ to the name of the relevant column with missing data in your DataFrame.
Step 6: Export the cleaned data to a new Excel file
Now that our data is clean, we can export it to a new Excel file using the df.to_excel()
method:
1 2 3 |
# Save the cleaned data to a new Excel file cleaned_file_name = 'cleaned_example.xlsx' df.to_excel(cleaned_file_name, index=False) |
This code saves the cleaned data to a new Excel file called ‘cleaned_example.xlsx’. You can replace ‘cleaned_example.xlsx’ with the desired name for your output file.
Here is the complete code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import pandas as pd # Read the Excel file file_name = 'example.xlsx' df = pd.read_excel(file_name) # Remove unnecessary columns columns_to_remove = ['Column1', 'Column2'] df = df.drop(columns=columns_to_remove) # Rename columns df = df.rename(columns={'Column3': 'NewColumnName'}) # Convert data types df['Date'] = pd.to_datetime(df['Date']) # Fill missing data df['Amount'] = df['Amount'].fillna(df['Amount'].mean()) # Save the cleaned data to a new Excel file cleaned_file_name = 'cleaned_example.xlsx' df.to_excel(cleaned_file_name, index=False) |
Make sure to change the file names and column names to suit your specific data.
Output (cleaned_example.xlsx)
Date | NewColumnName | Amount |
---|---|---|
2023-01-01 | C | 100 |
2023-01-02 | F | 250 |
2023-01-03 | I | 200 |
2023-01-04 | L | 300 |
2023-01-05 | O | 400 |
Conclusion
In this tutorial, you learned how to clean Excel data in Python using pandas. The process involved reading the data and creating a pandas DataFrame, removing unnecessary columns, renaming columns, converting data to the correct types, and handling missing data.
Once the data is cleaned, you can export it to a new Excel file. You can now apply this knowledge to clean and analyze your own datasets using Python and pandas effectively.