How To Clean Excel Data In Python

Dealing with messy data in Excel can be a real headache. However, Python provides a simple and powerful tool to clean and analyze your data quickly using its Pandas library.

Pandas offer DataFrame functionality, which makes it easy to read, explore, and modify the data in a table format similar to Excel.

In this tutorial, you’ll learn how to clean Excel data in Python using pandas. We’ll cover importing the data, removing unwanted rows and columns, converting data to the correct types, and handling missing data in a variety of ways.

Step 1: Create an Excel file

Create example.xlsx with the following content:

DateColumn1Column2Column3Amount
2023-01-01ABC100
2023-01-02DEF
2023-01-03GHI200
2023-01-04JKL300
2023-01-05MNO400

Step 1: Import data from Excel using Pandas

First, we’ll install the necessary libraries for this tutorial using the following command:

The first step is to read and import the data from an Excel file into a pandas DataFrame. You can do this using the pandas read_excel function:

This code reads the data from the ‘example.xlsx’ file and loads it into a pandas DataFrame called df. You can change ‘example.xlsx’ to the name of your Excel file.

Step 2: Remove unnecessary columns

Often there are unnecessary columns in our data that we don’t want in the final output. We’ll use the df.drop() function to remove the unwanted columns:

Replace ‘Column1’ and ‘Column2’ with the names of the columns you want to remove.

Step 3: Rename columns

To make our data more readable, we can rename the columns using the df.rename() function. For example, if we want to change ‘Column3’ to ‘NewColumnName’, we can use the following code:

You can change ‘Column3’ and ‘NewColumnName’ to the appropriate values for your data.

Step 4: Convert data to the correct types

It’s essential to ensure that the data types for each column are correct. For instance, if you have a column ‘Date’ that is read as an object, you want to convert it to a datetime data type:

You can change ‘Date’ to the name of the relevant column in your data.

Step 5: Fill in the missing data

Dealing with missing data is an essential part of the data-cleaning process. In pandas, you can use several methods to fill in missing data, like forward filling, backward filling, or filling them with a specific value. For example, let’s use the fillna() method to replace any missing data in the ‘Amount’ column with the mean of the column:

You can change ‘Amount’ to the name of the relevant column with missing data in your DataFrame.

Step 6: Export the cleaned data to a new Excel file

Now that our data is clean, we can export it to a new Excel file using the df.to_excel() method:

This code saves the cleaned data to a new Excel file called ‘cleaned_example.xlsx’. You can replace ‘cleaned_example.xlsx’ with the desired name for your output file.

Here is the complete code:

Make sure to change the file names and column names to suit your specific data.

Output (cleaned_example.xlsx)

DateNewColumnNameAmount
2023-01-01C100
2023-01-02F250
2023-01-03I200
2023-01-04L300
2023-01-05O400

Conclusion

In this tutorial, you learned how to clean Excel data in Python using pandas. The process involved reading the data and creating a pandas DataFrame, removing unnecessary columns, renaming columns, converting data to the correct types, and handling missing data.

Once the data is cleaned, you can export it to a new Excel file. You can now apply this knowledge to clean and analyze your own datasets using Python and pandas effectively.