How To Remove Duplicates In Excel Using Python

In this tutorial, we will learn how to remove duplicates in an Excel file using Python. Python is an excellent tool for data cleansing, and with its powerful libraries like pandas and openpyxl, we can easily manipulate the data stored in Excel files.

Removing duplicates from an Excel file is an essential data preprocessing step for many projects, as duplicate entries can lead to inappropriate results and incorrect conclusions. Using Python to automate this process will save you time and help ensure consistent results.

Requirements

To follow this tutorial, you need to have Python installed on your computer and the following packages:

  1. pandas: A powerful library for data manipulation and analysis.
  2. openpyxl: A library for reading and writing Excel files.

You can install these packages using pip:

Step 1: Read the Excel file

First, we need to read the Excel file using pandas. To do that, we will use the read_excel() function. We will also provide the engine parameter to specify the library to read the file.

For example, let’s assume that our Excel file, sample_data.xlsx, has the following content:

ID  Name  Age  Country
1   John  28   USA
2   Jane  32   Canada
3   Nina  26   India
4   John  28   USA

Step 2: Remove duplicates

To remove duplicates, we will use the pandas drop_duplicates() method. This method will find all duplicate rows based on the columns specified, and retain only the first occurrence of each duplicated row. If no columns are specified, it will consider all columns.

For our example, after running this code the duplicates will be removed, leaving us with the following data:

ID  Name  Age  Country
1   John  28   USA
2   Jane  32   Canada
3   Nina  26   India

Step 3: Write the data to a new Excel file

Finally, we will write the cleansed data to a new Excel file using the pandas to_excel() function. To use openpyxl as the engine for writing the file, we need to provide the engine parameter.

After running this code, a new Excel file named sample_data_no_duplicates.xlsx will be created without the duplicate entries.

Full Code

Conclusion

This tutorial showed you how to easily remove duplicates from an Excel file using Python and the pandas and openpyxl libraries. With these tools, you can automate data cleansing tasks and ensure that your data is free of duplicate entries before analysis.