In this tutorial, we are going to learn how to validate an Excel file in Python. As Python developers, you may come across tasks where you need to validate the data in the Excel files before processing it further.
Validations can include checking if required columns exist, if no null values are present in certain columns and more. This tutorial will guide you through easy steps to achieve this using Python.
Step 1: Install Important Libraries
To validate an Excel file in Python, we will use two important libraries – pandas and openpyxl.
Pandas is an open-source data analysis and manipulation tool, while openpyxl enables Python to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.
You can install them using the pip package installer:
1 |
pip install pandas openpyxl |
Step 2: Read the Excel File
Let’s say the Excel file (‘example.xlsx’) you wish to validate has the following content:
Name | Age | |
John | 34 | [email protected] |
Anna | null | [email protected] |
Tom | null | [email protected] |
The first step is to read the Excel file and load it as a pandas DataFrame.
1 2 3 |
import pandas as pd data = pd.read_excel('example.xlsx') |
Step 3: Validate Data in Excel File
The next step is to validate the data in the Excel file. Let’s start by checking if there are null values in the ‘Age’ column and if the ‘Email’ column exists.
1 2 3 4 5 6 7 |
# Check if 'Age' column has null values if data['Age'].isnull().any(): print("Null value found in 'Age' column") # Check if 'Email' column exists if 'Email' not in data.columns: print("'Email' column not found") |
Step 4: Validate More Cases
You can add as many validations as required for your task. For example, you may wish to check if data in a specific column follows a certain pattern.
Full Code
Below is the full code combining the above steps together.
1 2 3 4 5 6 7 8 9 10 11 |
import pandas as pd data = pd.read_excel('example.xlsx') # Check if 'Age' column has null values if data['Age'].isnull().any(): print("Null value found in 'Age' column") # Check if 'Email' column exists if 'Email' not in data.columns: print("'Email' column not found") |
Null value found in 'Age' column
Conclusion
As shown in this tutorial, Python’s pandas and openpyxl libraries make it easy to validate an Excel file. Being able to validate Excel sheets can be incredibly handy when dealing with datasets where the quality and correctness of the data are crucial.
The steps shown above can be extended and modified as per your requirements to create a robust Excel file checker for your projects.