In this tutorial, we will learn how to compare two columns in Python using popular data manipulation libraries, such as pandas and NumPy. Comparing columns is a common task performed in data analysis, and Python provides efficient ways to handle this task.
Step 1: Import libraries and load data
First, let’s import the necessary libraries and read our data as a pandas DataFrame. For this tutorial, we will use a small dataset stored in a CSV file called sample_data.csv
, which contains information about students, their grades, and their ages.
Here is the content of our sample CSV file:
Name,Age,Grade1,Grade2 Alice,21,85,90 Bob,19,80,75 Charlie,22,92,88 David,20,78,82 Eva,23,95,98
Now, let’s import the libraries and load the data:
1 2 3 4 5 6 |
import pandas as pd import numpy as np # Read the CSV file data = pd.read_csv('sample_data.csv') print(data) |
Output:
Name Age Grade1 Grade2 0 Alice 21 85 90 1 Bob 19 80 75 2 Charlie 22 92 88 3 David 20 78 82 4 Eva 23 95 98
Step 2: Compare two columns
We can compare two columns in various ways, such as checking for equality, greater than, or less than. Let’s see how to compare the ‘Grade1’ and ‘Grade2′ columns by checking if students’ Grade1 scores are equal to their Grade2 scores.
1 2 3 |
# Compare Grade1 and Grade2 columns for equality result = data['Grade1'] == data['Grade2'] print(result) |
Output:
0 False 1 False 2 False 3 False 4 False dtype: bool
The result
variable contains a pandas Series of boolean values. Each boolean value corresponds to a row and indicates whether the two columns are equal for that particular row. In this case, none of the students have the same grades in both subjects.
Step 3: Calculate the difference between two columns
Now let’s calculate the absolute difference between the Grade1 and Grade2 columns. We will use the abs()
function from the numpy
library to get the absolute values.
1 2 3 |
# Calculate the absolute difference between Grade1 and Grade2 columns difference = np.abs(data['Grade1'] - data['Grade2']) print(difference) |
Output:
0 5 1 5 2 4 3 4 4 3 dtype: int64
Step 4: Add the comparison result to the DataFrame
We can add the result of the comparison as a new column in the DataFrame. This is helpful when we want to store the result for further analysis. In this example, we’ll add the ‘Difference’ column containing the absolute difference between Grade1 and Grade2.
1 2 3 |
# Add the 'Difference' column to the DataFrame data['Difference'] = difference print(data) |
Full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import pandas as pd import numpy as np # Read the CSV file data = pd.read_csv('sample_data.csv') # Compare Grade1 and Grade2 columns for equality result = data['Grade1'] == data['Grade2'] # Calculate the absolute difference between Grade1 and Grade2 columns difference = np.abs(data['Grade1'] - data['Grade2']) # Add the 'Difference' column to the DataFrame data['Difference'] = difference print(data) |
Output:
Name Age Grade1 Grade2 Difference 0 Alice 21 85 90 5 1 Bob 19 80 75 5 2 Charlie 22 92 88 4 3 David 20 78 82 4 4 Eva 23 95 98 3
Conclusion
In this tutorial, we learned how to compare two columns in Python using pandas and NumPy libraries. We went through the process of checking for equality, calculating the difference between columns, and adding the result to the DataFrame. These techniques can be applied to various data manipulation tasks, making Python a powerful tool for data analysis.