In this tutorial, we will learn how to remove outliers in Python using a Box Plot. Outliers are data points that differ greatly from other observations and may cause significant issues in your analysis.
Therefore, it’s often necessary to detect and deal with these outliers. One of the most common ways to detect outliers visually is by creating a Box Plot also known as a Box and Whisker Plot.
We will be using Python’s Pandas and Matplotlib library functions to accomplish this task.
Step 1: Importing necessary libraries
The first step is to import the necessary Python libraries. We will be using Matplotlib for creating Box plots and Pandas for managing data.
1 2 |
import pandas as pd import matplotlib.pyplot as plt |
Step 2: Loading and inspecting the data
Here, we will load the dataset into a dataframe using pandas and view the first few rows. Let’s use a sample data as an example.
1 2 3 |
data = {'Values': [25, 26, 28, 29, 30, 39, 40, 42, 48, 49, 500]} df = pd.DataFrame(data) print(df.head()) |
Step 3: Creating a Box Plot
Using matplotlib, we will create a Box plot. This is done using the boxplot function from pyplot.
1 2 |
plt.boxplot(df["Values"]) plt.show() |
Step 4: Removing the Outliers
Finally, we will remove the outliers. From the box plot, we can see that the value 500 is an outlier. We will use the IQR (Interquartile Range) method to remove this outlier. Here, any value less than Q1-1.5IQR or more than Q3+1.5IQR is considered an outlier.
1 2 3 4 5 |
Q1 = df['Values'].quantile(0.25) Q3 = df['Values'].quantile(0.75) IQR = Q3 - Q1 filter = (df['Values'] >= Q1 - 1.5 * IQR) & (df['Values'] <= Q3 + 1.5 *IQR) df = df.loc[filter] |
The full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import pandas as pd import matplotlib.pyplot as plt data = {'Values': [25, 26, 28, 29, 30, 39, 40, 42, 48, 49, 500]} df = pd.DataFrame(data) print(df.head()) plt.boxplot(df["Values"]) plt.show() Q1 = df['Values'].quantile(0.25) Q3 = df['Values'].quantile(0.75) IQR = Q3 - Q1 filter = (df['Values'] >= Q1 - 1.5 * IQR) & (df['Values'] <= Q3 + 1.5 *IQR) df = df.loc[filter] |
Conclusion
As we can see, outliers can have a heavy influence on our model’s accuracy when predicting data. Using a box plot, we can visualize outliers and with the help of Interquartile Range (IQR), we can manage to remove them.
The Python libraries, Pandas and Matplotlib are extremely powerful tools for data analysis.