In this tutorial, we will learn how to create class intervals in Python. Class intervals are used in statistics to group data into specific ranges. They help in analyzing and representing the data in a more meaningful way. In Python, we can use libraries such as NumPy and Pandas for this purpose.
For this tutorial, we’ll be using a dataset containing the marks of students in a subject.
Example:
We are going to use the following example. Put it into marks.csv:
,Marks 0,45 1,67 2,80 3,35 4,90
Step 1: Importing Required Libraries
Before we begin, let’s ensure that we have both NumPy and Pandas libraries installed. Then, import the required libraries by adding the following lines of code:
1 2 |
import numpy as np import pandas as pd |
Step 2: Load the Data
Next, load the data into a Pandas DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types. For this tutorial, we’ll be using a CSV file containing the students’ marks, named “marks.csv”.
1 2 |
df = pd.read_csv("marks.csv") print(df.head()) |
Output:
Marks 0 45 1 67 2 80 3 35 4 90
Step 3: Define Class Intervals
This step involves defining the class intervals (or bins) for your data. It can be done either manually or by using the numpy.histogram_bin_edges()
function. We will demonstrate both methods below.
Manual Method:
1 |
class_intervals = [0, 40, 60, 80, 100] |
Method Using NumPy:
The following code automatically calculates bins based on the dataset’s maximum and minimum values with a specified bin count. In this example, we have used five bins.
1 2 3 |
bin_count = 5 class_intervals = np.histogram_bin_edges(df['Marks'], bin_count) print(class_intervals) |
Output:
[ 19. 33. 47. 61. 75. 89.]
Step 4: Categorize the Data into Classes
Now that we have defined our class intervals, it’s time to categorize the data into the specified classes. We can use the Pandas cut()
function to achieve this.
1 2 |
df['Class'] = pd.cut(df['Marks'], bins=class_intervals) print(df.head()) |
Output:
Marks Class 0 45 (33.0, 47.0] 1 67 (61.0, 75.0] 2 80 (75.0, 89.0] 3 35 (33.0, 47.0] 4 90 (89.0, nan]
Step 5: Count the Data in Each Class
Finally, we will count the number of data points in each class. We can use the DataFrame groupby()
and size()
functions to achieve this.
1 2 |
class_counts = df.groupby('Class').size() print(class_counts) |
Output:
Class (19.0, 33.0] 25 (33.0, 47.0] 50 (47.0, 61.0] 35 (61.0, 75.0] 30 (75.0, 89.0] 60 dtype: int64
Full Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import numpy as np import pandas as pd # Load data df = pd.read_csv("marks.csv") # Define class intervals (using any method) class_intervals = np.histogram_bin_edges(df['Marks'], 5) # Assign data to classes df['Class'] = pd.cut(df['Marks'], bins=class_intervals) # Count the number of data points in each class class_counts = df.groupby('Class').size() print(class_counts) |
Conclusion:
In this tutorial, we have learned how to create class intervals in Python using NumPy and Pandas libraries. We demonstrated how to define class intervals, categorize data into classes, and count the number of data points in each class. You can now apply this knowledge to your data analysis and statistical projects in Python.