In this tutorial, we are going to look at one of the most fundamental processes in data preprocessing known as Data Normalization.
Data Normalization is a method used in machine learning to standardize the range of distinct features of data. It basically scales the values in a fixed range (0 and 1 in our case).
The main reason we normalize data is to avoid features in greater numeric ranges dominating those in smaller numeric ranges. Another reason is to avoid numerical instability. Furthermore, gradient descent converges much faster with feature scaling.
Step 1: Saving a File with the Data:
Feature1,Feature2,Feature3 10,20,30 5,15,25 8,12,18 15,25,35
Step 2: Import the Libraries and load your dataset
After successfully installing the libraries, we will import them to our python script by adding the following lines:
1 2 |
import pandas as pd from sklearn.preprocessing import MinMaxScaler |
Next, load your dataset and store it in a pandas DataFrame. Here, let’s assume that we have a dataset ‘data.csv’. This is how we load it:
1 |
df = pd.read_csv('data.csv') |
Step 3: Initialize the Scaler and Transform the Data
After preparing our data, we can now start the normalization process. The MinMaxScaler transforms features by scaling each feature to a given range which is between 0 and 1 by default.
1 2 |
scaler = MinMaxScaler() df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) |
Step 4: Verify the Results
Last, but not least, verify the transformed data by viewing the first few rows of your DataFrame:
1 |
print(df_scaled.head()) |
Full Code
Here is the entire code put together:
1 2 3 4 5 6 7 8 9 |
import pandas as pd from sklearn.preprocessing import MinMaxScaler df = pd.read_csv('data.csv') scaler = MinMaxScaler() df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print(df_scaled.head()) |
Feature1 Feature2 Feature3 0 0.5 0.615385 0.705882 1 0.0 0.230769 0.411765 2 0.3 0.000000 0.000000 3 1.0 1.000000 1.000000
Conclusion
That’s it! You have successfully normalized your dataset between 0 and 1 in Python. As you can see, it’s quite a straightforward process thanks to Python’s great libraries. Remember, proper data preprocessing step including data normalization is key to building a good machine learning model.