In this tutorial, we will learn how to perform **predictive analysis** using Python. Predictive analysis refers to the use of data, machine learning techniques, and statistical algorithms to predict future outcomes based on historical data.

The main aim of predictive analysis is to test new, untested assumptions, and forecast what might happen in the future, aiding businesses in making informed decisions. Python provides various libraries such as **NumPy, pandas, matplotlib,** and **scikit-learn** to implement predictive analysis effectively.

### Step 1: Generate random values

You can use this code to generate random values:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import numpy as np import pandas as pd # Generate a random dataset of house prices and living areas np.random.seed(42) n_samples = 1000 living_area = np.random.normal(loc=2000, scale=500, size=n_samples) noise = np.random.normal(loc=0, scale=10000, size=n_samples) # add some random noise to the sale price sale_price = 200000 + living_area * 100 + noise data = pd.DataFrame({'LivingArea': living_area, 'SalePrice': sale_price}) # Save the dataset to a CSV file data.to_csv('house_prices.csv', index=False) |

### Step 2: Install the required libraries

To perform predictive analysis in Python, we will be using pandas, NumPy, matplotlib, and scikit-learn libraries. If you don’t have these libraries installed, you can install them using the following pip commands:

1 |
pip install numpy pandas matplotlib scikit-learn |

### Step 3: Import required libraries

After installing the required libraries, we can import them into our Python script as follows:

1 2 3 4 5 6 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split |

### Step 4: Load and preprocess the dataset

For this tutorial, we’ll use a sample dataset containing information about **house prices** and their respective **living area** in square feet. You can download the dataset from Kaggle. After downloading, load the dataset using pandas:

1 2 |
data = pd.read_csv('house_prices.csv') data.head() |

After loading the dataset, we will preprocess it by removing any missing values and selecting relevant features for our analysis:

1 2 3 |
data = data[['LivingArea', 'SalePrice']].dropna() X = data['LivingArea'].values.reshape(-1, 1) y = data['SalePrice'].values |

### Step 5: Split the dataset into a training and testing set

Before moving ahead, we’ll split our dataset into a training set and a testing set. This will help us evaluate the performance of our predictive model:

1 |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |

### Step 6: Train and evaluate the linear regression model

Now that we have our training and testing sets ready, we can train our linear regression model and evaluate its performance:

1 2 3 4 5 6 7 8 9 10 |
model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean squared error: ", mse) print("R² Score: ", r2) |

After evaluating the model, the output should resemble the following:

Mean squared error: 2818030334.32 R² Score: 0.507437821761

### Step 7: Visualize the results

To visualize the results, we can plot the original data points along with the fitted line generated by our linear regression model:

1 2 3 4 5 6 7 8 |
plt.scatter(X_test, y_test, color='blue', label='Actual') plt.plot(X_test, y_pred, color='red', label='Predicted') plt.xlabel('Living Area (sq. ft.)') plt.ylabel('Sale Price') plt.legend() plt.show() |

This will display a scatter plot showing the relationship between house prices and living area, with the fitted line from our linear regression model.

## Full code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split # Load and preprocess dataset data = pd.read_csv('house_prices.csv') data = data[['LivingArea', 'SalePrice']].dropna() X = data['LivingArea'].values.reshape(-1, 1) y = data['SalePrice'].values # Split dataset into training and testing set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train and evaluate the linear regression model model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean squared error: ", mse) print("R² Score: ", r2) # Visualize the results plt.scatter(X_test, y_test, color='blue', label='Actual') plt.plot(X_test, y_pred, color='red', label='Predicted') plt.xlabel('Living Area (sq. ft.)') plt.ylabel('Sale Price') plt.legend() plt.show() |

## Output:

Mean squared error: 87284861.23745254 R² Score: 0.9434939781511769

## Conclusion

In this tutorial, we have learned how to perform predictive analysis using Python libraries such as pandas, NumPy, matplotlib, and scikit-learn.

We have gone through the steps of loading and preprocessing the dataset, splitting it into training and testing sets, training and evaluating a linear regression model and visualizing the results.

With these tools, you can now adapt this approach to analyze other types of datasets and make informed predictions based on your data.