In this tutorial, we’ll learn how to determine the P, D, and Q values in an ARIMA model using Python. ARIMA stands for Autoregressive Integrated Moving Average, which is a forecasting algorithm that helps to predict future values in a time series dataset by analyzing the dataset’s past values and trends.
The ARIMA model takes three main parameters, denoted as P, D, and Q, which are crucial for the performance of the model.
Requirements
To follow this tutorial, you need to install the following Python libraries:
- Pandas
- matplotlib
- numpy
- pmdarima
You can install these libraries using pip:
1 |
pip install Pandas matplotlib numpy pmdarima |
Step 1: Importing necessary libraries and loading the dataset
First, let’s import the required libraries and load the dataset, which will be a simple time series dataset. For this tutorial, we will use the Air Passengers dataset, which is a well-known time series dataset representing the total number of airline passengers per month from 1949 to 1960.
The dataset can be downloaded from Kaggle and it looks like this:
Month #Passengers Jan-49 112 Feb-49 118 Mar-49 132 Apr-49 129 May-49 121 Jun-49 135 Jul-49 148 Aug-49 148 Sep-49 136 Oct-49 119 Nov-49 104 Dec-49 118 Jan-50 115 Feb-50 126 Mar-50 141 Apr-50 135 ...
1 2 3 4 5 6 7 |
import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load the dataset data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month') data.head() |
Step 2: Plotting the dataset
Before determining the P, D, and Q values, it is a good idea to visualize the dataset to identify any trends or seasonality.
1 2 3 4 5 6 |
plt.figure(figsize=(10, 6)) plt.plot(data) plt.xlabel('Month') plt.ylabel('Number of Passengers') plt.title('Air Passengers (1949-1960)') plt.show() |
From the plot, we can observe that there is an upward trend and seasonality in the data.
Step 3: Differencing the dataset
In order to determine the value of D, we need to make the dataset stationary by removing any trends or seasonality. We can do this by differencing the data using the .diff() function provided by Pandas.
1 2 3 4 5 6 7 8 |
# Differencing the dataset data_diff = data.diff().dropna() plt.figure(figsize=(10, 6)) plt.plot(data_diff) plt.xlabel('Month') plt.ylabel('Number of Passengers') plt.title('Air Passengers (Differenced)') plt.show() |
In the differenced plot, the trend seems to be removed, indicating that D = 1 might be suitable.
Step 4: Determine P and Q using ACF and PACF plots
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots can be used to identify the optimum values for P and Q. Let’s plot ACF and PACF on the differenced data.
1 2 3 4 5 6 7 8 9 |
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf #ACF plot plot_acf(data_diff, lags=20) plt.show() #PACF plot plot_pacf(data_diff, lags=20) plt.show() |
From the ACF plot, we can see that it cuts off after 1 lag, so Q = 1. From the PACF plot, we can see that it also cuts off after 1 lag, so P = 1.
Step 5: Auto ARIMA
We can also determine the P, D, and Q values using an automatic approach provided by the pmdarima library’s auto_arima function.
1 2 3 4 5 6 |
from pmdarima import auto_arima # Determine P, D, Q values using auto_arima model = auto_arima(data, seasonal=True, m=12, trace=True) print("\nBest Model Parameters:", model.get_params()) |
The output will show the best-fitted ARIMA model, and you can get the values for P, D, and Q from the parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
import pandas as pd import matplotlib.pyplot as plt import numpy as np from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from pmdarima import auto_arima # Load the dataset data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month') # Plot the dataset plt.figure(figsize=(10, 6)) plt.plot(data) plt.xlabel('Month') plt.ylabel('Number of Passengers') plt.title('Air Passengers (1949-1960)') plt.show() # Differencing the dataset data_diff = data.diff().dropna() plt.figure(figsize=(10, 6)) plt.plot(data_diff) plt.xlabel('Month') plt.ylabel('Number of Passengers') plt.title('Air Passengers (Differenced)') plt.show() #ACF plot plot_acf(data_diff, lags=20) plt.show() #PACF plot plot_pacf(data_diff, lags=20) plt.show() # Determine P, D, Q values using auto_arima model = auto_arima(data, seasonal=True, m=12, trace=True) print("\nBest Model Parameters:", model.get_params()) |
Output (P, D, and Q values):
Conclusion
In this tutorial, we learned how to determine the P, D, and Q values for an ARIMA model using Python. We used ACF and PACF plots, as well as the auto_arima function provided by the pmdarima library to automatically determine the optimal values for our model’s parameters.
Now, you can use these values to build an ARIMA model for your time series data and make accurate forecasts.