Introduction to Time Series Forecasting — Part 2 (ARIMA Models)

Most time series forecasting methods assume that the data is ‘stationary,’ but in reality it often needs certain transformations for further processing.

Shweta
Towards Data Science

--

Photo by Miguel Luis on Unsplash

In the first article, we looked at Simple Moving Average and Exponential Smoothing methods. In this article we will look at more complex methods like ARIMA and its extensions.

Method 3 : Seasonal Auto Regressive Integrated Moving Average (SARIMA)

Seasonal ARIMA is a variation of ARIMA model. It is an extension of ARIMA method that supports seasonality in data. Let us first understand the working of the ARIMA model.

ARIMA is a statistical model used for forecasting time series data. The ARIMA equation is a regression type equation in which the independent variables are lags of the dependent variable and/or lags of the forecast errors. The equation of the ARIMA model is given as :

y’(t) = c + ϕ1* y′(t−1) +⋯ + ϕp*y′(t−p) + θ1*ε(t−1) +⋯ + θq*ε(t−q) + εt

There are three terms in the equation :

AR : Auto Regression : The time series is regressed with its previous values i.e. y(t-1), y(t-2) etc. The order of the lag is denoted as p.

I : Integration : The time series uses differencing to make it stationary. The order of the difference is denoted as d.

MA : Moving Average : The time series is regressed with residuals of the past observations i.e. error ε(t-1), error ε(t-2) etc. The order of the error lag is denoted as q.

In the above equation, y′t is the differenced series, ϕ1 is the coefficient of the first AR term, p is the order of the AR term, θ1 is the coefficient of the first MA term, q is the order of the MA term and εt is the error.

ARIMA does not support seasonal data. For time series that has a significant seasonal pattern, seasonal ARIMA model is used.

How does it differ from ARIMA? In addition to the three parameters in ARIMA i.e. p, d, q, SARIMA has three more seasonal parameters (P, D, Q). The additional three parameters account for Autoregressive component (P), Differencing component (D) and Moving Average Component (Q) at the seasonal level.

It can be expressed as follows: ARIMA (p, d, q) (P, D, Q)m
Here m is number of observations per season (In our case, m is 12). The seasonal components of the model are expressed in upper case and non seasonal components of the model are expressed in lower case.

Before we get into the details of SARIMA, let’s understand the terms ‘Autocorrelation’ and ‘Partial Autocorrelation’.

Auto-Correlation Function (ACF) : Auto-correlation of lag k is the correlation between Y(t) and Y(t-k), measured at different k lags. For lag 1, Auto correlation is measured between Y(t) and Y(t-1), similarly for lag 2, Auto correlation is measured between Y(t) and Y(t-2) values. A plot of auto-correlation for different values of k is called an auto-correlation plot or correlogram.

Partial Auto-Correlation (PACF) : Partial Auto Correlation of lag k is the correlation between Y(t) and Y(t-k) when the effect of all other intermediate values (Y(t-1), Y(t-2),….Y(t-k+1)) is removed from both Y(t) and Y(t-k). For e.g. , partial auto correlation between y(t) and y(t+1) is the same as their autocorrelation cause there are no intermediate terms between them. Partial autocorrelation between y(t) and y(t+2) will remove the effect of y(t+1) from both y(t) and y(t+2). A plot of partial auto correlation for different values of k is called partial auto correlation plot .

Before we use ARIMA or any of its extension, we need to ensure that the time series is stationary.

What do you mean by stationarity of time series?

A time series is said to be stationary if it has the following three properties:

  1. It has a constant mean i.e. mean remains constant over time
  2. It has a constant variance i.e. variance remains constant over time
  3. It has a constant covariance. Value of the covariance between the two time periods depend on the lag between the two time periods and not on the time at which the covariance is computed. This means that the covariance between series at time t=2 and t = 4 should be roughly the same as the covariance between series at time t=7 and time t=9.

A stationary time series will have the same statistical properties i.e. same mean, variance and covariance no matter at what point we measure them, these properties are therefore time invariant. The series will have no predictable patterns in the long-term. Time plot will show the series to be roughly horizontal with constant mean and variance.

Why do we need stationary time series? Because if the time series is not stationary, we can study its behavior only for that time period. Each period of the time series will have its own distinct behavior and it is not possible to predict or generalize for future time periods if the series is not stationary.

A stationary time series will tend to return to its mean value and the fluctuations around this mean will have a constant magnitude. Thus, a stationary time series will not drift too much from its mean value because of the finite variance.

One example of a stationary time series is ‘White Noise’. Cause of its inherent stationarity, it has no predictable pattern in the long term. It is thus memoryless.

Here is how a distribution of White Noise will look like:

# Plot for White Noise with Mean 0 and standard deviation as 0.8
wnoise= np.random.normal(loc=0, scale=0.8, size=800)
plt.figure(figsize=(15, 4)),
plt.plot(wnoise);
plt.title('Distribution of White Noise');

As you can see from the plot above, the distribution is constant across the mean and is completely random. It is difficult to predict the next movement of the time series. If we plot the autocorrelation of this series, one will observe complete zero autocorrelation. This means that the correlation between the series at any time t and its lagged values is zero.

acr = plot_acf(wnoise, lags = 50)

In the ACF (Autocorrelation Plot) above, you can see that all lags are within the highlighted area in blue. This indicates that there is almost zero correlation between observations at different lags. If one or more spikes are outside this range, or more than 5% of spikes are outside these ranges, then we can infer that the series is not ‘White Noise’.

How do we check for stationarity?

Two popular tests used to check for stationarity are Augmented Dickey Fuller Test (ADF) and Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS). Let us look at each test in detail.

1. Augmented Dickey Fuller Test :

This statistical test belongs to the category called the Unit Root test. It checks whether the time series has a unit root and is thus non stationary. When we conduct this test to check for stationarity, we check for unit root.

Null Hypothesis Ho : There is a unit root i.e. the time series is non stationary
Alternate Hypothesis Ha : There is no unit root i.e. the time series is stationary

What do you mean by Unit Root?

There are two kinds of non stationary stochastic time series that are observed :

  1. Random Walk Model without Drift
  2. Random Walk Model with Drift

In Random Walk Without Drift, the mean of the series is constant and is equal to its initial or starting value, but the variance increases with time. Thus the series is non stationary. Usually, the stock prices are considered to be following Random Walk. The data exhibiting Random Walk has long periods of up or down trend and sudden and unpredictable changes in the direction.
Thus, the forecast for this model is the last observation as future movements are unpredictable and are equally likely to go up or down.

The value of the time series Y at time t is equal to its value at time t-1 plus a random shock i.e. Y(t) = Y(t-1) + Error u(t)
The first difference of the Random Walk Without Drift is a stationary time series. Why? Cause from the above equation, Y(t) — Y(t-1) = Error Term u(t). The error term is a white noise or stationary series with mean 0 and constant variance

In Random Walk With Drift, both the mean of the series and the variance increase with time. Thus, the series is not stationary. Here the value of time series at time t is equal to its value at time (t-1) plus a drift parameter alpha plus a random shock i.e. Y(t) = Y(t-1) + alpha + u(t). The drift parameter alpha is the average increase in the series from one period to the next.

A Random Walk Model is known as a unit root process.

It is written as : Y(t) = (Delta)* Y(t-1) + u(t)

If the value of delta is 1(Random Walk) or greater than 1(Random Walk with Drift), the series is non stationary. If the value of delta is less than 1, we conclude that the series is stationary.

Please note that the non stationarity or unit root could also be cause of deterministic trend. I have not covered that part in this article.

Let us now look at the KPSS test.

2. Kwiatkowski-Phillips-Schmidt-Shin (KPSS Test) :

This is another less popular test conducted to check for stationarity. Here the Null and Alternate hypothesis are opposite of the ADF test.

Null Hypothesis: The process is trend stationary.

Alternate Hypothesis: The series has a unit root (series is not stationary).

If we use both the tests to confirm stationarity, one of the four scenarios are possible :

Scenario I — Both tests conclude stationarity. Therefore, the series is stationary.

Scenario II — Both tests conclude non stationarity. Therefore, the series is non stationary.

Scenario III — ADF test concludes stationarity but KPSS test concludes non stationarity. This indicates that one needs to use differencing to make the series stationary. Here, a new series is created where the value of the observation at time t is given by the difference between the actual value at time t and the value at time t-1. Thus,

Value(t) of the differenced series = observation(t)-observation(t-1)

Scenario IV — ADF tests concludes non stationarity but KPSS test confirms stationarity. This indicates that one needs to use detrending to make the series stationary. Common methods of detrending include using log transformation, square root transformation of the original series.

Once the series has been made stationary, the next step is to build the ARIMA model. Two approaches are used to build the ARIMA Model.

Approach 1: Select model order manually by looking at the ACF and PACF plots of the differenced data and determine the optimal values of p and q. Build a number of models with different values of p and q and select the final model i.e. the model with the lowest AIC. (AIC or Akaike’s Information Criterion is a useful model selection tool. Given a set of models for time series data, the best model is the one that has the lowest AIC. It balances both the goodness of fit and overfitting by penalizing models that have higher number of parameters.)

Approach 2: Use Auto Arima (auto.arima()) to find the best model for the given time series.

In reality, a mix of both approaches is followed. Using approach 1, a range of p and q is determined. In the next step, the Auto Arima function is used with the ranges identified in step 1.

There are certain guidelines used to determine the optimal values of p, d and q in the ARIMA model as mentioned in Approach 1. These are as follows:

Guideline 1: Identifying the order of differencing in an ARIMA model:

The first and the most important step in fitting an ARIMA model is to decide on the order of differencing to make the series stationary. The right approach is to start with the lowest value of differencing i.e. d=1. Usually this will make the series stationary. However, if there is still a significant trend or if the auto correlations are positively significant for lags ≥10, then the series need a second order of differencing. Differencing of the order > 2 is never preferred.

If the autocorrelation of lag 1 of the differenced series is zero or negative, or if the autocorrelations are all small and without pattern, then the series does not require more differencing. If the lag-1 autocorrelation is more negative than -0.5, it indicates the series may have been over differenced.

Guideline 2 : Identifying the AR or MA terms in the ARIMA model:

Once we have a stationary series, the next step is to determine the AR and MA terms in the ARIMA model. The intuition is as follows:

  1. If the PACF of the differenced series shows a sharp cut off and/or the lag1 autocorrelation is positive (this indicates an ‘under- differenced’ series) while the ACF decays more slowly , then consider adding an AR term to the model. The number of AR terms will depend on the lag at which PACF cuts off.
  2. If the ACF of the differenced series shows a sharp cut off and/or the lag1 autocorrelation is negative (this indicates an ‘over- differenced’ series) while the PACF decays more slowly , then consider adding MA term to the model. Here, the autocorrelation pattern is explained more by adding the MA terms. The number of MA terms will depend on the lag at which ACF cuts off.
  3. An AR term is associated with under-differencing or positive auto correlation at lag 1while an MA term is associated with over-differencing or negative auto correlation at lag 1.

Guideline 3 : Which ARIMA model to finalize ?

Chose a model with lower level of differencing with either the AR or MA terms. However, mixed models with both AR and MA can be a best fit some times.

If the sum of the AR coefficients equal to 1 or almost 1(i.e. presence of unit root), then increase the order of differencing and reduce the number of AR terms.

If the sum of the MA coefficients equal to 1 or almost 1, then decrease the order of differencing and increase the number of MA terms.

A model where the sum of the AR coefficients or the sum of MA coefficients equal 1, are said to have ‘unit root’ . Such models do not have residuals as white noise.

In addition to p, d, q in the ARIMA models, seasonal models like SARIMA have additional seasonal terms P, D, Q. These can also be determined from the ACF and PACF plots.

Guidelines for Seasonal ARIMA (SARIMA) :

Guideline 4: Identify the seasonal and non seasonal order of differencing:

The most important step in SARIMA model is to decide whether or not a seasonal differencing is required. If the seasonal pattern is very strong, one should use an order of seasonal differencing. If the series is still not stationary after applying the seasonal difference, we can consider applying non seasonal difference. However, never use more than one order of seasonal differencing and the sum of the seasonal and non seasonal differencing should not exceed 2.

The order in which we use seasonal and non seasonal differencing does not matter , however, if we have highly seasonal data, it is preferable to perform seasonal differencing first followed by non seasonal differencing.

Guideline 5: Identify the seasonal and non seasonal AR and MA terms :

If the autocorrelation at the seasonal lag k is positive (in our case k=12,24,36) add a seasonal AR term (P) to the model. If the autocorrelation at the seasonal lag is negative, add a seasonal MA term (Q) to the model. Care should be taken to not include both seasonal AR and MA terms in the same model and the order of each should not be more than 1.

These are the guidelines advised by Robert Nau, Duke University. Please refer to the following link to get detailed understanding of the above guidelines. https://people.duke.edu/~rnau/411arim.htm

Let me summarize the key steps in ARIMA/SARIMA modelling:

  1. Plot the time series data and observe the trend and seasonality
  2. Use statistical tests like ADF and KPSS to check for stationarity
  3. If the series is not stationary, use transformations like differencing or detrending. For seasonal data, one may require seasonal differencing in addition to non seasonal differencing.
  4. Once the order of differencing is determined, we need to decide on the optimal AR and MA terms.
  5. Use the ACF and PACF plots to get a range within which the parameters p,P,q,Q should lie. Use auto.arima and specify the range of these parameters.
  6. Check for the assumptions of the regression model i.e. normality of errors and no autocorrelation between the errors.

Let us now go through these steps one at a time.

The plot of the data is as follows:

plt.figure(figsize=(15,5))
plt.ylabel('Energy Production')
plt.title('Trend of Energy Production')
plt.plot(df['Energy_Production'],'b-');

Here we will remove a certain section of the data i.e. observations from 1940 to 1964. The reason is very old data and may induce unnecessary noise.

df1 = df.copy()
df2 = df1['Energy_Production'][312:]
df2 = pd.DataFrame(df2)
df2.head()

Energy_Production Date
1965–01–01 29.4598
1965–02–01 28.7989
1965–03–01 28.8497
1965–04–01 28.1889
1965–05–01 27.7059

We can now go ahead and run the statistical tests to check for stationarity.

Test 1: Augmented Dickey Fuller Test:

X = df2['Energy_Production'].values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: -2.231551
p-value: 0.194994
Critical Values:
1%: -3.440
5%: -2.866
10%: -2.569

Here, the p value is greater than 0.05, thus we fail to reject the Null Hypothesis. Therefore, the series has a unit root and is non stationary.

Test 2 : KPSS Test

from statsmodels.tsa.stattools import kpss

def kpss_test(timeseries):
print("Results of KPSS Test:")
kpsstest = kpss(timeseries, regression="c")
kpss_output = pd.Series(
kpsstest[0:3], index=["Test Statistic", "p-value", "Lags Used"]
)
for key, value in kpsstest[3].items():
kpss_output["Critical Value (%s)" % key] = value
print(kpss_output)
kpss_test(df2['Energy_Production'].values)Results of KPSS Test:
Test Statistic 3.273962
p-value 0.010000
Lags Used 20.000000
Critical Value (10%) 0.347000
Critical Value (5%) 0.463000
Critical Value (2.5%) 0.574000
Critical Value (1%) 0.739000
dtype: float64

Here, the p value is less than 0.05. However, the Null Hypothesis of KPSS test is opposite of the ADF test. Thus here, we will reject the Null Hypothesis of stationary series and conclude that the series is non stationary.

This is similar to Scenario 2, where both tests confirm that the given series is non stationary. We will now consider steps to make the series stationary.

Given the presence of very high seasonality, we will first use seasonal differencing. Given m=12, we need to difference the observation at time t with its lag at t-12. If we do this, the first 12 observations of the differenced series will take NaN values. Here, the order of seasonal differencing D is 1.

df2['Energy_Production_diff'] = df2['Energy_Production'] - df2['Energy_Production'].shift(12)
df2[['Energy_Production','Energy_Production_diff']].head(13)

Date Energy_Production Energy_Production_diff
1965–01–01 29.4598 NaN
1965–02–01 28.7989 NaN
1965–03–01 28.8497 NaN
1965–04–01 28.1889 NaN
1965–05–01 27.7059 NaN
1965–06–01 28.6972 NaN
1965–07–01 29.9936 NaN
1965–08–01 31.0103 NaN
1965–09–01 30.8578 NaN
1965–10–01 29.3581 NaN
1965–11–01 28.5701 NaN
1965–12–01 30.0190 NaN
1966–01–01 31.2391 1.7793

We will remove the NaN observations and plot the data series to check if differencing made the series stationary.

plt.figure(figsize=(18,5));
plt.xlabel('Time');
plt.ylabel('Energy Production');
plt.title('Differenced Time Series of Energy Production');
plt.plot(df2['Energy_Production_diff1']);

The series looks nearly stationary, though the variance is not constant. We will run the two tests again to confirm stationarity.

# ADF Test: 
X = df2['Energy_Production_diff1'].dropna().values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: -7.448126
p-value: 0.000000
Critical Values:
1%: -3.440
5%: -2.866
10%: -2.569

Here, the p value is less than 0.05. Thus, we will reject the Null Hypothesis and conclude that the series is stationary.

# KPSS Test 
kpss_test(X)
Results of KPSS Test:
Test Statistic 0.575462
p-value 0.024867

Lags Used 20.000000
Critical Value (10%) 0.347000
Critical Value (5%) 0.463000
Critical Value (2.5%) 0.574000
Critical Value (1%) 0.739000
dtype: float64

The p value of the KPSS test is less than 0.05. Thus, we will reject the null hypothesis that the series is stationary. The KPSS test concludes that the series is non stationary.

This outcome is similar to Scenario 3 as mentioned above in this article. Both tests are giving contradictory result with ADF concluding that the series is stationary and KPSS test concluding otherwise. Thus, we can conclude that the series is non stationary. If we plot the ACF and PACF plots of the differenced series, we can see strong positive autocorrelation between the initial lags.

acf_plot = plot_acf(df2['Energy_Production_diff1'].dropna(), lags= 30)
pacf_plot = plot_pacf(df2['Energy_Production_diff1'].dropna(), lags= 30)

We will thus go ahead and difference the series once again to make it stationary. This will be non seasonal difference of order 1 i.e. d = 1

df2['Energy_Production_diff121'] = df2['Energy_Production_diff1'] - df2['Energy_Production_diff1'].shift(1)

We will use the two statistical tests to check and confirm stationarity.

# ADF Test 
X = df2['Energy_Production_diff121'].dropna().values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: -8.939993
p-value: 0.000000

Critical Values:
1%: -3.441
5%: -2.866
10%: -2.569
# KPSS Test:
kpss_test(X)
Results of KPSS Test:
Test Statistic 0.017791
p-value 0.100000

Lags Used 20.000000
Critical Value (10%) 0.347000
Critical Value (5%) 0.463000
Critical Value (2.5%) 0.574000
Critical Value (1%) 0.739000
dtype: float64

At this step, both the tests confirm that the series is stationary. Let us plot the ACF and PACF plots and also the plot for the overall differenced series.

acf_plot = plot_acf(df2['Energy_Production_diff121'].dropna(), lags= 30)
pacf_plot = plot_pacf(df2['Energy_Production_diff121'].dropna(), lags= 30)

Both the plots do have significant lags but we will go ahead and treat the series stationary as confirmed by the ADF and KPSS tests. The plot of the differenced series is as follows:

plt.figure(figsize=(18,5));
plt.xlabel('Time');
plt.ylabel('Energy Production');
plt.title('Differenced Time Series of Energy Production');
plt.plot(df2['Energy_Production_diff121']);

This plot looks very similar to the distribution of White Noise. We can infer from the above plot that the series is stationary.

So far, we have used differencing twice i.e. one seasonal differencing of order D=1, and one non seasonal differencing of order d=1. The next step is to determine the AR and MA parameters. In the ACF plot, there is a significant negative lag at 1 and at 12 and 24. Thus, the MA parameter should be 1 and seasonal MA parameter should be 2. (Guideline 2, 5) . Given there is a negative lag at 1 and 12 for PACF plot as well, as per the guidelines, we will not include AR and seasonal AR terms.

Thus, the final order of the model seems to be (0,1,1)(0,1,2)12.

It is extremely difficult to come up with the best model at first instant while following Approach 1 of following ACF and PACF plots. This is because it is very difficult to read these plots correctly. Therefore, the general methodology is to combine both the approaches. Using the first approach, we can determine the range of non seasonal parameters, p, d, q and seasonal parameters P, D, Q. This can then serve to be an input in the auto arima function.

Let us see how it is implemented.

We will first install the library ‘pmdarima’ and then import it.

import pmdarima as pm
from
pmdarima import auto_arima
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
df3 = df2['Energy_Production']

We will define a range for the values of p, q, P, Q as derived from the ACF and PACF plots. Here, we have taken d=D=1. (Both ADF and KPSS tests are in sync for this case) . We initialize the parameters p, P, q, Q as 0 and the max values is 2. As mentioned in the guidelines, we should not build a model with higher order of AR, MA, SAR, SMA parameters. Given the data is seasonal, here m is 12 and seasonal = ‘True’.

model = auto_arima(df3, start_p=0, start_q=0,
max_p=2, max_q=2,m=12,start_P=0,start_Q=0,
max_P = 2, max_Q = 2,
seasonal=True,
d=1,D=1,trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)

The above code will take some time to run as it evaluates a number of combination of the models. The output of the code is the model with the least AIC.

Performing stepwise search to minimize aic
ARIMA(0,1,0)(0,1,0)[12] : AIC=3281.487, Time=0.08 sec
ARIMA(1,1,0)(1,1,0)[12] : AIC=3171.654, Time=0.23 sec
ARIMA(0,1,1)(0,1,1)[12] : AIC=3018.258, Time=0.58 sec
ARIMA(0,1,1)(0,1,0)[12] : AIC=3218.537, Time=0.08 sec
ARIMA(0,1,1)(1,1,1)[12] : AIC=3015.997, Time=0.92 sec
ARIMA(0,1,1)(1,1,0)[12] : AIC=3141.548, Time=0.28 sec
ARIMA(0,1,1)(2,1,1)[12] : AIC=2994.828, Time=1.94 sec
ARIMA(0,1,1)(2,1,0)[12] : AIC=3058.224, Time=0.81 sec
ARIMA(0,1,1)(2,1,2)[12] : AIC=2977.369, Time=5.76 sec
ARIMA(0,1,1)(1,1,2)[12] : AIC=3010.587, Time=3.48 sec
ARIMA(0,1,0)(2,1,2)[12] : AIC=3047.276, Time=4.01 sec
ARIMA(1,1,1)(2,1,2)[12] : AIC=2912.252, Time=7.25 sec
ARIMA(1,1,1)(1,1,2)[12] : AIC=2931.644, Time=4.72 sec
ARIMA(1,1,1)(2,1,1)[12] : AIC=2918.937, Time=3.15 sec
ARIMA(1,1,1)(1,1,1)[12] : AIC=2937.865, Time=1.59 sec
ARIMA(1,1,0)(2,1,2)[12] : AIC=3010.240, Time=6.52 sec
ARIMA(2,1,1)(2,1,2)[12] : AIC=2914.225, Time=9.10 sec
ARIMA(1,1,2)(2,1,2)[12] : AIC=2914.220, Time=7.80 sec
ARIMA(0,1,2)(2,1,2)[12] : AIC=2930.445, Time=5.22 sec
ARIMA(2,1,0)(2,1,2)[12] : AIC=2980.502, Time=6.62 sec
ARIMA(2,1,2)(2,1,2)[12] : AIC=2916.073, Time=16.42 sec
ARIMA(1,1,1)(2,1,2)[12] intercept : AIC=2914.035, Time=21.91 sec

Best model: ARIMA(1,1,1)(2,1,2)[12]
Total fit time: 108.511 seconds

Let us look at the summary of the model. Auto Arima gives (1,1,1)(2,1,2)12 as the best model i.e. the model with the least AIC.

print(model.summary())Statespace Model Results                                 
===================================================================
Dep. Variable: y No. Observations: 677
Model: SARIMAX(1, 1, 1)x(2, 1, 2, 12) Log Likelihood: -1449.126
Date: Wed, 28 Jul 2021 AIC: 2912.252
Time: 17:14:03 BIC: 2943.740
Sample: 0- 677 HQIC : 2924.454 Covariance Type: opg
====================================================================
coef std err z P>|z|
ar.L1 0.5133 0.034 15.039 0.000
ma.L1 -0.9304 0.018 -52.457 0.000
ar.S.L12 0.6594 0.090 7.340 0.000
ar.S.L24 -0.3230 0.041 -7.930 0.000
ma.S.L12 -1.3218 0.093 -14.195 0.000
ma.S.L24 0.5492 0.075 7.350 0.000
sigma2 4.5143 0.186 24.224 0.000
====================================================================
Ljung-Box (Q): 52.32 Jarque-Bera (JB): 112.23
Prob(Q): 0.09 Prob(JB):0.00
Heteroskedasticity (H):4.73 Skew: 0.07
Prob(H) (two-sided): 0.00 Kurtosis: 5.01
====================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Certain checks need to be done : The coefficients of AR and MA terms is less than 1 , the sum of the coefficients of the two seasonal AR terms is less than one and the sum of the coefficients of two MA terms is also less than one. Each of these terms are significant with p value less than 0.05. This seems to be a good model. Thus, a SARIMA model (1,1,1) (2,1,2)12 is the final selected model. (This is close to model (0,1,1)(0,1,2) 12 that we identified from the plots)

We will now divide the data into train and test and calculate predictions for the test data.

train = df3[:612]
test = df3[612:]
from statsmodels.tsa.statespace.sarimax import SARIMAX
final_model = SARIMAX(train,order=(1,1,1),seasonal_order=(2,1,2,12))
result = final_model.fit()
print(result.summary())

The summary is as follows:

Statespace Model Results                                 
====================================================================
Dep. Variable: Energy_Production No. Observations:612
Model: SARIMAX(1, 1, 1)x(2, 1, 2, 12) Log Likelihood :-1247.862
Date: Wed, 28 Jul 2021 AIC :2509.724
Time: 17:14:32 BIC :2540.491
Sample: 01-01-1965- 12-01-2015 HQIC: 2521.702

====================================================================
coef std err z P>|z|
--------------------------------------------------------------------
ar.L1 0.5832 0.038 15.368 0.000
ma.L1 -0.9411 0.017 -55.676 0.000
ar.S.L12 0.6723 0.167 4.020 0.000
ar.S.L24 -0.2318 0.047 -4.908 0.000
ma.S.L12 -1.3065 0.169 -7.738 0.000
ma.S.L24 0.5211 0.125 4.175 0.000
sigma2 3.7119 0.161 23.044 0.000
===================================================================================
Ljung-Box (Q): 61.31 Jarque-Bera (JB): 109.28
Prob(Q): 0.02 Prob(JB): 0.00
Heteroskedasticity (H): 4.62 Skew: -0.11
Prob(H) (two-sided): 0.00 Kurtosis: 5.08
====================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

We will also run the model diagnostics to check for the assumptions of normality of errors and the distribution of residuals. If the errors are normally distributed and are uncorrelated to each other, then we actually have a good model.

result.plot_diagnostics(figsize=(15, 12));

We can see here that the residuals are white noise and they are also normally distributed. We can thus go ahead with this model.

We will now predict for the test data and then check for the accuracy.

# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = result.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA(1,1,1)(2,1,2,12) Predictions')
# Plot predictions against known values
title = 'Energy Production - Actual vs. Predicted'
ylabel='Production'
xlabel='Date'

ax = test.plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel

We can see that the prediction closely follows the actual value. If we calculate the MAPE and RMSE of the model, we get the values of 3.09 and 4 respectively. We can compare these values with the average value of the test series to check if the magnitude of the error is acceptable. In this case, it is. (We cannot compare these metrics with the averaging and smoothing methods as the data dimension is different)

This brings us to the end of Part 2 of Time Series Forecasting. This is a pretty long article and there is too much to absorb here. Time series forecasting is a science and this is just the tip of the iceberg. There is much more to explore here.

Below are the key takeaways from this article in brief.

  1. ARIMA models are basically regression models where the independent variables are lagged values of the series and/or the lagged forecast errors.
  2. One of the necessary condition for using ARIMA model for forecasting is that the series should be stationary.
  3. A stationary series is one which has a constant mean, constant variance and the covariance is roughly the same for observations at equal lags.
  4. The stationarity of time series is checked using Augmented Dickey Fuller Test and KPSS test. In case the series is not stationary, differencing is used. There could be only one differencing (i.e. either D or d is used) or both. If we use both the seasonal and non seasonal differencing, it is advised to use seasonal differencing first if the data series is highly seasonal.
  5. The parameters AR(p), MA(q), SAR(P), SMA(Q) are identified using ACF and PACF plots. There are certain guidelines that help in roughly estimating the order of these parameters. However, the final model is built using auto.arima() function.
  6. The final model should satisfy certain assumptions like normality of errors and zero autocorrelation between the residuals i.e. the residuals should be white noise. The p values of all the parameters in the final model should be significant.

There are a number of links that are extremely helpful in understanding this topic. Below are few of them.

  1. https://otexts.com/fpp2/index.html — This is the link to the famous online book ‘ Forecasting — Principles and Practice’ by Rob J Hyndman and George Athanasopoulos
  2. https://people.duke.edu/~rnau/411home.htm — Another extremely helpful source for understanding forecasting. I referred to these pages a number of times.
  3. https://online.stat.psu.edu/stat510/lesson/4/4.1

Hope you find this article useful. As always, let me know if there are any questions or corrections or suggestions.

--

--