From: https://github.com/ksatola
Version: 0.1.0
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from model import (
get_pm25_data_for_modelling,
)
from stats import (
adfuller_test
)
from plot import (
plot_stl,
plot_ts_corr
)
dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
#dfh.tail(24)
train_range_from_h = '2008-01-01 02:00:00'
train_range_to_h = '2018-12-30 23:00:00'
test_range_from_h = '2018-12-31 01:00:00'
test_range_to_h = None
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
#dfd.tail(7)
train_range_from_d = '2008-01-01'
train_range_to_d = '2018-12-25'
test_range_from_d = '2018-12-26'
test_range_to_d = None
df = dfd.copy()
series = df['pm25'].resample(rule='D').mean()
series.index.freq
# STL for Daily Data
result = plot_stl(data=dfd['pm25'], period=365, low_pass=367)
series = result.resid
adfuller_test(series)
Time series modeling assumes a relationship between an observation and the previous observation. Previous observations in a time series are called lags
, with the observation at the previous time step called lag=1, the observation at two time steps ago lag=2, and so on. A useful type of plot to explore the relationship between each observation and a lag of that observation is called the scatter plot. Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the observation at the next time step (t+1) on the y-axis.
More points tighter in to the diagonal line suggests a stronger relationship and more spread from the line suggests a weaker relationship. A ball in the middle or a spread across the plot suggests a weak or no relationship.
# Create a Lag scatter plot
from pandas.plotting import lag_plot
#series = result.observed
series = result.resid
plt.figure(figsize=(20, 20))
lag_plot(series, lag=1)
plt.show();
The plot created from running the example shows a relatively strong positive correlation between observations and their lag1 values.
series = result.resid
plt.figure(figsize=(20, 20))
lag_plot(series, lag=365)
plt.show();
The plot created from running the example shows a relatively strong negative correlation between observations and their lag365 values.
# Create multiple lag scatter plots
series = result.resid
values = pd.DataFrame(series.values)
lags = 8
columns = [values]
for i in range(1, (lags + 1)):
columns.append(values.shift(i))
dataframe = pd.concat(columns, axis=1)
columns = ['t']
for i in range(1, (lags + 1)):
columns.append('t-' + str(i))
dataframe.columns = columns
plt.figure(1)
dataframe.head()
for i in range(1, (lags + 1)):
# Define the plot matrix 2x4
plt.figure(figsize=(40, 20))
ax = plt.subplot(240 + i)
ax.set_title('t vs t-' + str(i))
plt.scatter(x=dataframe['t'].values, y=dataframe['t-'+str(i)].values)
plt.show();
# Create multiple lag scatter plots (around 365)
series = result.resid
values = pd.DataFrame(series.values)
start = 362
lags = 8
columns = [values]
for i in range(start, (lags + start)):
columns.append(values.shift(i))
dataframe = pd.concat(columns, axis=1)
columns = ['t']
for i in range(start, (lags + start)):
columns.append('t-' + str(i))
dataframe.columns = columns
dataframe.tail()
plt.figure(1)
for i in range(1, (lags)):
# Define the plot matrix 2x4
plt.figure(figsize=(40, 20))
ax = plt.subplot(240 + i)
ax.set_title('t vs t-' + str(i+start))
plt.scatter(x=dataframe['t'].values, y=dataframe['t-'+str(i+start)].values)
plt.show();
We can quantify the strength and type of relationship between observations and their lags. In statistics, this is called correlation
, and when calculated against lag values in time series, it is called autocorrelation (self-correlation)
. A correlation value calculated between two groups of numbers, such as observations and their lag=1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation. Correlation values, called correlation coefficients
, can be calculated for each observation and different lag values.
Once calculated, a plot can be created to help better understand how this relationship changes over the lag. This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation_plot()
function. The plot shows lag along the x-axis and the correlation on the y-axis. Dotted lines are provided that indicate any correlation values above those lines are statistically significant
(meaningful).
Identification of an MA model
is often best done with the ACF
rather than the PACF. For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0 in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have non-zero autocorrelations only at lags involved in the model.
More: https://online.stat.psu.edu/stat510/book/export/html/662
from pandas.plotting import autocorrelation_plot
plt.figure(figsize=(40, 20))
autocorrelation_plot(series)
from statsmodels.graphics import tsaplots
fig = tsaplots.plot_acf(series, lags=40)
plt.show();
# MA
#q=0,1,2,3
fig = tsaplots.plot_acf(series, lags=366)
plt.show();
fig = tsaplots.plot_acf(series, lags=366*8)
plt.show();
There are 3 stronger autocorrelated lags at lag1, lag2 and lag3.
An autocorrelation (ACF)
plot represents the autocorrelation of the series with lags of itself. A partial autocorrelation (PACF)
plot represents the amount of correlation between a series and a lag of itself that is not explained by correlations at all lower-order lags. Ideally, we want no correlation between the series and lags of itself. Graphically speaking, we would like all the spikes to fall in the blue region.
Identification of an AR model
is often best done with the PACF
. For an AR model, the theoretical PACF “shuts off” past the order of the model. The phrase “shuts off” means that in theory the partial autocorrelations are equal to 0 beyond that point. Put another way, the number of non-zero partial autocorrelations gives the order of the AR model. By the “order of the model” we mean the most extreme lag of x that is used as a predictor.
More: https://online.stat.psu.edu/stat510/book/export/html/662
fig = tsaplots.plot_pacf(series, lags=40)
plt.show();
# RA
#p=0,1,2
fig = tsaplots.plot_pacf(series, lags=366)
plt.show();
plot_ts_corr(series)