From: https://github.com/ksatola
Version: 0.1.0
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from model import (
get_pm25_data_for_modelling,
)
from stats import (
adfuller_test
)
from plot import (
plot_stl,
plot_decompose,
plot_before_after,
fit_theoretical_dist_and_plot
)
Time Series Decomposition
involves thinking of a series as a combination of level, trend, seasonality, and noise components. Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.
A useful abstraction for selecting forecasting methods is to break a time series down into systematic and unsystematic components:
A given time series is thought to consist of three systematic components including level, trend, seasonality, and one non-systematic component called noise. These components are defined as follows:
A series is thought to be an aggregate or combination of these four components. All series have a level and noise. The trend and seasonality components are optional.
There are two types of time series composition, linear and nonlinear:
dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
#dfh.tail(24)
train_range_from_h = '2008-01-01 02:00:00'
train_range_to_h = '2018-12-30 23:00:00'
test_range_from_h = '2018-12-31 01:00:00'
test_range_to_h = None
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
#dfd.tail(7)
train_range_from_d = '2008-01-01'
train_range_to_d = '2018-12-25'
test_range_from_d = '2018-12-26'
test_range_to_d = None
df = dfh.copy()
# Work with different scope of PM2.5 observed (time series) data
series = df['pm25'] # hourly
#series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()
series.name
series.index.freq
# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M
plot_decompose(data=series)
df = dfd.copy()
# Work with different scope of PM2.5 observed (time series) data
#series = df['pm25'] # hourly
series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()
series.index.freq
# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M
plot_decompose(data=series)
STL uses LOESS (locally estimated scatterplot smoothing) to extract smooths estimates of the three components. The key inputs into STL are:
https://www.statsmodels.org/dev/examples/notebooks/generated/stl_decomposition.html
df = dfh.copy()
# Work with different scope of PM2.5 observed (time series) data
series = df['pm25'] # hourly
#series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()
# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M
plot_stl(data=series, period=365, low_pass=367)
df = dfd.copy()
# Work with different scope of PM2.5 observed (time series) data
#series = df['pm25'] # hourly
series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()
# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M
plot_stl(data=series, period=365)
Here, choose the decomposition method to move forward and the dataset (hourly, daily).
For the clasical modelling
, we will focus on residuals
, as they should be stationary
(still to be confirmed below).
# STL for Daily Data
result = plot_stl(data=dfd['pm25'], period=365, low_pass=367)
result.observed[:5]
Our time series may contain a trend. A trend
is a continued increase or decrease in the series over time.
Identifying and understanding trend
information can aid in improving model performance; below are a few reasons:
A time series with a trend is called trend non-stationary
. An identified trend can be modeled. Once modeled, it can be removed from the time series dataset. This is called detrending
the time series. If a dataset does not have a trend or we successfully remove the trend, the dataset is said to be trend stationary
.
Specifically, a trend
can be removed from your time series data (and data in the future) as a data preparation and cleaning exercise. This is common when using statistical methods for time series forecasting, but does not always improve results when using machine learning models. Alternately, a trend can be added, either directly or as a summary, as a new input variable to the supervised learning problem to predict the output variable.
Perhaps the simplest method to detrend a time series is by differencing
. Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step.
result.trend[:5]
# Detrend a time series using differencing
series = result.observed
X = series.values
diff = list()
for i in range(1, len(X)):
value = X[i] - X[i - 1]
diff.append(value)
diff = pd.Series(diff, index=result.observed.index[1:])
Because no difference value can be created for the first observation (there is nothing for it to be subtracted from), the new dataset contains one less record. This approach works well for data with a linear trend. If the trend is quadratic (the change in the trend also increases or decreases), then a difference of the already-differenced dataset can be taken, a second level of differencing
.
plot_before_after(result.observed, diff, "With trend", "Detrended")
# Use a linear model to detrend a time series
from sklearn.linear_model import LinearRegression
import numpy as np
series = result.observed
# Create integer-indexed observations
X = [i for i in range(0, len(series))]
X = np.reshape(X, (len(X), 1))
print(X[:5])
y = series.values
print(y[:5])
# Fit linear model
model = LinearRegression()
model.fit(X, y)
# Calculate trend
trend1 = model.predict(X)
print(trend1[:5])
# Detrend
detrended = [y[i] - trend1[i] for i in range(0, len(series))]
print(detrended[:5])
# Plot trend
plt.figure(figsize=(20, 10))
plt.plot(y)
plt.plot(trend1, linewidth=3)
plt.show();
# Check what is the trend line now
model.fit(X, detrended)
# Calculate trend (should be 0 now, flat horizontal line)
trend2 = model.predict(X)
print(trend2[:5])
# Plot detrended
plt.figure(figsize=(20, 10))
plt.plot(detrended)
plt.plot(trend2, linewidth=3)
plt.show();
# Check what is the trend line now
model.fit(X, detrended+trend1)
# Calculate trend (should be back as at the beginning)
trend3 = model.predict(X)
print(trend3[:5])
# Plot detrended
plt.figure(figsize=(20, 10))
plt.plot(y)
plt.plot(trend3, linewidth=3)
plt.show();
y = pd.Series(y, index=result.observed.index)
detrended = pd.Series(detrended, index=result.observed.index)
plot_before_after(y, detrended, "With trend", "Detrended")
Time series datasets can contain a seasonal component
. This is a cycle that repeats over time, such as monthly or yearly. This repeating cycle may obscure the signal that we wish to model when forecasting, and in turn may provide a strong signal to our predictive models.
Seasonal variation
, or seasonality
, are cycles that repeat regularly over time. A cycle structure in a time series may or may not be seasonal. If it consistently repeats at the same frequency, it is seasonal, otherwise it is not seasonal and is called a cycle
.
Understanding the seasonal component in time series can improve the performance of modeling with machine learning. This can happen in two main ways:
Modeling seasonality and removing it from the time series may occur during data cleaning and preparation. Extracting seasonal information and providing it as input features, either directly or in summary form, may occur during feature extraction and feature engineering activities.
The simplest approach to determining if there is an aspect of seasonality is to plot and review your data, perhaps at different scales and with the addition of trend lines.
Once seasonality is identified, it can be modeled. The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment
, or Deseasonalizing
. A time series where the seasonal component has been removed is called seasonal stationary
. A time series with a clear seasonal component is referred to as seasonal non-stationary
.
As we are primarily interested in predictive modeling and time series forecasting, we are limited to methods that can be developed on historical data and available when making predictions on new data.
A simple way to correct for a seasonal component is to use differencing
. If there is a seasonal component at the level of one week, then we can remove it on an observation today by subtracting the value from last week.
It looks like we have a seasonal component each year showing swing from summer to winter. We can subtract the daily minimum temperature from the same day last year to correct for seasonality. This would require special handling of February 29th in leap years.
result.seasonal[:5]
# Deseasonalize a time series using differencing
series = result.observed
X = series.values
print(X[:5])
diff = []
days_in_year = 365
for i in range(days_in_year, len(X)):
value = X[i] - X[i - days_in_year]
diff.append(value)
diff = pd.Series(diff, index=result.observed.index[365:])
plot_before_after(result.observed, diff, "With seasonality", "Deseasonalized")
We can model the seasonal component directly, then subtract it from the observations. The seasonal component in a given time series is likely a sine wave over a generally fixed period and amplitude. This can be approximated easily using a curve-fitting method.
# Model seasonality with a polynomial model
from numpy import polyfit
series = result.observed
X = [i%365 for i in range(0, len(series))]
print(X[:5])
y = series.values
print(y[:5])
# Fit polynomial: x^2*b1 + x*b2 + ... + bn
degree = 4
coef = polyfit(X, y, degree)
print('Coefficients: %s' % coef)
# Create curve
curve = []
for i in range(len(X)):
value = coef[-1]
for d in range(degree):
value += X[i]**(degree-d) * coef[d]
curve.append(value)
# Plot curve over original data
plt.figure(figsize=(20, 10))
plt.plot(series.values)
plt.plot(curve, color='red', linewidth=3)
plt.show();
One limitation of this model is that it does not take into account of leap days, adding small offset noise that could easily be corrected with an update to the approach. For example, we could just remove the February 29 observations from the dataset when creating the seasonal model.
Time series is different from more traditional classification and regression predictive modeling problems. The temporal structure adds an order to the observations. This imposes that important assumptions about the consistency of those observations needs to be handled specifically. For example, when modeling, there are assumptions that the summary statistics of observations are consistent. In time series terminology, we refer to this expectation as the time series being stationary
. These assumptions can be easily violated in time series by the addition of a trend, seasonality, and other time-dependent structures.
The observations in a stationary time series
are not dependent on time. Time series are stationary
if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations. When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.
Observations from a non-stationary time series
show seasonal effects, trends, and other structures that depend on the time index. Summary statistics like the mean and variance do change over time, providing a drift in the concepts a model may try to capture. Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects.
There are some finer-grained notions of stationarity that you may come across if you dive deeper into this topic. They are:
There are many methods to check whether a time series (direct observations, residuals, otherwise) is stationary or non-stationary:
You should make your time series stationary. If you have clear trend and seasonality in your time series, then model these components, remove them from observations, then train models on the residuals.
Statistical time series methods
and even modern machine learning methods will benefit from the clearer signal in the data. But we should turn to machine learning methods when the classical methods fail. When we want more or better results. We cannot know how to best model unknown nonlinear relationships in time series data and some methods may result in better performance when working with non-stationary observations or some mixture of stationary and non-stationary views of the problem.
result.resid[:5]
series = result.resid
# Plot
plt.figure(figsize=(20, 10))
plt.plot(series.values)
plt.show();
# Fit theoretical distribution
fit_theoretical_dist_and_plot(series)
Skewed data is common in data science; skew
is the degree of distortion from a normal distribution.
Why do we care if the data is skewed? If the response variable is skewed
, the model will be trained on a much larger number of moderately values, and will be less likely to successfully predict the value for the most extreme values. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance
.
Our residuals are not normally distributed, and this is not a problem. More importantly, it is NOT very much skewed, which is good for modelling.
# select only continuous variables (for multicolumn dfs)
#cols = df_observed.dtypes[df_observed.dtypes != 'object'].index
#skewed = df_observed[cols].skew().sort_values(ascending=False)
skewed_features = series.to_frame().skew().sort_values(ascending=False)
skewness = pd.DataFrame({'Skew':skewed_features})
skewness
A quick and dirty check to see if your time series is non-stationary is to review summary statistics
. You can split your time series into two (or more) partitions and compare the mean and variance of each group. If they differ and the difference is statistically significant, the time series is likely non-stationary.
Because we are looking at the mean and variance, we are assuming that the data conforms to a Gaussian (also called the bell curve or normal) distribution. We can also quickly check this by eyeballing a histogram of our observations.
# Calculate the mean and variance of each group of numbers
# and compare the values
one, two, three, four, five = np.split(df.sample(frac=1), [int(.20*len(df)), int(.35*len(df)), int(.60*len(df)), int(.70*len(df))])
one.shape, two.shape, three.shape, four.shape, five.shape
mean1, mean2, mean3, mean4, mean5 = one.mean(), two.mean(), three.mean(), four.mean(), five.mean()
var1, var2, var3, var4, var5 = one.var(), two.var(), three.var(), four.var(), five.var()
print(f'Means: {mean1}, {mean2}, {mean3}, {mean4}, {mean5}')
print(f'Variances: {var1}, {var2}, {var3}, {var4}, {var5}')
# !!! it is better to divide into nonequal ranges and more than 2 for better visibility (see above)
series = result.resid
X = series.values
split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]
# Calculate statistics of partitioned time series data
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print(f'mean1={mean1}, mean2={mean2}')
print(f'variance1={var1}, variance2={var2}')
Mean and standard deviation values for each group are different (second case) and very similar (first case). Based on the summary statistics we may say that the time series is stationary, the statistics confirm what we can see reviewing the line plot.
This is a quick and dirty method that may be easily fooled. We can use a statistical test to check if the difference between two samples of Gaussian random variables is real or a statistical fluke. We could explore statistical significance tests, like the Student’s t-test, but things get tricky because of the serial correlation between values.
# ???
# python difference between two samples of Gaussian test
Statistical tests make strong assumptions about your data. They can only be used to inform the degree to which a null hypothesis can be rejected (or fail to be rejected). The result must be interpreted for a given problem to be meaningful. Nevertheless, they can provide a quick check and confirmatory evidence that your time series is stationary or non-stationary.
ADF is a parametric test and KPSS is a non-parametric test of unit root:
Augmented Dickey-Fuller test (ADF): a statistical test designed to explicitly comment on whether a univariate time series is stationary.
The Augmented Dickey-Fuller test is a type of statistical test called a unit root test
. The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend. There are a number of unit root tests and the Augmented Dickey-Fuller may be one of the more widely used. It uses an autoregressive model and optimizes an information criterion across multiple different lag values. The null hypothesis of the test is that the time series can be represented by a unit root, that it is not stationary (has some time-dependent structure). The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.
Kwiatkowski–Phillips–Schmidt–Shin test (KPSS): https://www.statisticshowto.com/kpss-test/
adfuller_test(series)
Based on this test, PM25 time series is stationary with 99% of confidence.