From: https://github.com/ksatola
Version: 0.1.0

Model - PM2.5 - Time Series Decomposition¶

Contents¶

Time series decomposition
Load hourly data
Load daily data
Time Series Decomposition
- The classical decomposition method
- Seasonal-Trend decomposition using LOESS (STL)
- Decomposition process for modelling
  - Detrending
  - Deseasonliting
  - Stationarity checking

%load_ext autoreload

%autoreload 2

import sys
sys.path.insert(0, '../src')

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline

from model import (
    get_pm25_data_for_modelling,
)

from stats import (
    adfuller_test
)

from plot import (
    plot_stl,
    plot_decompose,
    plot_before_after,
    fit_theoretical_dist_and_plot
)

Time Series Decomposition¶

Time Series Decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components. Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.

A useful abstraction for selecting forecasting methods is to break a time series down into systematic and unsystematic components:

Systematic: Components of the time series that have consistency or recurrence and can be described and modeled.
Non-Systematic: Components of the time series that cannot be directly modeled.

A given time series is thought to consist of three systematic components including level, trend, seasonality, and one non-systematic component called noise. These components are defined as follows:

Level: The average value in the series.
Trend: The increasing or decreasing value in the series.
Seasonality: The repeating short-term cycle in the series.
Noise: The random variation in the series.

A series is thought to be an aggregate or combination of these four components. All series have a level and noise. The trend and seasonality components are optional.

There are two types of time series composition, linear and nonlinear:

An additive model is linear where changes over time are consistently made by the same amount. A linear trend is a straight line. A linear seasonality has the same frequency (width of cycles) and amplitude (height of cycles).
A multiplicative model is nonlinear, such as quadratic or exponential. Changes increase or decrease over time. A nonlinear trend is a curved line. A nonlinear seasonality has an increasing or decreasing frequency and/or amplitude over time.

Load hourly data¶

dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()

common.py | 42 | get_pm25_data_for_modelling | 08-Jun-20 18:05:08 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_hourly.hdf
common.py | 43 | get_pm25_data_for_modelling | 08-Jun-20 18:05:08 | INFO: Dataframe size: (96388, 1)

#dfh.tail(24)

train_range_from_h = '2008-01-01 02:00:00'
train_range_to_h = '2018-12-30 23:00:00'
test_range_from_h = '2018-12-31 01:00:00'
test_range_to_h = None

Load daily data¶

dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()

common.py | 42 | get_pm25_data_for_modelling | 08-Jun-20 18:05:10 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_daily.hdf
common.py | 43 | get_pm25_data_for_modelling | 08-Jun-20 18:05:10 | INFO: Dataframe size: (4019, 1)

#dfd.tail(7)

train_range_from_d = '2008-01-01'
train_range_to_d = '2018-12-25'
test_range_from_d = '2018-12-26'
test_range_to_d = None

The classical decomposition method¶

Hourly Frequency¶

df = dfh.copy()

# Work with different scope of PM2.5 observed (time series) data
series = df['pm25'] # hourly
#series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()

Datetime
2008-01-01 01:00:00    92.0
2008-01-01 02:00:00    81.0
2008-01-01 03:00:00    73.0
2008-01-01 04:00:00    60.5
2008-01-01 05:00:00    61.0
Name: pm25, dtype: float64

series.name

'pm25'

series.index.freq

# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M

plot_decompose(data=series)

<statsmodels.tsa.seasonal.DecomposeResult at 0x12bae45d0>

Daily Frequency¶

df = dfd.copy()

# Work with different scope of PM2.5 observed (time series) data
#series = df['pm25'] # hourly
series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()

Datetime
2008-01-01    53.586957
2008-01-02    30.958333
2008-01-03    46.104167
2008-01-04    42.979167
2008-01-05    57.312500
Freq: D, Name: pm25, dtype: float64

series.index.freq

<Day>

# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M

plot_decompose(data=series)

<statsmodels.tsa.seasonal.DecomposeResult at 0x12cd506d0>

Seasonal-Trend decomposition using LOESS (STL)¶

STL uses LOESS (locally estimated scatterplot smoothing) to extract smooths estimates of the three components. The key inputs into STL are:

seasonal - The length of the seasonal smoother. Must be odd.
trend - The length of the trend smoother, usually around 150% of seasonal. Must be odd and larger than seasonal.
low_pass - The length of the low-pass estimation window, usually the smallest odd number larger than the periodicity of the data.

https://www.statsmodels.org/dev/examples/notebooks/generated/stl_decomposition.html

Hourly Frequency¶

df = dfh.copy()

# Work with different scope of PM2.5 observed (time series) data
series = df['pm25'] # hourly
#series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()

Datetime
2008-01-01 01:00:00    92.0
2008-01-01 02:00:00    81.0
2008-01-01 03:00:00    73.0
2008-01-01 04:00:00    60.5
2008-01-01 05:00:00    61.0
Name: pm25, dtype: float64

# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M

plot_stl(data=series, period=365, low_pass=367)

<statsmodels.tsa.seasonal.DecomposeResult at 0x1317aeb50>

Daily Frequency¶

df = dfd.copy()

# Work with different scope of PM2.5 observed (time series) data
#series = df['pm25'] # hourly
series = df['pm25'].resample(rule='D').mean() # daily
#series = df['pm25'].resample(rule='W').mean() # weekly
#series = df['pm25'].resample(rule='M').mean() # monthly
series.head()

Datetime
2008-01-01    53.586957
2008-01-02    30.958333
2008-01-03    46.104167
2008-01-04    42.979167
2008-01-05    57.312500
Freq: D, Name: pm25, dtype: float64

# Different rolling window would be appropriate for different resamplings
rolling_window = 4 # for D
#rolling_window = 6 # for W
#rolling_window = 3 # for M

plot_stl(data=series, period=365)

<statsmodels.tsa.seasonal.DecomposeResult at 0x1362c83d0>

Decomposition process for modelling¶

Here, choose the decomposition method to move forward and the dataset (hourly, daily).

For the clasical modelling, we will focus on residuals, as they should be stationary (still to be confirmed below).

# STL for Daily Data
result = plot_stl(data=dfd['pm25'], period=365, low_pass=367)

Observed data¶

result.observed[:5]

Datetime
2008-01-01    53.586957
2008-01-02    30.958333
2008-01-03    46.104167
2008-01-04    42.979167
2008-01-05    57.312500
Freq: D, Name: pm25, dtype: float64

Use and remove trend¶

Our time series may contain a trend. A trend is a continued increase or decrease in the series over time.

Identifying and understanding trend information can aid in improving model performance; below are a few reasons:

Faster Modeling: Perhaps the knowledge of a trend or lack of a trend can suggest methods and make model selection and evaluation more efficient.
Simpler Problem: Perhaps we can correct or remove the trend to simplify modeling and improve model performance.
More Data: Perhaps we can use trend information, directly or as a summary, to provide additional information to the model and improve model performance.

A time series with a trend is called trend non-stationary. An identified trend can be modeled. Once modeled, it can be removed from the time series dataset. This is called detrending the time series. If a dataset does not have a trend or we successfully remove the trend, the dataset is said to be trend stationary.

Specifically, a trend can be removed from your time series data (and data in the future) as a data preparation and cleaning exercise. This is common when using statistical methods for time series forecasting, but does not always improve results when using machine learning models. Alternately, a trend can be added, either directly or as a summary, as a new input variable to the supervised learning problem to predict the output variable.

Perhaps the simplest method to detrend a time series is by differencing. Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step.

result.trend[:5]

Datetime
2008-01-01    27.904629
2008-01-02    27.947183
2008-01-03    27.989668
2008-01-04    28.032084
2008-01-05    28.074429
Freq: D, Name: trend, dtype: float64

# Detrend a time series using differencing
series = result.observed
X = series.values
diff = list()
for i in range(1, len(X)):
    value = X[i] - X[i - 1]
    diff.append(value)
diff = pd.Series(diff, index=result.observed.index[1:])

Because no difference value can be created for the first observation (there is nothing for it to be subtracted from), the new dataset contains one less record. This approach works well for data with a linear trend. If the trend is quadratic (the change in the trend also increases or decreases), then a difference of the already-differenced dataset can be taken, a second level of differencing.

plot_before_after(result.observed, diff, "With trend", "Detrended")

# Use a linear model to detrend a time series
from sklearn.linear_model import LinearRegression
import numpy as np

series = result.observed

# Create integer-indexed observations
X = [i for i in range(0, len(series))]
X = np.reshape(X, (len(X), 1))
print(X[:5])

y = series.values
print(y[:5])

# Fit linear model
model = LinearRegression()
model.fit(X, y)

# Calculate trend
trend1 = model.predict(X)
print(trend1[:5])

# Detrend
detrended = [y[i] - trend1[i] for i in range(0, len(series))]
print(detrended[:5])

[[0]
 [1]
 [2]
 [3]
 [4]]
[53.58695652 30.95833333 46.10416667 42.97916667 57.3125    ]
[43.97819481 43.97481936 43.97144391 43.96806846 43.96469301]
[9.608761707709753, -13.016486028819504, 2.132722756390365, -0.9889017917330918, 13.347806993476787]

# Plot trend
plt.figure(figsize=(20, 10))
plt.plot(y)
plt.plot(trend1, linewidth=3)
plt.show();

# Check what is the trend line now
model.fit(X, detrended)

# Calculate trend (should be 0 now, flat horizontal line)
trend2 = model.predict(X)
print(trend2[:5])

[-4.50655520e-16 -4.52346124e-16 -4.54036728e-16 -4.55727332e-16
 -4.57417936e-16]

# Plot detrended
plt.figure(figsize=(20, 10))
plt.plot(detrended)
plt.plot(trend2, linewidth=3)
plt.show();

# Check what is the trend line now
model.fit(X, detrended+trend1)

# Calculate trend (should be back as at the beginning)
trend3 = model.predict(X)
print(trend3[:5])

# Plot detrended
plt.figure(figsize=(20, 10))
plt.plot(y)
plt.plot(trend3, linewidth=3)
plt.show();

[43.97819481 43.97481936 43.97144391 43.96806846 43.96469301]

y = pd.Series(y, index=result.observed.index)
detrended = pd.Series(detrended, index=result.observed.index)
plot_before_after(y, detrended, "With trend", "Detrended")

Use and remove seasonality¶

Time series datasets can contain a seasonal component. This is a cycle that repeats over time, such as monthly or yearly. This repeating cycle may obscure the signal that we wish to model when forecasting, and in turn may provide a strong signal to our predictive models.

Seasonal variation, or seasonality, are cycles that repeat regularly over time. A cycle structure in a time series may or may not be seasonal. If it consistently repeats at the same frequency, it is seasonal, otherwise it is not seasonal and is called a cycle.

Understanding the seasonal component in time series can improve the performance of modeling with machine learning. This can happen in two main ways:

Clearer Signal: Identifying and removing the seasonal component from the time series can result in a clearer relationship between input and output variables.
More Information: Additional information about the seasonal component of the time series can provide new information to improve model performance.

Modeling seasonality and removing it from the time series may occur during data cleaning and preparation. Extracting seasonal information and providing it as input features, either directly or in summary form, may occur during feature extraction and feature engineering activities.

The simplest approach to determining if there is an aspect of seasonality is to plot and review your data, perhaps at different scales and with the addition of trend lines.

Once seasonality is identified, it can be modeled. The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment, or Deseasonalizing. A time series where the seasonal component has been removed is called seasonal stationary. A time series with a clear seasonal component is referred to as seasonal non-stationary.

As we are primarily interested in predictive modeling and time series forecasting, we are limited to methods that can be developed on historical data and available when making predictions on new data.

A simple way to correct for a seasonal component is to use differencing. If there is a seasonal component at the level of one week, then we can remove it on an observation today by subtracting the value from last week.

It looks like we have a seasonal component each year showing swing from summer to winter. We can subtract the daily minimum temperature from the same day last year to correct for seasonality. This would require special handling of February 29th in leap years.

result.seasonal[:5]

Datetime
2008-01-01    94.496364
2008-01-02    14.806637
2008-01-03    16.350060
2008-01-04    11.232492
2008-01-05    27.397082
Freq: D, Name: season, dtype: float64

# Deseasonalize a time series using differencing

series = result.observed
X = series.values
print(X[:5])

diff = []
days_in_year = 365

for i in range(days_in_year, len(X)):
    value = X[i] - X[i - days_in_year]
    diff.append(value)
diff = pd.Series(diff, index=result.observed.index[365:])

[53.58695652 30.95833333 46.10416667 42.97916667 57.3125    ]

plot_before_after(result.observed, diff, "With seasonality", "Deseasonalized")

We can model the seasonal component directly, then subtract it from the observations. The seasonal component in a given time series is likely a sine wave over a generally fixed period and amplitude. This can be approximated easily using a curve-fitting method.

# Model seasonality with a polynomial model
from numpy import polyfit

series = result.observed

X = [i%365 for i in range(0, len(series))] 
print(X[:5])

y = series.values
print(y[:5])

# Fit polynomial: x^2*b1 + x*b2 + ... + bn
degree = 4
coef = polyfit(X, y, degree) 
print('Coefficients: %s' % coef)

# Create curve
curve = []
for i in range(len(X)):
    value = coef[-1]
    for d in range(degree):
        value += X[i]**(degree-d) * coef[d]
    curve.append(value)

[0, 1, 2, 3, 4]
[53.58695652 30.95833333 46.10416667 42.97916667 57.3125    ]
Coefficients: [-5.24747614e-08  3.88570661e-05 -7.53229315e-03  1.02827324e-01
  6.67842463e+01]

# Plot curve over original data
plt.figure(figsize=(20, 10))
plt.plot(series.values) 
plt.plot(curve, color='red', linewidth=3) 
plt.show();

One limitation of this model is that it does not take into account of leap days, adding small offset noise that could easily be corrected with an update to the approach. For example, we could just remove the February 29 observations from the dataset when creating the seasonal model.

Stationarity¶

Time series is different from more traditional classification and regression predictive modeling problems. The temporal structure adds an order to the observations. This imposes that important assumptions about the consistency of those observations needs to be handled specifically. For example, when modeling, there are assumptions that the summary statistics of observations are consistent. In time series terminology, we refer to this expectation as the time series being stationary. These assumptions can be easily violated in time series by the addition of a trend, seasonality, and other time-dependent structures.

The observations in a stationary time series are not dependent on time. Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations. When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.

Observations from a non-stationary time series show seasonal effects, trends, and other structures that depend on the time index. Summary statistics like the mean and variance do change over time, providing a drift in the concepts a model may try to capture. Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects.

There are some finer-grained notions of stationarity that you may come across if you dive deeper into this topic. They are:

Stationary Process: A process that generates a stationary series of observations.
Stationary Model: A model that describes a stationary series of observations.
Trend Stationary: A time series that does not exhibit a trend.
Seasonal Stationary: A time series that does not exhibit seasonality.
Strictly Stationary: A mathematical definition of a stationary process, specifically that the joint distribution of observations is invariant to time shift.

There are many methods to check whether a time series (direct observations, residuals, otherwise) is stationary or non-stationary:

Look at plots: You can review a time series plot of your data and visually check if there are any obvious trends or seasonality.
Summary statistics: You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences.
Statistical tests: You can use statistical tests to check if the expectations of stationarity are met or have been violated.

You should make your time series stationary. If you have clear trend and seasonality in your time series, then model these components, remove them from observations, then train models on the residuals.

Statistical time series methods and even modern machine learning methods will benefit from the clearer signal in the data. But we should turn to machine learning methods when the classical methods fail. When we want more or better results. We cannot know how to best model unknown nonlinear relationships in time series data and some methods may result in better performance when working with non-stationary observations or some mixture of stationary and non-stationary views of the problem.

result.resid[:5]

Datetime
2008-01-01   -68.814036
2008-01-02   -11.795487
2008-01-03     1.764438
2008-01-04     3.714591
2008-01-05     1.840989
Freq: D, Name: resid, dtype: float64

Look at Plots¶

series = result.resid

# Plot
plt.figure(figsize=(20, 10))
plt.plot(series.values)
plt.show();

# Fit theoretical distribution
fit_theoretical_dist_and_plot(series)

Skewed data is common in data science; skew is the degree of distortion from a normal distribution.

Why do we care if the data is skewed? If the response variable is skewed, the model will be trained on a much larger number of moderately values, and will be less likely to successfully predict the value for the most extreme values. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance.

Our residuals are not normally distributed, and this is not a problem. More importantly, it is NOT very much skewed, which is good for modelling.

# select only continuous variables (for multicolumn dfs)
#cols = df_observed.dtypes[df_observed.dtypes != 'object'].index
#skewed = df_observed[cols].skew().sort_values(ascending=False)

skewed_features = series.to_frame().skew().sort_values(ascending=False)
skewness = pd.DataFrame({'Skew':skewed_features})
skewness

Summary statistics¶

A quick and dirty check to see if your time series is non-stationary is to review summary statistics. You can split your time series into two (or more) partitions and compare the mean and variance of each group. If they differ and the difference is statistically significant, the time series is likely non-stationary.

Because we are looking at the mean and variance, we are assuming that the data conforms to a Gaussian (also called the bell curve or normal) distribution. We can also quickly check this by eyeballing a histogram of our observations.

# Calculate the mean and variance of each group of numbers 
# and compare the values

one, two, three, four, five = np.split(df.sample(frac=1), [int(.20*len(df)), int(.35*len(df)), int(.60*len(df)), int(.70*len(df))])

one.shape, two.shape, three.shape, four.shape, five.shape

((803, 1), (603, 1), (1005, 1), (402, 1), (1206, 1))

mean1, mean2, mean3, mean4, mean5 = one.mean(), two.mean(), three.mean(), four.mean(), five.mean()
var1, var2, var3, var4, var5 = one.var(), two.var(), three.var(), four.var(), five.var()

print(f'Means: {mean1}, {mean2}, {mean3}, {mean4}, {mean5}')
print(f'Variances: {var1}, {var2}, {var3}, {var4}, {var5}')

Means: pm25    36.627173
dtype: float64, pm25    36.997177
dtype: float64, pm25    37.064811
dtype: float64, pm25    38.40876
dtype: float64, pm25    37.382268
dtype: float64
Variances: pm25    1126.689772
dtype: float64, pm25    1022.654757
dtype: float64, pm25    1076.143007
dtype: float64, pm25    1139.448064
dtype: float64, pm25    941.11929
dtype: float64

# !!! it is better to divide into nonequal ranges and more than 2 for better visibility (see above)
series = result.resid
X = series.values

split = int(len(X) / 2)
X1, X2 = X[0:split], X[split:]

# Calculate statistics of partitioned time series data
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()

print(f'mean1={mean1}, mean2={mean2}') 
print(f'variance1={var1}, variance2={var2}')

mean1=0.21749741755438073, mean2=-0.17020154968547843
variance1=567.1676580974554, variance2=331.7557961398194

Mean and standard deviation values for each group are different (second case) and very similar (first case). Based on the summary statistics we may say that the time series is stationary, the statistics confirm what we can see reviewing the line plot.

This is a quick and dirty method that may be easily fooled. We can use a statistical test to check if the difference between two samples of Gaussian random variables is real or a statistical fluke. We could explore statistical significance tests, like the Student’s t-test, but things get tricky because of the serial correlation between values.

# ???
# python difference between two samples of Gaussian test

Statistical tests¶

Statistical tests make strong assumptions about your data. They can only be used to inform the degree to which a null hypothesis can be rejected (or fail to be rejected). The result must be interpreted for a given problem to be meaningful. Nevertheless, they can provide a quick check and confirmatory evidence that your time series is stationary or non-stationary.

ADF is a parametric test and KPSS is a non-parametric test of unit root:

Augmented Dickey-Fuller test (ADF): a statistical test designed to explicitly comment on whether a univariate time series is stationary.

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend. There are a number of unit root tests and the Augmented Dickey-Fuller may be one of the more widely used. It uses an autoregressive model and optimizes an information criterion across multiple different lag values. The null hypothesis of the test is that the time series can be represented by a unit root, that it is not stationary (has some time-dependent structure). The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.
- Null Hypothesis (H0): Fail to reject, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.
- Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure. We interpret this result using the p-value from the test. A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (non-stationary).
Kwiatkowski–Phillips–Schmidt–Shin test (KPSS): https://www.statisticshowto.com/kpss-test/

adfuller_test(series)

Test statistic: -12.04972808762017
P-value: 2.614368859594535e-22 -> 0.0000000000000000
Critical values: {'1%': -3.431989155785145, '5%': -2.8622642893621704, '10%': -2.5671555470083454}
The time series does not have a unit root, so it is stationary (reject H0)

Based on this test, PM25 time series is stationary with 99% of confidence.

	pm25
Datetime
2008-01-01 01:00:00	92.0
2008-01-01 02:00:00	81.0
2008-01-01 03:00:00	73.0
2008-01-01 04:00:00	60.5
2008-01-01 05:00:00	61.0

	pm25
Datetime
2008-01-01	53.586957
2008-01-02	30.958333
2008-01-03	46.104167
2008-01-04	42.979167
2008-01-05	57.312500