From: https://github.com/ksatola
Version: 0.1.0

Model - PM2.5 - Moving Average (MA)

In [1]:
%load_ext autoreload
In [2]:
%autoreload 2
In [3]:
import sys
sys.path.insert(0, '../src')
In [4]:
import warnings
warnings.filterwarnings('ignore')

from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
In [5]:
import pandas as pd 
import numpy as np

from statsmodels.tsa.statespace.sarimax import SARIMAX

import matplotlib.pyplot as plt
%matplotlib inline
In [6]:
from model import (
    get_pm25_data_for_modelling,
    get_best_arima_params_for_time_series,
    split_df_for_ts_modelling_offset,
    predict_ts,
    fit_model
)

from measure import (
    get_rmse,
    walk_forward_ts_model_validation2,
    get_mean_folds_rmse_for_n_prediction_points,
    prepare_data_for_visualization
)

from plot import (
    visualize_results,
    plot_ts_corr
)

from utils import (
    get_datetime_identifier
)

from logger import logger
In [7]:
model_name = 'MA'

Moving Average (MA) modelling

The moving average model (MA), is a linear regression model of the lag residual errors.


Load hourly data

In [8]:
dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
common.py | 42 | get_pm25_data_for_modelling | 10-Jun-20 13:04:39 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_hourly.hdf
common.py | 43 | get_pm25_data_for_modelling | 10-Jun-20 13:04:39 | INFO: Dataframe size: (96388, 1)
Out[8]:
pm25
Datetime
2008-01-01 01:00:00 92.0
2008-01-01 02:00:00 81.0
2008-01-01 03:00:00 73.0
2008-01-01 04:00:00 60.5
2008-01-01 05:00:00 61.0
In [9]:
df = dfh.copy()
In [10]:
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data

# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data

# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
period = 'H' # for hourly data
#period = 'D' # for daily data

Train test split

In [11]:
# Create train / test datasets (with the offset of cut_off_offset datapoints from the end)
# period=None because we need index to be DatetimeIndex not PeriodIndex for SARIMAX
df_train, df_test = split_df_for_ts_modelling_offset(data=df, cut_off_offset=cut_off_offset, period=None)
common.py | 159 | split_df_for_ts_modelling_offset | 10-Jun-20 13:04:42 | INFO: Observations: 96388
common.py | 160 | split_df_for_ts_modelling_offset | 10-Jun-20 13:04:42 | INFO: Training Observations: 87627
common.py | 161 | split_df_for_ts_modelling_offset | 10-Jun-20 13:04:42 | INFO: Testing Observations: 8760
common.py | 163 | split_df_for_ts_modelling_offset | 10-Jun-20 13:04:42 | INFO: (96388, 1), (87627, 1), (8760, 1), 96387

Modelling (train, predict/validate)

In statistical time series models, fitting the model means estimating its paraneters. In case of AR model, the only parameter to estimate is number of autocorrelated lags.

In [12]:
plot_ts_corr(df_train['pm25'])
In [ ]:
%%time
# Find best parameters (grid-search)
best_results = get_best_arima_params_for_time_series(data=df_train, 
                                                     seasonal=False, 
                                                     max_param_range_p=0, 
                                                     max_param_range_d=0, 
                                                     max_param_range_q=5) # MA model
In [ ]:
SARIMAX(0, 0, 0) - AIC:950035.119919657
SARIMAX(0, 0, 1) - AIC:843344.0396508672
SARIMAX(0, 0, 2) - AIC:772685.05307077
SARIMAX(0, 0, 3) - AIC:726435.4649042541
SARIMAX(0, 0, 4) - AIC:699267.5669061132
SARIMAX(0, 0, 5) - AIC:679220.6801305263
Best model is ARIMA(0, 0, 5) with AIC of 679220.6801305263
CPU times: user 3min 39s, sys: 24.4 s, total: 4min 4s
Wall time: 1min 7s
In [13]:
%%time
# Train the model -> find best parameters
p = 0 # (AR) 
d = 0 # differencing
q = 5 # (MA)
model = SARIMAX(endog=df_train, order=(p, d, q))
model_fitted = model.fit()
/usr/local/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:218: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  ' ignored when e.g. forecasting.', ValueWarning)
/usr/local/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:218: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  ' ignored when e.g. forecasting.', ValueWarning)
CPU times: user 1min 30s, sys: 13.2 s, total: 1min 44s
Wall time: 34.3 s
In [14]:
model_fitted
Out[14]:
<statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper at 0x132bef850>
In [15]:
# Estimated parameters
print(model_fitted.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                   pm25   No. Observations:                87627
Model:               SARIMAX(0, 0, 5)   Log Likelihood             -339604.340
Date:                Wed, 10 Jun 2020   AIC                         679220.680
Time:                        13:05:28   BIC                         679276.965
Sample:                             0   HQIC                        679237.863
                              - 87627                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          1.5570      0.001   1099.445      0.000       1.554       1.560
ma.L2          1.7406      0.003    690.894      0.000       1.736       1.745
ma.L3          1.5317      0.003    511.143      0.000       1.526       1.538
ma.L4          1.0168      0.003    387.983      0.000       1.012       1.022
ma.L5          0.4352      0.002    280.099      0.000       0.432       0.438
sigma2       136.0769      0.269    505.185      0.000     135.549     136.605
===================================================================================
Ljung-Box (Q):                    81794.62   Jarque-Bera (JB):            741889.61
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.60   Skew:                             1.33
Prob(H) (two-sided):                  0.00   Kurtosis:                        17.01
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [16]:
# True parameters
print(f'The coefficients of the model are:\n {model_fitted.params}')
The coefficients of the model are:
 ma.L1       1.557027
ma.L2       1.740553
ma.L3       1.531730
ma.L4       1.016828
ma.L5       0.435159
sigma2    136.076879
dtype: float64
In [17]:
print(f'The residual errors during training of the model are:\n {model_fitted.resid}')
The residual errors during training of the model are:
 Datetime
2008-01-01 02:00:00    81.000000
2008-01-01 03:00:00     0.810018
2008-01-01 04:00:00     6.260586
2008-01-01 05:00:00    18.664692
2008-01-01 06:00:00    12.606700
                         ...    
2017-12-31 20:00:00    18.025052
2017-12-31 21:00:00    -1.104700
2017-12-31 22:00:00    24.985000
2017-12-31 23:00:00   -24.342307
2018-01-01 00:00:00     4.016120
Length: 87627, dtype: float64
In [18]:
# Evaluate model quality
import statsmodels.api as sm
res = model_fitted.resid
fig,ax = plt.subplots(2,1,figsize=(15,8))
fig = sm.graphics.tsa.plot_acf(res, lags=50, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(res, lags=50, ax=ax[1])
plt.show();
In [19]:
# Evaluate model quality
model_fitted.plot_diagnostics(figsize=(20, 10))
plt.show();

MA model alone does not fit good for hourly data as it does not capture the multilayer seasonality (yearly, weekly, daily) in data. Increasing MA order does not help either.

Making a prediction requires that we retrieve the MA coefficients from the fit model and use them with the lag of residual error values and call the custom predict() function.

In [ ]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ts_model_validation2(data=df,
                                                 col_name='pm25',
                                                 model_type='MA',
                                                 p=p,
                                                 d=d,
                                                 q=q,
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period=period)
print(len(fold_results))
print(fold_results[0])

Serialize output data

In [ ]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'

#dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])

Calculate and visualize results

In [ ]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
In [ ]:
print(res)
In [ ]:
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [ ]:
fold_results[0].index
In [ ]:
# We need to convert to datetime index for plotting
# https://stackoverflow.com/questions/29394730/converting-periodindex-to-datetimeindex
for i in range(0, len(fold_results)):
    if not isinstance(fold_results[i].index, pd.DatetimeIndex):
        fold_results[i].index = fold_results[i].index.to_timestamp()
In [ ]:
fold_results[0].index
In [ ]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)
In [ ]:
 

Load daily data

In [20]:
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
common.py | 42 | get_pm25_data_for_modelling | 10-Jun-20 13:05:59 | INFO: Dataframe loaded: /Users/ksatola/Documents/git/air-pollution/agh/data/dfpm25_2008-2018_daily.hdf
common.py | 43 | get_pm25_data_for_modelling | 10-Jun-20 13:05:59 | INFO: Dataframe size: (4019, 1)
Out[20]:
pm25
Datetime
2008-01-01 53.586957
2008-01-02 30.958333
2008-01-03 46.104167
2008-01-04 42.979167
2008-01-05 57.312500
In [21]:
df = dfd.copy()
In [22]:
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data

# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data

# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
#period = 'H' # for hourly data
period = 'D' # for daily data

Train test split

In [23]:
# Create train / test datasets (with the offset of cut_off_offset datapoints from the end)
# period=None because we need index to be DatetimeIndex not PeriodIndex for SARIMAX
df_train, df_test = split_df_for_ts_modelling_offset(data=df, cut_off_offset=cut_off_offset, period=None)
common.py | 159 | split_df_for_ts_modelling_offset | 10-Jun-20 13:06:03 | INFO: Observations: 4019
common.py | 160 | split_df_for_ts_modelling_offset | 10-Jun-20 13:06:03 | INFO: Training Observations: 3653
common.py | 161 | split_df_for_ts_modelling_offset | 10-Jun-20 13:06:03 | INFO: Testing Observations: 365
common.py | 163 | split_df_for_ts_modelling_offset | 10-Jun-20 13:06:03 | INFO: (4019, 1), (3653, 1), (365, 1), 4018

Modelling (train, predict/validate)

In [24]:
plot_ts_corr(df_train['pm25'])
In [ ]:
%%time
# Find best parameters (grid-search)
best_results = get_best_arima_params_for_time_series(data=df_train, 
                                                     seasonal=False, 
                                                     max_param_range_p=0, 
                                                     max_param_range_d=0, 
                                                     max_param_range_q=5) # MA model
In [ ]:
SARIMAX(0, 0, 0) - AIC:38983.00813026588
SARIMAX(0, 0, 1) - AIC:36059.690659735024
SARIMAX(0, 0, 2) - AIC:34918.45008233713
SARIMAX(0, 0, 3) - AIC:34409.81571434598
SARIMAX(0, 0, 4) - AIC:34096.090358520945
SARIMAX(0, 0, 5) - AIC:33892.384328226864
Best model is ARIMA(0, 0, 5) with AIC of 33892.384328226864
CPU times: user 7.08 s, sys: 1.28 s, total: 8.36 s
Wall time: 2.07 s
In [25]:
%%time
# Train the model -> find best parameters
p = 0 # (AR) 
d = 0 # differencing
q = 5 # (MA)
model = SARIMAX(endog=df_train['pm25'], order=(p, d, q))
model_fitted = model.fit()
CPU times: user 1.87 s, sys: 404 ms, total: 2.27 s
Wall time: 535 ms
In [26]:
model_fitted
Out[26]:
<statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper at 0x13501f850>
In [27]:
# Estimated parameters
print(model_fitted.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                   pm25   No. Observations:                 3653
Model:               SARIMAX(0, 0, 5)   Log Likelihood              -16940.192
Date:                Wed, 10 Jun 2020   AIC                          33892.384
Time:                        13:06:13   BIC                          33929.604
Sample:                    01-02-2008   HQIC                         33905.639
                         - 01-01-2018                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          0.9441      0.009    108.407      0.000       0.927       0.961
ma.L2          0.7285      0.012     60.408      0.000       0.705       0.752
ma.L3          0.5963      0.012     48.111      0.000       0.572       0.621
ma.L4          0.4354      0.013     33.959      0.000       0.410       0.461
ma.L5          0.2040      0.010     19.615      0.000       0.184       0.224
sigma2       624.1822      6.815     91.583      0.000     610.824     637.540
===================================================================================
Ljung-Box (Q):                      535.31   Jarque-Bera (JB):             12374.42
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.63   Skew:                             0.87
Prob(H) (two-sided):                  0.00   Kurtosis:                        11.85
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [28]:
# True parameters
print(f'The coefficients of the model are:\n {model_fitted.params}')
The coefficients of the model are:
 ma.L1       0.944070
ma.L2       0.728543
ma.L3       0.596271
ma.L4       0.435404
ma.L5       0.203968
sigma2    624.182183
dtype: float64
In [29]:
print(f'The residual errors during training of the model are:\n {model_fitted.resid}')
The residual errors during training of the model are:
 Datetime
2008-01-02    30.958333
2008-01-03    21.258348
2008-01-04     4.806204
2008-01-05    24.910633
2008-01-06   -10.435173
                ...    
2017-12-28   -16.069751
2017-12-29     4.699001
2017-12-30    19.078982
2017-12-31    -4.544337
2018-01-01    50.745771
Freq: D, Length: 3653, dtype: float64
In [30]:
# Evaluate model quality
import statsmodels.api as sm
res = model_fitted.resid
fig,ax = plt.subplots(2,1,figsize=(15,8))
fig = sm.graphics.tsa.plot_acf(res, lags=50, ax=ax[0])
fig = sm.graphics.tsa.plot_pacf(res, lags=50, ax=ax[1])
plt.show();
In [31]:
# Evaluate model quality
model_fitted.plot_diagnostics(figsize=(20, 10))
plt.show();
In [34]:
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data

fold_results = walk_forward_ts_model_validation2(data=df,
                                                 col_name='pm25',
                                                 model_type='MA',
                                                 p=p,
                                                 d=d,
                                                 q=q,
                                                 cut_off_offset=cut_off_offset,
                                                 n_pred_points=n_pred_points,
                                                 n_folds=-1,
                                                 period=period)
print(len(fold_results))
print(fold_results[0])
validation.py | 67 | walk_forward_ts_model_validation2 | 10-Jun-20 13:13:44 | INFO: (0, 0, 5)
validation.py | 71 | walk_forward_ts_model_validation2 | 10-Jun-20 13:13:44 | INFO: MA model validation started
Started fold 000000/000365 - 2020-06-10_13-13-44
Started fold 000010/000365 - 2020-06-10_13-14-45
Started fold 000020/000365 - 2020-06-10_13-15-50
Started fold 000030/000365 - 2020-06-10_13-16-47
Started fold 000040/000365 - 2020-06-10_13-17-49
Started fold 000050/000365 - 2020-06-10_13-18-45
Started fold 000060/000365 - 2020-06-10_13-19-47
Started fold 000070/000365 - 2020-06-10_13-20-41
Started fold 000080/000365 - 2020-06-10_13-21-38
Started fold 000090/000365 - 2020-06-10_13-22-34
Started fold 000100/000365 - 2020-06-10_13-23-33
Started fold 000110/000365 - 2020-06-10_13-24-30
Started fold 000120/000365 - 2020-06-10_13-25-41
Started fold 000130/000365 - 2020-06-10_13-26-43
Started fold 000140/000365 - 2020-06-10_13-27-43
Started fold 000150/000365 - 2020-06-10_13-28-42
Started fold 000160/000365 - 2020-06-10_13-29-40
Started fold 000170/000365 - 2020-06-10_13-30-38
Started fold 000180/000365 - 2020-06-10_13-31-30
Started fold 000190/000365 - 2020-06-10_13-32-24
Started fold 000200/000365 - 2020-06-10_13-33-22
Started fold 000210/000365 - 2020-06-10_13-34-21
Started fold 000220/000365 - 2020-06-10_13-35-16
Started fold 000230/000365 - 2020-06-10_13-36-15
Started fold 000240/000365 - 2020-06-10_13-37-10
Started fold 000250/000365 - 2020-06-10_13-38-05
Started fold 000260/000365 - 2020-06-10_13-39-02
Started fold 000270/000365 - 2020-06-10_13-40-02
Started fold 000280/000365 - 2020-06-10_13-41-05
Started fold 000290/000365 - 2020-06-10_13-42-07
Started fold 000300/000365 - 2020-06-10_13-43-06
Started fold 000310/000365 - 2020-06-10_13-43-59
Started fold 000320/000365 - 2020-06-10_13-44-55
Started fold 000330/000365 - 2020-06-10_13-45-49
Started fold 000340/000365 - 2020-06-10_13-46-46
Started fold 000350/000365 - 2020-06-10_13-47-46
Started fold 000360/000365 - 2020-06-10_13-48-41
365
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  54.709948  13.281900  13.281900
2018-01-03  16.026950  43.506800  27.479850  27.479850
2018-01-04  14.590020  32.143511  17.553490  17.553490
2018-01-05  22.094854  21.183389   0.911465   0.911465
2018-01-06  62.504217  10.350621  52.153596  52.153596
2018-01-07  43.929804   0.000209  43.929595  43.929595
2018-01-08  22.088192   0.000101  22.088090  22.088090
CPU times: user 2h 5min 53s, sys: 21min 12s, total: 2h 27min 6s
Wall time: 35min 10s
In [ ]:
validation.py | 67 | walk_forward_ts_model_validation2 | 10-Jun-20 13:13:44 | INFO: (0, 0, 5)
validation.py | 71 | walk_forward_ts_model_validation2 | 10-Jun-20 13:13:44 | INFO: MA model validation started
Started fold 000000/000365 - 2020-06-10_13-13-44
Started fold 000010/000365 - 2020-06-10_13-14-45
Started fold 000020/000365 - 2020-06-10_13-15-50
Started fold 000030/000365 - 2020-06-10_13-16-47
Started fold 000040/000365 - 2020-06-10_13-17-49
Started fold 000050/000365 - 2020-06-10_13-18-45
Started fold 000060/000365 - 2020-06-10_13-19-47
Started fold 000070/000365 - 2020-06-10_13-20-41
Started fold 000080/000365 - 2020-06-10_13-21-38
Started fold 000090/000365 - 2020-06-10_13-22-34
Started fold 000100/000365 - 2020-06-10_13-23-33
Started fold 000110/000365 - 2020-06-10_13-24-30
Started fold 000120/000365 - 2020-06-10_13-25-41
Started fold 000130/000365 - 2020-06-10_13-26-43
Started fold 000140/000365 - 2020-06-10_13-27-43
Started fold 000150/000365 - 2020-06-10_13-28-42
Started fold 000160/000365 - 2020-06-10_13-29-40
Started fold 000170/000365 - 2020-06-10_13-30-38
Started fold 000180/000365 - 2020-06-10_13-31-30
Started fold 000190/000365 - 2020-06-10_13-32-24
Started fold 000200/000365 - 2020-06-10_13-33-22
Started fold 000210/000365 - 2020-06-10_13-34-21
Started fold 000220/000365 - 2020-06-10_13-35-16
Started fold 000230/000365 - 2020-06-10_13-36-15
Started fold 000240/000365 - 2020-06-10_13-37-10
Started fold 000250/000365 - 2020-06-10_13-38-05
Started fold 000260/000365 - 2020-06-10_13-39-02
Started fold 000270/000365 - 2020-06-10_13-40-02
Started fold 000280/000365 - 2020-06-10_13-41-05
Started fold 000290/000365 - 2020-06-10_13-42-07
Started fold 000300/000365 - 2020-06-10_13-43-06
Started fold 000310/000365 - 2020-06-10_13-43-59
Started fold 000320/000365 - 2020-06-10_13-44-55
Started fold 000330/000365 - 2020-06-10_13-45-49
Started fold 000340/000365 - 2020-06-10_13-46-46
Started fold 000350/000365 - 2020-06-10_13-47-46
Started fold 000360/000365 - 2020-06-10_13-48-41
365
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  54.709948  13.281900  13.281900
2018-01-03  16.026950  43.506800  27.479850  27.479850
2018-01-04  14.590020  32.143511  17.553490  17.553490
2018-01-05  22.094854  21.183389   0.911465   0.911465
2018-01-06  62.504217  10.350621  52.153596  52.153596
2018-01-07  43.929804   0.000209  43.929595  43.929595
2018-01-08  22.088192   0.000101  22.088090  22.088090
CPU times: user 2h 5min 53s, sys: 21min 12s, total: 2h 27min 6s
Wall time: 35min 10s

Serialize output data

In [35]:
from joblib import dump, load

timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")

path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'

dump(fold_results, path) 
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
365
             observed  predicted      error  abs_error
Datetime                                              
2018-01-02  67.991848  54.709948  13.281900  13.281900
2018-01-03  16.026950  43.506800  27.479850  27.479850
2018-01-04  14.590020  32.143511  17.553490  17.553490
2018-01-05  22.094854  21.183389   0.911465   0.911465
2018-01-06  62.504217  10.350621  52.153596  52.153596
2018-01-07  43.929804   0.000209  43.929595  43.929595
2018-01-08  22.088192   0.000101  22.088090  22.088090

Calculate and visualize results

In [36]:
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
CPU times: user 2.49 s, sys: 21.2 ms, total: 2.51 s
Wall time: 2.54 s
Out[36]:
[11.755884916201119,
 18.54064776536313,
 22.45893463687151,
 26.14217932960894,
 29.310016480446926,
 30.748041340782123,
 30.660515642458105]
In [37]:
print(res)
[11.755884916201119, 18.54064776536313, 22.45893463687151, 26.14217932960894, 29.310016480446926, 30.748041340782123, 30.660515642458105]

[11.755884916201119, 18.54064776536313, 22.45893463687151, 26.14217932960894, 29.310016480446926, 30.748041340782123, 30.660515642458105]

In [38]:
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data

# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2019-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2019-08-01')]

# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]

# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
In [39]:
fold_results[0].index
Out[39]:
DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05',
               '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', name='Datetime', freq='D')
In [40]:
# We need to convert to datetime index for plotting
# https://stackoverflow.com/questions/29394730/converting-periodindex-to-datetimeindex
for i in range(0, len(fold_results)):
    if not isinstance(fold_results[i].index, pd.DatetimeIndex):
        fold_results[i].index = fold_results[i].index.to_timestamp()
In [41]:
fold_results[0].index
Out[41]:
DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05',
               '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', name='Datetime', freq='D')
In [42]:
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
                   start_end_dates=start_end_dates,
                   plot_types=plot_types,
                   base_file_path=base_file_path,
                   fold_results=fold_results, 
                   n_pred_points=n_pred_points, 
                   cut_off_offset=cut_off_offset, 
                   model_name=model_name,
                timestamp=timestamp)


results.py | 92 | visualize_results | 10-Jun-20 13:50:38 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_01_lag-01_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:39 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_01_lag-03_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:40 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_01_lag-07_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:41 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_02_lag-01_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:42 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_02_lag-03_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:42 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_02_lag-07_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:44 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_03_lag-01_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:44 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_03_lag-03_2020-06-10_13-50-05.png


results.py | 92 | visualize_results | 10-Jun-20 13:50:45 | INFO: images/pm25_obs_vs_pred_365_d_ts_MA_03_lag-07_2020-06-10_13-50-05.png
In [ ]: