From: https://github.com/ksatola
Version: 0.1.0
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from model import (
get_pm25_data_for_modelling
)
from measure import (
get_rmse,
walk_forward_ref_model_validation,
get_mean_folds_rmse_for_n_prediction_points,
prepare_data_for_visualization
)
from plot import (
visualize_results
)
from utils import (
get_datetime_identifier
)
from logger import logger
Establishing a baseline is essential for any time series forecasting problem. A baseline in performance gives you an idea of how well all other models will actually perform on your problem.
A baseline in forecast performance provides a point of comparison. It is a point of reference for all other modeling techniques on your problem. If a model achieves performance at or below the baseline, the technique should be fixed or abandoned.
The technique used to generate a forecast to calculate the baseline performance must be easy to implement and naive of problem-specific details. Once prepared, you then need to select a naive technique that you can use to make a forecast and calculate the baseline performance. Three properties of a good technique for making a baseline forecast are:
Baseline forecasts quickly flesh out whether you can do significantly better. If you can’t, you’re probably working with a random walk.
There can be many naive forecast methods, here we examine two:
With the moving average approach
we take n values previously known, calculate their average and take it as the next value. This new value is then added to the training set and a new average of n lags is calculated to predict the next point (and so on). As a forecasting method, there are situations where this technique works the best.
Moving average smoothing
is a naive and effective technique in time series forecasting. It can be used for data preparation, feature engineering, and even directly for making predictions.
Smoothing
is a technique applied to time series to remove the fine-grained variation between time steps. The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes. Moving averages
are a simple and common type of smoothing used in time series analysis and time series forecasting.
There are various types of moving averages:
Simple Moving Average
which drops the oldest observation as the new one gets added, cumulative moving average considers all prior observations. CMA averages out all of the previous data up until the current data point, which make it very similar to the Simple Average Forecast
technique.dfh = get_pm25_data_for_modelling('ts', 'h')
dfh.head()
# This is done in the walk forward validation function.
dfd = get_pm25_data_for_modelling('ts', 'd')
dfd.head()
# This is done in the walk forward validation function.
The most common baseline method for supervised machine learning is the Zero Rule Algorithm
. This algorithm predicts the majority class in the case of classification, or the average outcome in the case of regression.
With the Simple Average Forecast
approach we take all the values previously known, calculate the average and take it as all the next values.
This method is better from the Persistence Model
(below) as it does not depend on the value of the last train dataset observation, instead it uses calculated once mean of the entire train dataset. Simple average and naive forecasting predict a constant value.
For a seasonal data we have here, this approach can be more fair (on average) but because of existing outliers, it will be constantly wrong in most cases (for most prediction points).
model_name = 'SAF'
df = dfh.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data
# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='SAF',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
8736
observed predicted error abs_error
Datetime
2018-01-01 01:00:00 84.90085 37.831378 47.069472 47.069472
2018-01-01 02:00:00 67.44355 37.831378 29.612172 29.612172
2018-01-01 03:00:00 76.66860 37.831378 38.837222 38.837222
2018-01-01 04:00:00 64.96090 37.831378 27.129522 27.129522
2018-01-01 05:00:00 64.14875 37.831378 26.317372 26.317372
2018-01-01 06:00:00 76.06410 37.831378 38.232722 38.232722
2018-01-01 07:00:00 69.19180 37.831378 31.360422 31.360422
2018-01-01 08:00:00 48.51735 37.831378 10.685972 10.685972
2018-01-01 09:00:00 45.92715 37.831378 8.095772 8.095772
2018-01-01 10:00:00 44.19595 37.831378 6.364572 6.364572
2018-01-01 11:00:00 39.27865 37.831378 1.447272 1.447272
2018-01-01 12:00:00 32.61625 37.831378 5.215128 5.215128
2018-01-01 13:00:00 34.09440 37.831378 3.736978 3.736978
2018-01-01 14:00:00 33.51795 37.831378 4.313428 4.313428
2018-01-01 15:00:00 41.24420 37.831378 3.412822 3.412822
2018-01-01 16:00:00 49.08765 37.831378 11.256272 11.256272
2018-01-01 17:00:00 51.24645 37.831378 13.415072 13.415072
2018-01-01 18:00:00 41.64520 37.831378 3.813822 3.813822
2018-01-01 19:00:00 40.98405 37.831378 3.152672 3.152672
2018-01-01 20:00:00 45.36865 37.831378 7.537272 7.537272
2018-01-01 21:00:00 58.24830 37.831378 20.416922 20.416922
2018-01-01 22:00:00 63.21335 37.831378 25.381972 25.381972
2018-01-01 23:00:00 78.28435 37.831378 40.452972 40.452972
2018-01-02 00:00:00 91.30400 37.831378 53.472622 53.472622
CPU times: user 12min 53s, sys: 2 s, total: 12min 55s
Wall time: 12min 56s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[22.219594478879706, 22.216437235996327, 22.215180865472913, 22.213085755280073, 22.21252481634527, 22.212170948117542, 22.209960468319558, 22.208431600091828, 22.20969613177227, 22.210756737832877, 22.212124873737373, 22.214644696969696, 22.216666207529844, 22.218942275022957, 22.22145361570248, 22.22403221992654, 22.22595305325987, 22.227605819559233, 22.22990950413223, 22.232791609274567, 22.234595775941226, 22.235836512855833, 22.23636769972452, 22.23493732782369]
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
df = dfd.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data
# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='SAF',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
358
observed predicted error abs_error
Datetime
2018-01-02 67.991848 37.836392 30.155456 30.155456
2018-01-03 16.026950 37.836392 21.809442 21.809442
2018-01-04 14.590020 37.836392 23.246371 23.246371
2018-01-05 22.094854 37.836392 15.741538 15.741538
2018-01-06 62.504217 37.836392 24.667825 24.667825
2018-01-07 43.929804 37.836392 6.093412 6.093412
2018-01-08 22.088192 37.836392 15.748200 15.748200
CPU times: user 5.06 s, sys: 42.6 ms, total: 5.1 s
Wall time: 5.11 s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[19.635213105413108, 19.601542735042734, 19.632443589743588, 19.589547863247862, 19.633233903133906, 19.623368376068374, 19.691608831908837]
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
The Persistence Model
naive forecasting technique assumes that the next expected point is equal to the last observed point.
The Persistence Algorithm
uses the value at the current time step (t) to predict the expected outcome at the next time step (t+1). This satisfies the three above conditions for a baseline forecast.
The Persistence Algorithm
is naive. It is often called the Naive Forecast
. It assumes nothing about the specifics of the time series problem to which it is applied. This is what makes it so easy to understand and so quick to implement and evaluate.
The actual prediction and other performance measures depend on the last train data point value, and will be different for different dataset samples and train-test split point. The naive method isn’t suited for datasets with high variability.
model_name = 'PER'
df = dfh.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data
# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='PER',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
8736
observed predicted error abs_error
Datetime
2018-01-01 01:00:00 84.90085 21.3458 63.55505 63.55505
2018-01-01 02:00:00 67.44355 21.3458 46.09775 46.09775
2018-01-01 03:00:00 76.66860 21.3458 55.32280 55.32280
2018-01-01 04:00:00 64.96090 21.3458 43.61510 43.61510
2018-01-01 05:00:00 64.14875 21.3458 42.80295 42.80295
2018-01-01 06:00:00 76.06410 21.3458 54.71830 54.71830
2018-01-01 07:00:00 69.19180 21.3458 47.84600 47.84600
2018-01-01 08:00:00 48.51735 21.3458 27.17155 27.17155
2018-01-01 09:00:00 45.92715 21.3458 24.58135 24.58135
2018-01-01 10:00:00 44.19595 21.3458 22.85015 22.85015
2018-01-01 11:00:00 39.27865 21.3458 17.93285 17.93285
2018-01-01 12:00:00 32.61625 21.3458 11.27045 11.27045
2018-01-01 13:00:00 34.09440 21.3458 12.74860 12.74860
2018-01-01 14:00:00 33.51795 21.3458 12.17215 12.17215
2018-01-01 15:00:00 41.24420 21.3458 19.89840 19.89840
2018-01-01 16:00:00 49.08765 21.3458 27.74185 27.74185
2018-01-01 17:00:00 51.24645 21.3458 29.90065 29.90065
2018-01-01 18:00:00 41.64520 21.3458 20.29940 20.29940
2018-01-01 19:00:00 40.98405 21.3458 19.63825 19.63825
2018-01-01 20:00:00 45.36865 21.3458 24.02285 24.02285
2018-01-01 21:00:00 58.24830 21.3458 36.90250 36.90250
2018-01-01 22:00:00 63.21335 21.3458 41.86755 41.86755
2018-01-01 23:00:00 78.28435 21.3458 56.93855 56.93855
2018-01-02 00:00:00 91.30400 21.3458 69.95820 69.95820
CPU times: user 12min 13s, sys: 1.28 s, total: 12min 14s
Wall time: 12min 15s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[4.736325355831037, 7.053169960973371, 8.956491999540862, 10.574964657943067, 11.980865300734619, 13.22309616620753, 14.234124506427918, 15.063469019742884, 15.818647486225895, 16.437228615702477, 16.915838096877867, 17.294015518824608, 17.48212400137741, 17.558690449954085, 17.534685365013775, 17.42358526170799, 17.222373507805326, 17.018195661157026, 16.706883551423324, 16.26723811983471, 15.87100703627181, 15.53065098714417, 15.336227892561983, 15.33666230486685]
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
df = dfd.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data
# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='PER',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
358
observed predicted error abs_error
Datetime
2018-01-02 67.991848 53.008094 14.983754 14.983754
2018-01-03 16.026950 53.008094 36.981144 36.981144
2018-01-04 14.590020 53.008094 38.418073 38.418073
2018-01-05 22.094854 53.008094 30.913240 30.913240
2018-01-06 62.504217 53.008094 9.496123 9.496123
2018-01-07 43.929804 53.008094 9.078290 9.078290
2018-01-08 22.088192 53.008094 30.919902 30.919902
CPU times: user 5.32 s, sys: 46.9 ms, total: 5.37 s
Wall time: 5.38 s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[9.682181481481482, 14.066957549857548, 14.978520227920228, 15.17680683760684, 15.717147008547009, 17.067744444444447, 17.992249857549858]
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
model_name = 'SMA'
df = dfh.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data
# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='SMA',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
8736
observed predicted error abs_error
Datetime
2018-01-01 01:00:00 84.90085 32.356375 52.544475 52.544475
2018-01-01 02:00:00 67.44355 33.695469 33.748081 33.748081
2018-01-01 03:00:00 76.66860 30.248536 46.420064 46.420064
2018-01-01 04:00:00 64.96090 29.411545 35.549355 35.549355
2018-01-01 05:00:00 64.14875 31.427981 32.720769 32.720769
2018-01-01 06:00:00 76.06410 31.195883 44.868217 44.868217
2018-01-01 07:00:00 69.19180 30.570986 38.620814 38.620814
2018-01-01 08:00:00 48.51735 30.651599 17.865751 17.865751
2018-01-01 09:00:00 45.92715 30.961612 14.965538 14.965538
2018-01-01 10:00:00 44.19595 30.845020 13.350930 13.350930
2018-01-01 11:00:00 39.27865 30.757304 8.521346 8.521346
2018-01-01 12:00:00 32.61625 30.803884 1.812366 1.812366
2018-01-01 13:00:00 34.09440 30.841955 3.252445 3.252445
2018-01-01 14:00:00 33.51795 30.812041 2.705909 2.705909
2018-01-01 15:00:00 41.24420 30.803796 10.440404 10.440404
2018-01-01 16:00:00 49.08765 30.815419 18.272231 18.272231
2018-01-01 17:00:00 51.24645 30.818303 20.428147 20.428147
2018-01-01 18:00:00 41.64520 30.812390 10.832810 10.832810
2018-01-01 19:00:00 40.98405 30.812477 10.171573 10.171573
2018-01-01 20:00:00 45.36865 30.814647 14.554003 14.554003
2018-01-01 21:00:00 58.24830 30.814454 27.433846 27.433846
2018-01-01 22:00:00 63.21335 30.813492 32.399858 32.399858
2018-01-01 23:00:00 78.28435 30.813767 47.470583 47.470583
2018-01-02 00:00:00 91.30400 30.814090 60.489910 60.489910
CPU times: user 18min 9s, sys: 2.94 s, total: 18min 12s
Wall time: 18min 13s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[7.146073473370064, 8.508867665289257, 9.847301721763085, 11.215512121212122, 12.709579889807163, 13.664001698806244, 14.504790140036732, 15.262237959136824, 15.907483918732781, 16.393955383379247, 16.74894462809917, 16.967410560146924, 17.0615229912764, 17.040708000459137, 16.931553018824605, 16.742444249311298, 16.497563636363637, 16.19251631083563, 15.801181577134988, 15.423336340679521, 15.106160066574839, 14.923811191460054, 14.947423083103766, 15.229386306244264]
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
df = dfd.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data
# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='SMA',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='',
rolling_window=4)
print(len(fold_results))
print(fold_results[0])
358
observed predicted error abs_error
Datetime
2018-01-02 67.991848 32.029379 35.962469 35.962469
2018-01-03 16.026950 33.779889 17.752939 17.752939
2018-01-04 14.590020 33.403785 18.813765 18.813765
2018-01-05 22.094854 38.055287 15.960432 15.960432
2018-01-06 62.504217 34.317085 28.187132 28.187132
2018-01-07 43.929804 34.889011 9.040793 9.040793
2018-01-08 22.088192 35.166292 13.078100 13.078100
CPU times: user 6.4 s, sys: 49.6 ms, total: 6.45 s
Wall time: 6.46 s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[11.28434188034188, 12.888569800569801, 13.431560683760683, 14.104865242165241, 15.333503703703705, 16.172740740740746, 16.877947293447296]
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
model_name = 'EMA'
df = dfh.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
cut_off_offset = 365*24 # for hourly data
#cut_off_offset = 365 # for daily data
# Set datetime format for index
dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
#dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
n_pred_points = 24 # for hourly data
#n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='EMA',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='')
print(len(fold_results))
print(fold_results[0])
8736
observed predicted error abs_error
Datetime
2018-01-01 01:00:00 84.90085 21.221522 63.679328 63.679328
2018-01-01 02:00:00 67.44355 21.221522 46.222028 46.222028
2018-01-01 03:00:00 76.66860 21.221522 55.447078 55.447078
2018-01-01 04:00:00 64.96090 21.221522 43.739378 43.739378
2018-01-01 05:00:00 64.14875 21.221522 42.927228 42.927228
2018-01-01 06:00:00 76.06410 21.221522 54.842578 54.842578
2018-01-01 07:00:00 69.19180 21.221522 47.970278 47.970278
2018-01-01 08:00:00 48.51735 21.221522 27.295828 27.295828
2018-01-01 09:00:00 45.92715 21.221522 24.705628 24.705628
2018-01-01 10:00:00 44.19595 21.221522 22.974428 22.974428
2018-01-01 11:00:00 39.27865 21.221522 18.057128 18.057128
2018-01-01 12:00:00 32.61625 21.221522 11.394728 11.394728
2018-01-01 13:00:00 34.09440 21.221522 12.872878 12.872878
2018-01-01 14:00:00 33.51795 21.221522 12.296428 12.296428
2018-01-01 15:00:00 41.24420 21.221522 20.022678 20.022678
2018-01-01 16:00:00 49.08765 21.221522 27.866128 27.866128
2018-01-01 17:00:00 51.24645 21.221522 30.024928 30.024928
2018-01-01 18:00:00 41.64520 21.221522 20.423678 20.423678
2018-01-01 19:00:00 40.98405 21.221522 19.762528 19.762528
2018-01-01 20:00:00 45.36865 21.221522 24.147128 24.147128
2018-01-01 21:00:00 58.24830 21.221522 37.026778 37.026778
2018-01-01 22:00:00 63.21335 21.221522 41.991828 41.991828
2018-01-01 23:00:00 78.28435 21.221522 57.062828 57.062828
2018-01-02 00:00:00 91.30400 21.221522 70.082478 70.082478
CPU times: user 33min 38s, sys: 3.58 s, total: 33min 41s
Wall time: 33min 44s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_h_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[11.218553363177227, 11.84939532828283, 12.403505142332413, 12.875550573921029, 13.278597761707989, 13.614746889348025, 13.88626672405877, 14.100883448117537, 14.270553340220387, 14.400224242424244, 14.484374104683196, 14.538612316345272, 14.576462442607896, 14.603670569329662, 14.623023266758494, 14.638892011019285, 14.658972050045914, 14.696223071625344, 14.76162760560147, 14.864599735996327, 15.005145741505968, 15.188601779155189, 15.409990231864095, 15.65693489439853]
# Show forecasts for n-th point in the future
show_n_points_of_forecasts = [1, 12, 24] # for hourly data
#show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
#base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)
df = dfd.copy()
df.columns = ['pm25']
df.head()
# Define first past/future cutoff point in time offset (1 year of data)
#cut_off_offset = 365*24 # for hourly data
cut_off_offset = 365 # for daily data
# Set datetime format for index
#dt_format = "%Y-%m-%d %H:%M:%S" # for hourly data
dt_format = "%Y-%m-%d" # for daily data
# Create train and validate sets
train_test_split_position = int(len(df)-cut_off_offset)
# Create as many folds as remains till the end of known data
n_folds = len(df) #train_test_split_position+3
# Predict for X points
#n_pred_points = 24 # for hourly data
n_pred_points = 7 # for daily data
%%time
# Validate result on test
# Creates 365*24*24 models for hourly data, or 365*7 models for hourly data
fold_results = walk_forward_ref_model_validation(data=df,
col_name='pm25',
model_type='EMA',
cut_off_offset=cut_off_offset,
n_pred_points=n_pred_points,
n_folds=-1,
period='',
rolling_window=4)
print(len(fold_results))
print(fold_results[0])
358
observed predicted error abs_error
Datetime
2018-01-02 67.991848 36.725819 31.266029 31.266029
2018-01-03 16.026950 36.725819 20.698869 20.698869
2018-01-04 14.590020 36.725819 22.135798 22.135798
2018-01-05 22.094854 36.725819 14.630964 14.630964
2018-01-06 62.504217 36.725819 25.778398 25.778398
2018-01-07 43.929804 36.725819 7.203986 7.203986
2018-01-08 22.088192 36.725819 14.637627 14.637627
CPU times: user 6.63 s, sys: 34.7 ms, total: 6.66 s
Wall time: 6.66 s
from joblib import dump, load
timestamp = get_datetime_identifier("%Y-%m-%d_%H-%M-%S")
path = f'results/pm25_ts_{model_name}_results_d_{timestamp}.joblib'
dump(fold_results, path)
fold_results = load(path)
print(len(fold_results))
print(fold_results[0])
%%time
# Returns a list of mean folds RMSE for n_pred_points (starting at 1 point forecast)
res = get_mean_folds_rmse_for_n_prediction_points(fold_results=fold_results, n_pred_points=n_pred_points)
res
print(res)
[12.471266666666667, 12.96436923076923, 13.290245584045586, 13.465590313390313, 13.678582905982907, 13.811899430199428, 13.8896660968661]
# Show forecasts for n-th point in the future
#show_n_points_of_forecasts = [1, 12, 24] # for hourly data
show_n_points_of_forecasts = [1, 3, 7] # for daily data
# Used to zoom the plots (date ranges shown in the plots)
# for hourly data
#start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-03-01'), ('2018-06-01', '2018-07-01')]
# for daily data
start_end_dates = [('2018-01-01', '2019-01-01'), ('2018-02-01', '2018-04-01'), ('2018-06-01', '2018-08-01')]
# Type of plot
# 0 -> plot_observed_vs_predicted
# 1 -> plot_observed_vs_predicted_with_error
plot_types = [0, 1, 1]
# File names for plots (format png will be used, do not add .png extension)
#base_file_path = f'images/pm25_obs_vs_pred_365_h_ts_{model_name}' # for hourly data
base_file_path = f'images/pm25_obs_vs_pred_365_d_ts_{model_name}' # for daily data
visualize_results(show_n_points_of_forecasts=show_n_points_of_forecasts,
start_end_dates=start_end_dates,
plot_types=plot_types,
base_file_path=base_file_path,
fold_results=fold_results,
n_pred_points=n_pred_points,
cut_off_offset=cut_off_offset,
model_name=model_name,
timestamp=timestamp)