From: https://github.com/ksatola
Version: 0.1.0
This Exploratory Data Analysis (EDA) covers particulate matter (PM) air pollutants with a special focus on fine particles (PM2.5) which are considered as the most harmful out of all air-pollutants. The measuresements under EDA were taken in Krakow area in the years of 2008-2018.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Load the dataset
data_path = '../data/final/'
data_file = data_path + 'dfpm2008_2018.csv'
df = pd.read_csv(data_file, encoding='utf-8', sep=",", index_col="Datetime")
df.head()
df.info()
df.isnull().sum()
The data set contains hourly measurements of particulate matter (PM10) and fine particles (PM2.5) taken in Krakow area in the years of 2008-2018. There are 96 388 observations and no missing data.
df.index
# Convert indextype to DateTime from the generic object type
df.index = pd.to_datetime(df.index)
df.index
# Descriptive summary statistics
df.describe()
df['pm10'].idxmax(), df['pm25'].idxmax()
The maximum values observed are 546 [µg/m3] for PM10 and 445 [µg/m3] for PM2.5, both on 2010-01-27 at 6am. The mean values are respectively 53 and 37 [µg/m3]. 75% of all observations are below 65 and 45 [µg/m3]. Almost 75% of PM10 and almost 50% of PM2.5 observations exceeded WHO air quality guidelines (20 [µg/m3] for PM10 and 25 [µg/m3] for PM2.5).
df.boxplot(grid=True, figsize=(10, 8))
plt.title('Measured PM10 and PM2.5 distributions')
plt.ylabel('Observed values [µg/m3]')
plt.savefig('images/eda_pm_dists.png')
plt.show();
df.plot(kind='hist', bins=100, grid=True, figsize=(10, 8), alpha=0.5)
plt.title('PM10 and PM2.5 measured values frequency')
plt.xlabel('Observed value [µg/m3]')
plt.ylabel('Frequency')
plt.savefig('images/eda_pm_freq.png')
plt.show();
PM10 and PM2.5 distributions are similar (right-skewed) with many outliers at the high-end many times exeeding WHO and EU air quiality guidelines.
ax = df.plot(grid=True, figsize=(20, 8), alpha=1)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('PM10 and PM2.5 Data')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_data_representation.png')
plt.show();
There is yearly seasonality of data with the values increasing from October to April. PM10 and PM2.5 levels are similar and positively correlated. The biggest measurements were taken between 2009 and 2011.
# Mean of PM10, PM2.5 in 2008
m2008 = df['2008'].mean()
m2008
# Mean daily measurements in 2008
ax = df['2008'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2008')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2008.png')
plt.show();
# Mean PM10, PM2.5 in 2018
m2018 = df['2018'].mean()
m2018
# And in 2018
ax = df['2018'].asfreq('D').plot(grid=True, figsize=(20, 8))
ax.set_ylim(0, 350)
ax.axhline(25, color='orange', linestyle='--')
ax.axhline(20, color='blue', linestyle='-.')
plt.title('Mean daily measurements in 2018')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_daily_2018.png')
plt.show();
The characteristics of measurements look similar in all years (above checked the edge cases - 2008 and 2018). In autumn and winter, there are more air pollutants in the air then during the rest of the year. Looking at the vertical axes (the scale is scaled to be the same) we can clearly see, that situation improved - the maximum values observed are lower in 2018 comparing to 2008 but still not at the satisfactory level.
# Is there any linear relationship between PM10 i PM2.5?
plt.figure(figsize=(10, 8))
plt.title("Correlation between PM10 oraz PM2.5")
plt.xlabel('PM10')
plt.ylabel('PM25')
plt.savefig('images/eda_pm_corr.png')
plt.scatter(df['pm10'], df['pm25']);
There is indeed a linear positive and strong correlation between PM10 and PM2.5.
corr = df['pm10'].corr(df['pm25'])
print("PM10 and PM2.5 correlation coefficient: ", corr)
The Pearson's correlation coefficient is quite high (the maximum value is 1.0). This should not be surprising as PM2.5 is contained in PM10.
What were mean yearly particulate matter observed values in Krakow?
# Yearly mean
dfts = df.resample(rule='A').mean()
dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_mean_yearly.png')
plt.show();
# Another method
index_year = df.index.year
mean_by_year = df.groupby(index_year).mean()
mean_by_year
mean_by_year.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean yearly particulate matter level observed values in Krakow')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('Years')
plt.show();
In 2010 in Krakow, there were the highest mean daily particulate matter levels observed. In the subsequent years the observed levels were lower.
What was the characterists of maximum, minimum and median of PM2.5 values between 2008 and 2018?
dfts = df['pm25'].resample(rule='D').median().to_frame()
rolling = dfts.pm25.rolling(360)
dfts['q10'] = rolling.quantile(0.1)
dfts['q50'] = rolling.quantile(0.5)
dfts['q90'] = rolling.quantile(0.9)
dfts.head()
dfts.plot(grid=True, figsize=(20, 8))
plt.title('Maximum, minimum and median of PM2.5 values')
plt.ylabel('Observed value [µg/m3]')
plt.xlabel('Years')
plt.savefig('images/eda_pm_min_max_median_yearly.png')
plt.show();
Using rolling quantiles (with a window of 360 days) for 10%, 50% (median) and 90% quantiles, we can visualize trends of minimum, median and maximum observed levels of PM2.5. We can see, that the yearly maximum fine particle levels trend line goes down whereas minimum and median values stay rather at the same level.
Which months out of the winter months (between 2017 and 2018) were the best and the worst in terms of observed particulate matter level?
dfts = df['2017-10':'2018-4'].resample(rule='M').mean()
dfts.plot(grid=True, figsize=(10, 8), marker='o')
plt.title('Mean monthly PM10 and PM2.5 levels')
plt.ylabel('Mean monthly observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_monthly_winter.png')
plt.show();
What is mean hourly distribution of particle matter?
index_hour = df.index.hour
mean_by_hour = df.groupby(index_hour).mean()
mean_by_hour
mean_by_hour.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_hour.index)
plt.title('Hourly mean particulate matter distribution')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_hourly_winter.png')
plt.show();
The lowest PM levels observed were around 2 and 3pm. The biggest during the night (6pm-8am).
What are the average particle matter levels every day of a week?
# Monday = 0
index_dayofweek = df.index.dayofweek
mean_by_dayofweek = df.groupby(index_dayofweek).mean()
mean_by_dayofweek
mean_by_dayofweek.plot(grid=True, figsize=(10, 6), marker='o')
plt.title('Weekly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_weekdays.png')
plt.show();
An average particle matter levels is the highest on Tuesdays (1) then goes down to Sundays (6).
Which months are the cleanest on average?
index_month = df.index.month
mean_by_month = df.groupby(index_month).mean()
mean_by_month
mean_by_month.plot(grid=True, figsize=(10, 6), marker='o')
plt.xticks(mean_by_month.index)
plt.title('Monthly average particle matter levels of PM10 and PM2.5 in 2008-2018')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_months.png')
plt.show();
There is the cleanest air observed in Krakow from May till September.
How the fine particles (PM2.5) levels looked like in different years?
years = range(2008, 2019)
df_years = pd.DataFrame()
for year in years:
df_year = df.loc[str(year), ['pm25']].reset_index(drop=True)
df_year.rename(columns={'pm25': year}, inplace=True)
df_years = pd.concat([df_years, df_year], axis=1)
df_years.head()
plots = df_years.plot(subplots=True, grid=True, figsize=(20, 20))
for plot in plots:
plot.set_ylim(0, 500)
plot.xaxis.set_major_locator(plt.LinearLocator(13))
plot.set_xticklabels(range(1, 13), rotation=0)
plt.savefig('images/eda_pm_mean_years.png')
plt.show();
Is there any seasonality in PM2.5 data?
dfts = df.pm25.copy().to_frame()
dfts['30D'] = dfts.pm25.rolling(window='30D').mean()
dfts['90D'] = dfts.pm25.rolling(window='90D').mean()
dfts.head()
dfts.plot(grid=True, figsize=(20, 16))
plt.title('PM2.5 Seasonality')
plt.ylabel('Mean observed value [µg/m3]')
plt.xlabel('')
plt.savefig('images/eda_pm_mean_seasonality.png')
plt.show();