🎓 Predicting Air Particulate Matter at Scale ⛅️
🛠️ 1. Data Understanding & Data Quality Profiling
🎓 This notebook shows various DataFrame methods that can be used to analyse and cleanse a dataset. It should be ready for reuse in the next steps (Time Series, Machine Learning, Deep Learning) in CRISP-DM for Data Science
Workflow steps:
- Import the required teradataml modules.
- Connect to a Vantage system.
- 📈 Data Understanding & Visualization
- ⚙️ Data Preprocessing
- Cleanup.
- Import the required teradataml modules.
🎯 Libraries and Reusable Functions¶
🎓 This section executes all of the cells in `Data_Loading_and_Descriptive_Statistics.ipynb`.
In [1]:
import logging
## TODO: .env --> determines the environment for output format programmatically
## Check for the JupyterLab environment, which might affect how visualizations are rendered or interacted with.
IS_JUPYTERLAB = True ## True if JupyterLab; False if Python .py
IS_TERADATA_VANTAGE = False ## True if Data in Teradata Vantage; False if Laptop/Virtual-Machine
IS_DATA_IN_TERADATA_VANTAGE = False ## True if Data in Teradata Vantage; False if Data from *.csv/*xls files
IS_DEBUG = True ## Plot and display additional information or not
# if not IS_DEBUG:
## Set logging level to WARNING to suppress info messages --> turn-off Prophet logs
logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
# logging.getLogger('cmdstanpy').setLevel(logging.CRITICAL)
%run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
data/raw/Takapuna23-07May2020-to-30Apr2022.csv data/raw/Penrose7-07May2020-to-30Apr2022.csv ℹ️ Load Data from *.csv files --> rawdata ℹ️ The Shape of the Dataframe: (37382, 15) ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (20023, 12) (17359, 11) 🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp'] ℹ️ Nominal Variables/Features: Index(['Site', 'Site_Class', 'Country'], dtype='object') 🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp'] 🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2] Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp']
In [2]:
print("\n🎓 [Site1 - Penrose] Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...")
rawdata_site1.describe()
🎓 [Site1 - Penrose] Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...
Out[2]:
Timestamp | AQI | PM10 | PM2.5 | SO2 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | Air_Temp | Rel_Humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 20023 | 17019.000000 | 16818.000000 | 16102.000000 | 16623.000000 | 16260.000000 | 16260.000000 | 16259.000000 | 17182.000000 | 17181.000000 | 17206.000000 | 17338.000000 |
mean | 2021-05-06 21:58:02.954601984 | 28.233974 | 13.881787 | 5.348180 | 0.851639 | 10.041562 | 14.123253 | 24.151350 | 2.807368 | 173.234387 | 16.748925 | 69.426087 |
min | 2020-05-07 00:00:00 | 0.000000 | -6.400000 | -12.300000 | -4.400000 | -1.900000 | -1.400000 | -1.300000 | 0.200000 | 1.000000 | 3.000000 | 26.900000 |
25% | 2020-11-07 02:30:00 | 22.000000 | 8.400000 | 2.100000 | 0.100000 | 0.800000 | 3.900000 | 5.100000 | 1.600000 | 73.000000 | 14.000000 | 60.200000 |
50% | 2021-05-08 19:00:00 | 27.000000 | 13.100000 | 4.700000 | 0.500000 | 3.600000 | 10.600000 | 14.900000 | 2.600000 | 211.000000 | 17.000000 | 70.400000 |
75% | 2021-11-04 15:30:00 | 33.000000 | 18.400000 | 7.700000 | 1.000000 | 10.900000 | 20.700000 | 31.650000 | 3.800000 | 235.000000 | 20.000000 | 79.500000 |
max | 2022-04-30 23:00:00 | 121.000000 | 196.700000 | 138.400000 | 25.200000 | 378.400000 | 83.100000 | 455.100000 | 8.200000 | 357.000000 | 28.000000 | 92.000000 |
std | NaN | 9.693312 | 8.044288 | 5.259335 | 1.507654 | 21.065762 | 12.748169 | 30.956155 | 1.538615 | 94.038611 | 4.013776 | 12.290750 |
In [3]:
print("\n🎓 [Site2 - Takapuna] Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...")
rawdata_site2.describe()
🎓 [Site2 - Takapuna] Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...
Out[3]:
Timestamp | AQI | PM10 | PM2.5 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | Air_Temp | Rel_Humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 17359 | 17359.000000 | 16579.000000 | 15807.000000 | 16187.000000 | 16185.000000 | 16187.000000 | 16773.000000 | 16771.000000 | 16785.000000 | 16921.000000 |
mean | 2021-05-04 08:00:00 | 25.152947 | 11.590983 | 6.251275 | 12.005912 | 0.012416 | 24.203614 | 2.501792 | 176.335132 | 16.280548 | 69.909931 |
min | 2020-05-07 17:00:00 | -13.000000 | -3.800000 | 0.450000 | -0.800000 | -0.000500 | -0.700000 | 0.200000 | 5.500000 | 3.500000 | 30.400000 |
25% | 2020-11-04 12:30:00 | 19.000000 | 7.400000 | 4.250000 | 2.450000 | 0.004000 | 7.050000 | 1.250000 | 75.500000 | 13.500000 | 61.250000 |
50% | 2021-05-04 08:00:00 | 24.000000 | 10.650000 | 5.550000 | 5.850000 | 0.009100 | 15.100000 | 2.300000 | 205.000000 | 16.000000 | 71.050000 |
75% | 2021-11-01 03:30:00 | 30.000000 | 14.500000 | 7.350000 | 12.200000 | 0.017700 | 29.400000 | 3.500000 | 261.000000 | 19.500000 | 79.850000 |
max | 2022-04-30 23:00:00 | 318.000000 | 437.250000 | 52.300000 | 352.450000 | 0.072100 | 401.950000 | 7.800000 | 346.500000 | 27.000000 | 90.350000 |
std | NaN | 10.637321 | 7.911251 | 3.507078 | 22.592626 | 0.010951 | 30.662867 | 1.468465 | 93.196025 | 3.959962 | 11.804317 |
Python Reusable Functions¶
🎓 This section examine the key statistics and missing data percentages from the reports to gain a better understanding of the data characteristics and time-series features.
📈 Data Understanding & Visualization¶
🎓 This section identify and visualise Missing Data values.
Identify and Visualise Missing Data Values¶
In [4]:
import missingno as msno
import matplotlib.dates as mdates
In [5]:
def display_side_by_side_missing_data_visualizations(rawdata, site1, site2, visualization_type='matrix', feature_column='PM10'):
"""
Displays side by side missing data visualizations for two specified sites using missingno library,
adjusting parameters based on the visualization type.
* 'matrix': Each matrix cell color is based on whether the data exists (dark/green color) or not (white).
The matrix plot is a great tool if you are working with depth-related data or time-series data.
* 'bar': Displays the number of missing values for each column represented as a bar chart.
It’s similar to the matrix plot but simpler to interpret.
* 'heatmap': Displays a heat map that shows a nullity correlation between variables. The differentiates from a regular heatmap is that it measures a nullity correlation.
Nullity correlation (fancyimpute library) is an indicator of how strongly the absence of a variable to another variable is.
Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column.
Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.
* 'dendrogram': To show the similarity between each variable to another in terms of missing values.
More fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap
Hierarchical clustering algorithm calculates the similarity between variables.
It has the leaves that are represented by variables and the branch that connects all variables one to the other.
Parameters:
rawdata (DataFrame): The complete dataset containing multiple sites.
site1 (str): The first site to display.
site2 (str): The second site to display.
visualization_type (str): Type of visualization. Options: 'matrix', 'bar', 'heatmap', 'dendrogram'.
"""
## Filter rawdata for the specified sites
# data_site1 = rawdata[rawdata['Site'] == site1]
# data_site2 = rawdata[rawdata['Site'] == site2]
data_site1 = rawdata_site1.copy()
data_site2 = rawdata_site2.copy()
## Setting up the figure to hold two subplots side by side
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
plt.title(f'[{visualization_type.capitalize()}] Missing Values Plot', fontsize = 18)
## Generate visualizations for each site & Apply visualization with correct parameters.
if visualization_type == 'matrix':
msno.matrix(data_site1, ax=ax[0], sparkline=False, color=(0.24, 0.77, 0.77))
msno.matrix(data_site2, ax=ax[1], sparkline=False, color=(0.24, 0.77, 0.77))
elif visualization_type == 'bar':
msno.bar(data_site1, ax=ax[0], color=(0.24, 0.77, 0.77))
msno.bar(data_site2, ax=ax[1], color=(0.24, 0.77, 0.77))
elif visualization_type == 'heatmap':
if data_site1.isnull().sum().sum() > 0: ## Ensure there's at least one missing value
msno.heatmap(data_site1, ax=ax[0])
if data_site2.isnull().sum().sum() > 0:
msno.heatmap(data_site2, ax=ax[1])
elif visualization_type == 'dendrogram':
msno.dendrogram(data_site1, ax=ax[0], orientation='right')
msno.dendrogram(data_site2, ax=ax[1], orientation='right')
else:
## Ensure the visualization type is valid
raise ValueError("Invalid visualization_type. Choose from 'matrix', 'bar', 'heatmap', 'dendrogram'.")
## Adjust the layout and display the plot
plt.tight_layout()
plt.show()
In [6]:
# Display side by side the missing data matrices for two specific sites, "Penrose" and "Takapuna"
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='matrix')
print("The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.")
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='bar')
# display_side_by_side_missing_data_visualizations(rawdata, 'Takapuna', 'Takapuna', visualization_type='heatmap')
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='dendrogram')
The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.
Time-Series Data Visualizations¶
In [7]:
def display_side_by_side_timeseries_data_visualizations(rawdata, site1, site2, visualization_type='plot', feature_column='PM10'):
"""
Displays side by side data visualizations for two specified sites, adjusting parameters based on the visualization type.
* 'plot':
Parameters:
rawdata (DataFrame): The complete dataset containing multiple sites.
site1 (str): The first site to display.
site2 (str): The second site to display.
visualization_type (str): Type of visualization. Options: 'plot'.
"""
## Filter rawdata for the specified sites
## Copy the rawdata for safety so the original data is not modified
series = rawdata.copy()
# ## Set 'Site' and 'Timestamp' as a multi-level index
# series.set_index(['Site', 'Timestamp'], inplace=True)
# ## TODO: log-scale ?
# # data_site1 = rawdata.loc[(site1, slice(None)), numerical_columns]
# data_site1 = series.xs(site1, level='Site')[numerical_columns]
# data_site2 = series.xs(site2, level='Site')[numerical_columns]
## Setting up the figure to hold two subplots side by side
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
# major_locator = mdates.YearLocator()
# major_formatter = mdates.DateFormatter('%Y')
# minor_locator = mdates.MonthLocator(bymonth=(1, 4, 7, 10)) ## Ticks for every quarter
# minor_formatter = mdates.DateFormatter('%m') ## [DEBUG] '%b'
## Generate visualizations for each site & Apply visualization with correct parameters.
if visualization_type == 'plot':
data_site1 = rawdata_site1.copy()[numerical_columns_S1]
data_site2 = rawdata_site2.copy()[numerical_columns_S2]
## Filter rawdata for the specified sites and only include numerical columns
data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot for {site1}', fontsize = 18)
data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot for {site2}', fontsize = 18)
elif visualization_type == 'plot_feature_column':
## Filter rawdata for the specified sites and only plotting the feature column
# data_site1 = series.loc[(site1, ), feature_column]
# data_site2 = series.loc[(site2, ), feature_column]
# data_site1 = series.xs(site1, level='Site')[feature_column]
# data_site2 = series.xs(site2, level='Site')[feature_column]
data_site1 = rawdata_site1.copy()[feature_column]
data_site2 = rawdata_site2.copy()[feature_column]
data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site2}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
elif visualization_type == 'plot_feature_column_1':
## Filter rawdata for the specified sites and only plotting the feature column
data_site1 = rawdata_site1.copy()[feature_column]
data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
else:
## Ensure the visualization type is valid
raise ValueError("Invalid visualization_type. Choose from 'plot', '', '', ''.")
## Adjust the layout and display the plot
plt.tight_layout()
plt.show()
In [8]:
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='PM2.5')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='AQI')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column_1', feature_column='SO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NOx')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Wind_Speed')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Air_Temp')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Rel_Humidity')
Data Quality Profiling¶
In [20]:
# %%capture
# !pip install llvmlite --ignore installed
# !pip install -U ydata-profiling
In [22]:
from ydata_profiling import ProfileReport
## Note: Data profiling takes more than a minute for each site !!!
## Setting what variables are time series: common to both sites, except those unique to one
type_schema = {
# "Timestamp": "ordinal", ## Explicitly marking the Timestamp as ordinal due to its role in indexing the data over time
# "Timestamp": "timeseries",
"AQI": "timeseries",
"PM10": "timeseries",
"PM2.5": "timeseries",
# "SO2": "timeseries",
"NO": "timeseries",
"NO2": "timeseries",
"NOx": "timeseries",
# "CO": "timeseries",
"Wind_Speed": "timeseries",
"Wind_Dir": "timeseries",
"Air_Temp": "timeseries",
"Rel_Humidity": "timeseries",
"PM2.5_Lag1": "timeseries", ## Including lagged variables as they also represent time series data
"PM2.5_Lag2": "timeseries",
"PM10_Lag1": "timeseries",
"PM10_Lag2": "timeseries"
}
## Unique variables for Penrose and Takapuna
unique_penrose = {"SO2": "timeseries"} ## 'SO2' is only for Penrose
# unique_takapuna = {"CO": "timeseries"} ## 'CO' is only for Takapuna
## Combine the common schema with the unique variables for each site to create site-specific schemas
type_schema_penrose = {**type_schema, **unique_penrose}
# type_schema_takapuna = {**common_schema, **unique_takapuna}
type_schema_takapuna = {**type_schema}
## Generate the Data Quality Profiling per monitoring Site
# for site_name, site_data in rawdata.groupby(level='Site'):
for site_name, site_data in rawdata.groupby("Site"):
## Preprocess site_data if necessary HERE ...
# site_data = site_data.reset_index(level='Site', drop=True) ## Drop the 'Site' level for profiling
## Check if any series within site_data is empty or has all NaN values, then Proceed with profiling if data is valid
if not site_data.dropna(how='all').empty:
## Apply the site-specific schema
type_schema = type_schema_penrose if site_name == 'Penrose' else type_schema_takapuna
## Running 1 profile per station
profile = ProfileReport(
site_data,
title=f"Air Quality Profiling - Site: {site_name}",
explorative=True,
# vars={"cat": {"check_completeness": False}}, ## Optional: Adjust based on profiling needs
tsmode=True,
type_schema=type_schema,
sortby='Timestamp', ## Ensure the profile respects the temporal ordering
correlations={"auto": {"calculate": False}},
missing_diagrams={"Heatmap": False},
dataset={
"description": "Air Quality Data Profiling for the Research Project.",
"author": "Nhat Thanh, Nguyen. 🌐 Linkedin: https://www.linkedin.com/in/nnthanh",
"copyright_holder": "🎓 AUT 🛠️ Teradata's ClearScape Analytics™ ⚡",
"copyright_year": 2024,
"url": "https://analytics-experience.pages.dev",
},
variables={
"descriptions": {
"PM10": "Particulate Matter (PM10) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
"PM2.5": "Particulate Matter (PM2.5) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
"SO2": "Sulfur Dioxide (SO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
"NO": "Nitric Oxide (NO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
"NO2": "Nitrogen Dioxide (NO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
"NOx": "Nitrogen Oxide (NOx) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
"CO": "Carbon Monoxide (CO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
"AQI": "AQI.Air Quality Index (AQI)",
"Wind_Speed": "Wind Speed Hourly Aggregate: m/s (Metres per second)",
"Wind_Dir": "Wind Direction Hourly Aggregate: ° (Degrees)",
"Air_Temp": "Air Temperature Hourly Aggregate: °C (Celsius)",
"Rel_Humidity": "Relative Humidity Hourly Aggregate: % (Percent)",
"Solar_Rad": "Solar Radiation Hourly Aggregate: W/m² (Watts per square metre)",
}
},
)
else:
print(f"Skipping profiling for {site_name} due to insufficient data.")
## Save each profile report to an HTML file
profile.to_file(f"data/report/Data_Profiling_for_{site_name}.html")
## ONLY display 1 site to reduce the size of the .html file (< 25 MiB)
## Display the profile report in the Jupyter notebook
print(f"Displaying profile for {site_name}:")
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
Displaying profile for Takapuna: