Teradata

🎓 Predicting Air Particulate Matter at Scale ⛅️
🛠️ 1. Data Understanding & Data Quality Profiling

🦅 Overview

🎓 This notebook shows various DataFrame methods that can be used to analyse and cleanse a dataset. It should be ready for reuse in the next steps (Time Series, Machine Learning, Deep Learning) in CRISP-DM for Data Science

  • Workflow steps:

    1. Import the required teradataml modules.
      • Connect to a Vantage system.
    2. 📈 Data Understanding & Visualization
    3. ⚙️ Data Preprocessing
    4. Cleanup.

🎯 Libraries and Reusable Functions¶

🎓 This section executes all of the cells in `Data_Loading_and_Descriptive_Statistics.ipynb`.
In [1]:
import logging

## TODO: .env --> determines the environment for output format programmatically
## Check for the JupyterLab environment, which might affect how visualizations are rendered or interacted with.
IS_JUPYTERLAB               = True  ## True if JupyterLab; False if Python .py
IS_TERADATA_VANTAGE         = False ## True if Data in Teradata Vantage; False if Laptop/Virtual-Machine
IS_DATA_IN_TERADATA_VANTAGE = False ## True if Data in Teradata Vantage; False if Data from *.csv/*xls files
IS_DEBUG                    = True  ## Plot and display additional information or not

# if not IS_DEBUG:
## Set logging level to WARNING to suppress info messages --> turn-off Prophet logs
logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
# logging.getLogger('cmdstanpy').setLevel(logging.CRITICAL)

%run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
data/raw/Takapuna23-07May2020-to-30Apr2022.csv
data/raw/Penrose7-07May2020-to-30Apr2022.csv

ℹ️ Load Data from *.csv files --> rawdata 

ℹ️ The Shape of the Dataframe: (37382, 15)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (20023, 12) (17359, 11)

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site', 'Site_Class', 'Country'], dtype='object')

🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']
In [2]:
print("\n🎓 [Site1 - Penrose]  Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...")
rawdata_site1.describe()
🎓 [Site1 - Penrose]  Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...
Out[2]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
count 20023 17019.000000 16818.000000 16102.000000 16623.000000 16260.000000 16260.000000 16259.000000 17182.000000 17181.000000 17206.000000 17338.000000
mean 2021-05-06 21:58:02.954601984 28.233974 13.881787 5.348180 0.851639 10.041562 14.123253 24.151350 2.807368 173.234387 16.748925 69.426087
min 2020-05-07 00:00:00 0.000000 -6.400000 -12.300000 -4.400000 -1.900000 -1.400000 -1.300000 0.200000 1.000000 3.000000 26.900000
25% 2020-11-07 02:30:00 22.000000 8.400000 2.100000 0.100000 0.800000 3.900000 5.100000 1.600000 73.000000 14.000000 60.200000
50% 2021-05-08 19:00:00 27.000000 13.100000 4.700000 0.500000 3.600000 10.600000 14.900000 2.600000 211.000000 17.000000 70.400000
75% 2021-11-04 15:30:00 33.000000 18.400000 7.700000 1.000000 10.900000 20.700000 31.650000 3.800000 235.000000 20.000000 79.500000
max 2022-04-30 23:00:00 121.000000 196.700000 138.400000 25.200000 378.400000 83.100000 455.100000 8.200000 357.000000 28.000000 92.000000
std NaN 9.693312 8.044288 5.259335 1.507654 21.065762 12.748169 30.956155 1.538615 94.038611 4.013776 12.290750
In [3]:
print("\n🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...")
rawdata_site2.describe()
🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...
Out[3]:
Timestamp AQI PM10 PM2.5 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
count 17359 17359.000000 16579.000000 15807.000000 16187.000000 16185.000000 16187.000000 16773.000000 16771.000000 16785.000000 16921.000000
mean 2021-05-04 08:00:00 25.152947 11.590983 6.251275 12.005912 0.012416 24.203614 2.501792 176.335132 16.280548 69.909931
min 2020-05-07 17:00:00 -13.000000 -3.800000 0.450000 -0.800000 -0.000500 -0.700000 0.200000 5.500000 3.500000 30.400000
25% 2020-11-04 12:30:00 19.000000 7.400000 4.250000 2.450000 0.004000 7.050000 1.250000 75.500000 13.500000 61.250000
50% 2021-05-04 08:00:00 24.000000 10.650000 5.550000 5.850000 0.009100 15.100000 2.300000 205.000000 16.000000 71.050000
75% 2021-11-01 03:30:00 30.000000 14.500000 7.350000 12.200000 0.017700 29.400000 3.500000 261.000000 19.500000 79.850000
max 2022-04-30 23:00:00 318.000000 437.250000 52.300000 352.450000 0.072100 401.950000 7.800000 346.500000 27.000000 90.350000
std NaN 10.637321 7.911251 3.507078 22.592626 0.010951 30.662867 1.468465 93.196025 3.959962 11.804317

Python Reusable Functions¶

🎓 This section examine the key statistics and missing data percentages from the reports to gain a better understanding of the data characteristics and time-series features.

📈 Data Understanding & Visualization¶

🎓 This section identify and visualise Missing Data values.

Identify and Visualise Missing Data Values¶

In [4]:
import missingno as msno
import matplotlib.dates as mdates
In [5]:
def display_side_by_side_missing_data_visualizations(rawdata, site1, site2, visualization_type='matrix', feature_column='PM10'):
    """
    Displays side by side missing data visualizations for two specified sites using missingno library,
    adjusting parameters based on the visualization type.
    
    * 'matrix':  Each matrix cell color is based on whether the data exists (dark/green color) or not (white). 
                 The matrix plot is a great tool if you are working with depth-related data or time-series data. 
    * 'bar':     Displays the number of missing values for each column represented as a bar chart. 
                 It’s similar to the matrix plot but simpler to interpret.

    * 'heatmap': Displays a heat map that shows a nullity correlation between variables. The differentiates from a regular heatmap is that it measures a nullity correlation.
                 Nullity correlation (fancyimpute library) is an indicator of how strongly the absence of a variable to another variable is.
                 Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column.
                 Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.

    * 'dendrogram': To show the similarity between each variable to another in terms of missing values.
                 More fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap
                 Hierarchical clustering algorithm calculates the similarity between variables. 
                 It has the leaves that are represented by variables and the branch that connects all variables one to the other.
                 
    Parameters:
        rawdata (DataFrame): The complete dataset containing multiple sites.
        site1 (str): The first site to display.
        site2 (str): The second site to display.
        visualization_type (str): Type of visualization. Options: 'matrix', 'bar', 'heatmap', 'dendrogram'.
    """
    
    ## Filter rawdata for the specified sites
    # data_site1 = rawdata[rawdata['Site'] == site1]
    # data_site2 = rawdata[rawdata['Site'] == site2]
    data_site1 = rawdata_site1.copy()
    data_site2 = rawdata_site2.copy()

    ## Setting up the figure to hold two subplots side by side
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
    plt.title(f'[{visualization_type.capitalize()}] Missing Values Plot', fontsize = 18)

    ## Generate visualizations for each site & Apply visualization with correct parameters.
    if visualization_type == 'matrix':
        msno.matrix(data_site1, ax=ax[0], sparkline=False, color=(0.24, 0.77, 0.77))
        msno.matrix(data_site2, ax=ax[1], sparkline=False, color=(0.24, 0.77, 0.77))
    elif visualization_type == 'bar':
        msno.bar(data_site1, ax=ax[0], color=(0.24, 0.77, 0.77))
        msno.bar(data_site2, ax=ax[1], color=(0.24, 0.77, 0.77))
    elif visualization_type == 'heatmap':
        if data_site1.isnull().sum().sum() > 0:  ## Ensure there's at least one missing value
            msno.heatmap(data_site1, ax=ax[0])
        if data_site2.isnull().sum().sum() > 0:
            msno.heatmap(data_site2, ax=ax[1])
    elif visualization_type == 'dendrogram':
        msno.dendrogram(data_site1, ax=ax[0], orientation='right')
        msno.dendrogram(data_site2, ax=ax[1], orientation='right')
    else:
        ## Ensure the visualization type is valid
        raise ValueError("Invalid visualization_type. Choose from 'matrix', 'bar', 'heatmap', 'dendrogram'.")

    ## Adjust the layout and display the plot
    plt.tight_layout()
    plt.show()
In [6]:
# Display side by side the missing data matrices for two specific sites, "Penrose" and "Takapuna"
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='matrix')
print("The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.")

display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='bar')
# display_side_by_side_missing_data_visualizations(rawdata, 'Takapuna', 'Takapuna', visualization_type='heatmap')
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='dendrogram')
No description has been provided for this image
The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.
No description has been provided for this image
No description has been provided for this image

Time-Series Data Visualizations¶

In [7]:
def display_side_by_side_timeseries_data_visualizations(rawdata, site1, site2, visualization_type='plot', feature_column='PM10'):
    """
    Displays side by side data visualizations for two specified sites, adjusting parameters based on the visualization type.
    
    * 'plot':  
                 
    Parameters:
        rawdata (DataFrame): The complete dataset containing multiple sites.
        site1 (str): The first site to display.
        site2 (str): The second site to display.
        visualization_type (str): Type of visualization. Options: 'plot'.
    """
    
    ## Filter rawdata for the specified sites
    ## Copy the rawdata for safety so the original data is not modified
    series = rawdata.copy()
    # ## Set 'Site' and 'Timestamp' as a multi-level index
    # series.set_index(['Site', 'Timestamp'], inplace=True)
    # ## TODO: log-scale ?
    # # data_site1 = rawdata.loc[(site1, slice(None)), numerical_columns]
    # data_site1 = series.xs(site1, level='Site')[numerical_columns]
    # data_site2 = series.xs(site2, level='Site')[numerical_columns]
    
    ## Setting up the figure to hold two subplots side by side
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
    
    # major_locator = mdates.YearLocator()
    # major_formatter = mdates.DateFormatter('%Y')
    # minor_locator = mdates.MonthLocator(bymonth=(1, 4, 7, 10))  ## Ticks for every quarter
    # minor_formatter = mdates.DateFormatter('%m')                ## [DEBUG] '%b'

    ## Generate visualizations for each site & Apply visualization with correct parameters.
    if visualization_type == 'plot':
        data_site1 = rawdata_site1.copy()[numerical_columns_S1]
        data_site2 = rawdata_site2.copy()[numerical_columns_S2]
        ## Filter rawdata for the specified sites and only include numerical columns
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot for {site1}', fontsize = 18)
        data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot for {site2}', fontsize = 18)
    elif visualization_type == 'plot_feature_column':
        ## Filter rawdata for the specified sites and only plotting the feature column
        # data_site1 = series.loc[(site1, ), feature_column]
        # data_site2 = series.loc[(site2, ), feature_column]
        # data_site1 = series.xs(site1, level='Site')[feature_column]
        # data_site2 = series.xs(site2, level='Site')[feature_column]
        data_site1 = rawdata_site1.copy()[feature_column]
        data_site2 = rawdata_site2.copy()[feature_column]
        
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
        data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site2}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
    elif visualization_type == 'plot_feature_column_1':
        ## Filter rawdata for the specified sites and only plotting the feature column
        data_site1 = rawdata_site1.copy()[feature_column]
        
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
    else:
        ## Ensure the visualization type is valid
        raise ValueError("Invalid visualization_type. Choose from 'plot', '', '', ''.")

    ## Adjust the layout and display the plot
    plt.tight_layout()
    plt.show()
In [8]:
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot')

display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='PM2.5')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='AQI')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column_1', feature_column='SO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NOx')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Wind_Speed')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Air_Temp')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Rel_Humidity')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Data Quality Profiling¶

In [20]:
# %%capture
# !pip install llvmlite --ignore installed
# !pip install -U ydata-profiling
In [22]:
from ydata_profiling import ProfileReport

## Note: Data profiling takes more than a minute for each site !!!
## Setting what variables are time series: common to both sites, except those unique to one
type_schema = {
    # "Timestamp": "ordinal",   ## Explicitly marking the Timestamp as ordinal due to its role in indexing the data over time
    # "Timestamp": "timeseries",
    "AQI": "timeseries",
    "PM10": "timeseries",
    "PM2.5": "timeseries",
    # "SO2": "timeseries",
    "NO": "timeseries",
    "NO2": "timeseries",
    "NOx": "timeseries",
    # "CO": "timeseries",
    "Wind_Speed": "timeseries",
    "Wind_Dir": "timeseries",
    "Air_Temp": "timeseries",
    "Rel_Humidity": "timeseries",
    "PM2.5_Lag1": "timeseries",  ## Including lagged variables as they also represent time series data
    "PM2.5_Lag2": "timeseries",
    "PM10_Lag1": "timeseries",
    "PM10_Lag2": "timeseries"
}

## Unique variables for Penrose and Takapuna
unique_penrose = {"SO2": "timeseries"}   ## 'SO2' is only for Penrose
# unique_takapuna = {"CO": "timeseries"} ## 'CO' is only for Takapuna

## Combine the common schema with the unique variables for each site to create site-specific schemas
type_schema_penrose = {**type_schema, **unique_penrose}
# type_schema_takapuna = {**common_schema, **unique_takapuna}
type_schema_takapuna = {**type_schema}

## Generate the Data Quality Profiling per monitoring Site
# for site_name, site_data in rawdata.groupby(level='Site'):
for site_name, site_data in rawdata.groupby("Site"):

    ## Preprocess site_data if necessary HERE ...
    # site_data = site_data.reset_index(level='Site', drop=True)  ## Drop the 'Site' level for profiling

    ## Check if any series within site_data is empty or has all NaN values, then Proceed with profiling if data is valid
    if not site_data.dropna(how='all').empty:
        ## Apply the site-specific schema
        type_schema = type_schema_penrose if site_name == 'Penrose' else type_schema_takapuna
        
        ## Running 1 profile per station
        profile = ProfileReport(
            site_data,
            title=f"Air Quality Profiling - Site: {site_name}",
            explorative=True,
            # vars={"cat": {"check_completeness": False}},  ## Optional: Adjust based on profiling needs
            tsmode=True,
            type_schema=type_schema,
            sortby='Timestamp',      ## Ensure the profile respects the temporal ordering
            correlations={"auto": {"calculate": False}}, 
            missing_diagrams={"Heatmap": False},         
            dataset={
                "description": "Air Quality Data Profiling for the Research Project.",
                "author": "Nhat Thanh, Nguyen. 🌐 Linkedin: https://www.linkedin.com/in/nnthanh",
                "copyright_holder": "🎓 AUT 🛠️ Teradata's ClearScape Analytics™ ⚡",
                "copyright_year": 2024,
                "url": "https://analytics-experience.pages.dev",
            },
            variables={
                "descriptions": {
                    "PM10": "Particulate Matter (PM10) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
                    "PM2.5": "Particulate Matter (PM2.5) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
                    "SO2": "Sulfur Dioxide (SO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NO": "Nitric Oxide (NO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NO2": "Nitrogen Dioxide (NO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NOx": "Nitrogen Oxide (NOx) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "CO": "Carbon Monoxide (CO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "AQI": "AQI.Air Quality Index (AQI)",
                    "Wind_Speed": "Wind Speed Hourly Aggregate: m/s (Metres per second)",
                    "Wind_Dir": "Wind Direction Hourly Aggregate: ° (Degrees)",
                    "Air_Temp": "Air Temperature Hourly Aggregate: °C (Celsius)",
                    "Rel_Humidity": "Relative Humidity Hourly Aggregate: % (Percent)",
                    "Solar_Rad": "Solar Radiation Hourly Aggregate: W/m² (Watts per square metre)",
                }
            },
        )
    else:
        print(f"Skipping profiling for {site_name} due to insufficient data.")

    ## Save each profile report to an HTML file
    profile.to_file(f"data/report/Data_Profiling_for_{site_name}.html")

## ONLY display 1 site to reduce the size of the .html file (< 25 MiB)
## Display the profile report in the Jupyter notebook
print(f"Displaying profile for {site_name}:")
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
Displaying profile for Takapuna:

References¶

  • Clemente, F., Gonçalo Martins Ribeiro, Alexandre Quemy, Miriam Seoane Santos, Ricardo Cardoso Pereira, & Barros, A. (2023). ydata-profiling: Accelerating data-centric AI with high-quality data. Neurocomputing, 554, 126585–126585. https://doi.org/10.1016/j.neucom.2023.126585
🎓 Predicting Air Particulate Matter at Scale ⛅️
🧑‍🎓 Auckland University of Technology (AUT)