import logging

## TODO: .env --> determines the environment for output format programmatically
## Check for the JupyterLab environment, which might affect how visualizations are rendered or interacted with.
IS_JUPYTERLAB               = True  ## True if JupyterLab; False if Python .py
IS_TERADATA_VANTAGE         = False ## True if Data in Teradata Vantage; False if Laptop/Virtual-Machine
IS_DATA_IN_TERADATA_VANTAGE = False ## True if Data in Teradata Vantage; False if Data from *.csv/*xls files
IS_DEBUG                    = True  ## Plot and display additional information or not

# if not IS_DEBUG:
## Set logging level to WARNING to suppress info messages --> turn-off Prophet logs
logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
# logging.getLogger('cmdstanpy').setLevel(logging.CRITICAL)

%run -i ./Data_Loading_and_Descriptive_Statistics.ipynb

data/raw/Takapuna23-07May2020-to-30Apr2022.csv
data/raw/Penrose7-07May2020-to-30Apr2022.csv

ℹ️ Load Data from *.csv files --> rawdata 

ℹ️ The Shape of the Dataframe: (37382, 15)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (20023, 12) (17359, 11)

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site', 'Site_Class', 'Country'], dtype='object')

🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

print("\n🎓 [Site1 - Penrose]  Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...")
rawdata_site1.describe()

🎓 [Site1 - Penrose]  Summary Statistics of the {site1} rawdata_site1 Dataframe such as the mean, max/minimum values ...

print("\n🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...")
rawdata_site2.describe()

🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} rawdata_site2 Dataframe such as the mean, maximum and minimum values ...

import missingno as msno
import matplotlib.dates as mdates

def display_side_by_side_missing_data_visualizations(rawdata, site1, site2, visualization_type='matrix', feature_column='PM10'):
    """
    Displays side by side missing data visualizations for two specified sites using missingno library,
    adjusting parameters based on the visualization type.
    
    * 'matrix':  Each matrix cell color is based on whether the data exists (dark/green color) or not (white). 
                 The matrix plot is a great tool if you are working with depth-related data or time-series data. 
    * 'bar':     Displays the number of missing values for each column represented as a bar chart. 
                 It’s similar to the matrix plot but simpler to interpret.

    * 'heatmap': Displays a heat map that shows a nullity correlation between variables. The differentiates from a regular heatmap is that it measures a nullity correlation.
                 Nullity correlation (fancyimpute library) is an indicator of how strongly the absence of a variable to another variable is.
                 Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column.
                 Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.

    * 'dendrogram': To show the similarity between each variable to another in terms of missing values.
                 More fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap
                 Hierarchical clustering algorithm calculates the similarity between variables. 
                 It has the leaves that are represented by variables and the branch that connects all variables one to the other.
                 
    Parameters:
        rawdata (DataFrame): The complete dataset containing multiple sites.
        site1 (str): The first site to display.
        site2 (str): The second site to display.
        visualization_type (str): Type of visualization. Options: 'matrix', 'bar', 'heatmap', 'dendrogram'.
    """
    
    ## Filter rawdata for the specified sites
    # data_site1 = rawdata[rawdata['Site'] == site1]
    # data_site2 = rawdata[rawdata['Site'] == site2]
    data_site1 = rawdata_site1.copy()
    data_site2 = rawdata_site2.copy()

    ## Setting up the figure to hold two subplots side by side
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
    plt.title(f'[{visualization_type.capitalize()}] Missing Values Plot', fontsize = 18)

    ## Generate visualizations for each site & Apply visualization with correct parameters.
    if visualization_type == 'matrix':
        msno.matrix(data_site1, ax=ax[0], sparkline=False, color=(0.24, 0.77, 0.77))
        msno.matrix(data_site2, ax=ax[1], sparkline=False, color=(0.24, 0.77, 0.77))
    elif visualization_type == 'bar':
        msno.bar(data_site1, ax=ax[0], color=(0.24, 0.77, 0.77))
        msno.bar(data_site2, ax=ax[1], color=(0.24, 0.77, 0.77))
    elif visualization_type == 'heatmap':
        if data_site1.isnull().sum().sum() > 0:  ## Ensure there's at least one missing value
            msno.heatmap(data_site1, ax=ax[0])
        if data_site2.isnull().sum().sum() > 0:
            msno.heatmap(data_site2, ax=ax[1])
    elif visualization_type == 'dendrogram':
        msno.dendrogram(data_site1, ax=ax[0], orientation='right')
        msno.dendrogram(data_site2, ax=ax[1], orientation='right')
    else:
        ## Ensure the visualization type is valid
        raise ValueError("Invalid visualization_type. Choose from 'matrix', 'bar', 'heatmap', 'dendrogram'.")

    ## Adjust the layout and display the plot
    plt.tight_layout()
    plt.show()

# Display side by side the missing data matrices for two specific sites, "Penrose" and "Takapuna"
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='matrix')
print("The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.")

display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='bar')
# display_side_by_side_missing_data_visualizations(rawdata, 'Takapuna', 'Takapuna', visualization_type='heatmap')
display_side_by_side_missing_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='dendrogram')

The columns CO and Solar_Rad show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.

def display_side_by_side_timeseries_data_visualizations(rawdata, site1, site2, visualization_type='plot', feature_column='PM10'):
    """
    Displays side by side data visualizations for two specified sites, adjusting parameters based on the visualization type.
    
    * 'plot':  
                 
    Parameters:
        rawdata (DataFrame): The complete dataset containing multiple sites.
        site1 (str): The first site to display.
        site2 (str): The second site to display.
        visualization_type (str): Type of visualization. Options: 'plot'.
    """
    
    ## Filter rawdata for the specified sites
    ## Copy the rawdata for safety so the original data is not modified
    series = rawdata.copy()
    # ## Set 'Site' and 'Timestamp' as a multi-level index
    # series.set_index(['Site', 'Timestamp'], inplace=True)
    # ## TODO: log-scale ?
    # # data_site1 = rawdata.loc[(site1, slice(None)), numerical_columns]
    # data_site1 = series.xs(site1, level='Site')[numerical_columns]
    # data_site2 = series.xs(site2, level='Site')[numerical_columns]
    
    ## Setting up the figure to hold two subplots side by side
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), sharey=True)
    
    # major_locator = mdates.YearLocator()
    # major_formatter = mdates.DateFormatter('%Y')
    # minor_locator = mdates.MonthLocator(bymonth=(1, 4, 7, 10))  ## Ticks for every quarter
    # minor_formatter = mdates.DateFormatter('%m')                ## [DEBUG] '%b'

    ## Generate visualizations for each site & Apply visualization with correct parameters.
    if visualization_type == 'plot':
        data_site1 = rawdata_site1.copy()[numerical_columns_S1]
        data_site2 = rawdata_site2.copy()[numerical_columns_S2]
        ## Filter rawdata for the specified sites and only include numerical columns
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot for {site1}', fontsize = 18)
        data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot for {site2}', fontsize = 18)
    elif visualization_type == 'plot_feature_column':
        ## Filter rawdata for the specified sites and only plotting the feature column
        # data_site1 = series.loc[(site1, ), feature_column]
        # data_site2 = series.loc[(site2, ), feature_column]
        # data_site1 = series.xs(site1, level='Site')[feature_column]
        # data_site2 = series.xs(site2, level='Site')[feature_column]
        data_site1 = rawdata_site1.copy()[feature_column]
        data_site2 = rawdata_site2.copy()[feature_column]
        
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
        data_site2.plot(ax=ax[1], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site2}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
    elif visualization_type == 'plot_feature_column_1':
        ## Filter rawdata for the specified sites and only plotting the feature column
        data_site1 = rawdata_site1.copy()[feature_column]
        
        data_site1.plot(ax=ax[0], title=f'[{visualization_type.capitalize()}] Data Plot of {feature_column} at {site1}', fontsize = 18, ylabel=f'{feature_column}', marker='o', linestyle='-', markersize=1)
    else:
        ## Ensure the visualization type is valid
        raise ValueError("Invalid visualization_type. Choose from 'plot', '', '', ''.")

    ## Adjust the layout and display the plot
    plt.tight_layout()
    plt.show()

display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot')

display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='PM2.5')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='AQI')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column_1', feature_column='SO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NO2')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='NOx')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Wind_Speed')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Air_Temp')
display_side_by_side_timeseries_data_visualizations(rawdata, 'Penrose', 'Takapuna', visualization_type='plot_feature_column', feature_column='Rel_Humidity')

# %%capture
# !pip install llvmlite --ignore installed
# !pip install -U ydata-profiling

from ydata_profiling import ProfileReport

## Note: Data profiling takes more than a minute for each site !!!
## Setting what variables are time series: common to both sites, except those unique to one
type_schema = {
    # "Timestamp": "ordinal",   ## Explicitly marking the Timestamp as ordinal due to its role in indexing the data over time
    # "Timestamp": "timeseries",
    "AQI": "timeseries",
    "PM10": "timeseries",
    "PM2.5": "timeseries",
    # "SO2": "timeseries",
    "NO": "timeseries",
    "NO2": "timeseries",
    "NOx": "timeseries",
    # "CO": "timeseries",
    "Wind_Speed": "timeseries",
    "Wind_Dir": "timeseries",
    "Air_Temp": "timeseries",
    "Rel_Humidity": "timeseries",
    "PM2.5_Lag1": "timeseries",  ## Including lagged variables as they also represent time series data
    "PM2.5_Lag2": "timeseries",
    "PM10_Lag1": "timeseries",
    "PM10_Lag2": "timeseries"
}

## Unique variables for Penrose and Takapuna
unique_penrose = {"SO2": "timeseries"}   ## 'SO2' is only for Penrose
# unique_takapuna = {"CO": "timeseries"} ## 'CO' is only for Takapuna

## Combine the common schema with the unique variables for each site to create site-specific schemas
type_schema_penrose = {**type_schema, **unique_penrose}
# type_schema_takapuna = {**common_schema, **unique_takapuna}
type_schema_takapuna = {**type_schema}

## Generate the Data Quality Profiling per monitoring Site
# for site_name, site_data in rawdata.groupby(level='Site'):
for site_name, site_data in rawdata.groupby("Site"):

    ## Preprocess site_data if necessary HERE ...
    # site_data = site_data.reset_index(level='Site', drop=True)  ## Drop the 'Site' level for profiling

    ## Check if any series within site_data is empty or has all NaN values, then Proceed with profiling if data is valid
    if not site_data.dropna(how='all').empty:
        ## Apply the site-specific schema
        type_schema = type_schema_penrose if site_name == 'Penrose' else type_schema_takapuna
        
        ## Running 1 profile per station
        profile = ProfileReport(
            site_data,
            title=f"Air Quality Profiling - Site: {site_name}",
            explorative=True,
            # vars={"cat": {"check_completeness": False}},  ## Optional: Adjust based on profiling needs
            tsmode=True,
            type_schema=type_schema,
            sortby='Timestamp',      ## Ensure the profile respects the temporal ordering
            correlations={"auto": {"calculate": False}}, 
            missing_diagrams={"Heatmap": False},         
            dataset={
                "description": "Air Quality Data Profiling for the Research Project.",
                "author": "Nhat Thanh, Nguyen. 🌐 Linkedin: https://www.linkedin.com/in/nnthanh",
                "copyright_holder": "🎓 AUT 🛠️ Teradata's ClearScape Analytics™ ⚡",
                "copyright_year": 2024,
                "url": "https://analytics-experience.pages.dev",
            },
            variables={
                "descriptions": {
                    "PM10": "Particulate Matter (PM10) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
                    "PM2.5": "Particulate Matter (PM2.5) Hourly Aggregate: µg/m³ (Micrograms per cubic metre).",
                    "SO2": "Sulfur Dioxide (SO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NO": "Nitric Oxide (NO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NO2": "Nitrogen Dioxide (NO2) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "NOx": "Nitrogen Oxide (NOx) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "CO": "Carbon Monoxide (CO) Hourly Aggregate: µg/m³ (Micrograms per cubic metre)",
                    "AQI": "AQI.Air Quality Index (AQI)",
                    "Wind_Speed": "Wind Speed Hourly Aggregate: m/s (Metres per second)",
                    "Wind_Dir": "Wind Direction Hourly Aggregate: ° (Degrees)",
                    "Air_Temp": "Air Temperature Hourly Aggregate: °C (Celsius)",
                    "Rel_Humidity": "Relative Humidity Hourly Aggregate: % (Percent)",
                    "Solar_Rad": "Solar Radiation Hourly Aggregate: W/m² (Watts per square metre)",
                }
            },
        )
    else:
        print(f"Skipping profiling for {site_name} due to insufficient data.")

    ## Save each profile report to an HTML file
    profile.to_file(f"data/report/Data_Profiling_for_{site_name}.html")

## ONLY display 1 site to reduce the size of the .html file (< 25 MiB)
## Display the profile report in the Jupyter notebook
print(f"Displaying profile for {site_name}:")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Displaying profile for Takapuna:

	Timestamp	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
count	20023	17019.000000	16818.000000	16102.000000	16623.000000	16260.000000	16260.000000	16259.000000	17182.000000	17181.000000	17206.000000	17338.000000
mean	2021-05-06 21:58:02.954601984	28.233974	13.881787	5.348180	0.851639	10.041562	14.123253	24.151350	2.807368	173.234387	16.748925	69.426087
min	2020-05-07 00:00:00	0.000000	-6.400000	-12.300000	-4.400000	-1.900000	-1.400000	-1.300000	0.200000	1.000000	3.000000	26.900000
25%	2020-11-07 02:30:00	22.000000	8.400000	2.100000	0.100000	0.800000	3.900000	5.100000	1.600000	73.000000	14.000000	60.200000
50%	2021-05-08 19:00:00	27.000000	13.100000	4.700000	0.500000	3.600000	10.600000	14.900000	2.600000	211.000000	17.000000	70.400000
75%	2021-11-04 15:30:00	33.000000	18.400000	7.700000	1.000000	10.900000	20.700000	31.650000	3.800000	235.000000	20.000000	79.500000
max	2022-04-30 23:00:00	121.000000	196.700000	138.400000	25.200000	378.400000	83.100000	455.100000	8.200000	357.000000	28.000000	92.000000
std	NaN	9.693312	8.044288	5.259335	1.507654	21.065762	12.748169	30.956155	1.538615	94.038611	4.013776	12.290750

	Timestamp	AQI	PM10	PM2.5	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
count	17359	17359.000000	16579.000000	15807.000000	16187.000000	16185.000000	16187.000000	16773.000000	16771.000000	16785.000000	16921.000000
mean	2021-05-04 08:00:00	25.152947	11.590983	6.251275	12.005912	0.012416	24.203614	2.501792	176.335132	16.280548	69.909931
min	2020-05-07 17:00:00	-13.000000	-3.800000	0.450000	-0.800000	-0.000500	-0.700000	0.200000	5.500000	3.500000	30.400000
25%	2020-11-04 12:30:00	19.000000	7.400000	4.250000	2.450000	0.004000	7.050000	1.250000	75.500000	13.500000	61.250000
50%	2021-05-04 08:00:00	24.000000	10.650000	5.550000	5.850000	0.009100	15.100000	2.300000	205.000000	16.000000	71.050000
75%	2021-11-01 03:30:00	30.000000	14.500000	7.350000	12.200000	0.017700	29.400000	3.500000	261.000000	19.500000	79.850000
max	2022-04-30 23:00:00	318.000000	437.250000	52.300000	352.450000	0.072100	401.950000	7.800000	346.500000	27.000000	90.350000
std	NaN	10.637321	7.911251	3.507078	22.592626	0.010951	30.662867	1.468465	93.196025	3.959962	11.804317

🎯 Libraries and Reusable Functions¶

Python Reusable Functions¶

📈 Data Understanding & Visualization¶

Identify and Visualise Missing Data Values¶

Time-Series Data Visualizations¶

Data Quality Profiling¶

References¶