Teradata 🎓 Predicting Air Particulate Matter at Scale ⛅️
🛠️ 2. Data Preparation & Exploratory Data Analysis
🦅 Overview
🎓 This notebook shows various DataFrame methods that can be used to analyse and cleanse a dataset. It should be ready for reuse in the next steps (Time Series, Machine Learning, Deep Learning) in CRISP-DM for Data Science
  • Workflow steps:

    1. Import the required teradataml modules.
    2. Connect to a Vantage system.
    3. Data Loading and visualize the data.
    4. Data Analysis & preparation e.g. use of various dataframe functions to get details about the data like shape, null values etc., use Variable transformation to fill NULL values.
    5. Cleanup.

🎯 Libraries and Reusable Functions¶

🎓 To execute all of the cells in `Data_Loading_and_Descriptive_Statistics.ipynb`.
In [1]:
import os, logging

## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'raw_pandas.env'  ## Set to `cleaned_pandas.env` or `raw_pandas.env` as needed

%run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
# %run -i ./DataFrameAdapter.ipynb
The specified .env file: raw_pandas.env
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
| Variable                    | Description                                                                                | Value             |
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
| IS_TERADATA_VANTAGE         | Using scable Teradata Vantage vs. local machine (Laptop)                                   | True              |
| IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files | False             |
| SCHEMA_NAME                 | [Teradata Vantage] Schema Name                                                             | Air_Pollution     |
| TABLE_NAME                  | [Teradata Vantage] Table Name                                                              | Air_Pollution_Raw |
| IS_JUPYTERLAB               | Running in JupyterLab vs Python Dash/Vizro Dashboard                                       | True              |
| IS_DEBUG                    | Debug mode is active or not                                                                | False             |
| USE_DATA_PREFIX             | Prefix to use for data files: 'raw' | 'cleaned'                                            | raw               |
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
Performing setup ...
Setup complete
... Logon successful
Connected as: xxxxxsql://demo_user:xxxxx@host.docker.internal/dbc
Engine(teradatasql://demo_user:***@host.docker.internal)

ℹ️ Load Data from data/raw

data/raw/raw_Penrose7-07May2020-to-30Apr2022.csv
data/raw/raw_Takapuna23-07May2020-to-30Apr2022.csv

ℹ️ Load Data from data/raw/raw_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂

ℹ️ Load Data from data/raw/raw_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂

ℹ️ The Shape of the rawdata Dataframe: (37382, 15)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (20023, 12) (17359, 11)

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site', 'Site_Class', 'Country'], dtype='object')

🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 Listing variables with description...
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| Variable Name           | Description                                                    | All Sites | Penrose | Takapuna |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| rawdata                 | Complete dataset containing all observations across all sites. |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1 | Numerical columns specific to Site 1.                          |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1   | Nominal columns specific to Site 1.                            |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2 | Numerical columns specific to Site 2.                          |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2   | Nominal columns specific to Site 2.                            |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1           | Subset of raw data for Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2           | Subset of raw data for Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
In [2]:
rawdata_site1
Out[2]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
0 2020-05-07 00:00:00 26.0 16.5 16.1 1.8 60.3 40.9 101.2 0.6 316.0 8.0 78.0
1 2020-05-07 01:00:00 28.0 17.7 10.1 1.0 NaN NaN NaN 0.7 269.0 8.0 76.8
2 2020-05-07 02:00:00 28.0 15.0 10.3 0.1 16.0 29.2 45.3 1.0 180.0 8.0 78.4
3 2020-05-07 03:00:00 29.0 14.3 11.4 0.0 11.2 27.5 38.7 0.8 232.0 8.0 77.5
4 2020-05-07 04:00:00 30.0 8.8 10.6 -0.1 12.0 28.5 40.5 0.8 274.0 7.0 80.1
... ... ... ... ... ... ... ... ... ... ... ... ...
20018 2022-04-30 20:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20019 2022-04-30 21:00:00 24.0 NaN 1.9 0.6 5.8 25.4 31.3 0.8 124.0 15.0 85.4
20020 2022-04-30 21:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20021 2022-04-30 22:00:00 24.0 8.2 6.0 0.5 1.8 17.9 19.7 0.8 108.0 14.0 86.0
20022 2022-04-30 23:00:00 24.0 2.3 5.4 0.5 1.0 7.3 8.3 0.9 93.0 14.0 84.4

20023 rows × 12 columns

In [3]:
print("\n🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data1 Dataframe such as the mean, max/minimum values ...")
rawdata_site1.describe()
🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data1 Dataframe such as the mean, max/minimum values ...
Out[3]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
count 20023 17019.000000 16818.000000 16102.000000 16623.000000 16260.000000 16260.000000 16259.000000 17182.000000 17181.000000 17206.000000 17338.000000
mean 2021-05-06 21:58:02.954601984 28.233974 13.881787 5.348180 0.851639 10.041562 14.123253 24.151350 2.807368 173.234387 16.748925 69.426087
min 2020-05-07 00:00:00 0.000000 -6.400000 -12.300000 -4.400000 -1.900000 -1.400000 -1.300000 0.200000 1.000000 3.000000 26.900000
25% 2020-11-07 02:30:00 22.000000 8.400000 2.100000 0.100000 0.800000 3.900000 5.100000 1.600000 73.000000 14.000000 60.200000
50% 2021-05-08 19:00:00 27.000000 13.100000 4.700000 0.500000 3.600000 10.600000 14.900000 2.600000 211.000000 17.000000 70.400000
75% 2021-11-04 15:30:00 33.000000 18.400000 7.700000 1.000000 10.900000 20.700000 31.650000 3.800000 235.000000 20.000000 79.500000
max 2022-04-30 23:00:00 121.000000 196.700000 138.400000 25.200000 378.400000 83.100000 455.100000 8.200000 357.000000 28.000000 92.000000
std NaN 9.693312 8.044288 5.259335 1.507654 21.065762 12.748169 30.956155 1.538615 94.038611 4.013776 12.290750
In [4]:
rawdata_site2
Out[4]:
Timestamp AQI PM10 PM2.5 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
20023 2020-05-07 17:00:00 21.0 5.95 4.15 10.90 0.01715 28.00 2.50 242.0 15.5 68.15
20024 2020-05-07 18:00:00 21.0 5.65 5.10 8.20 0.01655 24.70 2.20 239.5 15.5 61.75
20025 2020-05-07 19:00:00 21.0 7.70 5.45 5.75 0.01325 19.00 2.10 244.0 15.0 59.45
20026 2020-05-07 20:00:00 21.0 8.20 5.45 3.50 0.00870 12.20 2.25 251.0 15.0 59.25
20027 2020-05-07 21:00:00 21.0 11.80 5.80 3.55 0.00930 12.90 2.10 261.0 15.0 61.75
... ... ... ... ... ... ... ... ... ... ... ...
37377 2022-04-30 19:00:00 14.0 4.75 3.30 0.60 0.00440 5.00 2.55 109.5 16.0 76.05
37378 2022-04-30 20:00:00 14.0 6.35 3.15 0.50 0.00365 4.15 2.45 105.5 16.0 74.50
37379 2022-04-30 21:00:00 14.0 6.05 2.80 0.40 0.00480 5.20 2.35 115.5 16.0 74.15
37380 2022-04-30 22:00:00 13.0 4.20 2.60 0.40 0.00555 5.90 1.95 122.5 16.0 75.95
37381 2022-04-30 23:00:00 13.0 5.00 2.80 0.35 0.00405 4.30 1.95 119.0 16.0 78.00

17359 rows × 11 columns

In [5]:
print("\n🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data2 Dataframe such as the mean, maximum and minimum values ...")
rawdata_site2.describe()
🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data2 Dataframe such as the mean, maximum and minimum values ...
Out[5]:
Timestamp AQI PM10 PM2.5 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
count 17359 17359.000000 16579.000000 15807.000000 16187.000000 16185.000000 16187.000000 16773.000000 16771.000000 16785.000000 16921.000000
mean 2021-05-04 08:00:00 25.152947 11.590983 6.251275 12.005912 0.012416 24.203614 2.501792 176.335132 16.280548 69.909931
min 2020-05-07 17:00:00 -13.000000 -3.800000 0.450000 -0.800000 -0.000500 -0.700000 0.200000 5.500000 3.500000 30.400000
25% 2020-11-04 12:30:00 19.000000 7.400000 4.250000 2.450000 0.004000 7.050000 1.250000 75.500000 13.500000 61.250000
50% 2021-05-04 08:00:00 24.000000 10.650000 5.550000 5.850000 0.009100 15.100000 2.300000 205.000000 16.000000 71.050000
75% 2021-11-01 03:30:00 30.000000 14.500000 7.350000 12.200000 0.017700 29.400000 3.500000 261.000000 19.500000 79.850000
max 2022-04-30 23:00:00 318.000000 437.250000 52.300000 352.450000 0.072100 401.950000 7.800000 346.500000 27.000000 90.350000
std NaN 10.637321 7.911251 3.507078 22.592626 0.010951 30.662867 1.468465 93.196025 3.959962 11.804317

Python Reusable Functions¶

🎓 This section will examine the key statistics and missing data percentages from the reports to gain a better understanding of the data characteristics and time-series features.

⚙️ Data Preprocessing & Preparation¶

🎓 This section handles missing data by using function wrappers (also known as decorators) to modify and extend an existing imputation function. Additionally, filter and display potential outliers for further investigation and analysis.

Clean the Time Series Dataset¶

In [6]:
def ensure_datetime_format(df, column_name='Timestamp'):
    """
    Ensures the specified column is in datetime format.
    Parameters:
    - df (pd.DataFrame): The dataframe to process.
    - column_name (str): The name of the column to format as datetime.
    Returns:
    - pd.DataFrame: DataFrame with the column in datetime format.
    """
    try:
        df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
    except Exception as e:
        raise ValueError(f"Error converting {column_name} to datetime: {e}")
    return df

def clean_timeseries_dataset(df, time_col='Timestamp'):
    """
    Clean the dataset by ensuring datetime format for 'Timestamp', 
    removing rows where all values are NaN (except time_col),
    and removing duplicate rows based on 'Timestamp'.
    
    Parameters:
    - df (pd.DataFrame): Input dataset.
    - time_col (str): The column name for timestamp data.
    
    Returns:
    - pd.DataFrame: Cleaned dataset.
    """
    print(f'\n🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on {time_col}')
    if time_col not in df.columns:
        raise ValueError(f"Column '{time_col}' not found in DataFrame.")
    
    ## Convert 'Timestamp' column to datetime format
    df = ensure_datetime_format(df, column_name=time_col)

    ## Remove rows where all values are NaN, excluding 'Timestamp'
    cols_except_time = [col for col in df.columns if col != time_col]
    if IS_DEBUG:
        rows_all_nans = df[df[cols_except_time].isna().all(axis=1)]
        if not rows_all_nans.empty:
            print("Rows where all values are NaN, excluding 'Timestamp':\n", rows_all_nans)
    ## Remove rows where all values are NaN, excluding 'Timestamp'
    df = df.dropna(how='all', subset=cols_except_time)

    if IS_DEBUG:
        ## Show duplicate rows based on 'Timestamp'
        duplicates = df[df.duplicated(subset=[time_col], keep=False)]
        if not duplicates.empty:
            print("Duplicate rows based on 'Timestamp':\n", duplicates)
    
    ## Remove duplicate rows based on 'Timestamp'
    df = df.drop_duplicates(subset=[time_col], keep='first')

    ## Next Steps:
    # - Handling outliers in numeric columns
    # - Filling missing values with appropriate imputation techniques
    # - Normalizing or scaling numeric columns
    # - Encoding categorical variables if necessary

    ## Optional: sort by 'Timestamp' for chronological order
    # cleaned_df.sort_values(by=time_col, inplace=True)

    return df
In [7]:
cleaned_data_site1 = clean_timeseries_dataset(rawdata_site1)
print(f"☑️ Original rawdata_site1 dataset shape: {rawdata_site1.shape}")
print(f"✅ Cleaned  rawdata_site1 dataset shape: {cleaned_data_site1.shape}")

cleaned_data_site2 = clean_timeseries_dataset(rawdata_site2)
print(f"☑️ Original rawdata_site2 dataset shape: {rawdata_site2.shape}")
print(f"✅ Cleaned  rawdata_site2 dataset shape: {cleaned_data_site2.shape}")
🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on Timestamp
☑️ Original rawdata_site1 dataset shape: (20023, 12)
✅ Cleaned  rawdata_site1 dataset shape: (17375, 12)

🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on Timestamp
☑️ Original rawdata_site2 dataset shape: (17359, 11)
✅ Cleaned  rawdata_site2 dataset shape: (17359, 11)

Missing Data Imputation in Time-Series¶

🎓 Using **function wrappers** (also known as **decorators**) to modify and extend an existing imputation function. Decorators can indeed be a effective tool to modify and extend the behavior of functions, including for tasks like data imputation. Each decorator does the following actions: (1) accepts a function as an input argument (which, when called/invoked, returns a DataFrame); and (2) returns a new function that, when called/invoked, also returns a DataFrame but with imputed data depending on the specified given imputation approach. 0. ~~**Option 0**: Fill NaN with Outlier or Zero~~ In this specific example filling the missing value with an outlier value such as np.inf or 0 seems to be very naive. However, using values like -999, is sometimes a good idea. 1. **Option 1**: Fill NaN with Mean or Mode Value Filling NaNs with the mean value is also not sufficient and naive, and doesn't seems to be a good option. 2. **Option 2**: Fill NaN with Last Value with .ffill() Filling NaNs with the last value could be bit better. 3. **Option 3**: Fill NaN with Linearly Interpolated Value with .interpolate() Filling NaNs with the interpolated values is the best option in this small examlple but it requires knowledge of the neighouring value 4. **Option 4**: Fill NaN with Time-Series Moving Average Filling NaNs with 1-Day and 1-Week Moving Average 5. ~~**Option 5**: Fill NaN with Time-Series Model (Facebook/Meta Prophet)~~ 6. ~~**Option 6**: Fill NaN with Machine Learning Model (XGBoost)~~ 7. ~~**Option N**: Filling NaNs with something else (TBD)~~
🎓 The proposed approach for improving the accuracy of missing data imputation involves using the average of data points from the exact same hour and day of the week in previous weeks. This approach may be ideal for environmental hourly data, where comparable conditions may recur on a weekly basis. This implementation now takes into account up to four weeks of data and calculates the average of these four data points for imputation.
In [8]:
import holidays
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

def before_and_after_values_visualization(data_site, imputed_data, method_description, numerical_columns, site, feature_column='PM10'):
    """
    This function plot the comparison between the two sites before and after imputation, adjusting parameters based on the visualization type.
    Applies the specified imputation function to the data, then visualizes the results using the specified visualization function.
    
    Parameters:
    - data: pd.DataFrame - The dataset to impute and visualize.
    - imputed_data: pd.DataFrame - The imputed dataset.
    - site: str - The name of the site/location for comparison.
    - feature_column: str - The feature column on which to focus the visualization.
    """
    if IS_DEBUG:
        print(f"before_and_after_values_visualization {data_site.shape}:", imputed_data.shape, site, visualization_type, feature_column)
    if not isinstance(data_site, pd.DataFrame):  ## Ensure data is a DataFrame before proceeding
        raise ValueError("Data must be a pandas DataFrame.")

    ## Plotting using Plotly: Setting up the figure to hold two subplots side by side
    fig = make_subplots(rows=1, cols=1, shared_yaxes=True, subplot_titles=(f'Data Plot of {feature_column} at {site}'))

    ## Original and Modified Data for Site1: darkorange vs lightblue | dodgerblue; dash='dot' vs 'solid'
    fig.add_trace(go.Scatter(x=imputed_data['Timestamp'], y=imputed_data[feature_column], mode='markers', name='Modified Site', 
                             line=dict(color='darkorange', width=0.3, dash='dot')), row=1, col=1)
    # fig.add_trace(go.Scatter(x=data_site['Timestamp'], y=data_site[feature_column], mode='lines+markers', name='Original Site',
    fig.add_trace(go.Scatter(x=data_site['Timestamp'], y=data_site[feature_column], mode='markers', name='Original Site',
                             line=dict(color='dodgerblue', width=0.3, dash='dot'), opacity=0.8), row=1, col=1)

    ## Adjust the layout for a better/consistent look and display the plot
    fig.update_layout(height=800, title_text=f'[{method_description}] Original vs Imputed Data Plot of {feature_column} at {site}',
                        xaxis_title="Timestamp", yaxis_title=feature_column, 
                        xaxis=dict(
                            title="Timestamp",
                            tickmode='auto',
                            tickformat="%Y-Q%q",  ## Format: Year - Q<Quarter>
                            dtick="M3",           ## Set ticks every 3 months to indicate quarters
                        ), legend=dict(orientation="h",  ## Horizontal legend
                            yanchor="bottom", y=1.02,   ## Position legend above the plot
                            xanchor="right", x=1,       ## Align legend to the right
                            title='Legend:',            ## Optional legend title
                            bgcolor='rgba(255,255,255,0.3)',  ## Semi-transparent white background
                            bordercolor="lightgrey", borderwidth=1))
    fig.show()
In [9]:
class DataImputer:
    """
    Class to perform various imputation methods on a DataFrame.
    Note: Imputation methods should directly accept a DataFrame and return the imputed DataFrame
    """
    def __init__(self):
        # if not isinstance(data, pd.DataFrame):
        #     raise ValueError("Data must be a pandas DataFrame")
        pass

    @staticmethod
    def mean_mode_imputation(data, columns=None):
        """
        Impute with Mean and Mode:
        Replaces missing numerical values with the MEAN and missing categorical values with the MODE.

        Args:
            data (DataFrame): DataFrame containing the data.
            columns (list, optional): Columns to impute. If None, applies to all columns.

        Returns:
            DataFrame: DataFrame with imputed values.
        """
        if columns is None:
            # Exclude 'Timestamp' column or any non-numeric columns
            columns = [col for col in data.columns if data[col].dtype in ['float64', 'int64'] and col != 'Timestamp']
        
        # print("Applying Mean/Mode Imputation...")
        for col in data.columns:
            if col == 'Timestamp':  # Skip the 'Timestamp' column
                continue

            if data[col].dtype in ['float64', 'int64']:
                imputer = SimpleImputer(strategy='mean')
            else:  ## Assuming categorical data: return_value[col].dtype.name == 'category'
                imputer = SimpleImputer(strategy='most_frequent')
            ## The fit_transform function expects a 2D array, hence the double square brackets
            data[col] = imputer.fit_transform(data[[col]])
            
        return data

    @staticmethod
    def forward_backward_imputation(data):
        """
        Impute with Forward and Backward propagation
        Using ffill and bfill as a naive method, to complete the data.

        Parameters:
            data (pd.DataFrame): DataFrame to impute.
            inplace (bool): If False, returns a new DataFrame, otherwise modifies the original DataFrame.
        
        Returns:
            pd.DataFrame: DataFrame with imputed values.
        """
        # print("Applying Forward and Backward Imputation across the entire DataFrame ...")
        ## Forward fill (ffill) missing values
        data.fillna(method='ffill', inplace=True)
        ## Backward fill (bfill) remaining missing values
        data.fillna(method='bfill', inplace=True)
        return data

    @staticmethod
    def interpolation_imputation(data, method='linear'):
        """
        Impute with Interpolation
        A linear interpolation method works by assuming a linear relationship between the observed points and drawing a straight line accordingly.
        Polynomial curves or splines.
        Interpolation is an effective approach to impute missing values in time series. 
        Polynomial interpolation fits a polynomial function to the observed data points and estimates the missing values based on this function.
        It works best if the time series is reasonably smooth. In case there are sudden changes or outliers, a simpler approach such as forward filling might be a better option.

        Parameters:
            data (pd.DataFrame): DataFrame to impute.
            method (str): Interpolation technique, default is 'linear'.
            order (int, optional): The order of the polynomial for polynomial interpolation.
            inplace (bool): If False, returns a new DataFrame, otherwise modifies the original DataFrame.
        
        Returns:
            pd.DataFrame: DataFrame with imputed values.
        """
        # print("Applying Interpolation Imputation...")
        if method == 'polynomial' and order is None:
            raise ValueError("Order must be specified for polynomial interpolation")
        data.interpolate(method=method, inplace=True, limit_direction='both')
        return data

    @staticmethod
    def polynomial_imputation(data, method='polynomial', order=2):
        """
        Impute with Interpolation
        A linear interpolation method works by assuming a linear relationship between the observed points and drawing a straight line accordingly.
        Polynomial curves or splines.
        Interpolation is an effective approach to impute missing values in time series. 
        Polynomial interpolation fits a polynomial function to the observed data points and estimates the missing values based on this function.
        It works best if the time series is reasonably smooth. In case there are sudden changes or outliers, a simpler approach such as forward filling might be a better option.
        """
        # print("Applying Polynomial Imputation...")
        if method == 'polynomial' and order is None:
            raise ValueError("Order must be specified for polynomial interpolation")
        # data.interpolate(method=method, order=order if order else None, inplace=True, limit_direction='both')
        ## Apply polynomial interpolation across columns
        for column in data.select_dtypes(include=['float', 'int']).columns:
            ## Only interpolate if there are at least two non-NA values
            if data[column].count() > 1:
                data[column].interpolate(method='polynomial', order=order, inplace=True, limit_direction='both')
        # return_value = return_value.interpolate(method='spline')
        return data
    

    @staticmethod
    def moving_average_imputation(data, columns=None, window='168H', aggressive_iterative=False, min_periods=1, verbose=True):
    # def moving_average_imputation(data, columns=None, window='168H', min_periods=1, verbose=False, inplace=True):
        """
        Impute missing values using a moving average over a specified window. 
        This method is particularly useful for time series data where patterns such as seasonality are expected to play a significant role.

        Only NaNs are filled with the rolling mean, preserving the integrity of original non-NaN values. 
        A moving average is better at adapting to changes by considering a few nearby data points to compute the mean.
        Yet, it can still lead to biased results if the data is not missing at random.
        
        Examples:
            .rolling(window='52W' ## A year has 52 weeks (52 weeks * 7 days per week) approximately.
            .rolling(window='168H', min_periods=1).mean()  ## 168 hours = 7 days

        Parameters:
            data (DataFrame): The pandas DataFrame containing the data.
            columns (list of str, optional): Specific columns to impute. If None, all numeric columns are used.
            window (str): The size of the moving window (default is '168H' for 168 hours = 24 * 7 = 7 days).
            min_periods (int): Minimum number of observations in window required to have a value (default is 1).
            verbose (bool): If True, print additional information about the imputation process.
            # inplace (bool): If True, modify the DataFrame in place. Otherwise, return a modified copy.

        Returns:
            DataFrame: The DataFrame with imputed values if inplace is False. Otherwise, None.

        Notes:
            - A rolling window imputation can adapt to changes by considering nearby data points. However, it may
              introduce biases if the data is not missing at random.
            - This method currently supports only numeric data for rolling calculations.
        """
        ## Ensure the 'Timestamp' column is in datetime format and set as index only if it's not already the index
        if 'Timestamp' in data.columns:
            data['Timestamp'] = pd.to_datetime(data['Timestamp'], dayfirst=True, errors='coerce')
            data.set_index('Timestamp', inplace=True)

        if not pd.api.types.is_datetime64_any_dtype(data.index):
            raise ValueError("Data index must be a datetime type for time-based rolling windows.")

        ## Select columns if not provided
        if columns is None:
            ## Select only numeric columns for rolling operation
            columns = data.select_dtypes(include=[np.number]).columns.tolist()
    
        if verbose:
            print(f"\nApplying Moving Average Imputation over a specified {window} window on columns:", columns)

        ## Apply rolling mean to specified numeric columns
        for column in columns:
            ## Check if the column is of a numeric type: boolean (b), integer (i), floating (f), or complex number (c)
            if data[column].dtype.kind in 'bifc':  
                # if verbose:
                #     print(f"Processing column: {column}")
                ## Calculate the rolling mean and update NaNs only
                rolling_mean = data[column].rolling(window=window, min_periods=min_periods).mean()
                data[column].fillna(rolling_mean, inplace=True)

        ## [Option] Aggressive Iterative pass to fill remaining NaNs
        if aggressive_iterative:
            ## Perform multiple passes over the data. In each pass, it tries to fill the NaNs with the rolling mean. 
            max_iterations = 10
            for iteration in range(max_iterations):
                change_made = False
                for column in columns:
                    if data[column].isna().sum() > 0:  ## Check if there are still NaNs to fill
                        # if verbose:
                        #     print(f"Iterative imputation for column: {column}, iteration {iteration+1}")
                        before_fill = data[column].isna().sum()
                        data[column].fillna(
                            data[column].rolling(window=window, min_periods=min_periods).mean(),
                            inplace=True
                        )
                        after_fill = data[column].isna().sum()
                        if after_fill < before_fill:
                            change_made = True

                if not change_made:  ## No more NaNs can be filled
                    if verbose:
                        print(f"Processing column: {column}. Iterative pass to fill remaining NaNs in {iteration+1} iteration(s).")
                        # print(f"No more changes made in iteration {iteration+1}. Ending iterative imputation.")
                    break
        
        ## Check for any remaining NaNs and attempt further imputation if necessary
        if not aggressive_iterative:
            if data.isna().any().any():
                if verbose:
                    print("Second pass for any remaining NaNs after initial imputation using bfill & ffill.")
                for column in columns:
                    data[column] = data[column].fillna(method='bfill').fillna(method='ffill')

        ## Reset index to turn 'Timestamp' back into a column
        data.reset_index(inplace=True)
        
        return data


## TODO: Future Work!!!
#
# !pip install --upgrade prophet xgboost --user
# import functools
# 
# from xgboost import XGBRegressor
# from prophet import Prophet
# from sklearn.ensemble import RandomForestRegressor

#     ## TODO
#     @staticmethod
#     def prophet_imputation(data, date_column, target_columns, holidays=None):
#         """
#         Decorator for imputing missing values using Facebook/Meta Prophet time series forecasting model.
#         
#         Parameters:
#         - dates_column: Name of the column containing the dates.
#         - columns_to_impute: Columns to impute. If 'all', all columns except the dates column are imputed.
#         
#         Returns:
#         - A wrapper function for the imputation.
#         """
#         # print("Applying Time-Series Prophet Imputation...")    
#         if date_column not in data.columns:
#             raise ValueError(f"{date_column} is not a column in the DataFrame")
# 
#         for column in target_columns:
#             ## Prepare DataFrame for Prophet
#             df_prophet = pd.DataFrame({
#                 'ds': data[date_column],
#                 'y': data[column],
#                 # 'y': data[column].interpolate()  ## Improved handling of missing values ?
#             }).dropna()
#             
#             ## Initialize and fit Prophet model
#             model = Prophet(
#                 daily_seasonality=True,
#                 weekly_seasonality=True,
#                 yearly_seasonality='auto',
#                 holidays=holidays
#             )
#             model.fit(df_prophet)
#             ## TODO: Make future dataframe and predict
#             # future = model.make_future_dataframe(periods=0, freq='H')
#             # forecast = model.predict(future)
#             forecast = model.predict(df_prophet[['ds']])
#             ## Fill missing values in original data
#             forecast.set_index('ds', inplace=True)
#             data.set_index(date_column, inplace=True)
#             # data[column].update(forecast)
#             ## 'yhat' from Prophet forecast can be used as imputed values
#             data[column].fillna(forecast['yhat'], inplace=True)
#             data.reset_index(inplace=True)
#         return data
# 
# 
#     ## TODO
#     @staticmethod
#     def xgboost_imputation(data, target_columns):
#         """
#         Impute missing values in a DataFrame using XGBoost with Scikit-learn's IterativeImputer.
#         
#         Parameters:
#         - data: DataFrame to be imputed.
#         - target_columns: Columns to impute, could be 'all' or a list of column names.
#         
#         Returns:
#         - The DataFrame with imputed values.
#         """
#         # print("Applying Machine-Learning XGBoost Interpolation Imputation...")
#         ## TODO: Set default estimator if none provided -> Directly use XGBRegressor as the estimator
#         # chosen_estimator = estimator if estimator is not None else XGBRegressor(n_estimators=100, random_state=0)
#         # chosen_estimator = XGBRegressor(n_estimators=100, random_state=0)
#         ## Perform imputation on numerical data
#         # imputer = IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0)
#         imputer = IterativeImputer(
#                     estimator=XGBRegressor(n_estimators=100, random_state=0),
#                     max_iter=10,
#                     random_state=0
#                 )
#         ## TODO: Apply imputer only on specified target columns
#         # data[target_columns] = imputer.fit_transform(data[target_columns])
#         # for column in target_columns:
#         #     if column not in data.columns:
#         #         raise ValueError(f"{column} is not a column in the DataFrame")
#         #     data[[column]] = imputer.fit_transform(data[[column]])
#         # Perform imputation on specified target columns or all numerical columns if 'all'
#         if target_columns == 'all':
#             numerical_data = data.select_dtypes(include=[np.number])
#             imputed_numerical_data = pd.DataFrame(IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0).fit_transform(numerical_data), columns=numerical_data.columns, index=data.index)
#             data.update(imputed_numerical_data)
#         else:
#             for column in target_columns:
#                 if column in data.columns:
#                     column_data = data[[column]].select_dtypes(include=[np.number])
#                     imputed_column_data = IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0).fit_transform(column_data)
#                     data[column] = imputed_column_data
#         return data
In [10]:
## Instantiate the class with rawdata
imputer = DataImputer()

nz_holidays = holidays.country_holidays('NZ')
# nz_holidays = holidays.country_holidays('NZ', subdiv='AUK')

## List of imputation methods to apply, directly referencing the methods of DataImputer as tuples of (method_function, method_description)
# (lambda data: DataImputer.mean_mode_imputation(data, columns=numerical_columns), "SimpleImputer Mean/Mode Imputation"),
# (lambda data: DataImputer.xgboost_imputation(data, target_columns=['PM2.5', 'PM10', 'NO2']), "Machine-Learning XGBoost Imputation"),
# (lambda data: DataImputer.prophet_imputation(data, date_column='Timestamp', target_columns=['PM2.5', 'PM10', 'NO2'], holidays=nz_holidays), "Time-Series Prophet Imputation"),
# (lambda data: DataImputer.moving_average_imputation(data, columns=timestamp_and_numerical_columns, window='168H', min_periods=1, verbose=True), "Moving Average Imputation"),
imputation_methods = [
    (lambda data: DataImputer.mean_mode_imputation(data), "SimpleImputer Mean/Mode Imputation"),
    (DataImputer.forward_backward_imputation, "Forward/Backward Imputation"),
    (lambda data: DataImputer.interpolation_imputation(data, method='linear'), "Linear Interpolation"),
    (lambda data: DataImputer.polynomial_imputation(data, method='polynomial', order=2), "Polynomial Interpolation (Order 2)"),
    (lambda data: DataImputer.moving_average_imputation(data, window='24H',  min_periods=1, verbose=True), "1-DAY Moving Average Imputation"),
    (lambda data: DataImputer.moving_average_imputation(data, window='168H', min_periods=1, verbose=True), "1-WEEK Moving Average Imputation"),   
]


## Iterate through each imputation method, apply it, and visualize the results (data vs imputed data visualization)
for method_function, method_description in imputation_methods:
    # print(f"Applying imputation method function: {method_function}")
    # print(f"Applying imputation method description: {method_description}")
    print(f'✅ [DataImputer] {method_description}: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna')

    if IS_DEBUG:
        print(f"{method_description} --------------Before Imputation--------------")
        print(cleaned_data_site1.isnull().sum(axis = 0))
        print(cleaned_data_site2.isnull().sum(axis = 0))
    ## Step 1. Apply the imputation function
    # imputed_cleaned_data_site1 = method_function(cleaned_data_site1.copy())  ## This calls the method on the DataImputer instance
    imputed_cleaned_data_site2 = method_function(cleaned_data_site2.copy())  ## This calls the method on the DataImputer instance
    if IS_DEBUG:
        # print(type(imputed_cleaned_data_site1))
        print(f"{method_description} --------------After Imputation--------------")
        # print(imputed_cleaned_data_site1.isnull().sum(axis = 0))
        print(imputed_cleaned_data_site2.isnull().sum(axis = 0))
    
    ## Introduce a short pause (e.g., 30 seconds) to ensure plots are fully rendered before moving to the next item
    time.sleep(60)
    
    ## Step 2. After imputation, we can now proceed with visualization or further processing
    # before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), method_description, numerical_columns_site1, 'Penrose', feature_column='PM2.5')
    # before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), method_description, numerical_columns_site1, 'Penrose', feature_column='PM10')
    
    before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2.copy(), method_description, numerical_columns_site2, 'Takapuna', feature_column='PM2.5')
    before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2.copy(), method_description, numerical_columns_site2, 'Takapuna', feature_column='PM10')
✅ [DataImputer] SimpleImputer Mean/Mode Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna
✅ [DataImputer] Forward/Backward Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna
✅ [DataImputer] Linear Interpolation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna
✅ [DataImputer] Polynomial Interpolation (Order 2): Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna
✅ [DataImputer] 1-DAY Moving Average Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

Applying Moving Average Imputation over a specified 24H window on columns: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.
✅ [DataImputer] 1-WEEK Moving Average Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

Applying Moving Average Imputation over a specified 168H window on columns: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.

[Advanced] Historical Average Imputation 🦅¶

🎓 Imputes missing values in a DataFrame by calculating the average of similar (the same weekday and hour) historical data points in previous (max_lookback_periods=4) weeks.
In [11]:
def weekday_from_index(day_index):
    """
    Convert a day index (0-6) to its corresponding weekday name.
    
    Parameters:
    - day_index (int): The index of the day where Monday is 0 and Sunday is 6.
    
    Returns:
    - str: The name of the weekday corresponding to the input index.
    """
    weekdays = ["0.Monday", "1.Tuesday", "2.Wednesday", "3.Thursday", "4.Friday", "5.Saturday", "6.Sunday"]
    if 0 <= day_index < len(weekdays):
        return weekdays[day_index]
    else:
        raise ValueError("Day index must be in the range 0 to 6.")

def impute_with_historical_averages(data, column, max_lookback_periods, verbose=True):
    """
    Helper function to impute missing values for a specific column in the dataset using historical averages.
    Historical averages are calculated using data from the same hour and weekday from previous weeks.
    
    Parameters:
    - data (pd.DataFrame): DataFrame with a datetime index.
    - column (str): The column in the DataFrame to impute.
    - max_lookback_periods (int): Maximum number of weeks to look back for historical data.
    - verbose (bool): If set to True, prints out additional debugging information.
    
    Returns:
        pd.Series: A series with imputed values.
    """
    imputed_values = data[column].copy()
    missing_indices = imputed_values[imputed_values.isna()].index
    if verbose:
        # print(f"Missing indices for column '{column}': {missing_indices}")
        print(f"Imputing column: {column} & number of missing entries: {len(missing_indices)}")
    
    ## Loop over all missing data points
    for current_missing_index in missing_indices:
        historical_values = []
        valid_data_found = False
        debug_info = []
        
        ## Collect historical data points from previous weeks
        for weeks_back in range(1, max_lookback_periods + 1):
            target_hist_time = current_missing_index - pd.DateOffset(weeks=weeks_back)
            if target_hist_time < data.index.min():
                ## Stop if there is no more historical data available
                if verbose:
                    print(f"No historical data beyond: {data.index.min()}")
                continue
            if target_hist_time in data.index:
                ## Retrieve the historical value if it exists and is not NaN
                historical_value = data.at[target_hist_time, column]
                if pd.notna(historical_value):
                    historical_values.append(historical_value)
                    valid_data_found = True
                    ## Collecting additional debug information
                    debug_info.append((target_hist_time, target_hist_time.weekday(), target_hist_time.hour, historical_value))

        # ## TODO: In case if data is sparse
        # ## while (count < max_lookback_periods) and still have previous data points (the same weekday and hour in previous weeks):
        # count = 0
        # while count < max_lookback_periods:
        #     target_hist_time = current_missing_index - pd.DateOffset(weeks=count)
        #     if target_hist_time < data.index.min():
        #         break  ## Break if historical data point goes beyond the available data range
        #     if target_hist_time in data.index and pd.notna(data.at[target_hist_time, column]):
        #         ## process the data
        #     ## Perform calculations or data retrieval ...
        #     count += 1
            
        ## Compute the mean of the historical values if any are available
        # if pd.notna(historical_value):
        if valid_data_found:
            mean_value = np.mean(historical_values)
            imputed_values.at[current_missing_index] = mean_value
            if verbose:
                print(f"Imputed {current_missing_index} (Day: {weekday_from_index(current_missing_index.weekday())}, Hour: {current_missing_index.hour}): {mean_value}")
                print("Historical data used:")
                for info in debug_info:
                    print(f"  - Date: {info[0]} (Day: {info[1]}, Hour: {info[2]}), Value: {info[3]}")
        else:
            if verbose:
                print(f"No valid historical data found for {current_missing_index} (Day: {weekday_from_index(current_missing_index.weekday())}, Hour: {current_missing_index.hour}). Cannot impute.")
            ## Optional: handle cases with no available historical data
            ## Could assign a default value or use another imputation method HERE
            ## [ ] Assign a global median, mean, or mode (if precomputed)
            ## [ ] Use another method like interpolation
            ## [x] Leave as NaN if downstream processes can handle it
            continue

    # data[column] = imputed_values
    return imputed_values
In [12]:
def historical_average_imputation(data, datetime_column_name='Timestamp', columns=None, max_lookback_periods=4, verbose=True):
    """
    Imputes missing values in a DataFrame by calculating the average of similar (the same weekday and hour) historical data points in previous (max_lookback_periods=4) weeks.

    Parameters:
    - data (pd.DataFrame): DataFrame with a datetime index and column for values.
    - columns (list of str, optional): Specific columns to impute. Defaults to all numeric columns if None.
    - max_lookback_periods (int): Number of weeks/periods to look back for similar historical data points.
    - verbose (bool): If True, print additional information about the imputation process.

    Returns:
    - pd.DataFrame: The DataFrame with imputed values.
    """
    ## Ensure the index is a DatetimeIndex
    if not isinstance(data.index, pd.DatetimeIndex):
        raise ValueError("Index must be of type DatetimeIndex for proper timedelta calculations.")

    if columns is None:
        columns = data.select_dtypes(include=[np.number]).columns.tolist()
    
    start_time = time.time()
    imputed_series = data.copy()
    ## Iterate over each column to apply imputation
    for target_column in columns:  
        imputed_series[target_column] = impute_with_historical_averages(data, target_column, max_lookback_periods, verbose=verbose)
        if verbose:
            print(f"Starting imputation for {target_column}.")
        data[target_column] = impute_with_historical_averages(data, target_column, max_lookback_periods, verbose)
        
    ## Optional: handle cases with no available historical data
    if imputed_series.isna().any().any():
        if verbose:
            print("Second pass for any remaining NaNs after initial imputation using bfill & ffill.")
        for column in columns:
            imputed_series[column] = imputed_series[column].fillna(method='bfill').fillna(method='ffill')

    # if imputed_series.isna().any().any():
    #     ## 1-WEEK Moving Average Imputation  
    #     imputed_series = DataImputer.moving_average_imputation(imputed_series.copy(), window='168H', min_periods=1, verbose=True)

    # if verbose:
    elapsed_time = time.time() - start_time
    print(f"Completed imputation for all columns in {elapsed_time:.2f} seconds.")

    return imputed_series
🎓 Penrose: imputed_cleaned_data_site1
In [13]:
if cleaned_data_site1.index.name != 'Timestamp':
    ## Check if 'Timestamp' is not the index and set it as the index if needed
    ## Set 'Timestamp' as the index of the DataFrame
    cleaned_data_site1.set_index('Timestamp', inplace=True)

imputed_cleaned_data_site1 = historical_average_imputation(cleaned_data_site1.copy(), numerical_columns_site1, verbose=False)

## Check if there are any NaNs left in the simulated data
# remaining_nans = imputed_cleaned_data_site1.isna().any().any()
total_nans = imputed_cleaned_data_site1.isna().sum().sum()
print(f"Total NaNs in the DataFrame: {total_nans}")
Completed imputation for all columns in 11.09 seconds.
Total NaNs in the DataFrame: 0
In [14]:
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if imputed_cleaned_data_site1.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site1
    imputed_cleaned_data_site1.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site1")
else:
    print("Index name is not 'Timestamp'. No changes made.")
    
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if cleaned_data_site1.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site1
    cleaned_data_site1.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: cleaned_data_site1")
else:
    print("Index name is not 'Timestamp'. No changes made.")
Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site1
Index 'Timestamp' has been reset to a column: cleaned_data_site1
In [15]:
print(f"☑️ [Historical Average Imputation] cleaned_data_site1 shape: {cleaned_data_site1.shape}")
before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), 'historical_average_imputation', numerical_columns_site1, 'Penrose', feature_column='PM2.5')
before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), 'historical_average_imputation', numerical_columns_site1, 'Penrose', feature_column='PM10')

print(f"✅ [Historical Average Imputation] imputed_cleaned_data_site1 imputed shape: {imputed_cleaned_data_site1.shape}")
☑️ [Historical Average Imputation] cleaned_data_site1 shape: (17375, 12)
✅ [Historical Average Imputation] imputed_cleaned_data_site1 imputed shape: (17375, 12)
🎓 Takapuna: imputed_cleaned_data_site2
In [16]:
if cleaned_data_site2.index.name != 'Timestamp':
    ## Check if 'Timestamp' is not the index and set it as the index if needed
    ## Set 'Timestamp' as the index of the DataFrame
    cleaned_data_site2.set_index('Timestamp', inplace=True)

imputed_cleaned_data_site2 = historical_average_imputation(cleaned_data_site2.copy(), numerical_columns_site2, verbose=False)

## Check if there are any NaNs left in the simulated data
# remaining_nans = imputed_cleaned_data_site2.isna().any().any()
total_nans = imputed_cleaned_data_site2.isna().sum().sum()
print(f"Total NaNs in the DataFrame: {total_nans}")

# if imputed_cleaned_data_site2.isna().any().any():
#     ## 1-WEEK Moving Average Imputation  
#     imputed_cleaned_data_site2 = DataImputer.moving_average_imputation(imputed_cleaned_data_site2.copy(), window='168H', min_periods=1, verbose=True)
Completed imputation for all columns in 13.37 seconds.
Total NaNs in the DataFrame: 0
In [17]:
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if imputed_cleaned_data_site2.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site2
    imputed_cleaned_data_site2.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site2")
else:
    print("Index name is not 'Timestamp'. No changes made.")
    
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if cleaned_data_site2.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in cleaned_data_site2
    cleaned_data_site2.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: cleaned_data_site2")
else:
    print("Index name is not 'Timestamp'. No changes made.")
Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site2
Index 'Timestamp' has been reset to a column: cleaned_data_site2
In [18]:
print(f"☑️ [Historical Average Imputation] cleaned_data_site2 shape: {cleaned_data_site2.shape}")

# before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, method_description, numerical_columns_site2, 'Takapuna', feature_column='PM10')
before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, 'historical_average_imputation', numerical_columns_site1, 'Takapuna', feature_column='PM2.5')
before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, 'historical_average_imputation', numerical_columns_site1, 'Takapuna', feature_column='PM10')

print(f"✅ [Historical Average Imputation] imputed_cleaned_data_site2 imputed shape: {imputed_cleaned_data_site2.shape}")
☑️ [Historical Average Imputation] cleaned_data_site2 shape: (17359, 11)
✅ [Historical Average Imputation] imputed_cleaned_data_site2 imputed shape: (17359, 11)
🎓 End of Missing Data Imputation in Time-Series

Filter Outliers¶

🎓 Proportional Winsorization Based on Data Distribution

This approach first identifies the actual minimum and maximum values within the acceptable range (beyond the IQR bounds) and then calculates the proportion of data points beyond these winsorized limits. It's particularly useful when you want to apply winsorization directly based on the distribution of your data, adjusting only the extreme values that fall outside the calculated bounds. As a result, this approach is more precise and aligns better with the principle of winsorization—limiting the influence of extreme outliers without removing them from the dataset.

In [19]:
from scipy.stats.mstats import winsorize
IS_WINSORIZING_OUTLIERS = True

def calculate_iqr_bounds(data, column):
    """
    Calculate the Interquartile Range (IQR) bounds for a given column in a DataFrame.

    Args:
    - data (pd.DataFrame): The DataFrame containing the data.
    - column (str): The column name for which to calculate the IQR bounds.

    Returns:
    - tuple: A tuple containing the lower and upper bounds.
    """
    ## Calculate IQR
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    ## Determine bounds for Winsorization based on IQR: the limits for potential outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound, Q1, Q3, IQR


# def detect_and_winsorize_outliers(data, numerical_columns):
def detect_and_handle_outliers(data, numerical_columns):
    """
    Detects outliers in numerical columns of a DataFrame using the IQR method and handles them using Winsorization.
    e.g. Apply the IQR method to detect outliers & Winsorization to handle/normalize outliers

    Limiting the influence of extreme outliers without removing them from the dataset.

    Args:
    - data (pd.DataFrame): The DataFrame to process.
    - numerical_columns (list): A list of column names to check for and handle outliers.

    Returns:
    - pd.DataFrame: A DataFrame with the outliers handled.
    """
    ## Initialize an empty list to store summary statistics
    outlier_summaries = []
    for col in numerical_columns:
        lower_bound, upper_bound, Q1, Q3, IQR = calculate_iqr_bounds(data, col)

        ## Calculate_percentiles_for_winsorization
        lower_winsorize_limit = data[col][data[col] < lower_bound].count() / data[col].count()
        upper_winsorize_limit = data[col][data[col] > upper_bound].count() / data[col].count()

        ## Collecting and calculating outlier metrics for summary
        # outlier_mask = (data[col] < lower_bound) | (data[col] > upper_bound)
        # num_outliers = outlier_mask.sum()
        ## Consider only non-NaN values for total records
        # num_outliers = outlier_mask.notnull().sum()

        ## Determine the percentages of the data that are outliers; this is for Winsorization limits
        num_lower_outliers = data[col][data[col] < lower_bound].count()
        num_upper_outliers = data[col][data[col] > upper_bound].count()
        num_outliers = num_lower_outliers + num_upper_outliers
        lower_outlier_percentage = num_lower_outliers / num_outliers
        upper_outlier_percentage = num_upper_outliers / num_outliers

        percentage_outliers = num_outliers / len(data) * 100

        if IS_WINSORIZING_OUTLIERS:
            # Calculate the actual min and max within the acceptable range
            valid_min = data[col][(data[col] >= lower_bound) & (data[col] <= upper_bound)].min()
            valid_max = data[col][(data[col] >= lower_bound) & (data[col] <= upper_bound)].max()
    
            ## Calculate the fractions for Winsorization
            lower_winsorize_limit = data[col][data[col] < valid_min].count() / len(data[col])
            upper_winsorize_limit = data[col][data[col] > valid_max].count() / len(data[col])
            
            ## Apply Winsorization
            # data[col] = winsorize(data[col], limits=(lower_winsorize_limit, upper_winsorize_limit))
            data[col] = winsorize(data[col], limits=(0.013653, 0.030128))

        # if IS_WINSORIZING_OUTLIERS:
            ## Instead of removing the outliers, we apply Winsorization (Winsorize column data)
            ## Winsorizing the data such that values beyond the limits are capped: Winsorize data points beyond the bounds
            # data[col] = winsorize(data[col], limits=(lower_outlier_percentage, upper_outlier_percentage))
            # data[col] = winsorize(data[col], limits=[lower_bound, upper_bound])
            # data[col] = winsorize(data[col], limits=(lower_winsorize_limit, upper_winsorize_limit))
    
            ## Applying winsorization to the column
            ## Note: winsorize operates on arrays, hence .values is used. It also modifies data in-place.
            # winsorized_data = winsorize(data[col].values, limits=(lower_winsorize_limit, upper_winsorize_limit))
            # ## Updating the column in the DataFrame with the winsorized data
            # data[col] = winsorized_data
    
            ## Apply Winsorization using calculated limits
            ## Note: It's crucial to ensure the limits are not NaN, indicating no outliers were found
            # if pd.notnull(lower_winsorize_limit) and pd.notnull(upper_winsorize_limit):
            #     data[col] = winsorize(data[col], limits=[lower_winsorize_limit, upper_winsorize_limit])

        ## Collect Outlier Summary statistics
        outlier_summaries.append({
            'Column': col,
            'Lower Winsorization Limit': lower_winsorize_limit,
            'Upper Winsorization Limit': upper_winsorize_limit,
            'Q1': Q1,
            'Q3': Q3,
            'IQR': IQR,
            'Lower Bound': lower_bound,
            'Upper Bound': upper_bound,
            'Number of Outliers': num_outliers,
            'Percentage of Outliers': percentage_outliers,
            'No. of Lower Outliers': num_lower_outliers,
            'No. of Upper Outliers': num_upper_outliers,
            'Lower Outlier %': num_lower_outliers / data[col].notnull().sum(),
            'Upper Outlier %': num_upper_outliers / data[col].notnull().sum()
            # 'Lower Outlier %': lower_outlier_percentage * 100, ## Convert to percentage
            # 'Upper Outlier %': upper_outlier_percentage * 100  ## Convert to percentage
        })

    outlier_summary_df = pd.DataFrame(outlier_summaries)

    return data, outlier_summary_df
In [20]:
## Detect and Handle/Retrieve Outliers
winsorized_imputed_cleaned_data_site1, outlier_summary_df1 = detect_and_handle_outliers(imputed_cleaned_data_site1.copy(), numerical_columns_site1)
winsorized_imputed_cleaned_data_site2, outlier_summary_df2 = detect_and_handle_outliers(imputed_cleaned_data_site2.copy(), numerical_columns_site2)

## Display the outlier summary
print("Outlier Summary for Site 1 (Penrose):\n")
outlier_summary_df1

## The result is a DataFrame 'rawdata' with outliers normalized/adjusted based on IQR and Winsorization
# winsorized_imputed_cleaned_data_site1.head()
# winsorized_imputed_cleaned_data_site2.head()

## TODO
## After applying IQR and Winsorization, the data is now ready for further processing such as PCA
## The `preprocessed_data` is ready for further analysis like PCA
Outlier Summary for Site 1 (Penrose):

Out[20]:
Column Lower Winsorization Limit Upper Winsorization Limit Q1 Q3 IQR Lower Bound Upper Bound Number of Outliers Percentage of Outliers No. of Lower Outliers No. of Upper Outliers Lower Outlier % Upper Outlier %
0 AQI 0.003453 0.022964 22.0 33.0 11.0 5.50 49.50 459 2.641727 60 399 0.003453 0.022964
1 PM10 0.000058 0.016403 8.5 18.3 9.8 -6.20 33.00 286 1.646043 1 285 0.000058 0.016403
2 PM2.5 0.001439 0.036777 2.2 7.7 5.5 -6.05 15.95 664 3.821583 25 639 0.001439 0.036777
3 SO2 0.000058 0.071252 0.1 1.0 0.9 -1.25 2.35 1239 7.130935 1 1238 0.000058 0.071252
4 NO 0.000000 0.084259 0.9 10.7 9.8 -13.80 25.40 1464 8.425899 0 1464 0.000000 0.084259
5 NO2 0.000000 0.028892 4.1 20.3 16.2 -20.20 44.60 502 2.889209 0 502 0.000000 0.028892
6 NOx 0.000000 0.057784 5.3 31.1 25.8 -33.40 69.80 1004 5.778417 0 1004 0.000000 0.057784
7 Wind_Speed 0.000000 0.002935 1.6 3.8 2.2 -1.70 7.10 51 0.293525 0 51 0.000000 0.002935
8 Wind_Dir 0.000000 0.000000 74.0 235.0 161.0 -167.50 476.50 0 0.000000 0 0 0.000000 0.000000
9 Air_Temp 0.001151 0.000000 14.0 20.0 6.0 5.00 29.00 20 0.115108 20 0 0.001151 0.000000
10 Rel_Humidity 0.000403 0.000000 60.2 79.5 19.3 31.25 108.45 7 0.040288 7 0 0.000403 0.000000
In [21]:
## Display the outlier summary
print("Outlier Summary or Site 2 (Takapuna):\n")
outlier_summary_df2
Outlier Summary or Site 2 (Takapuna):

Out[21]:
Column Lower Winsorization Limit Upper Winsorization Limit Q1 Q3 IQR Lower Bound Upper Bound Number of Outliers Percentage of Outliers No. of Lower Outliers No. of Upper Outliers Lower Outlier % Upper Outlier %
0 AQI 0.001498 0.036350 19.00000 30.0000 11.00000 2.500000 46.500000 657 3.784780 26 631 0.001498 0.036350
1 PM10 0.000115 0.026787 7.46875 14.4000 6.93125 -2.928125 24.796875 467 2.690247 2 465 0.000115 0.026787
2 PM2.5 0.000000 0.045740 4.30000 7.3000 3.00000 -0.200000 11.800000 794 4.573996 0 794 0.000000 0.045740
3 NO 0.000000 0.088945 2.35000 11.9000 9.55000 -11.975000 26.225000 1544 8.894522 0 1544 0.000000 0.088945
4 NO2 0.000000 0.039864 0.00395 0.0171 0.01315 -0.015775 0.036825 692 3.986405 0 692 0.000000 0.039864
5 NOx 0.000000 0.072873 6.91250 28.5000 21.58750 -25.468750 60.881250 1265 7.287286 0 1265 0.000000 0.072873
6 Wind_Speed 0.000000 0.003456 1.25000 3.5000 2.25000 -2.125000 6.875000 60 0.345642 0 60 0.000000 0.003456
7 Wind_Dir 0.000000 0.000000 76.50000 260.0000 183.50000 -198.750000 535.250000 0 0.000000 0 0 0.000000 0.000000
8 Air_Temp 0.001325 0.000000 13.50000 19.5000 6.00000 4.500000 28.500000 23 0.132496 23 0 0.001325 0.000000
9 Rel_Humidity 0.000288 0.000000 61.20000 79.7000 18.50000 33.450000 107.450000 5 0.028804 5 0 0.000288 0.000000
In [22]:
## TODO
## Identify Outliers with extreme values
## Let's assume Air_temp > 40°C, Rel_humidity > 100, and PM2.5 > 100 are outliers
extreme_outliers = rawdata[(rawdata['Air_Temp'] > 40) | (rawdata['Rel_Humidity'] > 100) | (rawdata['PM2.5'] > 100)]

## Print information about missing data and outliers
extreme_outliers

## FIXME: winsorize(cleaned_data[FEATURE], (lower_outlier_percentage,upper_outlier_percentage))

# ## Winsorization to handle outliers
# cleaned_data['Air_Temp'] = winsorize(cleaned_data['Air_Temp'], (0.05, 0.05))
# cleaned_data['Rel_Humidity'] = winsorize(cleaned_data['Rel_Humidity'], (0.05, 0.05))
# cleaned_data['PM2.5'] = winsorize(cleaned_data['PM2.5'], (0.05, 0.05))
# cleaned_data['PM10'] = winsorize(cleaned_data['PM10'], (0.05, 0.05))

# ## Show first 5 rows of cleaned_data
# cleaned_data.head()
Out[22]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity Site Site_Class Country
11344 2021-06-25 21:00:00 56.0 8.0 138.4 0.5 1.4 26.2 27.6 2.6 338.0 14.0 63.9 Penrose Industrial / Traffic New Zealand
In [23]:
# imputed_cleaned_data_site1.shape

## [What If] Visualize the difference between before and after the Winsorization
before_and_after_values_visualization(imputed_cleaned_data_site1.copy(), winsorized_imputed_cleaned_data_site1, '[What-If] Winsorization', numerical_columns_site1, 'Penrose', feature_column='PM2.5')
before_and_after_values_visualization(imputed_cleaned_data_site1.copy(), winsorized_imputed_cleaned_data_site1, '[What-If] Winsorization', numerical_columns_site1, 'Penrose', feature_column='PM10')
In [24]:
# imputed_cleaned_data_site2.shape

## [What If] Visualize the difference between before and after the Winsorization
before_and_after_values_visualization(imputed_cleaned_data_site2.copy(), winsorized_imputed_cleaned_data_site2, '[What-If] Winsorization', numerical_columns_site2, 'Takapuna', feature_column='PM2.5')
before_and_after_values_visualization(imputed_cleaned_data_site2.copy(), winsorized_imputed_cleaned_data_site2, '[What-If] Winsorization', numerical_columns_site2, 'Takapuna', feature_column='PM10')
In [25]:
from prettytable import PrettyTable

def display_site_comparison():
    """
    This function creates and displays a table that lists various data attributes along with descriptions and 
    their applicability to different datasets across sites using the PrettyTable library.
    """
    
    ## Define the PrettyTable table columns and set column alignments
    table = PrettyTable()
    table.field_names = ["Variable Name", "Description", "All Sites", "Penrose", "Takapuna"]
    table.align["Variable Name"] = "l"
    table.align["Description"] = "l"
    table.align["All Sites"] = "c"
    table.align["Penrose"] = "c"
    table.align["Takapuna"] = "c"

    ## Define the data dictionary with clear structure that containing information about the datasets and placeholders for potential future expansion
    data_dictionary = [
        ("rawdata", "Complete dataset containing all observations across all sites.", "[x]", "[x]", "[x]"),
        ("numerical_columns_site1", "Numerical columns specific to Site 1.", "[ ]", "[x]", "[ ]"),
        ("nominal_columns_site1", "Nominal columns specific to Site 1.", "[ ]", "[x]", "[ ]"),
        ("numerical_columns_site2", "Numerical columns specific to Site 2.", "[ ]", "[ ]", "[x]"),
        ("nominal_columns_site2", "Nominal columns specific to Site 2.", "[ ]", "[ ]", "[x]"),
        ("rawdata_site1", "Subset of raw data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("rawdata_site2", "Subset of raw data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("cleaned_data", "Cleaned dataset with preprocessing applied.", "[x]", "[x]", "[x]"),
        ("cleaned_ordinal_columns", "Ordinal columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_numerical_columns", "Numerical columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_nominal_columns", "Nominal columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_data1", "Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("cleaned_data2", "Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("imputed_cleaned_data_site1", "Imputed Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("imputed_cleaned_data_site2", "Imputed Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("winsorized_imputed_cleaned_data_site1", "Winsorized Imputed Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("winsorized_imputed_cleaned_data_site2", "Winsorized Imputed Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        # ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        # ("top_features_PM25_site1", "Top features correlated with PM2.5 at Site 1 (Penrose).", "[ ]", "[x]", "[ ]"),
        # ("top_features_PM25_site2", "Top features correlated with PM2.5 at Site 2 (Takapuna).", "[ ]", "[ ]", "[x]"),
        # ("top_features_PM10_site1", "Top features correlated with PM10 at Site 1.", "[ ]", "[x]", "[ ]"),
        # ("top_features_PM10_site2", "Top features correlated with PM10 at Site 2.", "[ ]", "[ ]", "[x]"),
        # ("summary_stats_PM25_penrose", "[PM2.5] Summary statistics for the Penrose site after preprocessing.", "[ ]", "[x]", "[ ]"),
        # ("summary_stats_PM25_takapuna", "[PM2.5] Summary statistics for the Takapuna site after preprocessing.", "[ ]", "[ ]", "[x]"),
        # ("summary_stats_PM10_penrose", "[PM10] Summary statistics for the Penrose site after preprocessing.", "[ ]", "[x]", "[ ]"),
        # ("summary_stats_PM10_takapuna", "[PM10] Summary statistics for the Takapuna site after preprocessing.", "[ ]", "[ ]", "[x]"),
        # ("selected_features_PM25_penrose", "[PM2.5] Target + Selected Features for the Penrose site.", "[ ]", "[x]", "[ ]"),
        # ("selected_features_PM25_takapuna", "[PM2.5] Target + Selected Features for the Takapuna site.", "[ ]", "[ ]", "[x]"),
        # ("selected_features_PM10_penrose", "[PM10] Target + Selected Features for the Penrose site.", "[ ]", "[x]", "[ ]"),
        # ("selected_features_PM10_takapuna", "[PM10] Target + Selected Features for the Takapuna site.", "[ ]", "[ ]", "[x]"),
    ]

    ## Format for better readability
    for entry in data_dictionary:
        ## Adding rows to the table
        table.add_row(entry)

    ## Print the table in an organized format
    print(table)

## Call the function to display the table
print('\n🎓 Listing variables with description...')
display_site_comparison()
🎓 Listing variables with description...
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| Variable Name                         | Description                                                           | All Sites | Penrose | Takapuna |
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| rawdata                               | Complete dataset containing all observations across all sites.        |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1               | Numerical columns specific to Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1                 | Nominal columns specific to Site 1.                                   |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2               | Numerical columns specific to Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2                 | Nominal columns specific to Site 2.                                   |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1                         | Subset of raw data for Site 1.                                        |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2                         | Subset of raw data for Site 2.                                        |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| cleaned_data                          | Cleaned dataset with preprocessing applied.                           |    [x]    |   [x]   |   [x]    |
| cleaned_ordinal_columns               | Ordinal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_numerical_columns             | Numerical columns in the cleaned dataset.                             |    [x]    |   [x]   |   [x]    |
| cleaned_nominal_columns               | Nominal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_data1                         | Cleaned data for Site 1.                                              |    [ ]    |   [x]   |   [ ]    |
| cleaned_data2                         | Cleaned data for Site 2.                                              |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| imputed_cleaned_data_site1            | Imputed Cleaned data for Site 1.                                      |    [ ]    |   [x]   |   [ ]    |
| imputed_cleaned_data_site2            | Imputed Cleaned data for Site 2.                                      |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| winsorized_imputed_cleaned_data_site1 | Winsorized Imputed Cleaned data for Site 1.                           |    [ ]    |   [x]   |   [ ]    |
| winsorized_imputed_cleaned_data_site2 | Winsorized Imputed Cleaned data for Site 2.                           |    [ ]    |   [ ]   |   [x]    |
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+

💾 Save Imputed_Cleaned-Data to *.csv¶

In [ ]:
## Filter cleaned data for the specified sites
# imputed_cleaned_data_site1 = cleaned_data[cleaned_data['Site'] == 'Penrose']
# imputed_cleaned_data_site2 = cleaned_data[cleaned_data['Site'] == 'Takapuna']
# 
## Remove redundant columns before saving to *.csv
# imputed_cleaned_data_site1 = cleaned_data_site1.drop(['SO2', 'Site', 'Site_Class'], axis=1)
# imputed_cleaned_data_site2 = cleaned_data_site2.drop(['Site', 'Site_Class'], axis=1)

imputed_cleaned_data_site1['Timestamp'] = imputed_cleaned_data_site1['Timestamp'].dt.strftime('%d/%m/%Y %H:%M')
imputed_cleaned_data_site2['Timestamp'] = imputed_cleaned_data_site2['Timestamp'].dt.strftime('%d/%m/%Y %H:%M')

CLEAN_DATA_PATH = 'data/cleaned'
imputed_cleaned_data_site1.to_csv(f"{CLEAN_DATA_PATH}/cleaned_Penrose7-07May2020-to-30Apr2022.csv", index=False)
imputed_cleaned_data_site2.to_csv(f"{CLEAN_DATA_PATH}/cleaned_Takapuna23-07May2020-to-30Apr2022.csv", index=False)
🎓 Predicting Air Particulate Matter at Scale ⛅️
🧑‍🎓 Auckland University of Technology (AUT)