import os, logging

## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'raw_pandas.env'  ## Set to `cleaned_pandas.env` or `raw_pandas.env` as needed

%run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
# %run -i ./DataFrameAdapter.ipynb

The specified .env file: raw_pandas.env
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
| Variable                    | Description                                                                                | Value             |
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
| IS_TERADATA_VANTAGE         | Using scable Teradata Vantage vs. local machine (Laptop)                                   | True              |
| IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files | False             |
| SCHEMA_NAME                 | [Teradata Vantage] Schema Name                                                             | Air_Pollution     |
| TABLE_NAME                  | [Teradata Vantage] Table Name                                                              | Air_Pollution_Raw |
| IS_JUPYTERLAB               | Running in JupyterLab vs Python Dash/Vizro Dashboard                                       | True              |
| IS_DEBUG                    | Debug mode is active or not                                                                | False             |
| USE_DATA_PREFIX             | Prefix to use for data files: 'raw' | 'cleaned'                                            | raw               |
+-----------------------------+--------------------------------------------------------------------------------------------+-------------------+
... Logon successful
Connected as: xxxxxsql://demo_user:xxxxx@host.docker.internal/dbc
Engine(teradatasql://demo_user:***@host.docker.internal)

ℹ️ Load Data from data/raw

data/raw/raw_Penrose7-07May2020-to-30Apr2022.csv
data/raw/raw_Takapuna23-07May2020-to-30Apr2022.csv

ℹ️ Load Data from data/raw/raw_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂

ℹ️ Load Data from data/raw/raw_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂

ℹ️ The Shape of the rawdata Dataframe: (37382, 15)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (20023, 12) (17359, 11)

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site', 'Site_Class', 'Country'], dtype='object')

🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 Listing variables with description...
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| Variable Name           | Description                                                    | All Sites | Penrose | Takapuna |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| rawdata                 | Complete dataset containing all observations across all sites. |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1 | Numerical columns specific to Site 1.                          |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1   | Nominal columns specific to Site 1.                            |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2 | Numerical columns specific to Site 2.                          |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2   | Nominal columns specific to Site 2.                            |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1           | Subset of raw data for Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2           | Subset of raw data for Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+


rawdata_site1


print("\n🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data1 Dataframe such as the mean, max/minimum values ...")
rawdata_site1.describe()

🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data1 Dataframe such as the mean, max/minimum values ...


rawdata_site2


print("\n🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data2 Dataframe such as the mean, maximum and minimum values ...")
rawdata_site2.describe()

🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data2 Dataframe such as the mean, maximum and minimum values ...


def ensure_datetime_format(df, column_name='Timestamp'):
    """
    Ensures the specified column is in datetime format.
    Parameters:
    - df (pd.DataFrame): The dataframe to process.
    - column_name (str): The name of the column to format as datetime.
    Returns:
    - pd.DataFrame: DataFrame with the column in datetime format.
    """
    try:
        df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
    except Exception as e:
        raise ValueError(f"Error converting {column_name} to datetime: {e}")
    return df

def clean_timeseries_dataset(df, time_col='Timestamp'):
    """
    Clean the dataset by ensuring datetime format for 'Timestamp', 
    removing rows where all values are NaN (except time_col),
    and removing duplicate rows based on 'Timestamp'.
    
    Parameters:
    - df (pd.DataFrame): Input dataset.
    - time_col (str): The column name for timestamp data.
    
    Returns:
    - pd.DataFrame: Cleaned dataset.
    """
    print(f'\n🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on {time_col}')
    if time_col not in df.columns:
        raise ValueError(f"Column '{time_col}' not found in DataFrame.")
    
    ## Convert 'Timestamp' column to datetime format
    df = ensure_datetime_format(df, column_name=time_col)

    ## Remove rows where all values are NaN, excluding 'Timestamp'
    cols_except_time = [col for col in df.columns if col != time_col]
    if IS_DEBUG:
        rows_all_nans = df[df[cols_except_time].isna().all(axis=1)]
        if not rows_all_nans.empty:
            print("Rows where all values are NaN, excluding 'Timestamp':\n", rows_all_nans)
    ## Remove rows where all values are NaN, excluding 'Timestamp'
    df = df.dropna(how='all', subset=cols_except_time)

    if IS_DEBUG:
        ## Show duplicate rows based on 'Timestamp'
        duplicates = df[df.duplicated(subset=[time_col], keep=False)]
        if not duplicates.empty:
            print("Duplicate rows based on 'Timestamp':\n", duplicates)
    
    ## Remove duplicate rows based on 'Timestamp'
    df = df.drop_duplicates(subset=[time_col], keep='first')

    ## Next Steps:
    # - Handling outliers in numeric columns
    # - Filling missing values with appropriate imputation techniques
    # - Normalizing or scaling numeric columns
    # - Encoding categorical variables if necessary

    ## Optional: sort by 'Timestamp' for chronological order
    # cleaned_df.sort_values(by=time_col, inplace=True)

    return df


cleaned_data_site1 = clean_timeseries_dataset(rawdata_site1)
print(f"☑️ Original rawdata_site1 dataset shape: {rawdata_site1.shape}")
print(f"✅ Cleaned  rawdata_site1 dataset shape: {cleaned_data_site1.shape}")

cleaned_data_site2 = clean_timeseries_dataset(rawdata_site2)
print(f"☑️ Original rawdata_site2 dataset shape: {rawdata_site2.shape}")
print(f"✅ Cleaned  rawdata_site2 dataset shape: {cleaned_data_site2.shape}")

🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on Timestamp
☑️ Original rawdata_site1 dataset shape: (20023, 12)
✅ Cleaned  rawdata_site1 dataset shape: (17375, 12)

🎓 Cleans the dataset by removing rows where all values are NaN and removing duplicate rows based on Timestamp
☑️ Original rawdata_site2 dataset shape: (17359, 11)
✅ Cleaned  rawdata_site2 dataset shape: (17359, 11)


import holidays
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

def before_and_after_values_visualization(data_site, imputed_data, method_description, numerical_columns, site, feature_column='PM10'):
    """
    This function plot the comparison between the two sites before and after imputation, adjusting parameters based on the visualization type.
    Applies the specified imputation function to the data, then visualizes the results using the specified visualization function.
    
    Parameters:
    - data: pd.DataFrame - The dataset to impute and visualize.
    - imputed_data: pd.DataFrame - The imputed dataset.
    - site: str - The name of the site/location for comparison.
    - feature_column: str - The feature column on which to focus the visualization.
    """
    if IS_DEBUG:
        print(f"before_and_after_values_visualization {data_site.shape}:", imputed_data.shape, site, visualization_type, feature_column)
    if not isinstance(data_site, pd.DataFrame):  ## Ensure data is a DataFrame before proceeding
        raise ValueError("Data must be a pandas DataFrame.")

    ## Plotting using Plotly: Setting up the figure to hold two subplots side by side
    fig = make_subplots(rows=1, cols=1, shared_yaxes=True, subplot_titles=(f'Data Plot of {feature_column} at {site}'))

    ## Original and Modified Data for Site1: darkorange vs lightblue | dodgerblue; dash='dot' vs 'solid'
    fig.add_trace(go.Scatter(x=imputed_data['Timestamp'], y=imputed_data[feature_column], mode='markers', name='Modified Site', 
                             line=dict(color='darkorange', width=0.3, dash='dot')), row=1, col=1)
    # fig.add_trace(go.Scatter(x=data_site['Timestamp'], y=data_site[feature_column], mode='lines+markers', name='Original Site',
    fig.add_trace(go.Scatter(x=data_site['Timestamp'], y=data_site[feature_column], mode='markers', name='Original Site',
                             line=dict(color='dodgerblue', width=0.3, dash='dot'), opacity=0.8), row=1, col=1)

    ## Adjust the layout for a better/consistent look and display the plot
    fig.update_layout(height=800, title_text=f'[{method_description}] Original vs Imputed Data Plot of {feature_column} at {site}',
                        xaxis_title="Timestamp", yaxis_title=feature_column, 
                        xaxis=dict(
                            title="Timestamp",
                            tickmode='auto',
                            tickformat="%Y-Q%q",  ## Format: Year - Q<Quarter>
                            dtick="M3",           ## Set ticks every 3 months to indicate quarters
                        ), legend=dict(orientation="h",  ## Horizontal legend
                            yanchor="bottom", y=1.02,   ## Position legend above the plot
                            xanchor="right", x=1,       ## Align legend to the right
                            title='Legend:',            ## Optional legend title
                            bgcolor='rgba(255,255,255,0.3)',  ## Semi-transparent white background
                            bordercolor="lightgrey", borderwidth=1))
    fig.show()


class DataImputer:
    """
    Class to perform various imputation methods on a DataFrame.
    Note: Imputation methods should directly accept a DataFrame and return the imputed DataFrame
    """
    def __init__(self):
        # if not isinstance(data, pd.DataFrame):
        #     raise ValueError("Data must be a pandas DataFrame")
        pass

    @staticmethod
    def mean_mode_imputation(data, columns=None):
        """
        Impute with Mean and Mode:
        Replaces missing numerical values with the MEAN and missing categorical values with the MODE.

        Args:
            data (DataFrame): DataFrame containing the data.
            columns (list, optional): Columns to impute. If None, applies to all columns.

        Returns:
            DataFrame: DataFrame with imputed values.
        """
        if columns is None:
            # Exclude 'Timestamp' column or any non-numeric columns
            columns = [col for col in data.columns if data[col].dtype in ['float64', 'int64'] and col != 'Timestamp']
        
        # print("Applying Mean/Mode Imputation...")
        for col in data.columns:
            if col == 'Timestamp':  # Skip the 'Timestamp' column
                continue

            if data[col].dtype in ['float64', 'int64']:
                imputer = SimpleImputer(strategy='mean')
            else:  ## Assuming categorical data: return_value[col].dtype.name == 'category'
                imputer = SimpleImputer(strategy='most_frequent')
            ## The fit_transform function expects a 2D array, hence the double square brackets
            data[col] = imputer.fit_transform(data[[col]])
            
        return data

    @staticmethod
    def forward_backward_imputation(data):
        """
        Impute with Forward and Backward propagation
        Using ffill and bfill as a naive method, to complete the data.

        Parameters:
            data (pd.DataFrame): DataFrame to impute.
            inplace (bool): If False, returns a new DataFrame, otherwise modifies the original DataFrame.
        
        Returns:
            pd.DataFrame: DataFrame with imputed values.
        """
        # print("Applying Forward and Backward Imputation across the entire DataFrame ...")
        ## Forward fill (ffill) missing values
        data.fillna(method='ffill', inplace=True)
        ## Backward fill (bfill) remaining missing values
        data.fillna(method='bfill', inplace=True)
        return data

    @staticmethod
    def interpolation_imputation(data, method='linear'):
        """
        Impute with Interpolation
        A linear interpolation method works by assuming a linear relationship between the observed points and drawing a straight line accordingly.
        Polynomial curves or splines.
        Interpolation is an effective approach to impute missing values in time series. 
        Polynomial interpolation fits a polynomial function to the observed data points and estimates the missing values based on this function.
        It works best if the time series is reasonably smooth. In case there are sudden changes or outliers, a simpler approach such as forward filling might be a better option.

        Parameters:
            data (pd.DataFrame): DataFrame to impute.
            method (str): Interpolation technique, default is 'linear'.
            order (int, optional): The order of the polynomial for polynomial interpolation.
            inplace (bool): If False, returns a new DataFrame, otherwise modifies the original DataFrame.
        
        Returns:
            pd.DataFrame: DataFrame with imputed values.
        """
        # print("Applying Interpolation Imputation...")
        if method == 'polynomial' and order is None:
            raise ValueError("Order must be specified for polynomial interpolation")
        data.interpolate(method=method, inplace=True, limit_direction='both')
        return data

    @staticmethod
    def polynomial_imputation(data, method='polynomial', order=2):
        """
        Impute with Interpolation
        A linear interpolation method works by assuming a linear relationship between the observed points and drawing a straight line accordingly.
        Polynomial curves or splines.
        Interpolation is an effective approach to impute missing values in time series. 
        Polynomial interpolation fits a polynomial function to the observed data points and estimates the missing values based on this function.
        It works best if the time series is reasonably smooth. In case there are sudden changes or outliers, a simpler approach such as forward filling might be a better option.
        """
        # print("Applying Polynomial Imputation...")
        if method == 'polynomial' and order is None:
            raise ValueError("Order must be specified for polynomial interpolation")
        # data.interpolate(method=method, order=order if order else None, inplace=True, limit_direction='both')
        ## Apply polynomial interpolation across columns
        for column in data.select_dtypes(include=['float', 'int']).columns:
            ## Only interpolate if there are at least two non-NA values
            if data[column].count() > 1:
                data[column].interpolate(method='polynomial', order=order, inplace=True, limit_direction='both')
        # return_value = return_value.interpolate(method='spline')
        return data
    

    @staticmethod
    def moving_average_imputation(data, columns=None, window='168H', aggressive_iterative=False, min_periods=1, verbose=True):
    # def moving_average_imputation(data, columns=None, window='168H', min_periods=1, verbose=False, inplace=True):
        """
        Impute missing values using a moving average over a specified window. 
        This method is particularly useful for time series data where patterns such as seasonality are expected to play a significant role.

        Only NaNs are filled with the rolling mean, preserving the integrity of original non-NaN values. 
        A moving average is better at adapting to changes by considering a few nearby data points to compute the mean.
        Yet, it can still lead to biased results if the data is not missing at random.
        
        Examples:
            .rolling(window='52W' ## A year has 52 weeks (52 weeks * 7 days per week) approximately.
            .rolling(window='168H', min_periods=1).mean()  ## 168 hours = 7 days

        Parameters:
            data (DataFrame): The pandas DataFrame containing the data.
            columns (list of str, optional): Specific columns to impute. If None, all numeric columns are used.
            window (str): The size of the moving window (default is '168H' for 168 hours = 24 * 7 = 7 days).
            min_periods (int): Minimum number of observations in window required to have a value (default is 1).
            verbose (bool): If True, print additional information about the imputation process.
            # inplace (bool): If True, modify the DataFrame in place. Otherwise, return a modified copy.

        Returns:
            DataFrame: The DataFrame with imputed values if inplace is False. Otherwise, None.

        Notes:
            - A rolling window imputation can adapt to changes by considering nearby data points. However, it may
              introduce biases if the data is not missing at random.
            - This method currently supports only numeric data for rolling calculations.
        """
        ## Ensure the 'Timestamp' column is in datetime format and set as index only if it's not already the index
        if 'Timestamp' in data.columns:
            data['Timestamp'] = pd.to_datetime(data['Timestamp'], dayfirst=True, errors='coerce')
            data.set_index('Timestamp', inplace=True)

        if not pd.api.types.is_datetime64_any_dtype(data.index):
            raise ValueError("Data index must be a datetime type for time-based rolling windows.")

        ## Select columns if not provided
        if columns is None:
            ## Select only numeric columns for rolling operation
            columns = data.select_dtypes(include=[np.number]).columns.tolist()
    
        if verbose:
            print(f"\nApplying Moving Average Imputation over a specified {window} window on columns:", columns)

        ## Apply rolling mean to specified numeric columns
        for column in columns:
            ## Check if the column is of a numeric type: boolean (b), integer (i), floating (f), or complex number (c)
            if data[column].dtype.kind in 'bifc':  
                # if verbose:
                #     print(f"Processing column: {column}")
                ## Calculate the rolling mean and update NaNs only
                rolling_mean = data[column].rolling(window=window, min_periods=min_periods).mean()
                data[column].fillna(rolling_mean, inplace=True)

        ## [Option] Aggressive Iterative pass to fill remaining NaNs
        if aggressive_iterative:
            ## Perform multiple passes over the data. In each pass, it tries to fill the NaNs with the rolling mean. 
            max_iterations = 10
            for iteration in range(max_iterations):
                change_made = False
                for column in columns:
                    if data[column].isna().sum() > 0:  ## Check if there are still NaNs to fill
                        # if verbose:
                        #     print(f"Iterative imputation for column: {column}, iteration {iteration+1}")
                        before_fill = data[column].isna().sum()
                        data[column].fillna(
                            data[column].rolling(window=window, min_periods=min_periods).mean(),
                            inplace=True
                        )
                        after_fill = data[column].isna().sum()
                        if after_fill < before_fill:
                            change_made = True

                if not change_made:  ## No more NaNs can be filled
                    if verbose:
                        print(f"Processing column: {column}. Iterative pass to fill remaining NaNs in {iteration+1} iteration(s).")
                        # print(f"No more changes made in iteration {iteration+1}. Ending iterative imputation.")
                    break
        
        ## Check for any remaining NaNs and attempt further imputation if necessary
        if not aggressive_iterative:
            if data.isna().any().any():
                if verbose:
                    print("Second pass for any remaining NaNs after initial imputation using bfill & ffill.")
                for column in columns:
                    data[column] = data[column].fillna(method='bfill').fillna(method='ffill')

        ## Reset index to turn 'Timestamp' back into a column
        data.reset_index(inplace=True)
        
        return data


## TODO: Future Work!!!
#
# !pip install --upgrade prophet xgboost --user
# import functools
# 
# from xgboost import XGBRegressor
# from prophet import Prophet
# from sklearn.ensemble import RandomForestRegressor

#     ## TODO
#     @staticmethod
#     def prophet_imputation(data, date_column, target_columns, holidays=None):
#         """
#         Decorator for imputing missing values using Facebook/Meta Prophet time series forecasting model.
#         
#         Parameters:
#         - dates_column: Name of the column containing the dates.
#         - columns_to_impute: Columns to impute. If 'all', all columns except the dates column are imputed.
#         
#         Returns:
#         - A wrapper function for the imputation.
#         """
#         # print("Applying Time-Series Prophet Imputation...")    
#         if date_column not in data.columns:
#             raise ValueError(f"{date_column} is not a column in the DataFrame")
# 
#         for column in target_columns:
#             ## Prepare DataFrame for Prophet
#             df_prophet = pd.DataFrame({
#                 'ds': data[date_column],
#                 'y': data[column],
#                 # 'y': data[column].interpolate()  ## Improved handling of missing values ?
#             }).dropna()
#             
#             ## Initialize and fit Prophet model
#             model = Prophet(
#                 daily_seasonality=True,
#                 weekly_seasonality=True,
#                 yearly_seasonality='auto',
#                 holidays=holidays
#             )
#             model.fit(df_prophet)
#             ## TODO: Make future dataframe and predict
#             # future = model.make_future_dataframe(periods=0, freq='H')
#             # forecast = model.predict(future)
#             forecast = model.predict(df_prophet[['ds']])
#             ## Fill missing values in original data
#             forecast.set_index('ds', inplace=True)
#             data.set_index(date_column, inplace=True)
#             # data[column].update(forecast)
#             ## 'yhat' from Prophet forecast can be used as imputed values
#             data[column].fillna(forecast['yhat'], inplace=True)
#             data.reset_index(inplace=True)
#         return data
# 
# 
#     ## TODO
#     @staticmethod
#     def xgboost_imputation(data, target_columns):
#         """
#         Impute missing values in a DataFrame using XGBoost with Scikit-learn's IterativeImputer.
#         
#         Parameters:
#         - data: DataFrame to be imputed.
#         - target_columns: Columns to impute, could be 'all' or a list of column names.
#         
#         Returns:
#         - The DataFrame with imputed values.
#         """
#         # print("Applying Machine-Learning XGBoost Interpolation Imputation...")
#         ## TODO: Set default estimator if none provided -> Directly use XGBRegressor as the estimator
#         # chosen_estimator = estimator if estimator is not None else XGBRegressor(n_estimators=100, random_state=0)
#         # chosen_estimator = XGBRegressor(n_estimators=100, random_state=0)
#         ## Perform imputation on numerical data
#         # imputer = IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0)
#         imputer = IterativeImputer(
#                     estimator=XGBRegressor(n_estimators=100, random_state=0),
#                     max_iter=10,
#                     random_state=0
#                 )
#         ## TODO: Apply imputer only on specified target columns
#         # data[target_columns] = imputer.fit_transform(data[target_columns])
#         # for column in target_columns:
#         #     if column not in data.columns:
#         #         raise ValueError(f"{column} is not a column in the DataFrame")
#         #     data[[column]] = imputer.fit_transform(data[[column]])
#         # Perform imputation on specified target columns or all numerical columns if 'all'
#         if target_columns == 'all':
#             numerical_data = data.select_dtypes(include=[np.number])
#             imputed_numerical_data = pd.DataFrame(IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0).fit_transform(numerical_data), columns=numerical_data.columns, index=data.index)
#             data.update(imputed_numerical_data)
#         else:
#             for column in target_columns:
#                 if column in data.columns:
#                     column_data = data[[column]].select_dtypes(include=[np.number])
#                     imputed_column_data = IterativeImputer(estimator=chosen_estimator, max_iter=10, random_state=0).fit_transform(column_data)
#                     data[column] = imputed_column_data
#         return data


## Instantiate the class with rawdata
imputer = DataImputer()

nz_holidays = holidays.country_holidays('NZ')
# nz_holidays = holidays.country_holidays('NZ', subdiv='AUK')

## List of imputation methods to apply, directly referencing the methods of DataImputer as tuples of (method_function, method_description)
# (lambda data: DataImputer.mean_mode_imputation(data, columns=numerical_columns), "SimpleImputer Mean/Mode Imputation"),
# (lambda data: DataImputer.xgboost_imputation(data, target_columns=['PM2.5', 'PM10', 'NO2']), "Machine-Learning XGBoost Imputation"),
# (lambda data: DataImputer.prophet_imputation(data, date_column='Timestamp', target_columns=['PM2.5', 'PM10', 'NO2'], holidays=nz_holidays), "Time-Series Prophet Imputation"),
# (lambda data: DataImputer.moving_average_imputation(data, columns=timestamp_and_numerical_columns, window='168H', min_periods=1, verbose=True), "Moving Average Imputation"),
imputation_methods = [
    (lambda data: DataImputer.mean_mode_imputation(data), "SimpleImputer Mean/Mode Imputation"),
    (DataImputer.forward_backward_imputation, "Forward/Backward Imputation"),
    (lambda data: DataImputer.interpolation_imputation(data, method='linear'), "Linear Interpolation"),
    (lambda data: DataImputer.polynomial_imputation(data, method='polynomial', order=2), "Polynomial Interpolation (Order 2)"),
    (lambda data: DataImputer.moving_average_imputation(data, window='24H',  min_periods=1, verbose=True), "1-DAY Moving Average Imputation"),
    (lambda data: DataImputer.moving_average_imputation(data, window='168H', min_periods=1, verbose=True), "1-WEEK Moving Average Imputation"),   
]


## Iterate through each imputation method, apply it, and visualize the results (data vs imputed data visualization)
for method_function, method_description in imputation_methods:
    # print(f"Applying imputation method function: {method_function}")
    # print(f"Applying imputation method description: {method_description}")
    print(f'✅ [DataImputer] {method_description}: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna')

    if IS_DEBUG:
        print(f"{method_description} --------------Before Imputation--------------")
        print(cleaned_data_site1.isnull().sum(axis = 0))
        print(cleaned_data_site2.isnull().sum(axis = 0))
    ## Step 1. Apply the imputation function
    imputed_cleaned_data_site1 = method_function(cleaned_data_site1.copy())  ## This calls the method on the DataImputer instance
    imputed_cleaned_data_site2 = method_function(cleaned_data_site2.copy())  ## This calls the method on the DataImputer instance
    if IS_DEBUG:
        # print(type(imputed_cleaned_data_site1))
        print(f"{method_description} --------------After Imputation--------------")
        print(imputed_cleaned_data_site1.isnull().sum(axis = 0))
        print(imputed_cleaned_data_site2.isnull().sum(axis = 0))
    
    ## Introduce a short pause (e.g., 30 seconds) to ensure plots are fully rendered before moving to the next item
    time.sleep(60)
    
    ## Step 2. After imputation, we can now proceed with visualization or further processing
    before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), method_description, numerical_columns_site1, 'Penrose', feature_column='PM2.5')
    before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), method_description, numerical_columns_site1, 'Penrose', feature_column='PM10')
    
    # before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2.copy(), method_description, numerical_columns_site2, 'Takapuna', feature_column='PM2.5')
    # before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2.copy(), method_description, numerical_columns_site2, 'Takapuna', feature_column='PM10')

✅ [DataImputer] SimpleImputer Mean/Mode Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

✅ [DataImputer] Forward/Backward Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

✅ [DataImputer] Linear Interpolation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

✅ [DataImputer] Polynomial Interpolation (Order 2): Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

✅ [DataImputer] 1-DAY Moving Average Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

Applying Moving Average Imputation over a specified 24H window on columns: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.

Applying Moving Average Imputation over a specified 24H window on columns: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.

✅ [DataImputer] 1-WEEK Moving Average Imputation: Original vs Imputed Data Plots of PM2.5/PM10 at Penrose/Takapuna

Applying Moving Average Imputation over a specified 168H window on columns: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.

Applying Moving Average Imputation over a specified 168H window on columns: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']
Second pass for any remaining NaNs after initial imputation using bfill & ffill.


def weekday_from_index(day_index):
    """
    Convert a day index (0-6) to its corresponding weekday name.
    
    Parameters:
    - day_index (int): The index of the day where Monday is 0 and Sunday is 6.
    
    Returns:
    - str: The name of the weekday corresponding to the input index.
    """
    weekdays = ["0.Monday", "1.Tuesday", "2.Wednesday", "3.Thursday", "4.Friday", "5.Saturday", "6.Sunday"]
    if 0 <= day_index < len(weekdays):
        return weekdays[day_index]
    else:
        raise ValueError("Day index must be in the range 0 to 6.")

def impute_with_historical_averages(data, column, max_lookback_periods, verbose=True):
    """
    Helper function to impute missing values for a specific column in the dataset using historical averages.
    Historical averages are calculated using data from the same hour and weekday from previous weeks.
    
    Parameters:
    - data (pd.DataFrame): DataFrame with a datetime index.
    - column (str): The column in the DataFrame to impute.
    - max_lookback_periods (int): Maximum number of weeks to look back for historical data.
    - verbose (bool): If set to True, prints out additional debugging information.
    
    Returns:
        pd.Series: A series with imputed values.
    """
    imputed_values = data[column].copy()
    missing_indices = imputed_values[imputed_values.isna()].index
    if verbose:
        # print(f"Missing indices for column '{column}': {missing_indices}")
        print(f"Imputing column: {column} & number of missing entries: {len(missing_indices)}")
    
    ## Loop over all missing data points
    for current_missing_index in missing_indices:
        historical_values = []
        valid_data_found = False
        debug_info = []
        
        ## Collect historical data points from previous weeks
        for weeks_back in range(1, max_lookback_periods + 1):
            target_hist_time = current_missing_index - pd.DateOffset(weeks=weeks_back)
            if target_hist_time < data.index.min():
                ## Stop if there is no more historical data available
                if verbose:
                    print(f"No historical data beyond: {data.index.min()}")
                continue
            if target_hist_time in data.index:
                ## Retrieve the historical value if it exists and is not NaN
                historical_value = data.at[target_hist_time, column]
                if pd.notna(historical_value):
                    historical_values.append(historical_value)
                    valid_data_found = True
                    ## Collecting additional debug information
                    debug_info.append((target_hist_time, target_hist_time.weekday(), target_hist_time.hour, historical_value))

        # ## TODO: In case if data is sparse
        # ## while (count < max_lookback_periods) and still have previous data points (the same weekday and hour in previous weeks):
        # count = 0
        # while count < max_lookback_periods:
        #     target_hist_time = current_missing_index - pd.DateOffset(weeks=count)
        #     if target_hist_time < data.index.min():
        #         break  ## Break if historical data point goes beyond the available data range
        #     if target_hist_time in data.index and pd.notna(data.at[target_hist_time, column]):
        #         ## process the data
        #     ## Perform calculations or data retrieval ...
        #     count += 1
            
        ## Compute the mean of the historical values if any are available
        # if pd.notna(historical_value):
        if valid_data_found:
            mean_value = np.mean(historical_values)
            imputed_values.at[current_missing_index] = mean_value
            if verbose:
                print(f"Imputed {current_missing_index} (Day: {weekday_from_index(current_missing_index.weekday())}, Hour: {current_missing_index.hour}): {mean_value}")
                print("Historical data used:")
                for info in debug_info:
                    print(f"  - Date: {info[0]} (Day: {info[1]}, Hour: {info[2]}), Value: {info[3]}")
        else:
            if verbose:
                print(f"No valid historical data found for {current_missing_index} (Day: {weekday_from_index(current_missing_index.weekday())}, Hour: {current_missing_index.hour}). Cannot impute.")
            ## Optional: handle cases with no available historical data
            ## Could assign a default value or use another imputation method HERE
            ## [ ] Assign a global median, mean, or mode (if precomputed)
            ## [ ] Use another method like interpolation
            ## [x] Leave as NaN if downstream processes can handle it
            continue

    # data[column] = imputed_values
    return imputed_values


def historical_average_imputation(data, datetime_column_name='Timestamp', columns=None, max_lookback_periods=4, verbose=True):
    """
    Imputes missing values in a DataFrame by calculating the average of similar (the same weekday and hour) historical data points in previous (max_lookback_periods=4) weeks.

    Parameters:
    - data (pd.DataFrame): DataFrame with a datetime index and column for values.
    - columns (list of str, optional): Specific columns to impute. Defaults to all numeric columns if None.
    - max_lookback_periods (int): Number of weeks/periods to look back for similar historical data points.
    - verbose (bool): If True, print additional information about the imputation process.

    Returns:
    - pd.DataFrame: The DataFrame with imputed values.
    """
    ## Ensure the index is a DatetimeIndex
    if not isinstance(data.index, pd.DatetimeIndex):
        raise ValueError("Index must be of type DatetimeIndex for proper timedelta calculations.")

    if columns is None:
        columns = data.select_dtypes(include=[np.number]).columns.tolist()
    
    start_time = time.time()
    imputed_series = data.copy()
    ## Iterate over each column to apply imputation
    for target_column in columns:  
        imputed_series[target_column] = impute_with_historical_averages(data, target_column, max_lookback_periods, verbose=verbose)
        if verbose:
            print(f"Starting imputation for {target_column}.")
        data[target_column] = impute_with_historical_averages(data, target_column, max_lookback_periods, verbose)
        
    ## Optional: handle cases with no available historical data
    if imputed_series.isna().any().any():
        if verbose:
            print("Second pass for any remaining NaNs after initial imputation using bfill & ffill.")
        for column in columns:
            imputed_series[column] = imputed_series[column].fillna(method='bfill').fillna(method='ffill')

    # if imputed_series.isna().any().any():
    #     ## 1-WEEK Moving Average Imputation  
    #     imputed_series = DataImputer.moving_average_imputation(imputed_series.copy(), window='168H', min_periods=1, verbose=True)

    # if verbose:
    elapsed_time = time.time() - start_time
    print(f"Completed imputation for all columns in {elapsed_time:.2f} seconds.")

    return imputed_series


if cleaned_data_site1.index.name != 'Timestamp':
    ## Check if 'Timestamp' is not the index and set it as the index if needed
    ## Set 'Timestamp' as the index of the DataFrame
    cleaned_data_site1.set_index('Timestamp', inplace=True)

imputed_cleaned_data_site1 = historical_average_imputation(cleaned_data_site1.copy(), numerical_columns_site1, verbose=False)

## Check if there are any NaNs left in the simulated data
# remaining_nans = imputed_cleaned_data_site1.isna().any().any()
total_nans = imputed_cleaned_data_site1.isna().sum().sum()
print(f"Total NaNs in the DataFrame: {total_nans}")

Completed imputation for all columns in 10.92 seconds.
Total NaNs in the DataFrame: 0


## Check if the index name of cleaned_data_site1 is 'Timestamp'
if imputed_cleaned_data_site1.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site1
    imputed_cleaned_data_site1.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site1")
else:
    print("Index name is not 'Timestamp'. No changes made.")
    
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if cleaned_data_site1.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site1
    cleaned_data_site1.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: cleaned_data_site1")
else:
    print("Index name is not 'Timestamp'. No changes made.")

Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site1
Index 'Timestamp' has been reset to a column: cleaned_data_site1


print(f"☑️ [Historical Average Imputation] cleaned_data_site1 shape: {cleaned_data_site1.shape}")
before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), 'historical_average_imputation', numerical_columns_site1, 'Penrose', feature_column='PM2.5')
before_and_after_values_visualization(cleaned_data_site1.copy(), imputed_cleaned_data_site1.copy(), 'historical_average_imputation', numerical_columns_site1, 'Penrose', feature_column='PM10')

print(f"✅ [Historical Average Imputation] imputed_cleaned_data_site1 imputed shape: {imputed_cleaned_data_site1.shape}")

☑️ [Historical Average Imputation] cleaned_data_site1 shape: (17375, 12)

✅ [Historical Average Imputation] imputed_cleaned_data_site1 imputed shape: (17375, 12)


if cleaned_data_site2.index.name != 'Timestamp':
    ## Check if 'Timestamp' is not the index and set it as the index if needed
    ## Set 'Timestamp' as the index of the DataFrame
    cleaned_data_site2.set_index('Timestamp', inplace=True)

imputed_cleaned_data_site2 = historical_average_imputation(cleaned_data_site2.copy(), numerical_columns_site2, verbose=False)

## Check if there are any NaNs left in the simulated data
# remaining_nans = imputed_cleaned_data_site2.isna().any().any()
total_nans = imputed_cleaned_data_site2.isna().sum().sum()
print(f"Total NaNs in the DataFrame: {total_nans}")

# if imputed_cleaned_data_site2.isna().any().any():
#     ## 1-WEEK Moving Average Imputation  
#     imputed_cleaned_data_site2 = DataImputer.moving_average_imputation(imputed_cleaned_data_site2.copy(), window='168H', min_periods=1, verbose=True)

Completed imputation for all columns in 13.55 seconds.
Total NaNs in the DataFrame: 0


## Check if the index name of cleaned_data_site1 is 'Timestamp'
if imputed_cleaned_data_site2.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in imputed_cleaned_data_site2
    imputed_cleaned_data_site2.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site2")
else:
    print("Index name is not 'Timestamp'. No changes made.")
    
## Check if the index name of cleaned_data_site1 is 'Timestamp'
if cleaned_data_site2.index.name == 'Timestamp':
    ## Reset index to turn 'Timestamp' back into a column in cleaned_data_site2
    cleaned_data_site2.reset_index(inplace=True)
    print("Index 'Timestamp' has been reset to a column: cleaned_data_site2")
else:
    print("Index name is not 'Timestamp'. No changes made.")

Index 'Timestamp' has been reset to a column: imputed_cleaned_data_site2
Index 'Timestamp' has been reset to a column: cleaned_data_site2


print(f"☑️ [Historical Average Imputation] cleaned_data_site2 shape: {cleaned_data_site2.shape}")

# before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, method_description, numerical_columns_site2, 'Takapuna', feature_column='PM10')
before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, 'historical_average_imputation', numerical_columns_site1, 'Takapuna', feature_column='PM2.5')
before_and_after_values_visualization(cleaned_data_site2.copy(), imputed_cleaned_data_site2, 'historical_average_imputation', numerical_columns_site1, 'Takapuna', feature_column='PM10')

print(f"✅ [Historical Average Imputation] imputed_cleaned_data_site2 imputed shape: {imputed_cleaned_data_site2.shape}")

☑️ [Historical Average Imputation] cleaned_data_site2 shape: (17359, 11)

✅ [Historical Average Imputation] imputed_cleaned_data_site2 imputed shape: (17359, 11)


from scipy.stats.mstats import winsorize
IS_WINSORIZING_OUTLIERS = True

def calculate_iqr_bounds(data, column):
    """
    Calculate the Interquartile Range (IQR) bounds for a given column in a DataFrame.

    Args:
    - data (pd.DataFrame): The DataFrame containing the data.
    - column (str): The column name for which to calculate the IQR bounds.

    Returns:
    - tuple: A tuple containing the lower and upper bounds.
    """
    ## Calculate IQR
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    ## Determine bounds for Winsorization based on IQR: the limits for potential outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound, Q1, Q3, IQR


# def detect_and_winsorize_outliers(data, numerical_columns):
def detect_and_handle_outliers(data, numerical_columns):
    """
    Detects outliers in numerical columns of a DataFrame using the IQR method and handles them using Winsorization.
    e.g. Apply the IQR method to detect outliers & Winsorization to handle/normalize outliers

    Limiting the influence of extreme outliers without removing them from the dataset.

    Args:
    - data (pd.DataFrame): The DataFrame to process.
    - numerical_columns (list): A list of column names to check for and handle outliers.

    Returns:
    - pd.DataFrame: A DataFrame with the outliers handled.
    """
    ## Initialize an empty list to store summary statistics
    outlier_summaries = []
    for col in numerical_columns:
        lower_bound, upper_bound, Q1, Q3, IQR = calculate_iqr_bounds(data, col)

        ## Calculate_percentiles_for_winsorization
        lower_winsorize_limit = data[col][data[col] < lower_bound].count() / data[col].count()
        upper_winsorize_limit = data[col][data[col] > upper_bound].count() / data[col].count()

        ## Collecting and calculating outlier metrics for summary
        # outlier_mask = (data[col] < lower_bound) | (data[col] > upper_bound)
        # num_outliers = outlier_mask.sum()
        ## Consider only non-NaN values for total records
        # num_outliers = outlier_mask.notnull().sum()

        ## Determine the percentages of the data that are outliers; this is for Winsorization limits
        num_lower_outliers = data[col][data[col] < lower_bound].count()
        num_upper_outliers = data[col][data[col] > upper_bound].count()
        num_outliers = num_lower_outliers + num_upper_outliers
        lower_outlier_percentage = num_lower_outliers / num_outliers
        upper_outlier_percentage = num_upper_outliers / num_outliers

        percentage_outliers = num_outliers / len(data) * 100

        if IS_WINSORIZING_OUTLIERS:
            # Calculate the actual min and max within the acceptable range
            valid_min = data[col][(data[col] >= lower_bound) & (data[col] <= upper_bound)].min()
            valid_max = data[col][(data[col] >= lower_bound) & (data[col] <= upper_bound)].max()
    
            ## Calculate the fractions for Winsorization
            lower_winsorize_limit = data[col][data[col] < valid_min].count() / len(data[col])
            upper_winsorize_limit = data[col][data[col] > valid_max].count() / len(data[col])
            
            ## Apply Winsorization
            # data[col] = winsorize(data[col], limits=(lower_winsorize_limit, upper_winsorize_limit))
            data[col] = winsorize(data[col], limits=(0.013653, 0.030128))

        # if IS_WINSORIZING_OUTLIERS:
            ## Instead of removing the outliers, we apply Winsorization (Winsorize column data)
            ## Winsorizing the data such that values beyond the limits are capped: Winsorize data points beyond the bounds
            # data[col] = winsorize(data[col], limits=(lower_outlier_percentage, upper_outlier_percentage))
            # data[col] = winsorize(data[col], limits=[lower_bound, upper_bound])
            # data[col] = winsorize(data[col], limits=(lower_winsorize_limit, upper_winsorize_limit))
    
            ## Applying winsorization to the column
            ## Note: winsorize operates on arrays, hence .values is used. It also modifies data in-place.
            # winsorized_data = winsorize(data[col].values, limits=(lower_winsorize_limit, upper_winsorize_limit))
            # ## Updating the column in the DataFrame with the winsorized data
            # data[col] = winsorized_data
    
            ## Apply Winsorization using calculated limits
            ## Note: It's crucial to ensure the limits are not NaN, indicating no outliers were found
            # if pd.notnull(lower_winsorize_limit) and pd.notnull(upper_winsorize_limit):
            #     data[col] = winsorize(data[col], limits=[lower_winsorize_limit, upper_winsorize_limit])

        ## Collect Outlier Summary statistics
        outlier_summaries.append({
            'Column': col,
            'Lower Winsorization Limit': lower_winsorize_limit,
            'Upper Winsorization Limit': upper_winsorize_limit,
            'Q1': Q1,
            'Q3': Q3,
            'IQR': IQR,
            'Lower Bound': lower_bound,
            'Upper Bound': upper_bound,
            'Number of Outliers': num_outliers,
            'Percentage of Outliers': percentage_outliers,
            'No. of Lower Outliers': num_lower_outliers,
            'No. of Upper Outliers': num_upper_outliers,
            'Lower Outlier %': num_lower_outliers / data[col].notnull().sum(),
            'Upper Outlier %': num_upper_outliers / data[col].notnull().sum()
            # 'Lower Outlier %': lower_outlier_percentage * 100, ## Convert to percentage
            # 'Upper Outlier %': upper_outlier_percentage * 100  ## Convert to percentage
        })

    outlier_summary_df = pd.DataFrame(outlier_summaries)

    return data, outlier_summary_df


## Detect and Handle/Retrieve Outliers
winsorized_imputed_cleaned_data_site1, outlier_summary_df1 = detect_and_handle_outliers(imputed_cleaned_data_site1.copy(), numerical_columns_site1)
winsorized_imputed_cleaned_data_site2, outlier_summary_df2 = detect_and_handle_outliers(imputed_cleaned_data_site2.copy(), numerical_columns_site2)

## Display the outlier summary
print("Outlier Summary for Site 1 (Penrose):\n")
outlier_summary_df1

## The result is a DataFrame 'rawdata' with outliers normalized/adjusted based on IQR and Winsorization
# winsorized_imputed_cleaned_data_site1.head()
# winsorized_imputed_cleaned_data_site2.head()

## TODO
## After applying IQR and Winsorization, the data is now ready for further processing such as PCA
## The `preprocessed_data` is ready for further analysis like PCA

Outlier Summary for Site 1 (Penrose):


## Display the outlier summary
print("Outlier Summary or Site 2 (Takapuna):\n")
outlier_summary_df2

Outlier Summary or Site 2 (Takapuna):


## TODO
## Identify Outliers with extreme values
## Let's assume Air_temp > 40°C, Rel_humidity > 100, and PM2.5 > 100 are outliers
extreme_outliers = rawdata[(rawdata['Air_Temp'] > 40) | (rawdata['Rel_Humidity'] > 100) | (rawdata['PM2.5'] > 100)]

## Print information about missing data and outliers
extreme_outliers

## FIXME: winsorize(cleaned_data[FEATURE], (lower_outlier_percentage,upper_outlier_percentage))

# ## Winsorization to handle outliers
# cleaned_data['Air_Temp'] = winsorize(cleaned_data['Air_Temp'], (0.05, 0.05))
# cleaned_data['Rel_Humidity'] = winsorize(cleaned_data['Rel_Humidity'], (0.05, 0.05))
# cleaned_data['PM2.5'] = winsorize(cleaned_data['PM2.5'], (0.05, 0.05))
# cleaned_data['PM10'] = winsorize(cleaned_data['PM10'], (0.05, 0.05))

# ## Show first 5 rows of cleaned_data
# cleaned_data.head()


# imputed_cleaned_data_site1.shape

## [What If] Visualize the difference between before and after the Winsorization
before_and_after_values_visualization(imputed_cleaned_data_site1.copy(), winsorized_imputed_cleaned_data_site1, '[What-If] Winsorization', numerical_columns_site1, 'Penrose', feature_column='PM2.5')
before_and_after_values_visualization(imputed_cleaned_data_site1.copy(), winsorized_imputed_cleaned_data_site1, '[What-If] Winsorization', numerical_columns_site1, 'Penrose', feature_column='PM10')


# imputed_cleaned_data_site2.shape

## [What If] Visualize the difference between before and after the Winsorization
before_and_after_values_visualization(imputed_cleaned_data_site2.copy(), winsorized_imputed_cleaned_data_site2, '[What-If] Winsorization', numerical_columns_site2, 'Takapuna', feature_column='PM2.5')
before_and_after_values_visualization(imputed_cleaned_data_site2.copy(), winsorized_imputed_cleaned_data_site2, '[What-If] Winsorization', numerical_columns_site2, 'Takapuna', feature_column='PM10')


from prettytable import PrettyTable

def display_site_comparison():
    """
    This function creates and displays a table that lists various data attributes along with descriptions and 
    their applicability to different datasets across sites using the PrettyTable library.
    """
    
    ## Define the PrettyTable table columns and set column alignments
    table = PrettyTable()
    table.field_names = ["Variable Name", "Description", "All Sites", "Penrose", "Takapuna"]
    table.align["Variable Name"] = "l"
    table.align["Description"] = "l"
    table.align["All Sites"] = "c"
    table.align["Penrose"] = "c"
    table.align["Takapuna"] = "c"

    ## Define the data dictionary with clear structure that containing information about the datasets and placeholders for potential future expansion
    data_dictionary = [
        ("rawdata", "Complete dataset containing all observations across all sites.", "[x]", "[x]", "[x]"),
        ("numerical_columns_site1", "Numerical columns specific to Site 1.", "[ ]", "[x]", "[ ]"),
        ("nominal_columns_site1", "Nominal columns specific to Site 1.", "[ ]", "[x]", "[ ]"),
        ("numerical_columns_site2", "Numerical columns specific to Site 2.", "[ ]", "[ ]", "[x]"),
        ("nominal_columns_site2", "Nominal columns specific to Site 2.", "[ ]", "[ ]", "[x]"),
        ("rawdata_site1", "Subset of raw data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("rawdata_site2", "Subset of raw data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("cleaned_data", "Cleaned dataset with preprocessing applied.", "[x]", "[x]", "[x]"),
        ("cleaned_ordinal_columns", "Ordinal columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_numerical_columns", "Numerical columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_nominal_columns", "Nominal columns in the cleaned dataset.", "[x]", "[x]", "[x]"),
        ("cleaned_data1", "Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("cleaned_data2", "Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("imputed_cleaned_data_site1", "Imputed Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("imputed_cleaned_data_site2", "Imputed Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        ("winsorized_imputed_cleaned_data_site1", "Winsorized Imputed Cleaned data for Site 1.", "[ ]", "[x]", "[ ]"),
        ("winsorized_imputed_cleaned_data_site2", "Winsorized Imputed Cleaned data for Site 2.", "[ ]", "[ ]", "[x]"),
        # ("---------------------------", "---------------------------------------------------------------------", "---------", "-------", "--------"),  ## Blank line for separation
        # ("top_features_PM25_site1", "Top features correlated with PM2.5 at Site 1 (Penrose).", "[ ]", "[x]", "[ ]"),
        # ("top_features_PM25_site2", "Top features correlated with PM2.5 at Site 2 (Takapuna).", "[ ]", "[ ]", "[x]"),
        # ("top_features_PM10_site1", "Top features correlated with PM10 at Site 1.", "[ ]", "[x]", "[ ]"),
        # ("top_features_PM10_site2", "Top features correlated with PM10 at Site 2.", "[ ]", "[ ]", "[x]"),
        # ("summary_stats_PM25_penrose", "[PM2.5] Summary statistics for the Penrose site after preprocessing.", "[ ]", "[x]", "[ ]"),
        # ("summary_stats_PM25_takapuna", "[PM2.5] Summary statistics for the Takapuna site after preprocessing.", "[ ]", "[ ]", "[x]"),
        # ("summary_stats_PM10_penrose", "[PM10] Summary statistics for the Penrose site after preprocessing.", "[ ]", "[x]", "[ ]"),
        # ("summary_stats_PM10_takapuna", "[PM10] Summary statistics for the Takapuna site after preprocessing.", "[ ]", "[ ]", "[x]"),
        # ("selected_features_PM25_penrose", "[PM2.5] Target + Selected Features for the Penrose site.", "[ ]", "[x]", "[ ]"),
        # ("selected_features_PM25_takapuna", "[PM2.5] Target + Selected Features for the Takapuna site.", "[ ]", "[ ]", "[x]"),
        # ("selected_features_PM10_penrose", "[PM10] Target + Selected Features for the Penrose site.", "[ ]", "[x]", "[ ]"),
        # ("selected_features_PM10_takapuna", "[PM10] Target + Selected Features for the Takapuna site.", "[ ]", "[ ]", "[x]"),
    ]

    ## Format for better readability
    for entry in data_dictionary:
        ## Adding rows to the table
        table.add_row(entry)

    ## Print the table in an organized format
    print(table)

## Call the function to display the table
print('\n🎓 Listing variables with description...')
display_site_comparison()

🎓 Listing variables with description...
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| Variable Name                         | Description                                                           | All Sites | Penrose | Takapuna |
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| rawdata                               | Complete dataset containing all observations across all sites.        |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1               | Numerical columns specific to Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1                 | Nominal columns specific to Site 1.                                   |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2               | Numerical columns specific to Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2                 | Nominal columns specific to Site 2.                                   |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1                         | Subset of raw data for Site 1.                                        |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2                         | Subset of raw data for Site 2.                                        |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| cleaned_data                          | Cleaned dataset with preprocessing applied.                           |    [x]    |   [x]   |   [x]    |
| cleaned_ordinal_columns               | Ordinal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_numerical_columns             | Numerical columns in the cleaned dataset.                             |    [x]    |   [x]   |   [x]    |
| cleaned_nominal_columns               | Nominal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_data1                         | Cleaned data for Site 1.                                              |    [ ]    |   [x]   |   [ ]    |
| cleaned_data2                         | Cleaned data for Site 2.                                              |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| imputed_cleaned_data_site1            | Imputed Cleaned data for Site 1.                                      |    [ ]    |   [x]   |   [ ]    |
| imputed_cleaned_data_site2            | Imputed Cleaned data for Site 2.                                      |    [ ]    |   [ ]   |   [x]    |
| ---------------------------           | --------------------------------------------------------------------- | --------- | ------- | -------- |
| winsorized_imputed_cleaned_data_site1 | Winsorized Imputed Cleaned data for Site 1.                           |    [ ]    |   [x]   |   [ ]    |
| winsorized_imputed_cleaned_data_site2 | Winsorized Imputed Cleaned data for Site 2.                           |    [ ]    |   [ ]   |   [x]    |
+---------------------------------------+-----------------------------------------------------------------------+-----------+---------+----------+


## Filter cleaned data for the specified sites
# imputed_cleaned_data_site1 = cleaned_data[cleaned_data['Site'] == 'Penrose']
# imputed_cleaned_data_site2 = cleaned_data[cleaned_data['Site'] == 'Takapuna']
# 
## Remove redundant columns before saving to *.csv
# imputed_cleaned_data_site1 = cleaned_data_site1.drop(['SO2', 'Site', 'Site_Class'], axis=1)
# imputed_cleaned_data_site2 = cleaned_data_site2.drop(['Site', 'Site_Class'], axis=1)

imputed_cleaned_data_site1['Timestamp'] = imputed_cleaned_data_site1['Timestamp'].dt.strftime('%d/%m/%Y %H:%M')
imputed_cleaned_data_site2['Timestamp'] = imputed_cleaned_data_site2['Timestamp'].dt.strftime('%d/%m/%Y %H:%M')

CLEAN_DATA_PATH = 'data/cleaned'
imputed_cleaned_data_site1.to_csv(f"{CLEAN_DATA_PATH}/cleaned_Penrose7-07May2020-to-30Apr2022.csv", index=False)
imputed_cleaned_data_site2.to_csv(f"{CLEAN_DATA_PATH}/cleaned_Takapuna23-07May2020-to-30Apr2022.csv", index=False)

	Timestamp	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
0	2020-05-07 00:00:00	26.0	16.5	16.1	1.8	60.3	40.9	101.2	0.6	316.0	8.0	78.0
1	2020-05-07 01:00:00	28.0	17.7	10.1	1.0	NaN	NaN	NaN	0.7	269.0	8.0	76.8
2	2020-05-07 02:00:00	28.0	15.0	10.3	0.1	16.0	29.2	45.3	1.0	180.0	8.0	78.4
3	2020-05-07 03:00:00	29.0	14.3	11.4	0.0	11.2	27.5	38.7	0.8	232.0	8.0	77.5
4	2020-05-07 04:00:00	30.0	8.8	10.6	-0.1	12.0	28.5	40.5	0.8	274.0	7.0	80.1
...	...	...	...	...	...	...	...	...	...	...	...	...
20018	2022-04-30 20:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20019	2022-04-30 21:00:00	24.0	NaN	1.9	0.6	5.8	25.4	31.3	0.8	124.0	15.0	85.4
20020	2022-04-30 21:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20021	2022-04-30 22:00:00	24.0	8.2	6.0	0.5	1.8	17.9	19.7	0.8	108.0	14.0	86.0
20022	2022-04-30 23:00:00	24.0	2.3	5.4	0.5	1.0	7.3	8.3	0.9	93.0	14.0	84.4

	Timestamp	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
count	20023	17019.000000	16818.000000	16102.000000	16623.000000	16260.000000	16260.000000	16259.000000	17182.000000	17181.000000	17206.000000	17338.000000
mean	2021-05-06 21:58:02.954601984	28.233974	13.881787	5.348180	0.851639	10.041562	14.123253	24.151350	2.807368	173.234387	16.748925	69.426087
min	2020-05-07 00:00:00	0.000000	-6.400000	-12.300000	-4.400000	-1.900000	-1.400000	-1.300000	0.200000	1.000000	3.000000	26.900000
25%	2020-11-07 02:30:00	22.000000	8.400000	2.100000	0.100000	0.800000	3.900000	5.100000	1.600000	73.000000	14.000000	60.200000
50%	2021-05-08 19:00:00	27.000000	13.100000	4.700000	0.500000	3.600000	10.600000	14.900000	2.600000	211.000000	17.000000	70.400000
75%	2021-11-04 15:30:00	33.000000	18.400000	7.700000	1.000000	10.900000	20.700000	31.650000	3.800000	235.000000	20.000000	79.500000
max	2022-04-30 23:00:00	121.000000	196.700000	138.400000	25.200000	378.400000	83.100000	455.100000	8.200000	357.000000	28.000000	92.000000
std	NaN	9.693312	8.044288	5.259335	1.507654	21.065762	12.748169	30.956155	1.538615	94.038611	4.013776	12.290750

	Timestamp	AQI	PM10	PM2.5	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
20023	2020-05-07 17:00:00	21.0	5.95	4.15	10.90	0.01715	28.00	2.50	242.0	15.5	68.15
20024	2020-05-07 18:00:00	21.0	5.65	5.10	8.20	0.01655	24.70	2.20	239.5	15.5	61.75
20025	2020-05-07 19:00:00	21.0	7.70	5.45	5.75	0.01325	19.00	2.10	244.0	15.0	59.45
20026	2020-05-07 20:00:00	21.0	8.20	5.45	3.50	0.00870	12.20	2.25	251.0	15.0	59.25
20027	2020-05-07 21:00:00	21.0	11.80	5.80	3.55	0.00930	12.90	2.10	261.0	15.0	61.75
...	...	...	...	...	...	...	...	...	...	...	...
37377	2022-04-30 19:00:00	14.0	4.75	3.30	0.60	0.00440	5.00	2.55	109.5	16.0	76.05
37378	2022-04-30 20:00:00	14.0	6.35	3.15	0.50	0.00365	4.15	2.45	105.5	16.0	74.50
37379	2022-04-30 21:00:00	14.0	6.05	2.80	0.40	0.00480	5.20	2.35	115.5	16.0	74.15
37380	2022-04-30 22:00:00	13.0	4.20	2.60	0.40	0.00555	5.90	1.95	122.5	16.0	75.95
37381	2022-04-30 23:00:00	13.0	5.00	2.80	0.35	0.00405	4.30	1.95	119.0	16.0	78.00

	Timestamp	AQI	PM10	PM2.5	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
count	17359	17359.000000	16579.000000	15807.000000	16187.000000	16185.000000	16187.000000	16773.000000	16771.000000	16785.000000	16921.000000
mean	2021-05-04 08:00:00	25.152947	11.590983	6.251275	12.005912	0.012416	24.203614	2.501792	176.335132	16.280548	69.909931
min	2020-05-07 17:00:00	-13.000000	-3.800000	0.450000	-0.800000	-0.000500	-0.700000	0.200000	5.500000	3.500000	30.400000
25%	2020-11-04 12:30:00	19.000000	7.400000	4.250000	2.450000	0.004000	7.050000	1.250000	75.500000	13.500000	61.250000
50%	2021-05-04 08:00:00	24.000000	10.650000	5.550000	5.850000	0.009100	15.100000	2.300000	205.000000	16.000000	71.050000
75%	2021-11-01 03:30:00	30.000000	14.500000	7.350000	12.200000	0.017700	29.400000	3.500000	261.000000	19.500000	79.850000
max	2022-04-30 23:00:00	318.000000	437.250000	52.300000	352.450000	0.072100	401.950000	7.800000	346.500000	27.000000	90.350000
std	NaN	10.637321	7.911251	3.507078	22.592626	0.010951	30.662867	1.468465	93.196025	3.959962	11.804317

	Column	Lower Winsorization Limit	Upper Winsorization Limit	Q1	Q3	IQR	Lower Bound	Upper Bound	Number of Outliers	Percentage of Outliers	No. of Lower Outliers	No. of Upper Outliers	Lower Outlier %	Upper Outlier %
0	AQI	0.003453	0.022964	22.0	33.0	11.0	5.50	49.50	459	2.641727	60	399	0.003453	0.022964
1	PM10	0.000058	0.016403	8.5	18.3	9.8	-6.20	33.00	286	1.646043	1	285	0.000058	0.016403
2	PM2.5	0.001439	0.036777	2.2	7.7	5.5	-6.05	15.95	664	3.821583	25	639	0.001439	0.036777
3	SO2	0.000058	0.071252	0.1	1.0	0.9	-1.25	2.35	1239	7.130935	1	1238	0.000058	0.071252
4	NO	0.000000	0.084259	0.9	10.7	9.8	-13.80	25.40	1464	8.425899	0	1464	0.000000	0.084259
5	NO2	0.000000	0.028892	4.1	20.3	16.2	-20.20	44.60	502	2.889209	0	502	0.000000	0.028892
6	NOx	0.000000	0.057784	5.3	31.1	25.8	-33.40	69.80	1004	5.778417	0	1004	0.000000	0.057784
7	Wind_Speed	0.000000	0.002935	1.6	3.8	2.2	-1.70	7.10	51	0.293525	0	51	0.000000	0.002935
8	Wind_Dir	0.000000	0.000000	74.0	235.0	161.0	-167.50	476.50	0	0.000000	0	0	0.000000	0.000000
9	Air_Temp	0.001151	0.000000	14.0	20.0	6.0	5.00	29.00	20	0.115108	20	0	0.001151	0.000000
10	Rel_Humidity	0.000403	0.000000	60.2	79.5	19.3	31.25	108.45	7	0.040288	7	0	0.000403	0.000000

🎯 Libraries and Reusable Functions¶

Python Reusable Functions¶

⚙️ Data Preprocessing & Preparation¶

Clean the Time Series Dataset¶

Missing Data Imputation in Time-Series¶

[Advanced] Historical Average Imputation 🦅¶

Filter Outliers¶

💾 Save Imputed_Cleaned-Data to *.csv¶

	Column	Lower Winsorization Limit	Upper Winsorization Limit	Q1	Q3	IQR	Lower Bound	Upper Bound	Number of Outliers	Percentage of Outliers	No. of Lower Outliers	No. of Upper Outliers	Lower Outlier %	Upper Outlier %
0	AQI	0.001498	0.036350	19.00000	30.0000	11.00000	2.500000	46.500000	657	3.784780	26	631	0.001498	0.036350
1	PM10	0.000115	0.026787	7.46875	14.4000	6.93125	-2.928125	24.796875	467	2.690247	2	465	0.000115	0.026787
2	PM2.5	0.000000	0.045740	4.30000	7.3000	3.00000	-0.200000	11.800000	794	4.573996	0	794	0.000000	0.045740
3	NO	0.000000	0.088945	2.35000	11.9000	9.55000	-11.975000	26.225000	1544	8.894522	0	1544	0.000000	0.088945
4	NO2	0.000000	0.039864	0.00395	0.0171	0.01315	-0.015775	0.036825	692	3.986405	0	692	0.000000	0.039864
5	NOx	0.000000	0.072873	6.91250	28.5000	21.58750	-25.468750	60.881250	1265	7.287286	0	1265	0.000000	0.072873
6	Wind_Speed	0.000000	0.003456	1.25000	3.5000	2.25000	-2.125000	6.875000	60	0.345642	0	60	0.000000	0.003456
7	Wind_Dir	0.000000	0.000000	76.50000	260.0000	183.50000	-198.750000	535.250000	0	0.000000	0	0	0.000000	0.000000
8	Air_Temp	0.001325	0.000000	13.50000	19.5000	6.00000	4.500000	28.500000	23	0.132496	23	0	0.001325	0.000000
9	Rel_Humidity	0.000288	0.000000	61.20000	79.7000	18.50000	33.450000	107.450000	5	0.028804	5	0	0.000288	0.000000