import os, logging

import logging
logging.getLogger('prophet').setLevel(logging.WARNING)
logging.getLogger("cmdstanpy").disabled=True

## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'cleaned_vantage.env'  ## Set to `.env_cleaned` or `.env_raw` as needed

# %run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
%run -i ./DataFrameAdapter.ipynb

2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv
2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv
2024-05-21 14:55:55,448 - INFO - 
ℹ️ Load Data from data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂
2024-05-21 14:55:55,489 - INFO - 
ℹ️ Load Data from data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂

🎓 The specified .env file: cleaned_vantage.env
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+
| Variable                    | Description                                                                                 | Value                 |
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+
| IS_LOADING_FROM_FILES       | True if loading data from *.csv/xls files; False if using imported data in Teradata Vantage | True                  |
| IS_TERADATA_VANTAGE         | Using scable Teradata Vantage vs. local machine (Laptop)                                    | False                 |
| IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files  | False                 |
| SCHEMA_NAME                 | [Teradata Vantage] Schema Name                                                              | Air_Pollution         |
| TABLE_NAME                  | [Teradata Vantage] Table Name                                                               | Air_Pollution_cleaned |
| IS_JUPYTERLAB               | Running in JupyterLab vs Python Dash/Vizro Dashboard                                        | True                  |
| IS_TEST_DEV                 | Is Test/Dev mode is active or not (in Production)                                           | False                 |
| DATA_PATH                   | *.csv/xls Data PATH                                                                         | Not set or not found  |
| USE_DATA_PREFIX             | Prefix to use for data files: 'raw' | 'cleaned'                                             | cleaned               |
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+

ℹ️ Load Data from data/cleaned folder


ℹ️ Combined Data Shape: (34734, 13)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (17375, 12) (17359, 11)

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')

🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

🎓 [Data_Loading_and_Descriptive_Statistics.ipynb] Listing variables with description...
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| Variable Name           | Description                                                    | All Sites | Penrose | Takapuna |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| rawdata                 | Complete dataset containing all observations across all sites. |    [x]    |   [x]   |   [x]    |
| ordinal_columns         | Ordinal columns specific to Site 1.                            |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1 | Numerical columns specific to Site 1.                          |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1   | Nominal columns specific to Site 1.                            |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2 | Numerical columns specific to Site 2.                          |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2   | Nominal columns specific to Site 2.                            |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1           | Subset of raw data for Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2           | Subset of raw data for Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+

🎓 Describing the types of each attribute as cleaned_numerical_columns (Continuous), cleaned_ordinal_columns (Ordinal), or cleaned_nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Hour', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')

🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...

🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...

🎓 [DataFrameAdapter.ipynb] Listing variables with description...
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| Variable Name               | Description                                                           | All Sites | Penrose | Takapuna |
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| rawdata                     | Complete dataset containing all observations across all sites.        |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1     | Numerical columns specific to Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1       | Nominal columns specific to Site 1.                                   |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2     | Numerical columns specific to Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2       | Nominal columns specific to Site 2.                                   |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1               | Subset of raw data for Site 1.                                        |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2               | Subset of raw data for Site 2.                                        |    [ ]    |   [ ]   |   [x]    |
| --------------------------- | --------------------------------------------------------------------- | --------- | ------- | -------- |
| cleaned_data                | Cleaned dataset with preprocessing applied.                           |    [x]    |   [x]   |   [x]    |
| cleaned_ordinal_columns     | Ordinal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_numerical_columns   | Numerical columns in the cleaned dataset.                             |    [x]    |   [x]   |   [x]    |
| cleaned_nominal_columns     | Nominal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_data_site1          | Cleaned data for Site 1.                                              |    [ ]    |   [x]   |   [ ]    |
| cleaned_data_site2          | Cleaned data for Site 2.                                              |    [ ]    |   [ ]   |   [x]    |
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+

# if IS_JUPYTERLAB:
target_variables = {'PM2.5': 'Particulate Matter <2.5 µm', 'PM10': 'Particulate Matter <10 µm'}
data_series_df1 = PandasDataLoader.load_data(rawdata_site1)
data_series_df2 = PandasDataLoader.load_data(rawdata_site2)

data_series_df1

# cleaned_data_site2.index
# data_series_df2.index
data_series_df2

class TimeSeriesAnalyzer(ABC):
    """
    Abstract base class for a Time Series Analyzer.
    """

    def __init__(self, data, target_col):
        """ ✅ Step 1: Data Loading & Splitting """
        # self.data = data
        self.target_col = target_col   
        ## Placeholder for transformed data, if necessary
        # self.transformed_data = None
        ## Placeholder for differenced data, if necessary
        # self.differenced_data = None
        ## Placeholder for the selected model
        self.model = None

    @abstractmethod
    def timeseries_visualization(self, lags=15, max_diff_order=3):
        """
        ✅ Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
        Also considers Box-Cox transformation for variance stabilisationif the data is skewed.

        Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.

        - ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
        - Histogram analysis after transformation helps understand data normality and variance stability.
        -
        - If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
        - Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
        - Check if transformation parameters need adjustment based on skewness and kurtosis metrics.

        - [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
        - [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
        - [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
        - [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
        """
        pass

    @abstractmethod
    def transform_data(data, transformer, lag_max=15):
        """
        ✅ Step 3: Apply transformation to the time series data and display diagnostics.
        
        Parameters:
        - data (np.array or pd.Series): The original time series data.
        - transformer (object): An instance of a transformer from pmdarima.preprocessing.
        - lag_max (int): Maximum lag to which the autocorrelation function is computed.
        
        Returns:
        - Transformed data
        """
        pass


    @abstractmethod
    def model_selection(self):
        """
        Step 5: iteratively select models based on the AIC: try the chosen model(s) and use the AIC to search for a better model.
        """
        pass

    @abstractmethod
    def multi_step_forecast(self):
        """
        Step 6: Check the residuals from the chosen model by plotting the ACF of the residual, and doing a portmanteau test of the residual.
        Do the residuals look like white noise?
        """
        pass

    # @abstractmethod
    # def calculate_forecast(self):
    #     """
    #     Step 7: Calculate forecasts & plot 
    #     """
    #     pass

from pmdarima.preprocessing import BoxCoxEndogTransformer, LogEndogTransformer
from pmdarima.model_selection import train_test_split
from pmdarima import tsdisplay

class ArimaTimeSeriesAnalyzer(TimeSeriesAnalyzer):
    def __init__(self, data: pd.DataFrame, monitoring_site:str, target_col: str, is_auto_arima=False):
        """ 
        Initializes the analysis with training and testing data, ensuring that both datasets contain the specified target variable. 
        """
        super().__init__(data, target_col)
        print(f'✅ Step 1: Data Loading & Splitting 80/20')
        self.data = data
        self.site = monitoring_site   ## Penrose, Takapuna
        self.target_col = target_col  ## PM2.5, PM10
        # self.train_data = None
        # self.test_data = None
        # self.train_test_split(train_size=0.8) ## 80/20 Train/Test split
        self.is_auto_arima = is_auto_arima
        # self.transformed_data = None
        # self.lambda_value = None
        # self.is_stationary = True
        # self.sarima_model = None
        # self.arima_model = None
        self.model = None

    def train_test_split(self, train_size=0.8):
        """
        ☑️ [1] Train-Test split: Splits the data into training and testing sets.
        Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
        :param train_size: float - proportion of the dataset to include in the train split.
        """
        train_len = int(len(self.data) * train_size)
        self.train_data = self.data.iloc[:train_len]
        self.test_data = self.data.iloc[train_len:]
        print(f'🎓 Training & Testing Data split (80/20): {len(self.train_data):,} train samples and {len(self.test_data):,} test samples.')
        self.y = self.data[self.target_col]   ## Compatible with Prophet
        self.y_train = self.train_data[self.target_col]
        self.y_test  = self.test_data[self.target_col]

    def plot_line_chart_and_histogram(self, y_train):
        """
        ☑️ [2] Plot line chart and histogram
        """
        print(f'\n🎓 [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.') 
        plt.figure(figsize=(14, 7))
        
        ## 1.1. Original Series Plot
        plt.subplot(121) ## 1 row, 2 columns, 1st subplot
        # plt.plot(self.data.index, self.y, color='blue', label='Original Series')
        plt.plot(y_train.index, y_train, color='steelblue', label='Original Series')
        plt.title(f"Original Time Series of {self.target_col} at {self.site}")
        plt.xlabel("Timestamp")
        plt.ylabel(self.target_col)
        plt.legend()

        ## 1.2 Plot the histogram of the series values
        ## Histogram displays the distribution of data to evaluate skewness.
        ## Check for skewness and perform Box-Cox Transformation if necessary
        # transformed_data, _ = preprocessing.BoxCoxEndogTransformer(lmbda2=1e-6).fit_transform(self.data[[self.target_col]].dropna())
        plt.subplot(122) ## 1 row, 2 columns, 2nd subplot
        # plt.hist(transformed_data, bins=30, color='blue', edgecolor='k', alpha=0.7)
        # plt.hist(self.y, bins=90, color='blue', edgecolor='k', alpha=0.7)
        plt.hist(y_train, bins=60, color='steelblue', edgecolor='gray', alpha=0.7)
        # plt.title(f"Histogram of Transformed {self.target_col}")
        plt.title(f"Histogram of Train Data: {self.target_col} at {self.site}")
        plt.xlabel(f'{self.target_col} Value')
        plt.ylabel("Frequency")

        plt.tight_layout()
        plt.show()

    ## Step 2.2: Differencing the series to achieve stationarity
    def plot_differencing(self, series, max_diff_order=3):
        """
        ☑️ [2].1 Apply differencing to the time series to achieve stationarity and plot each differencing level in a grid format,
        and perform Dickey-Fuller tests to check stationarity after each differencing.
        This function plots the original series and its differenced versions up to the specified order, 
        along with the p-values from the Dickey-Fuller test to assess stationarity.

        Note: pmdarima's auto_arima will automatically determine the appropriate differencing term for you by default.
        
        Parameters:
        - series (pd.Series): Time series data.
        - max_diff_order (int): The maximum differencing order to apply.
        
        Outputs:
        - Plots of original and differenced series with Dickey-Fuller test results.
        """
        print(f'\n🎓 [2.2] Apply {max_diff_order} Order Differencing to the time series and visualizes results.')
        ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']
        plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100})
        num_plots = max_diff_order + 1
        num_rows = (num_plots + 1) // 2   ## Calculate rows needed for plots (2 plots per row)
        # fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
        fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=False, sharey=True)
        axes = axes.flatten()  ## Flatten to simplify indexing
        
        for i in range(num_plots):
            ax = axes[i]       ## This handles both single and multiple rows of subplots 
            if i == 0:         ## Plot original series in the first plot
                current_series = series
                title = 'Original Series'
            else:              ## Generate and plot differenced series up to the max_diff_order
                current_series = series.diff(i).dropna()
                title = f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing'
            ax.plot(current_series, color='steelblue' if i == 0 else 'orange')  
            ax.set_title(f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing')
            ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
            ax.tick_params(axis='x', rotation=45)  ## Rotate x-axis labels

            ## Perform Dickey-Fuller test on the current series and display p-value
            if not current_series.empty:
                df_result = adfuller(current_series)
                p_value_msg = f'p-value: {df_result[1]:.4f}'
                stationarity = " (Stationary: < 0.05)" if df_result[1] < 0.05 else " (Not stationary: >= 0.05)"
                ax.set_xlabel(p_value_msg + stationarity)
            else:
                ax.set_xlabel('No data available for testing')
            ## Move the xlabel to the top left
            ax.xaxis.label.set_color('steelblue')
            ax.xaxis.set_label_position('top')
            ax.xaxis.set_label_coords(0.5, 1.02)

        ## Adjust the number of plots and layout: If the total number of plots is odd
        ## Hide any unused axes if the number of plots is less than the number of subplot cells
        if num_plots < len(axes):
            for j in range(num_plots, len(axes)):
                fig.delaxes(axes[j])
                
        ## Adjust layout and display the plot
        plt.tight_layout() 
        plt.show()

        
    def plot_acf_ma_and_pacf_ar(self, series, max_diff_order=2, max_lags=15, use_pacf=False):
        """
        ☑️ [2].2. Use the ACF to suggest the orders of MA; and Use the PACF to suggest the orders of AR
        
        Calculate and plots the Autocorrelation Function (ACF) for the original time series and {max_diff_order} differenced versions 
        to suggest possible MA orders, including statistical tests for stationarity.
        
        This function uses the series that have been differenced and stored in a dictionary to plot their ACF. 
        This helps in identifying the order of the MA component by looking at the significant lags in the ACF plots.
        
        Parameters:
        - series (pd.Series): The time series data.
        - max_diff_order (int): Maximum number of differencing orders to plot.
        - max_lags (int): specifically focus on the first max_lags lags in the ACF plot.
    
        This function performs the following:
        - Differencing the series up to the specified order.
        - Plotting the ACF for each differenced series including the original series.
        - Performing the Dickey-Fuller test to annotate plots with p-values indicating stationarity.
        """
        # print(f'\n🎓 Plotting the ACF for the original and differenced series to suggest possible MA orders.')
        print(f'\n🎓 [2.3] Plotting the {"PACF" if use_pacf else "ACF"} to suggest possible {"AR" if use_pacf else "MA"} orders.')
        ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']

        ## Prepare the subplot grid based on the number of differencing levels
        plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100}) ## Adjust plot settings
        num_plots = max_diff_order + 1
        num_rows = (num_plots + 1) // 2   ## Calculate rows needed for plots (2 plots per row)
        fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
        axes = axes.flatten()  ## Flatten to simplify indexing
               
        ## Iterate through each differencing level
        for i in range(num_plots):
            ax = axes[i]
            ## Handle the original series differently
            current_series = series if i == 0 else series.diff(i).dropna()
            title_suffix = 'Original Series' if i == 0 else f'{ordering[i]} Order Differencing'

            # ## Check for non-empty series to avoid plotting errors
            # if current_series.empty:
            #     ax.text(0.5, 0.5, 'No data available for plotting', horizontalalignment='center', verticalalignment='center',
            #             transform=ax.transAxes)
            #     ax.set_title('Empty Series')
            #     continue
    
            ## Perform ACF plotting: The alpha=0.05 setting specifies a 95% confidence level.
            lags = min(max_lags, len(current_series) - 1)
            # plot_acf(current_series, ax=ax, lags=lags, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
            plot_func = plot_pacf if use_pacf else plot_acf
            plot_func(current_series.head(lags), ax=ax, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
            ## Adding labels for x and y axes
            ax.set_xlabel('Lags')
            ax.set_ylabel(f'{"PACF" if use_pacf else "ACF"}')
            
            # plot_acf(current_series, ax=ax, lags=lags, title=f'[{self.site}] {self.target_col}: ACF - {ordering[i]} Order Differencing')
            ## Setting plot titles and labels
            ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
            # ax.set_title(f'{title_suffix} - ADF p-value: {p_value:.4f} {stationarity_status}', fontsize=12)

            # ## Highlight significant lags manually if needed: marks significant lags based on ACF values outside the confidence bounds.
            # significance_limit = 1.96 / np.sqrt(len(current_series))
            # ax.axhline(y=significance_limit, linestyle='--', color='red', linewidth=0.7)
            # ax.axhline(y=-significance_limit, linestyle='--', color='red', linewidth=0.7)

            ## Perform Dickey-Fuller test & Adding stationarity test results for context
            df_result = adfuller(current_series)
            p_value_msg = f'p-value: {df_result[1]:.4f}'
            stationarity_status = ' (Stationary: < 0.05)' if df_result[1] < 0.05 else ' (Not stationary: >= 0.05)'
            ax.set_xlabel(p_value_msg + stationarity_status)
            ## Move the xlabel to the top left
            ax.xaxis.label.set_color('steelblue')
            ax.xaxis.set_label_position('top')
            ax.xaxis.set_label_coords(0.5, 1.02)

        ## Adjust the number of plots and layout: If the total number of plots is odd
        ## Hide any unused axes if the number of plots is less than the number of subplot cells
        if num_plots < len(axes):
            for j in range(num_plots, len(axes)):
                fig.delaxes(axes[j])
        
        ## Layout adjustment
        plt.tight_layout()
        plt.show()

        
    def timeseries_visualization(self, lags=15, max_diff_order=3):
        """
        ✅ Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
        Also considers Box-Cox transformation for variance stabilisationif the data is skewed.

        Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.

        - ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
        - Histogram analysis after transformation helps understand data normality and variance stability.
        -
        - If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
        - Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
        - Check if transformation parameters need adjustment based on skewness and kurtosis metrics.

        - [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
        - [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
        - [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
        - [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
        """
        print(f'✅ Step 2: Time Series Visualization')
        ## Visualize the time series and its ACF and PACF plots
        print(f'\n🎓 [2.1] [Default] Time Series Display ...')
        pm.tsdisplay(self.y_train, lag_max=lags, title=f'ACF plot and histogram for {self.target_col} at {self.site}', show=True)

        ## Row 1: Plotting the original series and its histogram
        self.plot_line_chart_and_histogram(self.y_train)
        
        ## Row 2: Perform stationarity checks to see if the given series is stationary. 
        ## To perform differencing and unit-root tests (Dickey-Fuller test), and others if necessary. 
        self.plot_differencing(self.train_data[self.target_col], max_diff_order=1)

        # df_result = adfuller(self.y)
        # p_value_msg = f'p-value: {df_result[1]:.4f}'
        # stationarity = " (Stationary)" if df_result[1] < 0.05 else " (Not stationary)"
        # print(f'🎓 ADF Statistic adf_test_result[0]: {df_result[0]}; and adf_test_result[1]: ' + p_value_msg + stationarity)

        ## Row 3: ACF and PACF Plots
        ## Use the ACF to suggest the orders of MA
        self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=False)

        ## Use the PACF to suggest the orders of AR
        self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=True)


    ## Function to apply transformation and display time series diagnostics
    def transform_data(data, transformer, lag_max=15):
        """
        ✅ Step 3: Apply transformation to the time series data and display diagnostics.
        
        Parameters:
        - data (np.array or pd.Series): The original time series data.
        - transformer (object): An instance of a transformer from pmdarima.preprocessing.
        - lag_max (int): Maximum lag to which the autocorrelation function is computed.
        
        Returns:
        - Transformed data
        """
        print(f'✅ Step 3: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')
        
        ## Fit and transform the data
        transformed_data, _ = transformer.fit_transform(data)
    
        ## Display time series diagnostics
        tsdisplay(transformed_data, lag_max=lag_max, title=f"Transformed Data with {transformer.__class__.__name__}")
    
        return transformed_data


    def auto_arima_model(self, y_train):
        """
        ☑️ [4] Uses auto_arima from the pmdarima library to find the optimal/best ARIMA model.
        auto_arima can be quite memory-intensive, especially with large datasets

        ARIMA(p,d,q) model:
        - p is the order of the AR term: Use the PACF to suggest the orders of auto-regressive (AR)
        - d is the number of differencing required to make the time series stationary: 
          The degree of differencing, is typically only used in the case of non-stationary data
        - q is the order of the MA term: Use the ACF to suggest the orders of moving average (MA)
        
        Parameters:
        - train_data: A pandas Series containing the training data for the model.
        - order: A tuple representing the order of the ARIMA model (p, d, q).
        Returns:
        - model: The fitted ARIMA model.
        """
        if self.is_auto_arima:
            logging.info('🎓 [4.1] The ARIMA Models ...')
            # self.model = pm.auto_arima(y_train,
            model = pm.auto_arima(y_train,
                        d=None,
                        seasonal=False,
                        m=24,                   ## Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
                        stepwise=True,
                        max_p=None,
                        max_order=None,
                        suppress_warnings=True,
                        error_action="ignore",
                        trace=True)
            logging.debug(model.summary())
        else:
            logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')

        return model

    def auto_sarima_model(self, y_train):
        """
        ☑️ [4] Uses auto_arima from the pmdarima library to find the optimal/best SARIMA model.
        auto_arima can be quite memory-intensive, especially with large datasets

        SARIMA(p,d,q)(P,D,Q)[S] where
        - p is the order of the seasonal AR term: i.e., the number of Lag observations
        - q is the order of the seasonal MA term: essentially the size of the “window” function over your time series data
        - Q is the order of seasonal differencing
        - S is the seasonal cycle such as 24 for 24 hours.
        
        Parameters:
        - train_data: A pandas Series containing the training data for the model.
        - order: A tuple representing the order of the ARIMA model (p, d, q).
        Returns:
        - model: The fitted SARIMA model.
        """
        if self.is_auto_arima:
            logging.info('🎓 [4.1] The ARIMA Models ...')
            try:
                print('🎓 [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed!')
                # self.sarima_model = pm.auto_arima(y_train,
                #         seasonal=True, ## turn the seasonal to "True"
                #         m=24,          ## Daily seasonality; m=168 for weekly seasonality, or m=8760 for annual seasonality.
                #         ## Use a sane value for max_p, max_q, max_P and max_Q. You don’t need to search too high
                #         # max_p=3, max_q=3, max_P=3, max_Q=3, ## define the maximum order of the seasonal and non-seasonal AR and MA components
                # self.sarima_model = pm.auto_arima(y_train, 
                #                     d=None,                 ## Let the model determine 'd' (non-seasonal differencing)
                #                     seasonal=True,          ## Turn the seasonal to "True", Enable seasonal differencing
                #                     m=24,                   ## 24h Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
                #                     stepwise=True,          ## Use the stepwise algorithm to find the best model parameters
                #                     # max_p=None,
                #                     # max_order=None,
                #                     max_p=1,                ## Set the maximum 'p' (autoregressive terms) to 2
                #                     max_order=2,            ## Set the maximum sum of p, d, and q to 3
                #                     suppress_warnings=True,
                #                     error_action='ignore',  ## Change to 'ignore' | 'warn' to keep running despite errors | warning
                #                     trace=True)
                self.sarima_model = pm.auto_arima(
                    y_train,
                    d=None,                 ## Let the model determine 'd' (non-seasonal differencing)
                    seasonal=True,          ## Enable seasonal differencing
                    m=24,                   ## 24h Daily seasonality; adjust as needed for your data
                    stepwise=True,          ## Use stepwise algorithm to find the best model parameters
                    # start_p=0,              ## Start value for p
                    # start_q=0,              ## Start value for q
                    # start_P=0,              ## Start value for seasonal P
                    # start_Q=0,              ## Start value for seasonal Q
                    max_p=2,                ## Set the maximum p (autoregressive terms)
                    max_q=2,                ## Set the maximum q (moving average terms)
                    max_P=2,                ## Set the maximum seasonal P (autoregressive terms)
                    max_Q=2,                ## Set the maximum seasonal Q (moving average terms)
                    max_order=None,         ## Maximum order of p+q+P+Q+d+D
                    suppress_warnings=True,
                    error_action='ignore',  ## Change to 'ignore' to keep running despite errors
                    trace=True
                )
                                                  
                print(self.sarima_model.summary())
            except Exception as e:
                print(f"Failed to fit SARIMA model: {e}")
        else:
            logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')


    def model_selection(self, models, train_data, test_data, preference='RMSE'):
        """
        ✅ Step 5: Compare multiple ARIMA and SARIMA models based on configurable preference metrics including RMSE, MAE, MAPE, AIC, and BIC.
        
        Args:
        - models (list of tuples): Each tuple contains the model description, and its orders & seasonal orders.
          * order (tuple): Non-seasonal ARIMA order (p, d, q).
          * seasonal_order (tuple, optional): Seasonal order (P, D, Q, m), if the model is seasonal.
        - train_data (pd.Series): The training data for the time series.
        - test_data (pd.Series): The testing data for evaluating model performance.
        - preference (str): Preferred metric for model evaluation.
        
        Returns:
        dict: The best model based on the preferred metric or an informative message if no model fits.
        """
        results = []
        best_model = None
        best_metric = float('inf')
        best_model_instance = None
    
        for name, order, seasonal_order in models:
            try:
                logging.info(f"🎓 Fitting the ARIMA/SARIMA {name} to the training data...")
                ## Initializing the model with or without seasonal order
                # if seasonal_order:
                #     model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True)
                # else:
                #     model = pm.ARIMA(order=order, suppress_warnings=True)
                model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True) if seasonal_order else pm.ARIMA(order=order, suppress_warnings=True)
                ## Fit the model to the training data
                model_fit = model.fit(train_data)
                ## Predict using the fitted model on the test data size
                predictions = model_fit.predict(n_periods=len(test_data))
                ## Calculate metrics (RMSE, MAE, and MAPE) for the predictions
                metrics = CommmonUtils.calculate_metrics(test_data, predictions)
                metrics.update({'AIC': model_fit.aic(), 'BIC': model_fit.bic()})
    
                ## Append results to a list for final display: metrics + Model's AIC/BIC
                results.append({**{'Model': name}, **metrics})
                current_metric = metrics[preference]
    
                ## Determine best model based on the best/lowest preferred metric:
                if current_metric < best_metric:
                    best_metric = current_metric
                    best_model = {**{'Model': name}, **metrics}
                    best_model_instance = model_fit
                    logging.debug(f"Updated model found: {best_model} with combined metric: {best_metric:.2f}")
    
            except Exception as e:
                logging.error(f"Error fitting model {name}: {e}")
    
        if not best_model:
            print("No valid models were fitted successfully.")
            return None
    
        ## Output results sorted by the preferred metric
        results.sort(key=lambda x: x[preference])
        for result in results:
            logging.info(f"Model: {result['Model']} - RMSE: {result['RMSE']:.2f}, MAE: {result['MAE']:.2f}, "
                  f"MAPE: {result['MAPE']:.2f}%, AIC: {result['AIC']:.2f}, BIC: {result['BIC']:.2f}")
    
        logging.info(f"\nThe best model based on {preference}: {best_model['Model']} with {preference}: {best_model[preference]:.2f}")
        
        # self.model = best_model_instance
        # logging.info(self.model.summary())
        return best_model_instance
       

    def multi_step_forecast(self, n_periods: int):
        prediction, conf_int = self.model.predict(n_periods=n_periods, return_conf_int=True)
        plt.figure(figsize=(12, 5))
        plt.plot(self.data.index[-100:], self.data[self.target_col].dropna().tail(100), label='Historical')
        plt.plot(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'), prediction, label='Forecast')
        plt.fill_between(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'),
                         conf_int[:, 0], conf_int[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
        plt.legend()
        plt.show()

def plot_residuals(residuals, figsize=(12, 10), dpi=100, color_scheme='skyblue'):
    """
    Plot diagnostic charts for residuals of a time series model to check for normality,
    homoscedasticity, and absence of autocorrelation in residuals.
    
    Parameters:
    - residuals (pd.Series or np.array): Residual values from a time series model.
    - figsize (tuple): Figure size.
    - dpi (int): Dots per inch for the plot resolution.
    - color_scheme (str): Color for the plots.
    """
    ## Check if residuals is a method or a callable, and call it to get the data
    if callable(residuals):
        residuals = residuals()
    
    ## Ensure residuals are a numpy array for compatibility with plotting functions
    if not isinstance(residuals, np.ndarray):
        residuals = np.array(residuals)

    fig, axes = plt.subplots(2, 2, figsize=figsize, dpi=dpi)
    plt.subplots_adjust(hspace=0.4, wspace=0.3)
    
    ## Time series plot of residuals
    axes[0, 0].plot(residuals, color=color_scheme)
    axes[0, 0].set_title('Time-series plot of Residuals over-time')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].axhline(y=0, color='black', linestyle='--')
    
    ## ACF plot for residuals
    sm.graphics.tsa.plot_acf(residuals, ax=axes[0, 1], alpha=0.05, color=color_scheme)
    axes[0, 1].set_title('Autocorrelation/ACF for Residuals')

    ## Histogram with KDE for residuals
    sns.histplot(residuals, kde=True, ax=axes[1, 0], bins=30, color=color_scheme)
    axes[1, 0].set_title('Histogram with KDE for Residuals')
    
    ## Q-Q plot for checking normality
    qqplot(residuals, line='s', ax=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot for checking Normality')
    
    ## Statistical Summary
    # stats_text = f'Mean: {np.mean(residuals):.2f}\nSD: {np.std(residuals):.2f}\nSkew: {skew(residuals):.2f}\nKurtosis: {kurtosis(residuals):.2f}'
    mean, std_dev = np.mean(residuals), np.std(residuals)
    skewness, kurt = skew(residuals), kurtosis(residuals)
    stats_text = f'Mean: {mean:.2f}\nSD: {std_dev:.2f}\nSkew: {skewness:.2f}\nKurtosis: {kurt:.2f}'
    axes[1, 0].annotate(stats_text, xy=(0.05, 0.95), xycoords='axes fraction', 
                        horizontalalignment='left', verticalalignment='top', 
                        fontsize=10, bbox=dict(boxstyle="round4,pad=0.6", fc="white", ec="black", lw=1))

    plt.show()

def apply_auto_arima_all_sites(df1, df2, target_cols):
    """
    Apply Auto ARIMA Model Selection across multiple sites and pollutants.
    
    Parameters:
    - df1: DataFrame for Site 1
    - df2: DataFrame for Site 2
    - target_cols: List of target columns (pollutants) for analysis
    
    Returns:
    - models: Dictionary containing fitted models for each site and pollutant
    """
    ## Dictionary to store models for each site and target column
    models = {}

    ## Iterate over each pollutant
    for target_col in target_cols:
        ## Iterate over each site
        for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
            print(f'\n🔍 Applying Auto ARIMA Model Selection for {site_name} ({target_col})')

            ## Initialize the ARIMA analyzer
            analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)

            ## Prepare data by performing a train-test split
            analyzer.train_test_split(train_size=0.8)

            # ## Visualize time series and lag plots
            # analyzer.timeseries_visualization(lags=15, max_diff_order=1)

            ## Applying Auto ARIMA to find the best model
            model = analyzer.auto_arima_model(analyzer.y_train)

            ## Storing model results
            models[(site_name, target_col)] = model

            ## Print model summary
            if model:
                logging.info(f"✅ Best model for {site_name} ({target_col}): ARIMA{model.order}")
                logging.info(model.summary())
            else:
                logging.info('❌ Failed to determine a suitable model.')

    return models


target_cols = ['PM2.5', 'PM10']  # List of pollutants to analyze
data_penrose = data_series_df1.copy()  ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna

## Call the function with data and target columns
models = apply_auto_arima_all_sites(data_penrose, data_takapuna, target_cols)

2024-05-14 08:56:12,636 - INFO - 🎓 [4.1] The ARIMA Models ...

🔍 Applying Auto ARIMA Model Selection for Site1-Penrose (PM2.5)
✅ Step 1: Data Loading & Splitting 80/20
🎓 Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=6.93 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=85246.817, Time=0.18 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=84200.017, Time=0.26 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=83375.633, Time=0.65 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=85244.818, Time=0.09 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=inf, Time=3.08 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=82177.054, Time=2.68 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=8.12 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=81745.972, Time=4.33 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=14.72 sec
 ARIMA(0,1,4)(0,0,0)[0] intercept   : AIC=81522.712, Time=7.43 sec
 ARIMA(1,1,4)(0,0,0)[0] intercept   : AIC=inf, Time=21.19 sec
 ARIMA(0,1,5)(0,0,0)[0] intercept   : AIC=inf, Time=15.27 sec
 ARIMA(1,1,5)(0,0,0)[0] intercept   : AIC=inf, Time=36.91 sec
 ARIMA(0,1,4)(0,0,0)[0]             : AIC=81520.713, Time=1.22 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=81743.973, Time=0.79 sec
 ARIMA(1,1,4)(0,0,0)[0]             : AIC=inf, Time=1.83 sec
 ARIMA(0,1,5)(0,0,0)[0]             : AIC=inf, Time=1.57 sec
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=1.26 sec

2024-05-14 08:58:24,589 - INFO - ✅ Best model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)

 ARIMA(1,1,5)(0,0,0)[0]             : AIC=inf, Time=3.26 sec

Best model:  ARIMA(0,1,4)(0,0,0)[0]          
Total fit time: 131.811 seconds

2024-05-14 08:58:24,686 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:               SARIMAX(0, 1, 4)   Log Likelihood              -40755.357
Date:                Tue, 14 May 2024   AIC                          81520.713
Time:                        08:58:24   BIC                          81558.411
Sample:                    05-07-2020   HQIC                         81533.268
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1         -0.5052      0.004   -130.751      0.000      -0.513      -0.498
ma.L2         -0.2455      0.006    -44.035      0.000      -0.256      -0.235
ma.L3         -0.0998      0.009    -11.591      0.000      -0.117      -0.083
ma.L4         -0.1261      0.007    -18.049      0.000      -0.140      -0.112
sigma2        20.6235      0.069    297.137      0.000      20.487      20.760
===================================================================================
Ljung-Box (L1) (Q):                   1.26   Jarque-Bera (JB):            840496.71
Prob(Q):                              0.26   Prob(JB):                         0.00
Heteroskedasticity (H):               1.51   Skew:                             2.37
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.80
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 08:58:24,689 - INFO - 🎓 [4.1] The ARIMA Models ...

🔍 Applying Auto ARIMA Model Selection for Site2-Takapuna (PM2.5)
✅ Step 1: Data Loading & Splitting 80/20
🎓 Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=24.16 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=45019.870, Time=0.18 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=40348.147, Time=0.44 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=37707.821, Time=1.45 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=45017.871, Time=0.19 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=37596.087, Time=2.16 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=37596.544, Time=1.83 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=10.19 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=37595.386, Time=2.19 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=37596.990, Time=4.48 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=22.42 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=37593.386, Time=0.51 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=37705.822, Time=0.40 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=inf, Time=0.94 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=37594.991, Time=0.60 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=37594.087, Time=0.52 sec

2024-05-14 08:59:39,394 - INFO - ✅ Best model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2)
2024-05-14 08:59:39,489 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13887
Model:               SARIMAX(0, 1, 2)   Log Likelihood              -18793.693
Date:                Tue, 14 May 2024   AIC                          37593.386
Time:                        08:59:39   BIC                          37616.002
Sample:                    05-07-2020   HQIC                         37600.919
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          0.8354      0.002    401.874      0.000       0.831       0.839
ma.L2          0.0906      0.002     36.543      0.000       0.086       0.096
sigma2         0.8771      0.003    347.622      0.000       0.872       0.882
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):            792515.25
Prob(Q):                              0.95   Prob(JB):                         0.00
Heteroskedasticity (H):               1.14   Skew:                             0.35
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.00
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 08:59:39,491 - INFO - 🎓 [4.1] The ARIMA Models ...

 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=1.90 sec

Best model:  ARIMA(0,1,2)(0,0,0)[0]          
Total fit time: 74.588 seconds

🔍 Applying Auto ARIMA Model Selection for Site1-Penrose (PM10)
✅ Step 1: Data Loading & Splitting 80/20
🎓 Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0]             : AIC=inf, Time=2.29 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=115738.588, Time=0.05 sec
 ARIMA(1,0,0)(0,0,0)[0]             : AIC=91231.210, Time=0.14 sec
 ARIMA(0,0,1)(0,0,0)[0]             : AIC=105427.979, Time=0.38 sec
 ARIMA(2,0,0)(0,0,0)[0]             : AIC=89279.111, Time=0.27 sec
 ARIMA(3,0,0)(0,0,0)[0]             : AIC=88829.508, Time=0.38 sec
 ARIMA(4,0,0)(0,0,0)[0]             : AIC=88717.131, Time=0.39 sec
 ARIMA(5,0,0)(0,0,0)[0]             : AIC=88648.671, Time=0.53 sec
 ARIMA(6,0,0)(0,0,0)[0]             : AIC=88583.442, Time=0.60 sec
 ARIMA(7,0,0)(0,0,0)[0]             : AIC=88521.984, Time=0.82 sec
 ARIMA(8,0,0)(0,0,0)[0]             : AIC=88480.799, Time=1.14 sec
 ARIMA(9,0,0)(0,0,0)[0]             : AIC=88449.433, Time=1.19 sec
 ARIMA(10,0,0)(0,0,0)[0]             : AIC=88413.667, Time=1.31 sec
 ARIMA(11,0,0)(0,0,0)[0]             : AIC=inf, Time=1.81 sec
 ARIMA(10,0,1)(0,0,0)[0]             : AIC=inf, Time=10.12 sec
 ARIMA(9,0,1)(0,0,0)[0]             : AIC=inf, Time=8.68 sec
 ARIMA(11,0,1)(0,0,0)[0]             : AIC=inf, Time=11.78 sec
 ARIMA(10,0,0)(0,0,0)[0] intercept   : AIC=87884.782, Time=8.19 sec
 ARIMA(9,0,0)(0,0,0)[0] intercept   : AIC=87885.059, Time=7.37 sec
 ARIMA(11,0,0)(0,0,0)[0] intercept   : AIC=87884.909, Time=11.73 sec
 ARIMA(10,0,1)(0,0,0)[0] intercept   : AIC=87886.886, Time=24.70 sec
 ARIMA(9,0,1)(0,0,0)[0] intercept   : AIC=87887.126, Time=22.25 sec

2024-05-14 09:02:20,067 - INFO - ✅ Best model for Site1-Penrose (PM10): ARIMA(10, 0, 0)

 ARIMA(11,0,1)(0,0,0)[0] intercept   : AIC=87887.532, Time=44.24 sec

Best model:  ARIMA(10,0,0)(0,0,0)[0] intercept
Total fit time: 160.375 seconds

def apply_auto_sarima_all_sites(df1, df2, target_cols):
    """
    Apply Auto SARIMA Model Selection across multiple sites and pollutants.
    
    Parameters:
    - df1: DataFrame for Site 1
    - df2: DataFrame for Site 2
    - target_cols: List of target columns (pollutants) for analysis
    
    Returns:
    - models: Dictionary containing fitted models for each site and pollutant
    """
    ## Dictionary to store models for each site and target column
    models = {}

    ## Iterate over each pollutant
    for target_col in target_cols:
        # Iterate over each site
        for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
            print(f'\n🔍 Applying Auto SARIMA Model Selection for {site_name} ({target_col})')

            ## Initialize the ARIMA analyzer
            analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)

            ## Prepare data by performing a train-test split
            analyzer.train_test_split(train_size=0.8)

            # ## Visualize time series and lag plots
            # analyzer.timeseries_visualization(lags=15, max_diff_order=1)

            ## Applying Auto ARIMA to find the best model
            model = analyzer.auto_sarima_model(analyzer.y_train)

            ## Storing model results
            models[(site_name, target_col)] = model

            ## Print model summary
            if model:
                logging.info(f"✅ Best model for {site_name} ({target_col}): SARIMA{model.order}")
                logging.info(model.summary())
            else:
                logging.info('❌ Failed to determine a suitable model.')

    return models


target_cols = ['PM2.5', 'PM10']  # List of pollutants to analyze
data_penrose = data_series_df1.copy()  ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna

## Call the function with data and target columns
models = apply_auto_sarima_all_sites(data_penrose, data_takapuna, target_cols)

2024-05-21 14:56:10,920 - INFO - 🎓 [4.1] The ARIMA Models ...

🔍 Applying Auto SARIMA Model Selection for Site1-Penrose (PM2.5)
✅ Step 1: Data Loading & Splitting 80/20
🎓 Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
🎓 [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed!
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(1,0,1)[24] intercept   : AIC=inf, Time=149.59 sec
 ARIMA(0,1,0)(0,0,0)[24] intercept   : AIC=85246.817, Time=0.24 sec
 ARIMA(1,1,0)(1,0,0)[24] intercept   : AIC=84058.923, Time=18.85 sec
 ARIMA(0,1,1)(0,0,1)[24] intercept   : AIC=83140.927, Time=17.42 sec
 ARIMA(0,1,0)(0,0,0)[24]             : AIC=85244.818, Time=0.12 sec
 ARIMA(0,1,1)(0,0,0)[24] intercept   : AIC=83375.633, Time=0.88 sec
 ARIMA(0,1,1)(1,0,1)[24] intercept   : AIC=inf, Time=104.13 sec
 ARIMA(0,1,1)(0,0,2)[24] intercept   : AIC=82960.450, Time=60.51 sec
 ARIMA(0,1,1)(1,0,2)[24] intercept   : AIC=inf, Time=211.91 sec
 ARIMA(0,1,0)(0,0,2)[24] intercept   : AIC=85106.953, Time=42.82 sec
 ARIMA(1,1,1)(0,0,2)[24] intercept   : AIC=inf, Time=136.34 sec
 ARIMA(0,1,2)(0,0,2)[24] intercept   : AIC=81772.007, Time=86.94 sec
 ARIMA(0,1,2)(0,0,1)[24] intercept   : AIC=81969.054, Time=34.85 sec
 ARIMA(0,1,2)(1,0,2)[24] intercept   : AIC=inf, Time=293.02 sec
 ARIMA(0,1,2)(1,0,1)[24] intercept   : AIC=inf, Time=116.07 sec
 ARIMA(1,1,2)(0,0,2)[24] intercept   : AIC=inf, Time=306.30 sec
 ARIMA(0,1,2)(0,0,2)[24]             : AIC=81770.007, Time=17.64 sec
 ARIMA(0,1,2)(0,0,1)[24]             : AIC=81967.055, Time=4.62 sec

from sklearn.metrics import mean_absolute_percentage_error

def one_step_forecast(analyzer, test):
    """
    Create multi-step forecasts and plots them alongside the test data.
    """
    test_len = len(test)
    ## Create predictions for the future, evaluate on test
    fcast, conf_int = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
    # fcast = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
    # forecasts = fcast[0]
    # confidence_intervals = fcast[1]
    # confidence_intervals = np.asarray(fcast[1])

    ## Plotting
    plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)

def one_period_forecast(model, n_periods=168):
    """
    Forecast multiple periods ahead with confidence intervals.

    Parameters:
    model (ARIMA): The ARIMA model used for forecasting.
    n_periods (int): Number of periods to forecast.

    Returns:
    tuple: A tuple containing forecasts and their confidence intervals.
    """
    ## Create predictions for the future, evaluate on test
    fcast, conf_int = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
    return fcast.tolist(), np.asarray(conf_int).tolist()
    
    # fcast = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
    ## fcast is a list of two lists.
    ## The first list is the forecast
    # forecasts = fcast[0].tolist()
    ## The second list is the confidence interval
    # confidence_intervals = fcast[1]
    # return ( forecasts, 
    #          np.asarray(confidence_intervals).tolist()[0])
    
def multi_step_forecast(analyzer, test, period='week'):
    """
    Generate multi-step forecasts and update the model with actual observations for each period/forecast.
    Also plots them alongside the test data.

    Parameters:
    - analyzer (TimeSeriesAnalyzer): An instance containing the ARIMA model and data.
    - test (pd.Series): Test data for the target column.
    - period (str): The period for grouping test data, 'week' for hourly data equals 168 hours.

    Uses dynamic forecasting where the model is updated after each new observation.
    Prints progress and debugging information.
    """
    ## Initialize lists to hold predictions and confidence intervals
    forecasts = []
    confidence_intervals = []

    ## Calculate the number of periods based on the test data length and the defined period length: One week in hours = 24 * 7
    # period_length = 168 if period == 'week' else len(test)  
    ## One week in hours = 24 * 7 = 168; also 24 hours per day
    period_length = 168 if period == 'week' else 24 if period == 'day' else len(test)

    ## Generate predictions and update the model for each period: Loop over the test data in chunks
    for start_idx in range(0, len(test), period_length):
        end_idx = min(start_idx + period_length, len(test))
        current_test = test.iloc[start_idx:end_idx]
        
        ## Perform forecasting for the current period
        fc, conf = one_period_forecast(analyzer.model, n_periods=len(current_test))
        forecasts.extend(fc)
        confidence_intervals.extend(conf)
        # confidence_intervals.append(conf)

        ## Adaptive Forecasting: Update the model with actual observations from the current period
        analyzer.model.update(current_test.values)
        logging.debug(f'Iteration from {start_idx} to {end_idx}: Model updated with actual data with Forecast={fc}, CI={conf}.')

    ## Plot the results: Plotting the forecasts against the actual data
    plot_it(analyzer.train_data[analyzer.target_col], test, forecasts, np.array(confidence_intervals))
 
    ## Plotting the forecasts against the actual data
    # plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)
    # plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], np.array(forecasts), np.array(confidence_intervals))
    
    
# @staticmethod
def plot_it(train, test, forecasts, confidence_intervals):
    """
    Plot training data, test data, forecasts, and confidence intervals.
    
    Parameters:
    - train (pd.Series): Training data series.
    - test (pd.Series): Test data series.
    - forecasts (list): List of forecasted values.
    - confidence_intervals (array): Array of confidence interval bounds.
    """
    fig, ax = plt.subplots(figsize=(18, 10))

    ## Plot training data
    ax.plot(train.index, train, color='blue', label='Training Data')

    ## Plot actual test data
    ax.plot(test.index, test, color='green', label='Actual Test Data')
    
    ## Plotting Confidence Intervals
    # lower_bounds = [ci[0] for ci in confidence_intervals]
    # upper_bounds = [ci[1] for ci in confidence_intervals]
    ax.fill_between(test.index, confidence_intervals[:, 0], confidence_intervals[:, 1], color='orange', alpha=0.3, label='95% Confidence Interval')

    ## Plot forecasts
    ax.plot(test.index, forecasts, color='red', label='Forecasted Data', marker='o')
    
    ax.set_title(f'Multi-Step Forecast Evaluation: Forecast vs Actuals')
    ax.set_xlabel('Date')
    ax.set_ylabel(f'{train.name}')
    ax.legend()
    plt.grid(True)
    plt.show()

    ## Evaluate and print forecast accuracy
    mape = mean_absolute_percentage_error(test, forecasts)
    # print(f"MAPE: {mape:.2%}")
    # ## Evaluate and print forecast accuracy 
    # mape = mean_absolute_percentage_error(analyzer.test_data, forecasts)
    logging.info(f"MAPE for the forecast: {mape:.2%}")

def Time_Series_EDA(data: pd.DataFrame, site: str, target_col: str, train_size: float = 0.8, is_auto_arima=False, models=None):
    """
    Conduct Time Series EDA, Model Selection, and Forecasting.
    Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to any site and target pollutant.
    """
    analyzer = ArimaTimeSeriesAnalyzer(data, site, target_col, is_auto_arima=is_auto_arima)
    
    logging.info(f'✅ Step 1: Preparing Data for {site} ({target_col})')
    analyzer.train_test_split(train_size)
    
    logging.info(f'✅ Step 2: Visualizing Time Series and Lag Plots for {site} ({target_col})')
    analyzer.timeseries_visualization(lags=15, max_diff_order=1)

    logging.info(f'✅ Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')

    if is_auto_arima:
        logging.info(f'✅ Step 3: Applying Auto ARIMA Model Selection for {site} ({target_col})')
        analyzer.auto_arima_model(analyzer.y_train)
    else:
        logging.info(f'✅ Step 4: Selecting the Best Model Based on AIC')
        analyzer.model = analyzer.model_selection(models, analyzer.y_train, analyzer.y_test, preference='AIC')

    if analyzer.model:
        logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()}")
    else:
        logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()} ")

    logging.info(f'✅ Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.')
    logging.info(f'🎓 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...')
    # analyzer.model.plot_diagnostics(figsize=(15,12))
    analyzer.model.plot_diagnostics(figsize=(10, 8))
    plt.show()
    
    logging.info(f'🎓 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...')
    if hasattr(analyzer.model, 'resid'):
        residuals = analyzer.model.resid  ## Get the residuals from an instance of a fitted ARIMA model
        plot_residuals(residuals)         ## Plotting the residuals 

    logging.info(f'✅ Step 6: Conducting One-Step Forecast')
    one_step_forecast(analyzer, analyzer.test_data[analyzer.target_col])

    logging.info(f'✅ Step 7.1: Conducting Multi-Step Forecast for Next Weeks')
    multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='week')
    logging.info(f'✅ Step 7.2: Conducting Multi-Step Forecast for Next Days')
    multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='day')

    return analyzer

## ✅ Best ARIMA model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)
## ✅ Best ARIMA model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2)
## ✅ Best ARIMA model for Site1-Penrose (PM10): ARIMA(10, 0, 0)
## ✅ Best ARIMA model for Site2-Takapuna (PM10): ARIMA(2, 0, 3)
models_penrose_pm25  = [('ARIMA(0,1,4)', (0, 1, 4), None)]
models_penrose_pm10  = [('ARIMA(0,1,2)', (0, 1, 2), None)]
models_takapuna_pm25 = [('ARIMA(10,0,0)', (10, 0, 0), None)]
models_takapuna_pm10 = [('ARIMA(2,0,3)', (2, 0, 3), None)]

## Penrose Time Series EDA
logging.info(f"📈📊 Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm25 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM2.5', is_auto_arima=False, models=models_penrose_pm25)
logging.info(f"📈📊 Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm10 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM10', is_auto_arima=False, models=models_penrose_pm10)

## Takapuna Time Series EDA
logging.info(f"📈📊 Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm25 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM2.5', is_auto_arima=False, models=models_takapuna_pm25)
logging.info(f"📈📊 Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm10 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM10', is_auto_arima=False, models=models_takapuna_pm10)

2024-05-14 09:13:48,120 - INFO - 📈📊 Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 
2024-05-14 09:13:48,125 - INFO - ✅ Step 1: Preparing Data for Penrose (PM2.5)
2024-05-14 09:13:48,128 - INFO - ✅ Step 2: Visualizing Time Series and Lag Plots for Penrose (PM2.5)

✅ Step 1: Data Loading & Splitting 80/20
🎓 Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
✅ Step 2: Time Series Visualization

🎓 [2.1] [Default] Time Series Display ...

🎓 [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.

🎓 [2.2] Apply 1 Order Differencing to the time series and visualizes results.

🎓 [2.3] Plotting the ACF to suggest possible MA orders.

🎓 [2.3] Plotting the PACF to suggest possible AR orders.

2024-05-14 09:13:55,786 - INFO - ✅ Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.
2024-05-14 09:13:55,788 - INFO - ✅ Step 4: Selecting the Best Model Based on AIC
2024-05-14 09:13:55,806 - INFO - 🎓 Fitting the ARIMA/SARIMA ARIMA(0,1,4) to the training data...
2024-05-14 09:14:06,662 - INFO - Model: ARIMA(0,1,4) - RMSE: 4.58, MAE: 3.53, MAPE: inf%, AIC: 81522.71, BIC: 81567.95
2024-05-14 09:14:06,667 - INFO - 
The best model based on AIC: ARIMA(0,1,4) with AIC: 81522.71
2024-05-14 09:14:06,811 - INFO - Best model based on AIC = 81522.71 
                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:               SARIMAX(0, 1, 4)   Log Likelihood              -40755.356
Date:                Tue, 14 May 2024   AIC                          81522.712
Time:                        09:14:06   BIC                          81567.949
Sample:                    05-07-2020   HQIC                         81537.777
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept  -3.539e-05      0.001     -0.036      0.972      -0.002       0.002
ma.L1         -0.5052      0.004   -130.166      0.000      -0.513      -0.498
ma.L2         -0.2455      0.006    -43.965      0.000      -0.256      -0.235
ma.L3         -0.0997      0.009    -11.585      0.000      -0.117      -0.083
ma.L4         -0.1261      0.007    -18.013      0.000      -0.140      -0.112
sigma2        20.6245      0.071    288.519      0.000      20.484      20.765
===================================================================================
Ljung-Box (L1) (Q):                   1.27   Jarque-Bera (JB):            840531.64
Prob(Q):                              0.26   Prob(JB):                         0.00
Heteroskedasticity (H):               1.51   Skew:                             2.37
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.80
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 09:14:06,826 - INFO - ✅ Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.
2024-05-14 09:14:06,826 - INFO - 🎓 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...

2024-05-14 09:14:07,340 - INFO - 🎓 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...

2024-05-14 09:14:07,938 - INFO - ✅ Step 6: Conducting One-Step Forecast

2024-05-14 09:14:08,251 - INFO - MAPE for the forecast: 8479870924121942.00%
2024-05-14 09:14:08,251 - INFO - ✅ Step 7.1: Conducting Multi-Step Forecast for Next Weeks

## Import Prophet class from prophet (fbprophet) library
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot

class AdvancedTimeSeriesForecastingProphet:
    
    def __init__(self, training_data, testing_data, target_variable, changepoint_prior_scale=0.01, interval_width=0.95, daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True, include_holidays=True, country_code='NZ'):
        """
        Initializes the class with training and testing data, target variable, and configurations for the Prophet model.
        Ensures data includes 'ds' and target_variable correctly named 'y'.

        Args:
            training_data (DataFrame): The training dataset with 'ds' and target_variable columns.
            testing_data (DataFrame): The testing dataset with 'ds' and target_variable columns.
            target_variable (str): The name of the column to forecast.
            changepoint_prior_scale (float): The flexibility of the automatic changepoint selection.
            daily_seasonality (bool): Whether to model daily seasonality.
            yearly_seasonality (bool): Whether to model yearly seasonality.
            interval_width (float): Uncertainty interval width. Set the uncertainty interval to 95% (the Prophet default is 80%)
            
            periods (int): Number of periods to forecast.
            freq (str): Frequency of the forecast (e.g., 'D' for daily).
            include_holidays (bool): If True, includes holiday effects based on the specified country.
            country_code (str): The ISO country code ('US') for including holidays.

        Returns:
            model (Prophet object): The fitted Prophet model.        
        """
        # self.validate_data(training_data, testing_data, target_variable)
        # self.training_data = training_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
        # self.testing_data = testing_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
        self.training_data = training_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'}) ## target_variable --> 'y'
        self.testing_data = testing_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'})
        self.target_variable = target_variable
        ## Initialize and configure the model in Prophet with the default parameters
        ## Set the uncertainty interval to 95% (the Prophet default is 80%)
        # self.model = Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.001)
        # self.model = Prophet(changepoint_prior_scale=0.004, seasonality_prior_scale=20, seasonality_mode='multiplicative', changepoint_range=0.9)
        self.model = Prophet(
            changepoint_prior_scale=changepoint_prior_scale, 
            daily_seasonality=daily_seasonality, 
            weekly_seasonality=weekly_seasonality,
            yearly_seasonality=yearly_seasonality,
            interval_width=interval_width)
        if include_holidays:
            self.model.add_country_holidays(country_name=country_code)  ## 'US' | 'NZ'
    
        self.forecast_df = None


    # def validate_data(self, training_data, testing_data, target_variable):
    #     """
    #     Validates the input data to ensure required columns are present.
    #     """
    #     required_columns = ['ds', target_variable]
    #     for data in [training_data, testing_data]:
    #         if not all(col in data.columns for col in required_columns):
    #             raise ValueError(f"Data must include the following columns: {required_columns}")

    @staticmethod
    def train_test_split(data, target_col='PM2.5', train_size=0.8):
        """
        ☑️ [1] Train-Test split: Splits the data into training and testing sets.
        Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
        :param train_size: float - proportion of the dataset to include in the train split.
        """
        train_len  = int(len(data) * train_size)
        train_data = data.iloc[:train_len]
        test_data  = data.iloc[train_len:]
        print(f'🎓 [{target_col}] Training & Testing Data split (80/20): {len(train_data):,} train samples and {len(test_data):,} test samples.')
        # y = data[target_col]           ## Compatible with Prophet: ds & y columns
        y_train = train_data[target_col]
        y_test  = test_data[target_col]

        return y_train, y_test
    
    
    ## Create a future dataframe for predictions, 24 hours a day | 168 hours a week into the future
    ## freq='MS' | 'D' | 'H'
    def fit(self, periods=7*24, freq='H'):
        """
        Fits the Prophet model using the training data.
        Generates predictions over the forecast horizon.

        Returns:
            forecast (DataFrame): The forecast results from Prophet.
            periods: We need to specify the number of days in future
        """
        ## Fit the model to the data
        self.model.fit(self.training_data)

        ## Making Future Predictions
        future = self.model.make_future_dataframe(periods=len(self.testing_data), freq=freq)
        ## Predicting the next quarter 120 days | 365 days (1 year) | period = 24 for daily seasonality; periods=24*7, freq='H')
        # future = self.model.make_future_dataframe(periods=periods, freq=freq)  
        
        self.forecast_df = self.model.predict(future)

        ## DEBUG
        ## Display the most critical output columns from the forecast
        # self.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].tail()
        
        return self.forecast_df


    def validate(self, site_name='Penrose'):
        """
        Plots the forecasted values with uncertainty intervals alongside the actual testing data. 
        
        Validates the Forecasting Model by plotting the forecast with confidence intervals (highlights changepoints and the training data end) against actual values for a specified site. 
        Also shows components of the forecast, calculates MAE.
        """
        if self.forecast_df is None:
            raise ValueError("Forecast data is not available for validation.")

        ## DEBUG
        # sns.boxplot(x=self.training_data.ds, y=self.training_data.y)

        ## Extract the predicted and actual values for the visualization
        ## Ensure that forecasted data and actual data share the same index for proper comparison
        # forecast = self.forecast_df.set_index('ds')['yhat']
        forecast = self.forecast_df
        # actual = self.testing_data.set_index('ds')['y']
        actual_testing_data = self.testing_data
        last_training_date = self.training_data['ds'].max()

        ## Calculate Mean Absolute Error (MAE) only for overlapping dates
        # valid_indices = ~forecasted.isna()
        # mae = mean_absolute_error(actual[valid_indices], forecasted[valid_indices])
        # print(f"Mean Absolute Error (MAE): {mae:.2f}")

        ## [Plot 1] Plot the forecast: Shows the actual testing values, forecasted values from the fitted Model with the uncertainty intervals.
        # forecast_plot = self.model.plot(forecast, uncertainty=False, plot_cap=True)
        forecast_plot = self.model.plot(forecast, uncertainty=True)
        
        ## Adding changepoints and training end visual markers
        a = add_changepoints_to_plot(forecast_plot.gca(), self.model, forecast, cp_color='red')
        ## add a vertical line at the end of the training period
        axes = forecast_plot.gca()
        # last_training_date = pd.to_datetime('2021-12-31')
        axes.axvline(x=last_training_date, color='darkred', linestyle='solid', label='Training End') ## linestyle='--' | 'solid'
        
        ## plot true test data for the period after the red line
        actual_testing_data['ds'] = pd.to_datetime(actual_testing_data['ds'])
        plt.plot(actual_testing_data['ds'], actual_testing_data['y'],'ro', markersize=0.2, label='Actual Test Data')
        # actual = actual_testing_data.set_index('ds')['y']

        axes.set_title(f'[Prophet][{site_name}] Forecasting vs Actual Testing Data for {self.target_variable}')
        axes.set_xlabel('Date (ds)')
        axes.set_ylabel(f'{self.target_variable} (y)')
        axes.legend()
        axes.grid(True)
        plt.show()

        ## [Plot 2] Decomposes the forecast into trend, yearly seasonality, and weekly seasonality components.
        if hasattr(self.model, 'plot_components'):
            # fig2 = self.model.plot_components(self.forecast_df, uncertainty=False, plot_cap=True).show()
            fig2 = self.model.plot_components(self.forecast_df, uncertainty=True).show()
        
        ## Future Work:
        ## Change points: By default, Prophet automatically detects change points in the data. You can manually adjust their sensitivity or specify them.
        ## Seasonalities: Apart from built-in daily, weekly, and yearly seasonalities, you can add custom seasonalities (e.g., monthly).
        ## Holidays: You can add holidays and special events to improve the model’s accuracy on irregular occurrences.

## DataFrame must have 'Site', 'ds', and other columns like 'PM2.5'
## Initialize instances for Penrose and Takapuna for PM2.5 prediction using Meta/Facebook Prophet

## Prepare data by performing a train-test split
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM2.5', train_size=0.8)
forecasting_prophet_penrose   = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM2.5')

## Fitting models
logging.info("🛠️ Fitting Prophet models for Penrose ...")
forecast = forecasting_prophet_penrose.fit()

## Validating models
logging.info("🔬 Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')

2024-05-14 11:44:15,004 - INFO - 🛠️ Fitting Prophet models for Penrose ...

🎓 [PM2.5] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.

2024-05-14 11:44:18,105 - INFO - 🔬 Validating Prophet models for Penrose ...

## Prepare data by performing a train-test split
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM2.5', train_size=0.8)
forecasting_prophet_takapuna  = AdvancedTimeSeriesForecastingProphet(training_data2, testing_data2, 'PM2.5')

## Fitting models
logging.info("🛠️ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()

logging.info("🔬 Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')

2024-05-14 11:44:26,085 - INFO - 🛠️ Fitting Prophet models for Takapuna ...

🎓 [PM2.5] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.

2024-05-14 11:44:29,337 - INFO - 🔬 Validating Prophet models for Takapuna ...

## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM10', train_size=0.8)
forecasting_prophet_penrose   = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM10')

## DEBUG
# forecasting_prophet_penrose.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].head()

## Fitting models
logging.info("🛠️ Fitting Prophet models for Penrose ...")
forecasting_prophet_penrose.fit()

## Validating models
logging.info("🔬 Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')

2024-05-14 11:48:21,695 - INFO - 🛠️ Fitting Prophet models for Penrose ...

🎓 [PM10] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.

2024-05-14 11:48:25,018 - INFO - 🔬 Validating Prophet models for Penrose ...

## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM10', train_size=0.8)
forecasting_prophet_takapuna  = AdvancedTimeSeriesForecastingProphet(training_data=training_data2, testing_data=testing_data2, target_variable='PM10')

## Fitting models
logging.info("🛠️ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()

## Validating models
logging.info("🔬 Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')

2024-05-14 11:48:32,781 - INFO - 🛠️ Fitting Prophet models for Takapuna ...

🎓 [PM10] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.

2024-05-14 11:48:36,309 - INFO - 🔬 Validating Prophet models for Takapuna ...

def engineer_features(rawdata):
    """
    Adds time-based features to the dataframe to enhance the analysis capabilities, adjusting for the multi-level index.
    This includes extracting various temporal components and creating lag features for PM2.5 and PM10 variables to analyze time-based dependencies.
    """
    ## Extracting date-time components
    # timestamp_index = rawdata.index.get_level_values('Timestamp') ## 'Timestamp' <-- 'ds'
    timestamp_index = rawdata['Timestamp']
    # rawdata['Hour']       = timestamp_index.dt.hour
    rawdata['Day']        = timestamp_index.dt.day
    rawdata['DayOfWeek']  = timestamp_index.dt.dayofweek
    rawdata['Month']      = timestamp_index.dt.month
    rawdata['Quarter']    = timestamp_index.dt.quarter
    rawdata['Year']       = timestamp_index.dt.year

    # rawdata['DayOfYear']  = timestamp_index.dt.dayofyear
    ## Extract week of year for each timestamp
    rawdata['WeekOfYear'] = [d.isocalendar()[1] for d in timestamp_index]
    
    ## Calculating the season based on the month --> accurately reflects the local climate and seasonal cycles
    ## Season encoding: 1 (Summer): December, January, February; 2 (Autumn): March, April, May
    ##                  3 (Winter): June, July, August         ; 4 (Spring): September, October, November
    ## Adjusted for meteorological seasons in Auckland, New Zealand:  
    # rawdata['Season'] = (((timestamp_index.dt.month % 12) + 1) // 3) % 4 + 1 
    ## TODO: Southern Hemisphere like New Zealand and Australia vs Northern Hemisphere like England
    ## Correctly mapping the month to meteorological seasons for Southern Hemisphere (Auckland)
    rawdata['Season'] = rawdata['Month'].apply(
        lambda x: 1 if 9 <= x <= 11 else       ## Spring: Sep, Oct, Nov
                  2 if 12 <= x or x <= 2 else  ## Summer: Dec, Jan, Feb
                  3 if 3 <= x <= 5 else        ## Autumn: Mar, Apr, May
                  4                            ## Winter: Jun, Jul, Aug
    )

    ## Adding lag features for PM2.5 and PM10 to capture previous time steps' influence --> to compare the correlation with the other variables.
    rawdata['PM2.5_Lag1'] = rawdata.groupby('Site')['PM2.5'].shift(1)
    rawdata['PM2.5_Lag2'] = rawdata.groupby('Site')['PM2.5'].shift(2)
    rawdata['PM10_Lag1']  = rawdata.groupby('Site')['PM10'].shift(1)
    rawdata['PM10_Lag2']  = rawdata.groupby('Site')['PM10'].shift(2)
    ## Fill NaN with the next value in the column (backward fill)
    rawdata.fillna(method='bfill', inplace=True)
    ## Alternatively, fill NaN with the previous value in the column (forward fill)
    rawdata.fillna(method='ffill', inplace=True)
    
    return rawdata

## Apply the cleaning and querying function
# cleaned_data = rawdata.copy()
cleaned_data = engineer_features(rawdata.copy())

# if IS_TEST_DEV:
cleaned_data[['Timestamp', 'Year', 'Quarter','Season', 'Month', 'WeekOfYear', 'Day', 'DayOfWeek']].head()

# cleaned_data.columns

## 'Timestamp' as the only Ordinal Attribute/Column given its nature order in time-series data
ordinal_columns = ['Timestamp']

print("\n🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...")
cleaned_numerical_columns, cleaned_nominal_columns = DescriptiveStatistics.describe_data(cleaned_data, ordinal_columns)

# targets = ['PM2.5', 'PM10', 'SO2']
# features = [feature for feature in df.columns if feature not in targets]

🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')

cleaned_data_site1 = cleaned_data[cleaned_data['Site'] == 'Penrose']
cleaned_data_site2 = cleaned_data[cleaned_data['Site'] == 'Takapuna']

# if IS_DEBUG:
print("\n🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site1.describe()

🎓 [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...

# if IS_DEBUG:
print("\n🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site2.describe()

🎓 [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...

def get_top_correlated_features(data, target, num_features=12):
    """
    Calculates and returns the top N features most correlated with the target variable.
    
    Parameters:
    - data (DataFrame): The dataset containing numerical features.
    - target (str): The target variable for which correlations are calculated.
    - num_features (int): The number of top features to return.
    
    Returns:
    - List of top N correlated features.
    """
    ## Ensure data does not contain infinite or NaN values
    data = data.select_dtypes(include=[np.number]).dropna()
    
    ## Calculate the Pearson correlation matrix
    correlation_matrix = data.corr(method='pearson')
    
    ## Calculate absolute correlation with the target variable
    abs_corr_with_target = abs(correlation_matrix[target])
    
    ## Exclude the target variable and lag variables from the series
    features_to_exclude = [target] + [f"{target}_Lag1", f"{target}_Lag2"]
    abs_corr_with_target = abs_corr_with_target.drop(features_to_exclude, errors='ignore')
    
    ## Get the top N highly correlated features
    top_features = abs_corr_with_target.nlargest(num_features).index.tolist()
    
    return top_features

## Analyzing for both sites
top_features_data1 = get_top_correlated_features(cleaned_data_site1, 'PM2.5')
top_features_data2 = get_top_correlated_features(cleaned_data_site2, 'PM2.5')

logging.info("\n🌟 Top 5 features highly correlated with PM2.5 in Penrose: %s\n", top_features_data1)
logging.info("\n🌟 Top 5 features highly correlated with PM2.5 in Takapuna: %s\n", top_features_data2)

2024-05-14 12:04:07,853 - INFO - 
🌟 Top 5 features highly correlated with PM2.5 in Penrose: ['PM10', 'PM10_Lag1', 'NOx', 'NO', 'PM10_Lag2', 'AQI', 'NO2', 'SO2', 'Wind_Dir', 'Air_Temp', 'Season', 'Wind_Speed']

2024-05-14 12:04:07,853 - INFO - 
🌟 Top 5 features highly correlated with PM2.5 in Takapuna: ['PM10', 'PM10_Lag1', 'AQI', 'PM10_Lag2', 'NO', 'NOx', 'Air_Temp', 'NO2', 'Rel_Humidity', 'Wind_Dir', 'Quarter', 'Month']

## Analyzing for both sites
top_features_data21 = get_top_correlated_features(cleaned_data_site1, 'PM10')
top_features_data22 = get_top_correlated_features(cleaned_data_site2, 'PM10')

logging.info("\n🌟 Top 5 features highly correlated with PM10 in Penrose: %s\n", top_features_data21)
logging.info("\n🌟 Top 5 features highly correlated with PM10 in Takapuna: %s\n", top_features_data22)

2024-05-14 12:04:22,621 - INFO - 
🌟 Top 5 features highly correlated with PM10 in Penrose: ['AQI', 'PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'NO', 'NOx', 'Wind_Speed', 'Rel_Humidity', 'SO2', 'NO2', 'Air_Temp', 'Wind_Dir']

2024-05-14 12:04:22,621 - INFO - 
🌟 Top 5 features highly correlated with PM10 in Takapuna: ['PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'AQI', 'NO', 'NOx', 'Wind_Speed', 'NO2', 'Wind_Dir', 'Day', 'Rel_Humidity', 'Quarter']

# import pandas as pd
# import matplotlib.pyplot as plt
# 
# # Assuming rawdata and cleaned_data are pandas DataFrames loaded appropriately
# # Convert 'Timestamp' to datetime and set as index
# rawdata['Timestamp'] = pd.to_datetime(rawdata['Timestamp'])
# cleaned_data['Timestamp'] = pd.to_datetime(cleaned_data['Timestamp'])
# 
# rawdata.set_index('Timestamp', inplace=True)
# cleaned_data.set_index('Timestamp', inplace=True)

def plot_pm_variation(data, feature, title, ax):
    """
    Helper function to plot PM2.5/PM10 variations on a given Axes object.
    
    Parameters:
    - data (DataFrame): The dataset to plot.
    - title (str): The title of the plot.
    - ax (Axes): The matplotlib Axes object where the plot will be drawn.
    """
    ax.plot(data[feature], label=f'{feature}')
    ax.set_xlabel('Date')
    ax.set_ylabel(f'{feature} Concentration')
    ax.set_title(title)
    ax.legend()

fig, axs = plt.subplots(2, 2, figsize=(20, 12))  # Create a grid of 2x2 for Penrose and Takapuna

## Plotting
plot_pm_variation(rawdata_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Raw Data)', axs[0, 0])
plot_pm_variation(cleaned_data_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Cleaned Data)', axs[0, 1])
plot_pm_variation(rawdata_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Raw Data)', axs[1, 0])
plot_pm_variation(cleaned_data_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Cleaned Data)', axs[1, 1])

plt.tight_layout()
plt.show()

# ## 'cleaned_data_site1' and 'cleaned_data_site2' are preloaded DataFrames for Penrose and Takapuna, respectively
# ## Convert 'Timestamp' to datetime and set as index
# cleaned_data_site1['Timestamp'] = pd.to_datetime(cleaned_data_site1['Timestamp'])
# cleaned_data_site2['Timestamp'] = pd.to_datetime(cleaned_data_site2['Timestamp'])
# 
# cleaned_data_site1.set_index('Timestamp', inplace=True)
# cleaned_data_site2.set_index('Timestamp', inplace=True)

## [DEBUG] Top 5 features - this should be dynamically determined from a correlation analysis
# top_features_data1 = ['AQI', 'PM10', 'NOx', 'Wind_Speed', 'Air_Temp']
# top_features_data2 = ['PM2.5', 'SO2', 'NO', 'Wind_Dir', 'Rel_Humidity']

def plot_features(data, features, site_name):
    fig, axs = plt.subplots(len(features), 1, figsize=(15, 15))

    for ax, feature in zip(axs, features):
        if feature in data.columns:
            ax.plot(data.index, data[feature], label=f'{feature} (Cleaned Data)')
            ax.set_title(f'Variation of {feature} in {site_name}')
            ax.set_xlabel('Date')
            ax.set_ylabel(feature)
            ax.grid(True)
        else:
            fig.delaxes(ax)  ## Remove the axis if the feature is not in the data

    plt.tight_layout()
    plt.show()

## Call the function for each site
logging.info("\n🎓 [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site1, top_features_data1, 'Penrose')
logging.info("\n🎓 [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site2, top_features_data2, 'Takapuna')

2024-05-14 12:04:38,696 - INFO - 
🎓 [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.

2024-05-14 12:04:39,572 - INFO - 
🎓 [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.

## Call the function for each site
logging.info("\n🎓 [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site1, top_features_data21, 'Penrose')
logging.info("\n🎓 [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site2, top_features_data22, 'Takapuna')

2024-05-14 12:04:51,396 - INFO - 
🎓 [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.

2024-05-14 12:04:52,148 - INFO - 
🎓 [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.

def compute_summary_statistics(data, top_features):
    """
    Computes and prints summary statistics for selected features in a given dataset.
    
    Args:
    data (pd.DataFrame): The cleaned data for a specific site.
    top_features (list): List of features for which to compute statistics, determined from prior analysis.
    
    Returns:
    pd.DataFrame: Summary statistics of the selected features.
    """
    # Ensure all top features are in the data columns
    valid_features = [feature for feature in top_features if feature in data.columns]
    selected_features = ['PM2.5'] + valid_features

    # Get summary statistics
    summary_stats = data[selected_features].describe()
    return summary_stats

## 'cleaned_data_site1' and 'cleaned_data_site2' are your cleaned datasets for Penrose and Takapuna respectively.
## 'top_features_data1' and 'top_features_data2' should be lists of top features determined from correlation analysis.

## [DEBUG] Check if the top features list is defined for each site
# top_features_data1 = ['NO2', 'Wind_Speed', 'Air_Temp', 'PM10', 'NOx']  # Example top features for Penrose
# top_features_data2 = ['SO2', 'NO', 'Rel_Humidity', 'Wind_Dir', 'NOx']  # Example top features for Takapuna

summary_stats_penrose = compute_summary_statistics(cleaned_data_site1, top_features_data1)
summary_stats_takapuna = compute_summary_statistics(cleaned_data_site2, top_features_data2)

logging.info("🎓 Summary Statistics for Penrose:")
summary_stats_penrose

2024-05-14 12:05:50,736 - INFO - 🎓 Summary Statistics for Penrose:

logging.info("\n🎓 Summary Statistics for Takapuna:")
summary_stats_takapuna

2024-05-14 12:05:56,070 - INFO - 
🎓 Summary Statistics for Takapuna:

	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
ds
2020-05-07 00:00:00	26.0	16.500	16.1	1.8	60.3	40.9	101.2	0.6	316.0	8.0	78.0
2020-05-07 01:00:00	28.0	17.700	10.1	1.0	16.0	29.2	45.3	0.7	269.0	8.0	76.8
2020-05-07 02:00:00	28.0	15.000	10.3	0.1	16.0	29.2	45.3	1.0	180.0	8.0	78.4
2020-05-07 03:00:00	29.0	14.300	11.4	0.0	11.2	27.5	38.7	0.8	232.0	8.0	77.5
2020-05-07 04:00:00	30.0	8.800	10.6	-0.1	12.0	28.5	40.5	0.8	274.0	7.0	80.1
...	...	...	...	...	...	...	...	...	...	...	...
2022-04-30 19:00:00	25.0	12.800	-0.2	0.3	0.1	4.8	4.9	2.1	95.0	16.0	80.5
2022-04-30 20:00:00	25.0	2.200	2.2	0.2	0.5	9.9	10.3	1.0	116.0	15.0	84.2
2022-04-30 21:00:00	24.0	16.225	1.9	0.6	5.8	25.4	31.3	0.8	124.0	15.0	85.4
2022-04-30 22:00:00	24.0	8.200	6.0	0.5	1.8	17.9	19.7	0.8	108.0	14.0	86.0
2022-04-30 23:00:00	24.0	2.300	5.4	0.5	1.0	7.3	8.3	0.9	93.0	14.0	84.4

	AQI	PM10	PM2.5	NO	NO2	NOx	Wind_Speed	Wind_Dir	Air_Temp	Rel_Humidity
ds
2020-05-07 17:00:00	21.0	5.95	4.15	10.90	0.01715	28.00	2.50	242.0	15.5	68.15
2020-05-07 18:00:00	21.0	5.65	5.10	8.20	0.01655	24.70	2.20	239.5	15.5	61.75
2020-05-07 19:00:00	21.0	7.70	5.45	5.75	0.01325	19.00	2.10	244.0	15.0	59.45
2020-05-07 20:00:00	21.0	8.20	5.45	3.50	0.00870	12.20	2.25	251.0	15.0	59.25
2020-05-07 21:00:00	21.0	11.80	5.80	3.55	0.00930	12.90	2.10	261.0	15.0	61.75
...	...	...	...	...	...	...	...	...	...	...
2022-04-30 19:00:00	14.0	4.75	3.30	0.60	0.00440	5.00	2.55	109.5	16.0	76.05
2022-04-30 20:00:00	14.0	6.35	3.15	0.50	0.00365	4.15	2.45	105.5	16.0	74.50
2022-04-30 21:00:00	14.0	6.05	2.80	0.40	0.00480	5.20	2.35	115.5	16.0	74.15
2022-04-30 22:00:00	13.0	4.20	2.60	0.40	0.00555	5.90	1.95	122.5	16.0	75.95
2022-04-30 23:00:00	13.0	5.00	2.80	0.35	0.00405	4.30	1.95	119.0	16.0	78.00

	Timestamp	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	...	DayOfWeek	Month	Quarter	Year	WeekOfYear	Season	PM2.5_Lag1	PM2.5_Lag2	PM10_Lag1	PM10_Lag2
count	17375	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	...	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000
mean	2021-05-03 23:07:19.873381376	28.233712	13.886201	5.400065	0.846799	9.871754	13.973526	23.837744	2.805975	173.193482	...	3.004259	6.538935	2.513899	2020.835568	26.883856	2.502791	5.400681	5.401262	13.887018	13.887496
min	2020-05-07 00:00:00	0.000000	-6.400000	-12.300000	-4.400000	-1.900000	-1.400000	-1.300000	0.200000	1.000000	...	0.000000	1.000000	1.000000	2020.000000	1.000000	1.000000	-12.300000	-12.300000	-6.400000	-6.400000
25%	2020-11-03 23:30:00	22.000000	8.500000	2.200000	0.100000	0.900000	4.100000	5.300000	1.600000	74.000000	...	1.000000	4.000000	2.000000	2020.000000	13.000000	1.000000	2.200000	2.200000	8.500000	8.500000
50%	2021-05-03 23:00:00	27.000000	13.100000	4.700000	0.500000	3.700000	10.500000	14.800000	2.600000	210.000000	...	3.000000	7.000000	3.000000	2021.000000	27.000000	3.000000	4.700000	4.700000	13.100000	13.100000
75%	2021-10-31 22:30:00	33.000000	18.300000	7.700000	1.000000	10.700000	20.300000	31.100000	3.800000	235.000000	...	5.000000	10.000000	4.000000	2021.000000	40.000000	4.000000	7.700000	7.700000	18.300000	18.300000
max	2022-04-30 23:00:00	121.000000	196.700000	138.400000	25.200000	378.400000	83.100000	455.100000	8.200000	357.000000	...	6.000000	12.000000	4.000000	2022.000000	53.000000	4.000000	138.400000	138.400000	196.700000	196.700000
std	NaN	9.608861	7.958464	5.204060	1.482960	20.526086	12.520003	30.254179	1.533803	93.604420	...	1.997634	3.459325	1.120423	0.684702	15.234194	1.123019	5.204694	5.205324	7.958003	7.957911

	Timestamp	AQI	PM10	PM2.5	SO2	NO	NO2	NOx	Wind_Speed	Wind_Dir	...	DayOfWeek	Month	Quarter	Year	WeekOfYear	Season	PM2.5_Lag1	PM2.5_Lag2	PM10_Lag1	PM10_Lag2
count	17359	17359.000000	17359.000000	17359.000000	17359.0	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	...	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000	17359.000000
mean	2021-05-04 08:00:00	25.152947	11.562504	6.237181	0.5	11.637256	0.012103	23.521597	2.497209	176.424542	...	3.004148	6.540181	2.514315	2020.836454	26.890316	2.502275	6.237259	6.237348	11.562559	11.562660
min	2020-05-07 17:00:00	-13.000000	-3.800000	0.450000	0.5	-0.800000	-0.000500	-0.700000	0.200000	5.500000	...	0.000000	1.000000	1.000000	2020.000000	1.000000	1.000000	0.450000	0.450000	-3.800000	-3.800000
25%	2020-11-04 12:30:00	19.000000	7.468750	4.300000	0.5	2.350000	0.003950	6.912500	1.250000	76.500000	...	1.000000	4.000000	2.000000	2020.000000	13.000000	1.000000	4.300000	4.300000	7.468750	7.468750
50%	2021-05-04 08:00:00	24.000000	10.650000	5.550000	0.5	5.700000	0.008800	14.700000	2.300000	203.500000	...	3.000000	7.000000	3.000000	2021.000000	27.000000	2.000000	5.550000	5.550000	10.650000	10.650000
75%	2021-11-01 03:30:00	30.000000	14.400000	7.300000	0.5	11.900000	0.017100	28.500000	3.500000	260.000000	...	5.000000	10.000000	4.000000	2021.000000	40.000000	4.000000	7.300000	7.300000	14.400000	14.400000
max	2022-04-30 23:00:00	318.000000	437.250000	52.300000	0.5	352.450000	0.072100	401.950000	7.800000	346.500000	...	6.000000	12.000000	4.000000	2022.000000	53.000000	4.000000	52.300000	52.300000	437.250000	437.250000
std	NaN	10.637321	7.778661	3.401947	0.0	21.970397	0.010751	29.903734	1.455767	92.268417	...	1.998613	3.460755	1.120882	0.684574	15.240120	1.123436	3.401884	3.401809	7.778618	7.778534

	PM2.5	PM10	PM10_Lag1	NOx	NO	PM10_Lag2	AQI	NO2	SO2	Wind_Dir	Air_Temp	Season	Wind_Speed
count	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000	17375.000000
mean	5.400065	13.886201	13.887018	23.837744	9.871754	13.887496	28.233712	13.973526	0.846799	173.193482	16.744259	2.502791	2.805975
std	5.204060	7.958464	7.958003	30.254179	20.526086	7.957911	9.608861	12.520003	1.482960	93.604420	4.005293	1.123019	1.533803
min	-12.300000	-6.400000	-6.400000	-1.300000	-1.900000	-6.400000	0.000000	-1.400000	-4.400000	1.000000	3.000000	1.000000	0.200000
25%	2.200000	8.500000	8.500000	5.300000	0.900000	8.500000	22.000000	4.100000	0.100000	74.000000	14.000000	1.000000	1.600000
50%	4.700000	13.100000	13.100000	14.800000	3.700000	13.100000	27.000000	10.500000	0.500000	210.000000	17.000000	3.000000	2.600000
75%	7.700000	18.300000	18.300000	31.100000	10.700000	18.300000	33.000000	20.300000	1.000000	235.000000	20.000000	4.000000	3.800000
max	138.400000	196.700000	196.700000	455.100000	378.400000	196.700000	121.000000	83.100000	25.200000	357.000000	28.000000	4.000000	8.200000

🎯 Libraries and Reusable Functions¶

📈 Time-Series Analysis & Forecasting using ARIMA¶

Automatic Model Selection¶

Multi-Steps Forecasting¶

⚙️ Time-Series Analysis & Forecasting with Meta/Facebook's Prophet¶

🔧 Feature Selection¶

References¶

	Timestamp	Year	Quarter	Season	Month	WeekOfYear	Day	DayOfWeek
0	2020-05-07 00:00:00	2020	2	3	5	19	7	3
1	2020-05-07 01:00:00	2020	2	3	5	19	7	3
2	2020-05-07 02:00:00	2020	2	3	5	19	7	3
3	2020-05-07 03:00:00	2020	2	3	5	19	7	3
4	2020-05-07 04:00:00	2020	2	3	5	19	7	3