Teradata

πŸŽ“ Predicting Air Particulate Matter at Scale ⛅️ πŸ› οΈ 3. Time Series Analysis and Forecasting

πŸ¦… Overview

πŸŽ“ This notebook shows various methods that can be used to analyse and predict a Time Series Dataset. It should be ready for reuse in the next steps (Machine Learning / Deep Learning Modelling and Intelligent Dashboard) in CRISP-DM for Data Science.

  • Workflow steps:

    1. Import the required teradataml modules.
      • Connect to a Vantage system.
    2. πŸ“ˆ Time-Series Analysis
    3. βš™οΈ Time-Series Forecasting
    4. Cleanup.

🎯 Libraries and Reusable Functions¢

πŸŽ“ This section executes all of the cells in `Data_Loading_and_Descriptive_Statistics.ipynb`.
InΒ [1]:
import os, logging

import logging
logging.getLogger('prophet').setLevel(logging.WARNING)
logging.getLogger("cmdstanpy").disabled=True

## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'cleaned_vantage.env'  ## Set to `.env_cleaned` or `.env_raw` as needed

# %run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
%run -i ./DataFrameAdapter.ipynb
2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv
2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv
2024-05-21 14:55:55,448 - INFO - 
ℹ️ Load Data from data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame πŸ“‚
2024-05-21 14:55:55,489 - INFO - 
ℹ️ Load Data from data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame πŸ“‚
πŸŽ“ The specified .env file: cleaned_vantage.env
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+
| Variable                    | Description                                                                                 | Value                 |
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+
| IS_LOADING_FROM_FILES       | True if loading data from *.csv/xls files; False if using imported data in Teradata Vantage | True                  |
| IS_TERADATA_VANTAGE         | Using scable Teradata Vantage vs. local machine (Laptop)                                    | False                 |
| IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files  | False                 |
| SCHEMA_NAME                 | [Teradata Vantage] Schema Name                                                              | Air_Pollution         |
| TABLE_NAME                  | [Teradata Vantage] Table Name                                                               | Air_Pollution_cleaned |
| IS_JUPYTERLAB               | Running in JupyterLab vs Python Dash/Vizro Dashboard                                        | True                  |
| IS_TEST_DEV                 | Is Test/Dev mode is active or not (in Production)                                           | False                 |
| DATA_PATH                   | *.csv/xls Data PATH                                                                         | Not set or not found  |
| USE_DATA_PREFIX             | Prefix to use for data files: 'raw' | 'cleaned'                                             | cleaned               |
+-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+

ℹ️ Load Data from data/cleaned folder


ℹ️ Combined Data Shape: (34734, 13)
ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (17375, 12) (17359, 11)

πŸŽ“ Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')

πŸŽ“ 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

πŸŽ“ 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2]  Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

πŸŽ“ [Data_Loading_and_Descriptive_Statistics.ipynb] Listing variables with description...
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| Variable Name           | Description                                                    | All Sites | Penrose | Takapuna |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+
| rawdata                 | Complete dataset containing all observations across all sites. |    [x]    |   [x]   |   [x]    |
| ordinal_columns         | Ordinal columns specific to Site 1.                            |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1 | Numerical columns specific to Site 1.                          |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1   | Nominal columns specific to Site 1.                            |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2 | Numerical columns specific to Site 2.                          |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2   | Nominal columns specific to Site 2.                            |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1           | Subset of raw data for Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2           | Subset of raw data for Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
+-------------------------+----------------------------------------------------------------+-----------+---------+----------+

πŸŽ“ Describing the types of each attribute as cleaned_numerical_columns (Continuous), cleaned_ordinal_columns (Ordinal), or cleaned_nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Hour', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')

πŸŽ“ [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...

πŸŽ“ [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...

πŸŽ“ [DataFrameAdapter.ipynb] Listing variables with description...
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| Variable Name               | Description                                                           | All Sites | Penrose | Takapuna |
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
| rawdata                     | Complete dataset containing all observations across all sites.        |    [x]    |   [x]   |   [x]    |
| numerical_columns_site1     | Numerical columns specific to Site 1.                                 |    [ ]    |   [x]   |   [ ]    |
| nominal_columns_site1       | Nominal columns specific to Site 1.                                   |    [ ]    |   [x]   |   [ ]    |
| numerical_columns_site2     | Numerical columns specific to Site 2.                                 |    [ ]    |   [ ]   |   [x]    |
| nominal_columns_site2       | Nominal columns specific to Site 2.                                   |    [ ]    |   [ ]   |   [x]    |
| rawdata_site1               | Subset of raw data for Site 1.                                        |    [ ]    |   [x]   |   [ ]    |
| rawdata_site2               | Subset of raw data for Site 2.                                        |    [ ]    |   [ ]   |   [x]    |
| --------------------------- | --------------------------------------------------------------------- | --------- | ------- | -------- |
| cleaned_data                | Cleaned dataset with preprocessing applied.                           |    [x]    |   [x]   |   [x]    |
| cleaned_ordinal_columns     | Ordinal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_numerical_columns   | Numerical columns in the cleaned dataset.                             |    [x]    |   [x]   |   [x]    |
| cleaned_nominal_columns     | Nominal columns in the cleaned dataset.                               |    [x]    |   [x]   |   [x]    |
| cleaned_data_site1          | Cleaned data for Site 1.                                              |    [ ]    |   [x]   |   [ ]    |
| cleaned_data_site2          | Cleaned data for Site 2.                                              |    [ ]    |   [ ]   |   [x]    |
+-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
InΒ [2]:
# if IS_JUPYTERLAB:
target_variables = {'PM2.5': 'Particulate Matter <2.5 Β΅m', 'PM10': 'Particulate Matter <10 Β΅m'}
data_series_df1 = PandasDataLoader.load_data(rawdata_site1)
data_series_df2 = PandasDataLoader.load_data(rawdata_site2)

data_series_df1
Out[2]:
AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
ds
2020-05-07 00:00:00 26.0 16.500 16.1 1.8 60.3 40.9 101.2 0.6 316.0 8.0 78.0
2020-05-07 01:00:00 28.0 17.700 10.1 1.0 16.0 29.2 45.3 0.7 269.0 8.0 76.8
2020-05-07 02:00:00 28.0 15.000 10.3 0.1 16.0 29.2 45.3 1.0 180.0 8.0 78.4
2020-05-07 03:00:00 29.0 14.300 11.4 0.0 11.2 27.5 38.7 0.8 232.0 8.0 77.5
2020-05-07 04:00:00 30.0 8.800 10.6 -0.1 12.0 28.5 40.5 0.8 274.0 7.0 80.1
... ... ... ... ... ... ... ... ... ... ... ...
2022-04-30 19:00:00 25.0 12.800 -0.2 0.3 0.1 4.8 4.9 2.1 95.0 16.0 80.5
2022-04-30 20:00:00 25.0 2.200 2.2 0.2 0.5 9.9 10.3 1.0 116.0 15.0 84.2
2022-04-30 21:00:00 24.0 16.225 1.9 0.6 5.8 25.4 31.3 0.8 124.0 15.0 85.4
2022-04-30 22:00:00 24.0 8.200 6.0 0.5 1.8 17.9 19.7 0.8 108.0 14.0 86.0
2022-04-30 23:00:00 24.0 2.300 5.4 0.5 1.0 7.3 8.3 0.9 93.0 14.0 84.4

17375 rows Γ— 11 columns

InΒ [3]:
# cleaned_data_site2.index
# data_series_df2.index
data_series_df2
Out[3]:
AQI PM10 PM2.5 NO NO2 NOx Wind_Speed Wind_Dir Air_Temp Rel_Humidity
ds
2020-05-07 17:00:00 21.0 5.95 4.15 10.90 0.01715 28.00 2.50 242.0 15.5 68.15
2020-05-07 18:00:00 21.0 5.65 5.10 8.20 0.01655 24.70 2.20 239.5 15.5 61.75
2020-05-07 19:00:00 21.0 7.70 5.45 5.75 0.01325 19.00 2.10 244.0 15.0 59.45
2020-05-07 20:00:00 21.0 8.20 5.45 3.50 0.00870 12.20 2.25 251.0 15.0 59.25
2020-05-07 21:00:00 21.0 11.80 5.80 3.55 0.00930 12.90 2.10 261.0 15.0 61.75
... ... ... ... ... ... ... ... ... ... ...
2022-04-30 19:00:00 14.0 4.75 3.30 0.60 0.00440 5.00 2.55 109.5 16.0 76.05
2022-04-30 20:00:00 14.0 6.35 3.15 0.50 0.00365 4.15 2.45 105.5 16.0 74.50
2022-04-30 21:00:00 14.0 6.05 2.80 0.40 0.00480 5.20 2.35 115.5 16.0 74.15
2022-04-30 22:00:00 13.0 4.20 2.60 0.40 0.00555 5.90 1.95 122.5 16.0 75.95
2022-04-30 23:00:00 13.0 5.00 2.80 0.35 0.00405 4.30 1.95 119.0 16.0 78.00

17359 rows Γ— 10 columns

πŸ“ˆ Time-Series Analysis & Forecasting using ARIMAΒΆ

πŸŽ“ Time series analysis prepares and analyzes time series datasets for time series forecasting using ARIMA (AutoRegressive Integrated Moving Average) model. The results of time series analysis will extract useful information from time series data, such as trends, cyclic and seasonal deviations, correlations, etc.

No description has been provided for this image

  1. check_stationarity: checking for Stationarity of Time Series using the ADF Test

    Checking for stationarity of Time Series using the Augmented Dickey-Fuller (ADF) Test

    No description has been provided for this image

    • Time-Series Stationarity: Some time-series models, such as such as ARIMA, assume that the underlying data is stationary. Stationarity describes that the time-series has

      • Constant mean and mean is not time-dependent
      • Constant variance and variance is not time-dependent
      • Constant covariance and covariance is not time-dependent
    • The check for stationarity can be done via three different approaches:

      • Visually: plot time series and check for trends or seasonality
      • Basic statistics: split time series and compare the mean and variance of each partition
      • Statistical test: Augmented Dickey Fuller test
    • Apply the ADF Test to verify if the time series data is stationary, which is a requirement for certain time series forecasting methods like ARIMA.

    • Augmented Dickey-Fuller (ADF) test is a type of statistical test called a unit root test. Unit roots are a cause for non-stationarity.

      • Null Hypothesis (H0): Time series has a unit root. (Time series is not stationary).
      • Alternate Hypothesis (H1): Time series has no unit root (Time series is stationary).

      • If the null hypothesis can be rejected, we can conclude that the time series is stationary. There are two ways to rejects the null hypothesis:
        • On the one hand, the null hypothesis can be rejected if the p-value is below a set significance level. The defaults significance level is 5%

          • p-value > significance level (default: 0.05): Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
          • p-value <= significance level (default: 0.05): Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
        • On the other hand, the null hypothesis can be rejects if the test statistic is less than the critical value.

          • ADF statistic > critical value: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
          • ADF statistic < critical value: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
    Checking for stationarity of Time Series using the Kwiatkowski β€” Phillips β€” Schmidt β€” Shin (KPSS) Test
     * The KPSS test can also be used to detect stochastic trends.
     * The test hypotheses are opposite relative to ADF:
         * Null hypothesis: the time series is trend-stationary.

  2. determine_acf_pacf: determine ACF and PACF for Autocorrelation and Partial Autocorrelation of the Time Series.

    • ACF calculates the autocorrelation or autocovariance of a time series. The autocorrelation and autocovariance show how the time series correlates or covaries with itself when delayed by a lag in time or space.

Fitting and Validating ARIMA models parameters (p, d, q) based on AIC for Penrose and Takapuna. It involves determining the best-fit model and predicting future demand forecasts.

  1. fit: Estimate/Fit the ARIMA model parameters for forecasting pollutant levels.
  2. validate: Model Validation and Scoring: Validate the ARIMA model against a portion of the data to determine the model’s accuracy.
  3. forecast: Forecast future pollutant levels using the validated model.
  4. visualize actual versus predicted values and forecasted values to assess model performance.
    • plot_diagnostics: plot diagnostic charts
    • generate_report: generate a report of the model fit and forecast
InΒ [4]:
class TimeSeriesAnalyzer(ABC):
    """
    Abstract base class for a Time Series Analyzer.
    """

    def __init__(self, data, target_col):
        """ βœ… Step 1: Data Loading & Splitting """
        # self.data = data
        self.target_col = target_col   
        ## Placeholder for transformed data, if necessary
        # self.transformed_data = None
        ## Placeholder for differenced data, if necessary
        # self.differenced_data = None
        ## Placeholder for the selected model
        self.model = None

    @abstractmethod
    def timeseries_visualization(self, lags=15, max_diff_order=3):
        """
        βœ… Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
        Also considers Box-Cox transformation for variance stabilisationif the data is skewed.

        Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.

        - ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
        - Histogram analysis after transformation helps understand data normality and variance stability.
        -
        - If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
        - Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
        - Check if transformation parameters need adjustment based on skewness and kurtosis metrics.

        - [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
        - [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
        - [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
        - [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
        """
        pass

    @abstractmethod
    def transform_data(data, transformer, lag_max=15):
        """
        βœ… Step 3: Apply transformation to the time series data and display diagnostics.
        
        Parameters:
        - data (np.array or pd.Series): The original time series data.
        - transformer (object): An instance of a transformer from pmdarima.preprocessing.
        - lag_max (int): Maximum lag to which the autocorrelation function is computed.
        
        Returns:
        - Transformed data
        """
        pass


    @abstractmethod
    def model_selection(self):
        """
        Step 5: iteratively select models based on the AIC: try the chosen model(s) and use the AIC to search for a better model.
        """
        pass

    @abstractmethod
    def multi_step_forecast(self):
        """
        Step 6: Check the residuals from the chosen model by plotting the ACF of the residual, and doing a portmanteau test of the residual.
        Do the residuals look like white noise?
        """
        pass

    # @abstractmethod
    # def calculate_forecast(self):
    #     """
    #     Step 7: Calculate forecasts & plot 
    #     """
    #     pass
InΒ [5]:
from pmdarima.preprocessing import BoxCoxEndogTransformer, LogEndogTransformer
from pmdarima.model_selection import train_test_split
from pmdarima import tsdisplay

class ArimaTimeSeriesAnalyzer(TimeSeriesAnalyzer):
    def __init__(self, data: pd.DataFrame, monitoring_site:str, target_col: str, is_auto_arima=False):
        """ 
        Initializes the analysis with training and testing data, ensuring that both datasets contain the specified target variable. 
        """
        super().__init__(data, target_col)
        print(f'βœ… Step 1: Data Loading & Splitting 80/20')
        self.data = data
        self.site = monitoring_site   ## Penrose, Takapuna
        self.target_col = target_col  ## PM2.5, PM10
        # self.train_data = None
        # self.test_data = None
        # self.train_test_split(train_size=0.8) ## 80/20 Train/Test split
        self.is_auto_arima = is_auto_arima
        # self.transformed_data = None
        # self.lambda_value = None
        # self.is_stationary = True
        # self.sarima_model = None
        # self.arima_model = None
        self.model = None

    def train_test_split(self, train_size=0.8):
        """
        β˜‘οΈ [1] Train-Test split: Splits the data into training and testing sets.
        Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
        :param train_size: float - proportion of the dataset to include in the train split.
        """
        train_len = int(len(self.data) * train_size)
        self.train_data = self.data.iloc[:train_len]
        self.test_data = self.data.iloc[train_len:]
        print(f'πŸŽ“ Training & Testing Data split (80/20): {len(self.train_data):,} train samples and {len(self.test_data):,} test samples.')
        self.y = self.data[self.target_col]   ## Compatible with Prophet
        self.y_train = self.train_data[self.target_col]
        self.y_test  = self.test_data[self.target_col]

    def plot_line_chart_and_histogram(self, y_train):
        """
        β˜‘οΈ [2] Plot line chart and histogram
        """
        print(f'\nπŸŽ“ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.') 
        plt.figure(figsize=(14, 7))
        
        ## 1.1. Original Series Plot
        plt.subplot(121) ## 1 row, 2 columns, 1st subplot
        # plt.plot(self.data.index, self.y, color='blue', label='Original Series')
        plt.plot(y_train.index, y_train, color='steelblue', label='Original Series')
        plt.title(f"Original Time Series of {self.target_col} at {self.site}")
        plt.xlabel("Timestamp")
        plt.ylabel(self.target_col)
        plt.legend()

        ## 1.2 Plot the histogram of the series values
        ## Histogram displays the distribution of data to evaluate skewness.
        ## Check for skewness and perform Box-Cox Transformation if necessary
        # transformed_data, _ = preprocessing.BoxCoxEndogTransformer(lmbda2=1e-6).fit_transform(self.data[[self.target_col]].dropna())
        plt.subplot(122) ## 1 row, 2 columns, 2nd subplot
        # plt.hist(transformed_data, bins=30, color='blue', edgecolor='k', alpha=0.7)
        # plt.hist(self.y, bins=90, color='blue', edgecolor='k', alpha=0.7)
        plt.hist(y_train, bins=60, color='steelblue', edgecolor='gray', alpha=0.7)
        # plt.title(f"Histogram of Transformed {self.target_col}")
        plt.title(f"Histogram of Train Data: {self.target_col} at {self.site}")
        plt.xlabel(f'{self.target_col} Value')
        plt.ylabel("Frequency")

        plt.tight_layout()
        plt.show()

    ## Step 2.2: Differencing the series to achieve stationarity
    def plot_differencing(self, series, max_diff_order=3):
        """
        β˜‘οΈ [2].1 Apply differencing to the time series to achieve stationarity and plot each differencing level in a grid format,
        and perform Dickey-Fuller tests to check stationarity after each differencing.
        This function plots the original series and its differenced versions up to the specified order, 
        along with the p-values from the Dickey-Fuller test to assess stationarity.

        Note: pmdarima's auto_arima will automatically determine the appropriate differencing term for you by default.
        
        Parameters:
        - series (pd.Series): Time series data.
        - max_diff_order (int): The maximum differencing order to apply.
        
        Outputs:
        - Plots of original and differenced series with Dickey-Fuller test results.
        """
        print(f'\nπŸŽ“ [2.2] Apply {max_diff_order} Order Differencing to the time series and visualizes results.')
        ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']
        plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100})
        num_plots = max_diff_order + 1
        num_rows = (num_plots + 1) // 2   ## Calculate rows needed for plots (2 plots per row)
        # fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
        fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=False, sharey=True)
        axes = axes.flatten()  ## Flatten to simplify indexing
        
        for i in range(num_plots):
            ax = axes[i]       ## This handles both single and multiple rows of subplots 
            if i == 0:         ## Plot original series in the first plot
                current_series = series
                title = 'Original Series'
            else:              ## Generate and plot differenced series up to the max_diff_order
                current_series = series.diff(i).dropna()
                title = f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing'
            ax.plot(current_series, color='steelblue' if i == 0 else 'orange')  
            ax.set_title(f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing')
            ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
            ax.tick_params(axis='x', rotation=45)  ## Rotate x-axis labels

            ## Perform Dickey-Fuller test on the current series and display p-value
            if not current_series.empty:
                df_result = adfuller(current_series)
                p_value_msg = f'p-value: {df_result[1]:.4f}'
                stationarity = " (Stationary: < 0.05)" if df_result[1] < 0.05 else " (Not stationary: >= 0.05)"
                ax.set_xlabel(p_value_msg + stationarity)
            else:
                ax.set_xlabel('No data available for testing')
            ## Move the xlabel to the top left
            ax.xaxis.label.set_color('steelblue')
            ax.xaxis.set_label_position('top')
            ax.xaxis.set_label_coords(0.5, 1.02)

        ## Adjust the number of plots and layout: If the total number of plots is odd
        ## Hide any unused axes if the number of plots is less than the number of subplot cells
        if num_plots < len(axes):
            for j in range(num_plots, len(axes)):
                fig.delaxes(axes[j])
                
        ## Adjust layout and display the plot
        plt.tight_layout() 
        plt.show()

        
    def plot_acf_ma_and_pacf_ar(self, series, max_diff_order=2, max_lags=15, use_pacf=False):
        """
        β˜‘οΈ [2].2. Use the ACF to suggest the orders of MA; and Use the PACF to suggest the orders of AR
        
        Calculate and plots the Autocorrelation Function (ACF) for the original time series and {max_diff_order} differenced versions 
        to suggest possible MA orders, including statistical tests for stationarity.
        
        This function uses the series that have been differenced and stored in a dictionary to plot their ACF. 
        This helps in identifying the order of the MA component by looking at the significant lags in the ACF plots.
        
        Parameters:
        - series (pd.Series): The time series data.
        - max_diff_order (int): Maximum number of differencing orders to plot.
        - max_lags (int): specifically focus on the first max_lags lags in the ACF plot.
    
        This function performs the following:
        - Differencing the series up to the specified order.
        - Plotting the ACF for each differenced series including the original series.
        - Performing the Dickey-Fuller test to annotate plots with p-values indicating stationarity.
        """
        # print(f'\nπŸŽ“ Plotting the ACF for the original and differenced series to suggest possible MA orders.')
        print(f'\nπŸŽ“ [2.3] Plotting the {"PACF" if use_pacf else "ACF"} to suggest possible {"AR" if use_pacf else "MA"} orders.')
        ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']

        ## Prepare the subplot grid based on the number of differencing levels
        plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100}) ## Adjust plot settings
        num_plots = max_diff_order + 1
        num_rows = (num_plots + 1) // 2   ## Calculate rows needed for plots (2 plots per row)
        fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
        axes = axes.flatten()  ## Flatten to simplify indexing
               
        ## Iterate through each differencing level
        for i in range(num_plots):
            ax = axes[i]
            ## Handle the original series differently
            current_series = series if i == 0 else series.diff(i).dropna()
            title_suffix = 'Original Series' if i == 0 else f'{ordering[i]} Order Differencing'

            # ## Check for non-empty series to avoid plotting errors
            # if current_series.empty:
            #     ax.text(0.5, 0.5, 'No data available for plotting', horizontalalignment='center', verticalalignment='center',
            #             transform=ax.transAxes)
            #     ax.set_title('Empty Series')
            #     continue
    
            ## Perform ACF plotting: The alpha=0.05 setting specifies a 95% confidence level.
            lags = min(max_lags, len(current_series) - 1)
            # plot_acf(current_series, ax=ax, lags=lags, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
            plot_func = plot_pacf if use_pacf else plot_acf
            plot_func(current_series.head(lags), ax=ax, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
            ## Adding labels for x and y axes
            ax.set_xlabel('Lags')
            ax.set_ylabel(f'{"PACF" if use_pacf else "ACF"}')
            
            # plot_acf(current_series, ax=ax, lags=lags, title=f'[{self.site}] {self.target_col}: ACF - {ordering[i]} Order Differencing')
            ## Setting plot titles and labels
            ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
            # ax.set_title(f'{title_suffix} - ADF p-value: {p_value:.4f} {stationarity_status}', fontsize=12)

            # ## Highlight significant lags manually if needed: marks significant lags based on ACF values outside the confidence bounds.
            # significance_limit = 1.96 / np.sqrt(len(current_series))
            # ax.axhline(y=significance_limit, linestyle='--', color='red', linewidth=0.7)
            # ax.axhline(y=-significance_limit, linestyle='--', color='red', linewidth=0.7)

            ## Perform Dickey-Fuller test & Adding stationarity test results for context
            df_result = adfuller(current_series)
            p_value_msg = f'p-value: {df_result[1]:.4f}'
            stationarity_status = ' (Stationary: < 0.05)' if df_result[1] < 0.05 else ' (Not stationary: >= 0.05)'
            ax.set_xlabel(p_value_msg + stationarity_status)
            ## Move the xlabel to the top left
            ax.xaxis.label.set_color('steelblue')
            ax.xaxis.set_label_position('top')
            ax.xaxis.set_label_coords(0.5, 1.02)

        ## Adjust the number of plots and layout: If the total number of plots is odd
        ## Hide any unused axes if the number of plots is less than the number of subplot cells
        if num_plots < len(axes):
            for j in range(num_plots, len(axes)):
                fig.delaxes(axes[j])
        
        ## Layout adjustment
        plt.tight_layout()
        plt.show()

        
    def timeseries_visualization(self, lags=15, max_diff_order=3):
        """
        βœ… Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
        Also considers Box-Cox transformation for variance stabilisationif the data is skewed.

        Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.

        - ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
        - Histogram analysis after transformation helps understand data normality and variance stability.
        -
        - If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
        - Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
        - Check if transformation parameters need adjustment based on skewness and kurtosis metrics.

        - [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
        - [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
        - [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
        - [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
        """
        print(f'βœ… Step 2: Time Series Visualization')
        ## Visualize the time series and its ACF and PACF plots
        print(f'\nπŸŽ“ [2.1] [Default] Time Series Display ...')
        pm.tsdisplay(self.y_train, lag_max=lags, title=f'ACF plot and histogram for {self.target_col} at {self.site}', show=True)

        ## Row 1: Plotting the original series and its histogram
        self.plot_line_chart_and_histogram(self.y_train)
        
        ## Row 2: Perform stationarity checks to see if the given series is stationary. 
        ## To perform differencing and unit-root tests (Dickey-Fuller test), and others if necessary. 
        self.plot_differencing(self.train_data[self.target_col], max_diff_order=1)

        # df_result = adfuller(self.y)
        # p_value_msg = f'p-value: {df_result[1]:.4f}'
        # stationarity = " (Stationary)" if df_result[1] < 0.05 else " (Not stationary)"
        # print(f'πŸŽ“ ADF Statistic adf_test_result[0]: {df_result[0]}; and adf_test_result[1]: ' + p_value_msg + stationarity)

        ## Row 3: ACF and PACF Plots
        ## Use the ACF to suggest the orders of MA
        self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=False)

        ## Use the PACF to suggest the orders of AR
        self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=True)


    ## Function to apply transformation and display time series diagnostics
    def transform_data(data, transformer, lag_max=15):
        """
        βœ… Step 3: Apply transformation to the time series data and display diagnostics.
        
        Parameters:
        - data (np.array or pd.Series): The original time series data.
        - transformer (object): An instance of a transformer from pmdarima.preprocessing.
        - lag_max (int): Maximum lag to which the autocorrelation function is computed.
        
        Returns:
        - Transformed data
        """
        print(f'βœ… Step 3: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')
        
        ## Fit and transform the data
        transformed_data, _ = transformer.fit_transform(data)
    
        ## Display time series diagnostics
        tsdisplay(transformed_data, lag_max=lag_max, title=f"Transformed Data with {transformer.__class__.__name__}")
    
        return transformed_data


    def auto_arima_model(self, y_train):
        """
        β˜‘οΈ [4] Uses auto_arima from the pmdarima library to find the optimal/best ARIMA model.
        auto_arima can be quite memory-intensive, especially with large datasets

        ARIMA(p,d,q) model:
        - p is the order of the AR term: Use the PACF to suggest the orders of auto-regressive (AR)
        - d is the number of differencing required to make the time series stationary: 
          The degree of differencing, is typically only used in the case of non-stationary data
        - q is the order of the MA term: Use the ACF to suggest the orders of moving average (MA)
        
        Parameters:
        - train_data: A pandas Series containing the training data for the model.
        - order: A tuple representing the order of the ARIMA model (p, d, q).
        Returns:
        - model: The fitted ARIMA model.
        """
        if self.is_auto_arima:
            logging.info('πŸŽ“ [4.1] The ARIMA Models ...')
            # self.model = pm.auto_arima(y_train,
            model = pm.auto_arima(y_train,
                        d=None,
                        seasonal=False,
                        m=24,                   ## Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
                        stepwise=True,
                        max_p=None,
                        max_order=None,
                        suppress_warnings=True,
                        error_action="ignore",
                        trace=True)
            logging.debug(model.summary())
        else:
            logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')

        return model

    def auto_sarima_model(self, y_train):
        """
        β˜‘οΈ [4] Uses auto_arima from the pmdarima library to find the optimal/best SARIMA model.
        auto_arima can be quite memory-intensive, especially with large datasets

        SARIMA(p,d,q)(P,D,Q)[S] where
        - p is the order of the seasonal AR term: i.e., the number of Lag observations
        - q is the order of the seasonal MA term: essentially the size of the β€œwindow” function over your time series data
        - Q is the order of seasonal differencing
        - S is the seasonal cycle such as 24 for 24 hours.
        
        Parameters:
        - train_data: A pandas Series containing the training data for the model.
        - order: A tuple representing the order of the ARIMA model (p, d, q).
        Returns:
        - model: The fitted SARIMA model.
        """
        if self.is_auto_arima:
            logging.info('πŸŽ“ [4.1] The ARIMA Models ...')
            try:
                print('πŸŽ“ [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed!')
                # self.sarima_model = pm.auto_arima(y_train,
                #         seasonal=True, ## turn the seasonal to "True"
                #         m=24,          ## Daily seasonality; m=168 for weekly seasonality, or m=8760 for annual seasonality.
                #         ## Use a sane value for max_p, max_q, max_P and max_Q. You don’t need to search too high
                #         # max_p=3, max_q=3, max_P=3, max_Q=3, ## define the maximum order of the seasonal and non-seasonal AR and MA components
                # self.sarima_model = pm.auto_arima(y_train, 
                #                     d=None,                 ## Let the model determine 'd' (non-seasonal differencing)
                #                     seasonal=True,          ## Turn the seasonal to "True", Enable seasonal differencing
                #                     m=24,                   ## 24h Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
                #                     stepwise=True,          ## Use the stepwise algorithm to find the best model parameters
                #                     # max_p=None,
                #                     # max_order=None,
                #                     max_p=1,                ## Set the maximum 'p' (autoregressive terms) to 2
                #                     max_order=2,            ## Set the maximum sum of p, d, and q to 3
                #                     suppress_warnings=True,
                #                     error_action='ignore',  ## Change to 'ignore' | 'warn' to keep running despite errors | warning
                #                     trace=True)
                self.sarima_model = pm.auto_arima(
                    y_train,
                    d=None,                 ## Let the model determine 'd' (non-seasonal differencing)
                    seasonal=True,          ## Enable seasonal differencing
                    m=24,                   ## 24h Daily seasonality; adjust as needed for your data
                    stepwise=True,          ## Use stepwise algorithm to find the best model parameters
                    # start_p=0,              ## Start value for p
                    # start_q=0,              ## Start value for q
                    # start_P=0,              ## Start value for seasonal P
                    # start_Q=0,              ## Start value for seasonal Q
                    max_p=2,                ## Set the maximum p (autoregressive terms)
                    max_q=2,                ## Set the maximum q (moving average terms)
                    max_P=2,                ## Set the maximum seasonal P (autoregressive terms)
                    max_Q=2,                ## Set the maximum seasonal Q (moving average terms)
                    max_order=None,         ## Maximum order of p+q+P+Q+d+D
                    suppress_warnings=True,
                    error_action='ignore',  ## Change to 'ignore' to keep running despite errors
                    trace=True
                )
                                                  
                print(self.sarima_model.summary())
            except Exception as e:
                print(f"Failed to fit SARIMA model: {e}")
        else:
            logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')


    def model_selection(self, models, train_data, test_data, preference='RMSE'):
        """
        βœ… Step 5: Compare multiple ARIMA and SARIMA models based on configurable preference metrics including RMSE, MAE, MAPE, AIC, and BIC.
        
        Args:
        - models (list of tuples): Each tuple contains the model description, and its orders & seasonal orders.
          * order (tuple): Non-seasonal ARIMA order (p, d, q).
          * seasonal_order (tuple, optional): Seasonal order (P, D, Q, m), if the model is seasonal.
        - train_data (pd.Series): The training data for the time series.
        - test_data (pd.Series): The testing data for evaluating model performance.
        - preference (str): Preferred metric for model evaluation.
        
        Returns:
        dict: The best model based on the preferred metric or an informative message if no model fits.
        """
        results = []
        best_model = None
        best_metric = float('inf')
        best_model_instance = None
    
        for name, order, seasonal_order in models:
            try:
                logging.info(f"πŸŽ“ Fitting the ARIMA/SARIMA {name} to the training data...")
                ## Initializing the model with or without seasonal order
                # if seasonal_order:
                #     model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True)
                # else:
                #     model = pm.ARIMA(order=order, suppress_warnings=True)
                model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True) if seasonal_order else pm.ARIMA(order=order, suppress_warnings=True)
                ## Fit the model to the training data
                model_fit = model.fit(train_data)
                ## Predict using the fitted model on the test data size
                predictions = model_fit.predict(n_periods=len(test_data))
                ## Calculate metrics (RMSE, MAE, and MAPE) for the predictions
                metrics = CommmonUtils.calculate_metrics(test_data, predictions)
                metrics.update({'AIC': model_fit.aic(), 'BIC': model_fit.bic()})
    
                ## Append results to a list for final display: metrics + Model's AIC/BIC
                results.append({**{'Model': name}, **metrics})
                current_metric = metrics[preference]
    
                ## Determine best model based on the best/lowest preferred metric:
                if current_metric < best_metric:
                    best_metric = current_metric
                    best_model = {**{'Model': name}, **metrics}
                    best_model_instance = model_fit
                    logging.debug(f"Updated model found: {best_model} with combined metric: {best_metric:.2f}")
    
            except Exception as e:
                logging.error(f"Error fitting model {name}: {e}")
    
        if not best_model:
            print("No valid models were fitted successfully.")
            return None
    
        ## Output results sorted by the preferred metric
        results.sort(key=lambda x: x[preference])
        for result in results:
            logging.info(f"Model: {result['Model']} - RMSE: {result['RMSE']:.2f}, MAE: {result['MAE']:.2f}, "
                  f"MAPE: {result['MAPE']:.2f}%, AIC: {result['AIC']:.2f}, BIC: {result['BIC']:.2f}")
    
        logging.info(f"\nThe best model based on {preference}: {best_model['Model']} with {preference}: {best_model[preference]:.2f}")
        
        # self.model = best_model_instance
        # logging.info(self.model.summary())
        return best_model_instance
       

    def multi_step_forecast(self, n_periods: int):
        prediction, conf_int = self.model.predict(n_periods=n_periods, return_conf_int=True)
        plt.figure(figsize=(12, 5))
        plt.plot(self.data.index[-100:], self.data[self.target_col].dropna().tail(100), label='Historical')
        plt.plot(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'), prediction, label='Forecast')
        plt.fill_between(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'),
                         conf_int[:, 0], conf_int[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
        plt.legend()
        plt.show()
InΒ [6]:
def plot_residuals(residuals, figsize=(12, 10), dpi=100, color_scheme='skyblue'):
    """
    Plot diagnostic charts for residuals of a time series model to check for normality,
    homoscedasticity, and absence of autocorrelation in residuals.
    
    Parameters:
    - residuals (pd.Series or np.array): Residual values from a time series model.
    - figsize (tuple): Figure size.
    - dpi (int): Dots per inch for the plot resolution.
    - color_scheme (str): Color for the plots.
    """
    ## Check if residuals is a method or a callable, and call it to get the data
    if callable(residuals):
        residuals = residuals()
    
    ## Ensure residuals are a numpy array for compatibility with plotting functions
    if not isinstance(residuals, np.ndarray):
        residuals = np.array(residuals)

    fig, axes = plt.subplots(2, 2, figsize=figsize, dpi=dpi)
    plt.subplots_adjust(hspace=0.4, wspace=0.3)
    
    ## Time series plot of residuals
    axes[0, 0].plot(residuals, color=color_scheme)
    axes[0, 0].set_title('Time-series plot of Residuals over-time')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].axhline(y=0, color='black', linestyle='--')
    
    ## ACF plot for residuals
    sm.graphics.tsa.plot_acf(residuals, ax=axes[0, 1], alpha=0.05, color=color_scheme)
    axes[0, 1].set_title('Autocorrelation/ACF for Residuals')

    ## Histogram with KDE for residuals
    sns.histplot(residuals, kde=True, ax=axes[1, 0], bins=30, color=color_scheme)
    axes[1, 0].set_title('Histogram with KDE for Residuals')
    
    ## Q-Q plot for checking normality
    qqplot(residuals, line='s', ax=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot for checking Normality')
    
    ## Statistical Summary
    # stats_text = f'Mean: {np.mean(residuals):.2f}\nSD: {np.std(residuals):.2f}\nSkew: {skew(residuals):.2f}\nKurtosis: {kurtosis(residuals):.2f}'
    mean, std_dev = np.mean(residuals), np.std(residuals)
    skewness, kurt = skew(residuals), kurtosis(residuals)
    stats_text = f'Mean: {mean:.2f}\nSD: {std_dev:.2f}\nSkew: {skewness:.2f}\nKurtosis: {kurt:.2f}'
    axes[1, 0].annotate(stats_text, xy=(0.05, 0.95), xycoords='axes fraction', 
                        horizontalalignment='left', verticalalignment='top', 
                        fontsize=10, bbox=dict(boxstyle="round4,pad=0.6", fc="white", ec="black", lw=1))

    plt.show()

Automatic Model SelectionΒΆ

InΒ [7]:
def apply_auto_arima_all_sites(df1, df2, target_cols):
    """
    Apply Auto ARIMA Model Selection across multiple sites and pollutants.
    
    Parameters:
    - df1: DataFrame for Site 1
    - df2: DataFrame for Site 2
    - target_cols: List of target columns (pollutants) for analysis
    
    Returns:
    - models: Dictionary containing fitted models for each site and pollutant
    """
    ## Dictionary to store models for each site and target column
    models = {}

    ## Iterate over each pollutant
    for target_col in target_cols:
        ## Iterate over each site
        for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
            print(f'\nπŸ” Applying Auto ARIMA Model Selection for {site_name} ({target_col})')

            ## Initialize the ARIMA analyzer
            analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)

            ## Prepare data by performing a train-test split
            analyzer.train_test_split(train_size=0.8)

            # ## Visualize time series and lag plots
            # analyzer.timeseries_visualization(lags=15, max_diff_order=1)

            ## Applying Auto ARIMA to find the best model
            model = analyzer.auto_arima_model(analyzer.y_train)

            ## Storing model results
            models[(site_name, target_col)] = model

            ## Print model summary
            if model:
                logging.info(f"βœ… Best model for {site_name} ({target_col}): ARIMA{model.order}")
                logging.info(model.summary())
            else:
                logging.info('❌ Failed to determine a suitable model.')

    return models


target_cols = ['PM2.5', 'PM10']  # List of pollutants to analyze
data_penrose = data_series_df1.copy()  ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna

## Call the function with data and target columns
models = apply_auto_arima_all_sites(data_penrose, data_takapuna, target_cols)
2024-05-14 08:56:12,636 - INFO - πŸŽ“ [4.1] The ARIMA Models ...
πŸ” Applying Auto ARIMA Model Selection for Site1-Penrose (PM2.5)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=6.93 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=85246.817, Time=0.18 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=84200.017, Time=0.26 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=83375.633, Time=0.65 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=85244.818, Time=0.09 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=inf, Time=3.08 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=82177.054, Time=2.68 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=8.12 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=81745.972, Time=4.33 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=14.72 sec
 ARIMA(0,1,4)(0,0,0)[0] intercept   : AIC=81522.712, Time=7.43 sec
 ARIMA(1,1,4)(0,0,0)[0] intercept   : AIC=inf, Time=21.19 sec
 ARIMA(0,1,5)(0,0,0)[0] intercept   : AIC=inf, Time=15.27 sec
 ARIMA(1,1,5)(0,0,0)[0] intercept   : AIC=inf, Time=36.91 sec
 ARIMA(0,1,4)(0,0,0)[0]             : AIC=81520.713, Time=1.22 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=81743.973, Time=0.79 sec
 ARIMA(1,1,4)(0,0,0)[0]             : AIC=inf, Time=1.83 sec
 ARIMA(0,1,5)(0,0,0)[0]             : AIC=inf, Time=1.57 sec
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=1.26 sec
2024-05-14 08:58:24,589 - INFO - βœ… Best model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)
 ARIMA(1,1,5)(0,0,0)[0]             : AIC=inf, Time=3.26 sec

Best model:  ARIMA(0,1,4)(0,0,0)[0]          
Total fit time: 131.811 seconds
2024-05-14 08:58:24,686 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:               SARIMAX(0, 1, 4)   Log Likelihood              -40755.357
Date:                Tue, 14 May 2024   AIC                          81520.713
Time:                        08:58:24   BIC                          81558.411
Sample:                    05-07-2020   HQIC                         81533.268
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1         -0.5052      0.004   -130.751      0.000      -0.513      -0.498
ma.L2         -0.2455      0.006    -44.035      0.000      -0.256      -0.235
ma.L3         -0.0998      0.009    -11.591      0.000      -0.117      -0.083
ma.L4         -0.1261      0.007    -18.049      0.000      -0.140      -0.112
sigma2        20.6235      0.069    297.137      0.000      20.487      20.760
===================================================================================
Ljung-Box (L1) (Q):                   1.26   Jarque-Bera (JB):            840496.71
Prob(Q):                              0.26   Prob(JB):                         0.00
Heteroskedasticity (H):               1.51   Skew:                             2.37
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.80
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 08:58:24,689 - INFO - πŸŽ“ [4.1] The ARIMA Models ...
πŸ” Applying Auto ARIMA Model Selection for Site2-Takapuna (PM2.5)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=24.16 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=45019.870, Time=0.18 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=40348.147, Time=0.44 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=37707.821, Time=1.45 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=45017.871, Time=0.19 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=37596.087, Time=2.16 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=37596.544, Time=1.83 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=10.19 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=37595.386, Time=2.19 sec
 ARIMA(0,1,3)(0,0,0)[0] intercept   : AIC=37596.990, Time=4.48 sec
 ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=inf, Time=22.42 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=37593.386, Time=0.51 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=37705.822, Time=0.40 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=inf, Time=0.94 sec
 ARIMA(0,1,3)(0,0,0)[0]             : AIC=37594.991, Time=0.60 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=37594.087, Time=0.52 sec
2024-05-14 08:59:39,394 - INFO - βœ… Best model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2)
2024-05-14 08:59:39,489 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13887
Model:               SARIMAX(0, 1, 2)   Log Likelihood              -18793.693
Date:                Tue, 14 May 2024   AIC                          37593.386
Time:                        08:59:39   BIC                          37616.002
Sample:                    05-07-2020   HQIC                         37600.919
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          0.8354      0.002    401.874      0.000       0.831       0.839
ma.L2          0.0906      0.002     36.543      0.000       0.086       0.096
sigma2         0.8771      0.003    347.622      0.000       0.872       0.882
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):            792515.25
Prob(Q):                              0.95   Prob(JB):                         0.00
Heteroskedasticity (H):               1.14   Skew:                             0.35
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.00
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 08:59:39,491 - INFO - πŸŽ“ [4.1] The ARIMA Models ...
 ARIMA(1,1,3)(0,0,0)[0]             : AIC=inf, Time=1.90 sec

Best model:  ARIMA(0,1,2)(0,0,0)[0]          
Total fit time: 74.588 seconds

πŸ” Applying Auto ARIMA Model Selection for Site1-Penrose (PM10)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0]             : AIC=inf, Time=2.29 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=115738.588, Time=0.05 sec
 ARIMA(1,0,0)(0,0,0)[0]             : AIC=91231.210, Time=0.14 sec
 ARIMA(0,0,1)(0,0,0)[0]             : AIC=105427.979, Time=0.38 sec
 ARIMA(2,0,0)(0,0,0)[0]             : AIC=89279.111, Time=0.27 sec
 ARIMA(3,0,0)(0,0,0)[0]             : AIC=88829.508, Time=0.38 sec
 ARIMA(4,0,0)(0,0,0)[0]             : AIC=88717.131, Time=0.39 sec
 ARIMA(5,0,0)(0,0,0)[0]             : AIC=88648.671, Time=0.53 sec
 ARIMA(6,0,0)(0,0,0)[0]             : AIC=88583.442, Time=0.60 sec
 ARIMA(7,0,0)(0,0,0)[0]             : AIC=88521.984, Time=0.82 sec
 ARIMA(8,0,0)(0,0,0)[0]             : AIC=88480.799, Time=1.14 sec
 ARIMA(9,0,0)(0,0,0)[0]             : AIC=88449.433, Time=1.19 sec
 ARIMA(10,0,0)(0,0,0)[0]             : AIC=88413.667, Time=1.31 sec
 ARIMA(11,0,0)(0,0,0)[0]             : AIC=inf, Time=1.81 sec
 ARIMA(10,0,1)(0,0,0)[0]             : AIC=inf, Time=10.12 sec
 ARIMA(9,0,1)(0,0,0)[0]             : AIC=inf, Time=8.68 sec
 ARIMA(11,0,1)(0,0,0)[0]             : AIC=inf, Time=11.78 sec
 ARIMA(10,0,0)(0,0,0)[0] intercept   : AIC=87884.782, Time=8.19 sec
 ARIMA(9,0,0)(0,0,0)[0] intercept   : AIC=87885.059, Time=7.37 sec
 ARIMA(11,0,0)(0,0,0)[0] intercept   : AIC=87884.909, Time=11.73 sec
 ARIMA(10,0,1)(0,0,0)[0] intercept   : AIC=87886.886, Time=24.70 sec
 ARIMA(9,0,1)(0,0,0)[0] intercept   : AIC=87887.126, Time=22.25 sec
2024-05-14 09:02:20,067 - INFO - βœ… Best model for Site1-Penrose (PM10): ARIMA(10, 0, 0)
 ARIMA(11,0,1)(0,0,0)[0] intercept   : AIC=87887.532, Time=44.24 sec

Best model:  ARIMA(10,0,0)(0,0,0)[0] intercept
Total fit time: 160.375 seconds
2024-05-14 09:02:20,171 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:              SARIMAX(10, 0, 0)   Log Likelihood              -43930.391
Date:                Tue, 14 May 2024   AIC                          87884.782
Time:                        09:02:20   BIC                          87975.258
Sample:                    05-07-2020   HQIC                         87914.913
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      3.1601      0.138     22.916      0.000       2.890       3.430
ar.L1          0.4353      0.006     70.585      0.000       0.423       0.447
ar.L2          0.1911      0.008     24.366      0.000       0.176       0.206
ar.L3          0.0855      0.009      9.878      0.000       0.069       0.102
ar.L4          0.0161      0.009      1.717      0.086      -0.002       0.035
ar.L5          0.0018      0.009      0.191      0.849      -0.017       0.020
ar.L6          0.0057      0.009      0.610      0.542      -0.013       0.024
ar.L7          0.0122      0.009      1.302      0.193      -0.006       0.030
ar.L8          0.0057      0.008      0.672      0.501      -0.011       0.022
ar.L9          0.0016      0.009      0.177      0.859      -0.016       0.020
ar.L10         0.0128      0.008      1.530      0.126      -0.004       0.029
sigma2        32.5718      0.241    135.006      0.000      32.099      33.045
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):             12022.16
Prob(Q):                              0.99   Prob(JB):                         0.00
Heteroskedasticity (H):               1.21   Skew:                             0.70
Prob(H) (two-sided):                  0.00   Kurtosis:                         7.34
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 09:02:20,174 - INFO - πŸŽ“ [4.1] The ARIMA Models ...
πŸ” Applying Auto ARIMA Model Selection for Site2-Takapuna (PM10)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(0,0,0)[0]             : AIC=78822.388, Time=1.23 sec
 ARIMA(0,0,0)(0,0,0)[0]             : AIC=112846.654, Time=0.05 sec
 ARIMA(1,0,0)(0,0,0)[0]             : AIC=85707.873, Time=0.15 sec
 ARIMA(0,0,1)(0,0,0)[0]             : AIC=95901.076, Time=0.36 sec
 ARIMA(1,0,2)(0,0,0)[0]             : AIC=80273.049, Time=0.57 sec
 ARIMA(2,0,1)(0,0,0)[0]             : AIC=80774.015, Time=0.57 sec
 ARIMA(3,0,2)(0,0,0)[0]             : AIC=78824.229, Time=1.63 sec
 ARIMA(2,0,3)(0,0,0)[0]             : AIC=78824.188, Time=1.92 sec
 ARIMA(1,0,1)(0,0,0)[0]             : AIC=81175.004, Time=0.40 sec
 ARIMA(1,0,3)(0,0,0)[0]             : AIC=79249.268, Time=1.11 sec
 ARIMA(3,0,1)(0,0,0)[0]             : AIC=80278.950, Time=1.02 sec
 ARIMA(3,0,3)(0,0,0)[0]             : AIC=78825.089, Time=2.04 sec
 ARIMA(2,0,2)(0,0,0)[0] intercept   : AIC=78736.171, Time=12.06 sec
 ARIMA(1,0,2)(0,0,0)[0] intercept   : AIC=78889.262, Time=3.09 sec
 ARIMA(2,0,1)(0,0,0)[0] intercept   : AIC=78892.435, Time=2.96 sec
 ARIMA(3,0,2)(0,0,0)[0] intercept   : AIC=78735.840, Time=15.04 sec
 ARIMA(3,0,1)(0,0,0)[0] intercept   : AIC=78833.654, Time=4.73 sec
 ARIMA(4,0,2)(0,0,0)[0] intercept   : AIC=78757.841, Time=21.56 sec
 ARIMA(3,0,3)(0,0,0)[0] intercept   : AIC=78736.360, Time=22.63 sec
 ARIMA(2,0,3)(0,0,0)[0] intercept   : AIC=78734.785, Time=19.21 sec
 ARIMA(1,0,3)(0,0,0)[0] intercept   : AIC=78786.317, Time=5.41 sec
 ARIMA(2,0,4)(0,0,0)[0] intercept   : AIC=78746.403, Time=27.94 sec
 ARIMA(1,0,4)(0,0,0)[0] intercept   : AIC=78754.753, Time=11.18 sec
 ARIMA(3,0,4)(0,0,0)[0] intercept   : AIC=78737.181, Time=30.55 sec

Best model:  ARIMA(2,0,3)(0,0,0)[0] intercept
Total fit time: 187.483 seconds
2024-05-14 09:05:27,878 - INFO - βœ… Best model for Site2-Takapuna (PM10): ARIMA(2, 0, 3)
2024-05-14 09:05:27,986 - INFO -                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13887
Model:               SARIMAX(2, 0, 3)   Log Likelihood              -39360.392
Date:                Tue, 14 May 2024   AIC                          78734.785
Time:                        09:05:27   BIC                          78787.556
Sample:                    05-07-2020   HQIC                         78752.360
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.2385      0.043      5.583      0.000       0.155       0.322
ar.L1          1.3873      0.014    102.426      0.000       1.361       1.414
ar.L2         -0.4080      0.011    -38.577      0.000      -0.429      -0.387
ma.L1         -0.0155      0.014     -1.144      0.253      -0.042       0.011
ma.L2         -0.7895      0.008    -95.881      0.000      -0.806      -0.773
ma.L3         -0.0482      0.007     -7.264      0.000      -0.061      -0.035
sigma2        16.9697      0.014   1241.426      0.000      16.943      16.997
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):         362991240.28
Prob(Q):                              0.88   Prob(JB):                         0.00
Heteroskedasticity (H):               0.26   Skew:                            16.64
Prob(H) (two-sided):                  0.00   Kurtosis:                       794.34
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
InΒ [Β ]:
def apply_auto_sarima_all_sites(df1, df2, target_cols):
    """
    Apply Auto SARIMA Model Selection across multiple sites and pollutants.
    
    Parameters:
    - df1: DataFrame for Site 1
    - df2: DataFrame for Site 2
    - target_cols: List of target columns (pollutants) for analysis
    
    Returns:
    - models: Dictionary containing fitted models for each site and pollutant
    """
    ## Dictionary to store models for each site and target column
    models = {}

    ## Iterate over each pollutant
    for target_col in target_cols:
        # Iterate over each site
        for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
            print(f'\nπŸ” Applying Auto SARIMA Model Selection for {site_name} ({target_col})')

            ## Initialize the ARIMA analyzer
            analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)

            ## Prepare data by performing a train-test split
            analyzer.train_test_split(train_size=0.8)

            # ## Visualize time series and lag plots
            # analyzer.timeseries_visualization(lags=15, max_diff_order=1)

            ## Applying Auto ARIMA to find the best model
            model = analyzer.auto_sarima_model(analyzer.y_train)

            ## Storing model results
            models[(site_name, target_col)] = model

            ## Print model summary
            if model:
                logging.info(f"βœ… Best model for {site_name} ({target_col}): SARIMA{model.order}")
                logging.info(model.summary())
            else:
                logging.info('❌ Failed to determine a suitable model.')

    return models


target_cols = ['PM2.5', 'PM10']  # List of pollutants to analyze
data_penrose = data_series_df1.copy()  ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna

## Call the function with data and target columns
models = apply_auto_sarima_all_sites(data_penrose, data_takapuna, target_cols)
2024-05-21 14:56:10,920 - INFO - πŸŽ“ [4.1] The ARIMA Models ...
πŸ” Applying Auto SARIMA Model Selection for Site1-Penrose (PM2.5)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
πŸŽ“ [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed!
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(1,0,1)[24] intercept   : AIC=inf, Time=149.59 sec
 ARIMA(0,1,0)(0,0,0)[24] intercept   : AIC=85246.817, Time=0.24 sec
 ARIMA(1,1,0)(1,0,0)[24] intercept   : AIC=84058.923, Time=18.85 sec
 ARIMA(0,1,1)(0,0,1)[24] intercept   : AIC=83140.927, Time=17.42 sec
 ARIMA(0,1,0)(0,0,0)[24]             : AIC=85244.818, Time=0.12 sec
 ARIMA(0,1,1)(0,0,0)[24] intercept   : AIC=83375.633, Time=0.88 sec
 ARIMA(0,1,1)(1,0,1)[24] intercept   : AIC=inf, Time=104.13 sec
 ARIMA(0,1,1)(0,0,2)[24] intercept   : AIC=82960.450, Time=60.51 sec
 ARIMA(0,1,1)(1,0,2)[24] intercept   : AIC=inf, Time=211.91 sec
 ARIMA(0,1,0)(0,0,2)[24] intercept   : AIC=85106.953, Time=42.82 sec
 ARIMA(1,1,1)(0,0,2)[24] intercept   : AIC=inf, Time=136.34 sec
 ARIMA(0,1,2)(0,0,2)[24] intercept   : AIC=81772.007, Time=86.94 sec
 ARIMA(0,1,2)(0,0,1)[24] intercept   : AIC=81969.054, Time=34.85 sec
 ARIMA(0,1,2)(1,0,2)[24] intercept   : AIC=inf, Time=293.02 sec
 ARIMA(0,1,2)(1,0,1)[24] intercept   : AIC=inf, Time=116.07 sec
 ARIMA(1,1,2)(0,0,2)[24] intercept   : AIC=inf, Time=306.30 sec
 ARIMA(0,1,2)(0,0,2)[24]             : AIC=81770.007, Time=17.64 sec
 ARIMA(0,1,2)(0,0,1)[24]             : AIC=81967.055, Time=4.62 sec

Multi-Steps ForecastingΒΆ

InΒ [8]:
from sklearn.metrics import mean_absolute_percentage_error

def one_step_forecast(analyzer, test):
    """
    Create multi-step forecasts and plots them alongside the test data.
    """
    test_len = len(test)
    ## Create predictions for the future, evaluate on test
    fcast, conf_int = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
    # fcast = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
    # forecasts = fcast[0]
    # confidence_intervals = fcast[1]
    # confidence_intervals = np.asarray(fcast[1])

    ## Plotting
    plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)

def one_period_forecast(model, n_periods=168):
    """
    Forecast multiple periods ahead with confidence intervals.

    Parameters:
    model (ARIMA): The ARIMA model used for forecasting.
    n_periods (int): Number of periods to forecast.

    Returns:
    tuple: A tuple containing forecasts and their confidence intervals.
    """
    ## Create predictions for the future, evaluate on test
    fcast, conf_int = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
    return fcast.tolist(), np.asarray(conf_int).tolist()
    
    # fcast = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
    ## fcast is a list of two lists.
    ## The first list is the forecast
    # forecasts = fcast[0].tolist()
    ## The second list is the confidence interval
    # confidence_intervals = fcast[1]
    # return ( forecasts, 
    #          np.asarray(confidence_intervals).tolist()[0])
    
def multi_step_forecast(analyzer, test, period='week'):
    """
    Generate multi-step forecasts and update the model with actual observations for each period/forecast.
    Also plots them alongside the test data.

    Parameters:
    - analyzer (TimeSeriesAnalyzer): An instance containing the ARIMA model and data.
    - test (pd.Series): Test data for the target column.
    - period (str): The period for grouping test data, 'week' for hourly data equals 168 hours.

    Uses dynamic forecasting where the model is updated after each new observation.
    Prints progress and debugging information.
    """
    ## Initialize lists to hold predictions and confidence intervals
    forecasts = []
    confidence_intervals = []

    ## Calculate the number of periods based on the test data length and the defined period length: One week in hours = 24 * 7
    # period_length = 168 if period == 'week' else len(test)  
    ## One week in hours = 24 * 7 = 168; also 24 hours per day
    period_length = 168 if period == 'week' else 24 if period == 'day' else len(test)

    ## Generate predictions and update the model for each period: Loop over the test data in chunks
    for start_idx in range(0, len(test), period_length):
        end_idx = min(start_idx + period_length, len(test))
        current_test = test.iloc[start_idx:end_idx]
        
        ## Perform forecasting for the current period
        fc, conf = one_period_forecast(analyzer.model, n_periods=len(current_test))
        forecasts.extend(fc)
        confidence_intervals.extend(conf)
        # confidence_intervals.append(conf)

        ## Adaptive Forecasting: Update the model with actual observations from the current period
        analyzer.model.update(current_test.values)
        logging.debug(f'Iteration from {start_idx} to {end_idx}: Model updated with actual data with Forecast={fc}, CI={conf}.')

    ## Plot the results: Plotting the forecasts against the actual data
    plot_it(analyzer.train_data[analyzer.target_col], test, forecasts, np.array(confidence_intervals))
 
    ## Plotting the forecasts against the actual data
    # plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)
    # plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], np.array(forecasts), np.array(confidence_intervals))
    
    
# @staticmethod
def plot_it(train, test, forecasts, confidence_intervals):
    """
    Plot training data, test data, forecasts, and confidence intervals.
    
    Parameters:
    - train (pd.Series): Training data series.
    - test (pd.Series): Test data series.
    - forecasts (list): List of forecasted values.
    - confidence_intervals (array): Array of confidence interval bounds.
    """
    fig, ax = plt.subplots(figsize=(18, 10))

    ## Plot training data
    ax.plot(train.index, train, color='blue', label='Training Data')

    ## Plot actual test data
    ax.plot(test.index, test, color='green', label='Actual Test Data')
    
    ## Plotting Confidence Intervals
    # lower_bounds = [ci[0] for ci in confidence_intervals]
    # upper_bounds = [ci[1] for ci in confidence_intervals]
    ax.fill_between(test.index, confidence_intervals[:, 0], confidence_intervals[:, 1], color='orange', alpha=0.3, label='95% Confidence Interval')

    ## Plot forecasts
    ax.plot(test.index, forecasts, color='red', label='Forecasted Data', marker='o')
    
    ax.set_title(f'Multi-Step Forecast Evaluation: Forecast vs Actuals')
    ax.set_xlabel('Date')
    ax.set_ylabel(f'{train.name}')
    ax.legend()
    plt.grid(True)
    plt.show()

    ## Evaluate and print forecast accuracy
    mape = mean_absolute_percentage_error(test, forecasts)
    # print(f"MAPE: {mape:.2%}")
    # ## Evaluate and print forecast accuracy 
    # mape = mean_absolute_percentage_error(analyzer.test_data, forecasts)
    logging.info(f"MAPE for the forecast: {mape:.2%}")
πŸŽ“ Time Series EDA ...
InΒ [9]:
def Time_Series_EDA(data: pd.DataFrame, site: str, target_col: str, train_size: float = 0.8, is_auto_arima=False, models=None):
    """
    Conduct Time Series EDA, Model Selection, and Forecasting.
    Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to any site and target pollutant.
    """
    analyzer = ArimaTimeSeriesAnalyzer(data, site, target_col, is_auto_arima=is_auto_arima)
    
    logging.info(f'βœ… Step 1: Preparing Data for {site} ({target_col})')
    analyzer.train_test_split(train_size)
    
    logging.info(f'βœ… Step 2: Visualizing Time Series and Lag Plots for {site} ({target_col})')
    analyzer.timeseries_visualization(lags=15, max_diff_order=1)

    logging.info(f'βœ… Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')

    if is_auto_arima:
        logging.info(f'βœ… Step 3: Applying Auto ARIMA Model Selection for {site} ({target_col})')
        analyzer.auto_arima_model(analyzer.y_train)
    else:
        logging.info(f'βœ… Step 4: Selecting the Best Model Based on AIC')
        analyzer.model = analyzer.model_selection(models, analyzer.y_train, analyzer.y_test, preference='AIC')

    if analyzer.model:
        logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()}")
    else:
        logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()} ")

    logging.info(f'βœ… Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.')
    logging.info(f'πŸŽ“ 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...')
    # analyzer.model.plot_diagnostics(figsize=(15,12))
    analyzer.model.plot_diagnostics(figsize=(10, 8))
    plt.show()
    
    logging.info(f'πŸŽ“ 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...')
    if hasattr(analyzer.model, 'resid'):
        residuals = analyzer.model.resid  ## Get the residuals from an instance of a fitted ARIMA model
        plot_residuals(residuals)         ## Plotting the residuals 

    logging.info(f'βœ… Step 6: Conducting One-Step Forecast')
    one_step_forecast(analyzer, analyzer.test_data[analyzer.target_col])

    logging.info(f'βœ… Step 7.1: Conducting Multi-Step Forecast for Next Weeks')
    multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='week')
    logging.info(f'βœ… Step 7.2: Conducting Multi-Step Forecast for Next Days')
    multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='day')

    return analyzer
InΒ [10]:
## βœ… Best ARIMA model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)
## βœ… Best ARIMA model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2)
## βœ… Best ARIMA model for Site1-Penrose (PM10): ARIMA(10, 0, 0)
## βœ… Best ARIMA model for Site2-Takapuna (PM10): ARIMA(2, 0, 3)
models_penrose_pm25  = [('ARIMA(0,1,4)', (0, 1, 4), None)]
models_penrose_pm10  = [('ARIMA(0,1,2)', (0, 1, 2), None)]
models_takapuna_pm25 = [('ARIMA(10,0,0)', (10, 0, 0), None)]
models_takapuna_pm10 = [('ARIMA(2,0,3)', (2, 0, 3), None)]

## Penrose Time Series EDA
logging.info(f"πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm25 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM2.5', is_auto_arima=False, models=models_penrose_pm25)
logging.info(f"πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm10 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM10', is_auto_arima=False, models=models_penrose_pm10)

## Takapuna Time Series EDA
logging.info(f"πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm25 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM2.5', is_auto_arima=False, models=models_takapuna_pm25)
logging.info(f"πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm10 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM10', is_auto_arima=False, models=models_takapuna_pm10)
2024-05-14 09:13:48,120 - INFO - πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 
2024-05-14 09:13:48,125 - INFO - βœ… Step 1: Preparing Data for Penrose (PM2.5)
2024-05-14 09:13:48,128 - INFO - βœ… Step 2: Visualizing Time Series and Lag Plots for Penrose (PM2.5)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
βœ… Step 2: Time Series Visualization

πŸŽ“ [2.1] [Default] Time Series Display ...
No description has been provided for this image
πŸŽ“ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
No description has been provided for this image
πŸŽ“ [2.2] Apply 1 Order Differencing to the time series and visualizes results.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the ACF to suggest possible MA orders.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the PACF to suggest possible AR orders.
No description has been provided for this image
2024-05-14 09:13:55,786 - INFO - βœ… Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.
2024-05-14 09:13:55,788 - INFO - βœ… Step 4: Selecting the Best Model Based on AIC
2024-05-14 09:13:55,806 - INFO - πŸŽ“ Fitting the ARIMA/SARIMA ARIMA(0,1,4) to the training data...
2024-05-14 09:14:06,662 - INFO - Model: ARIMA(0,1,4) - RMSE: 4.58, MAE: 3.53, MAPE: inf%, AIC: 81522.71, BIC: 81567.95
2024-05-14 09:14:06,667 - INFO - 
The best model based on AIC: ARIMA(0,1,4) with AIC: 81522.71
2024-05-14 09:14:06,811 - INFO - Best model based on AIC = 81522.71 
                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:               SARIMAX(0, 1, 4)   Log Likelihood              -40755.356
Date:                Tue, 14 May 2024   AIC                          81522.712
Time:                        09:14:06   BIC                          81567.949
Sample:                    05-07-2020   HQIC                         81537.777
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept  -3.539e-05      0.001     -0.036      0.972      -0.002       0.002
ma.L1         -0.5052      0.004   -130.166      0.000      -0.513      -0.498
ma.L2         -0.2455      0.006    -43.965      0.000      -0.256      -0.235
ma.L3         -0.0997      0.009    -11.585      0.000      -0.117      -0.083
ma.L4         -0.1261      0.007    -18.013      0.000      -0.140      -0.112
sigma2        20.6245      0.071    288.519      0.000      20.484      20.765
===================================================================================
Ljung-Box (L1) (Q):                   1.27   Jarque-Bera (JB):            840531.64
Prob(Q):                              0.26   Prob(JB):                         0.00
Heteroskedasticity (H):               1.51   Skew:                             2.37
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.80
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 09:14:06,826 - INFO - βœ… Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.
2024-05-14 09:14:06,826 - INFO - πŸŽ“ 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
No description has been provided for this image
2024-05-14 09:14:07,340 - INFO - πŸŽ“ 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
No description has been provided for this image
2024-05-14 09:14:07,938 - INFO - βœ… Step 6: Conducting One-Step Forecast
No description has been provided for this image
2024-05-14 09:14:08,251 - INFO - MAPE for the forecast: 8479870924121942.00%
2024-05-14 09:14:08,251 - INFO - βœ… Step 7.1: Conducting Multi-Step Forecast for Next Weeks
No description has been provided for this image
2024-05-14 09:16:51,827 - INFO - MAPE for the forecast: 6881842348181038.00%
2024-05-14 09:16:51,827 - INFO - βœ… Step 7.2: Conducting Multi-Step Forecast for Next Days
No description has been provided for this image
2024-05-14 09:28:33,919 - INFO - MAPE for the forecast: 6337679726063511.00%
2024-05-14 09:28:33,919 - INFO - πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 
2024-05-14 09:28:33,920 - INFO - βœ… Step 1: Preparing Data for Penrose (PM10)
2024-05-14 09:28:33,920 - INFO - βœ… Step 2: Visualizing Time Series and Lag Plots for Penrose (PM10)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
βœ… Step 2: Time Series Visualization

πŸŽ“ [2.1] [Default] Time Series Display ...
No description has been provided for this image
πŸŽ“ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
No description has been provided for this image
πŸŽ“ [2.2] Apply 1 Order Differencing to the time series and visualizes results.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the ACF to suggest possible MA orders.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the PACF to suggest possible AR orders.
No description has been provided for this image
2024-05-14 09:28:43,398 - INFO - βœ… Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.
2024-05-14 09:28:43,400 - INFO - βœ… Step 4: Selecting the Best Model Based on AIC
2024-05-14 09:28:43,402 - INFO - πŸŽ“ Fitting the ARIMA/SARIMA ARIMA(0,1,2) to the training data...
2024-05-14 09:28:45,151 - INFO - Model: ARIMA(0,1,2) - RMSE: 11.24, MAE: 8.77, MAPE: inf%, AIC: 88598.79, BIC: 88628.95
2024-05-14 09:28:45,153 - INFO - 
The best model based on AIC: ARIMA(0,1,2) with AIC: 88598.79
2024-05-14 09:28:45,298 - INFO - Best model based on AIC = 88598.79 
                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13900
Model:               SARIMAX(0, 1, 2)   Log Likelihood              -44295.396
Date:                Tue, 14 May 2024   AIC                          88598.793
Time:                        09:28:45   BIC                          88628.951
Sample:                    05-07-2020   HQIC                         88608.837
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.0004      0.020      0.019      0.985      -0.039       0.040
ma.L1         -0.5236      0.006    -86.317      0.000      -0.535      -0.512
ma.L2         -0.0896      0.007    -12.927      0.000      -0.103      -0.076
sigma2        34.3288      0.244    140.409      0.000      33.850      34.808
===================================================================================
Ljung-Box (L1) (Q):                   0.79   Jarque-Bera (JB):             10409.50
Prob(Q):                              0.38   Prob(JB):                         0.00
Heteroskedasticity (H):               1.24   Skew:                             0.54
Prob(H) (two-sided):                  0.00   Kurtosis:                         7.10
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 09:28:45,309 - INFO - βœ… Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.
2024-05-14 09:28:45,309 - INFO - πŸŽ“ 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
No description has been provided for this image
2024-05-14 09:28:45,951 - INFO - πŸŽ“ 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
No description has been provided for this image
2024-05-14 09:28:47,393 - INFO - βœ… Step 6: Conducting One-Step Forecast
No description has been provided for this image
2024-05-14 09:28:47,722 - INFO - MAPE for the forecast: 5461844894654976.00%
2024-05-14 09:28:47,722 - INFO - βœ… Step 7.1: Conducting Multi-Step Forecast for Next Weeks
No description has been provided for this image
2024-05-14 09:29:22,342 - INFO - MAPE for the forecast: 3868016845709311.00%
2024-05-14 09:29:22,342 - INFO - βœ… Step 7.2: Conducting Multi-Step Forecast for Next Days
No description has been provided for this image
2024-05-14 09:33:02,316 - INFO - MAPE for the forecast: 3045492723406247.50%
2024-05-14 09:33:02,316 - INFO - πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 
2024-05-14 09:33:02,317 - INFO - βœ… Step 1: Preparing Data for Takapuna (PM2.5)
2024-05-14 09:33:02,317 - INFO - βœ… Step 2: Visualizing Time Series and Lag Plots for Takapuna (PM2.5)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
βœ… Step 2: Time Series Visualization

πŸŽ“ [2.1] [Default] Time Series Display ...
No description has been provided for this image
πŸŽ“ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
No description has been provided for this image
πŸŽ“ [2.2] Apply 1 Order Differencing to the time series and visualizes results.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the ACF to suggest possible MA orders.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the PACF to suggest possible AR orders.
No description has been provided for this image
2024-05-14 09:33:09,591 - INFO - βœ… Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.
2024-05-14 09:33:09,592 - INFO - βœ… Step 4: Selecting the Best Model Based on AIC
2024-05-14 09:33:09,593 - INFO - πŸŽ“ Fitting the ARIMA/SARIMA ARIMA(10,0,0) to the training data...
2024-05-14 09:33:24,327 - INFO - Model: ARIMA(10,0,0) - RMSE: 2.12, MAE: 1.68, MAPE: 40.74%, AIC: 36614.49, BIC: 36704.95
2024-05-14 09:33:24,328 - INFO - 
The best model based on AIC: ARIMA(10,0,0) with AIC: 36614.49
2024-05-14 09:33:24,470 - INFO - Best model based on AIC = 36614.49 
                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13887
Model:              SARIMAX(10, 0, 0)   Log Likelihood              -18295.243
Date:                Tue, 14 May 2024   AIC                          36614.485
Time:                        09:33:24   BIC                          36704.950
Sample:                    05-07-2020   HQIC                         36644.614
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.4089      0.018     22.875      0.000       0.374       0.444
ar.L1          1.7644      0.003    703.954      0.000       1.759       1.769
ar.L2         -1.3783      0.005   -262.277      0.000      -1.389      -1.368
ar.L3          0.9730      0.006    150.706      0.000       0.960       0.986
ar.L4         -0.7195      0.007   -104.882      0.000      -0.733      -0.706
ar.L5          0.4857      0.008     64.575      0.000       0.471       0.500
ar.L6         -0.3417      0.009    -39.574      0.000      -0.359      -0.325
ar.L7          0.2382      0.011     22.446      0.000       0.217       0.259
ar.L8         -0.1689      0.011    -14.980      0.000      -0.191      -0.147
ar.L9          0.0983      0.010     10.068      0.000       0.079       0.117
ar.L10        -0.0153      0.005     -2.868      0.004      -0.026      -0.005
sigma2         0.8160      0.003    300.204      0.000       0.811       0.821
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):            833488.81
Prob(Q):                              0.97   Prob(JB):                         0.00
Heteroskedasticity (H):               1.17   Skew:                             1.28
Prob(H) (two-sided):                  0.00   Kurtosis:                        40.87
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 09:33:24,471 - INFO - βœ… Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.
2024-05-14 09:33:24,471 - INFO - πŸŽ“ 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
No description has been provided for this image
2024-05-14 09:33:25,175 - INFO - πŸŽ“ 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
No description has been provided for this image
2024-05-14 09:33:25,719 - INFO - βœ… Step 6: Conducting One-Step Forecast
No description has been provided for this image
2024-05-14 09:33:26,013 - INFO - MAPE for the forecast: 40.74%
2024-05-14 09:33:26,013 - INFO - βœ… Step 7.1: Conducting Multi-Step Forecast for Next Weeks
No description has been provided for this image
2024-05-14 09:42:56,092 - INFO - MAPE for the forecast: 39.18%
2024-05-14 09:42:56,093 - INFO - βœ… Step 7.2: Conducting Multi-Step Forecast for Next Days
No description has been provided for this image
2024-05-14 10:18:49,871 - INFO - MAPE for the forecast: 31.65%
2024-05-14 10:18:49,872 - INFO - πŸ“ˆπŸ“Š Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 
2024-05-14 10:18:49,872 - INFO - βœ… Step 1: Preparing Data for Takapuna (PM10)
2024-05-14 10:18:49,873 - INFO - βœ… Step 2: Visualizing Time Series and Lag Plots for Takapuna (PM10)
βœ… Step 1: Data Loading & Splitting 80/20
πŸŽ“ Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
βœ… Step 2: Time Series Visualization

πŸŽ“ [2.1] [Default] Time Series Display ...
No description has been provided for this image
πŸŽ“ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
No description has been provided for this image
πŸŽ“ [2.2] Apply 1 Order Differencing to the time series and visualizes results.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the ACF to suggest possible MA orders.
No description has been provided for this image
πŸŽ“ [2.3] Plotting the PACF to suggest possible AR orders.
No description has been provided for this image
2024-05-14 10:18:54,773 - INFO - βœ… Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.
2024-05-14 10:18:54,775 - INFO - βœ… Step 4: Selecting the Best Model Based on AIC
2024-05-14 10:18:54,778 - INFO - πŸŽ“ Fitting the ARIMA/SARIMA ARIMA(2,0,3) to the training data...
2024-05-14 10:19:17,273 - INFO - Model: ARIMA(2,0,3) - RMSE: 6.62, MAE: 4.21, MAPE: 87.72%, AIC: 78734.78, BIC: 78787.56
2024-05-14 10:19:17,292 - INFO - 
The best model based on AIC: ARIMA(2,0,3) with AIC: 78734.78
2024-05-14 10:19:17,424 - INFO - Best model based on AIC = 78734.78 
                                SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                13887
Model:               SARIMAX(2, 0, 3)   Log Likelihood              -39360.392
Date:                Tue, 14 May 2024   AIC                          78734.785
Time:                        10:19:17   BIC                          78787.556
Sample:                    05-07-2020   HQIC                         78752.360
                         - 12-07-2021                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.2385      0.043      5.583      0.000       0.155       0.322
ar.L1          1.3873      0.014    102.426      0.000       1.361       1.414
ar.L2         -0.4080      0.011    -38.577      0.000      -0.429      -0.387
ma.L1         -0.0155      0.014     -1.144      0.253      -0.042       0.011
ma.L2         -0.7895      0.008    -95.881      0.000      -0.806      -0.773
ma.L3         -0.0482      0.007     -7.264      0.000      -0.061      -0.035
sigma2        16.9697      0.014   1241.426      0.000      16.943      16.997
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):         362991240.28
Prob(Q):                              0.88   Prob(JB):                         0.00
Heteroskedasticity (H):               0.26   Skew:                            16.64
Prob(H) (two-sided):                  0.00   Kurtosis:                       794.34
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
2024-05-14 10:19:17,425 - INFO - βœ… Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.
2024-05-14 10:19:17,433 - INFO - πŸŽ“ 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
No description has been provided for this image
2024-05-14 10:19:17,948 - INFO - πŸŽ“ 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
No description has been provided for this image
2024-05-14 10:19:18,484 - INFO - βœ… Step 6: Conducting One-Step Forecast
No description has been provided for this image
2024-05-14 10:19:18,758 - INFO - MAPE for the forecast: 87.72%
2024-05-14 10:19:18,758 - INFO - βœ… Step 7.1: Conducting Multi-Step Forecast for Next Weeks
No description has been provided for this image
2024-05-14 10:21:15,576 - INFO - MAPE for the forecast: 86.08%
2024-05-14 10:21:15,576 - INFO - βœ… Step 7.2: Conducting Multi-Step Forecast for Next Days
No description has been provided for this image
2024-05-14 10:29:19,279 - INFO - MAPE for the forecast: 76.37%

βš™οΈ Time-Series Analysis & Forecasting with Meta/Facebook's ProphetΒΆ

πŸŽ“ Time Series Forecasting with Meta/Facebook's Prophet
  • Prophet works by decomposing the time series into three main components: trend, seasonality, and holidays.
  • It uses an additive model where non-linear trends fit with yearly and weekly seasonalities, plus holidays.
  • It accommodates seasonality with Fourier series and models holidays as indicator variables.
InΒ [11]:
## Import Prophet class from prophet (fbprophet) library
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
InΒ [28]:
class AdvancedTimeSeriesForecastingProphet:
    
    def __init__(self, training_data, testing_data, target_variable, changepoint_prior_scale=0.01, interval_width=0.95, daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True, include_holidays=True, country_code='NZ'):
        """
        Initializes the class with training and testing data, target variable, and configurations for the Prophet model.
        Ensures data includes 'ds' and target_variable correctly named 'y'.

        Args:
            training_data (DataFrame): The training dataset with 'ds' and target_variable columns.
            testing_data (DataFrame): The testing dataset with 'ds' and target_variable columns.
            target_variable (str): The name of the column to forecast.
            changepoint_prior_scale (float): The flexibility of the automatic changepoint selection.
            daily_seasonality (bool): Whether to model daily seasonality.
            yearly_seasonality (bool): Whether to model yearly seasonality.
            interval_width (float): Uncertainty interval width. Set the uncertainty interval to 95% (the Prophet default is 80%)
            
            periods (int): Number of periods to forecast.
            freq (str): Frequency of the forecast (e.g., 'D' for daily).
            include_holidays (bool): If True, includes holiday effects based on the specified country.
            country_code (str): The ISO country code ('US') for including holidays.

        Returns:
            model (Prophet object): The fitted Prophet model.        
        """
        # self.validate_data(training_data, testing_data, target_variable)
        # self.training_data = training_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
        # self.testing_data = testing_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
        self.training_data = training_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'}) ## target_variable --> 'y'
        self.testing_data = testing_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'})
        self.target_variable = target_variable
        ## Initialize and configure the model in Prophet with the default parameters
        ## Set the uncertainty interval to 95% (the Prophet default is 80%)
        # self.model = Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.001)
        # self.model = Prophet(changepoint_prior_scale=0.004, seasonality_prior_scale=20, seasonality_mode='multiplicative', changepoint_range=0.9)
        self.model = Prophet(
            changepoint_prior_scale=changepoint_prior_scale, 
            daily_seasonality=daily_seasonality, 
            weekly_seasonality=weekly_seasonality,
            yearly_seasonality=yearly_seasonality,
            interval_width=interval_width)
        if include_holidays:
            self.model.add_country_holidays(country_name=country_code)  ## 'US' | 'NZ'
    
        self.forecast_df = None


    # def validate_data(self, training_data, testing_data, target_variable):
    #     """
    #     Validates the input data to ensure required columns are present.
    #     """
    #     required_columns = ['ds', target_variable]
    #     for data in [training_data, testing_data]:
    #         if not all(col in data.columns for col in required_columns):
    #             raise ValueError(f"Data must include the following columns: {required_columns}")

    @staticmethod
    def train_test_split(data, target_col='PM2.5', train_size=0.8):
        """
        β˜‘οΈ [1] Train-Test split: Splits the data into training and testing sets.
        Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
        :param train_size: float - proportion of the dataset to include in the train split.
        """
        train_len  = int(len(data) * train_size)
        train_data = data.iloc[:train_len]
        test_data  = data.iloc[train_len:]
        print(f'πŸŽ“ [{target_col}] Training & Testing Data split (80/20): {len(train_data):,} train samples and {len(test_data):,} test samples.')
        # y = data[target_col]           ## Compatible with Prophet: ds & y columns
        y_train = train_data[target_col]
        y_test  = test_data[target_col]

        return y_train, y_test
    
    
    ## Create a future dataframe for predictions, 24 hours a day | 168 hours a week into the future
    ## freq='MS' | 'D' | 'H'
    def fit(self, periods=7*24, freq='H'):
        """
        Fits the Prophet model using the training data.
        Generates predictions over the forecast horizon.

        Returns:
            forecast (DataFrame): The forecast results from Prophet.
            periods: We need to specify the number of days in future
        """
        ## Fit the model to the data
        self.model.fit(self.training_data)

        ## Making Future Predictions
        future = self.model.make_future_dataframe(periods=len(self.testing_data), freq=freq)
        ## Predicting the next quarter 120 days | 365 days (1 year) | period = 24 for daily seasonality; periods=24*7, freq='H')
        # future = self.model.make_future_dataframe(periods=periods, freq=freq)  
        
        self.forecast_df = self.model.predict(future)

        ## DEBUG
        ## Display the most critical output columns from the forecast
        # self.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].tail()
        
        return self.forecast_df


    def validate(self, site_name='Penrose'):
        """
        Plots the forecasted values with uncertainty intervals alongside the actual testing data. 
        
        Validates the Forecasting Model by plotting the forecast with confidence intervals (highlights changepoints and the training data end) against actual values for a specified site. 
        Also shows components of the forecast, calculates MAE.
        """
        if self.forecast_df is None:
            raise ValueError("Forecast data is not available for validation.")

        ## DEBUG
        # sns.boxplot(x=self.training_data.ds, y=self.training_data.y)

        ## Extract the predicted and actual values for the visualization
        ## Ensure that forecasted data and actual data share the same index for proper comparison
        # forecast = self.forecast_df.set_index('ds')['yhat']
        forecast = self.forecast_df
        # actual = self.testing_data.set_index('ds')['y']
        actual_testing_data = self.testing_data
        last_training_date = self.training_data['ds'].max()

        ## Calculate Mean Absolute Error (MAE) only for overlapping dates
        # valid_indices = ~forecasted.isna()
        # mae = mean_absolute_error(actual[valid_indices], forecasted[valid_indices])
        # print(f"Mean Absolute Error (MAE): {mae:.2f}")

        ## [Plot 1] Plot the forecast: Shows the actual testing values, forecasted values from the fitted Model with the uncertainty intervals.
        # forecast_plot = self.model.plot(forecast, uncertainty=False, plot_cap=True)
        forecast_plot = self.model.plot(forecast, uncertainty=True)
        
        ## Adding changepoints and training end visual markers
        a = add_changepoints_to_plot(forecast_plot.gca(), self.model, forecast, cp_color='red')
        ## add a vertical line at the end of the training period
        axes = forecast_plot.gca()
        # last_training_date = pd.to_datetime('2021-12-31')
        axes.axvline(x=last_training_date, color='darkred', linestyle='solid', label='Training End') ## linestyle='--' | 'solid'
        
        ## plot true test data for the period after the red line
        actual_testing_data['ds'] = pd.to_datetime(actual_testing_data['ds'])
        plt.plot(actual_testing_data['ds'], actual_testing_data['y'],'ro', markersize=0.2, label='Actual Test Data')
        # actual = actual_testing_data.set_index('ds')['y']

        axes.set_title(f'[Prophet][{site_name}] Forecasting vs Actual Testing Data for {self.target_variable}')
        axes.set_xlabel('Date (ds)')
        axes.set_ylabel(f'{self.target_variable} (y)')
        axes.legend()
        axes.grid(True)
        plt.show()

        ## [Plot 2] Decomposes the forecast into trend, yearly seasonality, and weekly seasonality components.
        if hasattr(self.model, 'plot_components'):
            # fig2 = self.model.plot_components(self.forecast_df, uncertainty=False, plot_cap=True).show()
            fig2 = self.model.plot_components(self.forecast_df, uncertainty=True).show()
        
        ## Future Work:
        ## Change points: By default, Prophet automatically detects change points in the data. You can manually adjust their sensitivity or specify them.
        ## Seasonalities: Apart from built-in daily, weekly, and yearly seasonalities, you can add custom seasonalities (e.g., monthly).
        ## Holidays: You can add holidays and special events to improve the model’s accuracy on irregular occurrences.
InΒ [29]:
## DataFrame must have 'Site', 'ds', and other columns like 'PM2.5'
## Initialize instances for Penrose and Takapuna for PM2.5 prediction using Meta/Facebook Prophet

## Prepare data by performing a train-test split
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM2.5', train_size=0.8)
forecasting_prophet_penrose   = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM2.5')

## Fitting models
logging.info("πŸ› οΈ Fitting Prophet models for Penrose ...")
forecast = forecasting_prophet_penrose.fit()

## Validating models
logging.info("πŸ”¬ Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')
2024-05-14 11:44:15,004 - INFO - πŸ› οΈ Fitting Prophet models for Penrose ...
πŸŽ“ [PM2.5] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
2024-05-14 11:44:18,105 - INFO - πŸ”¬ Validating Prophet models for Penrose ...
No description has been provided for this image
No description has been provided for this image
InΒ [30]:
## Prepare data by performing a train-test split
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM2.5', train_size=0.8)
forecasting_prophet_takapuna  = AdvancedTimeSeriesForecastingProphet(training_data2, testing_data2, 'PM2.5')

## Fitting models
logging.info("πŸ› οΈ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()

logging.info("πŸ”¬ Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')
2024-05-14 11:44:26,085 - INFO - πŸ› οΈ Fitting Prophet models for Takapuna ...
πŸŽ“ [PM2.5] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
2024-05-14 11:44:29,337 - INFO - πŸ”¬ Validating Prophet models for Takapuna ...
No description has been provided for this image
No description has been provided for this image
InΒ [32]:
## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM10', train_size=0.8)
forecasting_prophet_penrose   = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM10')

## DEBUG
# forecasting_prophet_penrose.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].head()

## Fitting models
logging.info("πŸ› οΈ Fitting Prophet models for Penrose ...")
forecasting_prophet_penrose.fit()

## Validating models
logging.info("πŸ”¬ Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')
2024-05-14 11:48:21,695 - INFO - πŸ› οΈ Fitting Prophet models for Penrose ...
πŸŽ“ [PM10] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
2024-05-14 11:48:25,018 - INFO - πŸ”¬ Validating Prophet models for Penrose ...
No description has been provided for this image
No description has been provided for this image
InΒ [33]:
## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM10', train_size=0.8)
forecasting_prophet_takapuna  = AdvancedTimeSeriesForecastingProphet(training_data=training_data2, testing_data=testing_data2, target_variable='PM10')

## Fitting models
logging.info("πŸ› οΈ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()

## Validating models
logging.info("πŸ”¬ Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')
2024-05-14 11:48:32,781 - INFO - πŸ› οΈ Fitting Prophet models for Takapuna ...
πŸŽ“ [PM10] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
2024-05-14 11:48:36,309 - INFO - πŸ”¬ Validating Prophet models for Takapuna ...
No description has been provided for this image
No description has been provided for this image
TODO: πŸŽ“ Time Series Forecasting with Facebook's Prophet - Executive Summary HERE ...

πŸ”§ Feature SelectionΒΆ

πŸŽ“ Feature Engineering & Feature Selection
InΒ [39]:
def engineer_features(rawdata):
    """
    Adds time-based features to the dataframe to enhance the analysis capabilities, adjusting for the multi-level index.
    This includes extracting various temporal components and creating lag features for PM2.5 and PM10 variables to analyze time-based dependencies.
    """
    ## Extracting date-time components
    # timestamp_index = rawdata.index.get_level_values('Timestamp') ## 'Timestamp' <-- 'ds'
    timestamp_index = rawdata['Timestamp']
    # rawdata['Hour']       = timestamp_index.dt.hour
    rawdata['Day']        = timestamp_index.dt.day
    rawdata['DayOfWeek']  = timestamp_index.dt.dayofweek
    rawdata['Month']      = timestamp_index.dt.month
    rawdata['Quarter']    = timestamp_index.dt.quarter
    rawdata['Year']       = timestamp_index.dt.year

    # rawdata['DayOfYear']  = timestamp_index.dt.dayofyear
    ## Extract week of year for each timestamp
    rawdata['WeekOfYear'] = [d.isocalendar()[1] for d in timestamp_index]
    
    ## Calculating the season based on the month --> accurately reflects the local climate and seasonal cycles
    ## Season encoding: 1 (Summer): December, January, February; 2 (Autumn): March, April, May
    ##                  3 (Winter): June, July, August         ; 4 (Spring): September, October, November
    ## Adjusted for meteorological seasons in Auckland, New Zealand:  
    # rawdata['Season'] = (((timestamp_index.dt.month % 12) + 1) // 3) % 4 + 1 
    ## TODO: Southern Hemisphere like New Zealand and Australia vs Northern Hemisphere like England
    ## Correctly mapping the month to meteorological seasons for Southern Hemisphere (Auckland)
    rawdata['Season'] = rawdata['Month'].apply(
        lambda x: 1 if 9 <= x <= 11 else       ## Spring: Sep, Oct, Nov
                  2 if 12 <= x or x <= 2 else  ## Summer: Dec, Jan, Feb
                  3 if 3 <= x <= 5 else        ## Autumn: Mar, Apr, May
                  4                            ## Winter: Jun, Jul, Aug
    )

    ## Adding lag features for PM2.5 and PM10 to capture previous time steps' influence --> to compare the correlation with the other variables.
    rawdata['PM2.5_Lag1'] = rawdata.groupby('Site')['PM2.5'].shift(1)
    rawdata['PM2.5_Lag2'] = rawdata.groupby('Site')['PM2.5'].shift(2)
    rawdata['PM10_Lag1']  = rawdata.groupby('Site')['PM10'].shift(1)
    rawdata['PM10_Lag2']  = rawdata.groupby('Site')['PM10'].shift(2)
    ## Fill NaN with the next value in the column (backward fill)
    rawdata.fillna(method='bfill', inplace=True)
    ## Alternatively, fill NaN with the previous value in the column (forward fill)
    rawdata.fillna(method='ffill', inplace=True)
    
    return rawdata
InΒ [45]:
## Apply the cleaning and querying function
# cleaned_data = rawdata.copy()
cleaned_data = engineer_features(rawdata.copy())

# if IS_TEST_DEV:
cleaned_data[['Timestamp', 'Year', 'Quarter','Season', 'Month', 'WeekOfYear', 'Day', 'DayOfWeek']].head()

# cleaned_data.columns
Out[45]:
Timestamp Year Quarter Season Month WeekOfYear Day DayOfWeek
0 2020-05-07 00:00:00 2020 2 3 5 19 7 3
1 2020-05-07 01:00:00 2020 2 3 5 19 7 3
2 2020-05-07 02:00:00 2020 2 3 5 19 7 3
3 2020-05-07 03:00:00 2020 2 3 5 19 7 3
4 2020-05-07 04:00:00 2020 2 3 5 19 7 3
InΒ [46]:
## 'Timestamp' as the only Ordinal Attribute/Column given its nature order in time-series data
ordinal_columns = ['Timestamp']

print("\nπŸŽ“ Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...")
cleaned_numerical_columns, cleaned_nominal_columns = DescriptiveStatistics.describe_data(cleaned_data, ordinal_columns)

# targets = ['PM2.5', 'PM10', 'SO2']
# features = [feature for feature in df.columns if feature not in targets]
πŸŽ“ Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...

ℹ️ Numerical Variables/Features: 
 ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2']

ℹ️ Ordinal Variables/Features: 
 ['Timestamp']

ℹ️ Nominal Variables/Features: 
 Index(['Site'], dtype='object')
InΒ [47]:
cleaned_data_site1 = cleaned_data[cleaned_data['Site'] == 'Penrose']
cleaned_data_site2 = cleaned_data[cleaned_data['Site'] == 'Takapuna']

# if IS_DEBUG:
print("\nπŸŽ“ [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site1.describe()    
πŸŽ“ [Site1 - Penrose]  Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...
Out[47]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir ... DayOfWeek Month Quarter Year WeekOfYear Season PM2.5_Lag1 PM2.5_Lag2 PM10_Lag1 PM10_Lag2
count 17375 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 ... 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000
mean 2021-05-03 23:07:19.873381376 28.233712 13.886201 5.400065 0.846799 9.871754 13.973526 23.837744 2.805975 173.193482 ... 3.004259 6.538935 2.513899 2020.835568 26.883856 2.502791 5.400681 5.401262 13.887018 13.887496
min 2020-05-07 00:00:00 0.000000 -6.400000 -12.300000 -4.400000 -1.900000 -1.400000 -1.300000 0.200000 1.000000 ... 0.000000 1.000000 1.000000 2020.000000 1.000000 1.000000 -12.300000 -12.300000 -6.400000 -6.400000
25% 2020-11-03 23:30:00 22.000000 8.500000 2.200000 0.100000 0.900000 4.100000 5.300000 1.600000 74.000000 ... 1.000000 4.000000 2.000000 2020.000000 13.000000 1.000000 2.200000 2.200000 8.500000 8.500000
50% 2021-05-03 23:00:00 27.000000 13.100000 4.700000 0.500000 3.700000 10.500000 14.800000 2.600000 210.000000 ... 3.000000 7.000000 3.000000 2021.000000 27.000000 3.000000 4.700000 4.700000 13.100000 13.100000
75% 2021-10-31 22:30:00 33.000000 18.300000 7.700000 1.000000 10.700000 20.300000 31.100000 3.800000 235.000000 ... 5.000000 10.000000 4.000000 2021.000000 40.000000 4.000000 7.700000 7.700000 18.300000 18.300000
max 2022-04-30 23:00:00 121.000000 196.700000 138.400000 25.200000 378.400000 83.100000 455.100000 8.200000 357.000000 ... 6.000000 12.000000 4.000000 2022.000000 53.000000 4.000000 138.400000 138.400000 196.700000 196.700000
std NaN 9.608861 7.958464 5.204060 1.482960 20.526086 12.520003 30.254179 1.533803 93.604420 ... 1.997634 3.459325 1.120423 0.684702 15.234194 1.123019 5.204694 5.205324 7.958003 7.957911

8 rows Γ— 23 columns

InΒ [48]:
# if IS_DEBUG:
print("\nπŸŽ“ [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site2.describe()
πŸŽ“ [Site2 - Takapuna]  Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...
Out[48]:
Timestamp AQI PM10 PM2.5 SO2 NO NO2 NOx Wind_Speed Wind_Dir ... DayOfWeek Month Quarter Year WeekOfYear Season PM2.5_Lag1 PM2.5_Lag2 PM10_Lag1 PM10_Lag2
count 17359 17359.000000 17359.000000 17359.000000 17359.0 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 ... 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000
mean 2021-05-04 08:00:00 25.152947 11.562504 6.237181 0.5 11.637256 0.012103 23.521597 2.497209 176.424542 ... 3.004148 6.540181 2.514315 2020.836454 26.890316 2.502275 6.237259 6.237348 11.562559 11.562660
min 2020-05-07 17:00:00 -13.000000 -3.800000 0.450000 0.5 -0.800000 -0.000500 -0.700000 0.200000 5.500000 ... 0.000000 1.000000 1.000000 2020.000000 1.000000 1.000000 0.450000 0.450000 -3.800000 -3.800000
25% 2020-11-04 12:30:00 19.000000 7.468750 4.300000 0.5 2.350000 0.003950 6.912500 1.250000 76.500000 ... 1.000000 4.000000 2.000000 2020.000000 13.000000 1.000000 4.300000 4.300000 7.468750 7.468750
50% 2021-05-04 08:00:00 24.000000 10.650000 5.550000 0.5 5.700000 0.008800 14.700000 2.300000 203.500000 ... 3.000000 7.000000 3.000000 2021.000000 27.000000 2.000000 5.550000 5.550000 10.650000 10.650000
75% 2021-11-01 03:30:00 30.000000 14.400000 7.300000 0.5 11.900000 0.017100 28.500000 3.500000 260.000000 ... 5.000000 10.000000 4.000000 2021.000000 40.000000 4.000000 7.300000 7.300000 14.400000 14.400000
max 2022-04-30 23:00:00 318.000000 437.250000 52.300000 0.5 352.450000 0.072100 401.950000 7.800000 346.500000 ... 6.000000 12.000000 4.000000 2022.000000 53.000000 4.000000 52.300000 52.300000 437.250000 437.250000
std NaN 10.637321 7.778661 3.401947 0.0 21.970397 0.010751 29.903734 1.455767 92.268417 ... 1.998613 3.460755 1.120882 0.684574 15.240120 1.123436 3.401884 3.401809 7.778618 7.778534

8 rows Γ— 23 columns

InΒ [49]:
def get_top_correlated_features(data, target, num_features=12):
    """
    Calculates and returns the top N features most correlated with the target variable.
    
    Parameters:
    - data (DataFrame): The dataset containing numerical features.
    - target (str): The target variable for which correlations are calculated.
    - num_features (int): The number of top features to return.
    
    Returns:
    - List of top N correlated features.
    """
    ## Ensure data does not contain infinite or NaN values
    data = data.select_dtypes(include=[np.number]).dropna()
    
    ## Calculate the Pearson correlation matrix
    correlation_matrix = data.corr(method='pearson')
    
    ## Calculate absolute correlation with the target variable
    abs_corr_with_target = abs(correlation_matrix[target])
    
    ## Exclude the target variable and lag variables from the series
    features_to_exclude = [target] + [f"{target}_Lag1", f"{target}_Lag2"]
    abs_corr_with_target = abs_corr_with_target.drop(features_to_exclude, errors='ignore')
    
    ## Get the top N highly correlated features
    top_features = abs_corr_with_target.nlargest(num_features).index.tolist()
    
    return top_features
InΒ [63]:
## Analyzing for both sites
top_features_data1 = get_top_correlated_features(cleaned_data_site1, 'PM2.5')
top_features_data2 = get_top_correlated_features(cleaned_data_site2, 'PM2.5')

logging.info("\n🌟 Top 5 features highly correlated with PM2.5 in Penrose: %s\n", top_features_data1)
logging.info("\n🌟 Top 5 features highly correlated with PM2.5 in Takapuna: %s\n", top_features_data2)
2024-05-14 12:04:07,853 - INFO - 
🌟 Top 5 features highly correlated with PM2.5 in Penrose: ['PM10', 'PM10_Lag1', 'NOx', 'NO', 'PM10_Lag2', 'AQI', 'NO2', 'SO2', 'Wind_Dir', 'Air_Temp', 'Season', 'Wind_Speed']

2024-05-14 12:04:07,853 - INFO - 
🌟 Top 5 features highly correlated with PM2.5 in Takapuna: ['PM10', 'PM10_Lag1', 'AQI', 'PM10_Lag2', 'NO', 'NOx', 'Air_Temp', 'NO2', 'Rel_Humidity', 'Wind_Dir', 'Quarter', 'Month']

InΒ [64]:
## Analyzing for both sites
top_features_data21 = get_top_correlated_features(cleaned_data_site1, 'PM10')
top_features_data22 = get_top_correlated_features(cleaned_data_site2, 'PM10')

logging.info("\n🌟 Top 5 features highly correlated with PM10 in Penrose: %s\n", top_features_data21)
logging.info("\n🌟 Top 5 features highly correlated with PM10 in Takapuna: %s\n", top_features_data22)
2024-05-14 12:04:22,621 - INFO - 
🌟 Top 5 features highly correlated with PM10 in Penrose: ['AQI', 'PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'NO', 'NOx', 'Wind_Speed', 'Rel_Humidity', 'SO2', 'NO2', 'Air_Temp', 'Wind_Dir']

2024-05-14 12:04:22,621 - INFO - 
🌟 Top 5 features highly correlated with PM10 in Takapuna: ['PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'AQI', 'NO', 'NOx', 'Wind_Speed', 'NO2', 'Wind_Dir', 'Day', 'Rel_Humidity', 'Quarter']

InΒ [65]:
# import pandas as pd
# import matplotlib.pyplot as plt
# 
# # Assuming rawdata and cleaned_data are pandas DataFrames loaded appropriately
# # Convert 'Timestamp' to datetime and set as index
# rawdata['Timestamp'] = pd.to_datetime(rawdata['Timestamp'])
# cleaned_data['Timestamp'] = pd.to_datetime(cleaned_data['Timestamp'])
# 
# rawdata.set_index('Timestamp', inplace=True)
# cleaned_data.set_index('Timestamp', inplace=True)

def plot_pm_variation(data, feature, title, ax):
    """
    Helper function to plot PM2.5/PM10 variations on a given Axes object.
    
    Parameters:
    - data (DataFrame): The dataset to plot.
    - title (str): The title of the plot.
    - ax (Axes): The matplotlib Axes object where the plot will be drawn.
    """
    ax.plot(data[feature], label=f'{feature}')
    ax.set_xlabel('Date')
    ax.set_ylabel(f'{feature} Concentration')
    ax.set_title(title)
    ax.legend()
InΒ [66]:
fig, axs = plt.subplots(2, 2, figsize=(20, 12))  # Create a grid of 2x2 for Penrose and Takapuna

## Plotting
plot_pm_variation(rawdata_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Raw Data)', axs[0, 0])
plot_pm_variation(cleaned_data_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Cleaned Data)', axs[0, 1])
plot_pm_variation(rawdata_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Raw Data)', axs[1, 0])
plot_pm_variation(cleaned_data_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Cleaned Data)', axs[1, 1])

plt.tight_layout()
plt.show()
No description has been provided for this image
InΒ [67]:
# ## 'cleaned_data_site1' and 'cleaned_data_site2' are preloaded DataFrames for Penrose and Takapuna, respectively
# ## Convert 'Timestamp' to datetime and set as index
# cleaned_data_site1['Timestamp'] = pd.to_datetime(cleaned_data_site1['Timestamp'])
# cleaned_data_site2['Timestamp'] = pd.to_datetime(cleaned_data_site2['Timestamp'])
# 
# cleaned_data_site1.set_index('Timestamp', inplace=True)
# cleaned_data_site2.set_index('Timestamp', inplace=True)

## [DEBUG] Top 5 features - this should be dynamically determined from a correlation analysis
# top_features_data1 = ['AQI', 'PM10', 'NOx', 'Wind_Speed', 'Air_Temp']
# top_features_data2 = ['PM2.5', 'SO2', 'NO', 'Wind_Dir', 'Rel_Humidity']

def plot_features(data, features, site_name):
    fig, axs = plt.subplots(len(features), 1, figsize=(15, 15))

    for ax, feature in zip(axs, features):
        if feature in data.columns:
            ax.plot(data.index, data[feature], label=f'{feature} (Cleaned Data)')
            ax.set_title(f'Variation of {feature} in {site_name}')
            ax.set_xlabel('Date')
            ax.set_ylabel(feature)
            ax.grid(True)
        else:
            fig.delaxes(ax)  ## Remove the axis if the feature is not in the data

    plt.tight_layout()
    plt.show()
InΒ [68]:
## Call the function for each site
logging.info("\nπŸŽ“ [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site1, top_features_data1, 'Penrose')
logging.info("\nπŸŽ“ [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site2, top_features_data2, 'Takapuna')
2024-05-14 12:04:38,696 - INFO - 
πŸŽ“ [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.

No description has been provided for this image
2024-05-14 12:04:39,572 - INFO - 
πŸŽ“ [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.

No description has been provided for this image
InΒ [69]:
## Call the function for each site
logging.info("\nπŸŽ“ [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site1, top_features_data21, 'Penrose')
logging.info("\nπŸŽ“ [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site2, top_features_data22, 'Takapuna')
2024-05-14 12:04:51,396 - INFO - 
πŸŽ“ [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.

No description has been provided for this image
2024-05-14 12:04:52,148 - INFO - 
πŸŽ“ [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.

No description has been provided for this image
InΒ [70]:
def compute_summary_statistics(data, top_features):
    """
    Computes and prints summary statistics for selected features in a given dataset.
    
    Args:
    data (pd.DataFrame): The cleaned data for a specific site.
    top_features (list): List of features for which to compute statistics, determined from prior analysis.
    
    Returns:
    pd.DataFrame: Summary statistics of the selected features.
    """
    # Ensure all top features are in the data columns
    valid_features = [feature for feature in top_features if feature in data.columns]
    selected_features = ['PM2.5'] + valid_features

    # Get summary statistics
    summary_stats = data[selected_features].describe()
    return summary_stats
InΒ [75]:
## 'cleaned_data_site1' and 'cleaned_data_site2' are your cleaned datasets for Penrose and Takapuna respectively.
## 'top_features_data1' and 'top_features_data2' should be lists of top features determined from correlation analysis.

## [DEBUG] Check if the top features list is defined for each site
# top_features_data1 = ['NO2', 'Wind_Speed', 'Air_Temp', 'PM10', 'NOx']  # Example top features for Penrose
# top_features_data2 = ['SO2', 'NO', 'Rel_Humidity', 'Wind_Dir', 'NOx']  # Example top features for Takapuna

summary_stats_penrose = compute_summary_statistics(cleaned_data_site1, top_features_data1)
summary_stats_takapuna = compute_summary_statistics(cleaned_data_site2, top_features_data2)

logging.info("πŸŽ“ Summary Statistics for Penrose:")
summary_stats_penrose
2024-05-14 12:05:50,736 - INFO - πŸŽ“ Summary Statistics for Penrose:
Out[75]:
PM2.5 PM10 PM10_Lag1 NOx NO PM10_Lag2 AQI NO2 SO2 Wind_Dir Air_Temp Season Wind_Speed
count 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000 17375.000000
mean 5.400065 13.886201 13.887018 23.837744 9.871754 13.887496 28.233712 13.973526 0.846799 173.193482 16.744259 2.502791 2.805975
std 5.204060 7.958464 7.958003 30.254179 20.526086 7.957911 9.608861 12.520003 1.482960 93.604420 4.005293 1.123019 1.533803
min -12.300000 -6.400000 -6.400000 -1.300000 -1.900000 -6.400000 0.000000 -1.400000 -4.400000 1.000000 3.000000 1.000000 0.200000
25% 2.200000 8.500000 8.500000 5.300000 0.900000 8.500000 22.000000 4.100000 0.100000 74.000000 14.000000 1.000000 1.600000
50% 4.700000 13.100000 13.100000 14.800000 3.700000 13.100000 27.000000 10.500000 0.500000 210.000000 17.000000 3.000000 2.600000
75% 7.700000 18.300000 18.300000 31.100000 10.700000 18.300000 33.000000 20.300000 1.000000 235.000000 20.000000 4.000000 3.800000
max 138.400000 196.700000 196.700000 455.100000 378.400000 196.700000 121.000000 83.100000 25.200000 357.000000 28.000000 4.000000 8.200000
InΒ [76]:
logging.info("\nπŸŽ“ Summary Statistics for Takapuna:")
summary_stats_takapuna
2024-05-14 12:05:56,070 - INFO - 
πŸŽ“ Summary Statistics for Takapuna:
Out[76]:
PM2.5 PM10 PM10_Lag1 AQI PM10_Lag2 NO NOx Air_Temp NO2 Rel_Humidity Wind_Dir Quarter Month
count 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000 17359.000000
mean 6.237181 11.562504 11.562559 25.152947 11.562660 11.637256 23.521597 16.342980 0.012103 69.842651 176.424542 2.514315 6.540181
std 3.401947 7.778661 7.778618 10.637321 7.778534 21.970397 29.903734 3.952319 0.010751 11.760956 92.268417 1.120882 3.460755
min 0.450000 -3.800000 -3.800000 -13.000000 -3.800000 -0.800000 -0.700000 3.500000 -0.000500 30.400000 5.500000 1.000000 1.000000
25% 4.300000 7.468750 7.468750 19.000000 7.468750 2.350000 6.912500 13.500000 0.003950 61.200000 76.500000 2.000000 4.000000
50% 5.550000 10.650000 10.650000 24.000000 10.650000 5.700000 14.700000 16.000000 0.008800 70.950000 203.500000 3.000000 7.000000
75% 7.300000 14.400000 14.400000 30.000000 14.400000 11.900000 28.500000 19.500000 0.017100 79.700000 260.000000 4.000000 10.000000
max 52.300000 437.250000 437.250000 318.000000 437.250000 352.450000 401.950000 27.000000 0.072100 90.350000 346.500000 4.000000 12.000000
πŸŽ“ Interpretation ....
  • Target:

  • Features:

    • NO2

ReferencesΒΆ

Forecasting: Principles and Practice (3rd ed) : Rob J Hyndman and George Athanasopoulos; Monash University, Australia

Predicting Air Particulate Matter at Scale ⛅️
Auckland University of Technology (AUT) πŸŽ“