π Predicting Air Particulate Matter at Scale β οΈ π οΈ 3. Time Series Analysis and Forecasting
π This notebook shows various methods that can be used to analyse and predict a Time Series Dataset. It should be ready for reuse in the next steps (Machine Learning / Deep Learning Modelling and Intelligent Dashboard) in CRISP-DM for Data Science.
Workflow steps:
- Import the required teradataml modules.
- Connect to a Vantage system.
- π Time-Series Analysis
- βοΈ Time-Series Forecasting
- Cleanup.
- Import the required teradataml modules.
π― Libraries and Reusable FunctionsΒΆ
import os, logging
import logging
logging.getLogger('prophet').setLevel(logging.WARNING)
logging.getLogger("cmdstanpy").disabled=True
## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'cleaned_vantage.env' ## Set to `.env_cleaned` or `.env_raw` as needed
# %run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
%run -i ./DataFrameAdapter.ipynb
2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv 2024-05-21 14:55:55,439 - INFO - data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv 2024-05-21 14:55:55,448 - INFO - βΉοΈ Load Data from data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame π 2024-05-21 14:55:55,489 - INFO - βΉοΈ Load Data from data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame π
π The specified .env file: cleaned_vantage.env +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ | Variable | Description | Value | +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ | IS_LOADING_FROM_FILES | True if loading data from *.csv/xls files; False if using imported data in Teradata Vantage | True | | IS_TERADATA_VANTAGE | Using scable Teradata Vantage vs. local machine (Laptop) | False | | IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files | False | | SCHEMA_NAME | [Teradata Vantage] Schema Name | Air_Pollution | | TABLE_NAME | [Teradata Vantage] Table Name | Air_Pollution_cleaned | | IS_JUPYTERLAB | Running in JupyterLab vs Python Dash/Vizro Dashboard | True | | IS_TEST_DEV | Is Test/Dev mode is active or not (in Production) | False | | DATA_PATH | *.csv/xls Data PATH | Not set or not found | | USE_DATA_PREFIX | Prefix to use for data files: 'raw' | 'cleaned' | cleaned | +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ βΉοΈ Load Data from data/cleaned folder βΉοΈ Combined Data Shape: (34734, 13) βΉοΈ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (17375, 12) (17359, 11) π Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ... βΉοΈ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] βΉοΈ Ordinal Variables/Features: ['Timestamp'] βΉοΈ Nominal Variables/Features: Index(['Site'], dtype='object') π 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ... βΉοΈ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] βΉοΈ Ordinal Variables/Features: ['Timestamp'] π 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2] Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ... βΉοΈ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] βΉοΈ Ordinal Variables/Features: ['Timestamp'] π [Data_Loading_and_Descriptive_Statistics.ipynb] Listing variables with description... +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ | Variable Name | Description | All Sites | Penrose | Takapuna | +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ | rawdata | Complete dataset containing all observations across all sites. | [x] | [x] | [x] | | ordinal_columns | Ordinal columns specific to Site 1. | [x] | [x] | [x] | | numerical_columns_site1 | Numerical columns specific to Site 1. | [ ] | [x] | [ ] | | nominal_columns_site1 | Nominal columns specific to Site 1. | [ ] | [x] | [ ] | | numerical_columns_site2 | Numerical columns specific to Site 2. | [ ] | [ ] | [x] | | nominal_columns_site2 | Nominal columns specific to Site 2. | [ ] | [ ] | [x] | | rawdata_site1 | Subset of raw data for Site 1. | [ ] | [x] | [ ] | | rawdata_site2 | Subset of raw data for Site 2. | [ ] | [ ] | [x] | +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ π Describing the types of each attribute as cleaned_numerical_columns (Continuous), cleaned_ordinal_columns (Ordinal), or cleaned_nominal_columns (Nominal) ... βΉοΈ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Hour', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2'] βΉοΈ Ordinal Variables/Features: ['Timestamp'] βΉοΈ Nominal Variables/Features: Index(['Site'], dtype='object') π [Site1 - Penrose] Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ... π [Site2 - Takapuna] Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ... π [DataFrameAdapter.ipynb] Listing variables with description... +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+ | Variable Name | Description | All Sites | Penrose | Takapuna | +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+ | rawdata | Complete dataset containing all observations across all sites. | [x] | [x] | [x] | | numerical_columns_site1 | Numerical columns specific to Site 1. | [ ] | [x] | [ ] | | nominal_columns_site1 | Nominal columns specific to Site 1. | [ ] | [x] | [ ] | | numerical_columns_site2 | Numerical columns specific to Site 2. | [ ] | [ ] | [x] | | nominal_columns_site2 | Nominal columns specific to Site 2. | [ ] | [ ] | [x] | | rawdata_site1 | Subset of raw data for Site 1. | [ ] | [x] | [ ] | | rawdata_site2 | Subset of raw data for Site 2. | [ ] | [ ] | [x] | | --------------------------- | --------------------------------------------------------------------- | --------- | ------- | -------- | | cleaned_data | Cleaned dataset with preprocessing applied. | [x] | [x] | [x] | | cleaned_ordinal_columns | Ordinal columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_numerical_columns | Numerical columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_nominal_columns | Nominal columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_data_site1 | Cleaned data for Site 1. | [ ] | [x] | [ ] | | cleaned_data_site2 | Cleaned data for Site 2. | [ ] | [ ] | [x] | +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
# if IS_JUPYTERLAB:
target_variables = {'PM2.5': 'Particulate Matter <2.5 Β΅m', 'PM10': 'Particulate Matter <10 Β΅m'}
data_series_df1 = PandasDataLoader.load_data(rawdata_site1)
data_series_df2 = PandasDataLoader.load_data(rawdata_site2)
data_series_df1
AQI | PM10 | PM2.5 | SO2 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | Air_Temp | Rel_Humidity | |
---|---|---|---|---|---|---|---|---|---|---|---|
ds | |||||||||||
2020-05-07 00:00:00 | 26.0 | 16.500 | 16.1 | 1.8 | 60.3 | 40.9 | 101.2 | 0.6 | 316.0 | 8.0 | 78.0 |
2020-05-07 01:00:00 | 28.0 | 17.700 | 10.1 | 1.0 | 16.0 | 29.2 | 45.3 | 0.7 | 269.0 | 8.0 | 76.8 |
2020-05-07 02:00:00 | 28.0 | 15.000 | 10.3 | 0.1 | 16.0 | 29.2 | 45.3 | 1.0 | 180.0 | 8.0 | 78.4 |
2020-05-07 03:00:00 | 29.0 | 14.300 | 11.4 | 0.0 | 11.2 | 27.5 | 38.7 | 0.8 | 232.0 | 8.0 | 77.5 |
2020-05-07 04:00:00 | 30.0 | 8.800 | 10.6 | -0.1 | 12.0 | 28.5 | 40.5 | 0.8 | 274.0 | 7.0 | 80.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-04-30 19:00:00 | 25.0 | 12.800 | -0.2 | 0.3 | 0.1 | 4.8 | 4.9 | 2.1 | 95.0 | 16.0 | 80.5 |
2022-04-30 20:00:00 | 25.0 | 2.200 | 2.2 | 0.2 | 0.5 | 9.9 | 10.3 | 1.0 | 116.0 | 15.0 | 84.2 |
2022-04-30 21:00:00 | 24.0 | 16.225 | 1.9 | 0.6 | 5.8 | 25.4 | 31.3 | 0.8 | 124.0 | 15.0 | 85.4 |
2022-04-30 22:00:00 | 24.0 | 8.200 | 6.0 | 0.5 | 1.8 | 17.9 | 19.7 | 0.8 | 108.0 | 14.0 | 86.0 |
2022-04-30 23:00:00 | 24.0 | 2.300 | 5.4 | 0.5 | 1.0 | 7.3 | 8.3 | 0.9 | 93.0 | 14.0 | 84.4 |
17375 rows Γ 11 columns
# cleaned_data_site2.index
# data_series_df2.index
data_series_df2
AQI | PM10 | PM2.5 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | Air_Temp | Rel_Humidity | |
---|---|---|---|---|---|---|---|---|---|---|
ds | ||||||||||
2020-05-07 17:00:00 | 21.0 | 5.95 | 4.15 | 10.90 | 0.01715 | 28.00 | 2.50 | 242.0 | 15.5 | 68.15 |
2020-05-07 18:00:00 | 21.0 | 5.65 | 5.10 | 8.20 | 0.01655 | 24.70 | 2.20 | 239.5 | 15.5 | 61.75 |
2020-05-07 19:00:00 | 21.0 | 7.70 | 5.45 | 5.75 | 0.01325 | 19.00 | 2.10 | 244.0 | 15.0 | 59.45 |
2020-05-07 20:00:00 | 21.0 | 8.20 | 5.45 | 3.50 | 0.00870 | 12.20 | 2.25 | 251.0 | 15.0 | 59.25 |
2020-05-07 21:00:00 | 21.0 | 11.80 | 5.80 | 3.55 | 0.00930 | 12.90 | 2.10 | 261.0 | 15.0 | 61.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-04-30 19:00:00 | 14.0 | 4.75 | 3.30 | 0.60 | 0.00440 | 5.00 | 2.55 | 109.5 | 16.0 | 76.05 |
2022-04-30 20:00:00 | 14.0 | 6.35 | 3.15 | 0.50 | 0.00365 | 4.15 | 2.45 | 105.5 | 16.0 | 74.50 |
2022-04-30 21:00:00 | 14.0 | 6.05 | 2.80 | 0.40 | 0.00480 | 5.20 | 2.35 | 115.5 | 16.0 | 74.15 |
2022-04-30 22:00:00 | 13.0 | 4.20 | 2.60 | 0.40 | 0.00555 | 5.90 | 1.95 | 122.5 | 16.0 | 75.95 |
2022-04-30 23:00:00 | 13.0 | 5.00 | 2.80 | 0.35 | 0.00405 | 4.30 | 1.95 | 119.0 | 16.0 | 78.00 |
17359 rows Γ 10 columns
π Time-Series Analysis & Forecasting using ARIMAΒΆ
check_stationarity: checking for Stationarity of Time Series using the ADF Test
Checking for stationarity of Time Series using the Augmented Dickey-Fuller (ADF) Test
Time-Series Stationarity: Some time-series models, such as such as ARIMA, assume that the underlying data is stationary. Stationarity describes that the time-series has
- Constant mean and mean is not time-dependent
- Constant variance and variance is not time-dependent
- Constant covariance and covariance is not time-dependent
The check for stationarity can be done via three different approaches:
- Visually: plot time series and check for trends or seasonality
- Basic statistics: split time series and compare the mean and variance of each partition
- Statistical test: Augmented Dickey Fuller test
Apply the ADF Test to verify if the time series data is stationary, which is a requirement for certain time series forecasting methods like ARIMA.
Augmented Dickey-Fuller (ADF) test is a type of statistical test called a unit root test. Unit roots are a cause for non-stationarity.
- Null Hypothesis (H0): Time series has a unit root. (Time series is not stationary).
- Alternate Hypothesis (H1): Time series has no unit root (Time series is stationary).
- If the null hypothesis can be rejected, we can conclude that the time series is stationary. There are two ways to rejects the null hypothesis:
On the one hand, the null hypothesis can be rejected if the p-value is below a set significance level. The defaults significance level is 5%
- p-value > significance level (default: 0.05): Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
- p-value <= significance level (default: 0.05): Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
On the other hand, the null hypothesis can be rejects if the test statistic is less than the critical value.
- ADF statistic > critical value: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
- ADF statistic < critical value: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
Checking for stationarity of Time Series using the Kwiatkowski β Phillips β Schmidt β Shin (KPSS) Test
* The KPSS test can also be used to detect stochastic trends. * The test hypotheses are opposite relative to ADF: * Null hypothesis: the time series is trend-stationary.
determine_acf_pacf: determine ACF and PACF for Autocorrelation and Partial Autocorrelation of the Time Series.
- ACF calculates the autocorrelation or autocovariance of a time series. The autocorrelation and autocovariance show how the time series correlates or covaries with itself when delayed by a lag in time or space.
Fitting and Validating ARIMA models parameters (p, d, q) based on AIC for Penrose and Takapuna. It involves determining the best-fit model and predicting future demand forecasts.
- fit: Estimate/Fit the ARIMA model parameters for forecasting pollutant levels.
- validate: Model Validation and Scoring: Validate the ARIMA model against a portion of the data to determine the modelβs accuracy.
- forecast: Forecast future pollutant levels using the validated model.
- visualize actual versus predicted values and forecasted values to assess model performance.
- plot_diagnostics: plot diagnostic charts
- generate_report: generate a report of the model fit and forecast
class TimeSeriesAnalyzer(ABC):
"""
Abstract base class for a Time Series Analyzer.
"""
def __init__(self, data, target_col):
""" β
Step 1: Data Loading & Splitting """
# self.data = data
self.target_col = target_col
## Placeholder for transformed data, if necessary
# self.transformed_data = None
## Placeholder for differenced data, if necessary
# self.differenced_data = None
## Placeholder for the selected model
self.model = None
@abstractmethod
def timeseries_visualization(self, lags=15, max_diff_order=3):
"""
β
Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
Also considers Box-Cox transformation for variance stabilisationif the data is skewed.
Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.
- ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
- Histogram analysis after transformation helps understand data normality and variance stability.
-
- If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
- Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
- Check if transformation parameters need adjustment based on skewness and kurtosis metrics.
- [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
- [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
- [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
- [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
"""
pass
@abstractmethod
def transform_data(data, transformer, lag_max=15):
"""
β
Step 3: Apply transformation to the time series data and display diagnostics.
Parameters:
- data (np.array or pd.Series): The original time series data.
- transformer (object): An instance of a transformer from pmdarima.preprocessing.
- lag_max (int): Maximum lag to which the autocorrelation function is computed.
Returns:
- Transformed data
"""
pass
@abstractmethod
def model_selection(self):
"""
Step 5: iteratively select models based on the AIC: try the chosen model(s) and use the AIC to search for a better model.
"""
pass
@abstractmethod
def multi_step_forecast(self):
"""
Step 6: Check the residuals from the chosen model by plotting the ACF of the residual, and doing a portmanteau test of the residual.
Do the residuals look like white noise?
"""
pass
# @abstractmethod
# def calculate_forecast(self):
# """
# Step 7: Calculate forecasts & plot
# """
# pass
from pmdarima.preprocessing import BoxCoxEndogTransformer, LogEndogTransformer
from pmdarima.model_selection import train_test_split
from pmdarima import tsdisplay
class ArimaTimeSeriesAnalyzer(TimeSeriesAnalyzer):
def __init__(self, data: pd.DataFrame, monitoring_site:str, target_col: str, is_auto_arima=False):
"""
Initializes the analysis with training and testing data, ensuring that both datasets contain the specified target variable.
"""
super().__init__(data, target_col)
print(f'β
Step 1: Data Loading & Splitting 80/20')
self.data = data
self.site = monitoring_site ## Penrose, Takapuna
self.target_col = target_col ## PM2.5, PM10
# self.train_data = None
# self.test_data = None
# self.train_test_split(train_size=0.8) ## 80/20 Train/Test split
self.is_auto_arima = is_auto_arima
# self.transformed_data = None
# self.lambda_value = None
# self.is_stationary = True
# self.sarima_model = None
# self.arima_model = None
self.model = None
def train_test_split(self, train_size=0.8):
"""
βοΈ [1] Train-Test split: Splits the data into training and testing sets.
Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
:param train_size: float - proportion of the dataset to include in the train split.
"""
train_len = int(len(self.data) * train_size)
self.train_data = self.data.iloc[:train_len]
self.test_data = self.data.iloc[train_len:]
print(f'π Training & Testing Data split (80/20): {len(self.train_data):,} train samples and {len(self.test_data):,} test samples.')
self.y = self.data[self.target_col] ## Compatible with Prophet
self.y_train = self.train_data[self.target_col]
self.y_test = self.test_data[self.target_col]
def plot_line_chart_and_histogram(self, y_train):
"""
βοΈ [2] Plot line chart and histogram
"""
print(f'\nπ [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.')
plt.figure(figsize=(14, 7))
## 1.1. Original Series Plot
plt.subplot(121) ## 1 row, 2 columns, 1st subplot
# plt.plot(self.data.index, self.y, color='blue', label='Original Series')
plt.plot(y_train.index, y_train, color='steelblue', label='Original Series')
plt.title(f"Original Time Series of {self.target_col} at {self.site}")
plt.xlabel("Timestamp")
plt.ylabel(self.target_col)
plt.legend()
## 1.2 Plot the histogram of the series values
## Histogram displays the distribution of data to evaluate skewness.
## Check for skewness and perform Box-Cox Transformation if necessary
# transformed_data, _ = preprocessing.BoxCoxEndogTransformer(lmbda2=1e-6).fit_transform(self.data[[self.target_col]].dropna())
plt.subplot(122) ## 1 row, 2 columns, 2nd subplot
# plt.hist(transformed_data, bins=30, color='blue', edgecolor='k', alpha=0.7)
# plt.hist(self.y, bins=90, color='blue', edgecolor='k', alpha=0.7)
plt.hist(y_train, bins=60, color='steelblue', edgecolor='gray', alpha=0.7)
# plt.title(f"Histogram of Transformed {self.target_col}")
plt.title(f"Histogram of Train Data: {self.target_col} at {self.site}")
plt.xlabel(f'{self.target_col} Value')
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
## Step 2.2: Differencing the series to achieve stationarity
def plot_differencing(self, series, max_diff_order=3):
"""
βοΈ [2].1 Apply differencing to the time series to achieve stationarity and plot each differencing level in a grid format,
and perform Dickey-Fuller tests to check stationarity after each differencing.
This function plots the original series and its differenced versions up to the specified order,
along with the p-values from the Dickey-Fuller test to assess stationarity.
Note: pmdarima's auto_arima will automatically determine the appropriate differencing term for you by default.
Parameters:
- series (pd.Series): Time series data.
- max_diff_order (int): The maximum differencing order to apply.
Outputs:
- Plots of original and differenced series with Dickey-Fuller test results.
"""
print(f'\nπ [2.2] Apply {max_diff_order} Order Differencing to the time series and visualizes results.')
ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']
plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100})
num_plots = max_diff_order + 1
num_rows = (num_plots + 1) // 2 ## Calculate rows needed for plots (2 plots per row)
# fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=False, sharey=True)
axes = axes.flatten() ## Flatten to simplify indexing
for i in range(num_plots):
ax = axes[i] ## This handles both single and multiple rows of subplots
if i == 0: ## Plot original series in the first plot
current_series = series
title = 'Original Series'
else: ## Generate and plot differenced series up to the max_diff_order
current_series = series.diff(i).dropna()
title = f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing'
ax.plot(current_series, color='steelblue' if i == 0 else 'orange')
ax.set_title(f'[{self.site}] {self.target_col}: {ordering[i]} Order Differencing')
ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
ax.tick_params(axis='x', rotation=45) ## Rotate x-axis labels
## Perform Dickey-Fuller test on the current series and display p-value
if not current_series.empty:
df_result = adfuller(current_series)
p_value_msg = f'p-value: {df_result[1]:.4f}'
stationarity = " (Stationary: < 0.05)" if df_result[1] < 0.05 else " (Not stationary: >= 0.05)"
ax.set_xlabel(p_value_msg + stationarity)
else:
ax.set_xlabel('No data available for testing')
## Move the xlabel to the top left
ax.xaxis.label.set_color('steelblue')
ax.xaxis.set_label_position('top')
ax.xaxis.set_label_coords(0.5, 1.02)
## Adjust the number of plots and layout: If the total number of plots is odd
## Hide any unused axes if the number of plots is less than the number of subplot cells
if num_plots < len(axes):
for j in range(num_plots, len(axes)):
fig.delaxes(axes[j])
## Adjust layout and display the plot
plt.tight_layout()
plt.show()
def plot_acf_ma_and_pacf_ar(self, series, max_diff_order=2, max_lags=15, use_pacf=False):
"""
βοΈ [2].2. Use the ACF to suggest the orders of MA; and Use the PACF to suggest the orders of AR
Calculate and plots the Autocorrelation Function (ACF) for the original time series and {max_diff_order} differenced versions
to suggest possible MA orders, including statistical tests for stationarity.
This function uses the series that have been differenced and stored in a dictionary to plot their ACF.
This helps in identifying the order of the MA component by looking at the significant lags in the ACF plots.
Parameters:
- series (pd.Series): The time series data.
- max_diff_order (int): Maximum number of differencing orders to plot.
- max_lags (int): specifically focus on the first max_lags lags in the ACF plot.
This function performs the following:
- Differencing the series up to the specified order.
- Plotting the ACF for each differenced series including the original series.
- Performing the Dickey-Fuller test to annotate plots with p-values indicating stationarity.
"""
# print(f'\nπ Plotting the ACF for the original and differenced series to suggest possible MA orders.')
print(f'\nπ [2.3] Plotting the {"PACF" if use_pacf else "ACF"} to suggest possible {"AR" if use_pacf else "MA"} orders.')
ordering = ['Original', '1st', '2nd', '3rd', '4th', '5th', '6th']
## Prepare the subplot grid based on the number of differencing levels
plt.rcParams.update({'figure.figsize': (12, 8), 'figure.dpi': 100}) ## Adjust plot settings
num_plots = max_diff_order + 1
num_rows = (num_plots + 1) // 2 ## Calculate rows needed for plots (2 plots per row)
fig, axes = plt.subplots(num_rows, 2, figsize=(12, 3 * num_rows), sharex=True, sharey=True)
axes = axes.flatten() ## Flatten to simplify indexing
## Iterate through each differencing level
for i in range(num_plots):
ax = axes[i]
## Handle the original series differently
current_series = series if i == 0 else series.diff(i).dropna()
title_suffix = 'Original Series' if i == 0 else f'{ordering[i]} Order Differencing'
# ## Check for non-empty series to avoid plotting errors
# if current_series.empty:
# ax.text(0.5, 0.5, 'No data available for plotting', horizontalalignment='center', verticalalignment='center',
# transform=ax.transAxes)
# ax.set_title('Empty Series')
# continue
## Perform ACF plotting: The alpha=0.05 setting specifies a 95% confidence level.
lags = min(max_lags, len(current_series) - 1)
# plot_acf(current_series, ax=ax, lags=lags, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
plot_func = plot_pacf if use_pacf else plot_acf
plot_func(current_series.head(lags), ax=ax, alpha=0.05, title=f'[{self.site}] {self.target_col}: ACF - {title_suffix}')
## Adding labels for x and y axes
ax.set_xlabel('Lags')
ax.set_ylabel(f'{"PACF" if use_pacf else "ACF"}')
# plot_acf(current_series, ax=ax, lags=lags, title=f'[{self.site}] {self.target_col}: ACF - {ordering[i]} Order Differencing')
## Setting plot titles and labels
ax.grid(True, which='both', linestyle='--', linewidth=0.5, color='lightgray')
# ax.set_title(f'{title_suffix} - ADF p-value: {p_value:.4f} {stationarity_status}', fontsize=12)
# ## Highlight significant lags manually if needed: marks significant lags based on ACF values outside the confidence bounds.
# significance_limit = 1.96 / np.sqrt(len(current_series))
# ax.axhline(y=significance_limit, linestyle='--', color='red', linewidth=0.7)
# ax.axhline(y=-significance_limit, linestyle='--', color='red', linewidth=0.7)
## Perform Dickey-Fuller test & Adding stationarity test results for context
df_result = adfuller(current_series)
p_value_msg = f'p-value: {df_result[1]:.4f}'
stationarity_status = ' (Stationary: < 0.05)' if df_result[1] < 0.05 else ' (Not stationary: >= 0.05)'
ax.set_xlabel(p_value_msg + stationarity_status)
## Move the xlabel to the top left
ax.xaxis.label.set_color('steelblue')
ax.xaxis.set_label_position('top')
ax.xaxis.set_label_coords(0.5, 1.02)
## Adjust the number of plots and layout: If the total number of plots is odd
## Hide any unused axes if the number of plots is less than the number of subplot cells
if num_plots < len(axes):
for j in range(num_plots, len(axes)):
fig.delaxes(axes[j])
## Layout adjustment
plt.tight_layout()
plt.show()
def timeseries_visualization(self, lags=15, max_diff_order=3):
"""
β
Step 2: Visualizes the time series data with original time-series and histogram plots; also its ACF and PACF plots.
Also considers Box-Cox transformation for variance stabilisationif the data is skewed.
Original Series and Histogram plots; also ACF and PACF plots are now side by side for better comparative analysis.
- ACF and PACF plots help identify the order of the AR and MA components for ARIMA modeling.
- Histogram analysis after transformation helps understand data normality and variance stability.
-
- If the histogram of the transformed data is more symmetrical, a Box-Cox transformation is helpful.
- Significant spikes in the ACF/PACF plots indicate potential AR or MA terms for ARIMA modeling.
- Check if transformation parameters need adjustment based on skewness and kurtosis metrics.
- [x] 2.1. Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
- [ ] 2.1. Plot the histogram of the series values: After applying a Box-Cox transformation, the histogram helps to check if the data distribution has become more symmetrical, which is often required for effective modeling in time series analysis.
- [x] 2.2. Plot the Auto-Correlation Function (ACF): visualize how the time series data correlates with its lagged values, which is crucial for identifying the MA (Moving Average) components of an ARIMA model.
- [x] 2.2. Plot the PACF: provides insights into the AR (AutoRegressive) components of the ARIMA model by isolating the influence of intervening lags.
"""
print(f'β
Step 2: Time Series Visualization')
## Visualize the time series and its ACF and PACF plots
print(f'\nπ [2.1] [Default] Time Series Display ...')
pm.tsdisplay(self.y_train, lag_max=lags, title=f'ACF plot and histogram for {self.target_col} at {self.site}', show=True)
## Row 1: Plotting the original series and its histogram
self.plot_line_chart_and_histogram(self.y_train)
## Row 2: Perform stationarity checks to see if the given series is stationary.
## To perform differencing and unit-root tests (Dickey-Fuller test), and others if necessary.
self.plot_differencing(self.train_data[self.target_col], max_diff_order=1)
# df_result = adfuller(self.y)
# p_value_msg = f'p-value: {df_result[1]:.4f}'
# stationarity = " (Stationary)" if df_result[1] < 0.05 else " (Not stationary)"
# print(f'π ADF Statistic adf_test_result[0]: {df_result[0]}; and adf_test_result[1]: ' + p_value_msg + stationarity)
## Row 3: ACF and PACF Plots
## Use the ACF to suggest the orders of MA
self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=False)
## Use the PACF to suggest the orders of AR
self.plot_acf_ma_and_pacf_ar(self.train_data[self.target_col], max_diff_order=max_diff_order, max_lags=lags, use_pacf=True)
## Function to apply transformation and display time series diagnostics
def transform_data(data, transformer, lag_max=15):
"""
β
Step 3: Apply transformation to the time series data and display diagnostics.
Parameters:
- data (np.array or pd.Series): The original time series data.
- transformer (object): An instance of a transformer from pmdarima.preprocessing.
- lag_max (int): Maximum lag to which the autocorrelation function is computed.
Returns:
- Transformed data
"""
print(f'β
Step 3: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')
## Fit and transform the data
transformed_data, _ = transformer.fit_transform(data)
## Display time series diagnostics
tsdisplay(transformed_data, lag_max=lag_max, title=f"Transformed Data with {transformer.__class__.__name__}")
return transformed_data
def auto_arima_model(self, y_train):
"""
βοΈ [4] Uses auto_arima from the pmdarima library to find the optimal/best ARIMA model.
auto_arima can be quite memory-intensive, especially with large datasets
ARIMA(p,d,q) model:
- p is the order of the AR term: Use the PACF to suggest the orders of auto-regressive (AR)
- d is the number of differencing required to make the time series stationary:
The degree of differencing, is typically only used in the case of non-stationary data
- q is the order of the MA term: Use the ACF to suggest the orders of moving average (MA)
Parameters:
- train_data: A pandas Series containing the training data for the model.
- order: A tuple representing the order of the ARIMA model (p, d, q).
Returns:
- model: The fitted ARIMA model.
"""
if self.is_auto_arima:
logging.info('π [4.1] The ARIMA Models ...')
# self.model = pm.auto_arima(y_train,
model = pm.auto_arima(y_train,
d=None,
seasonal=False,
m=24, ## Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
stepwise=True,
max_p=None,
max_order=None,
suppress_warnings=True,
error_action="ignore",
trace=True)
logging.debug(model.summary())
else:
logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')
return model
def auto_sarima_model(self, y_train):
"""
βοΈ [4] Uses auto_arima from the pmdarima library to find the optimal/best SARIMA model.
auto_arima can be quite memory-intensive, especially with large datasets
SARIMA(p,d,q)(P,D,Q)[S] where
- p is the order of the seasonal AR term: i.e., the number of Lag observations
- q is the order of the seasonal MA term: essentially the size of the βwindowβ function over your time series data
- Q is the order of seasonal differencing
- S is the seasonal cycle such as 24 for 24 hours.
Parameters:
- train_data: A pandas Series containing the training data for the model.
- order: A tuple representing the order of the ARIMA model (p, d, q).
Returns:
- model: The fitted SARIMA model.
"""
if self.is_auto_arima:
logging.info('π [4.1] The ARIMA Models ...')
try:
print('π [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed!')
# self.sarima_model = pm.auto_arima(y_train,
# seasonal=True, ## turn the seasonal to "True"
# m=24, ## Daily seasonality; m=168 for weekly seasonality, or m=8760 for annual seasonality.
# ## Use a sane value for max_p, max_q, max_P and max_Q. You donβt need to search too high
# # max_p=3, max_q=3, max_P=3, max_Q=3, ## define the maximum order of the seasonal and non-seasonal AR and MA components
# self.sarima_model = pm.auto_arima(y_train,
# d=None, ## Let the model determine 'd' (non-seasonal differencing)
# seasonal=True, ## Turn the seasonal to "True", Enable seasonal differencing
# m=24, ## 24h Daily seasonality; or m=168 for weekly seasonality, or m=8760 for annual seasonality.
# stepwise=True, ## Use the stepwise algorithm to find the best model parameters
# # max_p=None,
# # max_order=None,
# max_p=1, ## Set the maximum 'p' (autoregressive terms) to 2
# max_order=2, ## Set the maximum sum of p, d, and q to 3
# suppress_warnings=True,
# error_action='ignore', ## Change to 'ignore' | 'warn' to keep running despite errors | warning
# trace=True)
self.sarima_model = pm.auto_arima(
y_train,
d=None, ## Let the model determine 'd' (non-seasonal differencing)
seasonal=True, ## Enable seasonal differencing
m=24, ## 24h Daily seasonality; adjust as needed for your data
stepwise=True, ## Use stepwise algorithm to find the best model parameters
# start_p=0, ## Start value for p
# start_q=0, ## Start value for q
# start_P=0, ## Start value for seasonal P
# start_Q=0, ## Start value for seasonal Q
max_p=2, ## Set the maximum p (autoregressive terms)
max_q=2, ## Set the maximum q (moving average terms)
max_P=2, ## Set the maximum seasonal P (autoregressive terms)
max_Q=2, ## Set the maximum seasonal Q (moving average terms)
max_order=None, ## Maximum order of p+q+P+Q+d+D
suppress_warnings=True,
error_action='ignore', ## Change to 'ignore' to keep running despite errors
trace=True
)
print(self.sarima_model.summary())
except Exception as e:
print(f"Failed to fit SARIMA model: {e}")
else:
logging.info('[ArimaTimeSeriesAnalyzer] Set is_auto_arima=true or use the predefined models.')
def model_selection(self, models, train_data, test_data, preference='RMSE'):
"""
β
Step 5: Compare multiple ARIMA and SARIMA models based on configurable preference metrics including RMSE, MAE, MAPE, AIC, and BIC.
Args:
- models (list of tuples): Each tuple contains the model description, and its orders & seasonal orders.
* order (tuple): Non-seasonal ARIMA order (p, d, q).
* seasonal_order (tuple, optional): Seasonal order (P, D, Q, m), if the model is seasonal.
- train_data (pd.Series): The training data for the time series.
- test_data (pd.Series): The testing data for evaluating model performance.
- preference (str): Preferred metric for model evaluation.
Returns:
dict: The best model based on the preferred metric or an informative message if no model fits.
"""
results = []
best_model = None
best_metric = float('inf')
best_model_instance = None
for name, order, seasonal_order in models:
try:
logging.info(f"π Fitting the ARIMA/SARIMA {name} to the training data...")
## Initializing the model with or without seasonal order
# if seasonal_order:
# model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True)
# else:
# model = pm.ARIMA(order=order, suppress_warnings=True)
model = pm.ARIMA(order=order, seasonal_order=seasonal_order, suppress_warnings=True) if seasonal_order else pm.ARIMA(order=order, suppress_warnings=True)
## Fit the model to the training data
model_fit = model.fit(train_data)
## Predict using the fitted model on the test data size
predictions = model_fit.predict(n_periods=len(test_data))
## Calculate metrics (RMSE, MAE, and MAPE) for the predictions
metrics = CommmonUtils.calculate_metrics(test_data, predictions)
metrics.update({'AIC': model_fit.aic(), 'BIC': model_fit.bic()})
## Append results to a list for final display: metrics + Model's AIC/BIC
results.append({**{'Model': name}, **metrics})
current_metric = metrics[preference]
## Determine best model based on the best/lowest preferred metric:
if current_metric < best_metric:
best_metric = current_metric
best_model = {**{'Model': name}, **metrics}
best_model_instance = model_fit
logging.debug(f"Updated model found: {best_model} with combined metric: {best_metric:.2f}")
except Exception as e:
logging.error(f"Error fitting model {name}: {e}")
if not best_model:
print("No valid models were fitted successfully.")
return None
## Output results sorted by the preferred metric
results.sort(key=lambda x: x[preference])
for result in results:
logging.info(f"Model: {result['Model']} - RMSE: {result['RMSE']:.2f}, MAE: {result['MAE']:.2f}, "
f"MAPE: {result['MAPE']:.2f}%, AIC: {result['AIC']:.2f}, BIC: {result['BIC']:.2f}")
logging.info(f"\nThe best model based on {preference}: {best_model['Model']} with {preference}: {best_model[preference]:.2f}")
# self.model = best_model_instance
# logging.info(self.model.summary())
return best_model_instance
def multi_step_forecast(self, n_periods: int):
prediction, conf_int = self.model.predict(n_periods=n_periods, return_conf_int=True)
plt.figure(figsize=(12, 5))
plt.plot(self.data.index[-100:], self.data[self.target_col].dropna().tail(100), label='Historical')
plt.plot(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'), prediction, label='Forecast')
plt.fill_between(pd.date_range(self.data.index[-1], periods=n_periods, freq='H'),
conf_int[:, 0], conf_int[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.legend()
plt.show()
def plot_residuals(residuals, figsize=(12, 10), dpi=100, color_scheme='skyblue'):
"""
Plot diagnostic charts for residuals of a time series model to check for normality,
homoscedasticity, and absence of autocorrelation in residuals.
Parameters:
- residuals (pd.Series or np.array): Residual values from a time series model.
- figsize (tuple): Figure size.
- dpi (int): Dots per inch for the plot resolution.
- color_scheme (str): Color for the plots.
"""
## Check if residuals is a method or a callable, and call it to get the data
if callable(residuals):
residuals = residuals()
## Ensure residuals are a numpy array for compatibility with plotting functions
if not isinstance(residuals, np.ndarray):
residuals = np.array(residuals)
fig, axes = plt.subplots(2, 2, figsize=figsize, dpi=dpi)
plt.subplots_adjust(hspace=0.4, wspace=0.3)
## Time series plot of residuals
axes[0, 0].plot(residuals, color=color_scheme)
axes[0, 0].set_title('Time-series plot of Residuals over-time')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].axhline(y=0, color='black', linestyle='--')
## ACF plot for residuals
sm.graphics.tsa.plot_acf(residuals, ax=axes[0, 1], alpha=0.05, color=color_scheme)
axes[0, 1].set_title('Autocorrelation/ACF for Residuals')
## Histogram with KDE for residuals
sns.histplot(residuals, kde=True, ax=axes[1, 0], bins=30, color=color_scheme)
axes[1, 0].set_title('Histogram with KDE for Residuals')
## Q-Q plot for checking normality
qqplot(residuals, line='s', ax=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot for checking Normality')
## Statistical Summary
# stats_text = f'Mean: {np.mean(residuals):.2f}\nSD: {np.std(residuals):.2f}\nSkew: {skew(residuals):.2f}\nKurtosis: {kurtosis(residuals):.2f}'
mean, std_dev = np.mean(residuals), np.std(residuals)
skewness, kurt = skew(residuals), kurtosis(residuals)
stats_text = f'Mean: {mean:.2f}\nSD: {std_dev:.2f}\nSkew: {skewness:.2f}\nKurtosis: {kurt:.2f}'
axes[1, 0].annotate(stats_text, xy=(0.05, 0.95), xycoords='axes fraction',
horizontalalignment='left', verticalalignment='top',
fontsize=10, bbox=dict(boxstyle="round4,pad=0.6", fc="white", ec="black", lw=1))
plt.show()
Automatic Model SelectionΒΆ
def apply_auto_arima_all_sites(df1, df2, target_cols):
"""
Apply Auto ARIMA Model Selection across multiple sites and pollutants.
Parameters:
- df1: DataFrame for Site 1
- df2: DataFrame for Site 2
- target_cols: List of target columns (pollutants) for analysis
Returns:
- models: Dictionary containing fitted models for each site and pollutant
"""
## Dictionary to store models for each site and target column
models = {}
## Iterate over each pollutant
for target_col in target_cols:
## Iterate over each site
for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
print(f'\nπ Applying Auto ARIMA Model Selection for {site_name} ({target_col})')
## Initialize the ARIMA analyzer
analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)
## Prepare data by performing a train-test split
analyzer.train_test_split(train_size=0.8)
# ## Visualize time series and lag plots
# analyzer.timeseries_visualization(lags=15, max_diff_order=1)
## Applying Auto ARIMA to find the best model
model = analyzer.auto_arima_model(analyzer.y_train)
## Storing model results
models[(site_name, target_col)] = model
## Print model summary
if model:
logging.info(f"β
Best model for {site_name} ({target_col}): ARIMA{model.order}")
logging.info(model.summary())
else:
logging.info('β Failed to determine a suitable model.')
return models
target_cols = ['PM2.5', 'PM10'] # List of pollutants to analyze
data_penrose = data_series_df1.copy() ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna
## Call the function with data and target columns
models = apply_auto_arima_all_sites(data_penrose, data_takapuna, target_cols)
2024-05-14 08:56:12,636 - INFO - π [4.1] The ARIMA Models ...
π Applying Auto ARIMA Model Selection for Site1-Penrose (PM2.5) β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples. Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=6.93 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=85246.817, Time=0.18 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=84200.017, Time=0.26 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=83375.633, Time=0.65 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=85244.818, Time=0.09 sec ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=inf, Time=3.08 sec ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=82177.054, Time=2.68 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=inf, Time=8.12 sec ARIMA(0,1,3)(0,0,0)[0] intercept : AIC=81745.972, Time=4.33 sec ARIMA(1,1,3)(0,0,0)[0] intercept : AIC=inf, Time=14.72 sec ARIMA(0,1,4)(0,0,0)[0] intercept : AIC=81522.712, Time=7.43 sec ARIMA(1,1,4)(0,0,0)[0] intercept : AIC=inf, Time=21.19 sec ARIMA(0,1,5)(0,0,0)[0] intercept : AIC=inf, Time=15.27 sec ARIMA(1,1,5)(0,0,0)[0] intercept : AIC=inf, Time=36.91 sec ARIMA(0,1,4)(0,0,0)[0] : AIC=81520.713, Time=1.22 sec ARIMA(0,1,3)(0,0,0)[0] : AIC=81743.973, Time=0.79 sec ARIMA(1,1,4)(0,0,0)[0] : AIC=inf, Time=1.83 sec ARIMA(0,1,5)(0,0,0)[0] : AIC=inf, Time=1.57 sec ARIMA(1,1,3)(0,0,0)[0] : AIC=inf, Time=1.26 sec
2024-05-14 08:58:24,589 - INFO - β Best model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)
ARIMA(1,1,5)(0,0,0)[0] : AIC=inf, Time=3.26 sec Best model: ARIMA(0,1,4)(0,0,0)[0] Total fit time: 131.811 seconds
2024-05-14 08:58:24,686 - INFO - SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13900 Model: SARIMAX(0, 1, 4) Log Likelihood -40755.357 Date: Tue, 14 May 2024 AIC 81520.713 Time: 08:58:24 BIC 81558.411 Sample: 05-07-2020 HQIC 81533.268 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ma.L1 -0.5052 0.004 -130.751 0.000 -0.513 -0.498 ma.L2 -0.2455 0.006 -44.035 0.000 -0.256 -0.235 ma.L3 -0.0998 0.009 -11.591 0.000 -0.117 -0.083 ma.L4 -0.1261 0.007 -18.049 0.000 -0.140 -0.112 sigma2 20.6235 0.069 297.137 0.000 20.487 20.760 =================================================================================== Ljung-Box (L1) (Q): 1.26 Jarque-Bera (JB): 840496.71 Prob(Q): 0.26 Prob(JB): 0.00 Heteroskedasticity (H): 1.51 Skew: 2.37 Prob(H) (two-sided): 0.00 Kurtosis: 40.80 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 08:58:24,689 - INFO - π [4.1] The ARIMA Models ...
π Applying Auto ARIMA Model Selection for Site2-Takapuna (PM2.5) β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples. Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=inf, Time=24.16 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=45019.870, Time=0.18 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=40348.147, Time=0.44 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=37707.821, Time=1.45 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=45017.871, Time=0.19 sec ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=37596.087, Time=2.16 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=37596.544, Time=1.83 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=inf, Time=10.19 sec ARIMA(0,1,2)(0,0,0)[0] intercept : AIC=37595.386, Time=2.19 sec ARIMA(0,1,3)(0,0,0)[0] intercept : AIC=37596.990, Time=4.48 sec ARIMA(1,1,3)(0,0,0)[0] intercept : AIC=inf, Time=22.42 sec ARIMA(0,1,2)(0,0,0)[0] : AIC=37593.386, Time=0.51 sec ARIMA(0,1,1)(0,0,0)[0] : AIC=37705.822, Time=0.40 sec ARIMA(1,1,2)(0,0,0)[0] : AIC=inf, Time=0.94 sec ARIMA(0,1,3)(0,0,0)[0] : AIC=37594.991, Time=0.60 sec ARIMA(1,1,1)(0,0,0)[0] : AIC=37594.087, Time=0.52 sec
2024-05-14 08:59:39,394 - INFO - β Best model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2) 2024-05-14 08:59:39,489 - INFO - SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13887 Model: SARIMAX(0, 1, 2) Log Likelihood -18793.693 Date: Tue, 14 May 2024 AIC 37593.386 Time: 08:59:39 BIC 37616.002 Sample: 05-07-2020 HQIC 37600.919 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ma.L1 0.8354 0.002 401.874 0.000 0.831 0.839 ma.L2 0.0906 0.002 36.543 0.000 0.086 0.096 sigma2 0.8771 0.003 347.622 0.000 0.872 0.882 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 792515.25 Prob(Q): 0.95 Prob(JB): 0.00 Heteroskedasticity (H): 1.14 Skew: 0.35 Prob(H) (two-sided): 0.00 Kurtosis: 40.00 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 08:59:39,491 - INFO - π [4.1] The ARIMA Models ...
ARIMA(1,1,3)(0,0,0)[0] : AIC=inf, Time=1.90 sec Best model: ARIMA(0,1,2)(0,0,0)[0] Total fit time: 74.588 seconds π Applying Auto ARIMA Model Selection for Site1-Penrose (PM10) β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples. Performing stepwise search to minimize aic ARIMA(2,0,2)(0,0,0)[0] : AIC=inf, Time=2.29 sec ARIMA(0,0,0)(0,0,0)[0] : AIC=115738.588, Time=0.05 sec ARIMA(1,0,0)(0,0,0)[0] : AIC=91231.210, Time=0.14 sec ARIMA(0,0,1)(0,0,0)[0] : AIC=105427.979, Time=0.38 sec ARIMA(2,0,0)(0,0,0)[0] : AIC=89279.111, Time=0.27 sec ARIMA(3,0,0)(0,0,0)[0] : AIC=88829.508, Time=0.38 sec ARIMA(4,0,0)(0,0,0)[0] : AIC=88717.131, Time=0.39 sec ARIMA(5,0,0)(0,0,0)[0] : AIC=88648.671, Time=0.53 sec ARIMA(6,0,0)(0,0,0)[0] : AIC=88583.442, Time=0.60 sec ARIMA(7,0,0)(0,0,0)[0] : AIC=88521.984, Time=0.82 sec ARIMA(8,0,0)(0,0,0)[0] : AIC=88480.799, Time=1.14 sec ARIMA(9,0,0)(0,0,0)[0] : AIC=88449.433, Time=1.19 sec ARIMA(10,0,0)(0,0,0)[0] : AIC=88413.667, Time=1.31 sec ARIMA(11,0,0)(0,0,0)[0] : AIC=inf, Time=1.81 sec ARIMA(10,0,1)(0,0,0)[0] : AIC=inf, Time=10.12 sec ARIMA(9,0,1)(0,0,0)[0] : AIC=inf, Time=8.68 sec ARIMA(11,0,1)(0,0,0)[0] : AIC=inf, Time=11.78 sec ARIMA(10,0,0)(0,0,0)[0] intercept : AIC=87884.782, Time=8.19 sec ARIMA(9,0,0)(0,0,0)[0] intercept : AIC=87885.059, Time=7.37 sec ARIMA(11,0,0)(0,0,0)[0] intercept : AIC=87884.909, Time=11.73 sec ARIMA(10,0,1)(0,0,0)[0] intercept : AIC=87886.886, Time=24.70 sec ARIMA(9,0,1)(0,0,0)[0] intercept : AIC=87887.126, Time=22.25 sec
2024-05-14 09:02:20,067 - INFO - β Best model for Site1-Penrose (PM10): ARIMA(10, 0, 0)
ARIMA(11,0,1)(0,0,0)[0] intercept : AIC=87887.532, Time=44.24 sec Best model: ARIMA(10,0,0)(0,0,0)[0] intercept Total fit time: 160.375 seconds
2024-05-14 09:02:20,171 - INFO - SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13900 Model: SARIMAX(10, 0, 0) Log Likelihood -43930.391 Date: Tue, 14 May 2024 AIC 87884.782 Time: 09:02:20 BIC 87975.258 Sample: 05-07-2020 HQIC 87914.913 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 3.1601 0.138 22.916 0.000 2.890 3.430 ar.L1 0.4353 0.006 70.585 0.000 0.423 0.447 ar.L2 0.1911 0.008 24.366 0.000 0.176 0.206 ar.L3 0.0855 0.009 9.878 0.000 0.069 0.102 ar.L4 0.0161 0.009 1.717 0.086 -0.002 0.035 ar.L5 0.0018 0.009 0.191 0.849 -0.017 0.020 ar.L6 0.0057 0.009 0.610 0.542 -0.013 0.024 ar.L7 0.0122 0.009 1.302 0.193 -0.006 0.030 ar.L8 0.0057 0.008 0.672 0.501 -0.011 0.022 ar.L9 0.0016 0.009 0.177 0.859 -0.016 0.020 ar.L10 0.0128 0.008 1.530 0.126 -0.004 0.029 sigma2 32.5718 0.241 135.006 0.000 32.099 33.045 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 12022.16 Prob(Q): 0.99 Prob(JB): 0.00 Heteroskedasticity (H): 1.21 Skew: 0.70 Prob(H) (two-sided): 0.00 Kurtosis: 7.34 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 09:02:20,174 - INFO - π [4.1] The ARIMA Models ...
π Applying Auto ARIMA Model Selection for Site2-Takapuna (PM10) β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples. Performing stepwise search to minimize aic ARIMA(2,0,2)(0,0,0)[0] : AIC=78822.388, Time=1.23 sec ARIMA(0,0,0)(0,0,0)[0] : AIC=112846.654, Time=0.05 sec ARIMA(1,0,0)(0,0,0)[0] : AIC=85707.873, Time=0.15 sec ARIMA(0,0,1)(0,0,0)[0] : AIC=95901.076, Time=0.36 sec ARIMA(1,0,2)(0,0,0)[0] : AIC=80273.049, Time=0.57 sec ARIMA(2,0,1)(0,0,0)[0] : AIC=80774.015, Time=0.57 sec ARIMA(3,0,2)(0,0,0)[0] : AIC=78824.229, Time=1.63 sec ARIMA(2,0,3)(0,0,0)[0] : AIC=78824.188, Time=1.92 sec ARIMA(1,0,1)(0,0,0)[0] : AIC=81175.004, Time=0.40 sec ARIMA(1,0,3)(0,0,0)[0] : AIC=79249.268, Time=1.11 sec ARIMA(3,0,1)(0,0,0)[0] : AIC=80278.950, Time=1.02 sec ARIMA(3,0,3)(0,0,0)[0] : AIC=78825.089, Time=2.04 sec ARIMA(2,0,2)(0,0,0)[0] intercept : AIC=78736.171, Time=12.06 sec ARIMA(1,0,2)(0,0,0)[0] intercept : AIC=78889.262, Time=3.09 sec ARIMA(2,0,1)(0,0,0)[0] intercept : AIC=78892.435, Time=2.96 sec ARIMA(3,0,2)(0,0,0)[0] intercept : AIC=78735.840, Time=15.04 sec ARIMA(3,0,1)(0,0,0)[0] intercept : AIC=78833.654, Time=4.73 sec ARIMA(4,0,2)(0,0,0)[0] intercept : AIC=78757.841, Time=21.56 sec ARIMA(3,0,3)(0,0,0)[0] intercept : AIC=78736.360, Time=22.63 sec ARIMA(2,0,3)(0,0,0)[0] intercept : AIC=78734.785, Time=19.21 sec ARIMA(1,0,3)(0,0,0)[0] intercept : AIC=78786.317, Time=5.41 sec ARIMA(2,0,4)(0,0,0)[0] intercept : AIC=78746.403, Time=27.94 sec ARIMA(1,0,4)(0,0,0)[0] intercept : AIC=78754.753, Time=11.18 sec ARIMA(3,0,4)(0,0,0)[0] intercept : AIC=78737.181, Time=30.55 sec Best model: ARIMA(2,0,3)(0,0,0)[0] intercept Total fit time: 187.483 seconds
2024-05-14 09:05:27,878 - INFO - β Best model for Site2-Takapuna (PM10): ARIMA(2, 0, 3) 2024-05-14 09:05:27,986 - INFO - SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13887 Model: SARIMAX(2, 0, 3) Log Likelihood -39360.392 Date: Tue, 14 May 2024 AIC 78734.785 Time: 09:05:27 BIC 78787.556 Sample: 05-07-2020 HQIC 78752.360 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.2385 0.043 5.583 0.000 0.155 0.322 ar.L1 1.3873 0.014 102.426 0.000 1.361 1.414 ar.L2 -0.4080 0.011 -38.577 0.000 -0.429 -0.387 ma.L1 -0.0155 0.014 -1.144 0.253 -0.042 0.011 ma.L2 -0.7895 0.008 -95.881 0.000 -0.806 -0.773 ma.L3 -0.0482 0.007 -7.264 0.000 -0.061 -0.035 sigma2 16.9697 0.014 1241.426 0.000 16.943 16.997 =================================================================================== Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 362991240.28 Prob(Q): 0.88 Prob(JB): 0.00 Heteroskedasticity (H): 0.26 Skew: 16.64 Prob(H) (two-sided): 0.00 Kurtosis: 794.34 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
def apply_auto_sarima_all_sites(df1, df2, target_cols):
"""
Apply Auto SARIMA Model Selection across multiple sites and pollutants.
Parameters:
- df1: DataFrame for Site 1
- df2: DataFrame for Site 2
- target_cols: List of target columns (pollutants) for analysis
Returns:
- models: Dictionary containing fitted models for each site and pollutant
"""
## Dictionary to store models for each site and target column
models = {}
## Iterate over each pollutant
for target_col in target_cols:
# Iterate over each site
for df, site_name in zip([df1, df2], ["Site1-Penrose", "Site2-Takapuna"]):
print(f'\nπ Applying Auto SARIMA Model Selection for {site_name} ({target_col})')
## Initialize the ARIMA analyzer
analyzer = ArimaTimeSeriesAnalyzer(df, site_name, target_col, is_auto_arima=True)
## Prepare data by performing a train-test split
analyzer.train_test_split(train_size=0.8)
# ## Visualize time series and lag plots
# analyzer.timeseries_visualization(lags=15, max_diff_order=1)
## Applying Auto ARIMA to find the best model
model = analyzer.auto_sarima_model(analyzer.y_train)
## Storing model results
models[(site_name, target_col)] = model
## Print model summary
if model:
logging.info(f"β
Best model for {site_name} ({target_col}): SARIMA{model.order}")
logging.info(model.summary())
else:
logging.info('β Failed to determine a suitable model.')
return models
target_cols = ['PM2.5', 'PM10'] # List of pollutants to analyze
data_penrose = data_series_df1.copy() ## the DataFrame for Site1-Penrose
data_takapuna = data_series_df2.copy() ## the DataFrame for Site2-Takapuna
## Call the function with data and target columns
models = apply_auto_sarima_all_sites(data_penrose, data_takapuna, target_cols)
2024-05-21 14:56:10,920 - INFO - π [4.1] The ARIMA Models ...
π Applying Auto SARIMA Model Selection for Site1-Penrose (PM2.5) β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples. π [4.2] The SARIMA Models ... Note: Preliminary check for stationarity, such as an ADF test if needed! Performing stepwise search to minimize aic ARIMA(2,1,2)(1,0,1)[24] intercept : AIC=inf, Time=149.59 sec ARIMA(0,1,0)(0,0,0)[24] intercept : AIC=85246.817, Time=0.24 sec ARIMA(1,1,0)(1,0,0)[24] intercept : AIC=84058.923, Time=18.85 sec ARIMA(0,1,1)(0,0,1)[24] intercept : AIC=83140.927, Time=17.42 sec ARIMA(0,1,0)(0,0,0)[24] : AIC=85244.818, Time=0.12 sec ARIMA(0,1,1)(0,0,0)[24] intercept : AIC=83375.633, Time=0.88 sec ARIMA(0,1,1)(1,0,1)[24] intercept : AIC=inf, Time=104.13 sec ARIMA(0,1,1)(0,0,2)[24] intercept : AIC=82960.450, Time=60.51 sec ARIMA(0,1,1)(1,0,2)[24] intercept : AIC=inf, Time=211.91 sec ARIMA(0,1,0)(0,0,2)[24] intercept : AIC=85106.953, Time=42.82 sec ARIMA(1,1,1)(0,0,2)[24] intercept : AIC=inf, Time=136.34 sec ARIMA(0,1,2)(0,0,2)[24] intercept : AIC=81772.007, Time=86.94 sec ARIMA(0,1,2)(0,0,1)[24] intercept : AIC=81969.054, Time=34.85 sec ARIMA(0,1,2)(1,0,2)[24] intercept : AIC=inf, Time=293.02 sec ARIMA(0,1,2)(1,0,1)[24] intercept : AIC=inf, Time=116.07 sec ARIMA(1,1,2)(0,0,2)[24] intercept : AIC=inf, Time=306.30 sec ARIMA(0,1,2)(0,0,2)[24] : AIC=81770.007, Time=17.64 sec ARIMA(0,1,2)(0,0,1)[24] : AIC=81967.055, Time=4.62 sec
Multi-Steps ForecastingΒΆ
from sklearn.metrics import mean_absolute_percentage_error
def one_step_forecast(analyzer, test):
"""
Create multi-step forecasts and plots them alongside the test data.
"""
test_len = len(test)
## Create predictions for the future, evaluate on test
fcast, conf_int = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
# fcast = analyzer.model.predict(n_periods=test_len, return_conf_int=True, alpha=0.05)
# forecasts = fcast[0]
# confidence_intervals = fcast[1]
# confidence_intervals = np.asarray(fcast[1])
## Plotting
plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)
def one_period_forecast(model, n_periods=168):
"""
Forecast multiple periods ahead with confidence intervals.
Parameters:
model (ARIMA): The ARIMA model used for forecasting.
n_periods (int): Number of periods to forecast.
Returns:
tuple: A tuple containing forecasts and their confidence intervals.
"""
## Create predictions for the future, evaluate on test
fcast, conf_int = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
return fcast.tolist(), np.asarray(conf_int).tolist()
# fcast = model.predict(n_periods=n_periods, return_conf_int=True, alpha=0.05)
## fcast is a list of two lists.
## The first list is the forecast
# forecasts = fcast[0].tolist()
## The second list is the confidence interval
# confidence_intervals = fcast[1]
# return ( forecasts,
# np.asarray(confidence_intervals).tolist()[0])
def multi_step_forecast(analyzer, test, period='week'):
"""
Generate multi-step forecasts and update the model with actual observations for each period/forecast.
Also plots them alongside the test data.
Parameters:
- analyzer (TimeSeriesAnalyzer): An instance containing the ARIMA model and data.
- test (pd.Series): Test data for the target column.
- period (str): The period for grouping test data, 'week' for hourly data equals 168 hours.
Uses dynamic forecasting where the model is updated after each new observation.
Prints progress and debugging information.
"""
## Initialize lists to hold predictions and confidence intervals
forecasts = []
confidence_intervals = []
## Calculate the number of periods based on the test data length and the defined period length: One week in hours = 24 * 7
# period_length = 168 if period == 'week' else len(test)
## One week in hours = 24 * 7 = 168; also 24 hours per day
period_length = 168 if period == 'week' else 24 if period == 'day' else len(test)
## Generate predictions and update the model for each period: Loop over the test data in chunks
for start_idx in range(0, len(test), period_length):
end_idx = min(start_idx + period_length, len(test))
current_test = test.iloc[start_idx:end_idx]
## Perform forecasting for the current period
fc, conf = one_period_forecast(analyzer.model, n_periods=len(current_test))
forecasts.extend(fc)
confidence_intervals.extend(conf)
# confidence_intervals.append(conf)
## Adaptive Forecasting: Update the model with actual observations from the current period
analyzer.model.update(current_test.values)
logging.debug(f'Iteration from {start_idx} to {end_idx}: Model updated with actual data with Forecast={fc}, CI={conf}.')
## Plot the results: Plotting the forecasts against the actual data
plot_it(analyzer.train_data[analyzer.target_col], test, forecasts, np.array(confidence_intervals))
## Plotting the forecasts against the actual data
# plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], fcast, conf_int)
# plot_it(analyzer.train_data[analyzer.target_col], analyzer.test_data[analyzer.target_col], np.array(forecasts), np.array(confidence_intervals))
# @staticmethod
def plot_it(train, test, forecasts, confidence_intervals):
"""
Plot training data, test data, forecasts, and confidence intervals.
Parameters:
- train (pd.Series): Training data series.
- test (pd.Series): Test data series.
- forecasts (list): List of forecasted values.
- confidence_intervals (array): Array of confidence interval bounds.
"""
fig, ax = plt.subplots(figsize=(18, 10))
## Plot training data
ax.plot(train.index, train, color='blue', label='Training Data')
## Plot actual test data
ax.plot(test.index, test, color='green', label='Actual Test Data')
## Plotting Confidence Intervals
# lower_bounds = [ci[0] for ci in confidence_intervals]
# upper_bounds = [ci[1] for ci in confidence_intervals]
ax.fill_between(test.index, confidence_intervals[:, 0], confidence_intervals[:, 1], color='orange', alpha=0.3, label='95% Confidence Interval')
## Plot forecasts
ax.plot(test.index, forecasts, color='red', label='Forecasted Data', marker='o')
ax.set_title(f'Multi-Step Forecast Evaluation: Forecast vs Actuals')
ax.set_xlabel('Date')
ax.set_ylabel(f'{train.name}')
ax.legend()
plt.grid(True)
plt.show()
## Evaluate and print forecast accuracy
mape = mean_absolute_percentage_error(test, forecasts)
# print(f"MAPE: {mape:.2%}")
# ## Evaluate and print forecast accuracy
# mape = mean_absolute_percentage_error(analyzer.test_data, forecasts)
logging.info(f"MAPE for the forecast: {mape:.2%}")
def Time_Series_EDA(data: pd.DataFrame, site: str, target_col: str, train_size: float = 0.8, is_auto_arima=False, models=None):
"""
Conduct Time Series EDA, Model Selection, and Forecasting.
Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to any site and target pollutant.
"""
analyzer = ArimaTimeSeriesAnalyzer(data, site, target_col, is_auto_arima=is_auto_arima)
logging.info(f'β
Step 1: Preparing Data for {site} ({target_col})')
analyzer.train_test_split(train_size)
logging.info(f'β
Step 2: Visualizing Time Series and Lag Plots for {site} ({target_col})')
analyzer.timeseries_visualization(lags=15, max_diff_order=1)
logging.info(f'β
Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance.')
if is_auto_arima:
logging.info(f'β
Step 3: Applying Auto ARIMA Model Selection for {site} ({target_col})')
analyzer.auto_arima_model(analyzer.y_train)
else:
logging.info(f'β
Step 4: Selecting the Best Model Based on AIC')
analyzer.model = analyzer.model_selection(models, analyzer.y_train, analyzer.y_test, preference='AIC')
if analyzer.model:
logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()}")
else:
logging.info(f"Best model based on AIC = {analyzer.model.aic():.2f} \n {analyzer.model.summary()} ")
logging.info(f'β
Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model.')
logging.info(f'π 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...')
# analyzer.model.plot_diagnostics(figsize=(15,12))
analyzer.model.plot_diagnostics(figsize=(10, 8))
plt.show()
logging.info(f'π 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...')
if hasattr(analyzer.model, 'resid'):
residuals = analyzer.model.resid ## Get the residuals from an instance of a fitted ARIMA model
plot_residuals(residuals) ## Plotting the residuals
logging.info(f'β
Step 6: Conducting One-Step Forecast')
one_step_forecast(analyzer, analyzer.test_data[analyzer.target_col])
logging.info(f'β
Step 7.1: Conducting Multi-Step Forecast for Next Weeks')
multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='week')
logging.info(f'β
Step 7.2: Conducting Multi-Step Forecast for Next Days')
multi_step_forecast(analyzer, analyzer.test_data[analyzer.target_col], period='day')
return analyzer
## β
Best ARIMA model for Site1-Penrose (PM2.5): ARIMA(0, 1, 4)
## β
Best ARIMA model for Site2-Takapuna (PM2.5): ARIMA(0, 1, 2)
## β
Best ARIMA model for Site1-Penrose (PM10): ARIMA(10, 0, 0)
## β
Best ARIMA model for Site2-Takapuna (PM10): ARIMA(2, 0, 3)
models_penrose_pm25 = [('ARIMA(0,1,4)', (0, 1, 4), None)]
models_penrose_pm10 = [('ARIMA(0,1,2)', (0, 1, 2), None)]
models_takapuna_pm25 = [('ARIMA(10,0,0)', (10, 0, 0), None)]
models_takapuna_pm10 = [('ARIMA(2,0,3)', (2, 0, 3), None)]
## Penrose Time Series EDA
logging.info(f"ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm25 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM2.5', is_auto_arima=False, models=models_penrose_pm25)
logging.info(f"ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_penrose_pm10 = Time_Series_EDA(data_series_df1, 'Penrose', 'PM10', is_auto_arima=False, models=models_penrose_pm10)
## Takapuna Time Series EDA
logging.info(f"ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm25 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM2.5', is_auto_arima=False, models=models_takapuna_pm25)
logging.info(f"ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. ")
analyzer_takapuna_pm10 = Time_Series_EDA(data_series_df2, 'Takapuna', 'PM10', is_auto_arima=False, models=models_takapuna_pm10)
2024-05-14 09:13:48,120 - INFO - ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 2024-05-14 09:13:48,125 - INFO - β Step 1: Preparing Data for Penrose (PM2.5) 2024-05-14 09:13:48,128 - INFO - β Step 2: Visualizing Time Series and Lag Plots for Penrose (PM2.5)
β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples. β Step 2: Time Series Visualization π [2.1] [Default] Time Series Display ...
π [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
π [2.2] Apply 1 Order Differencing to the time series and visualizes results.
π [2.3] Plotting the ACF to suggest possible MA orders.
π [2.3] Plotting the PACF to suggest possible AR orders.
2024-05-14 09:13:55,786 - INFO - β Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance. 2024-05-14 09:13:55,788 - INFO - β Step 4: Selecting the Best Model Based on AIC 2024-05-14 09:13:55,806 - INFO - π Fitting the ARIMA/SARIMA ARIMA(0,1,4) to the training data... 2024-05-14 09:14:06,662 - INFO - Model: ARIMA(0,1,4) - RMSE: 4.58, MAE: 3.53, MAPE: inf%, AIC: 81522.71, BIC: 81567.95 2024-05-14 09:14:06,667 - INFO - The best model based on AIC: ARIMA(0,1,4) with AIC: 81522.71 2024-05-14 09:14:06,811 - INFO - Best model based on AIC = 81522.71 SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13900 Model: SARIMAX(0, 1, 4) Log Likelihood -40755.356 Date: Tue, 14 May 2024 AIC 81522.712 Time: 09:14:06 BIC 81567.949 Sample: 05-07-2020 HQIC 81537.777 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept -3.539e-05 0.001 -0.036 0.972 -0.002 0.002 ma.L1 -0.5052 0.004 -130.166 0.000 -0.513 -0.498 ma.L2 -0.2455 0.006 -43.965 0.000 -0.256 -0.235 ma.L3 -0.0997 0.009 -11.585 0.000 -0.117 -0.083 ma.L4 -0.1261 0.007 -18.013 0.000 -0.140 -0.112 sigma2 20.6245 0.071 288.519 0.000 20.484 20.765 =================================================================================== Ljung-Box (L1) (Q): 1.27 Jarque-Bera (JB): 840531.64 Prob(Q): 0.26 Prob(JB): 0.00 Heteroskedasticity (H): 1.51 Skew: 2.37 Prob(H) (two-sided): 0.00 Kurtosis: 40.80 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 09:14:06,826 - INFO - β Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model. 2024-05-14 09:14:06,826 - INFO - π 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
2024-05-14 09:14:07,340 - INFO - π 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
2024-05-14 09:14:07,938 - INFO - β Step 6: Conducting One-Step Forecast
2024-05-14 09:14:08,251 - INFO - MAPE for the forecast: 8479870924121942.00% 2024-05-14 09:14:08,251 - INFO - β Step 7.1: Conducting Multi-Step Forecast for Next Weeks
2024-05-14 09:16:51,827 - INFO - MAPE for the forecast: 6881842348181038.00% 2024-05-14 09:16:51,827 - INFO - β Step 7.2: Conducting Multi-Step Forecast for Next Days
2024-05-14 09:28:33,919 - INFO - MAPE for the forecast: 6337679726063511.00% 2024-05-14 09:28:33,919 - INFO - ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 2024-05-14 09:28:33,920 - INFO - β Step 1: Preparing Data for Penrose (PM10) 2024-05-14 09:28:33,920 - INFO - β Step 2: Visualizing Time Series and Lag Plots for Penrose (PM10)
β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples. β Step 2: Time Series Visualization π [2.1] [Default] Time Series Display ...
π [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
π [2.2] Apply 1 Order Differencing to the time series and visualizes results.
π [2.3] Plotting the ACF to suggest possible MA orders.
π [2.3] Plotting the PACF to suggest possible AR orders.
2024-05-14 09:28:43,398 - INFO - β Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance. 2024-05-14 09:28:43,400 - INFO - β Step 4: Selecting the Best Model Based on AIC 2024-05-14 09:28:43,402 - INFO - π Fitting the ARIMA/SARIMA ARIMA(0,1,2) to the training data... 2024-05-14 09:28:45,151 - INFO - Model: ARIMA(0,1,2) - RMSE: 11.24, MAE: 8.77, MAPE: inf%, AIC: 88598.79, BIC: 88628.95 2024-05-14 09:28:45,153 - INFO - The best model based on AIC: ARIMA(0,1,2) with AIC: 88598.79 2024-05-14 09:28:45,298 - INFO - Best model based on AIC = 88598.79 SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13900 Model: SARIMAX(0, 1, 2) Log Likelihood -44295.396 Date: Tue, 14 May 2024 AIC 88598.793 Time: 09:28:45 BIC 88628.951 Sample: 05-07-2020 HQIC 88608.837 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.0004 0.020 0.019 0.985 -0.039 0.040 ma.L1 -0.5236 0.006 -86.317 0.000 -0.535 -0.512 ma.L2 -0.0896 0.007 -12.927 0.000 -0.103 -0.076 sigma2 34.3288 0.244 140.409 0.000 33.850 34.808 =================================================================================== Ljung-Box (L1) (Q): 0.79 Jarque-Bera (JB): 10409.50 Prob(Q): 0.38 Prob(JB): 0.00 Heteroskedasticity (H): 1.24 Skew: 0.54 Prob(H) (two-sided): 0.00 Kurtosis: 7.10 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 09:28:45,309 - INFO - β Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model. 2024-05-14 09:28:45,309 - INFO - π 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
2024-05-14 09:28:45,951 - INFO - π 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
2024-05-14 09:28:47,393 - INFO - β Step 6: Conducting One-Step Forecast
2024-05-14 09:28:47,722 - INFO - MAPE for the forecast: 5461844894654976.00% 2024-05-14 09:28:47,722 - INFO - β Step 7.1: Conducting Multi-Step Forecast for Next Weeks
2024-05-14 09:29:22,342 - INFO - MAPE for the forecast: 3868016845709311.00% 2024-05-14 09:29:22,342 - INFO - β Step 7.2: Conducting Multi-Step Forecast for Next Days
2024-05-14 09:33:02,316 - INFO - MAPE for the forecast: 3045492723406247.50% 2024-05-14 09:33:02,316 - INFO - ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 2024-05-14 09:33:02,317 - INFO - β Step 1: Preparing Data for Takapuna (PM2.5) 2024-05-14 09:33:02,317 - INFO - β Step 2: Visualizing Time Series and Lag Plots for Takapuna (PM2.5)
β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples. β Step 2: Time Series Visualization π [2.1] [Default] Time Series Display ...
π [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
π [2.2] Apply 1 Order Differencing to the time series and visualizes results.
π [2.3] Plotting the ACF to suggest possible MA orders.
π [2.3] Plotting the PACF to suggest possible AR orders.
2024-05-14 09:33:09,591 - INFO - β Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance. 2024-05-14 09:33:09,592 - INFO - β Step 4: Selecting the Best Model Based on AIC 2024-05-14 09:33:09,593 - INFO - π Fitting the ARIMA/SARIMA ARIMA(10,0,0) to the training data... 2024-05-14 09:33:24,327 - INFO - Model: ARIMA(10,0,0) - RMSE: 2.12, MAE: 1.68, MAPE: 40.74%, AIC: 36614.49, BIC: 36704.95 2024-05-14 09:33:24,328 - INFO - The best model based on AIC: ARIMA(10,0,0) with AIC: 36614.49 2024-05-14 09:33:24,470 - INFO - Best model based on AIC = 36614.49 SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13887 Model: SARIMAX(10, 0, 0) Log Likelihood -18295.243 Date: Tue, 14 May 2024 AIC 36614.485 Time: 09:33:24 BIC 36704.950 Sample: 05-07-2020 HQIC 36644.614 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.4089 0.018 22.875 0.000 0.374 0.444 ar.L1 1.7644 0.003 703.954 0.000 1.759 1.769 ar.L2 -1.3783 0.005 -262.277 0.000 -1.389 -1.368 ar.L3 0.9730 0.006 150.706 0.000 0.960 0.986 ar.L4 -0.7195 0.007 -104.882 0.000 -0.733 -0.706 ar.L5 0.4857 0.008 64.575 0.000 0.471 0.500 ar.L6 -0.3417 0.009 -39.574 0.000 -0.359 -0.325 ar.L7 0.2382 0.011 22.446 0.000 0.217 0.259 ar.L8 -0.1689 0.011 -14.980 0.000 -0.191 -0.147 ar.L9 0.0983 0.010 10.068 0.000 0.079 0.117 ar.L10 -0.0153 0.005 -2.868 0.004 -0.026 -0.005 sigma2 0.8160 0.003 300.204 0.000 0.811 0.821 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 833488.81 Prob(Q): 0.97 Prob(JB): 0.00 Heteroskedasticity (H): 1.17 Skew: 1.28 Prob(H) (two-sided): 0.00 Kurtosis: 40.87 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 09:33:24,471 - INFO - β Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model. 2024-05-14 09:33:24,471 - INFO - π 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
2024-05-14 09:33:25,175 - INFO - π 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
2024-05-14 09:33:25,719 - INFO - β Step 6: Conducting One-Step Forecast
2024-05-14 09:33:26,013 - INFO - MAPE for the forecast: 40.74% 2024-05-14 09:33:26,013 - INFO - β Step 7.1: Conducting Multi-Step Forecast for Next Weeks
2024-05-14 09:42:56,092 - INFO - MAPE for the forecast: 39.18% 2024-05-14 09:42:56,093 - INFO - β Step 7.2: Conducting Multi-Step Forecast for Next Days
2024-05-14 10:18:49,871 - INFO - MAPE for the forecast: 31.65% 2024-05-14 10:18:49,872 - INFO - ππ Time Series EDA: Utilize the ArimaTimeSeriesAnalyzer class to apply analysis to 'Penrose' site and 'PM2.5' target pollutant. 2024-05-14 10:18:49,872 - INFO - β Step 1: Preparing Data for Takapuna (PM10) 2024-05-14 10:18:49,873 - INFO - β Step 2: Visualizing Time Series and Lag Plots for Takapuna (PM10)
β Step 1: Data Loading & Splitting 80/20 π Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples. β Step 2: Time Series Visualization π [2.1] [Default] Time Series Display ...
π [2.1] Plot the original time series: visual sense of trends, seasonality, or irregular patterns.
π [2.2] Apply 1 Order Differencing to the time series and visualizes results.
π [2.3] Plotting the ACF to suggest possible MA orders.
π [2.3] Plotting the PACF to suggest possible AR orders.
2024-05-14 10:18:54,773 - INFO - β Step 2*: If data is skewed, transform the data (using a Log or Box-Cox transformation) to stabilise the variance. 2024-05-14 10:18:54,775 - INFO - β Step 4: Selecting the Best Model Based on AIC 2024-05-14 10:18:54,778 - INFO - π Fitting the ARIMA/SARIMA ARIMA(2,0,3) to the training data... 2024-05-14 10:19:17,273 - INFO - Model: ARIMA(2,0,3) - RMSE: 6.62, MAE: 4.21, MAPE: 87.72%, AIC: 78734.78, BIC: 78787.56 2024-05-14 10:19:17,292 - INFO - The best model based on AIC: ARIMA(2,0,3) with AIC: 78734.78 2024-05-14 10:19:17,424 - INFO - Best model based on AIC = 78734.78 SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 13887 Model: SARIMAX(2, 0, 3) Log Likelihood -39360.392 Date: Tue, 14 May 2024 AIC 78734.785 Time: 10:19:17 BIC 78787.556 Sample: 05-07-2020 HQIC 78752.360 - 12-07-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.2385 0.043 5.583 0.000 0.155 0.322 ar.L1 1.3873 0.014 102.426 0.000 1.361 1.414 ar.L2 -0.4080 0.011 -38.577 0.000 -0.429 -0.387 ma.L1 -0.0155 0.014 -1.144 0.253 -0.042 0.011 ma.L2 -0.7895 0.008 -95.881 0.000 -0.806 -0.773 ma.L3 -0.0482 0.007 -7.264 0.000 -0.061 -0.035 sigma2 16.9697 0.014 1241.426 0.000 16.943 16.997 =================================================================================== Ljung-Box (L1) (Q): 0.02 Jarque-Bera (JB): 362991240.28 Prob(Q): 0.88 Prob(JB): 0.00 Heteroskedasticity (H): 0.26 Skew: 16.64 Prob(H) (two-sided): 0.00 Kurtosis: 794.34 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). 2024-05-14 10:19:17,425 - INFO - β Step 5: Checking Model Diagnostics: Generates and Displays Diagnostic Plots for the Fitted Model. 2024-05-14 10:19:17,433 - INFO - π 5.1. [Default] 4 Model Diagnostics plots: The Standardized Residual, Histogram plus KDE estimate (N(0,1): the normal distribution), Normal Q-Q, and the Correlogram. ...
2024-05-14 10:19:17,948 - INFO - π 5.2. [Manual] Check the Residuals by plotting the ACF of the Residuals ...
2024-05-14 10:19:18,484 - INFO - β Step 6: Conducting One-Step Forecast
2024-05-14 10:19:18,758 - INFO - MAPE for the forecast: 87.72% 2024-05-14 10:19:18,758 - INFO - β Step 7.1: Conducting Multi-Step Forecast for Next Weeks
2024-05-14 10:21:15,576 - INFO - MAPE for the forecast: 86.08% 2024-05-14 10:21:15,576 - INFO - β Step 7.2: Conducting Multi-Step Forecast for Next Days
2024-05-14 10:29:19,279 - INFO - MAPE for the forecast: 76.37%
βοΈ Time-Series Analysis & Forecasting with Meta/Facebook's ProphetΒΆ
- Prophet works by decomposing the time series into three main components: trend, seasonality, and holidays.
- It uses an additive model where non-linear trends fit with yearly and weekly seasonalities, plus holidays.
- It accommodates seasonality with Fourier series and models holidays as indicator variables.
## Import Prophet class from prophet (fbprophet) library
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
class AdvancedTimeSeriesForecastingProphet:
def __init__(self, training_data, testing_data, target_variable, changepoint_prior_scale=0.01, interval_width=0.95, daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True, include_holidays=True, country_code='NZ'):
"""
Initializes the class with training and testing data, target variable, and configurations for the Prophet model.
Ensures data includes 'ds' and target_variable correctly named 'y'.
Args:
training_data (DataFrame): The training dataset with 'ds' and target_variable columns.
testing_data (DataFrame): The testing dataset with 'ds' and target_variable columns.
target_variable (str): The name of the column to forecast.
changepoint_prior_scale (float): The flexibility of the automatic changepoint selection.
daily_seasonality (bool): Whether to model daily seasonality.
yearly_seasonality (bool): Whether to model yearly seasonality.
interval_width (float): Uncertainty interval width. Set the uncertainty interval to 95% (the Prophet default is 80%)
periods (int): Number of periods to forecast.
freq (str): Frequency of the forecast (e.g., 'D' for daily).
include_holidays (bool): If True, includes holiday effects based on the specified country.
country_code (str): The ISO country code ('US') for including holidays.
Returns:
model (Prophet object): The fitted Prophet model.
"""
# self.validate_data(training_data, testing_data, target_variable)
# self.training_data = training_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
# self.testing_data = testing_data.rename(columns={target_variable: 'y', 'ds': 'ds'})
self.training_data = training_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'}) ## target_variable --> 'y'
self.testing_data = testing_data.reset_index()[['ds', target_variable]].rename(columns={target_variable: 'y'})
self.target_variable = target_variable
## Initialize and configure the model in Prophet with the default parameters
## Set the uncertainty interval to 95% (the Prophet default is 80%)
# self.model = Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.001)
# self.model = Prophet(changepoint_prior_scale=0.004, seasonality_prior_scale=20, seasonality_mode='multiplicative', changepoint_range=0.9)
self.model = Prophet(
changepoint_prior_scale=changepoint_prior_scale,
daily_seasonality=daily_seasonality,
weekly_seasonality=weekly_seasonality,
yearly_seasonality=yearly_seasonality,
interval_width=interval_width)
if include_holidays:
self.model.add_country_holidays(country_name=country_code) ## 'US' | 'NZ'
self.forecast_df = None
# def validate_data(self, training_data, testing_data, target_variable):
# """
# Validates the input data to ensure required columns are present.
# """
# required_columns = ['ds', target_variable]
# for data in [training_data, testing_data]:
# if not all(col in data.columns for col in required_columns):
# raise ValueError(f"Data must include the following columns: {required_columns}")
@staticmethod
def train_test_split(data, target_col='PM2.5', train_size=0.8):
"""
βοΈ [1] Train-Test split: Splits the data into training and testing sets.
Use 80% as the in-time training data and the rest 20% as the out-of-time test data.
:param train_size: float - proportion of the dataset to include in the train split.
"""
train_len = int(len(data) * train_size)
train_data = data.iloc[:train_len]
test_data = data.iloc[train_len:]
print(f'π [{target_col}] Training & Testing Data split (80/20): {len(train_data):,} train samples and {len(test_data):,} test samples.')
# y = data[target_col] ## Compatible with Prophet: ds & y columns
y_train = train_data[target_col]
y_test = test_data[target_col]
return y_train, y_test
## Create a future dataframe for predictions, 24 hours a day | 168 hours a week into the future
## freq='MS' | 'D' | 'H'
def fit(self, periods=7*24, freq='H'):
"""
Fits the Prophet model using the training data.
Generates predictions over the forecast horizon.
Returns:
forecast (DataFrame): The forecast results from Prophet.
periods: We need to specify the number of days in future
"""
## Fit the model to the data
self.model.fit(self.training_data)
## Making Future Predictions
future = self.model.make_future_dataframe(periods=len(self.testing_data), freq=freq)
## Predicting the next quarter 120 days | 365 days (1 year) | period = 24 for daily seasonality; periods=24*7, freq='H')
# future = self.model.make_future_dataframe(periods=periods, freq=freq)
self.forecast_df = self.model.predict(future)
## DEBUG
## Display the most critical output columns from the forecast
# self.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].tail()
return self.forecast_df
def validate(self, site_name='Penrose'):
"""
Plots the forecasted values with uncertainty intervals alongside the actual testing data.
Validates the Forecasting Model by plotting the forecast with confidence intervals (highlights changepoints and the training data end) against actual values for a specified site.
Also shows components of the forecast, calculates MAE.
"""
if self.forecast_df is None:
raise ValueError("Forecast data is not available for validation.")
## DEBUG
# sns.boxplot(x=self.training_data.ds, y=self.training_data.y)
## Extract the predicted and actual values for the visualization
## Ensure that forecasted data and actual data share the same index for proper comparison
# forecast = self.forecast_df.set_index('ds')['yhat']
forecast = self.forecast_df
# actual = self.testing_data.set_index('ds')['y']
actual_testing_data = self.testing_data
last_training_date = self.training_data['ds'].max()
## Calculate Mean Absolute Error (MAE) only for overlapping dates
# valid_indices = ~forecasted.isna()
# mae = mean_absolute_error(actual[valid_indices], forecasted[valid_indices])
# print(f"Mean Absolute Error (MAE): {mae:.2f}")
## [Plot 1] Plot the forecast: Shows the actual testing values, forecasted values from the fitted Model with the uncertainty intervals.
# forecast_plot = self.model.plot(forecast, uncertainty=False, plot_cap=True)
forecast_plot = self.model.plot(forecast, uncertainty=True)
## Adding changepoints and training end visual markers
a = add_changepoints_to_plot(forecast_plot.gca(), self.model, forecast, cp_color='red')
## add a vertical line at the end of the training period
axes = forecast_plot.gca()
# last_training_date = pd.to_datetime('2021-12-31')
axes.axvline(x=last_training_date, color='darkred', linestyle='solid', label='Training End') ## linestyle='--' | 'solid'
## plot true test data for the period after the red line
actual_testing_data['ds'] = pd.to_datetime(actual_testing_data['ds'])
plt.plot(actual_testing_data['ds'], actual_testing_data['y'],'ro', markersize=0.2, label='Actual Test Data')
# actual = actual_testing_data.set_index('ds')['y']
axes.set_title(f'[Prophet][{site_name}] Forecasting vs Actual Testing Data for {self.target_variable}')
axes.set_xlabel('Date (ds)')
axes.set_ylabel(f'{self.target_variable} (y)')
axes.legend()
axes.grid(True)
plt.show()
## [Plot 2] Decomposes the forecast into trend, yearly seasonality, and weekly seasonality components.
if hasattr(self.model, 'plot_components'):
# fig2 = self.model.plot_components(self.forecast_df, uncertainty=False, plot_cap=True).show()
fig2 = self.model.plot_components(self.forecast_df, uncertainty=True).show()
## Future Work:
## Change points: By default, Prophet automatically detects change points in the data. You can manually adjust their sensitivity or specify them.
## Seasonalities: Apart from built-in daily, weekly, and yearly seasonalities, you can add custom seasonalities (e.g., monthly).
## Holidays: You can add holidays and special events to improve the modelβs accuracy on irregular occurrences.
## DataFrame must have 'Site', 'ds', and other columns like 'PM2.5'
## Initialize instances for Penrose and Takapuna for PM2.5 prediction using Meta/Facebook Prophet
## Prepare data by performing a train-test split
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM2.5', train_size=0.8)
forecasting_prophet_penrose = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM2.5')
## Fitting models
logging.info("π οΈ Fitting Prophet models for Penrose ...")
forecast = forecasting_prophet_penrose.fit()
## Validating models
logging.info("π¬ Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')
2024-05-14 11:44:15,004 - INFO - π οΈ Fitting Prophet models for Penrose ...
π [PM2.5] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
2024-05-14 11:44:18,105 - INFO - π¬ Validating Prophet models for Penrose ...
## Prepare data by performing a train-test split
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM2.5', train_size=0.8)
forecasting_prophet_takapuna = AdvancedTimeSeriesForecastingProphet(training_data2, testing_data2, 'PM2.5')
## Fitting models
logging.info("π οΈ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()
logging.info("π¬ Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')
2024-05-14 11:44:26,085 - INFO - π οΈ Fitting Prophet models for Takapuna ...
π [PM2.5] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
2024-05-14 11:44:29,337 - INFO - π¬ Validating Prophet models for Takapuna ...
## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data1, testing_data1 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df1, target_col='PM10', train_size=0.8)
forecasting_prophet_penrose = AdvancedTimeSeriesForecastingProphet(training_data=training_data1, testing_data=testing_data1, target_variable='PM10')
## DEBUG
# forecasting_prophet_penrose.forecast_df[['ds','yhat','yhat_lower','yhat_upper']].head()
## Fitting models
logging.info("π οΈ Fitting Prophet models for Penrose ...")
forecasting_prophet_penrose.fit()
## Validating models
logging.info("π¬ Validating Prophet models for Penrose ...")
forecasting_prophet_penrose.validate(site_name='Penrose')
2024-05-14 11:48:21,695 - INFO - π οΈ Fitting Prophet models for Penrose ...
π [PM10] Training & Testing Data split (80/20): 13,900 train samples and 3,475 test samples.
2024-05-14 11:48:25,018 - INFO - π¬ Validating Prophet models for Penrose ...
## DataFrame must have 'Site', 'ds', and other columns like 'PM10'
## Creating instances for Penrose and Takapuna for PM10 prediction
training_data2, testing_data2 = AdvancedTimeSeriesForecastingProphet.train_test_split(data=data_series_df2, target_col='PM10', train_size=0.8)
forecasting_prophet_takapuna = AdvancedTimeSeriesForecastingProphet(training_data=training_data2, testing_data=testing_data2, target_variable='PM10')
## Fitting models
logging.info("π οΈ Fitting Prophet models for Takapuna ...")
forecasting_prophet_takapuna.fit()
## Validating models
logging.info("π¬ Validating Prophet models for Takapuna ...")
forecasting_prophet_takapuna.validate(site_name='Takapuna')
2024-05-14 11:48:32,781 - INFO - π οΈ Fitting Prophet models for Takapuna ...
π [PM10] Training & Testing Data split (80/20): 13,887 train samples and 3,472 test samples.
2024-05-14 11:48:36,309 - INFO - π¬ Validating Prophet models for Takapuna ...
π§ Feature SelectionΒΆ
def engineer_features(rawdata):
"""
Adds time-based features to the dataframe to enhance the analysis capabilities, adjusting for the multi-level index.
This includes extracting various temporal components and creating lag features for PM2.5 and PM10 variables to analyze time-based dependencies.
"""
## Extracting date-time components
# timestamp_index = rawdata.index.get_level_values('Timestamp') ## 'Timestamp' <-- 'ds'
timestamp_index = rawdata['Timestamp']
# rawdata['Hour'] = timestamp_index.dt.hour
rawdata['Day'] = timestamp_index.dt.day
rawdata['DayOfWeek'] = timestamp_index.dt.dayofweek
rawdata['Month'] = timestamp_index.dt.month
rawdata['Quarter'] = timestamp_index.dt.quarter
rawdata['Year'] = timestamp_index.dt.year
# rawdata['DayOfYear'] = timestamp_index.dt.dayofyear
## Extract week of year for each timestamp
rawdata['WeekOfYear'] = [d.isocalendar()[1] for d in timestamp_index]
## Calculating the season based on the month --> accurately reflects the local climate and seasonal cycles
## Season encoding: 1 (Summer): December, January, February; 2 (Autumn): March, April, May
## 3 (Winter): June, July, August ; 4 (Spring): September, October, November
## Adjusted for meteorological seasons in Auckland, New Zealand:
# rawdata['Season'] = (((timestamp_index.dt.month % 12) + 1) // 3) % 4 + 1
## TODO: Southern Hemisphere like New Zealand and Australia vs Northern Hemisphere like England
## Correctly mapping the month to meteorological seasons for Southern Hemisphere (Auckland)
rawdata['Season'] = rawdata['Month'].apply(
lambda x: 1 if 9 <= x <= 11 else ## Spring: Sep, Oct, Nov
2 if 12 <= x or x <= 2 else ## Summer: Dec, Jan, Feb
3 if 3 <= x <= 5 else ## Autumn: Mar, Apr, May
4 ## Winter: Jun, Jul, Aug
)
## Adding lag features for PM2.5 and PM10 to capture previous time steps' influence --> to compare the correlation with the other variables.
rawdata['PM2.5_Lag1'] = rawdata.groupby('Site')['PM2.5'].shift(1)
rawdata['PM2.5_Lag2'] = rawdata.groupby('Site')['PM2.5'].shift(2)
rawdata['PM10_Lag1'] = rawdata.groupby('Site')['PM10'].shift(1)
rawdata['PM10_Lag2'] = rawdata.groupby('Site')['PM10'].shift(2)
## Fill NaN with the next value in the column (backward fill)
rawdata.fillna(method='bfill', inplace=True)
## Alternatively, fill NaN with the previous value in the column (forward fill)
rawdata.fillna(method='ffill', inplace=True)
return rawdata
## Apply the cleaning and querying function
# cleaned_data = rawdata.copy()
cleaned_data = engineer_features(rawdata.copy())
# if IS_TEST_DEV:
cleaned_data[['Timestamp', 'Year', 'Quarter','Season', 'Month', 'WeekOfYear', 'Day', 'DayOfWeek']].head()
# cleaned_data.columns
Timestamp | Year | Quarter | Season | Month | WeekOfYear | Day | DayOfWeek | |
---|---|---|---|---|---|---|---|---|
0 | 2020-05-07 00:00:00 | 2020 | 2 | 3 | 5 | 19 | 7 | 3 |
1 | 2020-05-07 01:00:00 | 2020 | 2 | 3 | 5 | 19 | 7 | 3 |
2 | 2020-05-07 02:00:00 | 2020 | 2 | 3 | 5 | 19 | 7 | 3 |
3 | 2020-05-07 03:00:00 | 2020 | 2 | 3 | 5 | 19 | 7 | 3 |
4 | 2020-05-07 04:00:00 | 2020 | 2 | 3 | 5 | 19 | 7 | 3 |
## 'Timestamp' as the only Ordinal Attribute/Column given its nature order in time-series data
ordinal_columns = ['Timestamp']
print("\nπ Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ...")
cleaned_numerical_columns, cleaned_nominal_columns = DescriptiveStatistics.describe_data(cleaned_data, ordinal_columns)
# targets = ['PM2.5', 'PM10', 'SO2']
# features = [feature for feature in df.columns if feature not in targets]
π Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ... βΉοΈ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2'] βΉοΈ Ordinal Variables/Features: ['Timestamp'] βΉοΈ Nominal Variables/Features: Index(['Site'], dtype='object')
cleaned_data_site1 = cleaned_data[cleaned_data['Site'] == 'Penrose']
cleaned_data_site2 = cleaned_data[cleaned_data['Site'] == 'Takapuna']
# if IS_DEBUG:
print("\nπ [Site1 - Penrose] Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site1.describe()
π [Site1 - Penrose] Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ...
Timestamp | AQI | PM10 | PM2.5 | SO2 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | ... | DayOfWeek | Month | Quarter | Year | WeekOfYear | Season | PM2.5_Lag1 | PM2.5_Lag2 | PM10_Lag1 | PM10_Lag2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17375 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | ... | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 |
mean | 2021-05-03 23:07:19.873381376 | 28.233712 | 13.886201 | 5.400065 | 0.846799 | 9.871754 | 13.973526 | 23.837744 | 2.805975 | 173.193482 | ... | 3.004259 | 6.538935 | 2.513899 | 2020.835568 | 26.883856 | 2.502791 | 5.400681 | 5.401262 | 13.887018 | 13.887496 |
min | 2020-05-07 00:00:00 | 0.000000 | -6.400000 | -12.300000 | -4.400000 | -1.900000 | -1.400000 | -1.300000 | 0.200000 | 1.000000 | ... | 0.000000 | 1.000000 | 1.000000 | 2020.000000 | 1.000000 | 1.000000 | -12.300000 | -12.300000 | -6.400000 | -6.400000 |
25% | 2020-11-03 23:30:00 | 22.000000 | 8.500000 | 2.200000 | 0.100000 | 0.900000 | 4.100000 | 5.300000 | 1.600000 | 74.000000 | ... | 1.000000 | 4.000000 | 2.000000 | 2020.000000 | 13.000000 | 1.000000 | 2.200000 | 2.200000 | 8.500000 | 8.500000 |
50% | 2021-05-03 23:00:00 | 27.000000 | 13.100000 | 4.700000 | 0.500000 | 3.700000 | 10.500000 | 14.800000 | 2.600000 | 210.000000 | ... | 3.000000 | 7.000000 | 3.000000 | 2021.000000 | 27.000000 | 3.000000 | 4.700000 | 4.700000 | 13.100000 | 13.100000 |
75% | 2021-10-31 22:30:00 | 33.000000 | 18.300000 | 7.700000 | 1.000000 | 10.700000 | 20.300000 | 31.100000 | 3.800000 | 235.000000 | ... | 5.000000 | 10.000000 | 4.000000 | 2021.000000 | 40.000000 | 4.000000 | 7.700000 | 7.700000 | 18.300000 | 18.300000 |
max | 2022-04-30 23:00:00 | 121.000000 | 196.700000 | 138.400000 | 25.200000 | 378.400000 | 83.100000 | 455.100000 | 8.200000 | 357.000000 | ... | 6.000000 | 12.000000 | 4.000000 | 2022.000000 | 53.000000 | 4.000000 | 138.400000 | 138.400000 | 196.700000 | 196.700000 |
std | NaN | 9.608861 | 7.958464 | 5.204060 | 1.482960 | 20.526086 | 12.520003 | 30.254179 | 1.533803 | 93.604420 | ... | 1.997634 | 3.459325 | 1.120423 | 0.684702 | 15.234194 | 1.123019 | 5.204694 | 5.205324 | 7.958003 | 7.957911 |
8 rows Γ 23 columns
# if IS_DEBUG:
print("\nπ [Site2 - Takapuna] Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...")
cleaned_data_site2.describe()
π [Site2 - Takapuna] Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ...
Timestamp | AQI | PM10 | PM2.5 | SO2 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | ... | DayOfWeek | Month | Quarter | Year | WeekOfYear | Season | PM2.5_Lag1 | PM2.5_Lag2 | PM10_Lag1 | PM10_Lag2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17359 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.0 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | ... | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 |
mean | 2021-05-04 08:00:00 | 25.152947 | 11.562504 | 6.237181 | 0.5 | 11.637256 | 0.012103 | 23.521597 | 2.497209 | 176.424542 | ... | 3.004148 | 6.540181 | 2.514315 | 2020.836454 | 26.890316 | 2.502275 | 6.237259 | 6.237348 | 11.562559 | 11.562660 |
min | 2020-05-07 17:00:00 | -13.000000 | -3.800000 | 0.450000 | 0.5 | -0.800000 | -0.000500 | -0.700000 | 0.200000 | 5.500000 | ... | 0.000000 | 1.000000 | 1.000000 | 2020.000000 | 1.000000 | 1.000000 | 0.450000 | 0.450000 | -3.800000 | -3.800000 |
25% | 2020-11-04 12:30:00 | 19.000000 | 7.468750 | 4.300000 | 0.5 | 2.350000 | 0.003950 | 6.912500 | 1.250000 | 76.500000 | ... | 1.000000 | 4.000000 | 2.000000 | 2020.000000 | 13.000000 | 1.000000 | 4.300000 | 4.300000 | 7.468750 | 7.468750 |
50% | 2021-05-04 08:00:00 | 24.000000 | 10.650000 | 5.550000 | 0.5 | 5.700000 | 0.008800 | 14.700000 | 2.300000 | 203.500000 | ... | 3.000000 | 7.000000 | 3.000000 | 2021.000000 | 27.000000 | 2.000000 | 5.550000 | 5.550000 | 10.650000 | 10.650000 |
75% | 2021-11-01 03:30:00 | 30.000000 | 14.400000 | 7.300000 | 0.5 | 11.900000 | 0.017100 | 28.500000 | 3.500000 | 260.000000 | ... | 5.000000 | 10.000000 | 4.000000 | 2021.000000 | 40.000000 | 4.000000 | 7.300000 | 7.300000 | 14.400000 | 14.400000 |
max | 2022-04-30 23:00:00 | 318.000000 | 437.250000 | 52.300000 | 0.5 | 352.450000 | 0.072100 | 401.950000 | 7.800000 | 346.500000 | ... | 6.000000 | 12.000000 | 4.000000 | 2022.000000 | 53.000000 | 4.000000 | 52.300000 | 52.300000 | 437.250000 | 437.250000 |
std | NaN | 10.637321 | 7.778661 | 3.401947 | 0.0 | 21.970397 | 0.010751 | 29.903734 | 1.455767 | 92.268417 | ... | 1.998613 | 3.460755 | 1.120882 | 0.684574 | 15.240120 | 1.123436 | 3.401884 | 3.401809 | 7.778618 | 7.778534 |
8 rows Γ 23 columns
def get_top_correlated_features(data, target, num_features=12):
"""
Calculates and returns the top N features most correlated with the target variable.
Parameters:
- data (DataFrame): The dataset containing numerical features.
- target (str): The target variable for which correlations are calculated.
- num_features (int): The number of top features to return.
Returns:
- List of top N correlated features.
"""
## Ensure data does not contain infinite or NaN values
data = data.select_dtypes(include=[np.number]).dropna()
## Calculate the Pearson correlation matrix
correlation_matrix = data.corr(method='pearson')
## Calculate absolute correlation with the target variable
abs_corr_with_target = abs(correlation_matrix[target])
## Exclude the target variable and lag variables from the series
features_to_exclude = [target] + [f"{target}_Lag1", f"{target}_Lag2"]
abs_corr_with_target = abs_corr_with_target.drop(features_to_exclude, errors='ignore')
## Get the top N highly correlated features
top_features = abs_corr_with_target.nlargest(num_features).index.tolist()
return top_features
## Analyzing for both sites
top_features_data1 = get_top_correlated_features(cleaned_data_site1, 'PM2.5')
top_features_data2 = get_top_correlated_features(cleaned_data_site2, 'PM2.5')
logging.info("\nπ Top 5 features highly correlated with PM2.5 in Penrose: %s\n", top_features_data1)
logging.info("\nπ Top 5 features highly correlated with PM2.5 in Takapuna: %s\n", top_features_data2)
2024-05-14 12:04:07,853 - INFO - π Top 5 features highly correlated with PM2.5 in Penrose: ['PM10', 'PM10_Lag1', 'NOx', 'NO', 'PM10_Lag2', 'AQI', 'NO2', 'SO2', 'Wind_Dir', 'Air_Temp', 'Season', 'Wind_Speed'] 2024-05-14 12:04:07,853 - INFO - π Top 5 features highly correlated with PM2.5 in Takapuna: ['PM10', 'PM10_Lag1', 'AQI', 'PM10_Lag2', 'NO', 'NOx', 'Air_Temp', 'NO2', 'Rel_Humidity', 'Wind_Dir', 'Quarter', 'Month']
## Analyzing for both sites
top_features_data21 = get_top_correlated_features(cleaned_data_site1, 'PM10')
top_features_data22 = get_top_correlated_features(cleaned_data_site2, 'PM10')
logging.info("\nπ Top 5 features highly correlated with PM10 in Penrose: %s\n", top_features_data21)
logging.info("\nπ Top 5 features highly correlated with PM10 in Takapuna: %s\n", top_features_data22)
2024-05-14 12:04:22,621 - INFO - π Top 5 features highly correlated with PM10 in Penrose: ['AQI', 'PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'NO', 'NOx', 'Wind_Speed', 'Rel_Humidity', 'SO2', 'NO2', 'Air_Temp', 'Wind_Dir'] 2024-05-14 12:04:22,621 - INFO - π Top 5 features highly correlated with PM10 in Takapuna: ['PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2', 'AQI', 'NO', 'NOx', 'Wind_Speed', 'NO2', 'Wind_Dir', 'Day', 'Rel_Humidity', 'Quarter']
# import pandas as pd
# import matplotlib.pyplot as plt
#
# # Assuming rawdata and cleaned_data are pandas DataFrames loaded appropriately
# # Convert 'Timestamp' to datetime and set as index
# rawdata['Timestamp'] = pd.to_datetime(rawdata['Timestamp'])
# cleaned_data['Timestamp'] = pd.to_datetime(cleaned_data['Timestamp'])
#
# rawdata.set_index('Timestamp', inplace=True)
# cleaned_data.set_index('Timestamp', inplace=True)
def plot_pm_variation(data, feature, title, ax):
"""
Helper function to plot PM2.5/PM10 variations on a given Axes object.
Parameters:
- data (DataFrame): The dataset to plot.
- title (str): The title of the plot.
- ax (Axes): The matplotlib Axes object where the plot will be drawn.
"""
ax.plot(data[feature], label=f'{feature}')
ax.set_xlabel('Date')
ax.set_ylabel(f'{feature} Concentration')
ax.set_title(title)
ax.legend()
fig, axs = plt.subplots(2, 2, figsize=(20, 12)) # Create a grid of 2x2 for Penrose and Takapuna
## Plotting
plot_pm_variation(rawdata_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Raw Data)', axs[0, 0])
plot_pm_variation(cleaned_data_site1, 'PM2.5','Penrose - PM2.5 Variation over Time (Cleaned Data)', axs[0, 1])
plot_pm_variation(rawdata_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Raw Data)', axs[1, 0])
plot_pm_variation(cleaned_data_site2, 'PM2.5','Takapuna - PM2.5 Variation over Time (Cleaned Data)', axs[1, 1])
plt.tight_layout()
plt.show()
# ## 'cleaned_data_site1' and 'cleaned_data_site2' are preloaded DataFrames for Penrose and Takapuna, respectively
# ## Convert 'Timestamp' to datetime and set as index
# cleaned_data_site1['Timestamp'] = pd.to_datetime(cleaned_data_site1['Timestamp'])
# cleaned_data_site2['Timestamp'] = pd.to_datetime(cleaned_data_site2['Timestamp'])
#
# cleaned_data_site1.set_index('Timestamp', inplace=True)
# cleaned_data_site2.set_index('Timestamp', inplace=True)
## [DEBUG] Top 5 features - this should be dynamically determined from a correlation analysis
# top_features_data1 = ['AQI', 'PM10', 'NOx', 'Wind_Speed', 'Air_Temp']
# top_features_data2 = ['PM2.5', 'SO2', 'NO', 'Wind_Dir', 'Rel_Humidity']
def plot_features(data, features, site_name):
fig, axs = plt.subplots(len(features), 1, figsize=(15, 15))
for ax, feature in zip(axs, features):
if feature in data.columns:
ax.plot(data.index, data[feature], label=f'{feature} (Cleaned Data)')
ax.set_title(f'Variation of {feature} in {site_name}')
ax.set_xlabel('Date')
ax.set_ylabel(feature)
ax.grid(True)
else:
fig.delaxes(ax) ## Remove the axis if the feature is not in the data
plt.tight_layout()
plt.show()
## Call the function for each site
logging.info("\nπ [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site1, top_features_data1, 'Penrose')
logging.info("\nπ [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.\n")
plot_features(cleaned_data_site2, top_features_data2, 'Takapuna')
2024-05-14 12:04:38,696 - INFO - π [Penrose] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.
2024-05-14 12:04:39,572 - INFO - π [Takapuna] Visualizes the variation of the top N highly correlated features with PM2.5 across all times.
## Call the function for each site
logging.info("\nπ [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site1, top_features_data21, 'Penrose')
logging.info("\nπ [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.\n")
plot_features(cleaned_data_site2, top_features_data22, 'Takapuna')
2024-05-14 12:04:51,396 - INFO - π [Penrose] Visualizes the variation of the top N highly correlated features with PM10 across all times.
2024-05-14 12:04:52,148 - INFO - π [Takapuna] Visualizes the variation of the top N highly correlated features with PM10 across all times.
def compute_summary_statistics(data, top_features):
"""
Computes and prints summary statistics for selected features in a given dataset.
Args:
data (pd.DataFrame): The cleaned data for a specific site.
top_features (list): List of features for which to compute statistics, determined from prior analysis.
Returns:
pd.DataFrame: Summary statistics of the selected features.
"""
# Ensure all top features are in the data columns
valid_features = [feature for feature in top_features if feature in data.columns]
selected_features = ['PM2.5'] + valid_features
# Get summary statistics
summary_stats = data[selected_features].describe()
return summary_stats
## 'cleaned_data_site1' and 'cleaned_data_site2' are your cleaned datasets for Penrose and Takapuna respectively.
## 'top_features_data1' and 'top_features_data2' should be lists of top features determined from correlation analysis.
## [DEBUG] Check if the top features list is defined for each site
# top_features_data1 = ['NO2', 'Wind_Speed', 'Air_Temp', 'PM10', 'NOx'] # Example top features for Penrose
# top_features_data2 = ['SO2', 'NO', 'Rel_Humidity', 'Wind_Dir', 'NOx'] # Example top features for Takapuna
summary_stats_penrose = compute_summary_statistics(cleaned_data_site1, top_features_data1)
summary_stats_takapuna = compute_summary_statistics(cleaned_data_site2, top_features_data2)
logging.info("π Summary Statistics for Penrose:")
summary_stats_penrose
2024-05-14 12:05:50,736 - INFO - π Summary Statistics for Penrose:
PM2.5 | PM10 | PM10_Lag1 | NOx | NO | PM10_Lag2 | AQI | NO2 | SO2 | Wind_Dir | Air_Temp | Season | Wind_Speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 | 17375.000000 |
mean | 5.400065 | 13.886201 | 13.887018 | 23.837744 | 9.871754 | 13.887496 | 28.233712 | 13.973526 | 0.846799 | 173.193482 | 16.744259 | 2.502791 | 2.805975 |
std | 5.204060 | 7.958464 | 7.958003 | 30.254179 | 20.526086 | 7.957911 | 9.608861 | 12.520003 | 1.482960 | 93.604420 | 4.005293 | 1.123019 | 1.533803 |
min | -12.300000 | -6.400000 | -6.400000 | -1.300000 | -1.900000 | -6.400000 | 0.000000 | -1.400000 | -4.400000 | 1.000000 | 3.000000 | 1.000000 | 0.200000 |
25% | 2.200000 | 8.500000 | 8.500000 | 5.300000 | 0.900000 | 8.500000 | 22.000000 | 4.100000 | 0.100000 | 74.000000 | 14.000000 | 1.000000 | 1.600000 |
50% | 4.700000 | 13.100000 | 13.100000 | 14.800000 | 3.700000 | 13.100000 | 27.000000 | 10.500000 | 0.500000 | 210.000000 | 17.000000 | 3.000000 | 2.600000 |
75% | 7.700000 | 18.300000 | 18.300000 | 31.100000 | 10.700000 | 18.300000 | 33.000000 | 20.300000 | 1.000000 | 235.000000 | 20.000000 | 4.000000 | 3.800000 |
max | 138.400000 | 196.700000 | 196.700000 | 455.100000 | 378.400000 | 196.700000 | 121.000000 | 83.100000 | 25.200000 | 357.000000 | 28.000000 | 4.000000 | 8.200000 |
logging.info("\nπ Summary Statistics for Takapuna:")
summary_stats_takapuna
2024-05-14 12:05:56,070 - INFO - π Summary Statistics for Takapuna:
PM2.5 | PM10 | PM10_Lag1 | AQI | PM10_Lag2 | NO | NOx | Air_Temp | NO2 | Rel_Humidity | Wind_Dir | Quarter | Month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 | 17359.000000 |
mean | 6.237181 | 11.562504 | 11.562559 | 25.152947 | 11.562660 | 11.637256 | 23.521597 | 16.342980 | 0.012103 | 69.842651 | 176.424542 | 2.514315 | 6.540181 |
std | 3.401947 | 7.778661 | 7.778618 | 10.637321 | 7.778534 | 21.970397 | 29.903734 | 3.952319 | 0.010751 | 11.760956 | 92.268417 | 1.120882 | 3.460755 |
min | 0.450000 | -3.800000 | -3.800000 | -13.000000 | -3.800000 | -0.800000 | -0.700000 | 3.500000 | -0.000500 | 30.400000 | 5.500000 | 1.000000 | 1.000000 |
25% | 4.300000 | 7.468750 | 7.468750 | 19.000000 | 7.468750 | 2.350000 | 6.912500 | 13.500000 | 0.003950 | 61.200000 | 76.500000 | 2.000000 | 4.000000 |
50% | 5.550000 | 10.650000 | 10.650000 | 24.000000 | 10.650000 | 5.700000 | 14.700000 | 16.000000 | 0.008800 | 70.950000 | 203.500000 | 3.000000 | 7.000000 |
75% | 7.300000 | 14.400000 | 14.400000 | 30.000000 | 14.400000 | 11.900000 | 28.500000 | 19.500000 | 0.017100 | 79.700000 | 260.000000 | 4.000000 | 10.000000 |
max | 52.300000 | 437.250000 | 437.250000 | 318.000000 | 437.250000 | 352.450000 | 401.950000 | 27.000000 | 0.072100 | 90.350000 | 346.500000 | 4.000000 | 12.000000 |
Target:
Features:
- NO2
ReferencesΒΆ
Forecasting: Principles and Practice (3rd ed) : Rob J Hyndman and George Athanasopoulos; Monash University, Australia