🎓 Predicting Air Particulate Matter at Scale ⛅️ 🛠️ 5. Machine Learning Modelling
🎓 This notebook shows various methods that can be used to predict a Environmental Dataset using Machine Learning. It should be ready for reuse in the next steps (Intelligent Dashboard) in CRISP-DM for Data Science.
Workflow steps:
- Import the required teradataml modules.
- Connect to a Vantage system.
- ⚙️ Machine Learning Modelling
- Cleanup.
- Import the required teradataml modules.
🎯 Libraries and Reusable Functions¶
🎓 This section executes all of the cells in DataFrameAdapter.ipynb.
In [1]:
import os, logging
# logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
## .env --> Setting the environment variable for output format programmatically
os.environ['ENV_PATH'] = 'cleaned_vantage.env' ## Set to `.env_cleaned` or `.env_raw` as needed
# %run -i ./Data_Loading_and_Descriptive_Statistics.ipynb
%run -i ./DataFrameAdapter.ipynb
2024-05-27 12:36:31,137 - INFO - data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv 2024-05-27 12:36:31,137 - INFO - data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv 2024-05-27 12:36:31,145 - INFO - ℹ️ Load Data from data/cleaned/cleaned_Penrose7-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂 2024-05-27 12:36:31,183 - INFO - ℹ️ Load Data from data/cleaned/cleaned_Takapuna23-07May2020-to-30Apr2022.csv file --> rawdata DataFrame 📂
🎓 The specified .env file: cleaned_vantage.env +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ | Variable | Description | Value | +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ | IS_LOADING_FROM_FILES | True if loading data from *.csv/xls files; False if using imported data in Teradata Vantage | True | | IS_TERADATA_VANTAGE | Using scable Teradata Vantage vs. local machine (Laptop) | False | | IS_DATA_IN_TERADATA_VANTAGE | Using TeradataDataFrame in scalable Vantage vs. PandasDataFrame with local *.csv/xls files | False | | SCHEMA_NAME | [Teradata Vantage] Schema Name | Air_Pollution | | TABLE_NAME | [Teradata Vantage] Table Name | Air_Pollution_cleaned | | IS_JUPYTERLAB | Running in JupyterLab vs Python Dash/Vizro Dashboard | True | | IS_TEST_DEV | Is Test/Dev mode is active or not (in Production) | False | | DATA_PATH | *.csv/xls Data PATH | Not set or not found | | USE_DATA_PREFIX | Prefix to use for data files: 'raw' | 'cleaned' | cleaned | +-----------------------------+---------------------------------------------------------------------------------------------+-----------------------+ ℹ️ Load Data from data/cleaned folder ℹ️ Combined Data Shape: (34734, 13) ℹ️ The Shape of the Dataframe rawdata_site1 (Penrose) and rawdata_site2 (Takapuna): (17375, 12) (17359, 11) 🎓 Describing the types of each attribute as numerical_columns (Continuous), ordinal_columns (Ordinal), or nominal_columns (Nominal) ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp'] ℹ️ Nominal Variables/Features: Index(['Site'], dtype='object') 🎓 1. [Site 1 - Penrose][numerical_columns_S1, nominal_columns_S1] Summary Statistics of the Dataframe such as the mean, maximum and minimum values ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp'] 🎓 2. [Site 2 - Takapuna][numerical_columns_S2, nominal_columns_S2] Summary Statistics of the {site2} Dataframe such as the mean, maximum and minimum values ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity'] ℹ️ Ordinal Variables/Features: ['Timestamp'] 🎓 [Data_Loading_and_Descriptive_Statistics.ipynb] Listing variables with description... +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ | Variable Name | Description | All Sites | Penrose | Takapuna | +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ | rawdata | Complete dataset containing all observations across all sites. | [x] | [x] | [x] | | ordinal_columns | Ordinal columns specific to Site 1. | [x] | [x] | [x] | | numerical_columns_site1 | Numerical columns specific to Site 1. | [ ] | [x] | [ ] | | nominal_columns_site1 | Nominal columns specific to Site 1. | [ ] | [x] | [ ] | | numerical_columns_site2 | Numerical columns specific to Site 2. | [ ] | [ ] | [x] | | nominal_columns_site2 | Nominal columns specific to Site 2. | [ ] | [ ] | [x] | | rawdata_site1 | Subset of raw data for Site 1. | [ ] | [x] | [ ] | | rawdata_site2 | Subset of raw data for Site 2. | [ ] | [ ] | [x] | +-------------------------+----------------------------------------------------------------+-----------+---------+----------+ 🎓 Describing the types of each attribute as cleaned_numerical_columns (Continuous), cleaned_ordinal_columns (Ordinal), or cleaned_nominal_columns (Nominal) ... ℹ️ Numerical Variables/Features: ['AQI', 'PM10', 'PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Hour', 'Day', 'DayOfWeek', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1', 'PM10_Lag2'] ℹ️ Ordinal Variables/Features: ['Timestamp'] ℹ️ Nominal Variables/Features: Index(['Site'], dtype='object') 🎓 [Site1 - Penrose] Summary Statistics of the {site1} cleaned_data_site1 Dataframe such as the mean, max/minimum values ... 🎓 [Site2 - Takapuna] Summary Statistics of the {site2} cleaned_data_site2 Dataframe such as the mean, max/minimum values ... 🎓 [DataFrameAdapter.ipynb] Listing variables with description... +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+ | Variable Name | Description | All Sites | Penrose | Takapuna | +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+ | rawdata | Complete dataset containing all observations across all sites. | [x] | [x] | [x] | | numerical_columns_site1 | Numerical columns specific to Site 1. | [ ] | [x] | [ ] | | nominal_columns_site1 | Nominal columns specific to Site 1. | [ ] | [x] | [ ] | | numerical_columns_site2 | Numerical columns specific to Site 2. | [ ] | [ ] | [x] | | nominal_columns_site2 | Nominal columns specific to Site 2. | [ ] | [ ] | [x] | | rawdata_site1 | Subset of raw data for Site 1. | [ ] | [x] | [ ] | | rawdata_site2 | Subset of raw data for Site 2. | [ ] | [ ] | [x] | | --------------------------- | --------------------------------------------------------------------- | --------- | ------- | -------- | | cleaned_data | Cleaned dataset with preprocessing applied. | [x] | [x] | [x] | | cleaned_ordinal_columns | Ordinal columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_numerical_columns | Numerical columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_nominal_columns | Nominal columns in the cleaned dataset. | [x] | [x] | [x] | | cleaned_data_site1 | Cleaned data for Site 1. | [ ] | [x] | [ ] | | cleaned_data_site2 | Cleaned data for Site 2. | [ ] | [ ] | [x] | +-----------------------------+-----------------------------------------------------------------------+-----------+---------+----------+
In [2]:
# cleaned_data_site1
cleaned_data_site2
Out[2]:
Timestamp | AQI | PM10 | PM2.5 | SO2 | NO | NO2 | NOx | Wind_Speed | Wind_Dir | ... | DayOfWeek | Month | Quarter | Year | WeekOfYear | Season | PM2.5_Lag1 | PM2.5_Lag2 | PM10_Lag1 | PM10_Lag2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17375 | 2020-05-07 17:00:00 | 21.0 | 5.95 | 4.15 | 0.5 | 10.90 | 0.01715 | 28.00 | 2.50 | 242.0 | ... | 3 | 5 | 2 | 2020 | 19 | 3 | 4.15 | 4.15 | 5.95 | 5.95 |
17376 | 2020-05-07 18:00:00 | 21.0 | 5.65 | 5.10 | 0.5 | 8.20 | 0.01655 | 24.70 | 2.20 | 239.5 | ... | 3 | 5 | 2 | 2020 | 19 | 3 | 4.15 | 4.15 | 5.95 | 5.95 |
17377 | 2020-05-07 19:00:00 | 21.0 | 7.70 | 5.45 | 0.5 | 5.75 | 0.01325 | 19.00 | 2.10 | 244.0 | ... | 3 | 5 | 2 | 2020 | 19 | 3 | 5.10 | 4.15 | 5.65 | 5.95 |
17378 | 2020-05-07 20:00:00 | 21.0 | 8.20 | 5.45 | 0.5 | 3.50 | 0.00870 | 12.20 | 2.25 | 251.0 | ... | 3 | 5 | 2 | 2020 | 19 | 3 | 5.45 | 5.10 | 7.70 | 5.65 |
17379 | 2020-05-07 21:00:00 | 21.0 | 11.80 | 5.80 | 0.5 | 3.55 | 0.00930 | 12.90 | 2.10 | 261.0 | ... | 3 | 5 | 2 | 2020 | 19 | 3 | 5.45 | 5.45 | 8.20 | 7.70 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34729 | 2022-04-30 19:00:00 | 14.0 | 4.75 | 3.30 | 0.5 | 0.60 | 0.00440 | 5.00 | 2.55 | 109.5 | ... | 5 | 4 | 2 | 2022 | 17 | 3 | 3.25 | 2.85 | 5.90 | 5.85 |
34730 | 2022-04-30 20:00:00 | 14.0 | 6.35 | 3.15 | 0.5 | 0.50 | 0.00365 | 4.15 | 2.45 | 105.5 | ... | 5 | 4 | 2 | 2022 | 17 | 3 | 3.30 | 3.25 | 4.75 | 5.90 |
34731 | 2022-04-30 21:00:00 | 14.0 | 6.05 | 2.80 | 0.5 | 0.40 | 0.00480 | 5.20 | 2.35 | 115.5 | ... | 5 | 4 | 2 | 2022 | 17 | 3 | 3.15 | 3.30 | 6.35 | 4.75 |
34732 | 2022-04-30 22:00:00 | 13.0 | 4.20 | 2.60 | 0.5 | 0.40 | 0.00555 | 5.90 | 1.95 | 122.5 | ... | 5 | 4 | 2 | 2022 | 17 | 3 | 2.80 | 3.15 | 6.05 | 6.35 |
34733 | 2022-04-30 23:00:00 | 13.0 | 5.00 | 2.80 | 0.5 | 0.35 | 0.00405 | 4.30 | 1.95 | 119.0 | ... | 5 | 4 | 2 | 2022 | 17 | 3 | 2.60 | 2.80 | 4.20 | 6.05 |
17359 rows × 25 columns
In [3]:
# def ensure_numeric(data):
# """
# Ensure all columns in the DataFrame are numeric. Convert non-numeric columns to numeric where possible.
# Args:
# data (pd.DataFrame): The DataFrame to be processed.
# Returns:
# pd.DataFrame: DataFrame with all columns converted to numeric.
# """
# if 'Site' in data.columns:
# data.drop(columns=['Site'], inplace=True)
# for col in data.columns:
# if data[col].dtype == 'object':
# data[col] = pd.to_numeric(data[col], errors='coerce')
# return data
# ## Convert non-numeric columns to numeric
# cleaned_data_site1 = ensure_numeric(cleaned_data_site1)
# cleaned_data_site2 = ensure_numeric(cleaned_data_site2)
# logging.info("Non-numeric columns converted to numeric where possible.")
In [4]:
# ## Extract and select features for both sites and multiple pollutants
# ## 'PM2.5' Analyzing for both sites
# top_features_data11 = DataFrameAdapter.extract_featuretools_features(data=cleaned_data_site1, target_column='PM2.5', entity_id='site1')
# top_features_data12 = DataFrameAdapter.extract_featuretools_features(data=cleaned_data_site2, target_column='PM2.5', entity_id='site2')
# logging.info("\n🌟 top_features_data11: Top Featuretools features highly correlated with PM2.5 in Penrose: %s\n", top_features_data11.head())
# logging.info("\n🌟 top_features_data12: Top Featuretools features highly correlated with PM2.5 in Takapuna: %s\n", top_features_data12.head())
# ## 'PM10' Analyzing for both sites
# top_features_data21 = DataFrameAdapter.extract_featuretools_features(data=cleaned_data_site1, target_column='PM10', entity_id='site1')
# top_features_data22 = DataFrameAdapter.extract_featuretools_features(data=cleaned_data_site2, target_column='PM10', entity_id='site2')
# logging.info("\n🌟 top_features_data21: Top Featuretools features highly correlated with PM10 in Penrose: %s\n", top_features_data21.head())
# logging.info("\n🌟 top_features_data22: Top Featuretools features highly correlated with PM10 in Takapuna: %s\n", top_features_data22.head())
In [5]:
# ## Extract and select features for both sites and multiple pollutants
# ## 'PM2.5' Analyzing for both sites
# top_features_data11 = DataFrameAdapter.extract_tsfresh_features(data=cleaned_data_site1, target_column='PM2.5')
# top_features_data12 = DataFrameAdapter.extract_tsfresh_features(cleaned_data_site2, target_column='PM2.5')
# logging.info("\n🌟 top_features_data11: Top Tsfresh features highly correlated with PM2.5 in Penrose: %s\n", top_features_data11)
# logging.info("\n🌟 top_features_data12: Top Tsfresh features highly correlated with PM2.5 in Takapuna: %s\n", top_features_data12)
# ## 'PM10' Analyzing for both sites
# top_features_data21 = DataFrameAdapter.extract_tsfresh_features(data=cleaned_data_site1, target_column='PM10')
# top_features_data22 = DataFrameAdapter.extract_tsfresh_features(cleaned_data_site2, target_column='PM10')
# logging.info("\n🌟 top_features_data21: Top Tsfresh features highly correlated with PM10 in Penrose: %s\n", top_features_data21)
# logging.info("\n🌟 top_features_data22: Top Tsfresh features highly correlated with PM10 in Takapuna: %s\n", top_features_data22)
In [6]:
## 'PM2.5' Analyzing for both sites
top_features_data11 = DataFrameAdapter.get_top_correlated_features(data=cleaned_data_site1, target='PM2.5', num_features=10)
top_features_data12 = DataFrameAdapter.get_top_correlated_features(data=cleaned_data_site2, target='PM2.5', num_features=10)
logging.info("\n🌟 top_features_data11: Top features for PM2.5 in Penrose: %s\n", top_features_data11)
logging.info("\n🌟 top_features_data12: Top features for PM2.5 in Takapuna: %s\n", top_features_data12)
## 'PM10' Analyzing for both sites
top_features_data21 = DataFrameAdapter.get_top_correlated_features(data=cleaned_data_site1, target='PM10', num_features=10)
top_features_data22 = DataFrameAdapter.get_top_correlated_features(data=cleaned_data_site2, target='PM10', num_features=10)
logging.info("\n🌟 top_features_data21: Top features for PM10 in Penrose: %s\n", top_features_data21)
logging.info("\n🌟 top_features_data22: Top features for PM10 in Takapuna: %s\n", top_features_data22)
2024-05-27 12:36:31,950 - INFO - Excluded columns: ['AQI'] 2024-05-27 12:36:31,965 - INFO - Highly correlated features consider to drop: ['Quarter', 'WeekOfYear'] 2024-05-27 12:36:31,966 - INFO - Shape after removing highly correlated features: (17375, 20) 2024-05-27 12:36:32,469 - INFO - Multicollinear features consider to drop: ['PM10', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM10_Lag1', 'PM10_Lag2'] 2024-05-27 12:36:32,472 - INFO - Shape after removing multicollinear features: (17375, 7) 2024-05-27 12:36:32,488 - INFO - Low information features consider to drop: Index([], dtype='object') 2024-05-27 12:36:32,492 - INFO - Shape after removing low information features: (17375, 22) 2024-05-27 12:36:32,509 - INFO - Excluded columns: ['AQI'] 2024-05-27 12:36:32,533 - INFO - Highly correlated features consider to drop: ['NOx', 'Quarter', 'WeekOfYear'] 2024-05-27 12:36:32,537 - INFO - Shape after removing highly correlated features: (17359, 19) 2024-05-27 12:36:32,837 - INFO - Multicollinear features consider to drop: ['PM2.5', 'SO2', 'NO', 'NO2', 'NOx', 'Month', 'Quarter', 'WeekOfYear', 'PM2.5_Lag1', 'PM2.5_Lag2', 'PM10_Lag1'] 2024-05-27 12:36:32,839 - INFO - Shape after removing multicollinear features: (17359, 11) 2024-05-27 12:36:32,850 - INFO - Low information features consider to drop: Index(['SO2', 'NO2'], dtype='object') 2024-05-27 12:36:32,851 - INFO - Shape after removing low information features: (17359, 20) 2024-05-27 12:36:32,908 - INFO - 🌟 top_features_data11: Top features for PM2.5 in Penrose: [('PM10', 0.3316327673471662), ('PM10_Lag1', 0.2872995212028439), ('NOx', 0.26010067200800124), ('NO', 0.2438951539146822), ('PM10_Lag2', 0.2299781048934229), ('NO2', 0.22818076257545034), ('SO2', 0.1502897204253645), ('Wind_Dir', 0.09195585488183262), ('Air_Temp', 0.08956546185718811), ('Season', 0.055691287162079973)] 2024-05-27 12:36:32,909 - INFO - 🌟 top_features_data12: Top features for PM2.5 in Takapuna: [('PM10', 0.4783495824743127), ('PM10_Lag1', 0.45707224953829717), ('PM10_Lag2', 0.4157380573329951), ('NO', 0.28893379890807963), ('NOx', 0.2858649990967947), ('Air_Temp', 0.26145559434074866), ('NO2', 0.20908323226602676), ('Rel_Humidity', 0.19778481090123154), ('Wind_Dir', 0.1627087661651437), ('Quarter', 0.14799817939240387)] 2024-05-27 12:36:32,912 - INFO - Excluded columns: ['AQI', 'PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2'] 2024-05-27 12:36:32,935 - INFO - Highly correlated features consider to drop: ['Quarter', 'WeekOfYear'] 2024-05-27 12:36:32,939 - INFO - Shape after removing highly correlated features: (17375, 17) 2024-05-27 12:36:33,314 - INFO - Multicollinear features consider to drop: ['PM10', 'NO', 'NO2', 'NOx', 'Wind_Speed', 'Wind_Dir', 'Air_Temp', 'Rel_Humidity', 'Month', 'Quarter', 'Year', 'WeekOfYear', 'Season', 'PM10_Lag1', 'PM10_Lag2'] 2024-05-27 12:36:33,315 - INFO - Shape after removing multicollinear features: (17375, 4) 2024-05-27 12:36:33,323 - INFO - Low information features consider to drop: Index([], dtype='object') 2024-05-27 12:36:33,334 - INFO - Shape after removing low information features: (17375, 19) 2024-05-27 12:36:33,352 - INFO - Excluded columns: ['AQI', 'PM2.5', 'PM2.5_Lag1', 'PM2.5_Lag2'] 2024-05-27 12:36:33,380 - INFO - Highly correlated features consider to drop: ['NOx', 'Quarter', 'WeekOfYear'] 2024-05-27 12:36:33,383 - INFO - Shape after removing highly correlated features: (17359, 16) 2024-05-27 12:36:33,688 - INFO - Multicollinear features consider to drop: ['SO2', 'NO', 'NO2', 'NOx', 'Month', 'Quarter', 'WeekOfYear', 'PM10_Lag1'] 2024-05-27 12:36:33,691 - INFO - Shape after removing multicollinear features: (17359, 11) 2024-05-27 12:36:33,700 - INFO - Low information features consider to drop: Index(['SO2', 'NO2'], dtype='object') 2024-05-27 12:36:33,707 - INFO - Shape after removing low information features: (17359, 17) 2024-05-27 12:36:33,721 - INFO - 🌟 top_features_data21: Top features for PM10 in Penrose: [('NO', 0.22116679446709958), ('NOx', 0.19946161043703808), ('Wind_Speed', 0.1978398765827993), ('Rel_Humidity', 0.1968068762131793), ('SO2', 0.12497675418692601), ('NO2', 0.11929589880666419), ('Air_Temp', 0.09524315612393984), ('Wind_Dir', 0.06029375360018743), ('Year', 0.0475386031822095), ('Hour', 0.0401390588174396)] 2024-05-27 12:36:33,723 - INFO - 🌟 top_features_data22: Top features for PM10 in Takapuna: [('NO', 0.13616338528505534), ('NOx', 0.13137213923079702), ('Hour', 0.12187039938117586), ('Wind_Speed', 0.10867029970541756), ('NO2', 0.08402766918868398), ('Wind_Dir', 0.06940053549902547), ('Day', 0.035391068035720216), ('Rel_Humidity', 0.035055730794877324), ('Quarter', 0.025599675518559562), ('Month', 0.02467156095150525)]
🛠️ [Predictive Analytics] Predictive Models¶
Adaptive Cross Validation: TimeSeriesSplit with Time Series Data vs. KFold/Standard Cross Validation
TimeSeriesSplit Adaptive Cross Validation | KFold/Standard Cross Validation |
---|---|
In [63]:
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import KFold
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.optimizers import Adam
# from statsmodels.tsa.arima.model import ARIMA
import pmdarima as pm
from prophet import Prophet
from xgboost import XGBRegressor
import joblib, pickle
In [9]:
# !pip install neuralprophet
from neuralprophet import NeuralProphet
2024-05-27 12:36:42,389 - ERROR - Importing plotly failed. Interactive plots will not work. 2024-05-27 12:36:42,397 - ERROR - Importing plotly failed. Interactive plots will not work.
In [10]:
class PredictiveModels:
"""
A class for creating and managing various predictive models.
* [x] The adaptive forecasting approach to enhance long-term predictive accuracy for RQ2:
* Use TimeSeriesSplit for cross-validation.
* Continuously update the model with new data during each n_splits.
*
"""
# def __init__(self, training_data, testing_data, target_variable, features):
def __init__(self, training_data, target_variable, features):
"""
* [x] Initialize the predictor with training data.
* [x] Removed test_data, since test data is derived correctly during cross-validation.
:param training_data: DataFrame containing the training data.
:param testing_data: DataFrame containing the testing data.
:param target_variable: The target variable for prediction (e.g., 'PM2.5' or 'PM10').
:param features: List of feature columns to be used for training.
"""
## Store training data with only the relevant features and target variable; also refer to adaptive_cross_validate()
# self.training_data = training_data
self.training_data = training_data[features + [target_variable]]
# self.testing_data = testing_data
self.target_variable = target_variable
self.features = features
## Extract features and target from training and testing data
self.X_train = self.training_data[self.features]
self.y_train = self.training_data[self.target_variable]
# self.X_test = self.testing_data[self.features]
# self.y_test = self.testing_data[self.target_variable]
## Standardize the features
self.scaler = StandardScaler()
self.X_train_scaled = self.scaler.fit_transform(self.X_train)
# self.X_test_scaled = self.scaler.transform(self.X_test)
## Initialize dictionaries to store models and evaluation results
self.models = {} ## Dictionary to store models
self.evaluation_results = {} ## Dictionary to store evaluation results
logging.debug(f"Initialized PredictiveModels with target: {target_variable}")
#### Step 2: Model Creation
def _create_model(self, model_name, arima_model=None):
"""
Create and configure a model instance based on the model name.
:param model_name: Name of the model to be created.
:param arima_model: ARIMA model order string if applicable (e.g., 'ARIMA(1,1,1)').
:return: Configured model instance.
"""
if model_name == 'LinearRegression':
return LinearRegression()
elif model_name == 'Ridge':
## Consider using 'Ridge', 'Lasso' regression for regularization to prevent overfitting
return Ridge(alpha=1.0) ## Adjust alpha for regularization strength
elif model_name == 'Lasso':
return Lasso(alpha=0.1) ## Regularization strength
elif model_name == 'RandomForest':
# return RandomForestRegressor(n_estimators=100, random_state=42)
return RandomForestRegressor(
n_estimators=200, ## Increase number of trees
max_depth=10, ## Limit depth of trees
min_samples_split=5, ## Minimum number of samples required to split a node
min_samples_leaf=2, ## Minimum number of samples required at each leaf node
random_state=42,
bootstrap=True
)
elif model_name == 'SVR':
## Support Vector Regression (SVR) using linear kernel vs non-linear Radial Basis Function (RBF) kernel: 'linear' | 'poly' | 'rbf'
# return SVR(kernel='rbf')
# if kernel == 'linear':
# return SVR(kernel='linear', C=1.0)
# elif kernel == 'poly':
# return SVR(kernel='poly', C=1.0, degree=3)
return SVR(kernel='rbf', C=1.0, gamma='scale') ## Experiment with different kernels and hyperparameters like C and gamma
elif model_name == 'XGBoost':
# return XGBRegressor(objective='reg:squarederror', n_estimators=100)
return XGBRegressor(
objective='reg:squarederror',
n_estimators=200, ## Increase number of trees
learning_rate=0.05, ## Lower learning rate
max_depth=6, ## Limit depth of trees
subsample=0.8, ## Subsample ratio of training instances
colsample_bytree=0.8, ## Subsample ratio of columns
random_state=42
)
elif model_name == 'ARIMA' and arima_model is not None:
try:
order = tuple(map(int, arima_model.split('(')[1].strip(')').split(',')))
model = pm.ARIMA(order=order, suppress_warnings=True)
logging.debug(f"ARIMA model created with order: {order}")
return model
except Exception as e:
logging.error(f"Error in creating ARIMA model: {e}")
raise ValueError(f"Error in creating ARIMA model: {e}")
elif model_name == 'Prophet':
return self._create_prophet_model()
elif model_name == 'NeuralProphet':
return self._create_neural_prophet_model()
elif model_name == 'LSTM':
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(1, len(self.features))))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
return model
elif model_name == 'MLP':
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=len(self.features)))
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
return model
else:
raise ValueError(f"Model '{model_name}' is not supported.")
def _create_prophet_model(changepoint_prior_scale=0.01, interval_width=0.95,
daily_seasonality=True, weekly_seasonality=True,
yearly_seasonality=True, include_holidays=False, country_code=None):
"""
Configures and returns a Prophet model with the specified parameters.
:param changepoint_prior_scale: Flexibility of the trend
:param interval_width: Uncertainty interval width
:param daily_seasonality: Whether to include daily seasonality
:param weekly_seasonality: Whether to include weekly seasonality
:param yearly_seasonality: Whether to include yearly seasonality
:param include_holidays: Whether to include country-specific holidays
:param country_code: Country code for holidays
:return: Configured Prophet model
"""
# logging.info(f"Configuring Prophet model with changepoint_prior_scale={changepoint_prior_scale}, interval_width={interval_width}, "
# f"daily_seasonality={daily_seasonality}, weekly_seasonality={weekly_seasonality}, yearly_seasonality={yearly_seasonality}, "
# f"include_holidays={include_holidays}, country_code={country_code}")
model = Prophet(
# changepoint_prior_scale=changepoint_prior_scale,
# interval_width=interval_width,
daily_seasonality=daily_seasonality,
weekly_seasonality=weekly_seasonality,
yearly_seasonality=yearly_seasonality)
if include_holidays and country_code:
model.add_country_holidays(country_name=country_code)
logging.debug("Prophet model configured successfully")
return model
def _create_neural_prophet_model(self, periods=7*24, n_forecasts=1):
"""
Fits the NeuralProphet model using the training data and generates future predictions.
:param df: DataFrame containing the training data with columns 'ds' and 'y'.
:param periods: Number of periods for future predictions.
:param n_forecasts: Number of steps ahead to forecast.
:return: The forecast results from NeuralProphet.
"""
model = NeuralProphet(
n_changepoints=0, ## Disable trend changepoints
yearly_seasonality=False, ## Disable yearly seasonality
weekly_seasonality=True, ## Enable weekly seasonality
daily_seasonality=True, ## Enable daily seasonality
n_lags=24, ## Use 24 lags (one day of hourly data)
n_forecasts=n_forecasts ## Forecast 1 step ahead
)
logging.info(f"NeuralProphet model generated for {periods} periods.")
return model
#### Step 3: Fitting the Model
def fit(self, model_name, model=None, param_grid=None):
"""
Fit a model to the training data.
The fit() method is used within the adaptive_cross_validate() function to adapt the model with new test data included in the training.
This method needs to ensure that the model is continuously updated with new data in each fold of the cross-validation.
:param model_name: Name of the model (e.g., 'LinearRegression', 'RandomForest', 'SVR', 'LSTM', 'MLP').
:param model: The instantiated model to be trained (optional for custom models).
"""
if model is None:
model = self._create_model(model_name)
logging.debug(f"Fitting Model: {model_name}")
if model_name in ['LSTM', 'MLP']:
self._fit_deep_learning_model(model_name, model)
elif model_name == 'Prophet':
df_train = pd.DataFrame({'ds': self.training_data.index[self.X_train], 'y': self.y_train})
df_train['ds'] = pd.to_datetime(df_train['ds'], errors='coerce')
df_train = df_train.dropna()
if len(df_train) > 0: ## Ensure non-empty DataFrame
model.fit(df_train)
elif model_name == 'NeuralProphet':
df_train = pd.DataFrame({'ds': self.training_data.index, 'y': self.y_train})
# df_train['ds'] = pd.to_datetime(df_train['ds'], errors='coerce')
model.fit(df_train)
elif model_name == 'ARIMA':
model.fit(self.y_train)
else:
## Standard fitting process for other models
# if param_grid:
# grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
# grid_search.fit(self.X_train_scaled, self.y_train)
# model = grid_search.best_estimator_
# else:
model.fit(self.X_train_scaled, self.y_train) ## Fit traditional ML models
self.models[model_name] = model ## Store the model in the models dictionary
## Fit the model only once per fold and save only after cross-validation is complete: f"data/models/{model_name}.pkl": save_format='joblib' | 'pickle'
# self.save_model(model_name, model, save_format='joblib')
logging.info(f'Model {model_name} fitted successfully')
# def prophet_fit(self, periods=7*24, freq='H'):
# """
# Fits the Prophet model using the training data and generates predictions over the forecast horizon.
# :param periods: Number of periods for future predictions.
# :param freq: Frequency of the data ('H' for hourly).
# :return: The forecast results from Prophet.
# """
# self.model.fit(self.training_data)
# future = self.model.make_future_dataframe(periods=periods, freq=freq)
# self.forecast_df = self.model.predict(future)
# logging.info(f"Prophet model fit and forecast generated for {periods} periods with frequency {freq}.")
# return self.forecast_df
def _fit_deep_learning_model(self, model_name, model):
"""
Fit a deep learning model (LSTM and MLP).
:param model_name: Name of the model (e.g., 'LSTM', 'MLP').
:param model: The instantiated deep learning model to be trained.
"""
if model_name == 'LSTM': ## Reshape data for LSTM
X_train_reshaped = self.X_train_scaled.reshape((self.X_train_scaled.shape[0], 1, self.X_train_scaled.shape[1]))
model.fit(X_train_reshaped, self.y_train, epochs=50, batch_size=72, verbose=2, shuffle=False)
else: ## Fit MLP model
model.fit(self.X_train_scaled, self.y_train, epochs=50, batch_size=10, verbose=2)
# def predict(self, model_name):
# """
# Make predictions using a trained model.
# :param model_name: Name of the model to be used for prediction.
# :return: Array of predictions.
# """
# model = self.models.get(model_name)
# if not model:
# # raise ValueError(f"Model '{model_name}' has not been trained.")
# # model = joblib.load(f"data/models/{model_name}.pkl") ## Load the model from disk if not found in memory
# model = self.load_model(model_name) ## Load the model from disk
# if model_name == 'LSTM':
# X_test_reshaped = self.X_test_scaled.reshape((self.X_test_scaled.shape[0], 1, self.X_test_scaled.shape[1]))
# return model.predict(X_test_reshaped).flatten()
# return model.predict(self.X_test_scaled)
def evaluate(self, model_name):
"""
Evaluate the model using several metrics.
:param model_name: Name of the model to be evaluated.
:return: Dictionary of evaluation metrics.
"""
## Predictions using the specified model
# predictions = self.predict(model_name)
# predictions = self.all_predictions if hasattr(self, 'all_predictions') else self.predict(model_name)
# y_test = self.all_y_test if hasattr(self, 'all_y_test') else self.y_train
if hasattr(self, 'all_predictions') and hasattr(self, 'all_y_test'):
predictions = self.all_predictions
y_test = self.all_y_test
else:
predictions = self.predict(model_name)
y_test = self.y_test
## Ensure the length of predictions and y_test are the same
if len(predictions) != len(y_test):
raise ValueError("Inconsistent number of samples between predictions and true values")
## Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, predictions)
## Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
## Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, predictions)
## Calculate Mean Absolute Percentage Error (MAPE)
mape = np.mean(np.abs((y_test - predictions) / y_test)) * 100
## Calculate Symmetric Mean Absolute Percentage Error (SMAPE)
# smape = np.mean(2 * np.abs(y_test - predictions) / (np.abs(y_test) + np.abs(predictions))) * 100
# ## Calculate Median Absolute Percentage Error (MDAPE)
# mdape = np.median(np.abs((self.y_test - predictions) / self.y_test)) * 100
# ## Calculate Geometric Mean Relative Absolute Error (GMRAE)
# gmrae = np.exp(np.mean(np.log(np.abs((self.y_test - predictions) / self.y_test)))) * 100
## Calculate R-squared (R2)
r2 = r2_score(y_test, predictions)
## Calculate Adjusted R-squared (Adjusted R2)
adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-self.X_test_scaled.shape[1]-1)
## Store evaluation metrics in a dictionary
self.evaluation_results[model_name] = {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'MAPE': mape,
# 'SMAPE': smape,
# 'MDAPE': mdape,
# 'GMRAE': gmrae,
'R2': r2,
'Adjusted R2': adj_r2
}
## Log results sorted by the preferred metric (RMSE)
results = sorted(self.evaluation_results.items(), key=lambda x: x[1]['RMSE'])
for model, metrics in results:
logging.info(
f"Model: {model_name}\n"
f"RMSE: {rmse:.2f}, MSE: {mse:.2f}, "
f"MAE: {mae:.2f}, MAPE: {mape:.2f}%, "
# f"SMAPE: {smape:.2f}%, MDAPE: {mdape:.2f}%, GMRAE: {gmrae:.2f}%\n"
f"R2: {r2:.2f}, Adjusted R2: {adj_r2:.2f}"
)
best_model = results[0][0]
logging.debug(f"\nThe best RMSE from evaluation_results: {best_model}")
return self.evaluation_results[model_name]
#### Step 4: Cross-Validation
def adaptive_cross_validate(self, model_name, arima_model=None, site='Penrose', pollutant='PM2.5', n_splits=5, n_forecasts=24):
"""
[x] Perform adaptive cross-validation using TimeSeriesSplit.
[ ] Perform K-Fold Cross-Validation.
:param model_name: Name of the model to be cross-validated.
:param n_splits: Number of splits in TimeSeriesSplit (or Number of cross-validation folds).
:param n_forecasts: Number of steps ahead to forecast for NeuralProphet. Multi-step Forecasting for 24 hours ahead.
:return: List of cross-validation scores.
"""
# kf = KFold(n_splits=n_splits, shuffle=False, random_state=42)
ts_cross_validate = TimeSeriesSplit(n_splits=n_splits)
# X = self.training_data[self.features].values
X = self.training_data[self.features] ## Retain DataFrame with column names
# y = self.training_data[self.target_variable].values
y = self.training_data[self.target_variable] ## Retain DataFrame with column names
n_predictors = X.shape[1] ## Number of predictors
# scores = {'RMSE': [], 'MSE': [], 'MAE': [], 'MAPE': [], 'R2': [], 'Adjusted R2': []}
all_metrics = {
'RMSE': [], 'MSE': [], 'MAE': [], 'MAPE': [], 'R2': [], 'Adjusted R2': []
}
all_predictions = []
all_y_test = []
## Create the Prophet model once
# if model_name == 'Prophet':
# prophet_model = self._create_prophet_model(include_holidays=True, country_code='NZ')
## Split & print out results
for fold, (train_index, test_index) in enumerate(ts_cross_validate.split(X)):
logging.info(f"Processing fold {fold + 1}/{n_splits} for {model_name}")
## Splitting data into training and testing sets for current fold
# X_train_ts, X_test_ts = X[train_index], X[test_index]
# y_train_ts, y_test_ts = y[train_index], y[test_index]
X_train_ts, X_test_ts = X.iloc[train_index], X.iloc[test_index] ## Retain DataFrame with column names
y_train_ts, y_test_ts = y.iloc[train_index], y.iloc[test_index] ## Retain DataFrame with column names
## Scaling the data
X_train_ts_scaled = self.scaler.fit_transform(X_train_ts)
X_test_ts_scaled = self.scaler.transform(X_test_ts)
model = self._create_model(model_name, arima_model=arima_model)
if model_name in ['LSTM', 'MLP']:
model = self._reinitialize_model(model_name) ## Reinitialize the model for each split/fold starts with an untrained model
if model_name == 'LSTM':
X_train_ts_reshaped = X_train_ts_scaled.reshape((X_train_ts_scaled.shape[0], 1, X_train_ts_scaled.shape[1]))
X_test_ts_reshaped = X_test_ts_scaled.reshape((X_test_ts_scaled.shape[0], 1, X_test_ts_scaled.shape[1]))
model.fit(X_train_ts_reshaped, y_train_ts, epochs=50, batch_size=72, verbose=0, shuffle=False)
predictions_ts = model.predict(X_test_ts_reshaped).flatten()
else:
model.fit(X_train_ts_scaled, y_train_ts, epochs=50, batch_size=10, verbose=0)
predictions_ts = model.predict(X_test_ts_scaled).flatten()
elif model_name == 'ARIMA':
# logging.info(f"Training fold {fold + 1} for model {model_name}")
model.fit(y_train_ts) # Fitting the model with training data for the current fold
# logging.debug(f"Completed training fold {fold + 1} for model {model_name}")
predictions_ts = model.predict(n_periods=len(y_test_ts)) ## Use predict method of pmdarima
elif model_name == 'Prophet':
try:
## Create Prophet model once and reuse if possible
# if 'prophet_model' not in self.models:
# self.models['prophet_model'] = self._create_prophet_model(include_holidays=True, country_code='NZ')
# model = self.models['prophet_model']
model = self._create_prophet_model(include_holidays=True, country_code='NZ')
## Prepare training data for Prophet
df_train = pd.DataFrame({'ds': self.training_data.index[train_index], 'y': y_train_ts})
df_train['ds'] = pd.to_datetime(df_train['ds'], errors='coerce')
## Fit the model
model.fit(df_train)
logging.debug(f"Prophet model fitted for fold {fold + 1}")
## Prepare test data for Prophet
df_test = pd.DataFrame({'ds': self.training_data.index[test_index]})
df_test['ds'] = pd.to_datetime(df_test['ds'], errors='coerce') ## Handle invalid dates
## Predict using the fitted model
predictions_ts = model.predict(df_test)['yhat'].values ## prediction method
logging.debug(f"Predictions made for fold {fold + 1} using Prophet model")
except Exception as e:
logging.error(f"Error in fitting/predicting with Prophet model for fold {fold + 1}: {e}")
predictions_ts = np.zeros(len(test_index))
self.models[model_name] = model ## Ensure model is added to the dictionary
logging.debug(f"Completed training fold {fold + 1} for model {model_name}")
elif model_name == 'NeuralProphet':
model = self._create_neural_prophet_model(periods=len(test_index), n_forecasts=n_forecasts)
## Prepare training data for NeuralProphet
df_train = pd.DataFrame({'ds': self.training_data.index[train_index], 'y': y_train_ts})
df_train['ds'] = pd.to_datetime(df_train['ds'], errors='coerce')
## Fit the model
model.fit(df_train)
## Prepare test data for NeuralProphet
df_test = pd.DataFrame({'ds': self.training_data.index[test_index], 'y': y_test_ts}) ## Add 'y' column to df_test
df_test['ds'] = pd.to_datetime(df_test['ds'], errors='coerce')
## Predict using the fitted model
# future = model.make_future_dataframe(df_test, n_historic_predictions=True, periods=len(test_index), freq='D')
future = model.make_future_dataframe(df_test, n_historic_predictions=True, periods=len(test_index))
# predictions_ts = model.predict(future)['yhat1'].values: yhat, yhat1 ...
predictions_ts = model.predict(future)[f'yhat{n_forecasts}'].values
if np.isnan(predictions_ts).any():
logging.warning(f"NaN values found in predictions for fold {fold + 1}. Replacing NaNs with mean value.")
predictions_ts = np.nan_to_num(predictions_ts, nan=np.nanmean(predictions_ts))
## Ensure the lengths of y_test_ts and predictions_ts are consistent
predictions_ts = predictions_ts[:len(y_test_ts)]
self.models[model_name] = model ## Store the model in the models dictionary
logging.debug(f"Predictions made for fold {fold + 1} using NeuralProphet model")
else:
# if param_grid:
# grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
# grid_search.fit(X_train_ts_scaled, y_train_ts)
# model = grid_search.best_estimator_
# else:
logging.debug(f"Training fold {fold + 1} for model {model_name}")
model.fit(X_train_ts_scaled, y_train_ts)
logging.debug(f"Completed training fold {fold + 1} for model {model_name}")
predictions_ts = model.predict(X_test_ts_scaled)
## Evaluate the model
# scores.append(r2_score(y_test_ts, predictions_ts))
y_test_ts = y_test_ts[:len(predictions_ts)]
# metrics = CommonUtils.calculate_metrics(y_test_ts, predictions_ts)
metrics = CommonUtils.calculate_metrics_adj_r2(y_test_ts, predictions_ts, n_predictors)
for metric_name, metric_value in metrics.items():
all_metrics[metric_name].append(metric_value)
logging.debug(f"Fold results for {model_name}: all_metrics = {all_metrics}")
all_predictions.extend(predictions_ts)
all_y_test.extend(y_test_ts)
## Adapt the model with the new test data included in training
self.X_train_scaled = np.vstack([self.X_train_scaled, X_test_ts_scaled])
self.y_train = np.concatenate([self.y_train, y_test_ts])
self.X_test_scaled = X_test_ts_scaled
self.y_test = y_test_ts
# self.fit(model_name, model) ## Call fit method with the current fold's model and data
if model_name != 'Prophet' and model_name != 'NeuralProphet': ## Avoid redundant fitting for Prophet
self.fit(model_name, model)
self.all_predictions = np.array(all_predictions)
self.all_y_test = np.array(all_y_test)
logging.info(f"Adaptive cross-validation scores for {model_name}: {all_metrics}")
## Save the model after all folds are completed
logging.debug(f"Saving model {model_name} after all folds")
self.save_model(model_name, model, save_format='joblib',site=site, pollutant=pollutant)
# return scores
return all_metrics
def _reinitialize_model(self, model_name):
"""
Reinitialize a deep learning model.
:param model_name: Name of the model to be reinitialized.
:return: A new instance of the deep learning model.
"""
if model_name == 'LSTM':
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(1, self.X_train.shape[1])))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
return model
elif model_name == 'MLP':
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=self.X_train.shape[1]))
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
return model
def model_selection(self, models, preference='RMSE'):
"""
Compare multiple models based on configurable preference metrics including RMSE, MAE, MAPE, AIC, and BIC.
:param models: List of tuples. Each tuple contains the model name and instantiated model.
:param preference: Preferred metric for model evaluation.
:return: Best model based on the preferred metric.
"""
results = []
best_model = None
best_metric = float('inf')
for name, model in models:
self.fit(name, model)
evaluation = self.evaluate(name)
results.append((name, evaluation))
if evaluation[preference] < best_metric:
best_metric = evaluation[preference]
best_model = name
return best_model, results
# def hyperparameter_tuning(self, model_name, param_grid):
# model = self._create_model(model_name)
# tscv = TimeSeriesSplit(n_splits=5)
# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=tscv, scoring='neg_mean_squared_error')
# grid_search.fit(self.X_train_scaled, self.y_train)
# best_params = grid_search.best_params_
# logging.info(f"Best parameters for {model_name}: {best_params}")
# self.fit(model_name, best_params)
# return best_params
#### Step 5: Save and Load Models
def save_model(self, model_name, model, save_format='joblib', site='Penrose', pollutant='PM2.5'):
"""
Save the fitted model to disk.
:param model_name: Name of the model.
:param model: The trained model instance.
:param save_format: Format to save the model ('joblib' or 'pickle').
:param site: Site information (e.g., 'Penrose' or 'Takapuna').
:param pollutant: Pollutant information (e.g., 'PM2.5' or 'PM10').
"""
logging.info(f'Saving model {model_name} for site {site} and pollutant {pollutant}')
file_extension = 'pkl' if save_format == 'pickle' else 'joblib'
file_path = f"data/models/{site}_{pollutant}_{model_name}.{file_extension}"
if save_format == 'joblib':
joblib.dump(model, file_path)
elif save_format == 'pickle':
with open(file_path, 'wb') as file:
pickle.dump(model, file)
elif save_format == 'pmml':
from nyoka import skl_to_pmml ## for sklearn models
skl_to_pmml(model, file_path) ## This line depends on the specific implementation and libraries used
else:
raise ValueError("Unsupported save format. Use 'joblib' or 'pickle'.")
logging.info(f'Model saved at {file_path}')
def load_model(self, model_name, load_format='joblib', site='Penrose', pollutant='PM2.5'):
"""
Load a saved model from disk.
:param model_name: Name of the model.
:param load_format: Format to load the model ('joblib' or 'pickle').
:param site: Site information (e.g., 'Penrose' or 'Takapuna').
:param pollutant: Pollutant information (e.g., 'PM2.5' or 'PM10').
:return: Loaded model instance.
"""
logging.info(f'Loading model {model_name} for site {site} and pollutant {pollutant}')
file_extension = 'pkl' if load_format == 'pickle' else 'joblib'
file_path = f"data/models/{site}_{pollutant}_{model_name}.{file_extension}"
if load_format == 'joblib':
model = joblib.load(file_path)
elif load_format == 'pickle':
with open(file_path, 'rb') as file:
model = pickle.load(file)
elif load_format == 'pmml':
raise NotImplementedError("PMML loading not implemented. Use a compatible library for your model type.")
else:
raise ValueError("Unsupported load format. Use 'joblib' or 'pickle'.")
logging.info(f'Model loaded from {file_path}')
return model
# def generate_markdown_table(self, all_metrics, model_name, target_variable):
# """
# Generate a markdown table for cross-validation metrics.
# :param all_metrics: Dictionary of metrics with lists of scores for each fold.
# :param model_name: Name of the model.
# :param target_variable: Name of the target variable.
# :return: Markdown table as a string.
# """
# metrics = ['R2', 'RMSE', 'MSE', 'MAE', 'MAPE', 'Adjusted R2']
# header = f"| {target_variable} | {model_name} | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 |\n"
# header += "|---|---|---|---|---|---|---|---|\n"
# rows = []
# for metric in metrics:
# row = [f"{target_variable}", f"{model_name}", f"{metric}"]
# row.extend([f"{score:.2f}" for score in all_metrics[metric]])
# rows.append(" | ".join(row))
# return header + "\n".join(rows)
def generate_markdown_table(self, target_variable, evaluation_results):
"""
Generate a markdown table for cross-validation metrics for all models of a target variable.
:param target_variable: Name of the target variable.
:param evaluation_results: Dictionary of evaluation results for all models.
:return: Markdown table as a string.
"""
metrics = ['RMSE', 'MSE', 'MAE', 'MAPE', 'R2', 'Adjusted R2']
header = "| Target | Model | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 | Training Time |\n"
header += "|---|---|---|---|---|---|---|---|---|\n"
rows = []
for model_name, model_metrics in evaluation_results.items():
for metric in metrics:
row = [target_variable, model_name, metric]
row.extend([f"{score:.2f}" for score in model_metrics[metric]])
row.append(f"{model_metrics['Training Time']:.2f}") ## Append total training time
rows.append(" | ".join(row))
return header + "\n".join(rows)
## Adjust the train_and_evaluate_models method to log markdown tables
#### Step 6: Train and Evaluate Models
## Train and evaluate models for each target variable
def train_and_evaluate_models(data_dict, model_names):
"""
Train and evaluate models for each target variable in the data dictionary.
:param data_dict: Dictionary containing training data, target variable, and feature list for each site and pollutant.
:param model_names: List of model names to train and evaluate.
:return: Dictionary of trained models and their evaluation results.
"""
## Initialize a dictionary to store trained models and evaluation results
trained_models = {}
evaluation_results = {}
## Train and evaluate models for each target variable
for target_var, (train_data, target, features, arima_model) in data_dict.items():
logging.info(f"\n🛠️ Training models for {target_var} ... \n")
logging.info(f"Selected ARIMA model for {target_var}: {arima_model}")
## Initialize the PredictiveModels class for each target variable
pm = PredictiveModels(train_data, target, features)
## Initialize dictionary for evaluation results of current target variable
target_evaluation_results = {}
for model_name in model_names:
start_time = time.time()
logging.debug(f"Fitting Model: {model_name} at {start_time}")
if model_name == 'ARIMA':
all_metrics = pm.adaptive_cross_validate(model_name, arima_model=arima_model, site=f'{target}', pollutant=f'{target_var}')
else:
all_metrics = pm.adaptive_cross_validate(model_name, site=f'{target}', pollutant=f'{target_var}')
end_time = time.time()
training_time = end_time - start_time
# evaluation = pm.evaluate(model_name)
# # pm.save_model(model_name, pm.models[model_name], site=target_var.split('_')[0], pollutant=target)
# trained_models[f"{target_var}_{model_name}"] = pm.models[model_name]
# target_evaluation_results[model_name] = evaluation
all_metrics['Training Time'] = training_time
target_evaluation_results[model_name] = all_metrics
trained_models[f"{target_var}_{model_name}"] = pm.models[model_name]
## Store results for the current target variable
evaluation_results[target_var] = target_evaluation_results
## Identify the best model for each target variable based on RMSE
# best_model_name = min(target_evaluation_results.items(), key=lambda x: x[1]['RMSE'])[0]
# best_model = trained_models[f"{target_var}_{best_model_name}"] ## Key to access trained models
## select the model with the best average RMSE across all folds
best_model_name = min(target_evaluation_results.items(), key=lambda x: np.mean(x[1]['RMSE']))[0]
best_model = pm.models[best_model_name]
logging.info(f"\nThe best model based on average RMSE across all folds for {target_var}: {best_model_name} \n")
## Generate and log the markdown table for all models and target variable
markdown_table = pm.generate_markdown_table(target_var, target_evaluation_results)
logging.info(f"\nMarkdown Table for {target_var}:\n{markdown_table}\n")
## Save the best model after evaluation for each target variable based on RMSE immediately after evaluation
pm.save_model(best_model_name, best_model, site=target_var.split('_')[0], pollutant=target)
return trained_models, evaluation_results
def generate_markdown_table(evaluation_results, folds=5):
metrics = ['RMSE', 'MSE', 'MAE', 'MAPE', 'R2 Score','Adjusted R2']
header = "| Target | Model | Metric | " + " | ".join([f"Fold{i+1}" for i in range(folds)]) + " |"
separator = "|---" * (folds + 3) + "|"
rows = []
for target, models in evaluation_results.items():
for model, results in models.items():
for metric in metrics:
row = [target, model, metric]
row.extend([f"{results[metric]:.2f}" for fold in range(folds)])
rows.append("| " + " | ".join(row) + " |")
table = "\n".join([header, separator] + rows)
return table
🧩 Predictive Models Development¶
🛠️ Data Preparation¶
In [11]:
def extract_feature_names(feature_list):
return [feature[0] for feature in feature_list]
## Extracting feature names: target_variables = ['PM2.5', 'PM10'] across Penrose & Takapuna
top_features_data11_names = extract_feature_names(top_features_data11)
top_features_data12_names = extract_feature_names(top_features_data12)
top_features_data21_names = extract_feature_names(top_features_data21)
top_features_data22_names = extract_feature_names(top_features_data22)
## Combining data for easier access
data_dict = {
'Penrose_PM2.5': (cleaned_data_site1, 'PM2.5', top_features_data11_names, 'ARIMA(0, 1, 4)'),
'Takapuna_PM2.5': (cleaned_data_site2, 'PM2.5', top_features_data12_names, 'ARIMA(0, 1, 2)'),
'Penrose_PM10': (cleaned_data_site1, 'PM10', top_features_data21_names, 'ARIMA(10, 0, 0)'),
'Takapuna_PM10': (cleaned_data_site2, 'PM10', top_features_data22_names, 'ARIMA(2, 0, 3)')
}
🛠️ Model Development¶
In [12]:
logging.getLogger('fbprophet').setLevel(logging.INFO)
## Define Models to Train
# model_names = ['ARIMA', 'Prophet', 'NeuralProphet', 'LinearRegression', 'RandomForest', 'SVR', 'XGBoost', 'LSTM', 'MLP']
model_names = ['ARIMA', 'Prophet', 'NeuralProphet', 'LinearRegression', 'Ridge', 'Lasso', 'RandomForest', 'SVR', 'XGBoost']
## Train and Evaluate Models
trained_models, evaluation_results = train_and_evaluate_models(data_dict, model_names)
2024-05-27 12:36:52,616 - INFO - 🛠️ Training models for Penrose_PM2.5 ... 2024-05-27 12:36:52,619 - INFO - Selected ARIMA model for Penrose_PM2.5: ARIMA(0, 1, 4) 2024-05-27 12:36:52,630 - INFO - Processing fold 1/5 for ARIMA 2024-05-27 12:37:09,963 - INFO - Model ARIMA fitted successfully 2024-05-27 12:37:09,971 - INFO - Processing fold 2/5 for ARIMA 2024-05-27 12:37:27,871 - INFO - Model ARIMA fitted successfully 2024-05-27 12:37:27,872 - INFO - Processing fold 3/5 for ARIMA 2024-05-27 12:37:53,354 - INFO - Model ARIMA fitted successfully 2024-05-27 12:37:53,356 - INFO - Processing fold 4/5 for ARIMA 2024-05-27 12:38:25,838 - INFO - Model ARIMA fitted successfully 2024-05-27 12:38:25,839 - INFO - Processing fold 5/5 for ARIMA 2024-05-27 12:39:04,050 - INFO - Model ARIMA fitted successfully 2024-05-27 12:39:04,096 - INFO - Adaptive cross-validation scores for ARIMA: {'RMSE': [4.232457602542978, 4.594882574005848, 7.315938614136523, 6.56074928876857, 4.390077646647869], 'MSE': [17.91369735732386, 21.11294586890261, 53.522957805813824, 43.0434312300773, 19.272781743597292], 'MAE': [3.1658379474338894, 3.3103680166439893, 5.4907753466556395, 4.7369575135266855, 3.25949572353243], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.018067047113852208, -0.10992895285583315, -0.3257055016767856, -0.14617622456834267, -0.014362429458957049], 'Adjusted R2': [-0.021597099288310773, -0.11377752758834303, -0.33030226139133756, -0.15015048332204706, -0.017879636218523398]} 2024-05-27 12:39:04,099 - INFO - Saving model ARIMA for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:39:04,231 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_ARIMA.joblib 2024-05-27 12:39:04,240 - INFO - Processing fold 1/5 for Prophet 2024-05-27 12:39:04,380 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/5kuwbw07.json 2024-05-27 12:39:04,453 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/1alh6mbo.json 2024-05-27 12:39:04,454 - DEBUG - idx 0 2024-05-27 12:39:04,454 - DEBUG - running CmdStan, num_threads: None 2024-05-27 12:39:04,455 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=23028', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/5kuwbw07.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/1alh6mbo.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelxbzuidwm/prophet_model-20240527123904.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 12:39:04 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 12:39:04,455 - INFO - Chain [1] start processing 12:39:04 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 12:39:04,587 - INFO - Chain [1] done processing 2024-05-27 12:39:05,047 - INFO - Processing fold 2/5 for Prophet 2024-05-27 12:39:05,075 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/z4ovh0gq.json 2024-05-27 12:39:05,222 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/omq0wkbb.json 2024-05-27 12:39:05,223 - DEBUG - idx 0 2024-05-27 12:39:05,223 - DEBUG - running CmdStan, num_threads: None 2024-05-27 12:39:05,224 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=40747', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/z4ovh0gq.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/omq0wkbb.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelk8k7l1ok/prophet_model-20240527123905.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 12:39:05 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 12:39:05,224 - INFO - Chain [1] start processing 12:39:05 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 12:39:05,633 - INFO - Chain [1] done processing 2024-05-27 12:39:06,096 - INFO - Processing fold 3/5 for Prophet 2024-05-27 12:39:06,134 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/s1pqp9wm.json 2024-05-27 12:39:06,515 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/la6jmccz.json 2024-05-27 12:39:06,516 - DEBUG - idx 0 2024-05-27 12:39:06,516 - DEBUG - running CmdStan, num_threads: None 2024-05-27 12:39:06,517 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=44180', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/s1pqp9wm.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/la6jmccz.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelzgxwi1zs/prophet_model-20240527123906.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 12:39:06 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 12:39:06,517 - INFO - Chain [1] start processing 12:39:07 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 12:39:07,515 - INFO - Chain [1] done processing 2024-05-27 12:39:07,950 - INFO - Processing fold 4/5 for Prophet 2024-05-27 12:39:07,992 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/hfzfwsz9.json 2024-05-27 12:39:08,283 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/9vis3qyr.json 2024-05-27 12:39:08,284 - DEBUG - idx 0 2024-05-27 12:39:08,284 - DEBUG - running CmdStan, num_threads: None 2024-05-27 12:39:08,284 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=30629', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/hfzfwsz9.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/9vis3qyr.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_model89y2a__g/prophet_model-20240527123908.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 12:39:08 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 12:39:08,285 - INFO - Chain [1] start processing 12:39:08 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 12:39:08,987 - INFO - Chain [1] done processing 2024-05-27 12:39:09,439 - INFO - Processing fold 5/5 for Prophet 2024-05-27 12:39:09,487 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/wk1gaa66.json 2024-05-27 12:39:09,852 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/2rjcx9a1.json 2024-05-27 12:39:09,853 - DEBUG - idx 0 2024-05-27 12:39:09,853 - DEBUG - running CmdStan, num_threads: None 2024-05-27 12:39:09,853 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=98744', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/wk1gaa66.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/2rjcx9a1.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modeljlc_t8f8/prophet_model-20240527123909.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 12:39:09 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 12:39:09,853 - INFO - Chain [1] start processing 12:39:11 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 12:39:11,723 - INFO - Chain [1] done processing 2024-05-27 12:39:12,185 - INFO - Adaptive cross-validation scores for Prophet: {'RMSE': [4.455758818748366, 4.7123277191035475, 6.403859557015526, 6.142045629728678, 4.707933007393088], 'MSE': [19.853786650853838, 22.20603253223164, 41.009417225979085, 37.724724517669145, 22.16463320210132], 'MAE': [3.489835708392768, 3.765752675667963, 4.028461801742639, 4.659402754552279, 3.743869550148356], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.12832574685646847, -0.16739362610147746, -0.015758699926216924, -0.004547757575811717, -0.1665659624054705], 'Adjusted R2': [-0.13223811074986802, -0.17144145420862533, -0.019280748122909808, -0.008030932879472541, -0.1706109206662385]} 2024-05-27 12:39:12,185 - INFO - Saving model Prophet for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:39:12,203 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_Prophet.joblib 2024-05-27 12:39:12,205 - INFO - Processing fold 1/5 for NeuralProphet 2024-05-27 12:39:12,208 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 12:39:12,208 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 12:39:12,209 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:12,215 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 12:39:12,216 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 12:39:12,219 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 12:39:12,251 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 80 2024-05-27 12:39:12,251 - INFO - Auto-set epochs to 80 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal. 2024-05-27 12:39:12,291 - WARNING - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal.
Finding best initial lr: 0%| | 0/236 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 12:39:19,609 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:19,610 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 12:39:19,615 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:19,617 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:19,621 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:19,622 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:19,626 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:19,628 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 12:39:19,635 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 45it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:19,696 - INFO - Returning df with no ID column 2024-05-27 12:39:19,697 - WARNING - NaN values found in predictions for fold 1. Replacing NaNs with mean value. 2024-05-27 12:39:19,700 - INFO - Processing fold 2/5 for NeuralProphet 2024-05-27 12:39:19,703 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 12:39:19,704 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 12:39:19,705 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.983% of the data. 2024-05-27 12:39:19,713 - INFO - Major frequency ns corresponds to 99.983% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 12:39:19,714 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 12:39:19,718 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 12:39:19,922 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 70 2024-05-27 12:39:19,922 - INFO - Auto-set epochs to 70 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal. 2024-05-27 12:39:19,934 - WARNING - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal.
Finding best initial lr: 0%| | 0/244 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 12:39:31,841 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:31,842 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 12:39:31,847 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:31,849 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:31,852 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:31,853 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:31,857 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:31,858 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 12:39:31,865 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 90it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:31,926 - INFO - Returning df with no ID column 2024-05-27 12:39:31,928 - WARNING - NaN values found in predictions for fold 2. Replacing NaNs with mean value. 2024-05-27 12:39:31,930 - INFO - Processing fold 3/5 for NeuralProphet 2024-05-27 12:39:31,934 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 12:39:31,934 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 12:39:31,936 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.988% of the data. 2024-05-27 12:39:31,945 - INFO - Major frequency ns corresponds to 99.988% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 12:39:31,945 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 12:39:31,985 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 12:39:32,059 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 60 2024-05-27 12:39:32,060 - INFO - Auto-set epochs to 60 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (136) is too small than the required number for the learning rate finder (248). The results might not be optimal. 2024-05-27 12:39:32,072 - WARNING - Learning rate finder: The number of batches (136) is too small than the required number for the learning rate finder (248). The results might not be optimal.
Finding best initial lr: 0%| | 0/248 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 12:39:47,082 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:47,083 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 12:39:47,087 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:47,089 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:47,093 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:47,094 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:47,099 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:47,100 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 12:39:47,107 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 136it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:47,172 - INFO - Returning df with no ID column 2024-05-27 12:39:47,173 - WARNING - NaN values found in predictions for fold 3. Replacing NaNs with mean value. 2024-05-27 12:39:47,175 - INFO - Processing fold 4/5 for NeuralProphet 2024-05-27 12:39:47,179 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 12:39:47,179 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 12:39:47,182 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.991% of the data. 2024-05-27 12:39:47,192 - INFO - Major frequency ns corresponds to 99.991% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 12:39:47,192 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 12:39:47,197 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 12:39:47,448 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 12:39:47,448 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal. 2024-05-27 12:39:47,459 - WARNING - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal.
Finding best initial lr: 0%| | 0/251 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 12:39:58,133 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:58,135 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 12:39:58,139 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:58,140 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:58,144 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:58,145 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:39:58,149 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:39:58,151 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 12:39:58,157 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 91it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:39:58,225 - INFO - Returning df with no ID column 2024-05-27 12:39:58,226 - WARNING - NaN values found in predictions for fold 4. Replacing NaNs with mean value. 2024-05-27 12:39:58,228 - INFO - Processing fold 5/5 for NeuralProphet 2024-05-27 12:39:58,231 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 12:39:58,232 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 12:39:58,235 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.993% of the data. 2024-05-27 12:39:58,246 - INFO - Major frequency ns corresponds to 99.993% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 12:39:58,247 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 12:39:58,253 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 12:39:58,360 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 12:39:58,361 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal. 2024-05-27 12:39:58,372 - WARNING - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal.
Finding best initial lr: 0%| | 0/254 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 12:40:12,117 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:40:12,119 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 12:40:12,123 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:40:12,125 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:40:12,129 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:40:12,130 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 12:40:12,134 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 12:40:12,135 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 12:40:12,141 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 113it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 12:40:12,215 - INFO - Returning df with no ID column 2024-05-27 12:40:12,216 - WARNING - NaN values found in predictions for fold 5. Replacing NaNs with mean value. 2024-05-27 12:40:12,219 - INFO - Adaptive cross-validation scores for NeuralProphet: {'RMSE': [4.16604803159401, 4.500280205110116, 6.4352090651101115, 5.70500630167239, 4.4949376841003925], 'MSE': [17.355956201548327, 20.252521924505945, 41.411915711675356, 32.54709690212169, 20.204464783945802], 'MAE': [2.9942574291635538, 3.486873194431531, 3.8383841676992043, 4.334763471222363, 3.420706232715491], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.013630367450434333, -0.06469559444406747, -0.025728150023501373, 0.1333240036434047, -0.0633986446180681], 'Adjusted R2': [0.01021022309346642, -0.06838732674103021, -0.02928476635506705, 0.1303188857642209, -0.06708587986292969]} 2024-05-27 12:40:12,219 - INFO - Saving model NeuralProphet for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:40:14,970 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_NeuralProphet.joblib 2024-05-27 12:40:14,971 - INFO - Processing fold 1/5 for LinearRegression 2024-05-27 12:40:15,041 - INFO - Model LinearRegression fitted successfully 2024-05-27 12:40:15,062 - INFO - Processing fold 2/5 for LinearRegression 2024-05-27 12:40:15,127 - INFO - Model LinearRegression fitted successfully 2024-05-27 12:40:15,129 - INFO - Processing fold 3/5 for LinearRegression 2024-05-27 12:40:15,248 - INFO - Model LinearRegression fitted successfully 2024-05-27 12:40:15,264 - INFO - Processing fold 4/5 for LinearRegression 2024-05-27 12:40:15,297 - INFO - Model LinearRegression fitted successfully 2024-05-27 12:40:15,299 - INFO - Processing fold 5/5 for LinearRegression 2024-05-27 12:40:15,348 - INFO - Model LinearRegression fitted successfully 2024-05-27 12:40:15,352 - INFO - Adaptive cross-validation scores for LinearRegression: {'RMSE': [4.264941713124821, 4.0631139202032225, 5.86860206164965, 5.685358901744446, 4.450470382359049], 'MSE': [18.189727816352082, 16.508894728549198, 34.440490157998525, 32.32330584164481, 19.806686624255097], 'MAE': [3.0138510243440586, 2.9909757858639225, 3.5512434174468686, 4.120879206414382, 3.2510609314735537], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.03375434542703126, 0.13211067973916013, 0.1469464851222343, 0.13928319382548993, -0.04246283857732247], 'Adjusted R2': [-0.03733879183974631, 0.12910135477292972, 0.14398860192224205, 0.13629873888036326, -0.04607748087474728]} 2024-05-27 12:40:15,354 - INFO - Saving model LinearRegression for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:40:15,356 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_LinearRegression.joblib 2024-05-27 12:40:15,363 - INFO - Processing fold 1/5 for Ridge 2024-05-27 12:40:15,382 - INFO - Model Ridge fitted successfully 2024-05-27 12:40:15,382 - INFO - Processing fold 2/5 for Ridge 2024-05-27 12:40:15,404 - INFO - Model Ridge fitted successfully 2024-05-27 12:40:15,405 - INFO - Processing fold 3/5 for Ridge 2024-05-27 12:40:15,421 - INFO - Model Ridge fitted successfully 2024-05-27 12:40:15,421 - INFO - Processing fold 4/5 for Ridge 2024-05-27 12:40:15,448 - INFO - Model Ridge fitted successfully 2024-05-27 12:40:15,453 - INFO - Processing fold 5/5 for Ridge 2024-05-27 12:40:15,485 - INFO - Model Ridge fitted successfully 2024-05-27 12:40:15,493 - INFO - Adaptive cross-validation scores for Ridge: {'RMSE': [4.264655804709437, 4.054815405258433, 5.868777222116133, 5.685426264308851, 4.450499130796663], 'MSE': [18.187289132641894, 16.441527970721108, 34.44254608282916, 32.324071806892896, 19.806942513221852], 'MAE': [3.0136303657948393, 2.9880968774821044, 3.551381793007182, 4.1209634448047066, 3.251052777174162], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.03361575072633083, 0.13565221844426145, 0.14689556209834442, 0.13926279742280734, -0.042476306490644156], 'Adjusted R2': [-0.03719971657489651, 0.13265517343193234, 0.14393750232753422, 0.1362782717550639, -0.046090995486797626]} 2024-05-27 12:40:15,499 - INFO - Saving model Ridge for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:40:15,504 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_Ridge.joblib 2024-05-27 12:40:15,507 - INFO - Processing fold 1/5 for Lasso 2024-05-27 12:40:15,824 - INFO - Model Lasso fitted successfully 2024-05-27 12:40:15,825 - INFO - Processing fold 2/5 for Lasso 2024-05-27 12:40:16,152 - INFO - Model Lasso fitted successfully 2024-05-27 12:40:16,153 - INFO - Processing fold 3/5 for Lasso 2024-05-27 12:40:16,564 - INFO - Model Lasso fitted successfully 2024-05-27 12:40:16,567 - INFO - Processing fold 4/5 for Lasso 2024-05-27 12:40:16,909 - INFO - Model Lasso fitted successfully 2024-05-27 12:40:16,920 - INFO - Processing fold 5/5 for Lasso 2024-05-27 12:40:17,258 - INFO - Model Lasso fitted successfully 2024-05-27 12:40:17,266 - INFO - Adaptive cross-validation scores for Lasso: {'RMSE': [4.238445765560864, 4.031302612994142, 5.886924434281475, 5.658263924292697, 4.412605692378953], 'MSE': [17.96442250760082, 16.251400757533393, 34.65587929494026, 32.01595063695219, 19.471088996415144], 'MAE': [2.962093048769905, 2.9703235432265123, 3.5405717893306927, 4.127870572663973, 3.2331815772055603], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.020949846958394636, 0.1456473986504334, 0.14161153026265694, 0.14746756059292954, -0.024799709838291717], 'Adjusted R2': [-0.024489894971426507, 0.14268501098972064, 0.13863514860614745, 0.1445114841733488, -0.028353106890435686]} 2024-05-27 12:40:17,268 - INFO - Saving model Lasso for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:40:17,277 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_Lasso.joblib 2024-05-27 12:40:17,283 - INFO - Processing fold 1/5 for RandomForest 2024-05-27 12:41:09,891 - INFO - Model RandomForest fitted successfully 2024-05-27 12:41:09,891 - INFO - Processing fold 2/5 for RandomForest 2024-05-27 12:42:04,733 - INFO - Model RandomForest fitted successfully 2024-05-27 12:42:04,733 - INFO - Processing fold 3/5 for RandomForest 2024-05-27 12:43:02,054 - INFO - Model RandomForest fitted successfully 2024-05-27 12:43:02,054 - INFO - Processing fold 4/5 for RandomForest 2024-05-27 12:44:02,178 - INFO - Model RandomForest fitted successfully 2024-05-27 12:44:02,179 - INFO - Processing fold 5/5 for RandomForest 2024-05-27 12:45:03,990 - INFO - Model RandomForest fitted successfully 2024-05-27 12:45:03,991 - INFO - Adaptive cross-validation scores for RandomForest: {'RMSE': [4.106881912360715, 4.068852068247997, 5.863189715556987, 5.7531665591526275, 4.28344698000073], 'MSE': [16.866479042075603, 16.555557153286003, 34.37699364061322, 33.09892545735208, 18.347918030477373], 'MAE': [2.9045941127300763, 3.037117105859453, 3.555912972398239, 4.235583074653316, 3.218200652760503], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.041448218586025765, 0.12965758879924927, 0.14851922485646973, 0.11862971111216691, 0.034314871796240154], 'Adjusted R2': [0.038124530023563974, 0.12663975796984306, 0.14556679498426617, 0.11557364214931032, 0.03096644902160861]} 2024-05-27 12:45:03,992 - INFO - Saving model RandomForest for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 12:45:04,018 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_RandomForest.joblib 2024-05-27 12:45:04,019 - INFO - Processing fold 1/5 for SVR 2024-05-27 12:49:39,961 - INFO - Model SVR fitted successfully 2024-05-27 12:49:39,961 - INFO - Processing fold 2/5 for SVR 2024-05-27 12:54:30,754 - INFO - Model SVR fitted successfully 2024-05-27 12:54:30,754 - INFO - Processing fold 3/5 for SVR 2024-05-27 13:05:40,052 - INFO - Model SVR fitted successfully 2024-05-27 13:05:40,053 - INFO - Processing fold 4/5 for SVR 2024-05-27 13:29:53,244 - INFO - Model SVR fitted successfully 2024-05-27 13:29:53,245 - INFO - Processing fold 5/5 for SVR 2024-05-27 13:55:14,810 - INFO - Model SVR fitted successfully 2024-05-27 13:55:14,811 - INFO - Adaptive cross-validation scores for SVR: {'RMSE': [4.07057649944199, 4.086977172481168, 6.067353167722934, 5.805619192351314, 4.196858353896223], 'MSE': [16.569593037809405, 16.70338240838216, 36.812774461877524, 33.70521420659792, 17.61362004266852], 'MAE': [2.926687225727233, 3.0006880678252124, 3.6547611432751586, 4.206975782252438, 3.0798758357821736], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.05832077435515337, 0.12188626538406455, 0.08818758086653089, 0.10248523262265286, 0.07296234369572674], 'Adjusted R2': [0.055055589800212856, 0.11884148821826723, 0.08502595666703894, 0.09937318419207952, 0.0697479274117313]} 2024-05-27 13:55:14,811 - INFO - Saving model SVR for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 13:55:14,823 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_SVR.joblib 2024-05-27 13:55:14,826 - INFO - Processing fold 1/5 for XGBoost 2024-05-27 13:55:15,483 - INFO - Model XGBoost fitted successfully 2024-05-27 13:55:15,483 - INFO - Processing fold 2/5 for XGBoost 2024-05-27 13:55:16,177 - INFO - Model XGBoost fitted successfully 2024-05-27 13:55:16,178 - INFO - Processing fold 3/5 for XGBoost 2024-05-27 13:55:17,458 - INFO - Model XGBoost fitted successfully 2024-05-27 13:55:17,459 - INFO - Processing fold 4/5 for XGBoost 2024-05-27 13:55:18,915 - INFO - Model XGBoost fitted successfully 2024-05-27 13:55:18,916 - INFO - Processing fold 5/5 for XGBoost 2024-05-27 13:55:20,337 - INFO - Model XGBoost fitted successfully 2024-05-27 13:55:20,338 - INFO - Adaptive cross-validation scores for XGBoost: {'RMSE': [4.24115952200403, 4.1478764236556485, 5.93354484512397, 5.769046787121375, 4.24800992183389], 'MSE': [17.98743409108545, 17.20487882591837, 35.20695442909724, 33.28190083199545, 18.04558829599917], 'MAE': [3.0095871698414762, 3.1000705558250328, 3.6296526451890987, 4.236839074147162, 3.171607509025844], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.022257635874346038, 0.09552209067186723, 0.12796199804062025, 0.11375737593570967, 0.05022704929313282], 'Adjusted R2': [-0.025802218523008902, 0.09238589819846876, 0.12493828790899952, 0.11068441260677664, 0.04693380050427398]} 2024-05-27 13:55:20,339 - INFO - Saving model XGBoost for site PM2.5 and pollutant Penrose_PM2.5 2024-05-27 13:55:20,353 - INFO - Model saved at data/models/PM2.5_Penrose_PM2.5_XGBoost.joblib 2024-05-27 13:55:20,355 - INFO - The best model based on average RMSE across all folds for Penrose_PM2.5: RandomForest 2024-05-27 13:55:20,355 - INFO - Markdown Table for Penrose_PM2.5: | Target | Model | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 | Training Time | |---|---|---|---|---|---|---|---|---| Penrose_PM2.5 | ARIMA | RMSE | 4.23 | 4.59 | 7.32 | 6.56 | 4.39 | 131.61 Penrose_PM2.5 | ARIMA | MSE | 17.91 | 21.11 | 53.52 | 43.04 | 19.27 | 131.61 Penrose_PM2.5 | ARIMA | MAE | 3.17 | 3.31 | 5.49 | 4.74 | 3.26 | 131.61 Penrose_PM2.5 | ARIMA | MAPE | inf | inf | inf | inf | inf | 131.61 Penrose_PM2.5 | ARIMA | R2 | -0.02 | -0.11 | -0.33 | -0.15 | -0.01 | 131.61 Penrose_PM2.5 | ARIMA | Adjusted R2 | -0.02 | -0.11 | -0.33 | -0.15 | -0.02 | 131.61 Penrose_PM2.5 | Prophet | RMSE | 4.46 | 4.71 | 6.40 | 6.14 | 4.71 | 7.97 Penrose_PM2.5 | Prophet | MSE | 19.85 | 22.21 | 41.01 | 37.72 | 22.16 | 7.97 Penrose_PM2.5 | Prophet | MAE | 3.49 | 3.77 | 4.03 | 4.66 | 3.74 | 7.97 Penrose_PM2.5 | Prophet | MAPE | inf | inf | inf | inf | inf | 7.97 Penrose_PM2.5 | Prophet | R2 | -0.13 | -0.17 | -0.02 | -0.00 | -0.17 | 7.97 Penrose_PM2.5 | Prophet | Adjusted R2 | -0.13 | -0.17 | -0.02 | -0.01 | -0.17 | 7.97 Penrose_PM2.5 | NeuralProphet | RMSE | 4.17 | 4.50 | 6.44 | 5.71 | 4.49 | 62.77 Penrose_PM2.5 | NeuralProphet | MSE | 17.36 | 20.25 | 41.41 | 32.55 | 20.20 | 62.77 Penrose_PM2.5 | NeuralProphet | MAE | 2.99 | 3.49 | 3.84 | 4.33 | 3.42 | 62.77 Penrose_PM2.5 | NeuralProphet | MAPE | inf | inf | inf | inf | inf | 62.77 Penrose_PM2.5 | NeuralProphet | R2 | 0.01 | -0.06 | -0.03 | 0.13 | -0.06 | 62.77 Penrose_PM2.5 | NeuralProphet | Adjusted R2 | 0.01 | -0.07 | -0.03 | 0.13 | -0.07 | 62.77 Penrose_PM2.5 | LinearRegression | RMSE | 4.26 | 4.06 | 5.87 | 5.69 | 4.45 | 0.39 Penrose_PM2.5 | LinearRegression | MSE | 18.19 | 16.51 | 34.44 | 32.32 | 19.81 | 0.39 Penrose_PM2.5 | LinearRegression | MAE | 3.01 | 2.99 | 3.55 | 4.12 | 3.25 | 0.39 Penrose_PM2.5 | LinearRegression | MAPE | inf | inf | inf | inf | inf | 0.39 Penrose_PM2.5 | LinearRegression | R2 | -0.03 | 0.13 | 0.15 | 0.14 | -0.04 | 0.39 Penrose_PM2.5 | LinearRegression | Adjusted R2 | -0.04 | 0.13 | 0.14 | 0.14 | -0.05 | 0.39 Penrose_PM2.5 | Ridge | RMSE | 4.26 | 4.05 | 5.87 | 5.69 | 4.45 | 0.14 Penrose_PM2.5 | Ridge | MSE | 18.19 | 16.44 | 34.44 | 32.32 | 19.81 | 0.14 Penrose_PM2.5 | Ridge | MAE | 3.01 | 2.99 | 3.55 | 4.12 | 3.25 | 0.14 Penrose_PM2.5 | Ridge | MAPE | inf | inf | inf | inf | inf | 0.14 Penrose_PM2.5 | Ridge | R2 | -0.03 | 0.14 | 0.15 | 0.14 | -0.04 | 0.14 Penrose_PM2.5 | Ridge | Adjusted R2 | -0.04 | 0.13 | 0.14 | 0.14 | -0.05 | 0.14 Penrose_PM2.5 | Lasso | RMSE | 4.24 | 4.03 | 5.89 | 5.66 | 4.41 | 1.78 Penrose_PM2.5 | Lasso | MSE | 17.96 | 16.25 | 34.66 | 32.02 | 19.47 | 1.78 Penrose_PM2.5 | Lasso | MAE | 2.96 | 2.97 | 3.54 | 4.13 | 3.23 | 1.78 Penrose_PM2.5 | Lasso | MAPE | inf | inf | inf | inf | inf | 1.78 Penrose_PM2.5 | Lasso | R2 | -0.02 | 0.15 | 0.14 | 0.15 | -0.02 | 1.78 Penrose_PM2.5 | Lasso | Adjusted R2 | -0.02 | 0.14 | 0.14 | 0.14 | -0.03 | 1.78 Penrose_PM2.5 | RandomForest | RMSE | 4.11 | 4.07 | 5.86 | 5.75 | 4.28 | 286.74 Penrose_PM2.5 | RandomForest | MSE | 16.87 | 16.56 | 34.38 | 33.10 | 18.35 | 286.74 Penrose_PM2.5 | RandomForest | MAE | 2.90 | 3.04 | 3.56 | 4.24 | 3.22 | 286.74 Penrose_PM2.5 | RandomForest | MAPE | inf | inf | inf | inf | inf | 286.74 Penrose_PM2.5 | RandomForest | R2 | 0.04 | 0.13 | 0.15 | 0.12 | 0.03 | 286.74 Penrose_PM2.5 | RandomForest | Adjusted R2 | 0.04 | 0.13 | 0.15 | 0.12 | 0.03 | 286.74 Penrose_PM2.5 | SVR | RMSE | 4.07 | 4.09 | 6.07 | 5.81 | 4.20 | 4210.81 Penrose_PM2.5 | SVR | MSE | 16.57 | 16.70 | 36.81 | 33.71 | 17.61 | 4210.81 Penrose_PM2.5 | SVR | MAE | 2.93 | 3.00 | 3.65 | 4.21 | 3.08 | 4210.81 Penrose_PM2.5 | SVR | MAPE | inf | inf | inf | inf | inf | 4210.81 Penrose_PM2.5 | SVR | R2 | 0.06 | 0.12 | 0.09 | 0.10 | 0.07 | 4210.81 Penrose_PM2.5 | SVR | Adjusted R2 | 0.06 | 0.12 | 0.09 | 0.10 | 0.07 | 4210.81 Penrose_PM2.5 | XGBoost | RMSE | 4.24 | 4.15 | 5.93 | 5.77 | 4.25 | 5.53 Penrose_PM2.5 | XGBoost | MSE | 17.99 | 17.20 | 35.21 | 33.28 | 18.05 | 5.53 Penrose_PM2.5 | XGBoost | MAE | 3.01 | 3.10 | 3.63 | 4.24 | 3.17 | 5.53 Penrose_PM2.5 | XGBoost | MAPE | inf | inf | inf | inf | inf | 5.53 Penrose_PM2.5 | XGBoost | R2 | -0.02 | 0.10 | 0.13 | 0.11 | 0.05 | 5.53 Penrose_PM2.5 | XGBoost | Adjusted R2 | -0.03 | 0.09 | 0.12 | 0.11 | 0.05 | 5.53 2024-05-27 13:55:20,356 - INFO - Saving model RandomForest for site Penrose and pollutant PM2.5 2024-05-27 13:55:20,387 - INFO - Model saved at data/models/Penrose_PM2.5_RandomForest.joblib 2024-05-27 13:55:20,387 - INFO - 🛠️ Training models for Takapuna_PM2.5 ... 2024-05-27 13:55:20,388 - INFO - Selected ARIMA model for Takapuna_PM2.5: ARIMA(0, 1, 2) 2024-05-27 13:55:20,396 - INFO - Processing fold 1/5 for ARIMA 2024-05-27 13:55:22,985 - INFO - Model ARIMA fitted successfully 2024-05-27 13:55:23,004 - INFO - Processing fold 2/5 for ARIMA 2024-05-27 13:55:25,851 - INFO - Model ARIMA fitted successfully 2024-05-27 13:55:25,866 - INFO - Processing fold 3/5 for ARIMA 2024-05-27 13:55:29,191 - INFO - Model ARIMA fitted successfully 2024-05-27 13:55:29,212 - INFO - Processing fold 4/5 for ARIMA 2024-05-27 13:55:33,599 - INFO - Model ARIMA fitted successfully 2024-05-27 13:55:33,600 - INFO - Processing fold 5/5 for ARIMA 2024-05-27 13:55:38,055 - INFO - Model ARIMA fitted successfully 2024-05-27 13:55:38,076 - INFO - Adaptive cross-validation scores for ARIMA: {'RMSE': [5.113970838931715, 2.0434081974164413, 5.172576083461726, 2.9840609373997693, 2.40844279157668], 'MSE': [26.152697741443948, 4.17551706126871, 26.755543339200248, 8.904619678115191, 5.800596680297671], 'MAE': [4.548714184935662, 1.5071745450916292, 2.8306527943657493, 2.094165039837185, 1.7670581841048587], 'MAPE': [nan, nan, nan, nan, nan], 'R2': [-3.751206405915955, -0.3877170449402183, -0.045742530752174604, -0.0293823468814709, -0.4010793822978016], 'Adjusted R2': [-3.7676922019114993, -0.3925321630697818, -0.04937106139322989, -0.032954110749900734, -0.4059408652342964]} 2024-05-27 13:55:38,077 - INFO - Saving model ARIMA for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:55:38,165 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_ARIMA.joblib 2024-05-27 13:55:38,169 - INFO - Processing fold 1/5 for Prophet 2024-05-27 13:55:38,196 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/7ef594sr.json 2024-05-27 13:55:38,320 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/0fr1yo9o.json 2024-05-27 13:55:38,321 - DEBUG - idx 0 2024-05-27 13:55:38,321 - DEBUG - running CmdStan, num_threads: None 2024-05-27 13:55:38,321 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=92545', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/7ef594sr.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/0fr1yo9o.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_model0qsjgo09/prophet_model-20240527135538.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 13:55:38 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 13:55:38,322 - INFO - Chain [1] start processing 13:55:38 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 13:55:38,512 - INFO - Chain [1] done processing 2024-05-27 13:55:38,986 - INFO - Processing fold 2/5 for Prophet 2024-05-27 13:55:39,013 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/07yr8tfm.json 2024-05-27 13:55:39,161 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/a__j9cqe.json 2024-05-27 13:55:39,162 - DEBUG - idx 0 2024-05-27 13:55:39,162 - DEBUG - running CmdStan, num_threads: None 2024-05-27 13:55:39,163 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70874', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/07yr8tfm.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/a__j9cqe.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelbjkmo_et/prophet_model-20240527135539.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 13:55:39 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 13:55:39,163 - INFO - Chain [1] start processing 13:55:39 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 13:55:39,729 - INFO - Chain [1] done processing 2024-05-27 13:55:40,185 - INFO - Processing fold 3/5 for Prophet 2024-05-27 13:55:40,223 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/6jbfiob2.json 2024-05-27 13:55:40,445 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/_tlvc4k0.json 2024-05-27 13:55:40,446 - DEBUG - idx 0 2024-05-27 13:55:40,447 - DEBUG - running CmdStan, num_threads: None 2024-05-27 13:55:40,447 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=48110', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/6jbfiob2.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/_tlvc4k0.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_model2u4htzmd/prophet_model-20240527135540.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 13:55:40 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 13:55:40,447 - INFO - Chain [1] start processing 13:55:41 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 13:55:41,929 - INFO - Chain [1] done processing 2024-05-27 13:55:42,382 - INFO - Processing fold 4/5 for Prophet 2024-05-27 13:55:42,595 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/vruvzb0f.json 2024-05-27 13:55:42,888 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/t0g17krs.json 2024-05-27 13:55:42,889 - DEBUG - idx 0 2024-05-27 13:55:42,889 - DEBUG - running CmdStan, num_threads: None 2024-05-27 13:55:42,889 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=68186', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/vruvzb0f.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/t0g17krs.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_model5m7y4vy6/prophet_model-20240527135542.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 13:55:42 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 13:55:42,890 - INFO - Chain [1] start processing 13:55:44 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 13:55:44,966 - INFO - Chain [1] done processing 2024-05-27 13:55:45,428 - INFO - Processing fold 5/5 for Prophet 2024-05-27 13:55:45,479 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/v_vio2p0.json 2024-05-27 13:55:45,844 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/b_8_hzv1.json 2024-05-27 13:55:45,844 - DEBUG - idx 0 2024-05-27 13:55:45,845 - DEBUG - running CmdStan, num_threads: None 2024-05-27 13:55:45,845 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=36455', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/v_vio2p0.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/b_8_hzv1.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelhi89o0sk/prophet_model-20240527135545.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 13:55:45 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 13:55:45,845 - INFO - Chain [1] start processing 13:55:47 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 13:55:47,606 - INFO - Chain [1] done processing 2024-05-27 13:55:48,058 - INFO - Adaptive cross-validation scores for Prophet: {'RMSE': [3.771174779437052, 2.0099272345317454, 5.093479750364655, 3.0261651912476104, 3.5897676528769757], 'MSE': [14.221759217062097, 4.03980748811243, 25.943535967374785, 9.157675764718686, 12.886431801641873], 'MAE': [3.3416412621843605, 1.6633538179689926, 2.896152564801436, 2.322302780721624, 3.1665684706081234], 'MAPE': [85.5330476169517, 43.71248987953567, 43.88515888076979, 47.872756403666, 79.0371386793694], 'R2': [-1.5836919067978719, -0.3426144899590853, -0.014005158304272625, -0.0586358667101452, -2.1125959799950462], 'Adjusted R2': [-1.5926568336084128, -0.3472731106737248, -0.017523566209561636, -0.06230913481115197, -2.1233961048388874]} 2024-05-27 13:55:48,059 - INFO - Saving model Prophet for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:55:48,077 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_Prophet.joblib 2024-05-27 13:55:48,079 - INFO - Processing fold 1/5 for NeuralProphet 2024-05-27 13:55:48,082 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 13:55:48,082 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 13:55:48,083 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:55:48,088 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 13:55:48,089 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 13:55:48,092 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 13:55:48,122 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 80 2024-05-27 13:55:48,123 - INFO - Auto-set epochs to 80 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal. 2024-05-27 13:55:48,228 - WARNING - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal.
Finding best initial lr: 0%| | 0/236 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:55:55,158 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:55:55,159 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 13:55:55,164 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:55:55,165 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:55:55,169 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:55:55,170 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:55:55,174 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:55:55,175 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 13:55:55,181 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 45it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:55:55,242 - INFO - Returning df with no ID column 2024-05-27 13:55:55,244 - WARNING - NaN values found in predictions for fold 1. Replacing NaNs with mean value. 2024-05-27 13:55:55,247 - INFO - Processing fold 2/5 for NeuralProphet 2024-05-27 13:55:55,249 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 13:55:55,249 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 13:55:55,251 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.983% of the data. 2024-05-27 13:55:55,257 - INFO - Major frequency ns corresponds to 99.983% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 13:55:55,258 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 13:55:55,262 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 13:55:55,312 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 70 2024-05-27 13:55:55,312 - INFO - Auto-set epochs to 70 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal. 2024-05-27 13:55:55,323 - WARNING - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal.
Finding best initial lr: 0%| | 0/244 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:56:06,896 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:06,897 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 13:56:06,903 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:06,904 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:06,908 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:06,909 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:06,914 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:06,915 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 13:56:06,921 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 90it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:06,981 - INFO - Returning df with no ID column 2024-05-27 13:56:06,983 - WARNING - NaN values found in predictions for fold 2. Replacing NaNs with mean value. 2024-05-27 13:56:06,987 - INFO - Processing fold 3/5 for NeuralProphet 2024-05-27 13:56:06,990 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 13:56:06,990 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 13:56:06,992 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.988% of the data. 2024-05-27 13:56:07,001 - INFO - Major frequency ns corresponds to 99.988% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 13:56:07,001 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 13:56:07,006 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 13:56:07,245 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 60 2024-05-27 13:56:07,246 - INFO - Auto-set epochs to 60 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (135) is too small than the required number for the learning rate finder (248). The results might not be optimal. 2024-05-27 13:56:07,257 - WARNING - Learning rate finder: The number of batches (135) is too small than the required number for the learning rate finder (248). The results might not be optimal.
Finding best initial lr: 0%| | 0/248 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:56:21,968 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:21,969 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 13:56:21,974 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:21,976 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:21,980 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:21,981 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:21,985 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:21,987 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 13:56:21,993 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 135it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:22,060 - INFO - Returning df with no ID column 2024-05-27 13:56:22,060 - WARNING - NaN values found in predictions for fold 3. Replacing NaNs with mean value. 2024-05-27 13:56:22,065 - INFO - Processing fold 4/5 for NeuralProphet 2024-05-27 13:56:22,069 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 13:56:22,070 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 13:56:22,073 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.991% of the data. 2024-05-27 13:56:22,082 - INFO - Major frequency ns corresponds to 99.991% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 13:56:22,083 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 13:56:22,088 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 13:56:22,213 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 13:56:22,214 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal. 2024-05-27 13:56:22,225 - WARNING - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal.
Finding best initial lr: 0%| | 0/251 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:56:32,835 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:32,836 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 13:56:32,840 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:32,842 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:32,846 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:32,847 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:32,852 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:32,854 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 13:56:32,860 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 91it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:33,107 - INFO - Returning df with no ID column 2024-05-27 13:56:33,108 - WARNING - NaN values found in predictions for fold 4. Replacing NaNs with mean value. 2024-05-27 13:56:33,112 - INFO - Processing fold 5/5 for NeuralProphet 2024-05-27 13:56:33,115 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 13:56:33,116 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 13:56:33,119 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.993% of the data. 2024-05-27 13:56:33,130 - INFO - Major frequency ns corresponds to 99.993% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 13:56:33,131 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 13:56:33,136 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 13:56:33,246 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 13:56:33,246 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal. 2024-05-27 13:56:33,257 - WARNING - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal.
Finding best initial lr: 0%| | 0/254 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 13:56:46,233 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:46,234 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 13:56:46,238 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:46,240 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:46,243 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:46,244 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 13:56:46,249 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 13:56:46,250 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 13:56:46,257 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 113it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 13:56:46,333 - INFO - Returning df with no ID column 2024-05-27 13:56:46,334 - WARNING - NaN values found in predictions for fold 5. Replacing NaNs with mean value. 2024-05-27 13:56:46,337 - INFO - Adaptive cross-validation scores for NeuralProphet: {'RMSE': [3.4202234712817914, 1.669857212115358, 4.6637210996416645, 2.872185128796025, 2.0494991329713503], 'MSE': [11.697928593506868, 2.788423108853676, 21.750294495242855, 8.249447414077038, 4.200446696050316], 'MAE': [2.7866233911639546, 1.261717739439269, 2.7069349141587367, 1.980755137737919, 1.5875092015671337], 'MAPE': [48.42540248102986, 31.07454472576423, 35.32018086750407, 31.94150299501628, 38.346812652701054], 'R2': [-1.1251831768522016, 0.07327829826045562, 0.14988801678582397, 0.04635617841734252, -0.014578255762303316], 'Adjusted R2': [-1.1325571642805574, 0.0700627475951554, 0.1469382874894528, 0.04304721304058101, -0.018098652208390353]} 2024-05-27 13:56:46,337 - INFO - Saving model NeuralProphet for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:56:49,077 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_NeuralProphet.joblib 2024-05-27 13:56:49,079 - INFO - Processing fold 1/5 for LinearRegression 2024-05-27 13:56:49,152 - INFO - Model LinearRegression fitted successfully 2024-05-27 13:56:49,212 - INFO - Processing fold 2/5 for LinearRegression 2024-05-27 13:56:49,413 - INFO - Model LinearRegression fitted successfully 2024-05-27 13:56:49,425 - INFO - Processing fold 3/5 for LinearRegression 2024-05-27 13:56:49,459 - INFO - Model LinearRegression fitted successfully 2024-05-27 13:56:49,471 - INFO - Processing fold 4/5 for LinearRegression 2024-05-27 13:56:49,504 - INFO - Model LinearRegression fitted successfully 2024-05-27 13:56:49,509 - INFO - Processing fold 5/5 for LinearRegression 2024-05-27 13:56:49,544 - INFO - Model LinearRegression fitted successfully 2024-05-27 13:56:49,549 - INFO - Adaptive cross-validation scores for LinearRegression: {'RMSE': [4.061980559629283, 1.5234210681239604, 3.812710498767556, 2.9447666055915263, 1.8019220480918687], 'MSE': [16.49968606680622, 2.3208117508039483, 14.536761347412346, 8.671650361407039, 3.2469230673995946], 'MAE': [1.6227842138480366, 1.1509320898892812, 2.0571447170901522, 2.152650211708724, 1.3080080202426883], 'MAPE': [38.07659012542245, 25.681741793941608, 29.938084579698675, 32.82197287724934, 24.57738550289326], 'R2': [-1.9975268674475055, 0.22868713564551524, 0.4318295312616225, -0.002450876402860791, 0.21573637741570162], 'Adjusted R2': [-2.0079277240312927, 0.2260108245270055, 0.42985808619313404, -0.005929193114876163, 0.2130151295927165]} 2024-05-27 13:56:49,550 - INFO - Saving model LinearRegression for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:56:49,553 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_LinearRegression.joblib 2024-05-27 13:56:49,559 - INFO - Processing fold 1/5 for Ridge 2024-05-27 13:56:49,575 - INFO - Model Ridge fitted successfully 2024-05-27 13:56:49,577 - INFO - Processing fold 2/5 for Ridge 2024-05-27 13:56:49,595 - INFO - Model Ridge fitted successfully 2024-05-27 13:56:49,601 - INFO - Processing fold 3/5 for Ridge 2024-05-27 13:56:49,626 - INFO - Model Ridge fitted successfully 2024-05-27 13:56:49,634 - INFO - Processing fold 4/5 for Ridge 2024-05-27 13:56:49,660 - INFO - Model Ridge fitted successfully 2024-05-27 13:56:49,674 - INFO - Processing fold 5/5 for Ridge 2024-05-27 13:56:49,700 - INFO - Model Ridge fitted successfully 2024-05-27 13:56:49,703 - INFO - Adaptive cross-validation scores for Ridge: {'RMSE': [4.0613448050013625, 1.5230448296805557, 3.812389262323064, 2.944637010404716, 1.8019695906565074], 'MSE': [16.494521625111553, 2.3196655532166726, 14.534311887476196, 8.670887123045224, 3.2470944056507807], 'MAE': [1.6208627516985894, 1.1505820949938719, 2.048866644614823, 2.1524804309574956, 1.308095155465682], 'MAPE': [38.03614177489798, 25.681927653315945, 29.6504147555802, 32.81655446156088, 24.578740099341147], 'R2': [-1.9965886342791621, 0.22906807000774942, 0.43192526859725155, -0.0023626453357841193, 0.21569499227821198], 'Adjusted R2': [-2.006986235369652, 0.2263930806601011, 0.42995415571937934, -0.00584065590252858, 0.21297360085655415]} 2024-05-27 13:56:49,704 - INFO - Saving model Ridge for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:56:49,707 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_Ridge.joblib 2024-05-27 13:56:49,713 - INFO - Processing fold 1/5 for Lasso 2024-05-27 13:56:50,008 - INFO - Model Lasso fitted successfully 2024-05-27 13:56:50,009 - INFO - Processing fold 2/5 for Lasso 2024-05-27 13:56:50,308 - INFO - Model Lasso fitted successfully 2024-05-27 13:56:50,309 - INFO - Processing fold 3/5 for Lasso 2024-05-27 13:56:50,638 - INFO - Model Lasso fitted successfully 2024-05-27 13:56:50,639 - INFO - Processing fold 4/5 for Lasso 2024-05-27 13:56:50,952 - INFO - Model Lasso fitted successfully 2024-05-27 13:56:50,953 - INFO - Processing fold 5/5 for Lasso 2024-05-27 13:56:51,285 - INFO - Model Lasso fitted successfully 2024-05-27 13:56:51,287 - INFO - Adaptive cross-validation scores for Lasso: {'RMSE': [3.9321247476652754, 1.4738039514015235, 3.9516452474294264, 2.918387491705076, 1.679850501501327], 'MSE': [15.461605031201707, 2.1720980871667446, 15.615500161531573, 8.516985551740644, 2.82189770739426], 'MAE': [1.5330864461483857, 1.110196398644809, 2.1132507531036557, 2.1217751766332023, 1.2061355302180363], 'MAPE': [36.18975371280023, 25.819545739247562, 30.50805271872101, 32.19090985075741, 23.214684619637165], 'R2': [-1.8089368674794293, 0.2781115501112411, 0.3896669392635427, 0.015428519967728205, 0.3183972417505436], 'Adjusted R2': [-1.8186833520994137, 0.2756067324502809, 0.38754919790082076, 0.012012241411058389, 0.3160322078912464]} 2024-05-27 13:56:51,288 - INFO - Saving model Lasso for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 13:56:51,289 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_Lasso.joblib 2024-05-27 13:56:51,293 - INFO - Processing fold 1/5 for RandomForest 2024-05-27 13:57:47,920 - INFO - Model RandomForest fitted successfully 2024-05-27 13:57:47,920 - INFO - Processing fold 2/5 for RandomForest 2024-05-27 13:58:47,036 - INFO - Model RandomForest fitted successfully 2024-05-27 13:58:47,036 - INFO - Processing fold 3/5 for RandomForest 2024-05-27 13:59:48,807 - INFO - Model RandomForest fitted successfully 2024-05-27 13:59:48,808 - INFO - Processing fold 4/5 for RandomForest 2024-05-27 14:00:53,116 - INFO - Model RandomForest fitted successfully 2024-05-27 14:00:53,116 - INFO - Processing fold 5/5 for RandomForest 2024-05-27 14:01:59,958 - INFO - Model RandomForest fitted successfully 2024-05-27 14:01:59,959 - INFO - Adaptive cross-validation scores for RandomForest: {'RMSE': [1.4775920472061237, 1.3453089946162449, 2.0357281522417017, 2.6849534886808857, 1.3888216843334749], 'MSE': [2.1832782579667835, 1.8098562909953715, 4.144189109829413, 7.208975236379658, 1.9288256708748701], 'MAE': [1.0269844911542954, 0.9799949042889671, 1.2701365942390312, 1.8793785664825255, 1.0340768871243118], 'MAPE': [25.435091576028807, 23.824821465149455, 18.897743980279035, 27.29556170877043, 20.60210239735174], 'R2': [0.6033600147983784, 0.3985012186386725, 0.8380240403759951, 0.16663573339657423, 0.5341103633892326], 'Adjusted R2': [0.6019837483681161, 0.39641413057010444, 0.8374620141455162, 0.16374411553882462, 0.5324938136438795]} 2024-05-27 14:01:59,960 - INFO - Saving model RandomForest for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 14:02:00,024 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_RandomForest.joblib 2024-05-27 14:02:00,026 - INFO - Processing fold 1/5 for SVR 2024-05-27 14:17:45,532 - INFO - Model SVR fitted successfully 2024-05-27 14:17:45,533 - INFO - Processing fold 2/5 for SVR 2024-05-27 14:40:07,499 - INFO - Model SVR fitted successfully 2024-05-27 14:40:07,499 - INFO - Processing fold 3/5 for SVR 2024-05-27 15:03:37,165 - INFO - Model SVR fitted successfully 2024-05-27 15:03:37,166 - INFO - Processing fold 4/5 for SVR 2024-05-27 15:21:42,968 - INFO - Model SVR fitted successfully 2024-05-27 15:21:42,969 - INFO - Processing fold 5/5 for SVR 2024-05-27 15:40:57,476 - INFO - Model SVR fitted successfully 2024-05-27 15:40:57,478 - INFO - Adaptive cross-validation scores for SVR: {'RMSE': [2.3980751822423554, 1.3202001696905143, 3.128868360357376, 2.7575160033351835, 1.4068616647836736], 'MSE': [5.750764579686706, 1.7429284880508629, 9.789817216445456, 7.603894508649644, 1.9792597438378894], 'MAE': [1.977054213696566, 0.9834424637362961, 1.445714815756344, 1.9185963624445772, 1.043502287818774], 'MAPE': [52.405830763804, 23.56210241567084, 18.660275259191348, 27.72484993548117, 20.391200536652633], 'R2': [-0.04475147382686351, 0.4207444166818568, 0.617364218631753, 0.12098269688148211, 0.5219284890600278], 'Adjusted R2': [-0.048376565686082396, 0.41873450834279313, 0.6160365441648263, 0.11793267154102927, 0.5202696704932686]} 2024-05-27 15:40:57,478 - INFO - Saving model SVR for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 15:40:57,485 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_SVR.joblib 2024-05-27 15:40:57,487 - INFO - Processing fold 1/5 for XGBoost 2024-05-27 15:40:58,638 - INFO - Model XGBoost fitted successfully 2024-05-27 15:40:58,638 - INFO - Processing fold 2/5 for XGBoost 2024-05-27 15:40:59,675 - INFO - Model XGBoost fitted successfully 2024-05-27 15:40:59,676 - INFO - Processing fold 3/5 for XGBoost 2024-05-27 15:41:00,881 - INFO - Model XGBoost fitted successfully 2024-05-27 15:41:00,881 - INFO - Processing fold 4/5 for XGBoost 2024-05-27 15:41:01,910 - INFO - Model XGBoost fitted successfully 2024-05-27 15:41:01,911 - INFO - Processing fold 5/5 for XGBoost 2024-05-27 15:41:03,312 - INFO - Model XGBoost fitted successfully 2024-05-27 15:41:03,313 - INFO - Adaptive cross-validation scores for XGBoost: {'RMSE': [1.51337684773268, 1.3489503134069614, 1.986465967351033, 2.706373251040524, 1.4275647942490766], 'MSE': [2.2903094832533033, 1.819666948040739, 3.946047039443875, 7.324456173947655, 2.0379412417794085], 'MAE': [1.05980058709437, 0.9859179929401869, 1.2461940848782034, 1.8880597637413639, 1.0523506612433775], 'MAPE': [25.580890078589597, 22.68547392725986, 18.68657486441091, 27.415165941799657, 20.412053187085704], 'R2': [0.5839154646322025, 0.39524068448106975, 0.8457684388920912, 0.15328602921707135, 0.5077545270661739], 'Adjusted R2': [0.5824717292561865, 0.3931422829699007, 0.8452332842733962, 0.15034809038715147, 0.5060465275070696]} 2024-05-27 15:41:03,313 - INFO - Saving model XGBoost for site PM2.5 and pollutant Takapuna_PM2.5 2024-05-27 15:41:03,319 - INFO - Model saved at data/models/PM2.5_Takapuna_PM2.5_XGBoost.joblib 2024-05-27 15:41:03,319 - INFO - The best model based on average RMSE across all folds for Takapuna_PM2.5: RandomForest 2024-05-27 15:41:03,320 - INFO - Markdown Table for Takapuna_PM2.5: | Target | Model | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 | Training Time | |---|---|---|---|---|---|---|---|---| Takapuna_PM2.5 | ARIMA | RMSE | 5.11 | 2.04 | 5.17 | 2.98 | 2.41 | 17.77 Takapuna_PM2.5 | ARIMA | MSE | 26.15 | 4.18 | 26.76 | 8.90 | 5.80 | 17.77 Takapuna_PM2.5 | ARIMA | MAE | 4.55 | 1.51 | 2.83 | 2.09 | 1.77 | 17.77 Takapuna_PM2.5 | ARIMA | MAPE | nan | nan | nan | nan | nan | 17.77 Takapuna_PM2.5 | ARIMA | R2 | -3.75 | -0.39 | -0.05 | -0.03 | -0.40 | 17.77 Takapuna_PM2.5 | ARIMA | Adjusted R2 | -3.77 | -0.39 | -0.05 | -0.03 | -0.41 | 17.77 Takapuna_PM2.5 | Prophet | RMSE | 3.77 | 2.01 | 5.09 | 3.03 | 3.59 | 9.91 Takapuna_PM2.5 | Prophet | MSE | 14.22 | 4.04 | 25.94 | 9.16 | 12.89 | 9.91 Takapuna_PM2.5 | Prophet | MAE | 3.34 | 1.66 | 2.90 | 2.32 | 3.17 | 9.91 Takapuna_PM2.5 | Prophet | MAPE | 85.53 | 43.71 | 43.89 | 47.87 | 79.04 | 9.91 Takapuna_PM2.5 | Prophet | R2 | -1.58 | -0.34 | -0.01 | -0.06 | -2.11 | 9.91 Takapuna_PM2.5 | Prophet | Adjusted R2 | -1.59 | -0.35 | -0.02 | -0.06 | -2.12 | 9.91 Takapuna_PM2.5 | NeuralProphet | RMSE | 3.42 | 1.67 | 4.66 | 2.87 | 2.05 | 61.00 Takapuna_PM2.5 | NeuralProphet | MSE | 11.70 | 2.79 | 21.75 | 8.25 | 4.20 | 61.00 Takapuna_PM2.5 | NeuralProphet | MAE | 2.79 | 1.26 | 2.71 | 1.98 | 1.59 | 61.00 Takapuna_PM2.5 | NeuralProphet | MAPE | 48.43 | 31.07 | 35.32 | 31.94 | 38.35 | 61.00 Takapuna_PM2.5 | NeuralProphet | R2 | -1.13 | 0.07 | 0.15 | 0.05 | -0.01 | 61.00 Takapuna_PM2.5 | NeuralProphet | Adjusted R2 | -1.13 | 0.07 | 0.15 | 0.04 | -0.02 | 61.00 Takapuna_PM2.5 | LinearRegression | RMSE | 4.06 | 1.52 | 3.81 | 2.94 | 1.80 | 0.48 Takapuna_PM2.5 | LinearRegression | MSE | 16.50 | 2.32 | 14.54 | 8.67 | 3.25 | 0.48 Takapuna_PM2.5 | LinearRegression | MAE | 1.62 | 1.15 | 2.06 | 2.15 | 1.31 | 0.48 Takapuna_PM2.5 | LinearRegression | MAPE | 38.08 | 25.68 | 29.94 | 32.82 | 24.58 | 0.48 Takapuna_PM2.5 | LinearRegression | R2 | -2.00 | 0.23 | 0.43 | -0.00 | 0.22 | 0.48 Takapuna_PM2.5 | LinearRegression | Adjusted R2 | -2.01 | 0.23 | 0.43 | -0.01 | 0.21 | 0.48 Takapuna_PM2.5 | Ridge | RMSE | 4.06 | 1.52 | 3.81 | 2.94 | 1.80 | 0.15 Takapuna_PM2.5 | Ridge | MSE | 16.49 | 2.32 | 14.53 | 8.67 | 3.25 | 0.15 Takapuna_PM2.5 | Ridge | MAE | 1.62 | 1.15 | 2.05 | 2.15 | 1.31 | 0.15 Takapuna_PM2.5 | Ridge | MAPE | 38.04 | 25.68 | 29.65 | 32.82 | 24.58 | 0.15 Takapuna_PM2.5 | Ridge | R2 | -2.00 | 0.23 | 0.43 | -0.00 | 0.22 | 0.15 Takapuna_PM2.5 | Ridge | Adjusted R2 | -2.01 | 0.23 | 0.43 | -0.01 | 0.21 | 0.15 Takapuna_PM2.5 | Lasso | RMSE | 3.93 | 1.47 | 3.95 | 2.92 | 1.68 | 1.58 Takapuna_PM2.5 | Lasso | MSE | 15.46 | 2.17 | 15.62 | 8.52 | 2.82 | 1.58 Takapuna_PM2.5 | Lasso | MAE | 1.53 | 1.11 | 2.11 | 2.12 | 1.21 | 1.58 Takapuna_PM2.5 | Lasso | MAPE | 36.19 | 25.82 | 30.51 | 32.19 | 23.21 | 1.58 Takapuna_PM2.5 | Lasso | R2 | -1.81 | 0.28 | 0.39 | 0.02 | 0.32 | 1.58 Takapuna_PM2.5 | Lasso | Adjusted R2 | -1.82 | 0.28 | 0.39 | 0.01 | 0.32 | 1.58 Takapuna_PM2.5 | RandomForest | RMSE | 1.48 | 1.35 | 2.04 | 2.68 | 1.39 | 308.73 Takapuna_PM2.5 | RandomForest | MSE | 2.18 | 1.81 | 4.14 | 7.21 | 1.93 | 308.73 Takapuna_PM2.5 | RandomForest | MAE | 1.03 | 0.98 | 1.27 | 1.88 | 1.03 | 308.73 Takapuna_PM2.5 | RandomForest | MAPE | 25.44 | 23.82 | 18.90 | 27.30 | 20.60 | 308.73 Takapuna_PM2.5 | RandomForest | R2 | 0.60 | 0.40 | 0.84 | 0.17 | 0.53 | 308.73 Takapuna_PM2.5 | RandomForest | Adjusted R2 | 0.60 | 0.40 | 0.84 | 0.16 | 0.53 | 308.73 Takapuna_PM2.5 | SVR | RMSE | 2.40 | 1.32 | 3.13 | 2.76 | 1.41 | 5937.46 Takapuna_PM2.5 | SVR | MSE | 5.75 | 1.74 | 9.79 | 7.60 | 1.98 | 5937.46 Takapuna_PM2.5 | SVR | MAE | 1.98 | 0.98 | 1.45 | 1.92 | 1.04 | 5937.46 Takapuna_PM2.5 | SVR | MAPE | 52.41 | 23.56 | 18.66 | 27.72 | 20.39 | 5937.46 Takapuna_PM2.5 | SVR | R2 | -0.04 | 0.42 | 0.62 | 0.12 | 0.52 | 5937.46 Takapuna_PM2.5 | SVR | Adjusted R2 | -0.05 | 0.42 | 0.62 | 0.12 | 0.52 | 5937.46 Takapuna_PM2.5 | XGBoost | RMSE | 1.51 | 1.35 | 1.99 | 2.71 | 1.43 | 5.83 Takapuna_PM2.5 | XGBoost | MSE | 2.29 | 1.82 | 3.95 | 7.32 | 2.04 | 5.83 Takapuna_PM2.5 | XGBoost | MAE | 1.06 | 0.99 | 1.25 | 1.89 | 1.05 | 5.83 Takapuna_PM2.5 | XGBoost | MAPE | 25.58 | 22.69 | 18.69 | 27.42 | 20.41 | 5.83 Takapuna_PM2.5 | XGBoost | R2 | 0.58 | 0.40 | 0.85 | 0.15 | 0.51 | 5.83 Takapuna_PM2.5 | XGBoost | Adjusted R2 | 0.58 | 0.39 | 0.85 | 0.15 | 0.51 | 5.83 2024-05-27 15:41:03,320 - INFO - Saving model RandomForest for site Takapuna and pollutant PM2.5 2024-05-27 15:41:03,395 - INFO - Model saved at data/models/Takapuna_PM2.5_RandomForest.joblib 2024-05-27 15:41:03,396 - INFO - 🛠️ Training models for Penrose_PM10 ... 2024-05-27 15:41:03,396 - INFO - Selected ARIMA model for Penrose_PM10: ARIMA(10, 0, 0) 2024-05-27 15:41:03,402 - INFO - Processing fold 1/5 for ARIMA 2024-05-27 15:41:13,488 - INFO - Model ARIMA fitted successfully 2024-05-27 15:41:13,489 - INFO - Processing fold 2/5 for ARIMA 2024-05-27 15:41:25,908 - INFO - Model ARIMA fitted successfully 2024-05-27 15:41:25,942 - INFO - Processing fold 3/5 for ARIMA 2024-05-27 15:41:39,106 - INFO - Model ARIMA fitted successfully 2024-05-27 15:41:39,127 - INFO - Processing fold 4/5 for ARIMA 2024-05-27 15:41:59,309 - INFO - Model ARIMA fitted successfully 2024-05-27 15:41:59,329 - INFO - Processing fold 5/5 for ARIMA 2024-05-27 15:42:20,312 - INFO - Model ARIMA fitted successfully 2024-05-27 15:42:20,315 - INFO - Adaptive cross-validation scores for ARIMA: {'RMSE': [7.410416747806064, 6.698254621110223, 7.582223257788919, 8.746122102870272, 9.74461006608308], 'MSE': [54.9142763761646, 44.866614969224464, 57.49010953095521, 76.49465183831592, 94.95742534000767], 'MAE': [5.694949926829244, 5.21703165629189, 5.656579195925595, 7.0750233987787725, 6.495624872211012], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.013737072709760989, -0.0009034163058365685, -0.01649833798592848, -0.0010108780598307998, -0.014683726986427503], 'Adjusted R2': [-0.017252111103345458, -0.004373955197326973, -0.020022950808348483, -0.0044817895648927575, -0.018202047815090605]} 2024-05-27 15:42:20,316 - INFO - Saving model ARIMA for site PM10 and pollutant Penrose_PM10 2024-05-27 15:42:20,583 - INFO - Model saved at data/models/PM10_Penrose_PM10_ARIMA.joblib 2024-05-27 15:42:20,585 - INFO - Processing fold 1/5 for Prophet 2024-05-27 15:42:20,613 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4lji_djm.json 2024-05-27 15:42:20,691 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/b21e3fgv.json 2024-05-27 15:42:20,692 - DEBUG - idx 0 2024-05-27 15:42:20,692 - DEBUG - running CmdStan, num_threads: None 2024-05-27 15:42:20,692 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=50943', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4lji_djm.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/b21e3fgv.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modeldueisjer/prophet_model-20240527154220.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 15:42:20 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 15:42:20,693 - INFO - Chain [1] start processing 15:42:20 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 15:42:20,901 - INFO - Chain [1] done processing 2024-05-27 15:42:21,425 - INFO - Processing fold 2/5 for Prophet 2024-05-27 15:42:21,455 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/ajpmgfbq.json 2024-05-27 15:42:21,605 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/wznedc1y.json 2024-05-27 15:42:21,606 - DEBUG - idx 0 2024-05-27 15:42:21,607 - DEBUG - running CmdStan, num_threads: None 2024-05-27 15:42:21,607 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=53164', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/ajpmgfbq.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/wznedc1y.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelo93ihvuu/prophet_model-20240527154221.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 15:42:21 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 15:42:21,607 - INFO - Chain [1] start processing 15:42:22 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 15:42:22,355 - INFO - Chain [1] done processing 2024-05-27 15:42:22,821 - INFO - Processing fold 3/5 for Prophet 2024-05-27 15:42:22,857 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/m3vvtc1i.json 2024-05-27 15:42:23,083 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/npgh90_j.json 2024-05-27 15:42:23,083 - DEBUG - idx 0 2024-05-27 15:42:23,084 - DEBUG - running CmdStan, num_threads: None 2024-05-27 15:42:23,084 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=80139', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/m3vvtc1i.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/npgh90_j.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modele0k4_lim/prophet_model-20240527154223.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 15:42:23 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 15:42:23,084 - INFO - Chain [1] start processing 15:42:24 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 15:42:24,485 - INFO - Chain [1] done processing 2024-05-27 15:42:24,937 - INFO - Processing fold 4/5 for Prophet 2024-05-27 15:42:24,978 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/o3umgxvv.json 2024-05-27 15:42:25,278 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/hiirr1i3.json 2024-05-27 15:42:25,279 - DEBUG - idx 0 2024-05-27 15:42:25,279 - DEBUG - running CmdStan, num_threads: None 2024-05-27 15:42:25,279 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=87594', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/o3umgxvv.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/hiirr1i3.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelau1p6cvc/prophet_model-20240527154225.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 15:42:25 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 15:42:25,279 - INFO - Chain [1] start processing 15:42:27 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 15:42:27,258 - INFO - Chain [1] done processing 2024-05-27 15:42:27,746 - INFO - Processing fold 5/5 for Prophet 2024-05-27 15:42:27,795 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/6282v6zj.json 2024-05-27 15:42:28,166 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/_6g3hsg6.json 2024-05-27 15:42:28,167 - DEBUG - idx 0 2024-05-27 15:42:28,167 - DEBUG - running CmdStan, num_threads: None 2024-05-27 15:42:28,167 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=77458', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/6282v6zj.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/_6g3hsg6.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelfuf1yyw7/prophet_model-20240527154228.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 15:42:28 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 15:42:28,167 - INFO - Chain [1] start processing 15:42:30 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 15:42:30,791 - INFO - Chain [1] done processing 2024-05-27 15:42:31,252 - INFO - Adaptive cross-validation scores for Prophet: {'RMSE': [8.30343967405925, 8.23009298015674, 7.593614338654195, 8.867956200270823, 9.866111376069139], 'MSE': [68.94711042074118, 67.73443046202526, 57.66297872421458, 78.64064716992173, 97.34015368500089], 'MAE': [6.8277965066714685, 6.80432632146319, 5.786215676460357, 7.118737292562632, 6.932631993985451], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.2727881800889258, -0.5110483128150969, -0.019554885435059788, -0.029093425265137807, -0.04014477617596657], 'Adjusted R2': [-0.2772014539449901, -0.5162877313754821, -0.0230900965496057, -0.032661710373546704, -0.04375138080903174]} 2024-05-27 15:42:31,252 - INFO - Saving model Prophet for site PM10 and pollutant Penrose_PM10 2024-05-27 15:42:31,270 - INFO - Model saved at data/models/PM10_Penrose_PM10_Prophet.joblib 2024-05-27 15:42:31,272 - INFO - Processing fold 1/5 for NeuralProphet 2024-05-27 15:42:31,275 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 15:42:31,276 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 15:42:31,277 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:42:31,283 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 15:42:31,284 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 15:42:31,288 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 15:42:31,318 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 80 2024-05-27 15:42:31,318 - INFO - Auto-set epochs to 80 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal. 2024-05-27 15:42:31,357 - WARNING - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal.
Finding best initial lr: 0%| | 0/236 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 15:42:38,519 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:38,521 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 15:42:38,526 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:42:38,528 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:42:38,532 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:38,533 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:42:38,537 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:38,539 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 15:42:38,547 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 45it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:42:38,612 - INFO - Returning df with no ID column 2024-05-27 15:42:38,614 - WARNING - NaN values found in predictions for fold 1. Replacing NaNs with mean value. 2024-05-27 15:42:38,616 - INFO - Processing fold 2/5 for NeuralProphet 2024-05-27 15:42:38,619 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 15:42:38,620 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 15:42:38,621 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.983% of the data. 2024-05-27 15:42:38,855 - INFO - Major frequency ns corresponds to 99.983% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 15:42:38,855 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 15:42:38,859 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 15:42:38,909 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 70 2024-05-27 15:42:38,910 - INFO - Auto-set epochs to 70 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal. 2024-05-27 15:42:38,921 - WARNING - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal.
Finding best initial lr: 0%| | 0/244 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 15:42:50,855 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:50,856 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 15:42:50,862 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:42:50,864 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:42:50,868 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:50,869 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:42:50,874 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:42:50,875 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 15:42:50,882 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 90it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:42:50,949 - INFO - Returning df with no ID column 2024-05-27 15:42:50,951 - WARNING - NaN values found in predictions for fold 2. Replacing NaNs with mean value. 2024-05-27 15:42:50,954 - INFO - Processing fold 3/5 for NeuralProphet 2024-05-27 15:42:50,957 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 15:42:50,957 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 15:42:50,959 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.988% of the data. 2024-05-27 15:42:50,968 - INFO - Major frequency ns corresponds to 99.988% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 15:42:50,969 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 15:42:50,973 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 15:42:51,045 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 60 2024-05-27 15:42:51,046 - INFO - Auto-set epochs to 60 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (136) is too small than the required number for the learning rate finder (248). The results might not be optimal. 2024-05-27 15:42:51,058 - WARNING - Learning rate finder: The number of batches (136) is too small than the required number for the learning rate finder (248). The results might not be optimal.
Finding best initial lr: 0%| | 0/248 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 15:43:06,401 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:06,403 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 15:43:06,407 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:06,409 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:06,413 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:06,414 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:06,419 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:06,420 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 15:43:06,428 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 136it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:06,497 - INFO - Returning df with no ID column 2024-05-27 15:43:06,498 - WARNING - NaN values found in predictions for fold 3. Replacing NaNs with mean value. 2024-05-27 15:43:06,501 - INFO - Processing fold 4/5 for NeuralProphet 2024-05-27 15:43:06,504 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 15:43:06,504 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 15:43:06,507 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.991% of the data. 2024-05-27 15:43:06,517 - INFO - Major frequency ns corresponds to 99.991% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 15:43:06,517 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 15:43:06,523 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 15:43:06,792 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 15:43:06,793 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal. 2024-05-27 15:43:06,806 - WARNING - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal.
Finding best initial lr: 0%| | 0/251 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 15:43:17,625 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:17,627 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 15:43:17,632 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:17,634 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:17,638 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:17,640 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:17,645 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:17,646 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 15:43:17,654 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 91it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:17,725 - INFO - Returning df with no ID column 2024-05-27 15:43:17,726 - WARNING - NaN values found in predictions for fold 4. Replacing NaNs with mean value. 2024-05-27 15:43:17,730 - INFO - Processing fold 5/5 for NeuralProphet 2024-05-27 15:43:17,733 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 15:43:17,734 - INFO - NeuralProphet model generated for 2895 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 15:43:17,737 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.993% of the data. 2024-05-27 15:43:17,749 - INFO - Major frequency ns corresponds to 99.993% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 15:43:17,749 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 15:43:17,756 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 15:43:18,046 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 15:43:18,047 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal. 2024-05-27 15:43:18,058 - WARNING - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal.
Finding best initial lr: 0%| | 0/254 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 15:43:31,426 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:31,428 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 15:43:31,432 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:31,434 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:31,437 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:31,439 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 15:43:31,444 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 15:43:31,445 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 15:43:31,452 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 113it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 15:43:31,531 - INFO - Returning df with no ID column 2024-05-27 15:43:31,532 - WARNING - NaN values found in predictions for fold 5. Replacing NaNs with mean value. 2024-05-27 15:43:31,535 - INFO - Adaptive cross-validation scores for NeuralProphet: {'RMSE': [8.570032171338521, 6.692839599941133, 7.437067945592089, 8.03364261804773, 9.605015479564647], 'MSE': [73.44545141777725, 44.794101910540185, 55.30997962735333, 64.53941371451279, 92.2563223626765], 'MAE': [6.661288236882677, 5.265659993248008, 5.5638740254272046, 6.409849611083598, 6.274725305703293], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [-0.3558291547736976, 0.0007142356657612314, 0.022049134643454704, 0.155435659366308, 0.014179368513464685], 'Adjusted R2': [-0.36053036543518746, -0.00275069416896212, 0.018658181573563803, 0.15250721158325076, 0.01076112776628535]} 2024-05-27 15:43:31,535 - INFO - Saving model NeuralProphet for site PM10 and pollutant Penrose_PM10 2024-05-27 15:43:34,276 - INFO - Model saved at data/models/PM10_Penrose_PM10_NeuralProphet.joblib 2024-05-27 15:43:34,281 - INFO - Processing fold 1/5 for LinearRegression 2024-05-27 15:43:34,333 - INFO - Model LinearRegression fitted successfully 2024-05-27 15:43:34,334 - INFO - Processing fold 2/5 for LinearRegression 2024-05-27 15:43:34,430 - INFO - Model LinearRegression fitted successfully 2024-05-27 15:43:34,440 - INFO - Processing fold 3/5 for LinearRegression 2024-05-27 15:43:34,472 - INFO - Model LinearRegression fitted successfully 2024-05-27 15:43:34,473 - INFO - Processing fold 4/5 for LinearRegression 2024-05-27 15:43:34,537 - INFO - Model LinearRegression fitted successfully 2024-05-27 15:43:34,573 - INFO - Processing fold 5/5 for LinearRegression 2024-05-27 15:43:34,613 - INFO - Model LinearRegression fitted successfully 2024-05-27 15:43:34,617 - INFO - Adaptive cross-validation scores for LinearRegression: {'RMSE': [7.311149825661832, 6.663508909574445, 7.117379139063047, 8.131121931920168, 9.08953188186281], 'MSE': [53.452911773275034, 44.40235098797802, 50.65708580916984, 66.11514387175318, 82.61958983140048], 'MAE': [5.600498533031222, 5.065769462536602, 5.262899806778296, 6.516682697640107, 5.884226542160715], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.01324023760836135, 0.009453580878284962, 0.10431822182377837, 0.1348156161295796, 0.11715431381968211], 'Adjusted R2': [0.009818740512689828, 0.006018953904908586, 0.10121252911165546, 0.131815670277047, 0.11409312905484048]} 2024-05-27 15:43:34,618 - INFO - Saving model LinearRegression for site PM10 and pollutant Penrose_PM10 2024-05-27 15:43:34,622 - INFO - Model saved at data/models/PM10_Penrose_PM10_LinearRegression.joblib 2024-05-27 15:43:34,626 - INFO - Processing fold 1/5 for Ridge 2024-05-27 15:43:34,648 - INFO - Model Ridge fitted successfully 2024-05-27 15:43:34,650 - INFO - Processing fold 2/5 for Ridge 2024-05-27 15:43:34,672 - INFO - Model Ridge fitted successfully 2024-05-27 15:43:34,673 - INFO - Processing fold 3/5 for Ridge 2024-05-27 15:43:34,690 - INFO - Model Ridge fitted successfully 2024-05-27 15:43:34,691 - INFO - Processing fold 4/5 for Ridge 2024-05-27 15:43:34,704 - INFO - Model Ridge fitted successfully 2024-05-27 15:43:34,715 - INFO - Processing fold 5/5 for Ridge 2024-05-27 15:43:34,740 - INFO - Model Ridge fitted successfully 2024-05-27 15:43:34,742 - INFO - Adaptive cross-validation scores for Ridge: {'RMSE': [7.310226858063027, 6.662223514938412, 7.117570727054326, 8.131014034537479, 9.089548296408442], 'MSE': [53.43941671634604, 44.385222162998325, 50.65981305462064, 66.11338922984547, 82.61988823274159], 'MAE': [5.59817088444169, 5.065362218139482, 5.26300848274681, 6.516598076453972, 5.8841988741752544], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.013489361158921187, 0.009835697947986777, 0.10427000065163883, 0.13483857741634298, 0.11715112520137383], 'Adjusted R2': [0.010068727875838457, 0.006402395929775873, 0.1011641407371161, 0.13183871117992263, 0.11408992938029672]} 2024-05-27 15:43:34,744 - INFO - Saving model Ridge for site PM10 and pollutant Penrose_PM10 2024-05-27 15:43:34,749 - INFO - Model saved at data/models/PM10_Penrose_PM10_Ridge.joblib 2024-05-27 15:43:34,758 - INFO - Processing fold 1/5 for Lasso 2024-05-27 15:43:35,582 - INFO - Model Lasso fitted successfully 2024-05-27 15:43:35,583 - INFO - Processing fold 2/5 for Lasso 2024-05-27 15:43:36,244 - INFO - Model Lasso fitted successfully 2024-05-27 15:43:36,251 - INFO - Processing fold 3/5 for Lasso 2024-05-27 15:43:36,944 - INFO - Model Lasso fitted successfully 2024-05-27 15:43:36,946 - INFO - Processing fold 4/5 for Lasso 2024-05-27 15:43:37,777 - INFO - Model Lasso fitted successfully 2024-05-27 15:43:37,780 - INFO - Processing fold 5/5 for Lasso 2024-05-27 15:43:38,573 - INFO - Model Lasso fitted successfully 2024-05-27 15:43:38,576 - INFO - Adaptive cross-validation scores for Lasso: {'RMSE': [7.318229287456691, 6.403264553969968, 7.0146850013807205, 8.183326272179542, 9.135699532609902], 'MSE': [53.55647990378887, 41.00179694812821, 49.205805668595644, 66.96682887694394, 83.4610059501288], 'MAE': [5.599630955280757, 4.895255125972466, 5.23056881729309, 6.558601573848904, 5.912001295490395], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.011328333084045683, 0.08531457815104448, 0.12997870260702227, 0.12367044539688066, 0.1081632187026792], 'Adjusted R2': [0.007900206638428608, 0.08214299208360709, 0.12696198520968183, 0.12063185470824289, 0.10507085815726558]} 2024-05-27 15:43:38,589 - INFO - Saving model Lasso for site PM10 and pollutant Penrose_PM10 2024-05-27 15:43:38,592 - INFO - Model saved at data/models/PM10_Penrose_PM10_Lasso.joblib 2024-05-27 15:43:38,606 - INFO - Processing fold 1/5 for RandomForest 2024-05-27 15:44:26,983 - INFO - Model RandomForest fitted successfully 2024-05-27 15:44:26,983 - INFO - Processing fold 2/5 for RandomForest 2024-05-27 15:45:17,361 - INFO - Model RandomForest fitted successfully 2024-05-27 15:45:17,362 - INFO - Processing fold 3/5 for RandomForest 2024-05-27 15:46:10,070 - INFO - Model RandomForest fitted successfully 2024-05-27 15:46:10,071 - INFO - Processing fold 4/5 for RandomForest 2024-05-27 15:47:05,934 - INFO - Model RandomForest fitted successfully 2024-05-27 15:47:05,934 - INFO - Processing fold 5/5 for RandomForest 2024-05-27 15:48:02,755 - INFO - Model RandomForest fitted successfully 2024-05-27 15:48:02,757 - INFO - Adaptive cross-validation scores for RandomForest: {'RMSE': [7.179473848300742, 6.166179691735669, 6.796646031878665, 8.026547947680184, 8.973503815847373], 'MSE': [51.54484473843427, 38.021771990773395, 46.194397282652, 64.42547195640897, 80.52377073302736], 'MAE': [5.54544916997377, 4.837310936600146, 5.0954494929153595, 6.435742794941803, 5.758020937364656], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.04846383369444285, 0.15179423485211352, 0.18322423726129833, 0.15692670398950925, 0.1395495454322917], 'Adjusted R2': [0.04516447112056776, 0.1488531607704635, 0.18039214377052604, 0.15400342626409147, 0.13656601403642588]} 2024-05-27 15:48:02,757 - INFO - Saving model RandomForest for site PM10 and pollutant Penrose_PM10 2024-05-27 15:48:02,785 - INFO - Model saved at data/models/PM10_Penrose_PM10_RandomForest.joblib 2024-05-27 15:48:02,786 - INFO - Processing fold 1/5 for SVR 2024-05-27 15:52:38,770 - INFO - Model SVR fitted successfully 2024-05-27 15:52:38,771 - INFO - Processing fold 2/5 for SVR 2024-05-27 15:57:30,069 - INFO - Model SVR fitted successfully 2024-05-27 15:57:30,070 - INFO - Processing fold 3/5 for SVR 2024-05-27 16:02:36,015 - INFO - Model SVR fitted successfully 2024-05-27 16:02:36,015 - INFO - Processing fold 4/5 for SVR 2024-05-27 16:07:55,319 - INFO - Model SVR fitted successfully 2024-05-27 16:07:55,319 - INFO - Processing fold 5/5 for SVR 2024-05-27 16:13:28,022 - INFO - Model SVR fitted successfully 2024-05-27 16:13:28,023 - INFO - Adaptive cross-validation scores for SVR: {'RMSE': [7.090641699007668, 6.840042627882339, 7.226652887422557, 7.947068238327459, 9.006920830341643], 'MSE': [50.27719970370636, 46.786183151247535, 52.22451195529278, 63.155893584633105, 81.12462284404221], 'MAE': [5.46992090611325, 5.207019103276628, 5.314695181910045, 6.361521008930304, 5.889439612961629], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.0718650118860199, -0.043725954902446906, 0.07660413177509329, 0.17354043750101422, 0.13312903795550213], 'Adjusted R2': [0.06864679070670654, -0.04734497693747608, 0.0734023430503189, 0.1706747663411703, 0.13012324405104825]} 2024-05-27 16:13:28,023 - INFO - Saving model SVR for site PM10 and pollutant Penrose_PM10 2024-05-27 16:13:28,030 - INFO - Model saved at data/models/PM10_Penrose_PM10_SVR.joblib 2024-05-27 16:13:28,031 - INFO - Processing fold 1/5 for XGBoost 2024-05-27 16:13:28,858 - INFO - Model XGBoost fitted successfully 2024-05-27 16:13:28,858 - INFO - Processing fold 2/5 for XGBoost 2024-05-27 16:13:29,666 - INFO - Model XGBoost fitted successfully 2024-05-27 16:13:29,666 - INFO - Processing fold 3/5 for XGBoost 2024-05-27 16:13:30,532 - INFO - Model XGBoost fitted successfully 2024-05-27 16:13:30,532 - INFO - Processing fold 4/5 for XGBoost 2024-05-27 16:13:31,418 - INFO - Model XGBoost fitted successfully 2024-05-27 16:13:31,418 - INFO - Processing fold 5/5 for XGBoost 2024-05-27 16:13:32,255 - INFO - Model XGBoost fitted successfully 2024-05-27 16:13:32,257 - INFO - Adaptive cross-validation scores for XGBoost: {'RMSE': [7.2851153510150235, 6.335226641549502, 6.868805046733421, 7.961225013969376, 8.769431024428629], 'MSE': [53.072905677594754, 40.13509659979858, 47.180482770030515, 63.38110372305169, 76.90292049221134], 'MAE': [5.656639694253398, 4.922500473339803, 5.141089522230536, 6.4055195906248725, 5.643279331297318], 'MAPE': [inf, inf, inf, inf, inf], 'R2': [0.020255285287547387, 0.10464929596186257, 0.1657889902734213, 0.17059333214150663, 0.17824075682577256], 'Adjusted R2': [0.016858112212954968, 0.10154475121831841, 0.16289644169600592, 0.1677174421697365, 0.17539138358314355]} 2024-05-27 16:13:32,258 - INFO - Saving model XGBoost for site PM10 and pollutant Penrose_PM10 2024-05-27 16:13:32,261 - INFO - Model saved at data/models/PM10_Penrose_PM10_XGBoost.joblib 2024-05-27 16:13:32,262 - INFO - The best model based on average RMSE across all folds for Penrose_PM10: RandomForest 2024-05-27 16:13:32,262 - INFO - Markdown Table for Penrose_PM10: | Target | Model | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 | Training Time | |---|---|---|---|---|---|---|---|---| Penrose_PM10 | ARIMA | RMSE | 7.41 | 6.70 | 7.58 | 8.75 | 9.74 | 77.18 Penrose_PM10 | ARIMA | MSE | 54.91 | 44.87 | 57.49 | 76.49 | 94.96 | 77.18 Penrose_PM10 | ARIMA | MAE | 5.69 | 5.22 | 5.66 | 7.08 | 6.50 | 77.18 Penrose_PM10 | ARIMA | MAPE | inf | inf | inf | inf | inf | 77.18 Penrose_PM10 | ARIMA | R2 | -0.01 | -0.00 | -0.02 | -0.00 | -0.01 | 77.18 Penrose_PM10 | ARIMA | Adjusted R2 | -0.02 | -0.00 | -0.02 | -0.00 | -0.02 | 77.18 Penrose_PM10 | Prophet | RMSE | 8.30 | 8.23 | 7.59 | 8.87 | 9.87 | 10.69 Penrose_PM10 | Prophet | MSE | 68.95 | 67.73 | 57.66 | 78.64 | 97.34 | 10.69 Penrose_PM10 | Prophet | MAE | 6.83 | 6.80 | 5.79 | 7.12 | 6.93 | 10.69 Penrose_PM10 | Prophet | MAPE | inf | inf | inf | inf | inf | 10.69 Penrose_PM10 | Prophet | R2 | -0.27 | -0.51 | -0.02 | -0.03 | -0.04 | 10.69 Penrose_PM10 | Prophet | Adjusted R2 | -0.28 | -0.52 | -0.02 | -0.03 | -0.04 | 10.69 Penrose_PM10 | NeuralProphet | RMSE | 8.57 | 6.69 | 7.44 | 8.03 | 9.61 | 63.01 Penrose_PM10 | NeuralProphet | MSE | 73.45 | 44.79 | 55.31 | 64.54 | 92.26 | 63.01 Penrose_PM10 | NeuralProphet | MAE | 6.66 | 5.27 | 5.56 | 6.41 | 6.27 | 63.01 Penrose_PM10 | NeuralProphet | MAPE | inf | inf | inf | inf | inf | 63.01 Penrose_PM10 | NeuralProphet | R2 | -0.36 | 0.00 | 0.02 | 0.16 | 0.01 | 63.01 Penrose_PM10 | NeuralProphet | Adjusted R2 | -0.36 | -0.00 | 0.02 | 0.15 | 0.01 | 63.01 Penrose_PM10 | LinearRegression | RMSE | 7.31 | 6.66 | 7.12 | 8.13 | 9.09 | 0.35 Penrose_PM10 | LinearRegression | MSE | 53.45 | 44.40 | 50.66 | 66.12 | 82.62 | 0.35 Penrose_PM10 | LinearRegression | MAE | 5.60 | 5.07 | 5.26 | 6.52 | 5.88 | 0.35 Penrose_PM10 | LinearRegression | MAPE | inf | inf | inf | inf | inf | 0.35 Penrose_PM10 | LinearRegression | R2 | 0.01 | 0.01 | 0.10 | 0.13 | 0.12 | 0.35 Penrose_PM10 | LinearRegression | Adjusted R2 | 0.01 | 0.01 | 0.10 | 0.13 | 0.11 | 0.35 Penrose_PM10 | Ridge | RMSE | 7.31 | 6.66 | 7.12 | 8.13 | 9.09 | 0.13 Penrose_PM10 | Ridge | MSE | 53.44 | 44.39 | 50.66 | 66.11 | 82.62 | 0.13 Penrose_PM10 | Ridge | MAE | 5.60 | 5.07 | 5.26 | 6.52 | 5.88 | 0.13 Penrose_PM10 | Ridge | MAPE | inf | inf | inf | inf | inf | 0.13 Penrose_PM10 | Ridge | R2 | 0.01 | 0.01 | 0.10 | 0.13 | 0.12 | 0.13 Penrose_PM10 | Ridge | Adjusted R2 | 0.01 | 0.01 | 0.10 | 0.13 | 0.11 | 0.13 Penrose_PM10 | Lasso | RMSE | 7.32 | 6.40 | 7.01 | 8.18 | 9.14 | 3.84 Penrose_PM10 | Lasso | MSE | 53.56 | 41.00 | 49.21 | 66.97 | 83.46 | 3.84 Penrose_PM10 | Lasso | MAE | 5.60 | 4.90 | 5.23 | 6.56 | 5.91 | 3.84 Penrose_PM10 | Lasso | MAPE | inf | inf | inf | inf | inf | 3.84 Penrose_PM10 | Lasso | R2 | 0.01 | 0.09 | 0.13 | 0.12 | 0.11 | 3.84 Penrose_PM10 | Lasso | Adjusted R2 | 0.01 | 0.08 | 0.13 | 0.12 | 0.11 | 3.84 Penrose_PM10 | RandomForest | RMSE | 7.18 | 6.17 | 6.80 | 8.03 | 8.97 | 264.18 Penrose_PM10 | RandomForest | MSE | 51.54 | 38.02 | 46.19 | 64.43 | 80.52 | 264.18 Penrose_PM10 | RandomForest | MAE | 5.55 | 4.84 | 5.10 | 6.44 | 5.76 | 264.18 Penrose_PM10 | RandomForest | MAPE | inf | inf | inf | inf | inf | 264.18 Penrose_PM10 | RandomForest | R2 | 0.05 | 0.15 | 0.18 | 0.16 | 0.14 | 264.18 Penrose_PM10 | RandomForest | Adjusted R2 | 0.05 | 0.15 | 0.18 | 0.15 | 0.14 | 264.18 Penrose_PM10 | SVR | RMSE | 7.09 | 6.84 | 7.23 | 7.95 | 9.01 | 1525.25 Penrose_PM10 | SVR | MSE | 50.28 | 46.79 | 52.22 | 63.16 | 81.12 | 1525.25 Penrose_PM10 | SVR | MAE | 5.47 | 5.21 | 5.31 | 6.36 | 5.89 | 1525.25 Penrose_PM10 | SVR | MAPE | inf | inf | inf | inf | inf | 1525.25 Penrose_PM10 | SVR | R2 | 0.07 | -0.04 | 0.08 | 0.17 | 0.13 | 1525.25 Penrose_PM10 | SVR | Adjusted R2 | 0.07 | -0.05 | 0.07 | 0.17 | 0.13 | 1525.25 Penrose_PM10 | XGBoost | RMSE | 7.29 | 6.34 | 6.87 | 7.96 | 8.77 | 4.23 Penrose_PM10 | XGBoost | MSE | 53.07 | 40.14 | 47.18 | 63.38 | 76.90 | 4.23 Penrose_PM10 | XGBoost | MAE | 5.66 | 4.92 | 5.14 | 6.41 | 5.64 | 4.23 Penrose_PM10 | XGBoost | MAPE | inf | inf | inf | inf | inf | 4.23 Penrose_PM10 | XGBoost | R2 | 0.02 | 0.10 | 0.17 | 0.17 | 0.18 | 4.23 Penrose_PM10 | XGBoost | Adjusted R2 | 0.02 | 0.10 | 0.16 | 0.17 | 0.18 | 4.23 2024-05-27 16:13:32,262 - INFO - Saving model RandomForest for site Penrose and pollutant PM10 2024-05-27 16:13:32,291 - INFO - Model saved at data/models/Penrose_PM10_RandomForest.joblib 2024-05-27 16:13:32,292 - INFO - 🛠️ Training models for Takapuna_PM10 ... 2024-05-27 16:13:32,292 - INFO - Selected ARIMA model for Takapuna_PM10: ARIMA(2, 0, 3) 2024-05-27 16:13:32,297 - INFO - Processing fold 1/5 for ARIMA 2024-05-27 16:13:58,693 - INFO - Model ARIMA fitted successfully 2024-05-27 16:13:58,730 - INFO - Processing fold 2/5 for ARIMA 2024-05-27 16:14:26,924 - INFO - Model ARIMA fitted successfully 2024-05-27 16:14:26,947 - INFO - Processing fold 3/5 for ARIMA 2024-05-27 16:15:00,762 - INFO - Model ARIMA fitted successfully 2024-05-27 16:15:00,776 - INFO - Processing fold 4/5 for ARIMA 2024-05-27 16:15:52,876 - INFO - Model ARIMA fitted successfully 2024-05-27 16:15:52,888 - INFO - Processing fold 5/5 for ARIMA 2024-05-27 16:16:52,054 - INFO - Model ARIMA fitted successfully 2024-05-27 16:16:52,079 - INFO - Adaptive cross-validation scores for ARIMA: {'RMSE': [12.108850055693233, 6.2813542043900545, 6.1645177208607596, 6.023449620969445, 6.807699043403266], 'MSE': [146.62424967126202, 39.45541064100862, 38.001278730806334, 36.28194533635695, 46.34476626555375], 'MAE': [4.667970942104083, 4.1797524661169225, 4.57409957086901, 4.665973353141614, 4.241429849288792], 'MAPE': [nan, nan, nan, nan, nan], 'R2': [-0.004844925670708822, 0.0026077127141483913, -0.005444832281021439, -0.004997089854240633, -0.0005114301666717669], 'Adjusted R2': [-0.00833154928511104, -0.0008530516414582134, -0.008933537458956975, -0.008484241449848717, -0.003983017363641483]} 2024-05-27 16:16:52,088 - INFO - Saving model ARIMA for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:16:52,150 - INFO - Model saved at data/models/PM10_Takapuna_PM10_ARIMA.joblib 2024-05-27 16:16:52,156 - INFO - Processing fold 1/5 for Prophet 2024-05-27 16:16:52,190 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/h_kx6k8d.json 2024-05-27 16:16:52,330 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4nxw8uew.json 2024-05-27 16:16:52,331 - DEBUG - idx 0 2024-05-27 16:16:52,331 - DEBUG - running CmdStan, num_threads: None 2024-05-27 16:16:52,331 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=96953', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/h_kx6k8d.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4nxw8uew.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelfolrnnu8/prophet_model-20240527161652.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 16:16:52 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 16:16:52,332 - INFO - Chain [1] start processing 16:16:52 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 16:16:52,688 - INFO - Chain [1] done processing 2024-05-27 16:16:53,177 - INFO - Processing fold 2/5 for Prophet 2024-05-27 16:16:53,206 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/i_1qlpyc.json 2024-05-27 16:16:53,354 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/o5z4c9m2.json 2024-05-27 16:16:53,355 - DEBUG - idx 0 2024-05-27 16:16:53,355 - DEBUG - running CmdStan, num_threads: None 2024-05-27 16:16:53,355 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=76458', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/i_1qlpyc.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/o5z4c9m2.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelc79eu_wq/prophet_model-20240527161653.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 16:16:53 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 16:16:53,355 - INFO - Chain [1] start processing 16:16:53 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 16:16:53,678 - INFO - Chain [1] done processing 2024-05-27 16:16:54,150 - INFO - Processing fold 3/5 for Prophet 2024-05-27 16:16:54,185 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/uaew8165.json 2024-05-27 16:16:54,408 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/y660z5tv.json 2024-05-27 16:16:54,409 - DEBUG - idx 0 2024-05-27 16:16:54,409 - DEBUG - running CmdStan, num_threads: None 2024-05-27 16:16:54,409 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=10936', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/uaew8165.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/y660z5tv.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelb9cceift/prophet_model-20240527161654.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 16:16:54 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 16:16:54,409 - INFO - Chain [1] start processing 16:16:55 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 16:16:55,467 - INFO - Chain [1] done processing 2024-05-27 16:16:55,983 - INFO - Processing fold 4/5 for Prophet 2024-05-27 16:16:56,026 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/u8e5ep0p.json 2024-05-27 16:16:56,515 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4jumbwqt.json 2024-05-27 16:16:56,516 - DEBUG - idx 0 2024-05-27 16:16:56,516 - DEBUG - running CmdStan, num_threads: None 2024-05-27 16:16:56,516 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=19283', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/u8e5ep0p.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/4jumbwqt.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modelsqdnxeo1/prophet_model-20240527161656.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 16:16:56 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 16:16:56,517 - INFO - Chain [1] start processing 16:16:57 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 16:16:57,432 - INFO - Chain [1] done processing 2024-05-27 16:16:57,900 - INFO - Processing fold 5/5 for Prophet 2024-05-27 16:16:57,948 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/vq4k2d1c.json 2024-05-27 16:16:58,320 - DEBUG - input tempfile: /var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/nx0d3usq.json 2024-05-27 16:16:58,321 - DEBUG - idx 0 2024-05-27 16:16:58,321 - DEBUG - running CmdStan, num_threads: None 2024-05-27 16:16:58,321 - DEBUG - CmdStan args: ['/Users/nnthanh/.pyenv/versions/3.11.7/lib/python3.11/site-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=32882', 'data', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/vq4k2d1c.json', 'init=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/nx0d3usq.json', 'output', 'file=/var/folders/zd/rybc_92d26z7jm9x3nmqcc4c0000gn/T/tmpi4acxo0v/prophet_modeldwcoigg4/prophet_model-20240527161658.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 16:16:58 - cmdstanpy - INFO - Chain [1] start processing 2024-05-27 16:16:58,321 - INFO - Chain [1] start processing 16:17:00 - cmdstanpy - INFO - Chain [1] done processing 2024-05-27 16:17:00,093 - INFO - Chain [1] done processing 2024-05-27 16:17:00,569 - INFO - Adaptive cross-validation scores for Prophet: {'RMSE': [12.133532696830281, 6.384497005003979, 6.35512646905706, 6.157677365184172, 7.527638073606464], 'MSE': [147.22261570504952, 40.76180200690478, 40.387632437709655, 37.91699053370149, 56.66533496720963], 'MAE': [4.738515664540226, 4.203944967226124, 4.980142010826765, 4.6723635136700254, 5.354097130533008], 'MAPE': [53.12102905509746, 54.742575263145014, 82.77189540730757, 114.69192647107374, 125.18062310963558], 'R2': [-0.00894564621381888, -0.030416520245358125, -0.06858341821117153, -0.05028726517085458, -0.22331645830426994], 'Adjusted R2': [-0.012446498560154007, -0.03399187250158775, -0.07229120245201526, -0.053931565188796604, -0.22756113720192528]} 2024-05-27 16:17:00,570 - INFO - Saving model Prophet for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:17:00,588 - INFO - Model saved at data/models/PM10_Takapuna_PM10_Prophet.joblib 2024-05-27 16:17:00,590 - INFO - Processing fold 1/5 for NeuralProphet 2024-05-27 16:17:00,593 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 16:17:00,593 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 16:17:00,594 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:00,600 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 16:17:00,600 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 16:17:00,603 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 16:17:00,682 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 80 2024-05-27 16:17:00,682 - INFO - Auto-set epochs to 80 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal. 2024-05-27 16:17:00,709 - WARNING - Learning rate finder: The number of batches (45) is too small than the required number for the learning rate finder (236). The results might not be optimal.
Finding best initial lr: 0%| | 0/236 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:07,735 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:07,736 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 16:17:07,742 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:07,743 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:07,747 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:07,748 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:07,752 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:07,753 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 16:17:07,760 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 45it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:07,817 - INFO - Returning df with no ID column 2024-05-27 16:17:07,818 - WARNING - NaN values found in predictions for fold 1. Replacing NaNs with mean value. 2024-05-27 16:17:07,820 - INFO - Processing fold 2/5 for NeuralProphet 2024-05-27 16:17:07,822 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 16:17:07,823 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 16:17:07,825 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.983% of the data. 2024-05-27 16:17:07,832 - INFO - Major frequency ns corresponds to 99.983% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 16:17:07,832 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 16:17:07,836 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 16:17:07,885 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 70 2024-05-27 16:17:07,886 - INFO - Auto-set epochs to 70 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal. 2024-05-27 16:17:07,897 - WARNING - Learning rate finder: The number of batches (90) is too small than the required number for the learning rate finder (244). The results might not be optimal.
Finding best initial lr: 0%| | 0/244 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:19,737 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:19,738 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 16:17:19,742 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:19,744 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:19,748 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:19,749 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:19,753 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:19,755 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 16:17:19,761 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 90it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:19,826 - INFO - Returning df with no ID column 2024-05-27 16:17:19,828 - WARNING - NaN values found in predictions for fold 2. Replacing NaNs with mean value. 2024-05-27 16:17:19,830 - INFO - Processing fold 3/5 for NeuralProphet 2024-05-27 16:17:19,833 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 16:17:19,833 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 16:17:19,835 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.988% of the data. 2024-05-27 16:17:19,844 - INFO - Major frequency ns corresponds to 99.988% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 16:17:19,844 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 16:17:19,848 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 64 2024-05-27 16:17:20,103 - INFO - Auto-set batch_size to 64 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 60 2024-05-27 16:17:20,103 - INFO - Auto-set epochs to 60 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (135) is too small than the required number for the learning rate finder (248). The results might not be optimal. 2024-05-27 16:17:20,115 - WARNING - Learning rate finder: The number of batches (135) is too small than the required number for the learning rate finder (248). The results might not be optimal.
Finding best initial lr: 0%| | 0/248 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:35,222 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:35,223 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 16:17:35,227 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:35,229 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:35,232 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:35,234 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:35,238 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:35,239 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 16:17:35,246 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 135it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:35,311 - INFO - Returning df with no ID column 2024-05-27 16:17:35,311 - WARNING - NaN values found in predictions for fold 3. Replacing NaNs with mean value. 2024-05-27 16:17:35,313 - INFO - Processing fold 4/5 for NeuralProphet 2024-05-27 16:17:35,316 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 16:17:35,316 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 16:17:35,319 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.991% of the data. 2024-05-27 16:17:35,329 - INFO - Major frequency ns corresponds to 99.991% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 16:17:35,329 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 16:17:35,333 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 16:17:35,422 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 16:17:35,423 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal. 2024-05-27 16:17:35,435 - WARNING - Learning rate finder: The number of batches (91) is too small than the required number for the learning rate finder (251). The results might not be optimal.
Finding best initial lr: 0%| | 0/251 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:46,213 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:46,214 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 16:17:46,218 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:46,219 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:46,224 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:46,225 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:46,229 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:46,230 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 16:17:46,237 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 91it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:46,306 - INFO - Returning df with no ID column 2024-05-27 16:17:46,307 - WARNING - NaN values found in predictions for fold 4. Replacing NaNs with mean value. 2024-05-27 16:17:46,311 - INFO - Processing fold 5/5 for NeuralProphet 2024-05-27 16:17:46,315 - INFO - NeuralProphet model generated for 168 periods. 2024-05-27 16:17:46,315 - INFO - NeuralProphet model generated for 2893 periods. WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale. 2024-05-27 16:17:46,318 - WARNING - When Global modeling with local normalization, metrics are displayed in normalized scale. INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.993% of the data. 2024-05-27 16:17:46,328 - INFO - Major frequency ns corresponds to 99.993% of the data. INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as ns 2024-05-27 16:17:46,329 - INFO - Dataframe freq automatically defined as ns INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training. 2024-05-27 16:17:46,334 - INFO - Setting normalization to global as only one dataframe provided for training. INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 128 2024-05-27 16:17:46,623 - INFO - Auto-set batch_size to 128 INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 50 2024-05-27 16:17:46,624 - INFO - Auto-set epochs to 50 WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal. 2024-05-27 16:17:46,635 - WARNING - Learning rate finder: The number of batches (113) is too small than the required number for the learning rate finder (254). The results might not be optimal.
Finding best initial lr: 0%| | 0/254 [00:00<?, ?it/s]
Training: 0it [00:00, ?it/s]
INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.965% of the data. 2024-05-27 16:17:59,853 - INFO - Major frequency ns corresponds to 99.965% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:59,854 - INFO - Defined frequency is equal to major frequency - ns WARNING - (NP.data.splitting._make_future_dataframe) - Number of forecast steps is defined by n_forecasts. Adjusted to 24. 2024-05-27 16:17:59,858 - WARNING - Number of forecast steps is defined by n_forecasts. Adjusted to 24. INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:59,860 - INFO - Returning df with no ID column INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:59,863 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:59,864 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.df_utils._infer_frequency) - Major frequency ns corresponds to 99.966% of the data. 2024-05-27 16:17:59,869 - INFO - Major frequency ns corresponds to 99.966% of the data. INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - ns 2024-05-27 16:17:59,870 - INFO - Defined frequency is equal to major frequency - ns INFO - (NP.data.processing._handle_missing_data) - Dropped 24 rows at the end with NaNs in 'y' column. 2024-05-27 16:17:59,876 - INFO - Dropped 24 rows at the end with NaNs in 'y' column.
Predicting: 113it [00:00, ?it/s]
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column 2024-05-27 16:17:59,948 - INFO - Returning df with no ID column 2024-05-27 16:17:59,949 - WARNING - NaN values found in predictions for fold 5. Replacing NaNs with mean value. 2024-05-27 16:17:59,952 - INFO - Adaptive cross-validation scores for NeuralProphet: {'RMSE': [12.199001372150322, 6.624936913503251, 5.870707275150766, 5.706169023627084, 6.794263979303493], 'MSE': [148.81563447772544, 43.88978910789798, 34.46520391050813, 32.560364926201274, 46.16202302046094], 'MAE': [4.798916708811841, 4.7121500250628126, 4.391863566360272, 4.343161025971968, 4.138523135386877], 'MAPE': [57.78581650374562, 68.78271369289723, 62.5262445439582, 112.82794784132864, 86.7026156414427], 'R2': [-0.019862918314515143, -0.10948882385528713, 0.08811328663947937, 0.09808937497533665, 0.0034337122999760217], 'Adjusted R2': [-0.023401651549471936, -0.11333854218927497, 0.08494921060422422, 0.09495991409738846, -2.4185991835290466e-05]} 2024-05-27 16:17:59,952 - INFO - Saving model NeuralProphet for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:18:02,937 - INFO - Model saved at data/models/PM10_Takapuna_PM10_NeuralProphet.joblib 2024-05-27 16:18:02,939 - INFO - Processing fold 1/5 for LinearRegression 2024-05-27 16:18:02,980 - INFO - Model LinearRegression fitted successfully 2024-05-27 16:18:03,082 - INFO - Processing fold 2/5 for LinearRegression 2024-05-27 16:18:03,179 - INFO - Model LinearRegression fitted successfully 2024-05-27 16:18:03,180 - INFO - Processing fold 3/5 for LinearRegression 2024-05-27 16:18:03,237 - INFO - Model LinearRegression fitted successfully 2024-05-27 16:18:03,238 - INFO - Processing fold 4/5 for LinearRegression 2024-05-27 16:18:03,263 - INFO - Model LinearRegression fitted successfully 2024-05-27 16:18:03,264 - INFO - Processing fold 5/5 for LinearRegression 2024-05-27 16:18:03,304 - INFO - Model LinearRegression fitted successfully 2024-05-27 16:18:03,315 - INFO - Adaptive cross-validation scores for LinearRegression: {'RMSE': [12.002261885907796, 6.931788409559008, 5.831531118308314, 5.833399977697803, 6.60269138771953], 'MSE': [144.05429037791495, 48.0496905548966, 34.00675518379822, 34.02855529980472, 43.59553356146566], 'MAE': [4.801525038860148, 4.495592172500508, 4.500608832483857, 4.486716822112309, 4.029077705512591], 'MAPE': [56.2681463668313, 45.92248237542253, 67.52815717083507, 119.78370655648072, 81.25366287793379], 'R2': [0.012767512659853653, -0.21464686306203884, 0.10024300749444137, 0.057421019429778286, 0.05884022841904879], 'Adjusted R2': [0.009342000906418013, -0.21886146008862473, 0.09712101931780859, 0.054150446978112, 0.05557458035665819]} 2024-05-27 16:18:03,316 - INFO - Saving model LinearRegression for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:18:03,319 - INFO - Model saved at data/models/PM10_Takapuna_PM10_LinearRegression.joblib 2024-05-27 16:18:03,323 - INFO - Processing fold 1/5 for Ridge 2024-05-27 16:18:03,334 - INFO - Model Ridge fitted successfully 2024-05-27 16:18:03,336 - INFO - Processing fold 2/5 for Ridge 2024-05-27 16:18:03,363 - INFO - Model Ridge fitted successfully 2024-05-27 16:18:03,364 - INFO - Processing fold 3/5 for Ridge 2024-05-27 16:18:03,385 - INFO - Model Ridge fitted successfully 2024-05-27 16:18:03,386 - INFO - Processing fold 4/5 for Ridge 2024-05-27 16:18:03,429 - INFO - Model Ridge fitted successfully 2024-05-27 16:18:03,430 - INFO - Processing fold 5/5 for Ridge 2024-05-27 16:18:03,445 - INFO - Model Ridge fitted successfully 2024-05-27 16:18:03,453 - INFO - Adaptive cross-validation scores for Ridge: {'RMSE': [12.000721474766424, 6.929219493986933, 5.811674829968524, 5.833460434538396, 6.602598694107764], 'MSE': [144.01731591492, 48.01408279584852, 33.77556432928967, 34.029260641324896, 43.594309515433544], 'MAE': [4.794616598955739, 4.493098036769788, 4.477697068428124, 4.486814442545501, 4.0289635932245105], 'MAPE': [56.12030217115753, 45.91482319094319, 66.77925005472729, 119.78731130077485, 81.25211705601542], 'R2': [0.013020905953276651, -0.21374673545811573, 0.10635989770708087, 0.05740148171843018, 0.05886665366929522], 'Adjusted R2': [0.009596273427091018, -0.21795820921057274, 0.10325913399336506, 0.05413084147456626, 0.05560109729757179]} 2024-05-27 16:18:03,454 - INFO - Saving model Ridge for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:18:03,457 - INFO - Model saved at data/models/PM10_Takapuna_PM10_Ridge.joblib 2024-05-27 16:18:03,458 - INFO - Processing fold 1/5 for Lasso 2024-05-27 16:18:03,931 - INFO - Model Lasso fitted successfully 2024-05-27 16:18:03,934 - INFO - Processing fold 2/5 for Lasso 2024-05-27 16:18:04,401 - INFO - Model Lasso fitted successfully 2024-05-27 16:18:04,402 - INFO - Processing fold 3/5 for Lasso 2024-05-27 16:18:04,944 - INFO - Model Lasso fitted successfully 2024-05-27 16:18:04,945 - INFO - Processing fold 4/5 for Lasso 2024-05-27 16:18:05,602 - INFO - Model Lasso fitted successfully 2024-05-27 16:18:05,603 - INFO - Processing fold 5/5 for Lasso 2024-05-27 16:18:06,162 - INFO - Model Lasso fitted successfully 2024-05-27 16:18:06,170 - INFO - Adaptive cross-validation scores for Lasso: {'RMSE': [11.997702784817195, 6.59551711622235, 5.793897158582508, 5.83229170523209, 6.5858619496639985], 'MSE': [143.94487211281026, 43.500846030381986, 33.569244284230464, 34.01562653491904, 43.37357762003209], 'MAE': [4.762029967862066, 4.174074197058208, 4.447814541693742, 4.48614365201833, 4.0113707343539], 'MAPE': [55.672223401409745, 45.64815622459241, 66.22984811021696, 119.28604408736705, 82.66333495243691], 'R2': [0.013517377629070237, -0.09965674203418229, 0.11181875146224673, 0.057779141657378186, 0.0636319120176938], 'Adjusted R2': [0.010094467766575743, -0.10347234488648693, 0.1087369289482365, 0.054509811822740395, 0.060382890199573325]} 2024-05-27 16:18:06,177 - INFO - Saving model Lasso for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:18:06,180 - INFO - Model saved at data/models/PM10_Takapuna_PM10_Lasso.joblib 2024-05-27 16:18:06,203 - INFO - Processing fold 1/5 for RandomForest 2024-05-27 16:18:59,281 - INFO - Model RandomForest fitted successfully 2024-05-27 16:18:59,281 - INFO - Processing fold 2/5 for RandomForest 2024-05-27 16:19:54,436 - INFO - Model RandomForest fitted successfully 2024-05-27 16:19:54,436 - INFO - Processing fold 3/5 for RandomForest 2024-05-27 16:20:52,087 - INFO - Model RandomForest fitted successfully 2024-05-27 16:20:52,088 - INFO - Processing fold 4/5 for RandomForest 2024-05-27 16:21:52,062 - INFO - Model RandomForest fitted successfully 2024-05-27 16:21:52,063 - INFO - Processing fold 5/5 for RandomForest 2024-05-27 16:22:59,041 - INFO - Model RandomForest fitted successfully 2024-05-27 16:22:59,043 - INFO - Adaptive cross-validation scores for RandomForest: {'RMSE': [12.233679188867635, 6.7719723518047665, 5.397060061224332, 6.339233415644871, 6.578151333445769], 'MSE': [149.66290649613308, 45.859609533608186, 29.128257304462792, 40.185880298028536, 43.272074965714346], 'MAE': [4.9955637837229, 4.45676997150433, 3.9086405522089693, 4.4864789129149765, 4.070019800630096], 'MAPE': [56.59643136129198, 55.16208234760819, 56.09501127005251, 111.83407389751085, 86.90330485618014], 'R2': [-0.025669440702650315, -0.15928386255904003, 0.2293192029777098, -0.11313471144777232, 0.06582319647158985], 'Adjusted R2': [-0.029228321482326347, -0.16330636034723933, 0.2266450850144126, -0.11699708032857647, 0.0625817779999438]} 2024-05-27 16:22:59,043 - INFO - Saving model RandomForest for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:22:59,073 - INFO - Model saved at data/models/PM10_Takapuna_PM10_RandomForest.joblib 2024-05-27 16:22:59,076 - INFO - Processing fold 1/5 for SVR 2024-05-27 16:28:08,305 - INFO - Model SVR fitted successfully 2024-05-27 16:28:08,305 - INFO - Processing fold 2/5 for SVR 2024-05-27 16:33:15,162 - INFO - Model SVR fitted successfully 2024-05-27 16:33:15,162 - INFO - Processing fold 3/5 for SVR 2024-05-27 16:38:31,019 - INFO - Model SVR fitted successfully 2024-05-27 16:38:31,020 - INFO - Processing fold 4/5 for SVR 2024-05-27 16:44:00,082 - INFO - Model SVR fitted successfully 2024-05-27 16:44:00,082 - INFO - Processing fold 5/5 for SVR 2024-05-27 16:49:43,898 - INFO - Model SVR fitted successfully 2024-05-27 16:49:43,899 - INFO - Adaptive cross-validation scores for SVR: {'RMSE': [12.176460355507574, 6.305417502485727, 5.387322859517003, 5.394595549200109, 6.598607984086225], 'MSE': [148.26618678924766, 39.758289880653344, 29.02324759267445, 29.10166113944963, 43.54162732764648], 'MAE': [5.253452201562217, 3.9750577761576174, 3.935181552166319, 4.05087451887574, 4.020962139568618], 'MAPE': [66.50786131330248, 47.19145223395686, 47.67829882081735, 99.70806864702669, 76.31746295366867], 'R2': [-0.016097444847975595, -0.005048763614263541, 0.23209756927440284, 0.19389425004828575, 0.060003980174409555], 'Adjusted R2': [-0.019623112595539727, -0.008536094508136749, 0.2294330917215729, 0.19109721413589253, 0.056742370112558116]} 2024-05-27 16:49:43,900 - INFO - Saving model SVR for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:49:43,906 - INFO - Model saved at data/models/PM10_Takapuna_PM10_SVR.joblib 2024-05-27 16:49:43,908 - INFO - Processing fold 1/5 for XGBoost 2024-05-27 16:49:44,612 - INFO - Model XGBoost fitted successfully 2024-05-27 16:49:44,613 - INFO - Processing fold 2/5 for XGBoost 2024-05-27 16:49:45,278 - INFO - Model XGBoost fitted successfully 2024-05-27 16:49:45,278 - INFO - Processing fold 3/5 for XGBoost 2024-05-27 16:49:45,931 - INFO - Model XGBoost fitted successfully 2024-05-27 16:49:45,931 - INFO - Processing fold 4/5 for XGBoost 2024-05-27 16:49:46,662 - INFO - Model XGBoost fitted successfully 2024-05-27 16:49:46,663 - INFO - Processing fold 5/5 for XGBoost 2024-05-27 16:49:47,384 - INFO - Model XGBoost fitted successfully 2024-05-27 16:49:47,385 - INFO - Adaptive cross-validation scores for XGBoost: {'RMSE': [12.584823529636495, 6.961840994122229, 5.061416466118547, 6.5884861437967865, 6.605392648374692], 'MSE': [158.37778327209236, 48.46723002744078, 25.617936643495955, 43.408149667002256, 43.631212039202424], 'MAE': [5.310938520109562, 4.585897446502833, 3.8500744459939202, 4.468445263168259, 4.133103406971838], 'MAPE': [57.62567583274149, 54.95112525038972, 54.9510792383809, 105.49447048022773, 83.73772926242341], 'R2': [-0.0853942115083115, -0.22520183240051828, 0.3221959136068502, -0.2023904365342657, 0.058069985570437965], 'Adjusted R2': [-0.08916032605205992, -0.22945305319302522, 0.31984406042713764, -0.20656250605728532, 0.05480166490968319]} 2024-05-27 16:49:47,385 - INFO - Saving model XGBoost for site PM10 and pollutant Takapuna_PM10 2024-05-27 16:49:47,389 - INFO - Model saved at data/models/PM10_Takapuna_PM10_XGBoost.joblib 2024-05-27 16:49:47,389 - INFO - The best model based on average RMSE across all folds for Takapuna_PM10: SVR 2024-05-27 16:49:47,390 - INFO - Markdown Table for Takapuna_PM10: | Target | Model | Metric | Fold1 | Fold2 | Fold3 | Fold4 | Fold5 | Training Time | |---|---|---|---|---|---|---|---|---| Takapuna_PM10 | ARIMA | RMSE | 12.11 | 6.28 | 6.16 | 6.02 | 6.81 | 199.86 Takapuna_PM10 | ARIMA | MSE | 146.62 | 39.46 | 38.00 | 36.28 | 46.34 | 199.86 Takapuna_PM10 | ARIMA | MAE | 4.67 | 4.18 | 4.57 | 4.67 | 4.24 | 199.86 Takapuna_PM10 | ARIMA | MAPE | nan | nan | nan | nan | nan | 199.86 Takapuna_PM10 | ARIMA | R2 | -0.00 | 0.00 | -0.01 | -0.00 | -0.00 | 199.86 Takapuna_PM10 | ARIMA | Adjusted R2 | -0.01 | -0.00 | -0.01 | -0.01 | -0.00 | 199.86 Takapuna_PM10 | Prophet | RMSE | 12.13 | 6.38 | 6.36 | 6.16 | 7.53 | 8.43 Takapuna_PM10 | Prophet | MSE | 147.22 | 40.76 | 40.39 | 37.92 | 56.67 | 8.43 Takapuna_PM10 | Prophet | MAE | 4.74 | 4.20 | 4.98 | 4.67 | 5.35 | 8.43 Takapuna_PM10 | Prophet | MAPE | 53.12 | 54.74 | 82.77 | 114.69 | 125.18 | 8.43 Takapuna_PM10 | Prophet | R2 | -0.01 | -0.03 | -0.07 | -0.05 | -0.22 | 8.43 Takapuna_PM10 | Prophet | Adjusted R2 | -0.01 | -0.03 | -0.07 | -0.05 | -0.23 | 8.43 Takapuna_PM10 | NeuralProphet | RMSE | 12.20 | 6.62 | 5.87 | 5.71 | 6.79 | 62.35 Takapuna_PM10 | NeuralProphet | MSE | 148.82 | 43.89 | 34.47 | 32.56 | 46.16 | 62.35 Takapuna_PM10 | NeuralProphet | MAE | 4.80 | 4.71 | 4.39 | 4.34 | 4.14 | 62.35 Takapuna_PM10 | NeuralProphet | MAPE | 57.79 | 68.78 | 62.53 | 112.83 | 86.70 | 62.35 Takapuna_PM10 | NeuralProphet | R2 | -0.02 | -0.11 | 0.09 | 0.10 | 0.00 | 62.35 Takapuna_PM10 | NeuralProphet | Adjusted R2 | -0.02 | -0.11 | 0.08 | 0.09 | -0.00 | 62.35 Takapuna_PM10 | LinearRegression | RMSE | 12.00 | 6.93 | 5.83 | 5.83 | 6.60 | 0.38 Takapuna_PM10 | LinearRegression | MSE | 144.05 | 48.05 | 34.01 | 34.03 | 43.60 | 0.38 Takapuna_PM10 | LinearRegression | MAE | 4.80 | 4.50 | 4.50 | 4.49 | 4.03 | 0.38 Takapuna_PM10 | LinearRegression | MAPE | 56.27 | 45.92 | 67.53 | 119.78 | 81.25 | 0.38 Takapuna_PM10 | LinearRegression | R2 | 0.01 | -0.21 | 0.10 | 0.06 | 0.06 | 0.38 Takapuna_PM10 | LinearRegression | Adjusted R2 | 0.01 | -0.22 | 0.10 | 0.05 | 0.06 | 0.38 Takapuna_PM10 | Ridge | RMSE | 12.00 | 6.93 | 5.81 | 5.83 | 6.60 | 0.14 Takapuna_PM10 | Ridge | MSE | 144.02 | 48.01 | 33.78 | 34.03 | 43.59 | 0.14 Takapuna_PM10 | Ridge | MAE | 4.79 | 4.49 | 4.48 | 4.49 | 4.03 | 0.14 Takapuna_PM10 | Ridge | MAPE | 56.12 | 45.91 | 66.78 | 119.79 | 81.25 | 0.14 Takapuna_PM10 | Ridge | R2 | 0.01 | -0.21 | 0.11 | 0.06 | 0.06 | 0.14 Takapuna_PM10 | Ridge | Adjusted R2 | 0.01 | -0.22 | 0.10 | 0.05 | 0.06 | 0.14 Takapuna_PM10 | Lasso | RMSE | 12.00 | 6.60 | 5.79 | 5.83 | 6.59 | 2.73 Takapuna_PM10 | Lasso | MSE | 143.94 | 43.50 | 33.57 | 34.02 | 43.37 | 2.73 Takapuna_PM10 | Lasso | MAE | 4.76 | 4.17 | 4.45 | 4.49 | 4.01 | 2.73 Takapuna_PM10 | Lasso | MAPE | 55.67 | 45.65 | 66.23 | 119.29 | 82.66 | 2.73 Takapuna_PM10 | Lasso | R2 | 0.01 | -0.10 | 0.11 | 0.06 | 0.06 | 2.73 Takapuna_PM10 | Lasso | Adjusted R2 | 0.01 | -0.10 | 0.11 | 0.05 | 0.06 | 2.73 Takapuna_PM10 | RandomForest | RMSE | 12.23 | 6.77 | 5.40 | 6.34 | 6.58 | 292.89 Takapuna_PM10 | RandomForest | MSE | 149.66 | 45.86 | 29.13 | 40.19 | 43.27 | 292.89 Takapuna_PM10 | RandomForest | MAE | 5.00 | 4.46 | 3.91 | 4.49 | 4.07 | 292.89 Takapuna_PM10 | RandomForest | MAPE | 56.60 | 55.16 | 56.10 | 111.83 | 86.90 | 292.89 Takapuna_PM10 | RandomForest | R2 | -0.03 | -0.16 | 0.23 | -0.11 | 0.07 | 292.89 Takapuna_PM10 | RandomForest | Adjusted R2 | -0.03 | -0.16 | 0.23 | -0.12 | 0.06 | 292.89 Takapuna_PM10 | SVR | RMSE | 12.18 | 6.31 | 5.39 | 5.39 | 6.60 | 1604.83 Takapuna_PM10 | SVR | MSE | 148.27 | 39.76 | 29.02 | 29.10 | 43.54 | 1604.83 Takapuna_PM10 | SVR | MAE | 5.25 | 3.98 | 3.94 | 4.05 | 4.02 | 1604.83 Takapuna_PM10 | SVR | MAPE | 66.51 | 47.19 | 47.68 | 99.71 | 76.32 | 1604.83 Takapuna_PM10 | SVR | R2 | -0.02 | -0.01 | 0.23 | 0.19 | 0.06 | 1604.83 Takapuna_PM10 | SVR | Adjusted R2 | -0.02 | -0.01 | 0.23 | 0.19 | 0.06 | 1604.83 Takapuna_PM10 | XGBoost | RMSE | 12.58 | 6.96 | 5.06 | 6.59 | 6.61 | 3.48 Takapuna_PM10 | XGBoost | MSE | 158.38 | 48.47 | 25.62 | 43.41 | 43.63 | 3.48 Takapuna_PM10 | XGBoost | MAE | 5.31 | 4.59 | 3.85 | 4.47 | 4.13 | 3.48 Takapuna_PM10 | XGBoost | MAPE | 57.63 | 54.95 | 54.95 | 105.49 | 83.74 | 3.48 Takapuna_PM10 | XGBoost | R2 | -0.09 | -0.23 | 0.32 | -0.20 | 0.06 | 3.48 Takapuna_PM10 | XGBoost | Adjusted R2 | -0.09 | -0.23 | 0.32 | -0.21 | 0.05 | 3.48 2024-05-27 16:49:47,390 - INFO - Saving model SVR for site Takapuna and pollutant PM10 2024-05-27 16:49:47,395 - INFO - Model saved at data/models/Takapuna_PM10_SVR.joblib
🛠️ Save Models & Evaluation Results¶
- evaluation_results --> 'data/source/evaluation_results.json'
- trained_models --> select_best_model() --> best_models
- best_models
In [46]:
trained_models
# evaluation_results
Out[46]:
{'Penrose_PM2.5_ARIMA': ARIMA(order=(0, 1, 4), suppress_warnings=True), 'Penrose_PM2.5_Prophet': <prophet.forecaster.Prophet at 0x328947d50>, 'Penrose_PM2.5_NeuralProphet': <neuralprophet.forecaster.NeuralProphet at 0x38275bbd0>, 'Penrose_PM2.5_LinearRegression': LinearRegression(), 'Penrose_PM2.5_Ridge': Ridge(), 'Penrose_PM2.5_Lasso': Lasso(alpha=0.1), 'Penrose_PM2.5_RandomForest': RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=42), 'Penrose_PM2.5_SVR': SVR(), 'Penrose_PM2.5_XGBoost': XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...), 'Takapuna_PM2.5_ARIMA': ARIMA(order=(0, 1, 2), suppress_warnings=True), 'Takapuna_PM2.5_Prophet': <prophet.forecaster.Prophet at 0x335f321d0>, 'Takapuna_PM2.5_NeuralProphet': <neuralprophet.forecaster.NeuralProphet at 0x383a98e50>, 'Takapuna_PM2.5_LinearRegression': LinearRegression(), 'Takapuna_PM2.5_Ridge': Ridge(), 'Takapuna_PM2.5_Lasso': Lasso(alpha=0.1), 'Takapuna_PM2.5_RandomForest': RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=42), 'Takapuna_PM2.5_SVR': SVR(), 'Takapuna_PM2.5_XGBoost': XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...), 'Penrose_PM10_ARIMA': ARIMA(order=(10, 0, 0), suppress_warnings=True), 'Penrose_PM10_Prophet': <prophet.forecaster.Prophet at 0x382a261d0>, 'Penrose_PM10_NeuralProphet': <neuralprophet.forecaster.NeuralProphet at 0x38122a350>, 'Penrose_PM10_LinearRegression': LinearRegression(), 'Penrose_PM10_Ridge': Ridge(), 'Penrose_PM10_Lasso': Lasso(alpha=0.1), 'Penrose_PM10_RandomForest': RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=42), 'Penrose_PM10_SVR': SVR(), 'Penrose_PM10_XGBoost': XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...), 'Takapuna_PM10_ARIMA': ARIMA(order=(2, 0, 3), suppress_warnings=True), 'Takapuna_PM10_Prophet': <prophet.forecaster.Prophet at 0x381292f10>, 'Takapuna_PM10_NeuralProphet': <neuralprophet.forecaster.NeuralProphet at 0x383709e10>, 'Takapuna_PM10_LinearRegression': LinearRegression(), 'Takapuna_PM10_Ridge': Ridge(), 'Takapuna_PM10_Lasso': Lasso(alpha=0.1), 'Takapuna_PM10_RandomForest': RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=42), 'Takapuna_PM10_SVR': SVR(), 'Takapuna_PM10_XGBoost': XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...)}
In [45]:
import json
## Save the evaluation_results to a JSON file
with open('data/source/evaluation_results.json', 'w') as f:
json.dump(evaluation_results, f)
In [51]:
#### Step 7: Select the Best Model based on RMSE
def select_best_model(evaluation_results, key_metric='RMSE'):
"""
Select the best model for each target variable .
Select the best model for each target variable based on a specified key metric (by default, based on RMSE).
Parameters:
evaluation_results (dict): Dictionary containing evaluation metrics for each model and target variable.
key_metric (str): The metric to be used as the key metric for selecting the best model.
Returns:
dict: Dictionary containing the best model for each target variable.
"""
## Identify the best model for each target variable
best_models = {}
for target_var, models in evaluation_results.items():
try:
logging.debug(f"Selecting best model for {target_var} based on {key_metric} from {models.items()}")
## Calculate the mean of the metric values across folds
metric_values = {model_name: np.mean(metrics.get(key_metric, [float('inf')])) for model_name, metrics in models.items()}
# best_model = min(models.items(), key=lambda x: np.mean(x[1].get('RMSE', float('inf'))))[0]
metric_values = {model_name: np.mean(metrics.get(key_metric, [float('inf')])) for model_name, metrics in models.items()}
if key_metric in ['RMSE', 'MSE', 'MAE', 'MAPE', 'Training Time']: ## Lower value is better
best_model = min(metric_values, key=metric_values.get)
else:
best_model = max(metric_values, key=metric_values.get)
best_models[target_var] = best_model
# logging.info(f"The best model for {target_var} is {best_model} with RMSE: {models[best_model]['RMSE']}")
logging.info(f"The best model for {target_var} based on {key_metric} is {best_model} with {key_metric}: {metric_values[best_model]}")
except (IndexError, ValueError, KeyError) as e:
logging.error(f"Error selecting best model for {target_var}: {e}")
return best_models
## Define key metrics to evaluate
# key_metrics = ['RMSE', 'MSE', 'MAE', 'R2', 'Adjusted R2', 'Training Time']
key_metrics = ['RMSE', 'MSE', 'MAE']
## Get best models for each key metric
best_models_per_metric = {metric: select_best_model(evaluation_results, key_metric=metric) for metric in key_metrics}
# ## [DEBUG] Save the best models to disk
# for target_var, model_name in best_models.items():
# model_filename = f"data/models/{target_var}_{model_name}.joblib"
# ## Save the model (assuming the model object is available in the current scope)
# # pickle.dump(models[model_name], open(model_filename, 'wb')) ## Uncomment this line when model objects are available
# logging.info(f"Based on RMSE, Model {model_name} for {target_var} saved at {model_filename}")
2024-05-27 18:28:14,988 - INFO - The best model for Penrose_PM2.5 based on RMSE is RandomForest with RMSE: 4.815107447063811 2024-05-27 18:28:14,992 - INFO - The best model for Takapuna_PM2.5 based on RMSE is RandomForest with RMSE: 1.7864808734156863 2024-05-27 18:28:14,994 - INFO - The best model for Penrose_PM10 based on RMSE is RandomForest with RMSE: 7.428470267088526 2024-05-27 18:28:14,996 - INFO - The best model for Takapuna_PM10 based on RMSE is SVR with RMSE: 7.172480850159327 2024-05-27 18:28:14,997 - INFO - The best model for Penrose_PM2.5 based on MSE is RandomForest with MSE: 23.84917466476086 2024-05-27 18:28:14,998 - INFO - The best model for Takapuna_PM2.5 based on MSE is RandomForest with MSE: 3.4550249132092192 2024-05-27 18:28:14,999 - INFO - The best model for Penrose_PM10 based on MSE is XGBoost with MSE: 56.13450185253737 2024-05-27 18:28:14,999 - INFO - The best model for Takapuna_PM10 based on MSE is SVR with MSE: 57.93820254593432 2024-05-27 18:28:15,000 - INFO - The best model for Penrose_PM2.5 based on MAE is Lasso with MAE: 3.3668081062393282 2024-05-27 18:28:15,001 - INFO - The best model for Takapuna_PM2.5 based on MAE is RandomForest with MAE: 1.2381142886578262 2024-05-27 18:28:15,002 - INFO - The best model for Penrose_PM10 based on MAE is RandomForest with MAE: 5.534394666359146 2024-05-27 18:28:15,002 - INFO - The best model for Takapuna_PM10 based on MAE is SVR with MAE: 4.2471056376661025
In [53]:
## Print the results
for metric, best_models in best_models_per_metric.items():
logging.info(f"\n\nBest Models based on {metric}:\n")
for target_var, model_name in best_models.items():
logging.info(f"Target Variable: {target_var}, Best Model: {model_name}")
2024-05-27 18:29:11,121 - INFO - Best Models based on RMSE: 2024-05-27 18:29:11,123 - INFO - Target Variable: Penrose_PM2.5, Best Model: RandomForest 2024-05-27 18:29:11,126 - INFO - Target Variable: Takapuna_PM2.5, Best Model: RandomForest 2024-05-27 18:29:11,126 - INFO - Target Variable: Penrose_PM10, Best Model: RandomForest 2024-05-27 18:29:11,127 - INFO - Target Variable: Takapuna_PM10, Best Model: SVR 2024-05-27 18:29:11,128 - INFO - Best Models based on MSE: 2024-05-27 18:29:11,129 - INFO - Target Variable: Penrose_PM2.5, Best Model: RandomForest 2024-05-27 18:29:11,130 - INFO - Target Variable: Takapuna_PM2.5, Best Model: RandomForest 2024-05-27 18:29:11,130 - INFO - Target Variable: Penrose_PM10, Best Model: XGBoost 2024-05-27 18:29:11,131 - INFO - Target Variable: Takapuna_PM10, Best Model: SVR 2024-05-27 18:29:11,132 - INFO - Best Models based on MAE: 2024-05-27 18:29:11,132 - INFO - Target Variable: Penrose_PM2.5, Best Model: Lasso 2024-05-27 18:29:11,133 - INFO - Target Variable: Takapuna_PM2.5, Best Model: RandomForest 2024-05-27 18:29:11,133 - INFO - Target Variable: Penrose_PM10, Best Model: RandomForest 2024-05-27 18:29:11,133 - INFO - Target Variable: Takapuna_PM10, Best Model: SVR
In [54]:
def generate_markdown_table(evaluation_results):
"""
Generate a markdown table summarizing the best model results across multiple metrics for each target variable.
Parameters:
evaluation_results (dict): Dictionary containing evaluation metrics for each model and target variable.
Returns:
str: Markdown table as a string.
"""
table = "| Target Variable | Best Model | RMSE | MSE | MAE | R2 | Adjusted R2 | Training Time |\n"
table += "|-----------------|------------|------|-----|-----|----|-------------|---------------|\n"
# best_models = select_best_model(evaluation_results)
best_models = best_models_per_metric['RMSE']
for target_var, best_model in best_models.items():
metrics = evaluation_results[target_var][best_model]
table += (f"| {target_var} | {best_model} | {np.mean(metrics['RMSE']):.3f} | {np.mean(metrics['MSE']):.3f} | "
f"{np.mean(metrics['MAE']):.3f} | {np.mean(metrics['R2']):.3f} | {np.mean(metrics['Adjusted R2']):.3f} | "
f"{np.mean(metrics['Training Time']):.3f} |\n")
return table
# Generate and log the markdown table
markdown_table = generate_markdown_table(evaluation_results)
logging.info(f"\nMarkdown Table of Best Models:\n{markdown_table}\n")
2024-05-27 18:29:32,865 - INFO - Markdown Table of Best Models: | Target Variable | Best Model | RMSE | MSE | MAE | R2 | Adjusted R2 | Training Time | |-----------------|------------|------|-----|-----|----|-------------|---------------| | Penrose_PM2.5 | RandomForest | 4.815 | 23.849 | 3.390 | 0.095 | 0.091 | 286.736 | | Takapuna_PM2.5 | RandomForest | 1.786 | 3.455 | 1.238 | 0.508 | 0.506 | 308.734 | | Penrose_PM10 | RandomForest | 7.428 | 56.142 | 5.534 | 0.136 | 0.133 | 264.185 | | Takapuna_PM10 | SVR | 7.172 | 57.938 | 4.247 | 0.093 | 0.090 | 1604.832 |
🛠️ Model Performance Visualization¶
In [58]:
def parse_evaluation_results(evaluation_results):
"""
Parses the evaluation results into a DataFrame.
Args:
evaluation_results (dict): Dictionary containing model evaluation results.
Returns:
pd.DataFrame: Parsed data in a DataFrame.
"""
logging.info("Parsing evaluation results into a DataFrame.")
rows = []
for target, models in evaluation_results.items():
for model, metrics in models.items():
for metric, values in metrics.items():
if isinstance(values, list):
for fold, value in enumerate(values, start=1):
rows.append({"Target": target, "Model": model, "Metric": metric, "Fold": f"Fold{fold}", "Value": value})
else:
rows.append({"Target": target, "Model": model, "Metric": metric, "Fold": "Training Time", "Value": values})
df = pd.DataFrame(rows)
logging.info("Finished parsing evaluation results.")
return df
In [59]:
## @depreciated: Save evaluation results to a .json file --> Parse the evaluation results into a DataFrame
# evaluation_df = pd.DataFrame(evaluation_results).transpose()
# evaluation_df.to_csv('data/source/evaluation_results.csv', index=True)
# logging.info(f"Evaluation results saved to evaluation_results.csv")
In [60]:
# trained_models
# evaluation_results
## Parse the evaluation results into a DataFrame
df = parse_evaluation_results(evaluation_results)
# df
## Save the DataFrame to a CSV or Parquet file or a pickle file
df.to_csv('data/source/evaluation_results.csv', index=False)
# df.to_pickle('evaluation_results_df.pkl')
# df.to_parquet('evaluation_results_df.parquet')
2024-05-27 18:30:56,982 - INFO - Parsing evaluation results into a DataFrame. 2024-05-27 18:30:56,988 - INFO - Finished parsing evaluation results.
In [61]:
## If using pickle file or Parquet
# df = pd.read_pickle('evaluation_results_df.pkl')
# df = pd.read_parquet('evaluation_results_df.parquet')
## Load the DataFrame from the *.csv file
df = pd.read_csv('data/source/evaluation_results.csv')
## Proceed with EDA
print(df.head())
Target Model Metric Fold Value 0 Penrose_PM2.5 ARIMA RMSE Fold1 4.232458 1 Penrose_PM2.5 ARIMA RMSE Fold2 4.594883 2 Penrose_PM2.5 ARIMA RMSE Fold3 7.315939 3 Penrose_PM2.5 ARIMA RMSE Fold4 6.560749 4 Penrose_PM2.5 ARIMA RMSE Fold5 4.390078
In [62]:
import plotly.express as px
import plotly.graph_objects as go
def create_visualization(df, default_metric='RMSE'):
"""
Creates a polar bar plot visualization for model performance comparison.
Args:
df (pd.DataFrame): DataFrame containing parsed evaluation results.
default_metric (str): The default metric to be displayed in the polar plot.
"""
logging.debug(f"Creating the polar bar plot visualization for metric: {default_metric}")
targets = df['Target'].unique()
models = df['Model'].unique()
metrics = df['Metric'].unique()
colorscale = px.colors.sequential.Plasma
fig = go.Figure()
## Iterate over each target: Add traces for the default metric
for target in targets:
## Filter DataFrame for the current target and default metric
target_df = df[(df['Target'] == target) & (df['Metric'] == default_metric)]
for model in models:
model_df = target_df[target_df['Model'] == model]
mean_value = model_df['Value'].mean()
# mean_value = model_df.groupby('Model')['Value'].mean().values
if len(model_df) > 0:
# hovertext = [f"Mean {default_metric}: {mean_value:.2f}"] + model_df.apply(lambda row: f"Fold {row['Fold']}: {row['Value']:.2f}", axis=1).tolist()
# logging.debug(hovertext)
fig.add_trace(
go.Barpolar(
r=[mean_value], ## Average/Mean value for the metric
theta=[model], ## Display the model names around the polar chart
name=f"{target} - {model}", ## Only target in the name for legend clarity
legendgroup=target,
showlegend=model == models[0], ## Only show legend for the first model to avoid repetition
# text=hovertext, ## Hover text for additional info
# hoverinfo='text',
# text=model_df.apply(lambda row: f"{row['Model']} ({row['Fold']}): {row['Value']}", axis=1),
text = [
f"Model: {model}<br>"
f"Metric: {default_metric}<br>"
f"Average {default_metric}: {mean_value:.2f}<br>" +
"<br>".join([f"Fold {i+1}: {model_df.iloc[i]['Value']:.2f}" for i in range(len(model_df))])
],
hoverinfo='text+r',
)
)
## Set up the layout with 2/3 for the polar chart and 1/3 for the dropdown and legend
fig.update_layout(
title="Comparative Model Performance Across Multiple Metrics for Penrose and Takapuna PM2.5 and PM10",
polar=dict(
radialaxis=dict(visible=True, range=[0, df[df['Metric'] == default_metric]['Value'].max()])
),
showlegend=True,
# template="plotly_dark",
# legend=dict(yanchor="top", y=1, xanchor="left", x=1.35),
legend=dict(
title="Targets",
itemsizing='constant',
yanchor="top",
y=1,
xanchor="left",
x=1.2,
font=dict(size=10), # Adjust font size for better readability
bgcolor="rgba(255,255,255,0.7)" # Add a semi-transparent background for clarity
),
margin=dict(l=60, r=30, t=40, b=30),
width=1200, ## Adjust width to allow space for dropdown and legend
height=800, ## Adjust height for better layout
updatemenus=[
{
"buttons": [
{
"label": metric,
"method": "update",
"args": [
{
# "visible": [
# # trace.name.split(' - ')[0] == target and trace.name.split(' - ')[1] in models
# # for trace in fig.data
# # for target in targets
# True ## Ensure all traces remain visible
# for trace in fig.data
# ]
"visible": [True for _ in fig.data]
},
{"title": f"Comparative Model Performance for {metric}"},
{
"showlegend": True
}
]
}
for metric in metrics
],
"direction": "down",
"showactive": True,
"xanchor": "left",
"x": 0.01,
"y": 1.2,
}
],
# autosize=False, ## Ensure layout respects the specified width and height
)
## FIXME: Set visibility of traces: also show remain 3 target variables but deselect them
# for trace in fig.data:
# trace.visible = (trace.name.split(' - ')[0] == targets[0])
## Ensure colors are unique by rounding values to zero/two decimal places --> converting rounded values to distinct integers
decimal_place = 0
rounded_values = sorted({round(v, decimal_place) for v in df[df['Metric'] == default_metric]['Value']})
unique_colors = {v: i for i, v in enumerate(rounded_values)}
logging.debug(f"Unique colors mapping: {unique_colors}")
## Map the value to a color in the Plasma colorscale
for trace in fig.data:
value = round(trace.r[0], decimal_place)
if value in unique_colors:
color_index = unique_colors[value]
else:
color_index = min(unique_colors.values(), key=lambda k: abs(k - value))
color_index = max(0, min(color_index, len(colorscale) - 1))
trace.marker.color = colorscale[color_index]
trace.marker.colorscale = colorscale ## Apply gradient scale
trace.marker.showscale = True ## Ensure gradient scale is shown
# trace.marker.colorbar = dict(title='Value')
trace.marker.colorbar = dict(title=f'{default_metric} Value')
logging.debug(f"Value: {value}, Color Index: {color_index}, Color: {colorscale[color_index]}")
## Using a color scale to set colors properly
# fig.update_traces(marker=dict(colorscale='Plasma', showscale=True))
## Update traces with color scale and color bar
fig.update_traces(
marker=dict(
colorbar=dict(title=f'{default_metric} Value')
)
)
fig.show()
logging.debug("Visualization created successfully.")
create_visualization(df)