Machine Learning-Based Air Quality Forecasting

We developed an ML-based air quality forecast modeling framework that consists of two independent ML models, in order to predict O₃ at Kennewick, WA (Fan et al., 2022 (link)). The first ML model (ML1; Supplementary Figure 1a) consists of a random forest classifier and a multiple linear regression model: the RandomForestClassifier and RFE functions in the Python library scikit-learn are used (Pedregosa et al., 2011 ). The second ML model (ML2; Supplementary Figure 1b) is based on a two-phase random forest regression model: the RandomForestRegressor function in the Python library scikit-learn is used (Pedregosa et al., 2011 ). More details of our ML modeling framework are available in Dataset and Modeling Framework section in Fan et al. (2022 (link)).
In this study, we use the same ML models to predict the O₃ and PM2.5 at various AQS sites in the PNW. To better fit the local conditions, the model is trained at each individual site. Hourly O₃ and PM2.5 predictions are used to compute maximum daily 8-h running average (MDA8) O₃ mixing ratio and 24-h averaged PM2.5 concentrations, as these are the requirements of the National Ambient Air Quality Standards (NAAQS). Due to the different sources of PM2.5 during wildfire and cold seasons in the PNW, the model is trained separately for two seasons at each site. The feature-selection module from the functions listed above are used to select the features at each site to train the models. For ML2, the weighting factors vary at each site, which are computed based on the local input data.
Given ML models can be subject to overfitting and can be sensitive to issues in the training dataset, we account for these issues in our modeling setup. To avoid overfitting, we limit five features in the model training, and use 10-time 10-fold cross-validation to evaluate our model. Our training datasets are air quality observation, which are generally imbalanced: a highly polluted event or an extremely clean event is a rare event. Haixiang et al. (2017 (link)) shows that imbalanced training data may lead a bias toward commonly observed events. To alleviate the imbalance problem, we apply several methods such as turning on the balanced_subsample option in the function of the random forest model and using multiple linear regression and second phase random forest regression in the modeling system.

Free full text: Click here

Fan K., Dhammapala R., Harrington K., Lamb B, & Lee Y. (2023). Machine learning-based ozone and PM2.5 forecasting: Application to multiple AQS sites in the Pacific Northwest. Frontiers in Big Data, 6, 1124148.

Publication 2023

Cold Library Modeling system Python Wildfire

Corresponding Organization : Center for Advanced Systems Understanding

Other organizations : Bay Area Air Quality Management District, Max Delbrück Center

Top 5 similar protocols

Variable analysis

independent variables

Air quality observation data
Local input data

dependent variables

O3 mixing ratio
PM2.5 concentrations

control variables

Balanced_subsample option in the random forest model
Multiple linear regression
Second phase random forest regression

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!