Solubility Prediction via Ensemble Modeling

The problem of the solubility prediction was solved using the Python code developed for the purpose of this study by the hyperparameter tuning of 36 regression models utilizing a variety of algorithms, including linear models, boosting, ensembles, nearest neighbors, neural networks, and also some other types of regressors. The search for their optimal parameters was carried out using the Optuna study, which is a freely available Python package for hyperparameter optimization [61 ]. The collection of the tuned models was formulated after 5000 minimization trials using TPE (Tree-structured Parzen Estimator) as a sampler of the search algorithm. TPE is a computationally efficient model-based optimization algorithm that uses a probability density function to model the relationship between hyperparameters and performance metrics. To evaluate the performance of each regression model, a new custom score function was developed that combines multiple metrics to take into account both the model’s accuracy and ability to generalize. The actual mathematical formula used for the loss computation is the following:

{l o s s}_{t r a i n} = {M S E}_{t r a i n}^{L C, t r a i n} + |{M S E}_{t r a i n}^{L C, t r a i n} - {M S E}_{t r a i n}^{L C, t e s t}| + {M S E}_{t r a i n} (1 + 100 \cdot N_{t r a i n t}^{p o s} + 10 \cdot N_{t r a i n}^{o u t})

where all terms were computed on the training dataset. The last term comprises the value of the mean squared error (

{M S E}_{t r a i n}

) between the predicted and actual values of the target variable and two penalties on the number of positive values (

N_{t r a i n}^{p o s})

and outliers (

N_{t r a i n}^{o u t}

). The first penalty is associated with the formally acceptable predicated values since the models were trained against the values of solubility expressed as the logarithm of the mole fraction and, as such, should always be positive. The latter penalty directs the acceptance of models with as few as possible outlying data points, defined as 3 times higher than the standard deviation. The first two terms in Equation (1) were obtained from the learning curve analysis (LCA) of the scikit-learn 1.2.2 library [51 ] and provide information on the model’s performance for different training set sizes. It is worth mentioning that LCA utilizes cross-validation (CV), which was set here to a 5-fold CV of the training dataset. The

{M S E}_{t r a i n}^{L C, t r a i n}

and

{M S E}_{t r a i n}^{L C, t e s t}

values were obtained from the learning curve analysis, which provides information on the model’s ability to generalize to new, unseen data. The learning curve analysis (LCA) was performed using the sklearn.model_selection.learning_curve function from the scikit-learn library [51 ]. Since LCA can be computationally expensive, here, only two-point computations were performed by including 50% to 100% of the total data. The final model’s assessments via LCA were conducted using 20-point computations. The values included in the custom loss correspond to the mean MAE values obtained on the largest training set size. Hence, such a custom loss function combines the two types of components providing information on the model’s accuracy and ability to generalize to new, unseen data. Overall, this approach is regarded as a robust and reliable solubility prediction model that can be used for various applications and screening for new solvents.
The final performance of all models was evaluated using loss values characterizing test and validation subsets. The ensemble model (EM) was defined by the inclusion of the subset of regression models with the lowest values of both criteria, and the final predictions were averaged over selected models.

Free full text: Click here

Cysewski P., Jeliński T, & Przybyłek M. (2023). Finding the Right Solvent: A Novel Screening Protocol for Identifying Environmentally Friendly and Cost-Effective Options for Benzenesulfonamide. Molecules, 28(13), 5008.

Publication 2023

Learning curve Library Mole Python Solubility Solvents Tree

Corresponding Organization : Nicolaus Copernicus University

Top 5 similar protocols

Protocol cited in 4 other protocols

Variable analysis

independent variables

Hyperparameter values for 36 regression models
Algorithms (linear models, boosting, ensembles, nearest neighbors, neural networks, other regressors)

dependent variables

Solubility prediction performance
Model accuracy
Model generalization ability

control variables

Training dataset used for hyperparameter tuning and model evaluation
5-fold cross-validation of the training dataset
Computation of learning curves using 50% to 100% of the total data, and final 20-point computations

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!