The problem of the solubility prediction was solved using the Python code developed for the purpose of this study by the hyperparameter tuning of 36 regression models utilizing a variety of algorithms, including linear models, boosting, ensembles, nearest neighbors, neural networks, and also some other types of regressors. The search for their optimal parameters was carried out using the Optuna study, which is a freely available Python package for hyperparameter optimization [61 ]. The collection of the tuned models was formulated after 5000 minimization trials using TPE (Tree-structured Parzen Estimator) as a sampler of the search algorithm. TPE is a computationally efficient model-based optimization algorithm that uses a probability density function to model the relationship between hyperparameters and performance metrics. To evaluate the performance of each regression model, a new custom score function was developed that combines multiple metrics to take into account both the model’s accuracy and ability to generalize. The actual mathematical formula used for the loss computation is the following: losstrain=MSEtrainLC,train+MSEtrainLC,trainMSEtrainLC,test+MSEtrain(1+100·Ntraintpos+10·Ntrainout)
where all terms were computed on the training dataset. The last term comprises the value of the mean squared error ( MSEtrain ) between the predicted and actual values of the target variable and two penalties on the number of positive values ( Ntrainpos) and outliers ( Ntrainout ). The first penalty is associated with the formally acceptable predicated values since the models were trained against the values of solubility expressed as the logarithm of the mole fraction and, as such, should always be positive. The latter penalty directs the acceptance of models with as few as possible outlying data points, defined as 3 times higher than the standard deviation. The first two terms in Equation (1) were obtained from the learning curve analysis (LCA) of the scikit-learn 1.2.2 library [51 ] and provide information on the model’s performance for different training set sizes. It is worth mentioning that LCA utilizes cross-validation (CV), which was set here to a 5-fold CV of the training dataset. The MSEtrainLC,train and MSEtrainLC,test values were obtained from the learning curve analysis, which provides information on the model’s ability to generalize to new, unseen data. The learning curve analysis (LCA) was performed using the sklearn.model_selection.learning_curve function from the scikit-learn library [51 ]. Since LCA can be computationally expensive, here, only two-point computations were performed by including 50% to 100% of the total data. The final model’s assessments via LCA were conducted using 20-point computations. The values included in the custom loss correspond to the mean MAE values obtained on the largest training set size. Hence, such a custom loss function combines the two types of components providing information on the model’s accuracy and ability to generalize to new, unseen data. Overall, this approach is regarded as a robust and reliable solubility prediction model that can be used for various applications and screening for new solvents.
The final performance of all models was evaluated using loss values characterizing test and validation subsets. The ensemble model (EM) was defined by the inclusion of the subset of regression models with the lowest values of both criteria, and the final predictions were averaged over selected models.
Free full text: Click here