where all terms were computed on the training dataset. The last term comprises the value of the mean squared error ( ) between the predicted and actual values of the target variable and two penalties on the number of positive values ( and outliers ( ). The first penalty is associated with the formally acceptable predicated values since the models were trained against the values of solubility expressed as the logarithm of the mole fraction and, as such, should always be positive. The latter penalty directs the acceptance of models with as few as possible outlying data points, defined as 3 times higher than the standard deviation. The first two terms in Equation (1) were obtained from the learning curve analysis (LCA) of the scikit-learn 1.2.2 library [51 ] and provide information on the model’s performance for different training set sizes. It is worth mentioning that LCA utilizes cross-validation (CV), which was set here to a 5-fold CV of the training dataset. The and values were obtained from the learning curve analysis, which provides information on the model’s ability to generalize to new, unseen data. The learning curve analysis (LCA) was performed using the sklearn.model_selection.learning_curve function from the scikit-learn library [51 ]. Since LCA can be computationally expensive, here, only two-point computations were performed by including 50% to 100% of the total data. The final model’s assessments via LCA were conducted using 20-point computations. The values included in the custom loss correspond to the mean MAE values obtained on the largest training set size. Hence, such a custom loss function combines the two types of components providing information on the model’s accuracy and ability to generalize to new, unseen data. Overall, this approach is regarded as a robust and reliable solubility prediction model that can be used for various applications and screening for new solvents.
The final performance of all models was evaluated using loss values characterizing test and validation subsets. The ensemble model (EM) was defined by the inclusion of the subset of regression models with the lowest values of both criteria, and the final predictions were averaged over selected models.