Example 1

The MCA-miner method disclosed herein in FIGS. 2A-2C, when used together with BRL, offers the power of rule list interpretability while maintaining the predictive capabilities of already established machine learning methods.

The performance and computational efficiency of the new MCA-miner is benchmarked against the “Titanic” dataset, as well as the following five (5) datasets available in the UCI Machine Learning Repository: “Adult,” “Autism Screening Adult,” “Breast Cancer Wisconsin (Diagnostic),” “Heart Disease,” and “HIV-1 protease cleavage,” which are designated as Adult, ASD, Cancer, Heart, and HIV, respectively. These datasets represent a wide variety of real-world experiments and observations, thus enabling the improvements described herein to be compared against the original BRL implementation using the FP-Growth miner.

All six benchmark datasets correspond to binary classification tasks. The experiments were conducted using the same set up in each of the benchmarks. First, the dataset is transformed into a format that is compatible with the disclosed BRL implementation. Second, all continuous attributes are quantized into either two (2) or three (3) categories, while keeping the original categories of all other variables. It is worth noting that depending on the dataset and how its data was originally collected, the existing taxonomy and expert domain knowledge are prioritized in some instances to generate the continuous variable quantization. A balanced quantization is generated when no other information was available. Third, a model is trained and tested using 5-fold cross-validations, reporting the average accuracy and Area Under the ROC Curve (AUC) as model performance measurements.

Table 1 presents the empirical result of comparing both implementations. The notation in the table follows the definitions above. To strive for a fair comparison between both implementations, the parameters rmax=2 and smin=0:3 are fixed for both methods, and in particular for MCA-miner μmin=0:5 and M=70 are also set. The multi-core implementations for both the new MCA-miner and BRL were executed on six parallel processes, and stopped when the Gelman & Rubin parameter satisfied {circumflex over (R)}≤1.05. All the experiments were run using a single AWS EC2 c5.18×large instance with 72 cores.

TABLE 1
Performance evaluation of FP-Growth against MCA-miner
when used with BRL on benchmark datasets. ttrain is the full training wall time.
FP-GROWTH + BRLMCA-MINER + BRL
DATASETnpΣt-1p1|ACCURACYAUCttrain[s]ACCURACYAUCttrain[s]
Adult45.222141110.810.855120.810.85115
ASD24821890.870.901980.870.9016
Cancer569321500.920.971680.920.9422
Heart30313490.820.861170.820.8615
HIV5.84081600.870.884490.870.8836
Titanic2.201380.790.761180.790.7510

It is clear from the experiments in Table 1 that the new MCA-miner matches the performance of FP-Growth in each case, while significantly reducing the computation time required to mine rules and train a BRL model.

Free full text: Click here