1 Introduction
New computational capabilities and high volumes of data have propelled the development and adoption of new
artificial intelligence (
AI) algorithms. The main protagonist of the
machine learning (
ML) domain has undoubtedly been the
deep learning (
DL) approach, which has provided impressive performance in a wide range of sectors. Indeed, these approaches are now extensively applied for task automation, yielding performance that favourably compares to human operation. However, many challenges exist in sustaining the performance of DL approaches. These models necessitate extensive datasets for proper training, imposing significant demands on storage and computational power. The associated costs are estimated to be in millions [
3,
25,
39,
40]. They are also extremely hungry for data, making it challenging to apply these models to small datasets due to practical and cost-effective deployment issues. The opacity of “black box” DL models makes them hard to trust, especially in sensitive fields like medicine. These aspects contrast with the “right to explainability” ambition defined in the recital 71 of the
General Data Protection Regulation (
GDPR) of the European Union and in the algorithmic accountability act decreed by the US congress (H.R. 6580 of 2022) [
14,
29].
The rising interest in
Explainable AI (
XAI) is formulated in terms of
interpretability and
explainability [
15]. Drawing from multi-agent systems, interpretability is defined as the process of assigning a subjective meaning to a feature or aspect, while explainability is concerned with transforming an object into easily or effectively interpretable objects [
8]. Some methods aim at achieving the latter by attempting to peer into the black box [
17] through post-hoc models [
24]. They try to produce explanations by applying explainable-by-design models, such as decision trees on neural networks, to approximate their behaviour and prove their
robustness either locally or globally. Other techniques, as seen in [
26,
27,
45], perform feature selection to highlight the most important information for classifying or separating two or more groups. Angelov et al. [
1] proposed new DL architectures to augment their explainability or perform meta-analyses on their functioning [
2]. The tradeoff between explainability and accuracy has been widely discussed in [
28], but one point in favour of emphasising the former is the greater potential for improvement and testing in methods that are easier to understand.
In our previous work on a
Small and iNcomplete Dataset Analyser (
SaNDA) [
19], we addressed these challenges by employing abstractions to: (i) enhance privacy; (ii) facilitate the deployment of statistical approaches; (iii) share anonymous data among different research centres; (iv) construct knowledge graphs to represent the information extracted from the data. A key contribution was the use of abstractions to partition the set of values of an attribute into
Up and
Down flips, representing them in a more intuitive, probabilistic, and anonymous format. In this article, we further exploit this concept by creating the
Comprehensive Abstraction and Classification Tool for Uncovering Structures (
CACTUS). CACTUS supports additional flips for categorical attributes whilst preserving their original meaning, optimising memory usage, accelerating computation through parallelisation, and providing explanations of its internal reasoning. The architecture of CACTUS incorporates the original concept of SaNDA but in a broader manner, enabling it to automatically stratify and binarise for multiple configurations simulaneously. This allows the creation of binary decision trees and correlation matrices along with the utilisation of abstractions. Additionally, it facilitates the storage of intermediate results for easy revision and integration with other tools. In addition, the output structure of SaNDA has been improved and rationalised to transform it into a research tool that can be easily and broadly used by the wider scientific community.
In order to demonstrate and evaluate the work, we use several datasets to gauge and showcase the performance and ease of use of CACTUS. They include the
Wisconsin Diagnostic Breast Cancer (
WDBC) [
12,
44], Thyroid0387 [
37], Cleveland Heart Disease (hereafter referred to simply as Heart Disease) [
21], and the Adult Income datasets [
5], each available at archive.ics.uci.edu/ml/. This work makes a number of contributions:
—
Presents a faster and lighter data abstraction and knowledge graph creation process, facilitating quicker computations and generating more outputs. This, in turn, provides additional resources for users to explore the dataset.
—
Addresses categorical features, which are treated as continuous values by traditional ML models despite the semantic information they contain. To our knowledge, this is the first model to preserve and use the semantic meaning of categorical information together with continuous variables.
—
Introduces additional metrics to evaluate the classification process, which provides a notion of balanced accuracy (BA) for multi-class classifications. This allows for fairer comparisons with other models.
—
Conducts auxiliary analyses of the dataset to identify patterns and contrast them with insights provided by the classification process. This includes ranking each feature and quantifying their importance, fundamental for assessing the model’s classification and providing insights to the user.
The article is organised as follows: Section
2 covers the mechanism used for abstracting values. Section
3 demonstrates how CACTUS abstracts and extracts information form the considered datasets. In Section
4, we present the operation of a meta-analysis to interpret its internal reasoning, along with a comparison of the approach to existing literature.
2 Methods
CACTUS has been implemented in Python3 [
41], using scientific libraries such as numpy [
18] and pandas [
32] to achieve high performance while remaining easy to debug and understand. It offers users various options for handling a given dataset, such as replacing values, dropping certain columns, and considering what values represent a missing value (or a
not-a-number,
NaN). This information is then encapsulated in a
YAML configuration file, enabling customisation of computations and facilitating the execution of different analyses simultaneously. CACTUS can be compiled in Cython, enabling better parallelisation and speed. It also provides a command line interface with a convenient flag to switch from multi-class to binary classification, by specifying how to
binarise the target label. Both the binarisation and the stratification functionality can be specified multiple times and switched in a seamless manner, performing several runs in a single execution. Attributes for stratification have to be integer and must have less than 5 unique values.
The pipeline implemented in CACTUS is organised into three independent functional modules: decision tree, abstraction, and correlation, as described in Figure
1.
The decision tree module computes the binary decision tree on the dataset. It eases the correlation by using scikit-learn [
33] to assess the combinations of columns for which the correlation can be computed. Performing this pre-selection can speed up the correlation phase if the dataset presents particular patterns of missing values; it may also make the correlation matrix less sparse. The user can choose the criteria used for splitting a node of the tree between
gini coefficient [
11],
Shannon’s entropy [
30], and
log-loss [
42].
The abstraction module is detailed in Figure
2. The first step is the partitioning of continuous values into discrete categories using the
receiver operating characteristic (
ROC) curve. If there are multiple classes, the user has to set a threshold to divide them into two
populations to build the ROC curve. This assumes that the classes are ordered and related, such as the gravity of a disease or the income range of an individual. Let
x be a continuous feature that we want to partition,
\(V(x)\) be the set of unique values of the feature
x, and
\(v_{max} = \mathrm{argmax}\underset{v \in V(x)}{\ }BA(x, v)\) be the value of
x yielding the highest BA in separating the two populations. The feature,
x, will then be partitioned as
Unlike SaNDA [
19], CACTUS does not alter the categorical features, thus preserving their original meaning. If
A is a categorical attribute assuming values
\(\lbrace 1, 2, \dots i\rbrace\), then its states will be coded in the
flips \(\lbrace A\_1, A\_2, \dots A\_i\rbrace\). An attribute is recognised automatically as categorical, if it has less than 5 unique values or is explicitly marked as such in the configuration file. This allows users to define an attribute as categorical if it does not spontaneously fit into our definition. This can be useful for attributes that assume few non-integer values, for example when they contain high percentages of NaNs, or more than 5 unique values, permitting a finer granularity and a more effective stratification. In SaNDA, all attributes were simply labeled as
\(\_U\) and
\(\_D\), and thus their semantic meaning (e.g., Genetic variables) was lost.
CACTUS then performs a filtration to discard features with an excessive error margin (
\(\mathrm{EM}\)). Statistics calculated on a sample can differ when applied to the population it represents. Estimating this difference is important for selecting what is reliable and applicable beyond the sample. The notion of margin of error allows for an estimation of the difference for the specific confidence interval. If
Z is the z-score representing the confidence interval that we want to use for estimating
\(\mathrm{EM}\), and
\(p_x\) is the BA achieved by the
\(v_{max}\) value of feature
x, and
N is the sample size, then the error margin is then computed as
By default, CACTUS applies a z-score of 2.33, corresponding to a confidence interval of
\(99\%\), and removes features for which the error margin exceeds
\(v_{max}\). Given the lower accuracy of categorical features having multiple flips, this filtration is applied only to the previously-continuous features that were partitioned in
\(\_U\) and
\(\_D\). The z-score can be customised by the user to make it more or less relaxed.
Having partitioned all the continuous features into discrete objects, we can link all the
flips into a
knowledge graph representing the interactions inside each class. The nodes represent flips of each feature, and the connections represent conditional probabilities. A flip,
f, is equipped by a frequency in a given class
c,
\(P(f|c)\), which indicates whether it is more or less likely to be present than its siblings. For example, an old person (
\(Age\_U\)) could be more likely to be sick than a young one (
\(Age\_D\)). We then connect two flips,
\(f_i\) and
\(f_j\), in a connectivity matrix
M by
This connection is directional, meaning that
\(M_{j, i}\) will not necessarily have the same weight. This allows for exploiting
PageRank [
31] to identify central flips in each class as the elements which are more intensively referred to by other nodes. CACTUS considers the significance of a flip
f in a class
c as either considering its frequency,
\(P(f|c)\), alone, or combining it with the PageRank score
\(\mathrm{PR}(f,c)\) through multiplication; these two definitions of significance are used in parallel and independently in a
Probabilistic (
PB) and
PageRank (
PR) approach to classify the records of the dataset.
The graph-tool library [
35], already present in SaNDA, has been retained for building the knowledge graphs. However, the stochastic community detection algorithm, previously employed in [
34], has been substituted with the greedy [
9], label propagation [
38], and Louvain [
6] deterministic approaches offered by the networkX library [
16]. This alteration aims to achieve more stable structures that can be evaluated using the notions of modularity, coverage, and performance [
9,
13]. These metrics are automatically computed, compared, and stored during the process.
Previously in SaNDA [
19], adjacencies in the knowledge graphs were stored and computed using adjacency lists. However, these proved to be cumbersome and memory-consuming, particularly for large datasets. We addressed this limitation by computing these connections on-the-fly, thereby saving disk and memory space. This also reduces the total runtime, as connections are computed for each graph, only once and in parallel; the connection weights are saved then in a
comma-separated value (
CSV) file, which is a much more convenient representation than adjacency lists. The graphs are stored in graphml format, which allows for additional analyses using different software, such as Gephi [
4].
After building the knowledge graphs, CACTUS classifies each record individually and compares the two available significance metrics, PR and PB, against the ground truth. Using two classification mediums allows for consideration of the contribution of an attribute alone (PB) as well as its interaction with the other elements (PR). The function,
\(\Xi\), computes the similarity between the considered entry and a class
c, and is given as
where
\(\boldsymbol {F}\) is the set of flips of an entry,
c is one of the considered classes, and
\(\sigma (f|c)\) is a function returning the significance of the flip
f for the class
c, following the PR or PB approaches previously defined. Notably, the patient’s missing entries are excluded by the set
\(\boldsymbol {F}\). The classification assigns the entry to the class to which it is the most similar; this is computed as:
\(\mathrm{argmax}\underset{c}{\ }\Xi := \mathrm{argmax}\underset{c\in \boldsymbol {C}}{\ }\Xi (c, \boldsymbol {F})\), where
\(\boldsymbol {C}\) is defined as the set of all classes that are considered. All the patients are classified in parallel for both modalities to speed up the computation. Using both the PR and PB approaches helps validate the knowledge graphs as an effective representation of the classes, as connections modify the significance of each flip and drive the classification method.
Another addition to SaNDA is a BA measure for multi-class classification, which is described below,
where
\(\omega _c\) is the number of correctly classified instances, and
\(\boldsymbol {E}_c\) is the set containing all the instances of class
c. CACTUS automatically plots how the probability distribution of each feature changes across the classes, giving insights on the variables driving the classification towards particular outcomes. This is reflected in the knowledge graphs, where a flip (e.g.,
\(Age\_U\)) might be much more influential than its twin (e.g.,
\(Age\_D\)) in one class and surrender its influence to it in the next one; this will cause very different scores when the classification function is applied to the graphs. Using the following equation:
we model the rank
\(R_{x_f}\) for each flip
f of a feature,
x, by considering how the conditional probability
\(P(f|c_i)\) of the flip
f given the class
\(c_i\) will change across the
N considered classes. By considering the unordered pairs of states
\((c_i, c_j)\), we do not assume the classes to be ordered, and can average them using the number of pairs obtained, given by the combination,
\(_NC_2\). Intuitively, the more the probability distribution of the flips of a marker changes, the more useful the marker for assigning a class. The rank can then be averaged over the
\(\boldsymbol {F}_x\) flips of the feature
x as
The average rank describes the effectiveness of the change in the distribution of a feature in differentiating the available classes. For instance, a feature whose distribution does not change significantly across classes is not useful in differentiating between the classes, while features with a strong gap will provide stronger support to one class over the others. If a feature
x has only two flips and no missing values, the average rank,
\(\bar{R}_x\), will correspond to the rank,
\(R_{x_f}\), of any of its flips. When there are categorical variables with more than 2 possible values, the average rank is more informative as some flips might have a constant frequency across all classes, whilst others might change more radically.
When enabled, the correlation module computes the correlation for the set of columns retained after preprocessing; otherwise, it is calculated for the whole dataset. In both cases, the original values (and not abstracted data), are used. A warning is issued if attribute combinations have a correlation of
\(\pm 1\), because they might indicate either redundant or very interesting pairs of features. A correlation graph is also computed, using graph-tool [
35], and the PR and Laplacian scores are computed along with the community detection algorithms applied to the knowledge graphs. The self-connections will be removed by default from the correlation graph to unclutter it, as they will always correspond to 1, and the graph will be saved in the graphml format as well. A minimum spanning tree [
36] is computed on the correlation graph to highlight which attributes are better connected.
All these modules produce intermediate outputs that can be integrated, accessed, or elaborated differently from what is automatically done. This allows further investigations or even interaction with CACTUS. For instance, the accuracies of the flips may be modified by an expert to boost or dampen their significance and CACTUS can be set to automatically use these values to build the knowledge graphs.
2.1 Experimental Setup
The WDBC dataset contains numerical features extracted from digitalised images of fine needle aspirates of breast mass for a total of 569 patients. Enquiries about the exact description of the features and of the whole dataset are readdressed in its original documentation on archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names. It presents a natural, binary classification scenario, where breast cancer is labelled as benign or malignant and characterised by 30 real-value attributes. These attributes can either be actual measures, or their standard error and worst value. None of the attributes presents NaNs.
The Thyroid dataset consists of 9172 records composed of 29 attributes, 21 of which are binary, 7 are continuous, and the last one is the referral source. The Diagnosis column consists of a string containing several features regarding the patients’ conditions, namely hyperthyroid and hypothyroid status, increased/decreased binding protein, general health, and treatments that have been undertaken. To perform a classification task, we consider patients with a “concurrent non-thyroidal illness” (K), hyperthyroid (letters A, B, C, D), or hypothyroid (letters E, F, G, H) conditions, and we condense them into healthy (0), hyperthyroid (1), and hypothyroid (2) classes. This method for reducing the amount of classes was taken from the previous experiments listed in the dataset description. This shrinks the records to 1371. The Thyroid dataset also contains dual diagnosis in the form \(X|Y\), which is interpreted as “consistent to X, but Y is more likely”. Therefore, when this ambiguity was present, we assigned Y if it was one of the considered letters (A, B, C, D, E, F, G, H, K). We assigned X if X was valid. Otherwise, we ignored the record.
The Mushroom primary dataset [
43] includes 173 species of mushrooms described by 17 nominal and 3 metrical variables. The nominal variables contain multiple values, for instance, the colours of the cap. To address this, they were split into multiple binary columns, for example cap_colour_red and cap_colour_brown. We therefore obtained a total of 101 attributes, with 98 being categorical. The assigned class for each mushroom indicates whether it is edible, not edible, or unknown.
The Cleveland Heart Disease dataset [
21] contains 13 attributes, including 7 categorical variables associated with diagnosed heart disease ranging from 0 (healthy) to 4. This dataset is one of the four centres that cooperated in a study on heart diseases. We specifically chose to focus on the Cleveland dataset as it is reported to be the only one used in ML research to date. Moreover, it was already preprocessed and ready to use.
The Adult Income dataset [
5] is derived from an American census conducted in 1994 to classify individuals into those earning more and less than fifty thousand USD per year. It contains eight categorical and six continuous features.
The experimental setup considered the tolerance of both traditional ML models and CACTUS to missing values. For each dataset, we removed from every feature a percentage of its existing values. Therefore, if a feature already contained missing values, we considered only the existing ones for removal. The percentages of additional missing values we considered were
\(20\%\),
\(40\%\),
\(60\%,\) and
\(80\%\). This process resulted in 25 datasets used to compare traditional ML methods and CACTUS. The classifiers chosen for the comparison were Ridge [
22],
Logistic Regression (
LR) [
23],
Support Vector Machine (
SVM) [
10],
Stochastic Gradient Descent (
SGD) classifier [
20], and
Random Forest (
RF) [
7]. Since these models are inherently stochastic, they were trained using a 10-fold cross-validation technique on a 60/40% train/test split. CACTUS, being deterministic in how it computes the significance of features, did not present variability on multiple runs. In addition, CACTUS does not perform learning by fitting the data to a mathematical function, therefore it does not need to have a test phase. Imputation tools, such as
IterativeImputer and
KNNImputer from scikit-learn, were not considered. Furthermore, incorporating imputation tools could introduce results harder to generalise and be prone to inaccuracies if the imputing algorithm fails to extract meaningful patterns. Instances of when an imputation could not be reliable include medical datasets, where variables such as smoking status are obviously independent of genetic data, or sets of independent measures collected in different time points or modalities.
3 Results
In this section, we present the results and the knowledge graphs for each dataset. For representation purposes, the knowledge graphs display only nodes with at least one valid connection to the rest of the network and only the four strongest connections from a flip to the others. However, the PR and community computations consider all connections.
Table
2 lists the BA obtained by all the considered ML techniques and CACTUS methods. The additional metrics of precision, sensitivity, and specificity are shown in Tables
1,
3, and
4, respectively.
Some of the high dimensional categorical features in the Adult Income dataset, namely the native country, occupation, education, and marital status of an individual, caused a strong decrement of PR accuracy, as shown in Table
2. If these features were not treated as categorical, the PR and PB accuracies would be more balanced, as shown in Table
5. The reasons for this decrease are further discussed in the Discussion section.
Figures
3 and
4 display the strongest connections in the knowledge graphs for the benign and the malignant breast cancers in the WDBC dataset. The communities obtained by the Greedy and Louvain algorithms are shown in Figures S1 and S2, respectively. The label propagation algorithm did not yield multiple communities and was subsequently omitted. In Figure
5, the probabilities of each marker across the benign (0) and malignant (1) breast cancers enable the computation of each variable’s importance and the discrimination of the available classes, by applying Equation (
7). Figure S4 illustrates the correlation among the original features and the
Diagnosis column, and Figure S3 displays the decision tree. Due to space constraints, the knowledge graphs of the other datasets are showcased in the supplementary material.
Figure S5 presents how the connections change across healthy, hypothyroidism, and hyperthyroidism, respectively. This is also visible in the communities created by the Greedy, Label Propagation, and Louvain algorithms, depicted in Figures S6, S8, and S7, respectively. The individual changes in the probability distributions of the features are shown in Figure
6. Figure S10 emphasises the correlation patterns between the attributes, while Figure S9 provides the decision tree.
The representation of edible and poisonous mushrooms in the Mushroom dataset obtained from CACTUS is displayed in Figure S11. The communities derived from them using the Greedy, Label Propagation, and Louvain algorithms are presented in Figures S12, S14, and S13, respectively, with specific alterations between them outlined in Figure
7. The auxiliary correlation matrix and decision tree are illustrated in Figures S16 and S15.
The knowledge graphs from the Heart Disease dataset (Figure S17) are much harder to compare due to the volume of data. The communities generated by the Greedy and Louvain algorithms are listed in Figures S18 and S19. Label Propagation did not return different structures and was therefore ignored. Figure
8 shows the rank of each feature and helps in the recognition of the important transitions and differences between them. Figures S21 and S20 portray the correlation and decision tree.
The knowledge graphs for the Adult Income dataset are illustrated in Figure S22 and describe the main characteristics of the two classes.