research-article

Open access

CACTUS: A Comprehensive Abstraction and Classification Tool for Uncovering Structures

Authors:

Jose SousaAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 46, Pages 1 - 23

https://doi.org/10.1145/3649459

Published: 15 April 2024 Publication History

PDF eReader

Abstract

The availability of large datasets is providing the impetus for driving many current artificial intelligent developments. However, specific challenges arise in developing solutions that exploit small datasets, mainly due to practical and cost-effective deployment issues, as well as the opacity of deep learning models. To address this, the Comprehensive Abstraction and Classification Tool for Uncovering Structures (CACTUS) is presented as a means of improving secure analytics by effectively employing explainable artificial intelligence. CACTUS achieves this by providing additional support for categorical attributes, preserving their original meaning, optimising memory usage, and speeding up the computation through parallelisation. It exposes to the user the frequency of the attributes in each class and ranks them by their discriminative power. Performance is assessed by applying it to various domains, including Wisconsin Diagnostic Breast Cancer, Thyroid0387, Mushroom, Cleveland Heart Disease, and Adult Income datasets.

1 Introduction

New computational capabilities and high volumes of data have propelled the development and adoption of new artificial intelligence (AI) algorithms. The main protagonist of the machine learning (ML) domain has undoubtedly been the deep learning (DL) approach, which has provided impressive performance in a wide range of sectors. Indeed, these approaches are now extensively applied for task automation, yielding performance that favourably compares to human operation. However, many challenges exist in sustaining the performance of DL approaches. These models necessitate extensive datasets for proper training, imposing significant demands on storage and computational power. The associated costs are estimated to be in millions [3, 25, 39, 40]. They are also extremely hungry for data, making it challenging to apply these models to small datasets due to practical and cost-effective deployment issues. The opacity of “black box” DL models makes them hard to trust, especially in sensitive fields like medicine. These aspects contrast with the “right to explainability” ambition defined in the recital 71 of the General Data Protection Regulation (GDPR) of the European Union and in the algorithmic accountability act decreed by the US congress (H.R. 6580 of 2022) [14, 29].

The rising interest in Explainable AI (XAI) is formulated in terms of interpretability and explainability [15]. Drawing from multi-agent systems, interpretability is defined as the process of assigning a subjective meaning to a feature or aspect, while explainability is concerned with transforming an object into easily or effectively interpretable objects [8]. Some methods aim at achieving the latter by attempting to peer into the black box [17] through post-hoc models [24]. They try to produce explanations by applying explainable-by-design models, such as decision trees on neural networks, to approximate their behaviour and prove their robustness either locally or globally. Other techniques, as seen in [26, 27, 45], perform feature selection to highlight the most important information for classifying or separating two or more groups. Angelov et al. [1] proposed new DL architectures to augment their explainability or perform meta-analyses on their functioning [2]. The tradeoff between explainability and accuracy has been widely discussed in [28], but one point in favour of emphasising the former is the greater potential for improvement and testing in methods that are easier to understand.

In our previous work on a Small and iNcomplete Dataset Analyser (SaNDA) [19], we addressed these challenges by employing abstractions to: (i) enhance privacy; (ii) facilitate the deployment of statistical approaches; (iii) share anonymous data among different research centres; (iv) construct knowledge graphs to represent the information extracted from the data. A key contribution was the use of abstractions to partition the set of values of an attribute into Up and Down flips, representing them in a more intuitive, probabilistic, and anonymous format. In this article, we further exploit this concept by creating the Comprehensive Abstraction and Classification Tool for Uncovering Structures (CACTUS). CACTUS supports additional flips for categorical attributes whilst preserving their original meaning, optimising memory usage, accelerating computation through parallelisation, and providing explanations of its internal reasoning. The architecture of CACTUS incorporates the original concept of SaNDA but in a broader manner, enabling it to automatically stratify and binarise for multiple configurations simulaneously. This allows the creation of binary decision trees and correlation matrices along with the utilisation of abstractions. Additionally, it facilitates the storage of intermediate results for easy revision and integration with other tools. In addition, the output structure of SaNDA has been improved and rationalised to transform it into a research tool that can be easily and broadly used by the wider scientific community.

In order to demonstrate and evaluate the work, we use several datasets to gauge and showcase the performance and ease of use of CACTUS. They include the Wisconsin Diagnostic Breast Cancer (WDBC) [12, 44], Thyroid0387 [37], Cleveland Heart Disease (hereafter referred to simply as Heart Disease) [21], and the Adult Income datasets [5], each available at archive.ics.uci.edu/ml/. This work makes a number of contributions:

—

Presents a faster and lighter data abstraction and knowledge graph creation process, facilitating quicker computations and generating more outputs. This, in turn, provides additional resources for users to explore the dataset.

—

Addresses categorical features, which are treated as continuous values by traditional ML models despite the semantic information they contain. To our knowledge, this is the first model to preserve and use the semantic meaning of categorical information together with continuous variables.

—

Introduces additional metrics to evaluate the classification process, which provides a notion of balanced accuracy (BA) for multi-class classifications. This allows for fairer comparisons with other models.

—

Conducts auxiliary analyses of the dataset to identify patterns and contrast them with insights provided by the classification process. This includes ranking each feature and quantifying their importance, fundamental for assessing the model’s classification and providing insights to the user.

The article is organised as follows: Section 2 covers the mechanism used for abstracting values. Section 3 demonstrates how CACTUS abstracts and extracts information form the considered datasets. In Section 4, we present the operation of a meta-analysis to interpret its internal reasoning, along with a comparison of the approach to existing literature.

2 Methods

CACTUS has been implemented in Python3 [41], using scientific libraries such as numpy [18] and pandas [32] to achieve high performance while remaining easy to debug and understand. It offers users various options for handling a given dataset, such as replacing values, dropping certain columns, and considering what values represent a missing value (or a not-a-number, NaN). This information is then encapsulated in a YAML configuration file, enabling customisation of computations and facilitating the execution of different analyses simultaneously. CACTUS can be compiled in Cython, enabling better parallelisation and speed. It also provides a command line interface with a convenient flag to switch from multi-class to binary classification, by specifying how to binarise the target label. Both the binarisation and the stratification functionality can be specified multiple times and switched in a seamless manner, performing several runs in a single execution. Attributes for stratification have to be integer and must have less than 5 unique values.

The pipeline implemented in CACTUS is organised into three independent functional modules: decision tree, abstraction, and correlation, as described in Figure 1.

Fig. 1.

The decision tree module computes the binary decision tree on the dataset. It eases the correlation by using scikit-learn [33] to assess the combinations of columns for which the correlation can be computed. Performing this pre-selection can speed up the correlation phase if the dataset presents particular patterns of missing values; it may also make the correlation matrix less sparse. The user can choose the criteria used for splitting a node of the tree between gini coefficient [11], Shannon’s entropy [30], and log-loss [42].

The abstraction module is detailed in Figure 2. The first step is the partitioning of continuous values into discrete categories using the receiver operating characteristic (ROC) curve. If there are multiple classes, the user has to set a threshold to divide them into two populations to build the ROC curve. This assumes that the classes are ordered and related, such as the gravity of a disease or the income range of an individual. Let x be a continuous feature that we want to partition, \(V(x)\) be the set of unique values of the feature x, and \(v_{max} = \mathrm{argmax}\underset{v \in V(x)}{\ }BA(x, v)\) be the value of x yielding the highest BA in separating the two populations. The feature, x, will then be partitioned as

\begin{equation} v = {\left\lbrace \begin{array}{ll} x\_U\ \mathrm{if}\ v \gt v_{max} \\ x\_D\ \mathrm{if}\ v \le v_{max}. \end{array}\right.} \end{equation}

(1)

Fig. 2.

Unlike SaNDA [19], CACTUS does not alter the categorical features, thus preserving their original meaning. If A is a categorical attribute assuming values \(\lbrace 1, 2, \dots i\rbrace\), then its states will be coded in the flips \(\lbrace A\_1, A\_2, \dots A\_i\rbrace\). An attribute is recognised automatically as categorical, if it has less than 5 unique values or is explicitly marked as such in the configuration file. This allows users to define an attribute as categorical if it does not spontaneously fit into our definition. This can be useful for attributes that assume few non-integer values, for example when they contain high percentages of NaNs, or more than 5 unique values, permitting a finer granularity and a more effective stratification. In SaNDA, all attributes were simply labeled as \(\_U\) and \(\_D\), and thus their semantic meaning (e.g., Genetic variables) was lost.

CACTUS then performs a filtration to discard features with an excessive error margin (\(\mathrm{EM}\)). Statistics calculated on a sample can differ when applied to the population it represents. Estimating this difference is important for selecting what is reliable and applicable beyond the sample. The notion of margin of error allows for an estimation of the difference for the specific confidence interval. If Z is the z-score representing the confidence interval that we want to use for estimating \(\mathrm{EM}\), and \(p_x\) is the BA achieved by the \(v_{max}\) value of feature x, and N is the sample size, then the error margin is then computed as

\begin{equation} \mathrm{EM}(x) = Z \times \sqrt {\frac{p_x(1-p_x)}{N}}. \end{equation}

(2)

By default, CACTUS applies a z-score of 2.33, corresponding to a confidence interval of \(99\%\), and removes features for which the error margin exceeds \(v_{max}\). Given the lower accuracy of categorical features having multiple flips, this filtration is applied only to the previously-continuous features that were partitioned in \(\_U\) and \(\_D\). The z-score can be customised by the user to make it more or less relaxed.

Having partitioned all the continuous features into discrete objects, we can link all the flips into a knowledge graph representing the interactions inside each class. The nodes represent flips of each feature, and the connections represent conditional probabilities. A flip, f, is equipped by a frequency in a given class c, \(P(f|c)\), which indicates whether it is more or less likely to be present than its siblings. For example, an old person (\(Age\_U\)) could be more likely to be sick than a young one (\(Age\_D\)). We then connect two flips, \(f_i\) and \(f_j\), in a connectivity matrix M by

\begin{equation} M_{i,j} = \frac{P(f_i|c) \cdot P(f_j|c)}{P(f_j|c)}. \end{equation}

(3)

This connection is directional, meaning that \(M_{j, i}\) will not necessarily have the same weight. This allows for exploiting PageRank [31] to identify central flips in each class as the elements which are more intensively referred to by other nodes. CACTUS considers the significance of a flip f in a class c as either considering its frequency, \(P(f|c)\), alone, or combining it with the PageRank score \(\mathrm{PR}(f,c)\) through multiplication; these two definitions of significance are used in parallel and independently in a Probabilistic (PB) and PageRank (PR) approach to classify the records of the dataset.

The graph-tool library [35], already present in SaNDA, has been retained for building the knowledge graphs. However, the stochastic community detection algorithm, previously employed in [34], has been substituted with the greedy [9], label propagation [38], and Louvain [6] deterministic approaches offered by the networkX library [16]. This alteration aims to achieve more stable structures that can be evaluated using the notions of modularity, coverage, and performance [9, 13]. These metrics are automatically computed, compared, and stored during the process.

Previously in SaNDA [19], adjacencies in the knowledge graphs were stored and computed using adjacency lists. However, these proved to be cumbersome and memory-consuming, particularly for large datasets. We addressed this limitation by computing these connections on-the-fly, thereby saving disk and memory space. This also reduces the total runtime, as connections are computed for each graph, only once and in parallel; the connection weights are saved then in a comma-separated value (CSV) file, which is a much more convenient representation than adjacency lists. The graphs are stored in graphml format, which allows for additional analyses using different software, such as Gephi [4].

After building the knowledge graphs, CACTUS classifies each record individually and compares the two available significance metrics, PR and PB, against the ground truth. Using two classification mediums allows for consideration of the contribution of an attribute alone (PB) as well as its interaction with the other elements (PR). The function, \(\Xi\), computes the similarity between the considered entry and a class c, and is given as

\begin{equation} \Xi (c, \boldsymbol {F}) = \prod _{f\in \boldsymbol {F}} \sigma (f|c), \end{equation}

(4)

where \(\boldsymbol {F}\) is the set of flips of an entry, c is one of the considered classes, and \(\sigma (f|c)\) is a function returning the significance of the flip f for the class c, following the PR or PB approaches previously defined. Notably, the patient’s missing entries are excluded by the set \(\boldsymbol {F}\). The classification assigns the entry to the class to which it is the most similar; this is computed as: \(\mathrm{argmax}\underset{c}{\ }\Xi := \mathrm{argmax}\underset{c\in \boldsymbol {C}}{\ }\Xi (c, \boldsymbol {F})\), where \(\boldsymbol {C}\) is defined as the set of all classes that are considered. All the patients are classified in parallel for both modalities to speed up the computation. Using both the PR and PB approaches helps validate the knowledge graphs as an effective representation of the classes, as connections modify the significance of each flip and drive the classification method.

Another addition to SaNDA is a BA measure for multi-class classification, which is described below,

\begin{equation} \mathrm{BA} = \frac{1}{|\boldsymbol {C}|} \sum _{c \in \boldsymbol {C}} \frac{\omega _c}{|\boldsymbol {E}_c|}. \end{equation}

(5)

where \(\omega _c\) is the number of correctly classified instances, and \(\boldsymbol {E}_c\) is the set containing all the instances of class c. CACTUS automatically plots how the probability distribution of each feature changes across the classes, giving insights on the variables driving the classification towards particular outcomes. This is reflected in the knowledge graphs, where a flip (e.g., \(Age\_U\)) might be much more influential than its twin (e.g., \(Age\_D\)) in one class and surrender its influence to it in the next one; this will cause very different scores when the classification function is applied to the graphs. Using the following equation:

\begin{equation} R_{x_f} = \frac{\sum _{i=1}^{N} \sum _{j\gt i}^{N} |P(x_f|c_i) - P(x_f|c_j)|}{_NC_2}, \end{equation}

(6)

we model the rank \(R_{x_f}\) for each flip f of a feature, x, by considering how the conditional probability \(P(f|c_i)\) of the flip f given the class \(c_i\) will change across the N considered classes. By considering the unordered pairs of states \((c_i, c_j)\), we do not assume the classes to be ordered, and can average them using the number of pairs obtained, given by the combination, \(_NC_2\). Intuitively, the more the probability distribution of the flips of a marker changes, the more useful the marker for assigning a class. The rank can then be averaged over the \(\boldsymbol {F}_x\) flips of the feature x as

\begin{equation} \bar{R}_x = \frac{1}{|\boldsymbol {F}_x|}\sum _{f \in \boldsymbol {F}_x} R_{x_f}. \end{equation}

(7)

The average rank describes the effectiveness of the change in the distribution of a feature in differentiating the available classes. For instance, a feature whose distribution does not change significantly across classes is not useful in differentiating between the classes, while features with a strong gap will provide stronger support to one class over the others. If a feature x has only two flips and no missing values, the average rank, \(\bar{R}_x\), will correspond to the rank, \(R_{x_f}\), of any of its flips. When there are categorical variables with more than 2 possible values, the average rank is more informative as some flips might have a constant frequency across all classes, whilst others might change more radically.

When enabled, the correlation module computes the correlation for the set of columns retained after preprocessing; otherwise, it is calculated for the whole dataset. In both cases, the original values (and not abstracted data), are used. A warning is issued if attribute combinations have a correlation of \(\pm 1\), because they might indicate either redundant or very interesting pairs of features. A correlation graph is also computed, using graph-tool [35], and the PR and Laplacian scores are computed along with the community detection algorithms applied to the knowledge graphs. The self-connections will be removed by default from the correlation graph to unclutter it, as they will always correspond to 1, and the graph will be saved in the graphml format as well. A minimum spanning tree [36] is computed on the correlation graph to highlight which attributes are better connected.

All these modules produce intermediate outputs that can be integrated, accessed, or elaborated differently from what is automatically done. This allows further investigations or even interaction with CACTUS. For instance, the accuracies of the flips may be modified by an expert to boost or dampen their significance and CACTUS can be set to automatically use these values to build the knowledge graphs.

2.1 Experimental Setup

The WDBC dataset contains numerical features extracted from digitalised images of fine needle aspirates of breast mass for a total of 569 patients. Enquiries about the exact description of the features and of the whole dataset are readdressed in its original documentation on archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names. It presents a natural, binary classification scenario, where breast cancer is labelled as benign or malignant and characterised by 30 real-value attributes. These attributes can either be actual measures, or their standard error and worst value. None of the attributes presents NaNs.

The Thyroid dataset consists of 9172 records composed of 29 attributes, 21 of which are binary, 7 are continuous, and the last one is the referral source. The Diagnosis column consists of a string containing several features regarding the patients’ conditions, namely hyperthyroid and hypothyroid status, increased/decreased binding protein, general health, and treatments that have been undertaken. To perform a classification task, we consider patients with a “concurrent non-thyroidal illness” (K), hyperthyroid (letters A, B, C, D), or hypothyroid (letters E, F, G, H) conditions, and we condense them into healthy (0), hyperthyroid (1), and hypothyroid (2) classes. This method for reducing the amount of classes was taken from the previous experiments listed in the dataset description. This shrinks the records to 1371. The Thyroid dataset also contains dual diagnosis in the form \(X|Y\), which is interpreted as “consistent to X, but Y is more likely”. Therefore, when this ambiguity was present, we assigned Y if it was one of the considered letters (A, B, C, D, E, F, G, H, K). We assigned X if X was valid. Otherwise, we ignored the record.

The Mushroom primary dataset [43] includes 173 species of mushrooms described by 17 nominal and 3 metrical variables. The nominal variables contain multiple values, for instance, the colours of the cap. To address this, they were split into multiple binary columns, for example cap_colour_red and cap_colour_brown. We therefore obtained a total of 101 attributes, with 98 being categorical. The assigned class for each mushroom indicates whether it is edible, not edible, or unknown.

The Cleveland Heart Disease dataset [21] contains 13 attributes, including 7 categorical variables associated with diagnosed heart disease ranging from 0 (healthy) to 4. This dataset is one of the four centres that cooperated in a study on heart diseases. We specifically chose to focus on the Cleveland dataset as it is reported to be the only one used in ML research to date. Moreover, it was already preprocessed and ready to use.

The Adult Income dataset [5] is derived from an American census conducted in 1994 to classify individuals into those earning more and less than fifty thousand USD per year. It contains eight categorical and six continuous features.

The experimental setup considered the tolerance of both traditional ML models and CACTUS to missing values. For each dataset, we removed from every feature a percentage of its existing values. Therefore, if a feature already contained missing values, we considered only the existing ones for removal. The percentages of additional missing values we considered were \(20\%\), \(40\%\), \(60\%,\) and \(80\%\). This process resulted in 25 datasets used to compare traditional ML methods and CACTUS. The classifiers chosen for the comparison were Ridge [22], Logistic Regression (LR) [23], Support Vector Machine (SVM) [10], Stochastic Gradient Descent (SGD) classifier [20], and Random Forest (RF) [7]. Since these models are inherently stochastic, they were trained using a 10-fold cross-validation technique on a 60/40% train/test split. CACTUS, being deterministic in how it computes the significance of features, did not present variability on multiple runs. In addition, CACTUS does not perform learning by fitting the data to a mathematical function, therefore it does not need to have a test phase. Imputation tools, such as IterativeImputer and KNNImputer from scikit-learn, were not considered. Furthermore, incorporating imputation tools could introduce results harder to generalise and be prone to inaccuracies if the imputing algorithm fails to extract meaningful patterns. Instances of when an imputation could not be reliable include medical datasets, where variables such as smoking status are obviously independent of genetic data, or sets of independent measures collected in different time points or modalities.

3 Results

In this section, we present the results and the knowledge graphs for each dataset. For representation purposes, the knowledge graphs display only nodes with at least one valid connection to the rest of the network and only the four strongest connections from a flip to the others. However, the PR and community computations consider all connections.

Table 2 lists the BA obtained by all the considered ML techniques and CACTUS methods. The additional metrics of precision, sensitivity, and specificity are shown in Tables 1, 3, and 4, respectively.

Table 1.

	Ridge	LR	SVM	SGD	RF	CPR	CPB
WDBC	0.99\(\pm\)0.01	0.97\(\pm\)0.02	0.97\(\pm\)0.01	0.95\(\pm\)0.03	0.96\(\pm\)0.01	0.97	0.97
20%	0.95\(\pm\)0.01	0.9\(\pm\)0.03	0.93\(\pm\)0.02	0.87\(\pm\)0.05	0.94\(\pm\)0.02	0.97	0.97
40%	0.94\(\pm\)0.02	0.85\(\pm\)0.04	0.91\(\pm\)0.03	0.84\(\pm\)0.04	0.91\(\pm\)0.03	0.97	0.96
60%	0.91\(\pm\)0.03	0.84\(\pm\)0.04	0.88\(\pm\)0.04	0.74\(\pm\)0.04	0.89\(\pm\)0.03	0.96	0.96
80%	0.82\(\pm\)0.05	0.72\(\pm\)0.05	0.83\(\pm\)0.05	0.65\(\pm\)0.07	0.86\(\pm\)0.03	0.95	0.95
Mushroom	0.52\(\pm\)0.19	0.52\(\pm\)0.06	0.52\(\pm\)0.06	0.58\(\pm\)0.11	0.55\(\pm\)0.09	0.8	0.8
20%	0.28\(\pm\)0.32	0.45\(\pm\)0.08	0.53\(\pm\)0.11	0.45\(\pm\)0.08	0.53\(\pm\)0.08	0.76	0.76
40%	0.46\(\pm\)0.18	0.52\(\pm\)0.06	0.52\(\pm\)0.09	0.46\(\pm\)0.1	0.57\(\pm\)0.08	0.76	0.76
60%	0.15\(\pm\)0.19	0.48\(\pm\)0.11	0.39\(\pm\)0.24	0.45\(\pm\)0.07	0.4\(\pm\)0.06	0.85	0.85
80%	0.11\(\pm\)0.17	0.5\(\pm\)0.07	0.6\(\pm\)0.17	0.48\(\pm\)0.08	0.47\(\pm\)0.08	0.94	0.94
Adult Income	0.74\(\pm\)0.01	0.72\(\pm\)0.01	0.77\(\pm\)0.01	0.7\(\pm\)0.02	0.74\(\pm\)0.01	0.77	0.95
20%	0.76\(\pm\)0.01	0.76\(\pm\)0.01	0.76\(\pm\)0.01	0.74\(\pm\)0.04	0.74\(\pm\)0.01	0.77	0.94
40%	0.72\(\pm\)0.02	0.76\(\pm\)0.01	0.76\(\pm\)0.01	0.74\(\pm\)0.05	0.71\(\pm\)0.01	0.77	0.93
60%	0.7\(\pm\)0.02	0.75\(\pm\)0.02	0.78\(\pm\)0.01	0.7\(\pm\)0.06	0.6\(\pm\)0.01	0.78	0.91
80%	0.7\(\pm\)0.05	0.75\(\pm\)0.03	0.8\(\pm\)0.03	0.7\(\pm\)0.07	0.42\(\pm\)0.01	0.79	0.88

Table 1. Precision Achieved by All the Considered Models for the Original and Perturbed Datasets \(\pm\) Their Standard Deviation

The Thyroid and Heart Disease datasets have been omitted, since they contain more than two classes and the precision cannot be computed on them. The highest values for each dataset have been highlighted in bold. The PR classification in CACTUS is defined as CPR, while the PB as CPB.

Table 2.

Table 3.

	Ridge	LR	SVM	SGD	RF	CPR	CPB
WDBC	0.88\(\pm\)0.03	0.95\(\pm\)0.02	0.96\(\pm\)0.02	0.95\(\pm\)0.03	0.94\(\pm\)0.02	0.96	0.96
20%	0.79\(\pm\)0.04	0.86\(\pm\)0.04	0.9\(\pm\)0.04	0.87\(\pm\)0.04	0.92\(\pm\)0.03	0.94	0.94
40%	0.7\(\pm\)0.05	0.82\(\pm\)0.05	0.84\(\pm\)0.04	0.81\(\pm\)0.05	0.91\(\pm\)0.03	0.93	0.93
60%	0.62\(\pm\)0.05	0.74\(\pm\)0.06	0.72\(\pm\)0.03	0.68\(\pm\)0.06	0.89\(\pm\)0.03	0.94	0.94
80%	0.52\(\pm\)0.03	0.6\(\pm\)0.03	0.71\(\pm\)0.03	0.61\(\pm\)0.05	0.85\(\pm\)0.04	0.94	0.94
Mushroom	0.35\(\pm\)0.17	0.48\(\pm\)0.12	0.29\(\pm\)0.13	0.47\(\pm\)0.1	0.5\(\pm\)0.14	0.8	0.8
20%	0.22\(\pm\)0.29	0.5\(\pm\)0.1	0.37\(\pm\)0.17	0.49\(\pm\)0.09	0.53\(\pm\)0.16	0.84	0.84
40%	0.28\(\pm\)0.24	0.58\(\pm\)0.07	0.32\(\pm\)0.2	0.43\(\pm\)0.13	0.4\(\pm\)0.15	0.96	0.96
60%	0.13\(\pm\)0.2	0.42\(\pm\)0.1	0.13\(\pm\)0.09	0.44\(\pm\)0.09	0.3\(\pm\)0.08	0.84	0.84
80%	0.08\(\pm\)0.18	0.51\(\pm\)0.09	0.14\(\pm\)0.07	0.46\(\pm\)0.06	0.46\(\pm\)0.08	0.62	0.71
Adult Income	0.34\(\pm\)0.0	0.46\(\pm\)0.01	0.53\(\pm\)0.01	0.47\(\pm\)0.02	0.62\(\pm\)0.01	1.0	0.76
20%	0.12\(\pm\)0.01	0.24\(\pm\)0.01	0.41\(\pm\)0.01	0.23\(\pm\)0.03	0.54\(\pm\)0.01	1.0	0.75
40%	0.07\(\pm\)0.0	0.16\(\pm\)0.01	0.27\(\pm\)0.01	0.17\(\pm\)0.02	0.46\(\pm\)0.01	1.0	0.74
60%	0.04\(\pm\)0.0	0.1\(\pm\)0.0	0.16\(\pm\)0.01	0.11\(\pm\)0.02	0.38\(\pm\)0.01	0.98	0.72
80%	0.02\(\pm\)0.0	0.05\(\pm\)0.0	0.06\(\pm\)0.0	0.06\(\pm\)0.01	0.33\(\pm\)0.01	0.94	0.7

Table 3. Sensitivity Achieved by All the Considered Models for the Original and Perturbed Datasets \(\pm\) Their Standard Deviation

The Thyroid and Heart Disease datasets have been excluded from the analysis due to containing more than two classes, making it impossible to compute sensitivity on them. The highest values for each dataset have been highlighted in bold.

Table 4.

	Ridge	LR	SVM	SGD	RF	CPR	CPB
WDBC	0.99\(\pm\)0.01	0.98\(\pm\)0.01	0.98\(\pm\)0.01	0.97\(\pm\)0.01	0.97\(\pm\)0.01	0.95	0.96
20%	0.97\(\pm\)0.01	0.94\(\pm\)0.02	0.96\(\pm\)0.01	0.92\(\pm\)0.03	0.96\(\pm\)0.02	0.94	0.94
40%	0.97\(\pm\)0.01	0.91\(\pm\)0.03	0.95\(\pm\)0.02	0.9\(\pm\)0.02	0.95\(\pm\)0.02	0.94	0.94
60%	0.96\(\pm\)0.01	0.91\(\pm\)0.03	0.94\(\pm\)0.02	0.85\(\pm\)0.04	0.94\(\pm\)0.02	0.93	0.93
80%	0.93\(\pm\)0.02	0.86\(\pm\)0.03	0.91\(\pm\)0.03	0.81\(\pm\)0.05	0.92\(\pm\)0.02	0.92	0.92
Mushroom	0.79\(\pm\)0.14	0.64\(\pm\)0.08	0.77\(\pm\)0.11	0.7\(\pm\)0.11	0.66\(\pm\)0.11	0.75	0.75
20%	0.83\(\pm\)0.26	0.56\(\pm\)0.09	0.75\(\pm\)0.13	0.56\(\pm\)0.1	0.65\(\pm\)0.12	0.66	0.66
40%	0.79\(\pm\)0.19	0.57\(\pm\)0.07	0.74\(\pm\)0.17	0.63\(\pm\)0.07	0.75\(\pm\)0.09	0.62	0.62
60%	0.87\(\pm\)0.19	0.62\(\pm\)0.11	0.82\(\pm\)0.15	0.56\(\pm\)0.09	0.65\(\pm\)0.1	0.82	0.82
80%	0.92\(\pm\)0.18	0.57\(\pm\)0.08	0.91\(\pm\)0.06	0.59\(\pm\)0.09	0.59\(\pm\)0.11	0.95	0.95
Adult Income	0.96\(\pm\)0.0	0.94\(\pm\)0.0	0.95\(\pm\)0.0	0.94\(\pm\)0.01	0.93\(\pm\)0.0	0.03	0.86
20%	0.99\(\pm\)0.0	0.98\(\pm\)0.0	0.96\(\pm\)0.0	0.97\(\pm\)0.01	0.94\(\pm\)0.0	0.06	0.84
40%	0.99\(\pm\)0.0	0.98\(\pm\)0.0	0.97\(\pm\)0.0	0.98\(\pm\)0.01	0.94\(\pm\)0.0	0.08	0.82
60%	0.99\(\pm\)0.0	0.99\(\pm\)0.0	0.99\(\pm\)0.0	0.98\(\pm\)0.01	0.92\(\pm\)0.0	0.12	0.79
80%	1.0\(\pm\)0.0	0.99\(\pm\)0.0	1.0\(\pm\)0.0	0.99\(\pm\)0.0	0.86\(\pm\)0.0	0.2	0.69

Table 4. Specificity Achieved by All the Considered Models for the Original and Perturbed Datasets \(\pm\) Their Standard Deviation

The Thyroid and Heart Disease datasets have been omitted since they contain more than two classes and the specificity cannot be computed on them. The highest values for each dataset have been highlighted in bold.

Some of the high dimensional categorical features in the Adult Income dataset, namely the native country, occupation, education, and marital status of an individual, caused a strong decrement of PR accuracy, as shown in Table 2. If these features were not treated as categorical, the PR and PB accuracies would be more balanced, as shown in Table 5. The reasons for this decrease are further discussed in the Discussion section.

Table 5.

	CPR	CPB
Adult Income	0.77	0.77
20%	0.76	0.76
40%	0.74	0.74
60%	0.71	0.71
80%	0.65	0.65

Table 5. BA on the Adult Income Dataset When not Forcing the Education, Marital Status, Occupation, and Native Country Information to be Processed as Categorical

PR CACTUS has better performance than in Table 2, but PB CACTUS loses some points and the outputs are less explainable. Other ML methods applied to Table 2 remain unchanged as they never consider a feature as categorical.

Figures 3 and 4 display the strongest connections in the knowledge graphs for the benign and the malignant breast cancers in the WDBC dataset. The communities obtained by the Greedy and Louvain algorithms are shown in Figures S1 and S2, respectively. The label propagation algorithm did not yield multiple communities and was subsequently omitted. In Figure 5, the probabilities of each marker across the benign (0) and malignant (1) breast cancers enable the computation of each variable’s importance and the discrimination of the available classes, by applying Equation (7). Figure S4 illustrates the correlation among the original features and the Diagnosis column, and Figure S3 displays the decision tree. Due to space constraints, the knowledge graphs of the other datasets are showcased in the supplementary material.

Fig. 3.

Fig. 4.

Fig. 5.

Figure S5 presents how the connections change across healthy, hypothyroidism, and hyperthyroidism, respectively. This is also visible in the communities created by the Greedy, Label Propagation, and Louvain algorithms, depicted in Figures S6, S8, and S7, respectively. The individual changes in the probability distributions of the features are shown in Figure 6. Figure S10 emphasises the correlation patterns between the attributes, while Figure S9 provides the decision tree.

Fig. 6.

The representation of edible and poisonous mushrooms in the Mushroom dataset obtained from CACTUS is displayed in Figure S11. The communities derived from them using the Greedy, Label Propagation, and Louvain algorithms are presented in Figures S12, S14, and S13, respectively, with specific alterations between them outlined in Figure 7. The auxiliary correlation matrix and decision tree are illustrated in Figures S16 and S15.

Fig. 7.

The knowledge graphs from the Heart Disease dataset (Figure S17) are much harder to compare due to the volume of data. The communities generated by the Greedy and Louvain algorithms are listed in Figures S18 and S19. Label Propagation did not return different structures and was therefore ignored. Figure 8 shows the rank of each feature and helps in the recognition of the important transitions and differences between them. Figures S21 and S20 portray the correlation and decision tree.

Fig. 8.

The knowledge graphs for the Adult Income dataset are illustrated in Figure S22 and describe the main characteristics of the two classes.

4 Discussion

4.1 WDBC Dataset

Classification of breast cancer types in the WDBC dataset using CACTUS allows the user to visualise the most influential elements in the decision process. The most important nodes in the benign knowledge graph in Figure 3 are Radius_D, Area_D, and Worst Radius_D, which lose significance in the malignant breast cancer, portrayed in Figure 4. The changes in the composition of the graphs indicate different connectivity patterns, as demonstrated by the distinct communities in Figures S1 and S2. They illustrate how communities have a strong subset resistant to change and a subset more weakly bound where elements are inclined to change membership across cancer typology. Additionally, they demonstrate how flips of the same node can play the same role in the two knowledge graphs, facilitating the formation of the same structure. For this dataset, the label propagation algorithm returned a single community, and was therefore omitted. In Figure S3, the auxiliary decision tree shows the number of linear separations, chosen using the entropy criterion, to obtain pure classes (coloured leaves), where the first splits have been on Worst perimeter, Worst concave points, Worst texture, Worst smoothness, Area SE, and Fractal dimension SE. Despite the high number of linear separation, the decision tree fails to achieve pure partitions, even though there are only three classes.

CACTUS identifies the attributes Worst perimeter and Worst concave points as the most significant for splitting the dataset between the two classes, a finding consistent with the binary decision tree, as illustrated in Figure 5.

For example, Smoothness (not shown for space constraints) has a very balanced distribution for the benign cancer (\(0.52\%\) Up and \(0.48\%\) Down) and a more skewed distribution (\(0.86\%\) Up and \(0.14\%\) Down) for the malignant breast cancer (where malignant cancers are generally not very smooth), and therefore its average rank \(\bar{R}_x\) is not outstanding (0.3394). Conversely, Worst perimeter is much more informative due to the inverted distributions in the benign and the malignant types, and therefore gets a very high \(\bar{R}_x\) (0.8367).

The differences in the connections and the communities across the knowledge graphs are based and justified in Figure 5, where the variations in distributions drive such changes by seemingly altering the centrality of each flip.

Comparing the strongest correlations of the Diagnosis column with the distributions in Figure 5, we can observe that Area, Concave points, Concavity, and Perimeter have the most noticeable changes in the distributions of their flips, while other markers such as Smoothness, have both a less impressive change in the distributions and no significant correlation with the Diagnosis column. This is useful to confirm that the markers used by CACTUS to operate the classification task are actually bound to the label to predict, and therefore the building of the knowledge graph can be validated and supported. Other well correlated variables, namely Worst concave points, Worst area, Worst perimeter, Worst radius, and Radius, show important differences in their distributions as well, but have been omitted in Figure 5 due to space constraints.

4.2 Thyroid Dataset

The classification of the Thyroid dataset tested CACTUS’s capacity to handle multiple classes. Figure S5 shows that the connectivity is different, and Figure 6 helps in describing these differences by considering each feature. Figure S9 shows the linear separations to obtain partitions of the original dataset, the first splits being on TSH, T3, FTI, and TT4. Considering Figure 6, the same markers show significant changes in their distributions across the three original classes, and Sex_1 (Female sex) is more likely to have thyroid-related pathologies. Most of the influential markers are specific to a particular thyroidal condition, such as TT4 and FTI, but others seem to be more common between hyper- and hypo-thyroidism, such as T4U and T3, even though they still show a preference for hypothyroidism and hyperthyroidism, respectively. Interestingly, the Greedy algorithm shows more communities in the healthy class than in the others, but Label Propagation and Louvain do the opposite, creating more communities in the thyroidal classes than in the healthy one.

In Figure S10, we can describe how Diagnosis correlates with the other attributes. It is interesting to observe how the preferred attributes for both CACTUS and the binary decision tree, namely TSH, T3, FTI, and TT4, exhibit either no correlation or negative correlation.

4.3 Mushroom Dataset

Some features in the Mushroom dataset, as stem-color-white and cap-diameter-min, lose significance from the edible to the poisonous graphs in Figure S11 and become significant in differentiating between them, as illustrated in both Figure 7 and S15. It appears that edible mushrooms are less likely to: (i) have a white stem; (ii) have less variable cap diameter (as the poisonous ones are more likely to have higher maximum and lower minimum cap diameter); (iii) grow in winter; (iv) have a brown rather than a white gill. The correlations related to the class given in Figure S16 indicate that stem-color-white and cap-diameter-min are positive, and their importance for the classification is confirmed by the ranks computed by CACTUS, shown in Figure 7. Interestingly, stronger correlations within the class, such as stem-root-bulbous, are considered by neither the decision tree nor CACTUS.

4.4 Heart Disease Dataset

The Heart Disease dataset gave a great variability in the community compositions and number, as shown in Figures S18, and S19, ranging from a minimum of 2 to a maximum of 5 for the Louvain algorithm. The ranks in Figure 8 contain steady transitions for the Old_Peak, Slope, and CA features. Others, such as Age, show an abrupt change in the polarity of the distribution between the healthy class and any heart disease label, showing how being old is an important factor for any heart condition, even if it appears in different proportions. The rank of Sex also indicates a bias in the selection process towards men (coded as 1), as they are the most prevalent in all classes. However, there is still a noticeable increase in the occurrence of heart diseases among males. Compared to the decision tree in Figure S20, Thal, CA, and Old_Peak are also the three most important linear separators for differentiation in heart diseases. The decision tree produced many leaves, but only the smallest contained pure partitions belonging to only one class. The correlation matrix, given in Figure S21, mostly contains anti-correlations; for example, the Thalach feature in particular, which has a stronger bound with the Diagnosis feature. Other significant correlations with the Diagnosis are CA, Old_Peak, and Thal, which show good agreement between both CACTUS and the decision tree.

4.5 Adult Income Dataset

The Adult Income dataset in Figure 9 reveals how having studied more, being older, or working more hours per week brings, on average, a higher income. It also reveals substantial biases in our society, like higher salaries for men.

Fig. 9.

4.6 General Comparison

By examining Table 2, CACTUS achieves a higher BA than the considered ML methods for three datasets out of five, namely Mushroom, Heart Disease, and Adult Income. In the remaining two datasets, WDBC and Thyroid, CACTUS performs better in the perturbed datasets for more than \(20\%\) and \(40\%\) of removed values, respectively, winning in 22 out of 25 total comparisons. As shown by this comparison, CACTUS can be adapted to many different domains with a high degree of accuracy and interpretability. This makes it a more suitable tool for a wide range of situations where the available data or computational resources are limited, or the understanding of the role of each element and their interactions is paramount. The built-in transparency can enable specialists with providing feedback, and help with identifying biases in the datasets, thus improving model performance and generalisation. The abstraction and ranking of features can also be useful in summarizing and presenting the information in an easily consumable format to decision makers and the general public, be they patients at a hospital or defendants in a trial. Furthermore, having a reduced representation can allow for slimmer and faster decisions.

The gap between PR and PB in the Adult Income dataset is due to the design of the PR algorithm: with high dimensional categorical features, graph connectivity becomes homogeneous and redundant, thus creating more interconnection between the features and flattening the PR centrality. Furthermore, the individual probabilities \(P(f|c)\) of a categorical feature are tendentially lower than those of abstracted values computed on the same feature, as they both have to sum up to one (if there are no missing values). By incorporating a multiplication between the PR score and \(P(f|c)\) to create the Significance used by PR, these two aspects are effectively combined. This penalises the already lower probabilities of categorical features by multiplying them with low PR centralities. Additionally, a categorical feature that is treated as a continuous variable (such as the education level achieved by an individual) might still preserve some meaning if it is ordered. Conversely, other categorical features, such as ethnic background, nationality, or religion, obviously cannot be represented in a logical order, causing problems in traditional ML models and in the interpretation of their results. The PB approach has the advantage of richer information preserved by using categorical methods. This becomes evident from Table 5, where the most complex categorical features have not been forced to be categorical, and both the PR and PB approaches show the same levels of BA. The reason for this abrupt change in the BA is reflected in the sensitivity and specificity (Tables 3 and 4, respectively), where CPR has the maximum value for the former and a very low value for the latter, for each version of the Adult Income dataset. None of the considered ML methods recognise features as categorical, therefore their approach is always less sensitive than CACTUS’s default behaviour.

The Adult Income is the only dataset with exceptionally high heterogeneity in feature values; neglecting the most articulated features does not excessively penalize the BA obtained by PB, making CACTUS perform better even in this second scenario. The main difference is in how explainable the classification becomes, as Figure 9 points to the individual categories and their variation across the classes.

Most notably, RF is the least interpretable of the chosen models; the tradeoff between interpretability and performance plays a central role in driving the transition towards more transparent models for critical sectors.

5 Conclusions

The new architecture presented in this work has been shown to outperform different forms of classification tasks with a high degree of flexibility and transparency. It automatically provides users with several internal representations to describe its functionality and the elements considered fundamental when assigning a specific class over another. These insights are fundamental in receiving validation or correction from specialists and ensuring a correct classification without biases. It also helps to justify the classification to the final user, which is an important aspect of the decision-making in practical fields such as healthcare, law, economy, and so on. CACTUS provides the differences in the distribution of each individual feature and quantifies their contribution through the concept of Rank. On the other hand, the knowledge graphs capture more complex structures by modelling the interactions between each pair of elements, resulting in different communities and graph properties for each class, as contained in the supplementary material. The inclusion of categorical features is an important addition to allow the full use and interpretation of these variables, especially for medical data where medications, genetic, or habitual variables are included. The integration of external methods, namely binary decision trees and correlation matrices, gives a reference against which to compare the composition of knowledge graph. Using abstractions for representing information is a novel avenue yet to be explored. Future directions will explore exploiting it in different domains, as a fundamental block for models different from the current panorama in AI.

Authors’ Contributions

LG wrote the manuscript, conceptualised and implemented the tool, conceptualised and conducted the experiments. VRV conceptualised and implemented the tool, conceptualised the experimental design. KC conceptualised the tool and the experimental design. RW contributed to the manuscript. JS conceptualised the tool. All authors reviewed the final manuscript.

Acknowledgments

We gratefully acknowledge Patrycja Podsiedlińska for her careful proofreading and style suggestions.

Supplementary Material

3649459.supp (3649459.supp.pdf)

Supplementary material

Download
25.78 MB

References

[1]

Plamen Angelov and Eduardo Soares. 2020. Towards explainable deep neural networks (xDNN). Neural Networks 130, 130 (2020), 185–194. DOI:

Abstract

1 Introduction

2 Methods

2.1 Experimental Setup

3 Results

4 Discussion

4.1 WDBC Dataset

4.2 Thyroid Dataset

4.3 Mushroom Dataset

4.4 Heart Disease Dataset

4.5 Adult Income Dataset

4.6 General Comparison

5 Conclusions

Authors’ Contributions

Acknowledgments

Supplementary Material

References

Cited By

Index Terms

Recommendations

A Comprehensive Survey of Explainable Artificial Intelligence (XAI) Methods: Exploring Transparency and Interpretability

A unified and practical user-centric framework for explainable artificial intelligence

Towards Transparent AI: How will the AI Act Shape the Future?

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations