0% found this document useful (0 votes)

75 views10 pages

Sabol Et Al 2020

The document describes an explainable classifier that was developed to support medical diagnoses from histopathological images of colorectal cancer tissue. The proposed model provides explanations for its predictions in human-friendly ways like showing similar training samples or the possibility of misclassification. It is intended to support pathologists rather than fully automate diagnoses. The accuracy and explainability of the model were evaluated using real medical data and feedback from pathologists.

Uploaded by

Mais Diaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views10 pages

Sabol Et Al 2020

Uploaded by

Mais Diaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Journal of Biomedical Informatics 109 (2020) 103523

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin

Original research

Explainable classifier for improving the accountability in decision-making

for colorectal cancer diagnosis from histopathological images
Patrik Sabol a ,∗, Peter Sinčák a , Pitoyo Hartono b , Pavel Kočan c , Zuzana Benetinová c ,
Alžbeta Blichárová c , Ľudmila Verbóová c , Erika Štammová c , Antónia Sabolová-Fabianová d ,
Anna Jašková d
a
Department of Cybernetics and Artificial Intelligence, Technical University of Košice, Košice, Slovakia
b
School of Engineering, Chukyo University, Nagoya, Japan
c
Department of Pathology, Pavol Jozef Šafárik University in Košice, Košice, Slovakia
d
The Faculty of Arts of Prešov University, Prešov, Slovakia

ARTICLE INFO ABSTRACT

Keywords: Pathologists are responsible for cancer type diagnoses from histopathological cancer tissues. However, it is
Explainable artificial intelligence known that microscopic examination is tedious and time-consuming. In recent years, a long list of machine
Explainable machine learning learning approaches to image classification and whole-slide segmentation has been developed to support
Uncertainty measure
pathologists. Although many showed exceptional performances, the majority of them are not able to rationalize
Digital pathology
their decisions. In this study, we developed an explainable classifier to support decision making for medical
Colorectal cancer
diagnoses. The proposed model does not provide an explanation about the causality between the input and
the decisions, but offers a human-friendly explanation about the plausibility of the decision. Cumulative Fuzzy
Class Membership Criterion (CFCMC) explains its decisions in three ways: through a semantical explanation
about the possibilities of misclassification, showing the training sample responsible for a certain prediction and
showing training samples from conflicting classes. In this paper, we explain about the mathematical structure
of the classifier, which is not designed to be used as a fully automated diagnosis tool but as a support system
for medical experts. We also report on the accuracy of the classifier against real world histopathological data
for colorectal cancer. We also tested the acceptability of the system through clinical trials by 14 pathologists.
We show that the proposed classifier is comparable to state of the art neural networks in accuracy, but more
importantly it is more acceptable to be used by human experts as a diagnosis tool in the medical domain.

1. Introduction A long list of machine learning approaches to image classification

and whole-slide segmentation has been developed to support pathol-
Histopathological images of cancer tissue are routinely analysed by ogist in interpreting histopathological images [4,5]. Especially, in the
pathologists who are responsible for the cancer type diagnosis and recent years, conventional classification approaches, which mainly rely
prognosis [1]. Different types of tissues can be distinguished from on manually-engineered features, were outperformed by Deep Neural
histopathological evaluations of Hematoxylin and Eosin (H&E) stained Networks (DNN) and Convolutional Neural Networks (CNN) [6]. The
tissue sections. In colorectal cancer (CRC), the tumour architecture high performances of deep models is due to their ability to automat-
changes during tumour progression and is related to patient progno- ically extract representations that are strongly relevant to the their
sis [2]. Microscopic examination of tissue sections is known to be predictions from the learning data. However, their performance is
tedious and time-consuming [3]. Furthermore, the outcome of the not supported by their ability to explain their decisions and hence,
analysis may be affected by the levels of experience of the patholo- may prevent their applicability in real world clinical settings [7]. Our
gists involved. Therefore, with the advancement of digital pathology, objective in this study is to build a reliable classifier that is also able
computer-aided analysis of the histopathological images and machine to provide explanations about its decision in human-friendly forms. We
learning-based diagnostic systems, the fidelity and efficacy of medical believe that in real world clinical settings, which require accountability,
diagnoses can be significantly improved.
the accuracy of a classifier should be paired with its explainability [8].

∗ Corresponding author.
E-mail addresses: patrik.sabol@tuke.sk (P. Sabol), peter.sincak@tuke.sk (P. Sinčák).

https://doi.org/10.1016/j.jbi.2020.103523
Received 12 April 2020; Received in revised form 22 July 2020; Accepted 27 July 2020
Available online 3 August 2020
1532-0464/© 2020 Elsevier Inc. All rights reserved.
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

1.1. Relevant studies of training image and the most similar images that belong to clusters
of the conflicting class with a different confidence degree. In this
In [9], Holzinger et al. distinguished two types of explainable AI: paper, we extend the explainability of the CFCMC classifier by defining
Ante-hoc systems which incorporate explainability directly into the the factor of miclassification (FoM) and the certainty threshold. While
structure of an AI-model; these are systems that are interpretable by the FoM is a value that describes the possibility of the input sample
design. Typical examples include linear regression, decision trees and being misclassified to the one particular conflicting class, the certainty
fuzzy inference systems. They are commonly referred to as white-boxes threshold is a value of the FoM, under which it is a certain that the
or, currently, glass-boxes [10]. Posthoc systems, on the other hand, input sample will not be misclassified. Compared to the concept of
aim to explain and interpret black-box classifiers which provide local the uncertainty measure proposed in [16], in the case of uncertain
explanations for their specific decision. The majority of the explanation prediction, our approach is additionally able to suggest the classes in
approaches seek to link a particular output of the classifier to input which the input sample could be misclassified. Thus, it offers relevant
variables to see the impact of features on the final decision outcome. classes to be further examined.
For instance, in [11], G. R. Vásquez-Morales et al. used neural network- Different approach to interpret the decision of the classifier is based
based classifier to predict whether a person is at risk of developing on generating instances that are close to an observation. In [20] the
chronic kidney disease. Here, a black-box machine-learning method influence function was used to trace a model’s prediction through the
was complemented by Case-Based Reasoning, a white-box method learning algorithm and back to its training data, thereby identifying
that is able to find explanatory cases for an explanation-by-example training points most responsible for a given prediction. This approach
justification of a neural network’s prediction. In [12], Mullenbach et al. was used to explain the prediction of a black-box, deep neural network
presented an attentional convolutional network that predicted medical model. Moreover, the paper [21] investigates the effects of presentation
codes from clinical texts. Using an attention mechanism, the most of influence of training data points on machine learning predictions
relevant segments of the clinical text for each of the medical codes to boost user trust by measuring psychological signals (Galvanic Skin
were selected and used as an explanation mechanism. Through an Response and Blood Volume Pulse). It showed that these features
interpretability evaluation by a physician, they showed that the atten- correlate to user trust. Such reference-based explanations are needed
tion mechanism identified meaningful explanations. In [13], Lundberg in medicine, where, for instance, they could help to diagnose the type
et al. presented an ensemble-model-based machine learning method of the cancer from histopathological images.
using deep learning that predicts the near-term risk of hypoxaemia In this study the explainability of our classifier refers to its abil-
during anaesthesia care and explains the patient and surgery-specific ity in providing a degree of confidence for each of its prediction
factors that led to that risk. The system improved the performance of and in expressing the information in intuitive and human-friendly
anesthesiologists by providing interpretable hypoxaemia risk and the manners. The information is expressed through visualizations of train-
contributing factors. In [7], Hagele et al. utilized Layer-wise Relevance ing examples that are responsible for the prediction outcome paired
Propagation (LRP) to provide pixel-level explanation heatmaps for the with semantical explanations regarding likelihood for misclassifica-
classification decision of the CNN in digital histopathology analyses of tions. Here, while our method does not provide explainability for the
tumour tissue. These explanations were used to improve the general- causality between the input and the prediction, we believe that the
ization of the classifier by detecting and removing the effects of hidden explainability improves the accountability of the proposed classifier,
biases in used datasets. A similar approach to visualize parts of the and thus significantly contributes in supporting decision making in
input image responsible for the prediction was used in [14], where time-crucial medical domains. The objective of this article is to apply
LIME (Local Interpretable Model-agnostic Explanations) was utilized the explainable CFCMC classifier for the classification of histopatholog-
to provide a global understanding for the CNN model by providing ical images of colorectal cancer. We used a publicly available dataset
explanations for individual instances in the context of in-vivo gastral that was released in [2] by Kather et al.. It consists of a training set
image analysis. comprised of 5000 small tiles, each of them annotated with one of eight
It is natural that in the delicate medical domain, prediction models tissue classes and 10 non-annotated whole slide images (WSI) of the
should not only be accurate, but also accountable; they should state tissue.
uncertainty in their predictions, indicating difficult cases for which In [16], it was shown that CNN outperformed the approach in [2],
further human expert inspections are necessary. Therefore, another where features derived from images using texture descriptors served as
approach to probe and interpret machine learning algorithm is to a basis for a support vector machine model to classify colorectal cancer.
measure the uncertainty of the prediction for one particular example, Moreover, in [22], CNN achieved an exceptional level of performance,
the predictive uncertainty [9]. In [15], a transparent neural network, 98,7% accuracy, in nine tissue types classifications of colorectal cancer,
S-rRBF, was proposed and applied to DNA microarray data sets. It using the VGG19 model [23], which was pretrained on the ImageNet
provides an intuitive explanation through a visualization of its decision database [24]. Therefore, to enhance the accuracy of the CFCMC, we
process and on the given problem. It allows the users to understand why employ a Convolutional Neural Network as a feature extractor. We are
a certain problem is easy or difficult. Moreover, it makes it possible aware of the problem of losing explainability of the CNN model by com-
to see whether a new input is hard to classify or unlikely to be mis- pressing of the data from the feature space to the latent space, which
classified. However, the visual information still needs to be interpreted causes that it is hard to track the decision back to the features in the
and thus is prone to subjective inconsistencies. For the field of digital feature space. This problem is not relevant for our study, because we
pathology, in [16], Raczkowski et al. proposed an accurate, reliable and do not provide an explanation about the causality between the features
active (ARA) image classification framework using a Bayesian Convo- and the decisions but we provide explanation about the classifiability of
lutional Neural Network (ARA-CNN) for classifying histopathological the data, which is significantly improved by the CNN model by mapping
images of colorectal cancer. The model is able to achieve reliability the data from the feature space to the latent space.
by measuring the uncertainty of each prediction. This capability was Finally, we developed an explanation interface, which provided a
used to identify mislabelled training samples. In [17], the recently semantical and visual explanation that was extracted from the CFCMC
proposed semantically explainable fuzzy classifier called Cumulative classifier that was used to classify the WSIs of the colorectal can-
Fuzzy Class Membership Criterion (CFCMC) [18,19] was used to clas- cer tissue. We evaluated our XAI (eXplainable Artificial Intelligence)
sify histopathology images for breast cancer and to generate additional system using common within-subject experimental design [25]; the
information about classification reliability in human-friendly terms, in outcomes from our explanation interface (XAI system) were compared
the form of a semantic explanation. It provides a confidence measure with outcomes from a stand alone CNN (AI system with no explanation)
for the classification result of a test image followed by a visualization by 14 pathologists at clinical trials through questionnaires.

2
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

2. Proposed explainable model class membership criterion 𝜅𝑝̃𝑖𝑗𝑘 of the 𝑗th cluster of the 𝑖th class shares
the same value of parameters 𝑎 and 𝐾. 𝑎𝑖𝑛𝑖𝑡 is initialized as follows:
In this section, the mathematical description of the Cumulative 𝑛
1
∑𝑝̃
‖ ‖
Fuzzy Class Membership Criterion classifier is explained and followed 𝑎𝑖𝑛𝑖𝑡 = min ‖𝑝̃𝑖 − 𝑝̃𝑗 ‖, (4)
𝑛𝑝̃ 𝑗≠𝑖 ‖ ‖
by the definition of the factor of misclassification and the certainty 𝑖𝑗=1
threshold. where, 𝑛𝑝̃ is the number of training patterns. 𝐾𝑖𝑛𝑖𝑡 value is initialized
( )
from the interval 1; 𝑚𝑖𝑗 . The value of threshold 𝜃 is set from the
2.1. Cumulative Fuzzy Class Membership Criterion decision based classifier interval (0; 1). If the value of CFCMC 𝜒(𝑥) of the input pattern 𝑥 is below
the threshold 𝜃, thus 𝜒(𝑥) < 𝜃, the pattern is ‘‘not classified’’. Finally,
The proposed method is based on the assumption that 𝑑-dimensional the CFCMC surface is computed using Eq. (1) and Eq. (2).
data in the feature space are split into 𝑛𝑐 classes, where 𝐶𝑖 (𝑖 = 1, … , 𝑛𝑐 ), During the learning phase, adjustment of the CFCMC surface’s shape
the 𝑖th class, is divided into 𝑛𝑖𝑐𝑙 clusters, where 𝐶𝑙𝑖𝑗 (𝑗 = 1, … , 𝑛𝑖𝑐𝑙 ) is the occurs in order to obtain the highest classification accuracy. An assump-
𝑗th cluster of the 𝑖th class. Each cluster 𝐶𝑙𝑖𝑗 comprises training data tion of dividing training set into 𝑛𝑐 classes and each class 𝐶𝑖 into 𝑛𝑖𝑐𝑙
𝑝̃𝑖𝑗𝑘 ∈ R𝑑 , (𝑘 = 1, … , 𝑚𝑖𝑗 ), and 𝑚𝑖𝑗 is the number of training patterns clusters 𝐶𝑙𝑖,𝑗 generates a set of vectors
of cluster 𝐶𝑖𝑗 . Each training pattern 𝑝̃ defines a fuzzy class membership
criterion 𝜅𝑝̃(𝑥), which is considered as a triangular function as follows: 𝑝𝑖 = [𝑎𝑖1 , 𝐾𝑖1 ; ⋯ ; 𝑎𝑖𝑗 , 𝑘𝑖𝑗 ; ⋯ ; 𝑎𝑖𝑛𝑖 , 𝑘𝑖𝑛𝑖 ], (5)
𝑐𝑙 𝑐𝑙

⎧ ‖ ‖ where, 𝑖 = 1, … , 𝑛𝑐 . Optimizing of 𝑝𝑖 , the CFCMC surface’s shape is

‖𝑝̃ − 𝑥‖
⎪1 − ‖ 𝑖𝑗𝑘 ‖ ‖ ‖
‖𝑝̃𝑖𝑗𝑘 − 𝑥‖ < 𝑎𝑖𝑗 , adjusted. Any optimizing algorithm can be used, for instance, sim-
𝜅𝑝̃ (𝑥) = ⎨
𝑖𝑗𝑘 𝑎𝑖𝑗 ‖ ‖ (1)
⎪ ulated annealing or hill-climbing methods. We decided to employ a
⎩ 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
well-known genetic algorithm [29]. The fitness function is defined as
where, 𝑥 ∈ R𝑑 is an input vector, 𝑎𝑖𝑗 is the width of the triangular follows:
function for the 𝑗th cluster of 𝑖th class. ( )
Let (𝜅𝑝̃𝑖𝑗1 , … , 𝜅𝑝̃𝑖𝑗𝐾 , … , 𝜅𝑝̃𝑖𝑗𝑚 ) be a descending order of the fuzzy 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 ∶ 𝑒𝑟𝑟 𝑆𝑣𝑎𝑙𝑖𝑑 , (6)
𝑖𝑗 𝑖𝑗
class membership criterion 𝜅𝑝̃ for the 𝑗th cluster of 𝑖th class such where, 𝑆𝑣𝑎𝑙𝑖𝑑 is the validation set and 𝑒𝑟𝑟 (𝑆) is the error rate evaluated
𝑖𝑗
that 𝜅𝑝̃𝑖𝑗1 ≥ ⋯ ≥ 𝜅𝑝̃𝑖𝑗𝑚 , where 𝐾𝑖𝑗 is the number of first values in the from the data set 𝑆 as follows:
𝑖𝑗
reordering.
#of incorrectly classified samples
Then, the Cumulative Fuzzy Class Membership Criterion (CFCMC) 𝑒𝑟𝑟(𝑆) = . (7)
#of all samples
for class 𝐶𝑖 is defined as follows:

( ) ⎛ 𝐾𝑖𝑗
∑ ( )⎞ 2.3. Factor of misclassification
𝜒𝐶𝑖 𝑥 = max ⎜ 𝐾1 𝜅𝑝̃𝑖𝑗𝑘 𝑥 ⎟, (2)
𝑗 ⎜ 𝑖𝑗 ⎟
⎝ 𝑘=1 ⎠
( ) The term factor of misclassification (FoM) is described as ‘‘the
where, 𝜒𝐶𝑖 𝑥 is the value of CFCMC for an unknown pattern 𝑥 to the likelihood of the input sample, which is assigned to the cluster 𝐶𝑙 belonging
class 𝐶𝑖 . to the class 𝐶, to be misclassified to one of the rest of the classes’’ i.e. the
Then, the decision rule for winner class 𝐶𝐿 for the input pattern 𝑥 possibility that in reality the observation belongs to another class. The
is as follows: factor of misclassification of the input sample, assigned to the cluster
𝐶𝐿(𝑥) = 𝐶 ( ( )) . (3) 𝐶𝑙𝐴𝑖 , to the reference cluster 𝐶𝑙𝐵𝑗 is defined as follows:
argmax 𝜒𝐶𝑖 𝑥
𝑖 𝜒𝐶𝑙𝐵𝑗 (𝑥)
𝐹 𝑜𝑀(𝑥, 𝐶𝑙𝐵𝑗 ) = + 𝑠𝑖𝑚𝐶𝑙 (𝐶𝑙𝐴𝑖 , 𝐶𝑙𝐵𝑗 ), (8)
𝜒𝐶𝑙𝐴𝑖 (𝑥)
2.2. Algorithm description
where the first term on the right hand side describes the local similarity
The algorithm consists of two phases: the initialization and the as the ratio between memberships of the input sample 𝜒(𝑥) to the
learning phase. The initialization phase consists of three processes: reference cluster 𝐶𝑙𝐵𝑗 and the winner cluster 𝐶𝑙𝐴𝑖 . The second term
data splitting, clustering, and parameters initialization. First, input data on the left hand side describes global similarity, which is based on the
are divided into three sets: training sets, validation sets, and testing relationship between the data’s clusters.
sets. Training patterns are used in Eq. (1) to create a CFCMC decision The similarity between the two clusters 𝐶𝑙𝐴𝑖 and 𝐶𝑙𝐵𝑗 is defined as
surface. During the learning phase, the decision surface is optimized follows:
(parameters 𝑎𝑖𝑗 Eq. (1) and 𝐾𝑖,𝑗 in Eq. (2) are adaptively optimized) 𝐴𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝐶𝑙𝐴𝑖 ,𝐶𝑙𝐵𝑗 )
in order to cover all validation patterns. The testing set is used for the 𝑠𝑖𝑚𝐶𝑙 (𝐶𝑙𝐴𝑖 , 𝐶𝑙𝐵𝑗 ) = , (9)
𝐴𝐶𝑙𝐴𝑖
final evaluation of the created decision surface.
Afterwards, the training data of each class 𝐶𝑖 are independently where 𝐴𝐶𝑙 is the area of a hypersphere describing cluster 𝐶𝑙 and
clustered in order to find 𝑛𝑖𝑐𝑙 clusters for each class in feature space 𝐴𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 is the area of intersection of the two clusters. Here, for
using the well-known K-means algorithm. The number of clusters, 𝑘, simplification, the clusters are described with n-dimensional hyper-
is estimated via a gap statistic [26]. This technique uses the output sphere, where n is the data dimensionality. For the centre and the
of any clustering algorithm, comparing the change of the within- radius of a hypersphere, the coordinates of a cluster’s centroid 𝑐𝐶𝑙𝑖𝑗 and
cluster dispersion with that which is expected under an appropriate the estimated variance 𝜎̂ 𝐶𝑙𝑖𝑗 of a cluster’s data, respectively, are used.
reference null distribution. Any other techniques for the estimation of For computational purposes, 𝑛-hypersphere is transferred into a two
the number of clusters can be used, such as Silhouette analysis [27] or dimensional circle. The area of intersection between the two clusters
Davies–Bouldin clustering criterion [28]. is computed using simple two dimensional trigonometry by using the
Next, the parameters 𝑎𝑖𝑗 and 𝐾𝑖𝑗 in Eq. (1) and in Eq. (2), re- distance between the centres and radiuses of the circles, and thus, the
‖ ‖
spectively, are initialized. These parameters affect the shape of the Euclidean distance ‖𝑐𝐶𝑙𝐴𝑖 − 𝑐𝐶𝑙𝐵𝑗 ‖ between the centroids of the clusters
‖ ‖
boundary created by fuzzy class membership criterion 𝜅𝑝̃. Every fuzzy 𝐶𝑙𝐴𝑖 and 𝐶𝑙𝐵𝑗 and their estimated variances 𝜎̂ 𝐶𝑙𝐴𝑖 and 𝜎̂ 𝐶𝑙𝐵𝑗 .

3
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

The equation for the estimation of the cluster’s variance value 𝜎̂ 𝐶𝑙𝐴,𝑖 3. Colorectal cancer detection explanation interface
was derived in [19] and it is calculated as follows:
( )𝑚 { This section describes the explanations generated to human experts
𝑘 𝑚 = 0.7 𝜒𝐶𝑙𝑚𝑎𝑥 ≤ 𝑘
𝑖𝑗
𝜎̂ 𝐶𝑙𝑖𝑗 = 𝑎𝐶𝑙𝑖𝑗 , by the proposed system in colorectal cancer detection tasks. To evaluate
𝜒𝐶𝑙𝑚𝑎𝑥 𝑚 = 2.5 𝜒𝐶𝑙𝑚𝑎𝑥 > 𝑘
𝑖𝑗 𝑖𝑗 the usefulness of the proposed explanations, we designed two systems;
𝑘 = 𝑝1 ∗ 𝐾 + 𝑝2 first a plain CNN model that only generates decisions without any
(10)
𝑝𝑙 = 𝑎𝑙 ∗ 𝑑𝑖𝑚𝑏𝑙 + 𝑐𝑙 (𝑙 = 1, 2) explanation, second, an X-CFCMC (eXplainable CFCMC) model that
complements its decisions with explanations.
𝑎1 = −0.7621, 𝑏1 = −0.2799, 𝑐1 = 0.0746
For both systems, we developed similar user interfaces; both provide
𝑎2 = 0.8372, 𝑏2 = −0.3729, 𝑐2 = 0.1758, classification results of the whole-slide images (WSI) of colorectal can-
where 𝑑𝑖𝑚 is the dimensionality of the data. cer tissue, showing the original image of the WSI and a corresponding
It should be noted that during the variance estimation, the Eu- label map with a colour code of eight different tissue types. Pathologist
clidean distance was replaced with the following distance measure 𝑑: can examine an arbitrary area of the WSI by clicking on the desired
Let us have two vectors 𝑥𝐴 and 𝑥𝐵 with 𝑛-dimensionality. Then the area. Subsequently, the interfaces show their predictions. Finally, a
distance 𝑑 is defined as follows: pathologist can provide the final decision by selecting one of the eight
buttons representing the eight different tissue types.
1∑
𝑛
𝑑= |𝑥 − 𝑥𝐵𝑖 |, (11) The non-explainable AI system interface that uses the plain CNN
𝑛 𝑖=1 𝐴𝑖 model is shown in Fig. 1. It provides predicted type of tissue and
a probability distribution of the prediction of all tissue types, which
The value of the factor of misclassification of the input 𝑥 to the 𝑖th
was computed in the output layer of the CNN model with the softmax
class 𝐶𝑖 is computed as follows:
activation function.
𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) = max 𝐹 𝑜𝑀(𝑥; 𝐶𝑙𝑖𝑗 ) (12) The explainable AI system interface that uses the X-CFCMC model
𝑗
is shown in Fig. 2. It complements its decision with three types of
The factor of misclassification can also be expressed semantically. information:
It exhibits the values as follows:
⎧ 𝑛𝑜 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 ( ) 3.0.1. Semantical explanation
𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 0, 𝑐𝜃
⎪ ( ] To provide user-friendly explanations, prediction results and infor-
𝐷𝐹 𝑜𝑀 ⎨ 𝑙𝑜𝑤 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 𝑐𝜃 , 𝜃𝑚𝑖𝑑 (13)
( ]
⎪ℎ𝑖𝑔ℎ 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 𝜃𝑚𝑖𝑑 , 𝐹 𝑜𝑀𝑚𝑎𝑥 mation regarding the possibility of the misclassification of the exam-
⎩ ined area of the WSI are semantically explained, for example, with the
where 𝑐𝜃 is the certainty threshold and 𝜃𝑚𝑖𝑑 = (𝐹 𝑜𝑀𝑚𝑎𝑥 − 𝑐𝜃 )∕2 + 𝑐𝜃 . phrase ‘‘there is a low possibility that this classification is wrong’’. This
The 𝐹 𝑜𝑀𝑚𝑎𝑥 is the maximum value of the FoM computed using the semantical explanation is generated based on the value of the FoM.
validation samples as follows:
3.0.2. Visualization of the training image most responsible for a given
𝐹 𝑜𝑀𝑚𝑎𝑥 = max max 𝐹 𝑜𝑀(𝑥𝑗 , 𝐶𝑖 ), (14)
𝑗 𝑖 prediction
where 𝑥 ∈ 𝑆𝑣𝑎𝑙𝑖𝑑 . To justify the prediction result, the most responsible training image
for a given prediction is displayed to the user. A similar approach to
understand the predictions was introduced in [20], where the influence
2.4. Certainty threshold and certain prediction function was used to trace the CNN’s prediction through the learning
algorithm and back to its training data. In our approach, the most
The certainty threshold is the value of the FoM, below which it is responsible training image for a prediction is a representative training
certain (i.e. there is no possibility) that the input sample 𝑥 is assigned sample 𝑝̃𝑟 (𝑥𝑤 ) for the input image 𝑥𝑤 assigned to the winner tissue type
to the class 𝐶𝐴 and will not be misclassified to any other classes in the 𝑤. It follows that if the training image has very similar context with the
feature space. input image, it should gain the trust of the pathologist in prediction.
Let 𝑥 be samples from the validation set 𝑆𝑣𝑎𝑙𝑖𝑑 that were misclassi- Otherwise, the pathologist could consider the certain prediction as not
fied and 𝐶𝐺𝑇𝑘 be the ground truth label of the 𝑘th sample 𝑥 . Then the being reliable.
𝑘
certainty threshold 𝑐𝜃 is calculated as follows:
3.0.3. Visualization of training images of other types of tissue
𝑘
𝑐𝜃 = min 𝐹 𝑜𝑀(𝑥𝑘 ; 𝐶𝐺𝑇 ). (15) The third means of explanation shows the representative training
𝑘
images for the input image to the other tissue types, into which the
It follows that if 𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) < 𝑐𝐹 𝑜𝑀 holds, it is unlikely that input
input image could be misclassified with a high or low possibility. It
sample 𝑥 belongs to the class 𝐶𝑖 .
should visually explain to the pathologist why the input sample could
Therefore, if it holds that ∀𝑖∈1,…,𝑛𝑐 ; 𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) < 𝑐𝜃 , the prediction be misclassified to a particular tissue type. In the case of similar context,
of the input sample 𝑥 is certain, otherwise it is uncertain. a pathologist could consider the particular tissue type as the true type
of tissue.
2.5. Representative training sample
4. Experiments
Representative training sample 𝑝̃𝑟
is the training sample that is the
most responsible for assigning input sample 𝑥 to the 𝑗th cluster 𝐶𝑙𝑖𝑗 of To choose the best performing explainable model to classify colorec-
the 𝑖th class 𝐶𝑖 . It is computed as follows: tal cancer image data, the model has to be accurate and reliable with
its explanations. Therefore, in this section, first, we present the results
‖ ‖
𝑝̃𝑟 (𝑥) = argmin ‖𝑝̃𝑖𝑗𝑘 − 𝑥‖ (16) from the task of boosting the performance of the CFCMC classifier.
𝑘 ‖ ‖
Then, we describe the results of validating the certainty threshold.
‖ ‖
where ‖𝑝̃𝑖𝑗𝑘 − 𝑥‖ is the Euclidean distance between 𝑥 and 𝑝̃𝑖𝑗𝑘 . Finally, we show some examples of generated explanations.
‖ ‖

4
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

Fig. 1. The interface for the plain CNN, without any explanation, for presenting prediction on histopathological WSI. From the left hand side, it shows a prediction result and
the probability distribution of the prediction. Next, eight buttons for making a final decision by a pathologist are located there. Finally, the original image of the WSI and the
corresponding label map are visualized.

Fig. 2. The interface for the explainable system (X-CFCMC system) for presenting predictions with additional explanations of the CFCMC. From the left hand side, it shows a
semantical explanation of the results alongside the visualization of the training sample responsible for the prediction, below which there is a visualization of the other tissue types,
which could potentially be the true tissue type. Next, the original image of the WSI and the corresponding label map are visualized, below which eight buttons for making a final
decision by a pathologist are located.

4.1. Histopathological data description 4.2. Boosting the CFCMC’s performance

We used a publicly available dataset released in [2] by Kather et al.. The focus of the first part of experiments was to find the architecture
It consists of Hematoxylin and Eosin (H&E) tissue slides, which were of the CNN with the best performance as a feature extractor to train the
cut into 5000 small tiles of the size 150 × 150 pixels (equivalent to 74 explainable CFCMC classifier.
μm × 74 μm), each of them annotated with one of eight tissue classes,
namely tumour epithelium, simple stroma (homogeneous composition, 4.2.1. Experimental setup
includes tumour stroma, extra-tumoural stroma and smooth muscle), We utilized eight well-known CNN models, pre-trained on Ima-
complex stroma (containing single tumour cells and/or few immune geNet [24] dataset, specifically, AlexNet [30], VGG-16 [23], Inception-
cells), immune cells (including immune-cell conglomerates and sub- v3 [31], ResNet-50 [32], Xception [33], DenseNet121 [34], Inception-
mucosal lymphoid follicles), debris (including necrosis, haemorrhage ResNet-V2 [35] and EfficientNet0 [36]. For all of them, the fully con-
and mucus), normal mucosal glands, adipose tissue, background (no nected layers were cut and replaced with a dense layer containing
tissue). The data are class-balanced, each of the classes consists of 625 1024 neurons with ReLU activation functions and with an output layer
tiles. containing 8 neurons with the softmax activation function.

5
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

Table 1 The certainty rate 𝑐𝑟 is defined as the ratio between the number of
The performance results of different CNN models and the corresponding certain predictions 𝑦∗𝑐 and the number of all predictions 𝑦∗ .
CFCMC models.
CNN model CFCMC model #𝑦∗𝑐
𝑐𝑟 = (17)
Raw image – 59.35(3.43) #𝑦∗
AlexNet 91.43(2.42) 85.32(3.26) The certainty error is defined as follows:
VGG-16 93.61(1.75) 92.42(2.02)
𝑦∗𝑐
#̃
Inception-v3 92.76(1.44) 90.79(2.04) 𝑐𝑒 = (18)
ResNet-50 93.80(1.08) 91.28(1.64) #𝑦∗𝑐
Xception 93.58(1.25) 92.78(1.74)
DenseNet121 92.76(1.29) 92.06(1.75)
where 𝑦̃∗𝑐 are certain predictions that were misclassified.
Inception-ResNetV2 92.76(1.02) 91.44(1.95) The ground truth label certainty error 𝑐𝑒𝐺𝑇 is defined as follows:
EfficientNet0 90.97(1.39) 85.84(5.83)
𝑦∗𝐺𝑇
#̃
𝑐
VGG-like 80.88(1.14) 80.21(2.14) 𝑐𝑒𝐺𝑇 = (19)
Inception-like 85.25(2.82) 83.97(3.04) 𝑦∗
#̃
ResNet20 90.14(1.24) 84.01(5.0) where 𝑦̃∗ are misclassified predicitions and 𝑦̃∗𝐺𝑇 are misclassified pre-
𝑐
dictions, which for their ground truth label 𝐶𝐺𝑇 holds that 𝜙(𝑥; 𝐶𝐺𝑇 ) <
𝑐𝜃 , i.e. their ground truth label is unlikely to be true label.
Moreover, we created three lighter architectures that were trained Table 2 provides the results from the three previously defined
from scratch, specifically a VGG-like model with 12 convolutional layers metrics computed on predictions from joined testing sets from 10 folds
and 2 fully-connected layers, an Inception-like model with 3 inception for each CFCMC model. For each metric, the best performing models
layers and a ResNet20 model with a depth 20. All models were trained are highlighted in bold.
using the Adam [37] optimizer to minimize cross-entropy for 100 CFCMC models trained on features that were extracted from pre-
epochs with the learning rate set to the value 0.0001. trained CNN models generally outperformed models trained from
scratch with a certainty rate 𝑐𝑟 (13.99% against 3.41% in average). All
Finally, to train the explainable CFCMC classifier, we extracted
of the models, however, achieved very low certainty error 𝑐𝑒 ; 7 models
the features from the last dense layer of all the CNN models. The
achieved zero error, while the highest value was 2.17%. It follows that
experimental setup for the CFCMC algorithm is as follows: number
when a classifier labels its prediction as certain, it is unlikely that this
of clusters for each of the classes was set to 1. The value of the
prediction will be incorrect.
threshold 𝜃 was set to value 𝜃 = 0.01. For the optimization of the
Moreover, all of the models likewise reached a low value of ground
CFCMC, MATLAB implementation of the genetic algorithm was used
truth label certainty error 𝑐𝑒𝐺𝑇 (lower than 5%). This implies that when a
with a population size of 50 individuals. The mutation rate was set classifier labels its prediction as uncertain, it is very likely that ground
to 0.2. Arithmetic crossover and adaptive feasible mutation operators truth label will appear among the potentially true labels.
were used for reproduction. Stochastic uniform selection was used to
choose parents for the next generation. The algorithm stops at the 30th 4.4. Performance on imbalanced data set
generation.
For the clinical trials we used a balanced dataset, in which each
4.2.2. Experimental results class is uniformly distributed. Typically, this is not the case in many
Table 1 provides the performance results of different CNN models real world conditions; in most cases, classifiers are required to deal with
and the corresponding CFCMC models. The classification results are imbalances. Therefore, to examine the performance of our classifier on
evaluated with a 10-folds cross validation test. Because of the class- imbalanced data set, we generated two data sets from histopathological
balanced dataset, the accuracy metric was chosen to evaluate the images: balanced data set with the same ratio and imbalanced data set
performance. with different ratio for each of the class (See Table 3).
From Table 1, it can be observed that pre-trained and fine-tuned Table 4 provides the performance results of CNN models with
fine-tuned Xception architecture and corresponding CFCMC models for
models outperform the ones trained from scratch. Moreover, it can be
both data sets. Because we deal with imbalanced data set, beside the
seen that features extracted from the CNN models significantly boost
accuracy, the classification results are also evaluated with other metrics
the performance of the explainable CFCMC models.
(precision, recall and F1). The result shows that the differences of the
CNN and CFCMC models between balanced and imbalanced data set
4.3. Validation of the certainty threshold are not significant, hence the models are able to deal with imbalanced
data set.
To validate the certainty threshold, three metrics were defined: Moreover, Table 5 presents the results from the metrics for val-
certainty rate, certainty error and ground truth label certainty error. idation of the certainty threshold (defined in 4.3) for balanced and

Table 2
The results from three metrics for validation of the certainty threshold, certainty rate 𝑐𝑟 , certainty error 𝑐𝑒 , ground truth label error 𝑐𝑒𝐺𝑇 with
corresponding occurrences.
CFCMC models #𝑦∗ 𝑦∗
#̃ 𝑒𝑟𝑟 # 𝑦∗𝑐 𝑐𝑟 𝑦∗𝑐
#̃ 𝑐𝑒 𝑦∗𝐺𝑇
#̃ 𝑐𝑒𝐺𝑇
𝑐

AlexNet 5000 734 14.68% 538 10.76% 0 0.00% 6 0.82%

VGG16 5000 379 7.58% 646 12.92% 1 0.15% 5 1.32%
Inception-v3 5000 460 9.20% 513 10.26% 0 0.00% 7 1.52%
ResNet-50 5000 436 8.72% 589 11.78% 0 0.00% 13 2.98%
Xception 5000 361 7.22% 1368 27.36% 2 0.20% 15 4.16%
DenseNet121 5000 397 7.94% 648 12.96% 0 0.00% 4 1.01%
Inception-ResNetV2 5000 428 8.56% 612 12.24% 0 0.00% 8 1.87%
EfficientNet0 5000 707 14.14% 680 13.60% 3 0.44% 4 0.57%
VGG-like 5000 990 19.80% 46 0.92% 1 2.17% 7 0.71%
Inception-like 5000 801 16.02% 291 5.82% 0 0.00% 6 0.75%
ResNet20 5000 800 16.00% 175 3.50% 0 0.00% 4 0.50%

6
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

Table 3
Fraction ratios of balanced and imbalanced generated data sets.
Class Training set Validation set Testing set

Balanced Imbalanced Balanced Imbalanced Balanced Imbalanced

Tumour 0.75 0.30 0.10 0.10 0.15 0.60
Simple stroma 0.75 0.20 0.10 0.10 0.15 0.70
Complex stroma 0.75 0.70 0.10 0.10 0.15 0.20
Immune cells 0.75 0.60 0.10 0.10 0.15 0.30
Debris 0.75 0.70 0.10 0.10 0.15 0.20
Mucosal glands 0.75 0.80 0.10 0.10 0.15 0.10
Adipose 0.75 0.50 0.10 0.10 0.15 0.40
Background 0.75 0.40 0.10 0.10 0.15 0.50

Table 4 Fig. 3(d) illustrates a prediction explanation for an input image that
The performance results of CNN model with Xception architecture and corresponding
is hard to classify, because in addition to three tissue type with a low
CFCMC models for balanced and imbalanced data sets. The results of the precision,
recall and F1 metrics are averages of all classes. probability for misclassification, it offers three highly probable tissue
Metric CNN CFCMC types. Therefore, the expert should investigate the input image more
deeply. The following semantic explanation was extracted: The input
Balanced Unbalanced Balanced Unbalanced image is Immune cells tissue type. However, there is a high possibility
Accuracy 92.74% 90.06% 91.08% 87.23% that in reality it could be Tumour epithelium or Complex stroma or
Precision 92.50% 89.93% 91.44% 90.33% Mucosal glands. Moreover, there is a low possibility that it could be
Recall 92.76% 90.20% 91.04% 89.24%
Simple stroma or Debris or mucus or Adipose tissue.
F1 92.64% 90.08% 91.26% 89.78%
Fig. 3(e) shows a prediction explanation for an input image that
is hard to classify and was misclassified. It was predicted as Simple
stroma tissue type. The explanation offers three tissue types with low
unbalanced data sets. As can be seen, the difference in the results of and one with high probability of misclassification. It can be seen that
all metrics are insignificant, which indicates that the FoM is unaffected in this case, input image is the most similar to the high probable tissue
by imbalance in data set.
type, Complex stroma, which is also the true tissue type. The following
semantic explanation was extracted: The input image is Simple stroma
4.5. Explanations examples
tissue type. However, there is a high possibility that in reality it could
Fig. 3 shows five examples of explanations with different difficulty be Complex stroma. Moreover, there is a low possibility that it could be
levels of classification of the input image. If an explanation offers tissue Immune cells or Debris or mucus or Mucosal glands.
types with a high probability of misclassification, the input image is
hard to classify. In case of only low probable misclassification offer,
5. Clinical trials results
the input image is not easy to classify. Finally, in cases that it provides
no offer, the input image is easy to classify, thus, the prediction is
certain. To generate explanations, the CFCMC model trained of features To evaluate the influence of the explanation generated from X-
extracted from DenseNet121 CNN architecture was used. CFCMC for human pathologists, we ran an acceptability test against the
Fig. 3(a) illustrates an example of a prediction explanation for the plain CNN. The objective of this experiment was to evaluate the accept-
input image, which is easy to classify because it offers no other tissue ability of the explanation-generating X-CFCMC for human pathologists.
types. Moreover, it can be seen that the input image is very similar to
We used within-subject experimental design, thus at the clinical trials
the training image responsible for the prediction (TRP). Therefore, the
both systems were shown to 14 pathologists (3 men and 11 women),
prediction is certain. The following semantic explanation was extracted:
with an average age of 40.7 years and an average length of service
The input image is for sure Tumour epithelium, because it could no be
misclassified to any other tissue types. of 14.9 years in which the shortest length of service was 4 years and
Fig. 3(b) shows a prediction explanation for an input image that is the longest was 45 years. At the end of the session, feedback from the
not so easy to classify but was correctly classified. The input image pathologists was collected in the form of questionnaires.
was classified as Immune cells tissue type. Although it offers three
tissue types with a low probability of misclassification, namely Tumour
5.1. Experiment setting
epithelium, Simple stroma and Complex stroma, the input image is very
similar to a TRP image. Therefore, the TRP image could gain trust
in this prediction. The following semantic explanation was extracted: Prior to the experiments, each participant was informed about the
The input image is Immune cells tissue type. However, there is a low dataset. None of them were familiar with machine learning concepts.
possibility that in reality it could be Tumour epithelium or Simple stroma Therefore, each of the participant was informed about the automatic
or Complex stroma. classification of histopathological samples by machine learning. After-
Fig. 3(c) shows a prediction explanation for an input image that
wards, both interfaces were explained to the participants, including the
is not easy to classify and was misclassified. The input image was
controls and the means of presenting the predictions. Finally, it was
predicted as Adipose tissue type. The explanation offers two tissue types
explained to the participants that with the exception of asking for help
with low probability, namely Simple stroma and Debris or mucus. It can
be seen that the input image is more similar to low probable tissue types with controls, dialogue with the interviewers was discouraged. One dif-
than the TRP image. This could lower trust in this particular decision. ferent classified WSI from dataset was shown to every participant, who
However, it offers the true tissue type of the input image, which is they were asked to examine 20 arbitrary area for both interfaces and
Debris or mucus. The following semantic explanation was extracted: The evaluate each prediction outcome. This takes approximately 30 min on
input image is Adipose tissue type. However, there is a low possibility average. At the end of the experiment session, every participant was
that in reality it could be Simple stroma or Debris or mucus. asked to fill out questionnaire.

7
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

Fig. 3. Examples of the explanations generated for the testing dataset. The tissue types are displayed above the image.

5.2. Evaluation on users experiences questionnaire reached value of the Cronbach’s alpha, 𝛼 = 0.89. There-
fore, we can state that the participants sufficiently understood the
The users’ experiences in using the stand alone CNN and X-CFCMC objectives of the experiment. The questionnaire was divided into three
were evaluated using a questionnaire. The internal consistency of the parts.

8
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

Table 5
The results from three metrics for validation of the certainty threshold, certainty rate 𝑐𝑟 , certainty error 𝑐𝑒 , ground
truth label error 𝑐𝑒𝐺𝑇 with corresponding occurrences for both balanced and unbalanced data set.
Data set #𝑦∗ 𝑦∗
#̃ 𝑒𝑟𝑟 # 𝑦∗𝑐 𝑐𝑟 𝑦∗𝑐
#̃ 𝑐𝑒 𝑦∗𝐺𝑇
#̃ 𝑐𝑒𝐺𝑇
𝑐

Balanced 750 67 8.74% 113 15.06% 0 0% 2 2.98%

Unbalanced 1872 243 12.76% 289 15.43% 0 0% 8 3.29%

The first and the second parts use a semantic differential scale, Table 6
A comparison of the average scores for each of the four evaluation ares for both
which presents respondents with a set of bipolar scales (useful/useless,
systems, the stand alone CNN model without providing explanations and the X-CFCMC
reliable/unreliable). Respondents were asked to choose a number (from model that explains its decision.
1 to 6) that indicates the extent to which the adjectives relate to a Evaluation parameters Average score
characteristic evaluation of the stand alone CNN and X-CFCMC systems.
Plain CNN X-CFCMC
While 1 represents a positive adjective (f.e. useful), 6 represents a
Objectivity 1.75 1.65
negative adjective (f.e. useless). The adjectives were selected to cover
Level of details 2.16 1.85
four evaluation parameters of the systems’ influences on trust and Reliability 2.81 2.57
reliance to the pathologists: Quality 2.10 1.99
Total average 2.21 2.01
1. objectivity - objective/subjective, useful/useless, relevant/
irrelevant, serious/unserious, ethically/unethically
2. details - precise/imprecise, consistent/inconsistent,
complicated/uncomplicated, complete/incomplete 5.3. Discussion
3. reliability - accurate/inaccurate, faultless/faulty, straight/
misleading, certain/uncertain, reliable/unreliable From the results above, key findings emerge. The comparison of
4. quality - systematic/unsystematic, time saving/time consuming, the characteristic of both systems revealed that in the level of details,
clear/unclear, expert/inexpert, good quality/bad quality the pathologist consider the X-CFCMC system as more rigorous, more
precise, more consistent, more complete. Moreover, the X-CFCMC sys-
Table 6 shows the average scores of the four evaluation parameters tem is considered as more accurate, reliable and confident regarding its
for both systems, which are computed as the arithmetic mean of the predictions.
chosen number of each corresponding bipolar scale, while the total The statement evaluation indicates that the most useful means of
average is computed from the average scores of each evaluation pa- explanations are semantic explanation and a visualization of the training
rameter. Better values are highlighted in bold. Because the lower the image responsible for the prediction. Visualization of the other types of tissue
score means the better the evaluation of a certain characteristic, the was only appreciated by half of the pathologists. A direct comparison
results reveal that the X-CFCMC system obtained better average scores of both systems indicates that the X-CFCMC system is more acceptable
in all parameters. The most significant differences between the scores than the plain CNN system.
were in the level of details (0.31) and the reliability (0.24).
The third part of the questionnaire used dichotomous scale. It 6. Conclusion
consisted of closed-ended items that covered a subjective evaluation
of both interfaces so that the participants express their agreement In this study, we extend the explainability of the explainable Cumu-
or disagreement with the statements. The statements were created to lative Fuzzy Class Membership Criterion (CFCMC) classifier and used
be focused on the evaluation the truthfulness and usefulness of both it for classification of eight tissue type from histopathological cancer
systems. image samples.
Analysing the dichotomous scaled items of the third part of the First, we improved the performance of the CFCMC classifier on
questionnaires, we came to the following findings: colorectal image data using a fine-tuned Convolutional Neural Net-
work (CNN) as a feature extractor, which was pre-trained on different
X-CFCMC system. The semantic explanation influenced the trust of 10
dataset.
pathologists, while 9 of them increased and 1 decreased their trust
Next, we defined the factor of misclassification (FoM), which is able
in the prediction. The visualization of the training image responsible for
to estimate the possibility of the input sample being misclassified
the prediction increases trust of 10 of the pathologists while for the
to a particular conflicting class. Moreover, we defined the certainty
rest there was no influence. The visualization of the other types of tissue
threshold, thanks to which we are able to say, whether the prediction is
influenced only half of the pathologists, however, it increased their
certain or uncertain. The proposed uncertainty measure is significantly
trust of the system.
different from many other uncertainty measures of models, e.g. neural
Stand alone CNN system. The probability distribution of the prediction had networks model, firstly, because it is based on the classifiability of the
no influence to 10 of the pathologists. For the rest, it increased their data in the space, where the classifier makes its decision, and secondly,
trust in the prediction. in the case of uncertain prediction, it is able to suggest the classes in
which the input sample could be misclassified. Thus, it offers relevant
Comparison of the systems. Analysing five items devoted to a direct classes to be further examined. The experiments clearly supported this
comparison of the usefulness of both systems, the X-CFCMC system ability.
achieved a cumulative score of 50, while the plain CNN system achieved Finally, we developed two systems for segmentation of the whole-
a cumulative score of 20. slide images of histopathological cancer tissue. The first system used
Credibility of the systems. Comparing two items about the credibility of stand alone CNN and the second used an X-CFCMC classifier, which
both systems, the plain CNN system achieved a cumulative score of 23, provides three means of explanations: a semantical explanation about
while the X-CFCMC system achieved score of 21. the prediction and possible misclassification, a visualization of the
training image responsible for the prediction and a visualization of the
Usefulness of the whole-slide segmentation. Two items showed that, in other types of tissue.
general, all pathologists consider an automatic whole-slide segmenta- At the clinical trials with 14 pathologists, we measured the accept-
tion of the histopathological samples useful. ability and the trust of the pathologists for the proposed system. The

9
P. Sabol et al. Journal of Biomedical Informatics 109 (2020) 103523

results indicate that the X-CFCMC system is more useful and more [10] A. Holzinger, M. Plass, M. Kickmeier-Rust, K. Holzinger, G.C. Crişan, C.-M.
reliable than the plain CNN. Pintea, V. Palade, Interactive machine learning: experimental evidence for the
human in the algorithmic loop, Appl. Intell. 49 (7) (2019) 2401–2414.
In conclusion, this paper discussed about the usability and the reli-
[11] G.R. Vásquez-Morales, S.M. Martínez-Monterrubio, P. Moreno-Ger, J.A. Recio-
ability of a explainable classifier in real world medical settings through García, Explainable prediction of chronic renal disease in the Colombian
clinical trials. We believe that our proposed system can contribute to population using neural networks and case-based reasoning, IEEE Access 7 (2019)
the use of AI, especially by improving the usability and acceptability 152900–152910.
[12] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction
of AI systems in medical domains, where speed in making decisions,
of medical codes from clinical text, 2018, arXiv preprint arXiv:1802.05695.
reliability and accountability are crucial. We are aware that the scale of [13] S.M. Lundberg, B. Nair, M.S. Vavilala, M. Horibe, M.J. Eisses, T. Adams, D.E.
our preliminary experiment was limited. The expansion of clinical trials Liston, D.K.-W. Low, S.-F. Newman, J. Kim, et al., Explainable machine-learning
to include more pathologists from various fields are of our immediate predictions for the prevention of hypoxaemia during surgery, Nat. Biomed. Eng.
2 (10) (2018) 749.
future interest. Moreover, in the medical settings we will be confronted
[14] A. Malhi, T. Kampik, H. Pannu, M. Madhikermi, K. Främling, Explaining machine
with imbalanced, heterogeneous and inaccurate data sets. Therefore, learning-based classifications of in-vivo gastral images, in: 2019 Digital Image
our next challenge in our research is also to examine how our classifier Computing: Techniques and Applications, DICTA, IEEE, 2019, pp. 1–7.
will perform on imperfect data. [15] P. Hartono, A transparent cancer classifier, Health Inf. J. 26 (1) (2020) 190–204,
http://dx.doi.org/10.1177/1460458218817800, PMID: 30596318.
[16] Ł. Rączkowski, M. Możejko, J. Zambonelli, E. Szczurek, ARA: accurate, reliable
CRediT authorship contribution statement and active histopathological image classification framework with Bayesian deep
learning, bioRxiv (2019) 658138.
Patrik Sabol: Conceptualization, Methodology, Software, Investi- [17] P. Sabol, P. Sinčák, K. Ogawa, P. Hartono, Explainable classifier supporting
decision-making for breast cancer diagnosis from histopathological images, in:
gation, Formal analysis, Writing - original draft, Writing - review
2019 International Joint Conference on Neural Networks, IJCNN, IEEE, 2019,
& editing, Data curation. Peter Sinčák: Conceptualization, Supervi- pp. 1–8.
sion, Funding acquisition, Project administration. Pitoyo Hartono: [18] P. Sabol, P. Sinčák, J. Buša, P. Hartono, Cumulative fuzzy class membership
Conceptualization, Methodology, Supervision, Writing - original criterion decision-based classifier, in: 2017 IEEE International Conference on
Systems, Man, and Cybernetics, SMC, 2017, pp. 334–339, http://dx.doi.org/10.
draft. Pavel Kočan: Conceptualization, Investigation, Resources.
1109/SMC.2017.8122625.
Zuzana Benetinová: Investigation, Resources, Data curation. Alž- [19] P. Sabol, P. Sinčák, J. Magyar, P. Hartono, Semantically explainable fuzzy
beta Blichárová: Investigation, Resources. Ľudmila Verbóová: In- classifier, Int. J. Pattern Recognit. Artif. Intell. 33 (12) (2019) 2051006,
vestigation, Resources, Writing - review & editing. Erika Štammová: http://dx.doi.org/10.1142/S0218001420510064.
[20] P.W. Koh, P. Liang, Understanding black-box predictions via influence func-
Investigation, Resources, Validation. Antónia Sabolová-Fabianová:
tions, in: Proceedings of the 34th International Conference on Machine
Formal analysis, Methodology, Visualization, Data curation. Anna Learning-Volume 70, JMLR. org, 2017, pp. 1885–1894.
Jašková: Formal analysis, Methodology, Visualization. [21] J. Zhou, H. Hu, Z. Li, K. Yu, F. Chen, Physiological indicators for user trust in
machine learning with influence enhanced fact-checking, in: International Cross-
Declaration of competing interest Domain Conference for Machine Learning and Knowledge Extraction, Springer,
2019, pp. 94–113.
[22] J.N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C.-A. Weis, T.
The authors declare that they have no known competing financial Gaiser, A. Marx, N.A. Valous, D. Ferber, et al., Predicting survival from colorectal
interests or personal relationships that could have appeared to cancer histology slides using deep learning: A retrospective multicenter study,
influence the work reported in this paper. PLoS Med. 16 (1) (2019) e1002730.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
image recognition, 2014, arXiv preprint arXiv:1409.1556.
Acknowledgements [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale
hierarchical image database, in: 2009 IEEE Conference on Computer Vision and
This research is supported by AI4EU project from the European Pattern Recognition, IEEE, 2009, pp. 248–255.
[25] S. Mohseni, N. Zarei, E.D. Ragan, A survey of evaluation methods and measures
Union’s Horizon 2020 research and innovation programme under grant for interpretable machine learning, 2018, arXiv preprint arXiv:1811.11839.
agreement 825619 2019–2021, Maria Currie RISE LIFEBOTS Exchange [26] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data
Grant Agreement ID : 824047 2019–2021 and EU FlagEra Joint project set via the gap statistic, J. R. Stat. Soc. Ser. B Stat. Methodol. 63 (2) (2001)
Robocom++, 2017–2021. 411–423.
[27] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
References [28] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern
Anal. Mach. Intell. (2) (1979) 224–227.
[1] J. Xie, R. Liu, I. Luttrell, C. Zhang, et al., Deep learning based analysis of [29] J.H. Holland, Adaptation in Natural and Artificial Systems: an Introductory
histopathological images of breast cancer, Front. Genet. 10 (2019) 80. Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT
[2] J.N. Kather, C.-A. Weis, F. Bianconi, S.M. Melchers, L.R. Schad, T. Gaiser, A. Press, 1992.
Marx, F.G. Zöllner, Multi-class texture analysis in colorectal cancer histology, [30] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
Sci. Rep. 6 (2016) 27988. convolutional neural networks, in: Advances in Neural Information Processing
[3] B.E. Bejnordi, G. Zuidhof, M. Balkenhol, M. Hermsen, P. Bult, B. van Ginneken, Systems, 2012, pp. 1097–1105.
N. Karssemeijer, G. Litjens, J. van der Laak, Context-aware stacked convolu- [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception
tional neural networks for classification of breast carcinomas in whole-slide architecture for computer vision, in: Proceedings of the IEEE Conference on
histopathology images, J. Med. Imaging 4 (4) (2017) 044504. Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
[32] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
[4] J. de Matos, A.d.S. Britto Jr., L.E. Oliveira, A.L. Koerich, Histopathologic image
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
processing: A review, 2019, arXiv preprint arXiv:1904.07900.
2016, pp. 770–778.
[5] D. Komura, S. Ishikawa, Machine learning methods for histopathological image
[33] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in:
analysis, Comput. Struct. Biotechnol. J. 16 (2018) 34–42.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[6] T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia, A.
2017, pp. 1251–1258.
Campilho, Classification of breast cancer histology images using convolutional
[34] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
neural networks, PLoS One 12 (6) (2017) e0177544.
convolutional networks, in: Proceedings of the IEEE Conference on Computer
[7] M. Hägele, P. Seegerer, S. Lapuschkin, M. Bockmayr, W. Samek, F. Klauschen,
Vision and Pattern Recognition, 2017, pp. 4700–4708.
K.-R. Müller, A. Binder, Resolving challenges in deep learning-based analyses [35] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and
of histopathological images using explanation methods, 2019, arXiv preprint the impact of residual connections on learning, in: Thirty-First AAAI Conference
arXiv:1908.06943. on Artificial Intelligence, 2017.
[8] A. Holzinger, C. Biemann, C.S. Pattichis, D.B. Kell, What do we need to build [36] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional neural
explainable AI systems for the medical domain? 2017, arXiv preprint arXiv: networks, 2019, arXiv preprint arXiv:1905.11946.
1712.09923. [37] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv
[9] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, H. Mueller, Causability and preprint arXiv:1412.6980.
explainability of AI in medicine, Data Min. Knowl. Discov. 10 (2019).

ARA Accurate, Reliable and Active Histopathological Image Classification Framework With Bayesian Deep Learning
No ratings yet
ARA Accurate, Reliable and Active Histopathological Image Classification Framework With Bayesian Deep Learning
12 pages
Histopathologic Cancer Detection Using Convolutional Neural Networks
No ratings yet
Histopathologic Cancer Detection Using Convolutional Neural Networks
4 pages
Data and Knowledge Co-Driving For Cancer Subtype Classification On
No ratings yet
Data and Knowledge Co-Driving For Cancer Subtype Classification On
12 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
3 pages
Enhancing Diagnostic Precision in Oncology
No ratings yet
Enhancing Diagnostic Precision in Oncology
6 pages
CCL Template.v3
No ratings yet
CCL Template.v3
17 pages
Enhancing Diagnostic Precision in Oncology - 1
No ratings yet
Enhancing Diagnostic Precision in Oncology - 1
6 pages
Technologies
No ratings yet
Technologies
24 pages
Plagiarism Report
No ratings yet
Plagiarism Report
2 pages
PG 18 51207
No ratings yet
PG 18 51207
15 pages
SSRN 4853237
No ratings yet
SSRN 4853237
17 pages
Assisting Pathologists in Detecting Cancer With Deep Learning
No ratings yet
Assisting Pathologists in Detecting Cancer With Deep Learning
2 pages
Automated Cancer Diagnosis Based On Histopathologi
No ratings yet
Automated Cancer Diagnosis Based On Histopathologi
17 pages
IJCNN 2016 16315 With Cover Page v2
No ratings yet
IJCNN 2016 16315 With Cover Page v2
9 pages
1 s2.0 S1746809424006372 Main
No ratings yet
1 s2.0 S1746809424006372 Main
12 pages
Histopathology Foundation Models Benchmark
No ratings yet
Histopathology Foundation Models Benchmark
59 pages
Lung and Colon Cancer Classification Using Multiscale Deep Features Integration of Compact Convolutional Neural Networks and Feature Selection
No ratings yet
Lung and Colon Cancer Classification Using Multiscale Deep Features Integration of Compact Convolutional Neural Networks and Feature Selection
28 pages
Assessment of Deep Learning Algorithms To Predict
No ratings yet
Assessment of Deep Learning Algorithms To Predict
13 pages
Deep Convolutional Neural Networks Using An Active Learning Strategy For Cervical Cancer Screening and Diagnosis
No ratings yet
Deep Convolutional Neural Networks Using An Active Learning Strategy For Cervical Cancer Screening and Diagnosis
9 pages
Colorectal Cancer Detection Based On Deep Learning
No ratings yet
Colorectal Cancer Detection Based On Deep Learning
5 pages
Yang 2019
No ratings yet
Yang 2019
9 pages
Diagnose Like A Radiologist Hybrid Neuro-Probabilistic Reasoning For Attribute-Based Medical Image Diagnosis
No ratings yet
Diagnose Like A Radiologist Hybrid Neuro-Probabilistic Reasoning For Attribute-Based Medical Image Diagnosis
17 pages
Advancing Cancer Classification With Hybrid Deep Learning: Image Analysis For Lung and Colon Cancer Detection
No ratings yet
Advancing Cancer Classification With Hybrid Deep Learning: Image Analysis For Lung and Colon Cancer Detection
15 pages
Breast Cancer Detection by Leveraging Machine Learning - 2020 - ICT Express
No ratings yet
Breast Cancer Detection by Leveraging Machine Learning - 2020 - ICT Express
5 pages
Improving Cancer Metastasis Detection Via Effectiv
No ratings yet
Improving Cancer Metastasis Detection Via Effectiv
13 pages
Paper F
No ratings yet
Paper F
15 pages
Applsci 123
No ratings yet
Applsci 123
19 pages
Breast Cancer Image Dataset Creation
No ratings yet
Breast Cancer Image Dataset Creation
68 pages
Colon Cancer Diagnostics-13-02939
No ratings yet
Colon Cancer Diagnostics-13-02939
19 pages
Enabling Non-Invasive Diagnosis of Thyroid Nodules With High Specificity and Sensitivity
No ratings yet
Enabling Non-Invasive Diagnosis of Thyroid Nodules With High Specificity and Sensitivity
9 pages
Cancers 15 01492
No ratings yet
Cancers 15 01492
20 pages
Decision Support Systems For Disease Detection and
No ratings yet
Decision Support Systems For Disease Detection and
4 pages
Folmsbee Et Al. - 2018 - Active Deep Learning Improved Training Efficiency of Convolutional Neural Networks For Tissue Classification in
No ratings yet
Folmsbee Et Al. - 2018 - Active Deep Learning Improved Training Efficiency of Convolutional Neural Networks For Tissue Classification in
4 pages
Chrons Disease Documentation
No ratings yet
Chrons Disease Documentation
390 pages
CNN Enhances Breast Cancer Detection
No ratings yet
CNN Enhances Breast Cancer Detection
11 pages
Automated Histological Classification For Digital Pathology Images of Colonoscopy Specimen Via Deep Learning
No ratings yet
Automated Histological Classification For Digital Pathology Images of Colonoscopy Specimen Via Deep Learning
8 pages
Batch 27
No ratings yet
Batch 27
28 pages
Cancer Detection and Segmentation in Pathological Whole Slide Images 1
No ratings yet
Cancer Detection and Segmentation in Pathological Whole Slide Images 1
20 pages
CNN Model for Colon Cancer Classification
No ratings yet
CNN Model for Colon Cancer Classification
16 pages
Colorectal Cancer Image Recognition Algorithm Based On Improved
No ratings yet
Colorectal Cancer Image Recognition Algorithm Based On Improved
11 pages
Interpretable Deep Transfer Learning For Breast Ultrasound Cancer Detection: A Multi-Dataset Study
No ratings yet
Interpretable Deep Transfer Learning For Breast Ultrasound Cancer Detection: A Multi-Dataset Study
6 pages
Tumour Detect
No ratings yet
Tumour Detect
34 pages
An Intelligent System For Automated Breast Cancer Diagnosis and Prognosis Using SVM Based Classifiers
No ratings yet
An Intelligent System For Automated Breast Cancer Diagnosis and Prognosis Using SVM Based Classifiers
2 pages
Deep Learning Predictive Model For Colon Cancer
No ratings yet
Deep Learning Predictive Model For Colon Cancer
10 pages
Explainable Ensemble Machine Learning
No ratings yet
Explainable Ensemble Machine Learning
13 pages
An Optimal Deep Learning Approach For Breast Cancer
No ratings yet
An Optimal Deep Learning Approach For Breast Cancer
17 pages
Histoapathological Imaging Using Deep Learning
No ratings yet
Histoapathological Imaging Using Deep Learning
13 pages
Deep Learning Applied For Histological Diagnosis of Breast Cancer
No ratings yet
Deep Learning Applied For Histological Diagnosis of Breast Cancer
17 pages
Prognosis CRC
No ratings yet
Prognosis CRC
22 pages
Vander Laak 2021
No ratings yet
Vander Laak 2021
10 pages
Accuracy Improvement in Binary and Multi-Class Classification of Breast Histopathology Images
No ratings yet
Accuracy Improvement in Binary and Multi-Class Classification of Breast Histopathology Images
6 pages
Exploiting Histopathological Imaging For Early Detection of Lung and Colon Cancer Via Ensemble Deep Learning Model
No ratings yet
Exploiting Histopathological Imaging For Early Detection of Lung and Colon Cancer Via Ensemble Deep Learning Model
19 pages
Improved Accuracy in Colorectal Cancer Tissue Decomposition Through Refinement of Established Deep Learning Solutions
No ratings yet
Improved Accuracy in Colorectal Cancer Tissue Decomposition Through Refinement of Established Deep Learning Solutions
14 pages
1 s2.0 S095741742202471X Main
No ratings yet
1 s2.0 S095741742202471X Main
11 pages
Thesis
No ratings yet
Thesis
38 pages
A Novel Deep-Learning Model For Automatic Detection and Classification of Breast Cancer Using The Transfer-Learning Technique
No ratings yet
A Novel Deep-Learning Model For Automatic Detection and Classification of Breast Cancer Using The Transfer-Learning Technique
16 pages
Plagiarism Checker X - Report: Originality Assessment
No ratings yet
Plagiarism Checker X - Report: Originality Assessment
12 pages
Digital Detox Reality British English Student Ver2
No ratings yet
Digital Detox Reality British English Student Ver2
4 pages
Rheumatic Fever Nursing Care Guide
No ratings yet
Rheumatic Fever Nursing Care Guide
3 pages
Katombora Reformatory Child Centre
No ratings yet
Katombora Reformatory Child Centre
17 pages
Medika Plaza Extension Directory
No ratings yet
Medika Plaza Extension Directory
1 page
Section 5
No ratings yet
Section 5
65 pages
SO3 A2P PC 6C Howto
No ratings yet
SO3 A2P PC 6C Howto
1 page
Understanding Insurance Fraud in India - Penalties and Legal Consequences
No ratings yet
Understanding Insurance Fraud in India - Penalties and Legal Consequences
10 pages
081-017-118 ACRAGLAS (R) GEL HARDENER, 8 OZ. - 081 1357 SDS US English
No ratings yet
081-017-118 ACRAGLAS (R) GEL HARDENER, 8 OZ. - 081 1357 SDS US English
9 pages
Deutsche Bank Statement On Human Rights
No ratings yet
Deutsche Bank Statement On Human Rights
4 pages
Data Item Apotek Cf2
No ratings yet
Data Item Apotek Cf2
189 pages
Opalescence Boost Whitening IFU US 1006467AR07
No ratings yet
Opalescence Boost Whitening IFU US 1006467AR07
2 pages
Industrial Safety Management Guide
No ratings yet
Industrial Safety Management Guide
132 pages
FSR Strategy Malta Fa PB Ii Version 0ctober 2023
No ratings yet
FSR Strategy Malta Fa PB Ii Version 0ctober 2023
22 pages
Psychotherapy Appears To Improve Symptoms of Functional Dyspepsia and Anxiety Systematic Review With Meta-Analysis
No ratings yet
Psychotherapy Appears To Improve Symptoms of Functional Dyspepsia and Anxiety Systematic Review With Meta-Analysis
28 pages
Iphex List
No ratings yet
Iphex List
11 pages
Oet 1
No ratings yet
Oet 1
7 pages
RPD Design for Dental Professionals
100% (1)
RPD Design for Dental Professionals
133 pages
Inclusive Cognitive Science
No ratings yet
Inclusive Cognitive Science
4 pages
Community Midwifery
No ratings yet
Community Midwifery
22 pages
Atopic Dermatitis A Case Report
No ratings yet
Atopic Dermatitis A Case Report
3 pages
Senior Safety Engineer SHADI SHADDAD AMER
No ratings yet
Senior Safety Engineer SHADI SHADDAD AMER
4 pages
Louis G. Castonguay, Clara E. Hill (Eds.) - Transformation in Psychotherapy_ Corrective Experiences Across Cognitive Behavioral, Humanistic, And Psychodynamic Approaches-American Psychological Associa
100% (2)
Louis G. Castonguay, Clara E. Hill (Eds.) - Transformation in Psychotherapy_ Corrective Experiences Across Cognitive Behavioral, Humanistic, And Psychodynamic Approaches-American Psychological Associa
405 pages
Fitcog SSP Rec July 2022 Final
No ratings yet
Fitcog SSP Rec July 2022 Final
61 pages
ST Periodical Test in Pe 11 and
No ratings yet
ST Periodical Test in Pe 11 and
6 pages
4 Arrival Procedures Timeline
100% (1)
4 Arrival Procedures Timeline
1 page
Ayushman Bharat: Documentation of Process For Customization of Standard Treatment Guidelines
No ratings yet
Ayushman Bharat: Documentation of Process For Customization of Standard Treatment Guidelines
35 pages
How To... : Why Measure Fertility and Early Deads Levels?
No ratings yet
How To... : Why Measure Fertility and Early Deads Levels?
8 pages
Abstract of COVID Article
No ratings yet
Abstract of COVID Article
2 pages
Privat Facility in DHIS 2
No ratings yet
Privat Facility in DHIS 2
6 pages
Experiment With An Air Pump
No ratings yet
Experiment With An Air Pump
11 pages

Sabol Et Al 2020

Uploaded by

Sabol Et Al 2020

Uploaded by

Journal of Biomedical Informatics 109 (2020) 103523

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

Explainable classifier for improving the accountability in decision-making

ARTICLE INFO ABSTRACT

1. Introduction A long list of machine learning approaches to image classification

⎧ ‖ ‖ where, 𝑖 = 1, … , 𝑛𝑐 . Optimizing of 𝑝𝑖 , the CFCMC surface’s shape is

4.1. Histopathological data description 4.2. Boosting the CFCMC’s performance

AlexNet 5000 734 14.68% 538 10.76% 0 0.00% 6 0.82%

Balanced Imbalanced Balanced Imbalanced Balanced Imbalanced

Balanced 750 67 8.74% 113 15.06% 0 0% 2 2.98%

You might also like