Sabol Et Al 2020
Sabol Et Al 2020
Original research
Keywords:                                                    Pathologists are responsible for cancer type diagnoses from histopathological cancer tissues. However, it is
Explainable artificial intelligence                          known that microscopic examination is tedious and time-consuming. In recent years, a long list of machine
Explainable machine learning                                 learning approaches to image classification and whole-slide segmentation has been developed to support
Uncertainty measure
                                                             pathologists. Although many showed exceptional performances, the majority of them are not able to rationalize
Digital pathology
                                                             their decisions. In this study, we developed an explainable classifier to support decision making for medical
Colorectal cancer
                                                             diagnoses. The proposed model does not provide an explanation about the causality between the input and
                                                             the decisions, but offers a human-friendly explanation about the plausibility of the decision. Cumulative Fuzzy
                                                             Class Membership Criterion (CFCMC) explains its decisions in three ways: through a semantical explanation
                                                             about the possibilities of misclassification, showing the training sample responsible for a certain prediction and
                                                             showing training samples from conflicting classes. In this paper, we explain about the mathematical structure
                                                             of the classifier, which is not designed to be used as a fully automated diagnosis tool but as a support system
                                                             for medical experts. We also report on the accuracy of the classifier against real world histopathological data
                                                             for colorectal cancer. We also tested the acceptability of the system through clinical trials by 14 pathologists.
                                                             We show that the proposed classifier is comparable to state of the art neural networks in accuracy, but more
                                                             importantly it is more acceptable to be used by human experts as a diagnosis tool in the medical domain.
    ∗ Corresponding author.
      E-mail addresses: patrik.sabol@tuke.sk (P. Sabol), peter.sincak@tuke.sk (P. Sinčák).
https://doi.org/10.1016/j.jbi.2020.103523
Received 12 April 2020; Received in revised form 22 July 2020; Accepted 27 July 2020
Available online 3 August 2020
1532-0464/© 2020 Elsevier Inc. All rights reserved.
P. Sabol et al.                                                                                                 Journal of Biomedical Informatics 109 (2020) 103523
1.1. Relevant studies                                                            of training image and the most similar images that belong to clusters
                                                                                 of the conflicting class with a different confidence degree. In this
    In [9], Holzinger et al. distinguished two types of explainable AI:          paper, we extend the explainability of the CFCMC classifier by defining
Ante-hoc systems which incorporate explainability directly into the              the factor of miclassification (FoM) and the certainty threshold. While
structure of an AI-model; these are systems that are interpretable by            the FoM is a value that describes the possibility of the input sample
design. Typical examples include linear regression, decision trees and           being misclassified to the one particular conflicting class, the certainty
fuzzy inference systems. They are commonly referred to as white-boxes            threshold is a value of the FoM, under which it is a certain that the
or, currently, glass-boxes [10]. Posthoc systems, on the other hand,             input sample will not be misclassified. Compared to the concept of
aim to explain and interpret black-box classifiers which provide local           the uncertainty measure proposed in [16], in the case of uncertain
explanations for their specific decision. The majority of the explanation        prediction, our approach is additionally able to suggest the classes in
approaches seek to link a particular output of the classifier to input           which the input sample could be misclassified. Thus, it offers relevant
variables to see the impact of features on the final decision outcome.           classes to be further examined.
For instance, in [11], G. R. Vásquez-Morales et al. used neural network-             Different approach to interpret the decision of the classifier is based
based classifier to predict whether a person is at risk of developing            on generating instances that are close to an observation. In [20] the
chronic kidney disease. Here, a black-box machine-learning method                influence function was used to trace a model’s prediction through the
was complemented by Case-Based Reasoning, a white-box method                     learning algorithm and back to its training data, thereby identifying
that is able to find explanatory cases for an explanation-by-example             training points most responsible for a given prediction. This approach
justification of a neural network’s prediction. In [12], Mullenbach et al.       was used to explain the prediction of a black-box, deep neural network
presented an attentional convolutional network that predicted medical            model. Moreover, the paper [21] investigates the effects of presentation
codes from clinical texts. Using an attention mechanism, the most                of influence of training data points on machine learning predictions
relevant segments of the clinical text for each of the medical codes             to boost user trust by measuring psychological signals (Galvanic Skin
were selected and used as an explanation mechanism. Through an                   Response and Blood Volume Pulse). It showed that these features
interpretability evaluation by a physician, they showed that the atten-          correlate to user trust. Such reference-based explanations are needed
tion mechanism identified meaningful explanations. In [13], Lundberg             in medicine, where, for instance, they could help to diagnose the type
et al. presented an ensemble-model-based machine learning method                 of the cancer from histopathological images.
using deep learning that predicts the near-term risk of hypoxaemia                   In this study the explainability of our classifier refers to its abil-
during anaesthesia care and explains the patient and surgery-specific            ity in providing a degree of confidence for each of its prediction
factors that led to that risk. The system improved the performance of            and in expressing the information in intuitive and human-friendly
anesthesiologists by providing interpretable hypoxaemia risk and the             manners. The information is expressed through visualizations of train-
contributing factors. In [7], Hagele et al. utilized Layer-wise Relevance        ing examples that are responsible for the prediction outcome paired
Propagation (LRP) to provide pixel-level explanation heatmaps for the            with semantical explanations regarding likelihood for misclassifica-
classification decision of the CNN in digital histopathology analyses of         tions. Here, while our method does not provide explainability for the
tumour tissue. These explanations were used to improve the general-              causality between the input and the prediction, we believe that the
ization of the classifier by detecting and removing the effects of hidden        explainability improves the accountability of the proposed classifier,
biases in used datasets. A similar approach to visualize parts of the            and thus significantly contributes in supporting decision making in
input image responsible for the prediction was used in [14], where               time-crucial medical domains. The objective of this article is to apply
LIME (Local Interpretable Model-agnostic Explanations) was utilized              the explainable CFCMC classifier for the classification of histopatholog-
to provide a global understanding for the CNN model by providing                 ical images of colorectal cancer. We used a publicly available dataset
explanations for individual instances in the context of in-vivo gastral          that was released in [2] by Kather et al.. It consists of a training set
image analysis.                                                                  comprised of 5000 small tiles, each of them annotated with one of eight
    It is natural that in the delicate medical domain, prediction models         tissue classes and 10 non-annotated whole slide images (WSI) of the
should not only be accurate, but also accountable; they should state             tissue.
uncertainty in their predictions, indicating difficult cases for which               In [16], it was shown that CNN outperformed the approach in [2],
further human expert inspections are necessary. Therefore, another               where features derived from images using texture descriptors served as
approach to probe and interpret machine learning algorithm is to                 a basis for a support vector machine model to classify colorectal cancer.
measure the uncertainty of the prediction for one particular example,            Moreover, in [22], CNN achieved an exceptional level of performance,
the predictive uncertainty [9]. In [15], a transparent neural network,           98,7% accuracy, in nine tissue types classifications of colorectal cancer,
S-rRBF, was proposed and applied to DNA microarray data sets. It                 using the VGG19 model [23], which was pretrained on the ImageNet
provides an intuitive explanation through a visualization of its decision        database [24]. Therefore, to enhance the accuracy of the CFCMC, we
process and on the given problem. It allows the users to understand why          employ a Convolutional Neural Network as a feature extractor. We are
a certain problem is easy or difficult. Moreover, it makes it possible           aware of the problem of losing explainability of the CNN model by com-
to see whether a new input is hard to classify or unlikely to be mis-            pressing of the data from the feature space to the latent space, which
classified. However, the visual information still needs to be interpreted        causes that it is hard to track the decision back to the features in the
and thus is prone to subjective inconsistencies. For the field of digital        feature space. This problem is not relevant for our study, because we
pathology, in [16], Raczkowski et al. proposed an accurate, reliable and         do not provide an explanation about the causality between the features
active (ARA) image classification framework using a Bayesian Convo-              and the decisions but we provide explanation about the classifiability of
lutional Neural Network (ARA-CNN) for classifying histopathological              the data, which is significantly improved by the CNN model by mapping
images of colorectal cancer. The model is able to achieve reliability            the data from the feature space to the latent space.
by measuring the uncertainty of each prediction. This capability was                 Finally, we developed an explanation interface, which provided a
used to identify mislabelled training samples. In [17], the recently             semantical and visual explanation that was extracted from the CFCMC
proposed semantically explainable fuzzy classifier called Cumulative             classifier that was used to classify the WSIs of the colorectal can-
Fuzzy Class Membership Criterion (CFCMC) [18,19] was used to clas-               cer tissue. We evaluated our XAI (eXplainable Artificial Intelligence)
sify histopathology images for breast cancer and to generate additional          system using common within-subject experimental design [25]; the
information about classification reliability in human-friendly terms, in         outcomes from our explanation interface (XAI system) were compared
the form of a semantic explanation. It provides a confidence measure             with outcomes from a stand alone CNN (AI system with no explanation)
for the classification result of a test image followed by a visualization        by 14 pathologists at clinical trials through questionnaires.
                                                                             2
P. Sabol et al.                                                                                                                        Journal of Biomedical Informatics 109 (2020) 103523
2. Proposed explainable model                                                            class membership criterion 𝜅𝑝̃𝑖𝑗𝑘 of the 𝑗th cluster of the 𝑖th class shares
                                                                                         the same value of parameters 𝑎 and 𝐾. 𝑎𝑖𝑛𝑖𝑡 is initialized as follows:
   In this section, the mathematical description of the Cumulative                                       𝑛
                                                                                                   1
                                                                                                         ∑𝑝̃
                                                                                                                    ‖          ‖
Fuzzy Class Membership Criterion classifier is explained and followed                    𝑎𝑖𝑛𝑖𝑡 =                min ‖𝑝̃𝑖 − 𝑝̃𝑗 ‖,                                                     (4)
                                                                                                   𝑛𝑝̃          𝑗≠𝑖 ‖          ‖
by the definition of the factor of misclassification and the certainty                                   𝑖𝑗=1
threshold.                                                                               where, 𝑛𝑝̃ is the number of training patterns. 𝐾𝑖𝑛𝑖𝑡 value is initialized
                                                                                                                (     )
                                                                                         from the interval 1; 𝑚𝑖𝑗 . The value of threshold 𝜃 is set from the
2.1. Cumulative Fuzzy Class Membership Criterion decision based classifier               interval (0; 1). If the value of CFCMC 𝜒(𝑥) of the input pattern 𝑥 is below
                                                                                         the threshold 𝜃, thus 𝜒(𝑥) < 𝜃, the pattern is ‘‘not classified’’. Finally,
     The proposed method is based on the assumption that 𝑑-dimensional                   the CFCMC surface is computed using Eq. (1) and Eq. (2).
data in the feature space are split into 𝑛𝑐 classes, where 𝐶𝑖 (𝑖 = 1, … , 𝑛𝑐 ),              During the learning phase, adjustment of the CFCMC surface’s shape
the 𝑖th class, is divided into 𝑛𝑖𝑐𝑙 clusters, where 𝐶𝑙𝑖𝑗 (𝑗 = 1, … , 𝑛𝑖𝑐𝑙 ) is the       occurs in order to obtain the highest classification accuracy. An assump-
𝑗th cluster of the 𝑖th class. Each cluster 𝐶𝑙𝑖𝑗 comprises training data                  tion of dividing training set into 𝑛𝑐 classes and each class 𝐶𝑖 into 𝑛𝑖𝑐𝑙
𝑝̃𝑖𝑗𝑘 ∈ R𝑑 , (𝑘 = 1, … , 𝑚𝑖𝑗 ), and 𝑚𝑖𝑗 is the number of training patterns               clusters 𝐶𝑙𝑖,𝑗 generates a set of vectors
of cluster 𝐶𝑖𝑗 . Each training pattern 𝑝̃ defines a fuzzy class membership
criterion 𝜅𝑝̃(𝑥), which is considered as a triangular function as follows:               𝑝𝑖 = [𝑎𝑖1 , 𝐾𝑖1 ; ⋯ ; 𝑎𝑖𝑗 , 𝑘𝑖𝑗 ; ⋯ ; 𝑎𝑖𝑛𝑖 , 𝑘𝑖𝑛𝑖 ],                                         (5)
                                                                                                                                       𝑐𝑙    𝑐𝑙
    ( )        ⎛   𝐾𝑖𝑗
                   ∑         ( )⎞                                                        2.3. Factor of misclassification
𝜒𝐶𝑖 𝑥 = max ⎜ 𝐾1       𝜅𝑝̃𝑖𝑗𝑘 𝑥 ⎟,                                  (2)
            𝑗 ⎜ 𝑖𝑗              ⎟
               ⎝   𝑘=1          ⎠
            ( )                                                                              The term factor of misclassification (FoM) is described as ‘‘the
where, 𝜒𝐶𝑖 𝑥 is the value of CFCMC for an unknown pattern 𝑥 to the                       likelihood of the input sample, which is assigned to the cluster 𝐶𝑙 belonging
class 𝐶𝑖 .                                                                               to the class 𝐶, to be misclassified to one of the rest of the classes’’ i.e. the
    Then, the decision rule for winner class 𝐶𝐿 for the input pattern 𝑥                  possibility that in reality the observation belongs to another class. The
is as follows:                                                                           factor of misclassification of the input sample, assigned to the cluster
𝐶𝐿(𝑥) = 𝐶          (   ( )) .                                                 (3)        𝐶𝑙𝐴𝑖 , to the reference cluster 𝐶𝑙𝐵𝑗 is defined as follows:
             argmax 𝜒𝐶𝑖 𝑥
                  𝑖                                                                                                 𝜒𝐶𝑙𝐵𝑗 (𝑥)
                                                                                         𝐹 𝑜𝑀(𝑥, 𝐶𝑙𝐵𝑗 ) =                       + 𝑠𝑖𝑚𝐶𝑙 (𝐶𝑙𝐴𝑖 , 𝐶𝑙𝐵𝑗 ),                               (8)
                                                                                                                    𝜒𝐶𝑙𝐴𝑖 (𝑥)
2.2. Algorithm description
                                                                                         where the first term on the right hand side describes the local similarity
    The algorithm consists of two phases: the initialization and the                     as the ratio between memberships of the input sample 𝜒(𝑥) to the
learning phase. The initialization phase consists of three processes:                    reference cluster 𝐶𝑙𝐵𝑗 and the winner cluster 𝐶𝑙𝐴𝑖 . The second term
data splitting, clustering, and parameters initialization. First, input data             on the left hand side describes global similarity, which is based on the
are divided into three sets: training sets, validation sets, and testing                 relationship between the data’s clusters.
sets. Training patterns are used in Eq. (1) to create a CFCMC decision                       The similarity between the two clusters 𝐶𝑙𝐴𝑖 and 𝐶𝑙𝐵𝑗 is defined as
surface. During the learning phase, the decision surface is optimized                    follows:
(parameters 𝑎𝑖𝑗 Eq. (1) and 𝐾𝑖,𝑗 in Eq. (2) are adaptively optimized)                                                   𝐴𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝐶𝑙𝐴𝑖 ,𝐶𝑙𝐵𝑗 )
in order to cover all validation patterns. The testing set is used for the               𝑠𝑖𝑚𝐶𝑙 (𝐶𝑙𝐴𝑖 , 𝐶𝑙𝐵𝑗 ) =                                      ,                                (9)
                                                                                                                                    𝐴𝐶𝑙𝐴𝑖
final evaluation of the created decision surface.
    Afterwards, the training data of each class 𝐶𝑖 are independently                     where 𝐴𝐶𝑙 is the area of a hypersphere describing cluster 𝐶𝑙 and
clustered in order to find 𝑛𝑖𝑐𝑙 clusters for each class in feature space                 𝐴𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 is the area of intersection of the two clusters. Here, for
using the well-known K-means algorithm. The number of clusters, 𝑘,                       simplification, the clusters are described with n-dimensional hyper-
is estimated via a gap statistic [26]. This technique uses the output                    sphere, where n is the data dimensionality. For the centre and the
of any clustering algorithm, comparing the change of the within-                         radius of a hypersphere, the coordinates of a cluster’s centroid 𝑐𝐶𝑙𝑖𝑗 and
cluster dispersion with that which is expected under an appropriate                      the estimated variance 𝜎̂ 𝐶𝑙𝑖𝑗 of a cluster’s data, respectively, are used.
reference null distribution. Any other techniques for the estimation of                  For computational purposes, 𝑛-hypersphere is transferred into a two
the number of clusters can be used, such as Silhouette analysis [27] or                  dimensional circle. The area of intersection between the two clusters
Davies–Bouldin clustering criterion [28].                                                is computed using simple two dimensional trigonometry by using the
    Next, the parameters 𝑎𝑖𝑗 and 𝐾𝑖𝑗 in Eq. (1) and in Eq. (2), re-                      distance between the centres and radiuses of the circles, and thus, the
                                                                                                               ‖            ‖
spectively, are initialized. These parameters affect the shape of the                    Euclidean distance ‖𝑐𝐶𝑙𝐴𝑖 − 𝑐𝐶𝑙𝐵𝑗 ‖ between the centroids of the clusters
                                                                                                               ‖            ‖
boundary created by fuzzy class membership criterion 𝜅𝑝̃. Every fuzzy                    𝐶𝑙𝐴𝑖 and 𝐶𝑙𝐵𝑗 and their estimated variances 𝜎̂ 𝐶𝑙𝐴𝑖 and 𝜎̂ 𝐶𝑙𝐵𝑗 .
                                                                                     3
P. Sabol et al.                                                                                                     Journal of Biomedical Informatics 109 (2020) 103523
     The equation for the estimation of the cluster’s variance value 𝜎̂ 𝐶𝑙𝐴,𝑖        3. Colorectal cancer detection explanation interface
was derived in [19] and it is calculated as follows:
                (        )𝑚 {                                                            This section describes the explanations generated to human experts
                    𝑘         𝑚 = 0.7 𝜒𝐶𝑙𝑚𝑎𝑥 ≤ 𝑘
                                          𝑖𝑗
𝜎̂ 𝐶𝑙𝑖𝑗 = 𝑎𝐶𝑙𝑖𝑗                                   ,                                  by the proposed system in colorectal cancer detection tasks. To evaluate
                  𝜒𝐶𝑙𝑚𝑎𝑥      𝑚 = 2.5 𝜒𝐶𝑙𝑚𝑎𝑥 > 𝑘
                              𝑖𝑗                          𝑖𝑗                         the usefulness of the proposed explanations, we designed two systems;
                                                 𝑘 = 𝑝1 ∗ 𝐾 + 𝑝2                     first a plain CNN model that only generates decisions without any
                                                                         (10)
                                   𝑝𝑙 = 𝑎𝑙 ∗ 𝑑𝑖𝑚𝑏𝑙 + 𝑐𝑙 (𝑙 = 1, 2)                   explanation, second, an X-CFCMC (eXplainable CFCMC) model that
                                                                                     complements its decisions with explanations.
             𝑎1 = −0.7621, 𝑏1 = −0.2799, 𝑐1 = 0.0746
                                                                                         For both systems, we developed similar user interfaces; both provide
              𝑎2 = 0.8372, 𝑏2 = −0.3729, 𝑐2 = 0.1758,                                classification results of the whole-slide images (WSI) of colorectal can-
where 𝑑𝑖𝑚 is the dimensionality of the data.                                         cer tissue, showing the original image of the WSI and a corresponding
    It should be noted that during the variance estimation, the Eu-                  label map with a colour code of eight different tissue types. Pathologist
clidean distance was replaced with the following distance measure 𝑑:                 can examine an arbitrary area of the WSI by clicking on the desired
Let us have two vectors 𝑥𝐴 and 𝑥𝐵 with 𝑛-dimensionality. Then the                    area. Subsequently, the interfaces show their predictions. Finally, a
distance 𝑑 is defined as follows:                                                    pathologist can provide the final decision by selecting one of the eight
                                                                                     buttons representing the eight different tissue types.
     1∑
        𝑛
𝑑=        |𝑥 − 𝑥𝐵𝑖 |,                                                    (11)            The non-explainable AI system interface that uses the plain CNN
     𝑛 𝑖=1 𝐴𝑖                                                                        model is shown in Fig. 1. It provides predicted type of tissue and
                                                                                     a probability distribution of the prediction of all tissue types, which
    The value of the factor of misclassification of the input 𝑥 to the 𝑖th
                                                                                     was computed in the output layer of the CNN model with the softmax
class 𝐶𝑖 is computed as follows:
                                                                                     activation function.
𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) = max 𝐹 𝑜𝑀(𝑥; 𝐶𝑙𝑖𝑗 )                                        (12)            The explainable AI system interface that uses the X-CFCMC model
                      𝑗
                                                                                     is shown in Fig. 2. It complements its decision with three types of
    The factor of misclassification can also be expressed semantically.              information:
It exhibits the values as follows:
        ⎧ 𝑛𝑜 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦                           (    )                            3.0.1. Semantical explanation
                                   𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 0, 𝑐𝜃
        ⎪                                        (        ]                             To provide user-friendly explanations, prediction results and infor-
𝐷𝐹 𝑜𝑀 ⎨ 𝑙𝑜𝑤 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦          𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 𝑐𝜃 , 𝜃𝑚𝑖𝑑         (13)
                                             (              ]
        ⎪ℎ𝑖𝑔ℎ 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝐹 𝑜𝑀(𝑥, 𝐶𝐴 ) ∈ 𝜃𝑚𝑖𝑑 , 𝐹 𝑜𝑀𝑚𝑎𝑥                              mation regarding the possibility of the misclassification of the exam-
        ⎩                                                                            ined area of the WSI are semantically explained, for example, with the
where 𝑐𝜃 is the certainty threshold and 𝜃𝑚𝑖𝑑 = (𝐹 𝑜𝑀𝑚𝑎𝑥 − 𝑐𝜃 )∕2 + 𝑐𝜃 .              phrase ‘‘there is a low possibility that this classification is wrong’’. This
The 𝐹 𝑜𝑀𝑚𝑎𝑥 is the maximum value of the FoM computed using the                       semantical explanation is generated based on the value of the FoM.
validation samples as follows:
                                                                                     3.0.2. Visualization of the training image most responsible for a given
𝐹 𝑜𝑀𝑚𝑎𝑥 = max max 𝐹 𝑜𝑀(𝑥𝑗 , 𝐶𝑖 ),                                        (14)
                  𝑗       𝑖                                                          prediction
where 𝑥 ∈ 𝑆𝑣𝑎𝑙𝑖𝑑 .                                                                       To justify the prediction result, the most responsible training image
                                                                                     for a given prediction is displayed to the user. A similar approach to
                                                                                     understand the predictions was introduced in [20], where the influence
2.4. Certainty threshold and certain prediction                                      function was used to trace the CNN’s prediction through the learning
                                                                                     algorithm and back to its training data. In our approach, the most
    The certainty threshold is the value of the FoM, below which it is               responsible training image for a prediction is a representative training
certain (i.e. there is no possibility) that the input sample 𝑥 is assigned           sample 𝑝̃𝑟 (𝑥𝑤 ) for the input image 𝑥𝑤 assigned to the winner tissue type
to the class 𝐶𝐴 and will not be misclassified to any other classes in the            𝑤. It follows that if the training image has very similar context with the
feature space.                                                                       input image, it should gain the trust of the pathologist in prediction.
    Let 𝑥 be samples from the validation set 𝑆𝑣𝑎𝑙𝑖𝑑 that were misclassi-             Otherwise, the pathologist could consider the certain prediction as not
fied and 𝐶𝐺𝑇𝑘 be the ground truth label of the 𝑘th sample 𝑥 . Then the               being reliable.
                                                              𝑘
certainty threshold 𝑐𝜃 is calculated as follows:
                                                                                     3.0.3. Visualization of training images of other types of tissue
                    𝑘
𝑐𝜃 = min 𝐹 𝑜𝑀(𝑥𝑘 ; 𝐶𝐺𝑇 ).                                                (15)            The third means of explanation shows the representative training
        𝑘
                                                                                     images for the input image to the other tissue types, into which the
    It follows that if 𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) < 𝑐𝐹 𝑜𝑀 holds, it is unlikely that input
                                                                                     input image could be misclassified with a high or low possibility. It
sample 𝑥 belongs to the class 𝐶𝑖 .
                                                                                     should visually explain to the pathologist why the input sample could
    Therefore, if it holds that ∀𝑖∈1,…,𝑛𝑐 ; 𝐹 𝑜𝑀(𝑥, 𝐶𝑖 ) < 𝑐𝜃 , the prediction       be misclassified to a particular tissue type. In the case of similar context,
of the input sample 𝑥 is certain, otherwise it is uncertain.                         a pathologist could consider the particular tissue type as the true type
                                                                                     of tissue.
2.5. Representative training sample
                                                                                     4. Experiments
   Representative training sample                   𝑝̃𝑟
                                         is the training sample that is the
most responsible for assigning input sample 𝑥 to the 𝑗th cluster 𝐶𝑙𝑖𝑗 of                 To choose the best performing explainable model to classify colorec-
the 𝑖th class 𝐶𝑖 . It is computed as follows:                                        tal cancer image data, the model has to be accurate and reliable with
                                                                                     its explanations. Therefore, in this section, first, we present the results
                 ‖         ‖
𝑝̃𝑟 (𝑥) = argmin ‖𝑝̃𝑖𝑗𝑘 − 𝑥‖                                             (16)        from the task of boosting the performance of the CFCMC classifier.
             𝑘   ‖         ‖
                                                                                     Then, we describe the results of validating the certainty threshold.
      ‖         ‖
where ‖𝑝̃𝑖𝑗𝑘 − 𝑥‖ is the Euclidean distance between 𝑥 and 𝑝̃𝑖𝑗𝑘 .                    Finally, we show some examples of generated explanations.
      ‖         ‖
                                                                                 4
P. Sabol et al.                                                                                                                    Journal of Biomedical Informatics 109 (2020) 103523
Fig. 1. The interface for the plain CNN, without any explanation, for presenting prediction on histopathological WSI. From the left hand side, it shows a prediction result and
the probability distribution of the prediction. Next, eight buttons for making a final decision by a pathologist are located there. Finally, the original image of the WSI and the
corresponding label map are visualized.
Fig. 2. The interface for the explainable system (X-CFCMC system) for presenting predictions with additional explanations of the CFCMC. From the left hand side, it shows a
semantical explanation of the results alongside the visualization of the training sample responsible for the prediction, below which there is a visualization of the other tissue types,
which could potentially be the true tissue type. Next, the original image of the WSI and the corresponding label map are visualized, below which eight buttons for making a final
decision by a pathologist are located.
    We used a publicly available dataset released in [2] by Kather et al..                         The focus of the first part of experiments was to find the architecture
It consists of Hematoxylin and Eosin (H&E) tissue slides, which were                           of the CNN with the best performance as a feature extractor to train the
cut into 5000 small tiles of the size 150 × 150 pixels (equivalent to 74                       explainable CFCMC classifier.
μm × 74 μm), each of them annotated with one of eight tissue classes,
namely tumour epithelium, simple stroma (homogeneous composition,                              4.2.1. Experimental setup
includes tumour stroma, extra-tumoural stroma and smooth muscle),                                 We utilized eight well-known CNN models, pre-trained on Ima-
complex stroma (containing single tumour cells and/or few immune                               geNet [24] dataset, specifically, AlexNet [30], VGG-16 [23], Inception-
cells), immune cells (including immune-cell conglomerates and sub-                             v3 [31], ResNet-50 [32], Xception [33], DenseNet121 [34], Inception-
mucosal lymphoid follicles), debris (including necrosis, haemorrhage                           ResNet-V2 [35] and EfficientNet0 [36]. For all of them, the fully con-
and mucus), normal mucosal glands, adipose tissue, background (no                              nected layers were cut and replaced with a dense layer containing
tissue). The data are class-balanced, each of the classes consists of 625                      1024 neurons with ReLU activation functions and with an output layer
tiles.                                                                                         containing 8 neurons with the softmax activation function.
                                                                                           5
P. Sabol et al.                                                                                                                           Journal of Biomedical Informatics 109 (2020) 103523
    Table 1                                                                                            The certainty rate 𝑐𝑟 is defined as the ratio between the number of
    The performance results of different CNN models and the corresponding                           certain predictions 𝑦∗𝑐 and the number of all predictions 𝑦∗ .
    CFCMC models.
                                            CNN model                  CFCMC model                          #𝑦∗𝑐
                                                                                                    𝑐𝑟 =                                                                               (17)
     Raw image                              –                          59.35(3.43)                          #𝑦∗
     AlexNet                                91.43(2.42)                85.32(3.26)                     The certainty error is defined as follows:
     VGG-16                                 93.61(1.75)                92.42(2.02)
                                                                                                             𝑦∗𝑐
                                                                                                            #̃
     Inception-v3                           92.76(1.44)                90.79(2.04)                  𝑐𝑒 =                                                                               (18)
     ResNet-50                              93.80(1.08)                91.28(1.64)                          #𝑦∗𝑐
     Xception                               93.58(1.25)                92.78(1.74)
     DenseNet121                            92.76(1.29)                92.06(1.75)
                                                                                                    where 𝑦̃∗𝑐 are certain predictions that were misclassified.
     Inception-ResNetV2                     92.76(1.02)                91.44(1.95)                    The ground truth label certainty error 𝑐𝑒𝐺𝑇 is defined as follows:
     EfficientNet0                          90.97(1.39)                85.84(5.83)
                                                                                                                 𝑦∗𝐺𝑇
                                                                                                                #̃
                                                                                                                       𝑐
     VGG-like                               80.88(1.14)                80.21(2.14)                  𝑐𝑒𝐺𝑇 =                                                                             (19)
     Inception-like                         85.25(2.82)                83.97(3.04)                                𝑦∗
                                                                                                                 #̃
     ResNet20                               90.14(1.24)                84.01(5.0)                   where 𝑦̃∗ are misclassified predicitions and 𝑦̃∗𝐺𝑇 are misclassified pre-
                                                                                                                                                         𝑐
                                                                                                    dictions, which for their ground truth label 𝐶𝐺𝑇 holds that 𝜙(𝑥; 𝐶𝐺𝑇 ) <
                                                                                                    𝑐𝜃 , i.e. their ground truth label is unlikely to be true label.
   Moreover, we created three lighter architectures that were trained                                    Table 2 provides the results from the three previously defined
from scratch, specifically a VGG-like model with 12 convolutional layers                            metrics computed on predictions from joined testing sets from 10 folds
and 2 fully-connected layers, an Inception-like model with 3 inception                              for each CFCMC model. For each metric, the best performing models
layers and a ResNet20 model with a depth 20. All models were trained                                are highlighted in bold.
using the Adam [37] optimizer to minimize cross-entropy for 100                                          CFCMC models trained on features that were extracted from pre-
epochs with the learning rate set to the value 0.0001.                                              trained CNN models generally outperformed models trained from
                                                                                                    scratch with a certainty rate 𝑐𝑟 (13.99% against 3.41% in average). All
   Finally, to train the explainable CFCMC classifier, we extracted
                                                                                                    of the models, however, achieved very low certainty error 𝑐𝑒 ; 7 models
the features from the last dense layer of all the CNN models. The
                                                                                                    achieved zero error, while the highest value was 2.17%. It follows that
experimental setup for the CFCMC algorithm is as follows: number
                                                                                                    when a classifier labels its prediction as certain, it is unlikely that this
of clusters for each of the classes was set to 1. The value of the
                                                                                                    prediction will be incorrect.
threshold 𝜃 was set to value 𝜃 = 0.01. For the optimization of the
                                                                                                         Moreover, all of the models likewise reached a low value of ground
CFCMC, MATLAB implementation of the genetic algorithm was used
                                                                                                    truth label certainty error 𝑐𝑒𝐺𝑇 (lower than 5%). This implies that when a
with a population size of 50 individuals. The mutation rate was set                                 classifier labels its prediction as uncertain, it is very likely that ground
to 0.2. Arithmetic crossover and adaptive feasible mutation operators                               truth label will appear among the potentially true labels.
were used for reproduction. Stochastic uniform selection was used to
choose parents for the next generation. The algorithm stops at the 30th                             4.4. Performance on imbalanced data set
generation.
                                                                                                        For the clinical trials we used a balanced dataset, in which each
4.2.2. Experimental results                                                                         class is uniformly distributed. Typically, this is not the case in many
   Table 1 provides the performance results of different CNN models                                 real world conditions; in most cases, classifiers are required to deal with
and the corresponding CFCMC models. The classification results are                                  imbalances. Therefore, to examine the performance of our classifier on
evaluated with a 10-folds cross validation test. Because of the class-                              imbalanced data set, we generated two data sets from histopathological
balanced dataset, the accuracy metric was chosen to evaluate the                                    images: balanced data set with the same ratio and imbalanced data set
performance.                                                                                        with different ratio for each of the class (See Table 3).
   From Table 1, it can be observed that pre-trained and fine-tuned                                     Table 4 provides the performance results of CNN models with
                                                                                                    fine-tuned Xception architecture and corresponding CFCMC models for
models outperform the ones trained from scratch. Moreover, it can be
                                                                                                    both data sets. Because we deal with imbalanced data set, beside the
seen that features extracted from the CNN models significantly boost
                                                                                                    accuracy, the classification results are also evaluated with other metrics
the performance of the explainable CFCMC models.
                                                                                                    (precision, recall and F1). The result shows that the differences of the
                                                                                                    CNN and CFCMC models between balanced and imbalanced data set
4.3. Validation of the certainty threshold                                                          are not significant, hence the models are able to deal with imbalanced
                                                                                                    data set.
   To validate the certainty threshold, three metrics were defined:                                     Moreover, Table 5 presents the results from the metrics for val-
certainty rate, certainty error and ground truth label certainty error.                             idation of the certainty threshold (defined in 4.3) for balanced and
                      Table 2
                      The results from three metrics for validation of the certainty threshold, certainty rate 𝑐𝑟 , certainty error 𝑐𝑒 , ground truth label error 𝑐𝑒𝐺𝑇 with
                      corresponding occurrences.
                       CFCMC models              #𝑦∗            𝑦∗
                                                               #̃          𝑒𝑟𝑟              # 𝑦∗𝑐          𝑐𝑟               𝑦∗𝑐
                                                                                                                           #̃        𝑐𝑒                𝑦∗𝐺𝑇
                                                                                                                                                      #̃          𝑐𝑒𝐺𝑇
                                                                                                                                                           𝑐
                                                                                                6
P. Sabol et al.                                                                                                                 Journal of Biomedical Informatics 109 (2020) 103523
                           Table 3
                           Fraction ratios of balanced and imbalanced generated data sets.
                             Class                Training set                          Validation set                   Testing set
Table 4                                                                                            Fig. 3(d) illustrates a prediction explanation for an input image that
The performance results of CNN model with Xception architecture and corresponding
                                                                                               is hard to classify, because in addition to three tissue type with a low
CFCMC models for balanced and imbalanced data sets. The results of the precision,
recall and F1 metrics are averages of all classes.                                             probability for misclassification, it offers three highly probable tissue
 Metric           CNN                                    CFCMC                                 types. Therefore, the expert should investigate the input image more
                                                                                               deeply. The following semantic explanation was extracted: The input
                  Balanced           Unbalanced          Balanced          Unbalanced          image is Immune cells tissue type. However, there is a high possibility
 Accuracy         92.74%             90.06%              91.08%            87.23%              that in reality it could be Tumour epithelium or Complex stroma or
 Precision        92.50%             89.93%              91.44%            90.33%              Mucosal glands. Moreover, there is a low possibility that it could be
 Recall           92.76%             90.20%              91.04%            89.24%
                                                                                               Simple stroma or Debris or mucus or Adipose tissue.
 F1               92.64%             90.08%              91.26%            89.78%
                                                                                                   Fig. 3(e) shows a prediction explanation for an input image that
                                                                                               is hard to classify and was misclassified. It was predicted as Simple
                                                                                               stroma tissue type. The explanation offers three tissue types with low
unbalanced data sets. As can be seen, the difference in the results of                         and one with high probability of misclassification. It can be seen that
all metrics are insignificant, which indicates that the FoM is unaffected                      in this case, input image is the most similar to the high probable tissue
by imbalance in data set.
                                                                                               type, Complex stroma, which is also the true tissue type. The following
                                                                                               semantic explanation was extracted: The input image is Simple stroma
4.5. Explanations examples
                                                                                               tissue type. However, there is a high possibility that in reality it could
    Fig. 3 shows five examples of explanations with different difficulty                       be Complex stroma. Moreover, there is a low possibility that it could be
levels of classification of the input image. If an explanation offers tissue                   Immune cells or Debris or mucus or Mucosal glands.
types with a high probability of misclassification, the input image is
hard to classify. In case of only low probable misclassification offer,
                                                                                               5. Clinical trials results
the input image is not easy to classify. Finally, in cases that it provides
no offer, the input image is easy to classify, thus, the prediction is
certain. To generate explanations, the CFCMC model trained of features                            To evaluate the influence of the explanation generated from X-
extracted from DenseNet121 CNN architecture was used.                                          CFCMC for human pathologists, we ran an acceptability test against the
    Fig. 3(a) illustrates an example of a prediction explanation for the                       plain CNN. The objective of this experiment was to evaluate the accept-
input image, which is easy to classify because it offers no other tissue                       ability of the explanation-generating X-CFCMC for human pathologists.
types. Moreover, it can be seen that the input image is very similar to
                                                                                               We used within-subject experimental design, thus at the clinical trials
the training image responsible for the prediction (TRP). Therefore, the
                                                                                               both systems were shown to 14 pathologists (3 men and 11 women),
prediction is certain. The following semantic explanation was extracted:
                                                                                               with an average age of 40.7 years and an average length of service
The input image is for sure Tumour epithelium, because it could no be
misclassified to any other tissue types.                                                       of 14.9 years in which the shortest length of service was 4 years and
    Fig. 3(b) shows a prediction explanation for an input image that is                        the longest was 45 years. At the end of the session, feedback from the
not so easy to classify but was correctly classified. The input image                          pathologists was collected in the form of questionnaires.
was classified as Immune cells tissue type. Although it offers three
tissue types with a low probability of misclassification, namely Tumour
                                                                                               5.1. Experiment setting
epithelium, Simple stroma and Complex stroma, the input image is very
similar to a TRP image. Therefore, the TRP image could gain trust
in this prediction. The following semantic explanation was extracted:                              Prior to the experiments, each participant was informed about the
The input image is Immune cells tissue type. However, there is a low                           dataset. None of them were familiar with machine learning concepts.
possibility that in reality it could be Tumour epithelium or Simple stroma                     Therefore, each of the participant was informed about the automatic
or Complex stroma.                                                                             classification of histopathological samples by machine learning. After-
    Fig. 3(c) shows a prediction explanation for an input image that
                                                                                               wards, both interfaces were explained to the participants, including the
is not easy to classify and was misclassified. The input image was
                                                                                               controls and the means of presenting the predictions. Finally, it was
predicted as Adipose tissue type. The explanation offers two tissue types
                                                                                               explained to the participants that with the exception of asking for help
with low probability, namely Simple stroma and Debris or mucus. It can
be seen that the input image is more similar to low probable tissue types                      with controls, dialogue with the interviewers was discouraged. One dif-
than the TRP image. This could lower trust in this particular decision.                        ferent classified WSI from dataset was shown to every participant, who
However, it offers the true tissue type of the input image, which is                           they were asked to examine 20 arbitrary area for both interfaces and
Debris or mucus. The following semantic explanation was extracted: The                         evaluate each prediction outcome. This takes approximately 30 min on
input image is Adipose tissue type. However, there is a low possibility                        average. At the end of the experiment session, every participant was
that in reality it could be Simple stroma or Debris or mucus.                                  asked to fill out questionnaire.
                                                                                          7
P. Sabol et al.                                                                                                          Journal of Biomedical Informatics 109 (2020) 103523
Fig. 3. Examples of the explanations generated for the testing dataset. The tissue types are displayed above the image.
5.2. Evaluation on users experiences                                                   questionnaire reached value of the Cronbach’s alpha, 𝛼 = 0.89. There-
                                                                                       fore, we can state that the participants sufficiently understood the
   The users’ experiences in using the stand alone CNN and X-CFCMC                     objectives of the experiment. The questionnaire was divided into three
were evaluated using a questionnaire. The internal consistency of the                  parts.
                                                                                   8
P. Sabol et al.                                                                                                                Journal of Biomedical Informatics 109 (2020) 103523
                           Table 5
                           The results from three metrics for validation of the certainty threshold, certainty rate 𝑐𝑟 , certainty error 𝑐𝑒 , ground
                           truth label error 𝑐𝑒𝐺𝑇 with corresponding occurrences for both balanced and unbalanced data set.
                            Data set          #𝑦∗          𝑦∗
                                                          #̃        𝑒𝑟𝑟           # 𝑦∗𝑐       𝑐𝑟                 𝑦∗𝑐
                                                                                                                #̃     𝑐𝑒       𝑦∗𝐺𝑇
                                                                                                                               #̃          𝑐𝑒𝐺𝑇
                                                                                                                                    𝑐
    The first and the second parts use a semantic differential scale,                       Table 6
                                                                                            A comparison of the average scores for each of the four evaluation ares for both
which presents respondents with a set of bipolar scales (useful/useless,
                                                                                            systems, the stand alone CNN model without providing explanations and the X-CFCMC
reliable/unreliable). Respondents were asked to choose a number (from                       model that explains its decision.
1 to 6) that indicates the extent to which the adjectives relate to a                        Evaluation parameters                      Average score
characteristic evaluation of the stand alone CNN and X-CFCMC systems.
                                                                                                                                        Plain CNN                     X-CFCMC
While 1 represents a positive adjective (f.e. useful), 6 represents a
                                                                                             Objectivity                                1.75                          1.65
negative adjective (f.e. useless). The adjectives were selected to cover
                                                                                             Level of details                           2.16                          1.85
four evaluation parameters of the systems’ influences on trust and                           Reliability                                2.81                          2.57
reliance to the pathologists:                                                                Quality                                    2.10                          1.99
                                                                                             Total average                              2.21                          2.01
     1. objectivity - objective/subjective, useful/useless, relevant/
        irrelevant, serious/unserious, ethically/unethically
     2. details      -    precise/imprecise,      consistent/inconsistent,
        complicated/uncomplicated, complete/incomplete                                      5.3. Discussion
     3. reliability - accurate/inaccurate, faultless/faulty, straight/
        misleading, certain/uncertain, reliable/unreliable                                     From the results above, key findings emerge. The comparison of
     4. quality - systematic/unsystematic, time saving/time consuming,                      the characteristic of both systems revealed that in the level of details,
        clear/unclear, expert/inexpert, good quality/bad quality                            the pathologist consider the X-CFCMC system as more rigorous, more
                                                                                            precise, more consistent, more complete. Moreover, the X-CFCMC sys-
    Table 6 shows the average scores of the four evaluation parameters                      tem is considered as more accurate, reliable and confident regarding its
for both systems, which are computed as the arithmetic mean of the                          predictions.
chosen number of each corresponding bipolar scale, while the total                             The statement evaluation indicates that the most useful means of
average is computed from the average scores of each evaluation pa-                          explanations are semantic explanation and a visualization of the training
rameter. Better values are highlighted in bold. Because the lower the                       image responsible for the prediction. Visualization of the other types of tissue
score means the better the evaluation of a certain characteristic, the                      was only appreciated by half of the pathologists. A direct comparison
results reveal that the X-CFCMC system obtained better average scores                       of both systems indicates that the X-CFCMC system is more acceptable
in all parameters. The most significant differences between the scores                      than the plain CNN system.
were in the level of details (0.31) and the reliability (0.24).
    The third part of the questionnaire used dichotomous scale. It                          6. Conclusion
consisted of closed-ended items that covered a subjective evaluation
of both interfaces so that the participants express their agreement                             In this study, we extend the explainability of the explainable Cumu-
or disagreement with the statements. The statements were created to                         lative Fuzzy Class Membership Criterion (CFCMC) classifier and used
be focused on the evaluation the truthfulness and usefulness of both                        it for classification of eight tissue type from histopathological cancer
systems.                                                                                    image samples.
    Analysing the dichotomous scaled items of the third part of the                             First, we improved the performance of the CFCMC classifier on
questionnaires, we came to the following findings:                                          colorectal image data using a fine-tuned Convolutional Neural Net-
                                                                                            work (CNN) as a feature extractor, which was pre-trained on different
X-CFCMC system. The semantic explanation influenced the trust of 10
                                                                                            dataset.
pathologists, while 9 of them increased and 1 decreased their trust
                                                                                                Next, we defined the factor of misclassification (FoM), which is able
in the prediction. The visualization of the training image responsible for
                                                                                            to estimate the possibility of the input sample being misclassified
the prediction increases trust of 10 of the pathologists while for the
                                                                                            to a particular conflicting class. Moreover, we defined the certainty
rest there was no influence. The visualization of the other types of tissue
                                                                                            threshold, thanks to which we are able to say, whether the prediction is
influenced only half of the pathologists, however, it increased their
                                                                                            certain or uncertain. The proposed uncertainty measure is significantly
trust of the system.
                                                                                            different from many other uncertainty measures of models, e.g. neural
Stand alone CNN system. The probability distribution of the prediction had                  networks model, firstly, because it is based on the classifiability of the
no influence to 10 of the pathologists. For the rest, it increased their                    data in the space, where the classifier makes its decision, and secondly,
trust in the prediction.                                                                    in the case of uncertain prediction, it is able to suggest the classes in
                                                                                            which the input sample could be misclassified. Thus, it offers relevant
Comparison of the systems. Analysing five items devoted to a direct                         classes to be further examined. The experiments clearly supported this
comparison of the usefulness of both systems, the X-CFCMC system                            ability.
achieved a cumulative score of 50, while the plain CNN system achieved                          Finally, we developed two systems for segmentation of the whole-
a cumulative score of 20.                                                                   slide images of histopathological cancer tissue. The first system used
Credibility of the systems. Comparing two items about the credibility of                    stand alone CNN and the second used an X-CFCMC classifier, which
both systems, the plain CNN system achieved a cumulative score of 23,                       provides three means of explanations: a semantical explanation about
while the X-CFCMC system achieved score of 21.                                              the prediction and possible misclassification, a visualization of the
                                                                                            training image responsible for the prediction and a visualization of the
Usefulness of the whole-slide segmentation. Two items showed that, in                       other types of tissue.
general, all pathologists consider an automatic whole-slide segmenta-                           At the clinical trials with 14 pathologists, we measured the accept-
tion of the histopathological samples useful.                                               ability and the trust of the pathologists for the proposed system. The
                                                                                        9
P. Sabol et al.                                                                                                                    Journal of Biomedical Informatics 109 (2020) 103523
results indicate that the X-CFCMC system is more useful and more                               [10] A. Holzinger, M. Plass, M. Kickmeier-Rust, K. Holzinger, G.C. Crişan, C.-M.
reliable than the plain CNN.                                                                        Pintea, V. Palade, Interactive machine learning: experimental evidence for the
                                                                                                    human in the algorithmic loop, Appl. Intell. 49 (7) (2019) 2401–2414.
    In conclusion, this paper discussed about the usability and the reli-
                                                                                               [11] G.R. Vásquez-Morales, S.M. Martínez-Monterrubio, P. Moreno-Ger, J.A. Recio-
ability of a explainable classifier in real world medical settings through                          García, Explainable prediction of chronic renal disease in the Colombian
clinical trials. We believe that our proposed system can contribute to                              population using neural networks and case-based reasoning, IEEE Access 7 (2019)
the use of AI, especially by improving the usability and acceptability                              152900–152910.
                                                                                               [12] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction
of AI systems in medical domains, where speed in making decisions,
                                                                                                    of medical codes from clinical text, 2018, arXiv preprint arXiv:1802.05695.
reliability and accountability are crucial. We are aware that the scale of                     [13] S.M. Lundberg, B. Nair, M.S. Vavilala, M. Horibe, M.J. Eisses, T. Adams, D.E.
our preliminary experiment was limited. The expansion of clinical trials                            Liston, D.K.-W. Low, S.-F. Newman, J. Kim, et al., Explainable machine-learning
to include more pathologists from various fields are of our immediate                               predictions for the prevention of hypoxaemia during surgery, Nat. Biomed. Eng.
                                                                                                    2 (10) (2018) 749.
future interest. Moreover, in the medical settings we will be confronted
                                                                                               [14] A. Malhi, T. Kampik, H. Pannu, M. Madhikermi, K. Främling, Explaining machine
with imbalanced, heterogeneous and inaccurate data sets. Therefore,                                 learning-based classifications of in-vivo gastral images, in: 2019 Digital Image
our next challenge in our research is also to examine how our classifier                            Computing: Techniques and Applications, DICTA, IEEE, 2019, pp. 1–7.
will perform on imperfect data.                                                                [15] P. Hartono, A transparent cancer classifier, Health Inf. J. 26 (1) (2020) 190–204,
                                                                                                    http://dx.doi.org/10.1177/1460458218817800, PMID: 30596318.
                                                                                               [16] Ł. Rączkowski, M. Możejko, J. Zambonelli, E. Szczurek, ARA: accurate, reliable
CRediT authorship contribution statement                                                            and active histopathological image classification framework with Bayesian deep
                                                                                                    learning, bioRxiv (2019) 658138.
   Patrik Sabol: Conceptualization, Methodology, Software, Investi-                            [17] P. Sabol, P. Sinčák, K. Ogawa, P. Hartono, Explainable classifier supporting
                                                                                                    decision-making for breast cancer diagnosis from histopathological images, in:
gation, Formal analysis, Writing - original draft, Writing - review
                                                                                                    2019 International Joint Conference on Neural Networks, IJCNN, IEEE, 2019,
& editing, Data curation. Peter Sinčák: Conceptualization, Supervi-                                 pp. 1–8.
sion, Funding acquisition, Project administration. Pitoyo Hartono:                             [18] P. Sabol, P. Sinčák, J. Buša, P. Hartono, Cumulative fuzzy class membership
Conceptualization, Methodology, Supervision, Writing - original                                     criterion decision-based classifier, in: 2017 IEEE International Conference on
                                                                                                    Systems, Man, and Cybernetics, SMC, 2017, pp. 334–339, http://dx.doi.org/10.
draft. Pavel Kočan: Conceptualization, Investigation, Resources.
                                                                                                    1109/SMC.2017.8122625.
Zuzana Benetinová: Investigation, Resources, Data curation. Alž-                               [19] P. Sabol, P. Sinčák, J. Magyar, P. Hartono, Semantically explainable fuzzy
beta Blichárová: Investigation, Resources. Ľudmila Verbóová: In-                                   classifier, Int. J. Pattern Recognit. Artif. Intell. 33 (12) (2019) 2051006,
vestigation, Resources, Writing - review & editing. Erika Štammová:                                 http://dx.doi.org/10.1142/S0218001420510064.
                                                                                               [20] P.W. Koh, P. Liang, Understanding black-box predictions via influence func-
Investigation, Resources, Validation. Antónia Sabolová-Fabianová:
                                                                                                    tions, in: Proceedings of the 34th International Conference on Machine
Formal analysis, Methodology, Visualization, Data curation. Anna                                    Learning-Volume 70, JMLR. org, 2017, pp. 1885–1894.
Jašková: Formal analysis, Methodology, Visualization.                                          [21] J. Zhou, H. Hu, Z. Li, K. Yu, F. Chen, Physiological indicators for user trust in
                                                                                                    machine learning with influence enhanced fact-checking, in: International Cross-
Declaration of competing interest                                                                   Domain Conference for Machine Learning and Knowledge Extraction, Springer,
                                                                                                    2019, pp. 94–113.
                                                                                               [22] J.N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C.-A. Weis, T.
    The authors declare that they have no known competing financial                                 Gaiser, A. Marx, N.A. Valous, D. Ferber, et al., Predicting survival from colorectal
interests or personal relationships that could have appeared to                                     cancer histology slides using deep learning: A retrospective multicenter study,
influence the work reported in this paper.                                                          PLoS Med. 16 (1) (2019) e1002730.
                                                                                               [23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
                                                                                                    image recognition, 2014, arXiv preprint arXiv:1409.1556.
Acknowledgements                                                                               [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale
                                                                                                    hierarchical image database, in: 2009 IEEE Conference on Computer Vision and
   This research is supported by AI4EU project from the European                                    Pattern Recognition, IEEE, 2009, pp. 248–255.
                                                                                               [25] S. Mohseni, N. Zarei, E.D. Ragan, A survey of evaluation methods and measures
Union’s Horizon 2020 research and innovation programme under grant                                  for interpretable machine learning, 2018, arXiv preprint arXiv:1811.11839.
agreement 825619 2019–2021, Maria Currie RISE LIFEBOTS Exchange                                [26] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data
Grant Agreement ID : 824047 2019–2021 and EU FlagEra Joint project                                  set via the gap statistic, J. R. Stat. Soc. Ser. B Stat. Methodol. 63 (2) (2001)
Robocom++, 2017–2021.                                                                               411–423.
                                                                                               [27] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation
                                                                                                    of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
References                                                                                     [28] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern
                                                                                                    Anal. Mach. Intell. (2) (1979) 224–227.
 [1] J. Xie, R. Liu, I. Luttrell, C. Zhang, et al., Deep learning based analysis of            [29] J.H. Holland, Adaptation in Natural and Artificial Systems: an Introductory
     histopathological images of breast cancer, Front. Genet. 10 (2019) 80.                         Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT
 [2] J.N. Kather, C.-A. Weis, F. Bianconi, S.M. Melchers, L.R. Schad, T. Gaiser, A.                 Press, 1992.
     Marx, F.G. Zöllner, Multi-class texture analysis in colorectal cancer histology,          [30] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
     Sci. Rep. 6 (2016) 27988.                                                                      convolutional neural networks, in: Advances in Neural Information Processing
 [3] B.E. Bejnordi, G. Zuidhof, M. Balkenhol, M. Hermsen, P. Bult, B. van Ginneken,                 Systems, 2012, pp. 1097–1105.
     N. Karssemeijer, G. Litjens, J. van der Laak, Context-aware stacked convolu-              [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception
     tional neural networks for classification of breast carcinomas in whole-slide                  architecture for computer vision, in: Proceedings of the IEEE Conference on
     histopathology images, J. Med. Imaging 4 (4) (2017) 044504.                                    Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
                                                                                               [32] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
 [4] J. de Matos, A.d.S. Britto Jr., L.E. Oliveira, A.L. Koerich, Histopathologic image
                                                                                                    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
     processing: A review, 2019, arXiv preprint arXiv:1904.07900.
                                                                                                    2016, pp. 770–778.
 [5] D. Komura, S. Ishikawa, Machine learning methods for histopathological image
                                                                                               [33] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in:
     analysis, Comput. Struct. Biotechnol. J. 16 (2018) 34–42.
                                                                                                    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
 [6] T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia, A.
                                                                                                    2017, pp. 1251–1258.
     Campilho, Classification of breast cancer histology images using convolutional
                                                                                               [34] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
     neural networks, PLoS One 12 (6) (2017) e0177544.
                                                                                                    convolutional networks, in: Proceedings of the IEEE Conference on Computer
 [7] M. Hägele, P. Seegerer, S. Lapuschkin, M. Bockmayr, W. Samek, F. Klauschen,
                                                                                                    Vision and Pattern Recognition, 2017, pp. 4700–4708.
     K.-R. Müller, A. Binder, Resolving challenges in deep learning-based analyses             [35] C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and
     of histopathological images using explanation methods, 2019, arXiv preprint                    the impact of residual connections on learning, in: Thirty-First AAAI Conference
     arXiv:1908.06943.                                                                              on Artificial Intelligence, 2017.
 [8] A. Holzinger, C. Biemann, C.S. Pattichis, D.B. Kell, What do we need to build             [36] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional neural
     explainable AI systems for the medical domain? 2017, arXiv preprint arXiv:                     networks, 2019, arXiv preprint arXiv:1905.11946.
     1712.09923.                                                                               [37] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv
 [9] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, H. Mueller, Causability and                     preprint arXiv:1412.6980.
     explainability of AI in medicine, Data Min. Knowl. Discov. 10 (2019).
10