Skin Cancer Classification
Skin Cancer Classification
https://doi.org/10.1007/s11042-020-09388-2
Abstract
Skin Cancer accounts for one-third of all diagnosed cancers worldwide. The prevalence
of skin cancers have been rising over the past decades. In recent years, use of dermoscopy
has enhanced the diagnostic capability of skin cancer. The accurate diagnosis of skin
cancer is challenging for dermatologists as multiple skin cancer types may appear similar
in appearance. The dermatologists have an average accuracy of 62% to 80% in skin
cancer diagnosis. The research community has been made significant progress in devel-
oping automated tools to assist dermatologists in decision making. In this work, we
propose an automated computer-aided diagnosis system for multi-class skin (MCS)
cancer classification with an exceptionally high accuracy. The proposed method
outperformed both expert dermatologists and contemporary deep learning methods for
MCS cancer classification. We performed fine-tuning over seven classes of HAM10000
dataset and conducted a comparative study to analyse the performance of five pre-trained
convolutional neural networks (CNNs) and four ensemble models. The maximum accu-
racy of 93.20% for individual model amongst the set of models whereas maximum
accuracy of 92.83% for ensemble model is reported in this paper. We propose use of
ResNeXt101 for the MCS cancer classification owing to its optimized architecture and
ability to gain higher accuracy.
* Jitendra V. Tembhurne
  jitendra.tembhurne@cse.iiitn.ac.in
    Saket S. Chaturvedi
    saketschaturvedi@gmail.com
    Tausif Diwan
    tausif.diwan@cse.iiitn.ac.in
1 Introduction
The epidermis is the superficial layer of skin mainly consists of three cells: Squamous cells,
Basal cells, and Melanocytes, as shown in Fig. 1. The outermost cells are Squamous and
lowermost layer cells are Basal cells of the epidermis. Melanocytes protect deeper layers of
skin from exposure of sun by producing melanin, a brown pigment substance [10]. When these
cells experience excessive ultraviolet light exposure, the DNA mutations induced affects the
growth of skin cells and eventually shapes in skin cancer [38, 57]. Squamous Cell Carcinoma,
Basal Cell Carcinoma, and Melanoma are the substantial categories of skin cancer usually
associated with squamous cells, basal cells, and melanocytes, respectively.
    The World Health Organization estimates skin cancer as one-third of all the diagnosed
cancers cases globally [76]. Skin Cancer is a global public health issue which causes
approximately 5.4 million newly identified skin cancer incidences in the United States each
year [63]. However, melanomas are responsible for approximately three-fourth of all skin
cancer-related deaths, which count over 10,000 deaths each year in the United States alone. In
Europe, over 1,00,000 new diagnosed melanoma cases are reported annually [3]. Australia
accounts for nearly 15,229 new cases of melanoma annually [8]. Moreover, past decades had
recorded a raise in the incidence rates of skin cancer, which can be observed in the United
Kingdom, where raise in melanoma by 119% since the 1990s, or 250% in the United States
(starting with 27,600 cases in 1990 to 96,480 in 2019) [67, 68]. This trend is explained not
only by the depletion of the ozone layer, but also by the use of solarium and tanning beds [81].
    Skin cancer is traditionally diagnosed by examine physical and biopsy. Although, biopsy is
one of the simplest methods to diagnose skin cancer, the process is arduous and unreliable. In
recent years, the most popular non-invasive instruments that can assist dermatologists in the
skin cancer diagnosis are macroscopic and dermoscopic images [25]. Macroscopic images
usually have lower quality and resolution problem as they are attained using cameras or mobile
phones [55]. Dermoscopy images are high-resolution skin images, derived from the visuali-
zation of deeper skin structures to enhance the diagnostic capability of skin cancer [73].
    The accurate diagnosis of skin cancer is challenging for dermatologists even with
dermoscopy images as multiple skin cancer types may appear similar in initial appearance.
Moreover, even the expert dermatologists are limited to their studies and experiences, since
Fig. 1 Skin layers - epidermis, dermis, and subcutis (Epidermis sub-layered: squamous cells, basal cells, and
melanocytes)
Multimedia Tools and Applications
they are only exposed to a subset of all the possible appearances of skin cancer during their
lifetime. The dermatologists have an average accuracy of 62% to 80% in skin cancer diagnosis
[37, 50]. The reports on the diagnostic accuracy of clinical dermatologist have claimed 62%
accuracy for dermatologists with experience of 3 to 5 years. However, dermatologists with
experience greater than ten years are able to achieve 80% accuracy. The performance further
dropped for less experienced dermatologists [50]. Also, dermoscopy in the hands of inexpe-
rienced dermatologists may reduce the accuracy to identify skin cancer [5, 37, 59].
    The major drawback of dermoscopy is the requirement of extensive training. The research
community has been made significant progress in developing computer-aided diagnosis tools
to overcome the issues faced by the dermatologists [39, 55, 58]. The computer-aided diagnosis
gets better with more and more data surfaces. Retraining the system with new data is trivial,
and the underlying model can also be extended to integrate a plethora of other medical
information into its prediction pipeline. The magnificent performance of deep convolutional
neural networks (DCNNs) to classify images has made to utilize them for classifying images in
medical domain such as skin cancer classification [42].
    The researches [27, 31, 48, 51, 80] fail to extend their study for multiple classes in skin
cancer classification. Additionally, previous investigations are limited by exploiting limited
pre-trained networks [29, 40, 66, 69] or using particular layers of a network for the classifi-
cation. In this paper, an automated computer-aided diagnostic system for the MCS cancer
classification with exceptionally high accuracy is proposed. The proposed method
outperformed both expert dermatologists and previously proposed deep learning methods for
MCS skin cancer classification. We have conducted a comparative study to analyse the
performance of five pre-trained convolutional neural networks and four ensemble models to
determine the best method for skin cancer classification on HAM10000 dataset. We performed
extensive research in determining the best set-up of hyper-parameters for five models pre-
trained on ImageNet [17] namely Xception [15], InceptionV3 [71], InceptionResNetV2 [70],
NASNetLarge [83], and ResNetXt101 [77] and their ensembles InceptionV3 + Xception,
InceptionResNetV2 + Xception, InceptionResNetV2 + ResNetXt101, and
InceptionResNetV2 + ResNetXt101 + Xception. These models are fine-tuned further on
HAM10000 dataset [72] using Transfer Learning [56] to learn domain specific features of
skin cancers. We preferred not to perform extensive pre-processing in this work. Also, we did
not consider hand-crafted feature engineering or lesion segmentation to make the work more
generic and reliable.
    The paper is structured as follows. The literature review is covers in Section 2. Section 3
discusses the methods used in this research, including dataset, pre-processing, classification
models, fine-tuning, feature extraction, and performance matrix. The results and discussions
are highlighted in Section 4 and conclusion is presented in Section 5.
2 Related works
In early 1990, the computer-aided diagnosis systems were introduced to overcome the
challenges faced by the dermatologists in skin cancer classification [75]. The initial efforts
using dermoscopy images were restricted to the classification of benign and melanoma skin
cancer lesions [61]. Since then, numerous methods have been published to address this
challenging task. Several studies [51], [1, 33, 61] follow the commonly used manual evalu-
ation methods based on the ABCD rules proposed by Nachbar et al. [53]. Moreover, traditional
                                                                    Multimedia Tools and Applications
machine learning classifiers such as Super Vector Machines [12], Naive Bayes Classifier [43],
K-Nearest Neighbours [4], Logistic Regression [7], Decision Trees [11], and Artificial Neural
Networks [30] were also untied for skin cancer classification in a search of more accurate and
reliable method. Due to high intra-class and low inter-class variations in melanoma,
handcrafted feature based diagnostic performance was found to be unsatisfactory [79].
   Convolutional neural networks brought a key breakthrough to existing problems and
quickly became the preferred choice for skin cancer classification [16]. The CNNs not only
provided high classification accuracy but also alleviate the machine learning expert’s burden of
“feature engineering” by automatically discovering high-level abstractions from the datasets
[41]. As CNNs needs a large dataset to get familiar with the problem [47], the current literature
[18, 35, 54] mostly employs Transfer Learning to solve large dataset problem, a technique
where a model trained for a given source task is partially “recycled” for a new target task.
   A nearly melanoma detection classifying dermoscopy skin cancer images as malignant or
benign was particularly focused in [42]. The proposed solution uses transfer learning along
with the VGGNet convolutional neural network and achieved an accuracy of 81.3%, precision
of 79.74% and recall as 78.66% evaluated on ISIC archive dataset. However, this method was
restricted to binary classification of skin cancer. Harangi et al. [27] analysed, how the ensemble
of deep CNNs can be proposed to enhance the accuracy of individual models for the skin
cancer classification among three classes. The accuracy of 84.2%, 84.8%, 82.8%, and 81.3%
for GoogleNet, AlexNet, ResNet, and VGGNet models are achieved respectively. Further, the
best accuracy of 83.8% was achieved with the ensemble of GoogleNet, AlexNet, and VGGNet
models. Moreover, recall for their individual models obtained is 59.2%, 51.8%, 52.0%, and
43.4%.
   Kawahara et al. [34] demonstrated a linear classifier with a feature extracted from a CNN
pre-trained on 1300 natural images dataset, which can distinguish up to ten skin lesions with a
higher accuracy. The proposed method neither requires any lesion segmentations nor any
complex preprocessing. This approach achieved an accuracy of 85.8% and 81.9% over 5-
classes and 10-classes respectively. However, the number of images utilized for the training in
the work was insufficient to extract useful features from the dataset. The authors of [35]
proposed a novel CNN architecture composed of multiple tracts for skin lesion classification.
They converted a CNN, pre-trained on a single resolution, to work for multi-resolution input.
Moreover, an entire network was fine-tuned over a public lesion dataset to achieve an accuracy
of 79.15% for ten classes.
   Esteva et al. [18] utilized InceptionV3 architecture pre-trained on ImageNet [17] for fine-
tuning on the dataset of 1,29,450 clinical images, including 3374 dermoscopic images. The
authors showed that deep neural network-based method was able to outperform the clinical
experts regarding the classification accuracy of the dermoscopy images with a large dataset.
   Nyiri and Kiss [54] investigated multiple novel techniques of ensembling deep neural
networks with different hyper-parameters and differently pre-processed data for skin
lesion classification. The application of ensembling can be surprisingly useful not only
for combining different machine learning models but also for combining different hyper-
parameter choices for these models. An accuracy of 90.1% is achieved for Xception model
evaluated on the seven classes of ISIC2017 and ISIC2018 datasets. Whereas, an accuracy
of 80.1% for seven classes using VGG16 model is measured in [46]. A deep neural
network-based framework [65] that follows an ensemble approach by combining
ResNet-50 and Inception V3 architectures is presented to classify the seven types of skin
cancers; an accuracy of 89.9% is reported.
Multimedia Tools and Applications
We have proposed the generalized architecture for the multi-class classification of skin cancer,
as represented in Fig. 2. Initially, the preprocessing is conducted on the dermoscopic skin
cancer images to reconcile an image with the input dimension of the architectures used in this
work. The processed images are then fed to the architecture for features extraction and fine-
tuning. Finally, the output image is constructed by combining all features and classified among
the seven classes of skin cancer, i.e. Melanocytic Nevi, Melanoma, Benign Keratosis, Actinic
Keratosis, Vascular Lesions, Dermatofibroma, and Basal Cell Carcinoma. This method is
explored for five different architectures such as InceptionV3, ResNeXt101,
InceptionResNetV2, Xception, NASNetLarge and their four ensembles i.e. InceptionV3 +
Xception, InceptionResNetV2 + Xception, InceptionResNetV2 + ResNetXt101 and
InceptionResNetV2 + ResNetXt101 + Xception. The architecture of pre-trained model is
shown in Fig. 4.
3.1 Dataset
This research focuses on the dermoscopy images of skin cancer owing to the high impact of
dermoscopy over the world [6, 44, 45]. We have utilized HAM10000 dataset [72], a large
collection of multi-source dermoscopy images of common pigmented skin lesions. The dataset
contains 10,015 dermoscopy images of seven skin cancer types: Melanocytic nevi (6705
images), Melanoma (1113 images), Benign keratosis (1099 images), Basal cell carcinoma (514
                                                                          Multimedia Tools and Applications
images), Actinic keratosis (327 images), Vascular Lesions (142 images), and Dermatofibroma
(115 images). Sample images of skin cancer types from HAM10000 are represented in Fig. 3.
   The dataset of 10,015 images were split into the training set (8912 images) and validation
set (1103 images). The validation data contains unique cases of the dataset (i.e. cases where
Fig. 3 Sample skin cancer images from HAM10000 dataset (a) Actinic keratosis (b) Basal cellcarcinoma (c)
Benign keratosis-like lesions (d) dermatofibroma (e) Melanocytic nevi (f) Melanoma (g) Vascular lesions
Multimedia Tools and Applications
multiple images are associated with the same lesion id were eliminated from the validation set).
So that training and validation set must contain a different set of images for the unbiased
evaluation of the model’s performance.
3.2 Preprocessing
To ensure better generalization, the pre-processing steps were kept minimal for the proposed
method. We performed a basic pre-processing step using the built-in pre-processing function
of Keras ImageDataGenerator. As the dermoscopy images in the dataset have 450 × 600 pixels
resolution, we have downscaled the images to 299 × 299 pixels resolution and 331 × 331
pixels resolution to reconcile with the input image dimension for the models: Xception,
InceptionV3, InceptionResNetV2, ResNeXt101, and NASNetLarge.
Recent years has observed the development of more advanced convolutional neural networks
to elucidate computer vision problems. Usually, a CNN layer consists of convolutional layer,
subsampling layer (max pooling or average pooling) and optionally fully connected layer. The
output at the convolution layer is given by the Eq. 1.
                                         l−1                   
                                          M
                                Al j ¼ f ∑ Ai l−1 *ωl ij þ bi j                          ð1Þ
                                                 i¼1
Where Ml − 1 is the number of feature maps in the(l − 1) layer, Aij is the activation output at the
1st layer, ωlij is the kernel weights from feature map at layer 1 to feature map j at (l − 1) layer,
and bij is the additional bias parameter.
   We employ stochastic gradient descent with momentum (SGDM) [6, 52] and adaptive
moment estimation (Adam) [36] optimizers for the loss function to perform the fine-tuning of
the models in this work. In each iteration, the SGDM optimizer updates the weights as well as
biases of the network to minimize the loss function. The momentum term was utilized to avert
the oscillations along the steepest descent path. The SGDM is represented by Eq. 2;
                                    θiþ1 ¼ θi −α∇E R ðθi Þ þ γ ðθi −θi−1 Þ                      ð2Þ
Where θ is the network’s parameter vector, i represents the iteration number, α is the learning
rate. In the study, we kept the value of α as 0.0001 and 0.001 for the different networks
employed. ER indicates the loss function, and γ is the momentum term set to 0.9. We utilized
categorical cross-entropy loss function while performing the optimisation process using Eq. 3;
                                                            !
                                                     eθp
                                    E R ðθÞ ¼ −ln                                           ð3Þ
                                                   ∑Cj eθ j
Where θp is the CNN score for the positive class, j represents the iterator number, C is number
of classes. The minimization of the loss function using Adam is given by Eq. 4;
                                                       α∇E ðθi Þ
                                           θiþ1 ¼ θi − pffiffiffiffi                                    ð4Þ
                                                        vi þ ε
                                                                     Multimedia Tools and Applications
β2 is the decay rate which was set to 0.999, ε is very small number which prevents zero in the
denominator. The value of ε was set to 0.001.
   To enhance the performance of deep learning architectures in the classification of skin
cancer, we have performed modifications on the architectures i.e. Xception, InceptionV3,
InceptionResNetV2, ResNeXt101, and NASNetLarge. The customizations in the deep
learning architecture includes; 1) dense layers with ‘relu’ activation, 2) dropout layers
and softmax layers at the bottom of the architecture, and 3) improvement in the
parameters values. All these customization are applied on the architectures to improve
their performance for the skin cancer classification. Further, we have performed fine-
tuning of five different CNNs and four ensembles models by adapting SGD and Adam to
validate the impact of ensemble methods over seven classes of HAM10000 dataset for
MCS cancer classification. The architectures of various models used in this work are
represented in Fig. 4.
3.3.1 InceptionV3
3.3.2 ResNeXt101
ResNet introduced the idea of residual connections as a solution to the problems of accuracy
saturation and degradation when increasing the network depth. ResNet comes with different
variants such as ResNet-50, ResNet-101, and ResNet-152. The residual learning framework
used in the ResNeXt101 architecture eases the training of deeper networks and reformulate the
layers to learn residual functions with reference to the layer inputs. This makes ResNeXt101
model easier to converge and can gain accuracy from considerably increased depth. We have
included dense layer with ‘relu’ activation, dropout and softmax layers with seven outputs as a
modification in ResNeXt101 to improve the performance. The modified ResNeXt101 is then
fine-tuned on 8912 images (for 30 epochs) with learning rate of 0.0001 and SGD optimizer
with momentum as 0.9.
3.3.3 InceptionResNetV2
(I) InceptionV3
3X 4X
(II) ResNeXt101
2X 3X 22X 2X
(III) InceptionResnetV2
(IV) Xception
8X
(V) NASNetLarge
2X N N N
     Reduction          Normal
                      Normal            Reduction          Normal
                                                         Normal        Reduction     Normal
                                                                                   Normal
        Cell           CellCell            Cell           CellCell        Cell      CellCell
Fig. 4 Architecture of (I) InceptionV3, (II) ResNeXt101, (III) InceptionResnetV2, (IV) Xception, and (V)
NASNetLarge
connections allow training much deeper neural networks, which lead to even better perfor-
mance. The study [70] have shown that InceptionResNetV2 significantly accelerates the
training of Inception networks with the help of residual connections. So, we included
InceptionResNetV2 as a model after performing certain modifications i.e. added dense layer
with ‘relu’ activation, dropout and softmax layers with seven outputs. The modified architec-
ture then fine-tuned on 8912 images for 30 epochs wherein the learning rate is 0.0001 and
SGD optimizer with momentum is 0.9.
                                                                   Multimedia Tools and Applications
3.3.4 Xception
3.3.5 NASNetLarge
NASNet architectures introduce a new concept of normal cell and reduction cell, which can be
tuned using reinforcement learning search method. NASNetLarge architecture is specifically
designed to train over very large datasets. As the training over large dataset is expensive, the
search for an architectural building block is conducted on a small dataset and then transfer the
block to a larger dataset using NASNet search space. The key aspect of NASNetLarge includes
ScheduledDropPath regularization technique which significantly improves generalization in
the NASNet models. We have modified the pre-trained architecture of NASNetLarge by
adding dense layer with ‘relu’ activation, dropout and softmax layers with seven outputs.
Finally, this modified architecture is fine-tuned over 8912 images for 25 epochs with learning
rate of 0.0001 and SGD optimizer with momentum of 0.9.
In this work, we have used Integrated Feature Extractor for the feature extraction of five
different pre-trained models. Each approach is effective and save significant time in
developing and training a deep convolutional neural network model. The detailed
descriptions of five pre-trained models are discussed in section 3.3. Figure 5 highlights
the layer-wise processing in Xception model for an input image to produce the output
image, for the identification of type of skin cancer. Here, different features are extracted
at different layers i.e. layer by layer deeper features are extracted to accumulate more
features for accurate prediction. Similarly, other models can also extract the features to
identify the type of skin cancer.
   The Integrated Feature Extractor uses the concept of transfer learning for the effective
feature extraction, where a model trained on a particular problem is used on a different
problem after fine-tuning. We found safe to use pre-trained network for this work, as the
convolutional layers closer to the input layer learns low-level features such as lines,
borders, etc. which can be used for the efficient training for another problem. We
decided to integrate the pre-trained model output with few sets of layers at the end.
The weights of the pre-trained models were used as the starting point for the training
process and fine-tuned to our problem. However, the weights of the pre-trained model
were frozen during training so that the pre-trained weights do not modify as the new
model is trained.
   Each hidden layer of convolutional neural network maps its input data to an internal
representation which captures a higher level of abstraction. These learned features evolve
Multimedia Tools and Applications
increasingly more informative as they are passed through the different layers of the network.
Ultimately, the individual features of each layer are stored in an image for the classification
task. The illustration of the feature extraction process at different layers of the network is
shown in Fig. 5, which was performed using the modified Xception model. In [42, 55], feature
extraction is done by simply training the images using the pre-trained networks followed by
the output of the fully connected (FC) layers. However, we hypothesize that fine-tuning of pre-
trained networks on the relevant dataset can contribute to developing higher quality features,
which can boost the performance of the pre-trained models.
                                               ðTP þ TN Þ
                            Accuracy ¼                                                          ð6Þ
                                          ðTP þ TN þ FP þ FN Þ
                                                    TP
                                   Precision ¼                                                  ð7Þ
                                                 ðTP þ FPÞ
                                                   TP
                                    Recall ¼                                                    ð8Þ
                                               ðTP þ FN Þ
                                                            
                                           Precision*Recall
                            F1−score ¼ 2                                                        ð9Þ
                                          Precision þ Recall
The result is derived from the validation data, which consist of 1103 images of seven classes of
skin cancer from the HAM10000 dataset. We have used Keras library [14] for implementing
Multimedia Tools and Applications
the deep models used in this research work. Since, Keras has an ability to run on top of other
deep learning libraries such as TensorFlow or Theano. The training of models is done on the
Kaggle [32] server with 13GB RAM using Tesla P100-PCIE-16GB and 6 minor GPUs.
    We evaluated the performance of five different modelsviz.InceptionV3, ResNetXt101,
InceptionResNetV2, Xception, and NASNetLarge for the classification of skin cancer among
seven classes: Melanocytic nevi, Melanoma, Benign keratosis, Basal cell carcinoma, Actinic
keratosis, Vascular Lesions, and Dermatofibroma. The categorical accuracy for InceptionV3,
ResNetXt101, InceptionResNetV2, Xception, and NASNetLarge were found to be 91.56%,
93.20%, 93.20%, 91.47%, and 91.11% respectively. The best accuracy is recorded for
ResNetXt101 and InceptionResNetV2.
    The training-validation accuracy curves and training-validation loss curves are repre-
sented in Fig. 6 for each of the five models. In the initial stage of training for a few
epochs, the validation accuracy is higher than training accuracy or validation loss is
lower than the training loss; this can be justified in several ways. Firstly, as we have
utilized the Dropout layer in the architecture during fine-tuning of the model to make our
system less prone to over-fitting, these Dropout layers disable the neurons during
training in an attempt to reduce the complexity of the model. In Keras, dropout layers
are disabled during testing providing the network full computational power to perform
prediction and can lead to better training accuracy for a few epochs while evaluating the
model [20]. Secondly, the training loss is the average of the losses over each batch of
training data. As the model is evolving with time, the loss over the last batches is
generally higher as compared to the starting batches of an epoch. Diversely, the valida-
tion loss for a model is computed at the end of an epoch, resulting in a lower loss. This
can contribute to lower validation loss as compared to training loss.
    The weighted average of recall, precision, and F1-score are also evaluated to check
the performance of models with respect to the number of images for each class of
validation data. We found that the weighted average of recall, precision, and F1-score
for InceptionV3 is 89%, 89%, and 89% respectively. Similarly, the weighted averages of
recall, precision, and F1-score for ResNetXt101, InceptionResNetV2, Xception, and
NASNetLarge models are also evaluated. The accuracy, weighted average of recall,
precision, and F1-score results for the five different models utilized in this paper are
summarized in Table 1.
    We have also done experimentation for four ensemble models: InceptionV3 + Xception,
InceptionResNetV2 + Xception, InceptionResNetV2 + ResNeXt101, and
InceptionResNetV2 + ResNeXt101 + Xception. The outputs of the individual models were
averaged to develop the required architecture of ensemble models. The accuracy, weighted
average of precision, recall, and F1-score results for four ensemble models are shown in
Table 2. The training-validation accuracy curves and training-validation loss curves for four
ensemble models are represented in Fig. 7.
    The categorical accuracy was found to be 91.56% for ensemble model ‘InceptionV3 +
Xception’, 88.66% for ‘InceptionResNetV2 + Xception’, 92.83% for ‘InceptionResNetV2 +
ResNeXt101’, and 89.66% for ‘InceptionResNetV2 + ResNeXt101 + Xception’. We have achieved
best results for ‘InceptionResNetV2 + ResNeXt101’ and ‘InceptionV3 + Xception’ ensemble
methods.
    We have observed a trend from the literature that as the number of classification classes
increases, the performance of model deteriorates. Since the model needs to perform prediction
for multiple classes, there is a greater probability of having an incorrect prediction. Thus,
                                                               Multimedia Tools and Applications
Fig. 6 Training-validation Accuracy and Loss Curves for (a) InceptionV3 (b) ResNetXt101 (c)
InceptionResnetV2 (d) Xception (e) NASNetLarge
Multimedia Tools and Applications
Table 1 Accuracy, Weighted Average of Precision, Recall and F1-score for HAM10000 dataset of independent
model
InceptionV3                  91.56                 89                     89                  89
ResNetXt101                  93.20                 88                     88                  88
InceptionResNetV2            93.20                 87                     88                  88
Xception                     91.47                 89                     88                  88
NASNetLarge                  91.11                 86                     86                  86
model performance tends to decrease with several classification classes. The previous works
[27, 42], [18, 35, 46, 54, 65], and [49] lack in performance as compared to the proposed work
in this paper.
    Although, [18, 27, 42] employ classification on either two or three classes, still their classification
accuracy varies from 69.4% to 84.8%. In [18, 70], [46, 54, 65] classification is performed on more
than five classes and lacked to achieve higher accuracy, precision and recall. The accuracy,
precision, and recall obtained are varying between 48.9% - 90.1%, 78.6% - 84.9%, and 51.8% -
80.0%. Table 3 and Table 4 show the comparison of proposed work with existing models.
Moreover, ‘N/A’ represents the performance matrix was not included in the respective research
work.
    We outperformed both dermatologists and the current deep learning methods in multiclass
skin cancer classification with seven architectures used in this work; InceptionV3,
ResNeXt101, InceptionResNetV2, Xception, NASNetLarge, and ensemble models
‘InceptionV3 + Xception’, ‘InceptionResNetV2 + ResNet101’ by avoiding the use of exten-
sive pre-processing, and data augmentation methods.
    We have observed that ResNeXt101 model emerge as an optimized architecture which
makes training easier and can gain higher accuracy for skin cancer classification. ResNeXt101
achieves the best results hence; we propose the use of ResNeXt101 for the MCS cancer
Table 2 Accuracy, Weighted Average of Precision, Recall and F1-score for HAM10000 dataset of ensemble
models
InceptionV3                    91.56                82                     84                 83
+
Xception
InceptionResNetV2 +            88.66                80                     82                 81
Xception
InceptionResNetV2 +            92.83                83                     84                 84
ResNeXt101
InceptionResNetV2 +            89.66                83                     85                 84
ResNeXt101 +
Xception
                                                                          Multimedia Tools and Applications
Fig. 7 Training-validation Accuracy and Loss Curves for (a) InceptionV3 + Xception (b) InceptionResnetV2 +
Xception (c) InceptionResnetV2 + ResNeXt101 (d) InceptionResnetV2 + ResNeXt101 + Xception
classification. Additionally, we have noted better results without using ensemble methods for
skin cancer classification on HAM1000 dataset which is listed in Table 1. Although, ensemble
method is used globally to increase the accuracy in the classification task, but it drastically
increases the architectural complexity leading to much longer training time for the model.
Multimedia Tools and Applications
Ref. Method Number of Classes Accuracy (%) Precision (%) Recall (%)
5 Conclusions
As the incidence rates of skin cancer have been raising over the past decades, there is an
urgent need to address this global public health issue. The magnificent performance of
deep CNNs for medical image classification has made to utilize them for the skin cancer
classification. Although, various researches have been done before for the classification
of skin cancer, they failed to extend their study for multiple classes of skin cancer with
high performance. In this paper, we outperformed both dermatologists and current deep
learning methods for MCS cancer classification. The performance is analyzed for five
                                                                                  Multimedia Tools and Applications
pre-trained CNNs and four ensemble models to determine the best method for skin
cancer classification on HAM10000 dataset. We performed extensive research to deter-
mine the best set-up of hyper-parameters for five models pre-trained on ImageNet
namely Xception, InceptionV3, InceptionResNetV2, NASNetLarge, ResNetXt101 and
t he i r e ns e m b l e s I nc e p t i on V 3 + X c e p t i o n , I n c e p t i o n R e s N e t V 2 + X c e p t i on ,
InceptionResNetV2 + ResNetXt101, InceptionResNetV2 + ResNetXt101 + Xception.
The ResNetXt101 shows significant improvement in the performance as compared to
previously proposed deep learning models. Hence, we propose the use of ResNetXt101
for MCS cancer classification. Moreover, we conclude that the training of deep learning
models with best-setup of hyper-parameter can perform better than even ensemble
models. Although, ensemble methods are used globally to increase the accuracy in the
classification task, they not only drastically increase the architectural complexity of the
model but may not have a significant role in improving the performance of deep learning
models tuned with the best hyper-parameters.
    The future work may deal with the development of more robust deep learning computer-
aided systems for skin cancer diagnosis by including clinical images as additional inputs along
with the dermoscopy images to deep learning models by extending the concept of saliency
objects or features detection [19, 21–24, 74, 82] which have been effectively utilized in the
past for skin cancer diagnosis. The combination of these two clinical and dermoscopy
modalities can provide complementary visual features that can develop highly accurate and
efficient computer-aided systems for skin cancer classification.
References
 1. Abbas Q, Emre Celebi M, Garcia IF, Ahmad W (2013) Melanoma recognition framework based on expert
    definition of ABCD for dermoscopic images. Skin Research And Technology 19(1):e93–e102. https://doi.
    org/10.1111/j.1600-0846.2012.00614.x
 2. Alom, MZ, Aspiras, T, Taha, TM, & Asari, VK (2020). Skin cancer segmentation and classification with
    improved deep convolutional neural network. In: Medical Imaging 2020: Imaging informatics for
    healthcare, research, and applications, vol. 11318, pp. 1131814. International Society for Optics and
    Photonics. doi: https://doi.org/10.1117/12.2550146.
 3. Australian Government (2018). Melanoma of the skin statistics. https://melanoma.canceraustralia.gov.
    au/statistics. Accessed 19 June 2019.
Multimedia Tools and Applications
 4. Ballerini L, Fisher RB, Aldridge B, Rees J (2013) A color and texture based hierarchical K-NN approach to
    the classification of non-melanoma skin lesions. In: Color medical image analysis. Springer, Dordrecht, pp
    63–86
 5. Binder M, Schwarz M, Winkler A, Steiner A, Kaider A, Wolff K, Pehamberger H (1995) Epiluminescence
    microscopy: a useful tool for the diagnosis of pigmented skin lesions for formally trained dermatologists.
    Arch Dermatol 131(3):286–291
 6. Bishop CM (2006) Pattern recognition and machine learning. Springer
 7. Blum A, Luedtke H, Ellwanger U, Schwabe R, Rassner G, Garbe C (2004) Digital image analysis for
    diagnosis of cutaneous melanoma. Development of a highly effective computer algorithm based on analysis
    of 837 melanocytic lesions. Br J Dermatol 151(5):1029–1038. https://doi.org/10.1111/j.1365-
    2133.2004.06210.x
 8. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A (2018) Global cancer statistics 2018:
    GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J
    Clin 68(6):394–424. https://doi.org/10.3322/caac.21492
 9. Burroni M, Corona R, Dell’Eva G, Sera F, Bono R, Puddu P, Rubegni P (2004) Melanoma computer-aided
    diagnosis: reliability and feasibility study. Clin Cancer Res 10(6):1881–1886. https://doi.org/10.1158/1078-
    0432.CCR-03-0039
10. Cancer Facts and Figures 2016 - American Cancer Society. https://www.cancer.org/research/cancer-facts-
    statistics/all-cancer-facts-figures/cancer-facts-figures-2016.html. Accessed 31March 2019.
11. Celebi ME, Iyatomi H, Stoecker WV, Moss RH, Rabinovitz HS, Argenziano G, Soyer HP (2008)
    Automatic detection of blue-white veil and related structures in dermoscopy images. Comput Med
    Imaging Graph 32(8):670–677. https://doi.org/10.1016/j.compmedimag.2008.08.003
12. Celebi ME, Kingravi HA, Uddin B, Iyatomi H, Aslandogan YA, Stoecker WV, Moss RH (2007) A
    methodological approach to the classification of dermoscopy images. Comput Med Imaging Graph 31(6):
    362–373. https://doi.org/10.1016/j.compmedimag.2007.01.003
13. Chaturvedi, SS, Gupta, K, Prasad, P (2019). Skin lesion analyser: an efficient seven-way multi-class skin
    cancer classification using MobileNet. arXiv preprint arXiv:1907.03220.
14. Chollet, F. (2015). GitHub - keras-team/keras: Deep Learning for humans. https://github.com/keras-
    team/keras. Accessed 24 June 2019.
15. Chollet, F (2017). Xception: deep learning with depthwise separable convolutions. In: IEEE conference on
    computer vision and pattern recognition, pp. 1251–1258.
16. Codella N, Cai J, Abedini M, Garnavi R, Halpern A, Smith JR (2015, October) Deep learning, sparse
    coding, and SVM for melanoma recognition in dermoscopy images. In: International workshop on machine
    learning in medical imaging. Springer, Cham, pp 118–126
17. Deng, J, Dong, W, Socher, R, Li, LJ, Li, K, Fei-Fei, L (2009). Imagenet: a large-scale hierarchical image
    database. In: IEEE conference on computer vision and pattern recognition, pp. 248–255. doi: https://doi.
    org/10.1109/CVPRW.2009.5206848.
18. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-level
    classification of skin cancer with deep neural networks. Nature 542(7639):115–118. https://doi.
    org/10.1038/nature21056
19. Fan, DP, Cheng, MM, Liu, JJ, Gao, SH, Hou, Q, Borji, A (2018). Salient objects in clutter: bringing salient
    object detection to the foreground. In: proceedings of the European conference on computer vision (ECCV),
    pp. 186-202. Springer, Cham. Doi: https://doi.org/10.1007/978-3-030-01267-0_12.
20. FAQ - Keras Documentation (2019). https://keras.io/getting-started/faq/#why-is-the-training-loss-much-
    higher-than-the-testing-loss. Accessed 29 June 2019.
21. Fu, K, Fan, DP, Ji, GP, Zhao, Q (2020). JL-DCF: joint learning and densely-cooperative fusion framework
    for RGD-D salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pp. 3052–3062.
22. Fu K, Zhao Q, Gu IYH, Yang J (2019) Deepside: a general deep framework for salient object detection.
    Neurocomputing 356:69–82
23. Ge Z, Demyanov S, Chakravorty R, Bowling A, Garnavi R (2017) Skin disease recognition using deep
    saliency features and multimodal learning of dermoscopy and clinical images. In: International conference
    on medical image computing and computer-assisted intervention. Springer, Cham, pp 250–258. https://doi.
    org/10.1007/978-3-319-66179-7
24. Gong, C, Tao, D, Liu, W, Maybank, SJ, Fang, M, Fu, K, Yang, J (2015). Saliency propagation from simple
    to difficult. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
    2531–2539.
25. Goodson AG, Grossman D (2009) Strategies for early melanoma detection: approaches to the patient with
    nevi. J Am Acad Dermatol 60(5):719–735. https://doi.org/10.1016/j.jaad.2008.10.065
                                                                             Multimedia Tools and Applications
51. Moura N, Veras R, Aires K, Machado V, Silva R, Araújo F, Claro M (2019) ABCD rule and pre-trained
    CNNs for melanoma diagnosis. Multimed Tools Appl 78(6):6869–6888. https://doi.org/10.1007/s11042-
    018-6404-8
52. Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press
53. Nachbar F, Stolz W, Merkle T, Cognetta AB, Vogt T, Landthaler M, Plewig G (1994) The ABCD rule of
    dermatoscopy: high prospective value in the diagnosis of doubtful melanocytic skin lesions. J Am Acad
    Dermatol 30(4):551–559
54. Nyíri T, Kiss A (2018) Novel Ensembling methods for dermatological image classification. In: International
    conference on theory and practice of natural computing. Springer, Cham, pp 438–448
55. Oliveira RB, Papa JP, Pereira AS, Tavares JMR (2018) Computational methods for pigmented skin lesion
    classification in images: review and future trends. Neural Comput & Applic 29(3):613–636. https://doi.
    org/10.1007/s00521-016-2482-6
56. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359.
    https://doi.org/10.1109/TKDE.2009.191
57. Parkin DM, Mesher D, Sasieni P (2011) 13. Cancers attributable to solar (ultraviolet) radiation exposure in
    the UK in 2010. Br J Cancer 105(2):S66–S69. https://doi.org/10.1038/bjc.2011.486
58. Pathan S, Prabhu KG, Siddalingaswamy PC (2018) Techniques and algorithms for computer aided
    diagnosis of pigmented skin lesions-a review. Biomedical Signal Processing and Control 39:237–262.
    https://doi.org/10.1016/j.bspc.2017.07.010
59. Piccolo D, Ferrari A, Peris KETTY, Daidone R, Ruggeri B, Chimenti S (2002) Dermoscopic diagnosis by a
    trained clinician vs. a clinician with minimal dermoscopy training vs. computer-aided diagnosis of 341
    pigmented skin lesions: a comparative study. Br J Dermatol 147(3):481–486. https://doi.org/10.1046
    /j.1365-2133.2002.04978.x
60. Polat K, Koc KO (2020) Detection of skin diseases from Dermoscopy image using the combination of
    convolutional neural network and one-versus-all. Journal of Artificial Intelligence And Systems 2(1):80–97.
    https://doi.org/10.33969/ais.2020.21006.
61. Ramteke NS, Jain SV (2013) ABCD rule based automatic computer-aided skin cancer detection using
    MATLAB. International Journal of Computer Technology and Applications 4(4):691
62. Ratul AR, Mozaffari MH, Lee WS, Parimbelli E (2019) Skin Lesions Classification Using Deep Learning
    Based on Dilated Convolution bioRxiv:860700. https://doi.org/10.1101/860700
63. Rogers HW, Weinstock MA, Feldman SR, Coldiron BM (2015) Incidence estimate of nonmelanoma skin
    cancer (keratinocyte carcinomas) in the US population, 2012. JAMA dermatology 151(10):1081–1086.
    https://doi.org/10.1001/jamadermatol.2015.1187
64. Rosado B, Menzies S, Harbauer A, Pehamberger H, Wolff K, Binder M, Kittler H (2003) Accuracy of
    computer diagnosis of melanoma: a quantitative meta-analysis. Arch Dermatol 139(3):361–367
65. Shahin, AH, Kamal, A, Elattar, MA (2018). Deep ensemble learning for skin lesion classification from
    dermoscopic images. In: IEEE 9th Cairo international biomedical engineering conference - CIBEC’2018,
    pp 150-153. doi: https://doi.org/10.1109/CIBEC.2018.8641815.
66. Sharif Razavian, A, Azizpour, H, Sullivan, J, & Carlsson, S (2014). CNN features off-the-shelf: an
    astounding baseline for recognition. In: IEEE conference on computer vision and pattern recognition
    workshops, pp. 806–813.
67. Siegel RL, Miller KD, Jemal A (2019) Cancer statistics, 2019. CA Cancer J Clin 69(1):7–34. https://doi.
    org/10.3322/caac.21551
68. Silverberg E, Boring CC, Squires TS (1990) Cancer statistics, 1990. CA Cancer J Clin 40(1):9–26
69. Simonyan, K, Zisserman, A (2014). Very deep convolutional networks for large-scale image recognition.
    arXiv preprint arXiv:1409.1556.
70. Szegedy, C, Ioffe, S, Vanhoucke, V, & Alemi, AA (2017). Inception-v4, inception-resnet and the impact of
    residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence, pp. 4278–4284.
71. Szegedy, C, Vanhoucke, V, Ioffe, S, Shlens, J, Wojna, Z (2016). Rethinking the inception architecture for
    computer vision. In: IEEE conference on computer vision and pattern recognition, pp. 2818–2826.
72. Tschandl P, Rosendahl C, Kittler H (2018) The HAM10000 dataset, a large collection of multi-source
    dermatoscopic images of common pigmented skin lesions. Scientific data 5:180161. https://doi.org/10.1038
    /sdata.2018.161
73. Vestergaard ME, Macaskill PHPM, Holt PE, Menzies SW (2008) Dermoscopy compared with naked eye
    examination for the diagnosis of primary melanoma: a meta-analysis of studies performed in a clinical
    setting. Br J Dermatol 159(3):669–676. https://doi.org/10.1111/j.1365-2133.2008.08713.x
74. Wei, J, Wang, S, & Huang, Q (2019). F3Net: fusion, feedback and focus for salient object detection. arXiv
    preprint arXiv:1911.11445.
75. White R, Rigel DS, Friedman RJ (1991) Computer applications in the diagnosis and prognosis of malignant
    melanoma. Dermatol Clin 9(4):695–702
                                                                            Multimedia Tools and Applications
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Affiliations