0% found this document useful (0 votes)
4 views6 pages

Acute Lymphoblastic Leukemia Classification in Nucleus Microscopic Images Using Convolutional Neural Networks and Transfer Learning

Uploaded by

h5wxxk8t3s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Acute Lymphoblastic Leukemia Classification in Nucleus Microscopic Images Using Convolutional Neural Networks and Transfer Learning

Uploaded by

h5wxxk8t3s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 2nd International Conference on Artificial Intelligence and Data Sciences (AiDAS)

Acute Lymphoblastic Leukemia Classification in


Nucleus Microscopic Images using Convolutional
Neural Networks and Transfer Learning
Dheannisa Ramadhani Putri Ade Jamal Ali Akbar Septiandri
Faculty of Science and Technology Faculty of Science and Technology Faculty of Science and Technology
University of Al Azhar Indonesia University of Al Azhar Indonesia University of Al Azhar Indonesia
Jakarta, Indonesia Jakarta, Indonesia Jakarta, Indonesia
dheannisarp@if.uai.ac.id adja@if.uai.ac.id aliakbar@if.uai.ac.id
2021 2nd International Conference on Artificial Intelligence and Data Sciences (AiDAS) | 978-1-6654-1726-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/AIDAS53897.2021.9574176

Abstract—Leukemia is a disease caused by the abnormal distinguish the lymphoblast from the lymphocyte cells. We
production of abnormal blood cells. In Acute Lymphoblastic are using DenseNet-201 [10] for the deep CNN and Wide-
Leukemia (ALL), lymphoblast cells do not develop into lym- ResNet-50-2 [11] for the wide CNN as the base model to
phocytes. To diagnose the disease, we need to differentiate be-
tween lymphocytes and lymphoblasts. However, lymphocytes and do transfer learning. As a benchmark, we include the result
lymphoblasts have similar morphologies. Several studies using from doing transfer learning using ResNet50 as the base deep
computer vision have been developed to distinguish lymphocytes residual network [12]. Moreover, we investigated the effect of
from lymphoblasts. This study aims to compare deep and wide histogram equalization – as the preprocessing step – to the
deep learning architectures to classify segmented blood cell performance.
images. The “deep” architecture employed in this study was
DenseNet201, whereas the “wide” architecture was Wide-ResNet- II. DATASET
50-2. We also employed ResNet50 as a baseline. This study
utilizes transfer learning to reduce the training steps needed. In this study, the dataset was obtained from Cancer Imaging
In addition, we measured the impact of dataset preprocessing Archives [13]. This dataset was used on IEEE ISBI 2019
using histogram equalization on the classifier performance. We conference challenge: Classification of Normal vs Malignant
found that DenseNet201 model has the best performance with
Cells in B-ALL White Blood Cancer Microscopic Images. The
an AUC score of 86.73% and that histogram equalization makes
the performance worse. image size is 450x450 px in .bmp format and RGB channel.
Keywords—leukemia, computer vision, transfer learning, con- The data was segmented from ALL-IDB dataset in [14]–[16].
volution neural networks, image classification In Fig. 1, we can see the sample data of two classes. In
this study, we used 12,528 images with 8,491 lymphoblast
I. I NTRODUCTION (positive) images and 4,037 lymphocyte (negative) images. We
Blood cancer or leukemia is a disease caused by uncon- split the data into training, validation, and test sets.
trolled duplicates of abnormal cells on blood cells. From
several types of leukemia, Acute Lymphoblastic Leukemia
(ALL) is the type that is caused by the uncontrolled growth
of lymphocytes, also known as lymphoblasts. This type of
leukemia mostly affects adults over 50 years old and children
in the 2-5 year group. Studies show that 80% of leukemia cases
on children is ALL [1], [2]. Due to the fast development of
ALL, early diagnosis is crucial to get immediate treatment.
Cell counting from microscopic images by a pathologist is
still used to diagnose ALL despite being subjective and time-
consuming because of the ease of implementation [3], [4]. The
(a) Lymphocyte cell (b) Lymphoblast cell
problem comes from the similar morphology of lymphocytes
and lymphoblasts. Fig. 1. Sample images
Previous studies focus on automating this test using
computer-aided diagnosis to speed up the process [1], [3].
III. M ETHOD
In more recent studies, ALL diagnosis done using convolu-
tional neural networks (CNNs) show promising results [1], A. Histogram Equalization
[5]–[9]. Our study attempts to compare the performance of Histogram equalization flattens the grey level of an image
deep and wide convolutional neural network architectures to and maps it onto the image. This method enhances contrast on

978-1-6654-1726-6/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.
images and is mostly used in medical images [9], [17]. This a drop-out layer to regularize the feature to do so. Thus, WRN
method started with computing the probability on each grey can optimize the residual layer on ResNet and makes it faster
level as shown in Eq. 1. In Eq. 1 x represents the image, and less expensive in computation compared to the original
ni represents the number of pixels in grey level i, and L ResNet.
represents the maximum grey level.
E. Densely Connected Convolutional Networks (DenseNet)
ni
px (i) = p(x = i) = ,0 ≤ i ≤ L (1) Densely Connected Convolutional Networks (DenseNet)
n
The next step was determining the cumulative distribution [10] is one of the deep neural networks. This architecture re-
function (CDF) from the previous distribution for all possible solves the vanishing gradient problem by using concatenation
values of i which is described in Eq. 2. for each dense block. Moreover, the concatenation function
helps the network collect the knowledge from the previous
i
X layers.
cdfx (i) = px (x = j) (2)
j=0

Eq. represents how to determine new probabilities on each


grey level. M x N represents the image size and v represents
the grey level. The last step was mapping again the pixel with
the new grey level.
 
cdf (v) − 1
h(v) = round × 255 (3)
(M × N ) − 1
The results of this method cannot be inverted into the
original image [18].
B. Transfer Learning Method
Transfer learning is a method to train a model by transfer-
ring knowledge learned from a previous task [19]. This method
helps machine learning to operate efficiently because it reuses
the information from the previous problems to save some time Fig. 2. Dense block on DenseNet [10]
compared to training from scratch. This method saves the
features of the previous task to be used as its knowledge [20]. Fig. 2 demonstrates the dense block of DenseNet architec-
One popular task to pretrain the model is ImagetNet [21]. ture. There is the connector between the layers, which is the
concatenation function to transfer the collective knowledge.
C. Residual Networks (ResNet)
Between the layers, there is a transition layer to down-sample
Residual networks (ResNet) [12] are one of the most pop- the features to avoid redundancy of the feature map and make
ular deep neural network architectures, especially for transfer the computation faster. The concatenation function helped in
learning. This architecture is characterized by the use of the back-propagation phase because it helped the networks to
residual layer. This layer helps the model to avoid degradation send the error to the previous layers. The reused features on
of the features. The residual part comes from the idea that it is these networks resulted in dense architecture.
easier to get the desired output after transformation given that
we also provide the identity mapping. The formulation can be F. Optimizer
seen in Eq. 4.
To train the model, we used two types of gradient descent
y = F (x, {Wi }) + x (4) optimization algorithms, namely SGD and Adam.
1) Stochastic Gradient Descent (SGD): Stochastic gradient
F (x, {Wi }) is a function to map the residual from the descent algorithm minimizes the loss function to find the
previous layer. Note that the output of the function should optimal weights by iteratively descending the loss function
have the same dimension as x. instead of finding the analytical solution of the function. The
stochastic part of this algorithm means that we are doing an
D. Wide-Residual Network (WRN)
approximation of the optimization by taking some random
Wide Residual Networks (Wide-ResNet/WRN) [11] is the samples from the overall training set at each iteration. This
wide variation of ResNet. Instead of creating a deeper architec- will reduce the amount of memory and time needed to do this
ture with the help of residual block, this architecture optimizes process. The updates that are made in each iteration can be
feature learning with a wider network. This architecture uses formulated as in Eq. 5
as few parameters as possible by creating a shallow but wide
network to tackle diminishing feature reuse. The author added θt+1 = θt − α · ∇θ J(θ; x(i) , y (i) ) (5)

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.
where θ is the weight, t is the iteration, α is the learning
rate, and J(θ; x(i) , y (i) ) is the loss function that we want to
minimize.
2) Adaptive Moment Estimation (Adam): While the learn-
ing rate in SGD remains constant throughout the learning
process, Adam has a parameter to adapt the learning rate and
typically requires minimal tuning by using the lower-order
moments [22]. Eq. 6 shows the formulation to update the
weights at each time step
mt
θt = θt−1 − αt √ (6)
vt + ˆ
(a) Histogram of the original images
where mt is the biased first moment estimate, vt is the biased
second raw moment estimate,  is a small number to prevent
division by zero, and αt is the adapted learning rate based on
decay rates β and follows Eq. 7.
p
1 − β2t
αt = α · (7)
(1 − β1t )
IV. R ESULTS
This section elaborates the results of the study. The first step
was preprocessing the dataset with histogram equalization. We
can observe in Fig. 3a that shows the sample of the original
histogram and images. Top of the figure is the lymphoblast
(b) Histogram of the preprocessed images
(cancer) sample and the bottom of the figure is the lympho-
cytes (healthy) sample. The maximum value of the grey level Fig. 3. Comparing the effect of preprocessing the images to the grey level
distributions
is 150 in the original images. Fig. 3b shows the histograms
of the preprocessed images. We can observe that the intensity
is now better distributed in the histograms, ranging from 0 to
255. In the image, we can also observe that the nucleus looks
brighter than original.
After creating the enhanced dataset, the next step was
hyperparameter tuning. Table I shows the hyperparameters
used in this study. Each model was trained with 50 epochs Fig. 4. Classification layer on top of each architecture
with early stopping. Fig. 4 shows the classification layer we
added and trained on top of the base model. The selected
learning rates are based on experiments with the following 20 epochs. While the same model with a scheduler could
values: 10−4 , 10−3 , 10−2 , and 10−1 . We found that the most improve its accuracy before 10 epochs, it could not reach the
of the models using 10−1 did not learn at all. highest accuracy of the model without a scheduler.
Two WRN models with Adam optimizer have an interesting
TABLE I. H YPERPARAMETER VALUES TO TEST
result. The model shows that there was no improvements
Hyperparameter Value after several epochs. We argue that the model could not find
Optimizer [Adam, SGD]
Learning Rates [10−4 , 10−3 , 10−2 ]
the optimum value because the learning rate was too large.
Scheduler [True, False] Therefore, descending the error curve took a long time and
triggered the early stopping. The only model which was not
We split the data set into 70% for training data and 30% for in this case was a model with a 10−4 learning rate. This case
validation data. Since the data set is imbalanced, we under- happened either with or without a scheduler. In addition, this
sampled samples from the majority class. We cropped the case happened to its baseline ResNet50 too as well as to the
center of the images into 300x300 px to reduce black pixels model with SGD optimizer with learning rate 10−4 . For the
and resized them to match the input size of the model. DenseNet models, only the model using Adam optimizer with
Fig. 5 shows the learning result of each model. Fig. 5a 10−2 learning rate and a scheduler had no improvement on its
shows that most of the models have fluctuating curves without learning.
a scheduler even after several epochs compared to the ones Fig. 5b shows the learning curves from preprocessed data.
with a scheduler. In addition, the model with a scheduler The models without a scheduler were more stable than those
was faster to converge. For example, WRN model with Adam using the original data. However, the model took a long time
optimizer and learning rate 10−4 improved its accuracy after to converge. Using data preprocessing increased the amount

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.
(a) Models using original data

(b) Models using preprocessed data

Fig. 5. Learning curves from the training phase

of models which did not improve its learning, for example,


DenseNet model with Adam optimizer and 10−3 learning rate.
However, using this ResNet with Adam optimization and 10−3
learning rate indicated better learning than using the original
data. The interesting pattern of using this data was that the
validation result could not reach or come closer to the training
results almost for all model. In the right figure, the learning
curves of the models using scheduler were shown. Compared
with the model without a scheduler, the model has more
stable learning curve and is faster to converge. However, when
compared with models using the original data, the models take
more epochs to converge or trigger early stopping.
Table II shows the best models from the training set. For Fig. 6. AUC-ROC Curve from evaluation result
the original data, the best hyperparameters are Adam optimizer
with a scheduler and a learning rate of 10−4 . Meanwhile, the
best model which was trained with the preprocessed data is
with the SGD optimizer without scheduler and the learning used the same method to optimize the feature. However, in this
rate of 10−3 . case, Wide-ResNet-50-2 was converging earlier than ResNet50
Table III shows the results on the test set. We can observe although the accuracy could not reach ResNet50’s accuracy.
that the models had higher specificity than sensitivity on the It might cause additional dropout at the residual block, cause
original data except for the specificity of the Densenet-201 the model to generalize the features, and dismiss the important
model. The differences in performance are more apparent if feature [11]. DenseNet201 gave a result closer to the baseline,
we see it in the AUC curve as shown in Fig. 6. ResNet50. In the training phase, DenseNet201 could reach the
highest accuracy. The connector of dense block helped this
V. D ISCUSSION model to learn the unique features of each class [10]. The
The three models have their own formulations for optimiz- feature observed in the input can be seen by other layers and
ing the feature to classify the image. ResNet50 and Wide- help the reused feature to overcome the diminishing feature.
ResNet-50-2 indicate similar results probably because they Based on this study, deep architecture can optimize the reused

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.
TABLE II. T RAINING RESULTS
Model Optimizer Scheduler lr Acc Loss Specificity Sensitivity
Original Data
Resnet-50 Adam True 10−4 0.9616 0.3521 0.9607 0.9640
Densenet-201 Adam True 10−4 0.9685 0.3457 0.9706 0.9623
Wide-Resnet-50-2 Adam True 10−4 0.9677 0.3465 0.9657 0.9732
Preprocessed Data
Resnet-50 SGD False 10−3 0.9449 0.3674 0.9450 0.9444
Densenet-201 SGD False 10−3 0.9479 0.3646 0.9480 0.9479
Wide-Resnet-50-2 SGD False 10−3 0.9472 0.3647 0.9511 0.9356

TABLE III. E VALUATION RESULTS

Model Acc Loss Sensitivity Specificity AUC Score


Original Data
Resnet-50 0.8112 0.4991 0.7655 0.9349 0.8605
Densenet-201 0.8061 0.5039 0.7625 0.9235 0.8715
WideResnet50-2 0.8056 0.5023 0.7597 0.9335 0.8676
Preprocessed Data
Resnet-50 0.6431 0.6646 0.6205 0.8765 0.6549
Densenet-201 0.6290 0.6742 0.6092 0.9457 0.6774
WideResnet50-2 0.6221 0.6818 0.6057 0.8767 0.6584

features and ensure that the features can describe the data for [3] M. Mohamed, B. Far, and A. Guaily, “An efficient technique for white
each class. It also helps to classify the data which has similar blood cells nuclei automatic segmentation,” in 2012 IEEE International
Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2012,
forms. pp. 220–225.
From this study, we can also observe that scheduler could [4] Y. Liu and F. Long, “Acute lymphoblastic leukemia cells image analysis
help the model to converge. In addition, the scheduler helped with deep bagging ensemble learning,” in ISBI 2019 C-NMC Challenge:
Classification in Cancer Cell Imaging. Springer, 2019, pp. 113–121.
the model to use the information from training optimally. [5] A. M. Abdeldaim, A. T. Sahlol, M. Elhoseny, and A. E. Hassanien,
However, the scheduler might make the model limits its ability “Computer-aided acute lymphoblastic leukemia diagnosis system based
to reach higher accuracy. In this study, preprocessed data could on image analysis,” in Advances in Soft Computing and Machine
Learning in Image Processing. Springer, 2018, pp. 131–147.
not improve the learning. Preprocessing might decrease the [6] L. H. S. Vogado, R. D. M. S. Veras, A. R. Andrade, F. H. D.
features of the images and make it harder to be differentiated. De Araujo, R. R. V. e Silva, and K. R. T. Aires, “Diagnosing leukemia
Moreover, under sampling method might reduce the variance in blood smear images using an ensemble of classifiers and pre-trained
convolutional neural networks,” in 2017 30th SIBGRAPI Conference on
of the data images because lymphoblasts vary in its forms. Graphics, Patterns and Images (SIBGRAPI). IEEE, 2017, pp. 367–373.
[7] W. Yu, J. Chang, C. Yang, L. Zhang, H. Shen, Y. Xia, and J. Sha,
VI. C ONCLUSION “Automatic classification of leukocytes using deep neural network,” in
2017 IEEE 12th International Conference on ASIC (ASICON). IEEE,
We have compared three models to find the best model to 2017, pp. 1041–1044.
classify nucleus for ALL disease. DenseNet-201 as a model [8] S. Mourya, S. Kant, P. Kumar, A. Gupta, and R. Gupta, “Leukonet: Dct-
with a deep architecture achieved the best overall performance based cnn architecture for the classification of normal versus leukemic
blasts in b-all cancer,” arXiv preprint arXiv:1810.07961, 2018.
on the test set with an AUC of 86.73%. While the model
[9] T. Thanh, C. Vununu, S. Atoev, S.-H. Lee, and K.-R. Kwon, “Leukemia
could achieve a specificity of 93.49%, the sensitivity is still at blood cell image classification using convolutional neural network,”
76.55%. In a field like healthcare, we are more concerned in International Journal of Computer Theory and Engineering, vol. 10,
finding the positive cases. Thus, we need to tune the model no. 2, pp. 54–58, 2018.
[10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
further or penalize the model more on false negatives to connected convolutional networks,” in Proceedings of the IEEE confer-
improve the sensitivity. On the other hand, we can see that ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
preprocessing the data actually made the performance worse. [11] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv
preprint arXiv:1605.07146, 2016.
In the future, we also think that it is necessary to be able to [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
interpret the model itself, i.e. to get the visual explanations recognition,” in Proceedings of the IEEE conference on computer vision
for the decisions that the model make. Thus, employing a and pattern recognition, 2016, pp. 770–778.
method like Grad-CAM [23] can be beneficial. Further work [13] “Wiki - the cancer imaging archive (tcia) public access - cancer
imaging archive wiki,” https://wiki.cancerimagingarchive.net/, (Accessed
also includes using the model to count the cells and compare on 12/24/2019).
the result from this method with the gold standard. [14] R. Duggal, A. Gupta, R. Gupta, M. Wadhwa, and C. Ahuja, “Overlap-
ping cell nuclei segmentation in microscopic images using deep belief
R EFERENCES networks,” in Proceedings of the Tenth Indian Conference on Computer
Vision, Graphics and Image Processing. ACM, 2016, p. 82.
[1] V. Singhal and P. Singh, “Local binary pattern for automatic detection [15] R. Duggal, A. Gupta, and R. Gupta, “Segmentation of overlap-
of acute lymphoblastic leukemia,” 02 2014, pp. 1–5. ping/touching white blood cell nuclei using artificial neural networks,”
[2] S. Mohapatra, D. Patra, and S. Satpathy, “An ensemble classifier CME Series on Hemato-Oncopathology, All India Institute of Medical
system for early diagnosis of acute lymphoblastic leukemia in blood Sciences (AIIMS). New Delhi, India, 2016.
microscopic images,” Neural Computing and Applications, vol. 24, no. [16] R. Duggal, A. Gupta, R. Gupta, and P. Mallick, “Sd-layer: stain
7-8, pp. 1887–1904, 2014. deconvolutional layer for cnns in medical microscopic imaging,” in

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.
International Conference on Medical Image Computing and Computer-
Assisted Intervention. Springer, 2017, pp. 435–443.
[17] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz,
T. Greer, B. ter Haar Romeny, J. B. Zimmerman, and K. Zuiderveld,
“Adaptive histogram equalization and its variations,” Computer vision,
graphics, and image processing, vol. 39, no. 3, pp. 355–368, 1987.
[18] P. Garg and T. Jain, “A comparative study on histogram equalization
and cumulative histogram equalization,” International Journal of New
Technology and Research, vol. 3, no. 9, 2017.
[19] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of research
on machine learning applications and trends: algorithms, methods, and
techniques. IGI Global, 2010, pp. 242–264.
[20] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
features in deep neural networks?” in Advances in neural information
processing systems, 2014, pp. 3320–3328.
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[23] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-cam: Visual explanations from deep networks via
gradient-based localization,” in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Oct 2017.

Authorized licensed use limited to: Malnad College of Engineering. Downloaded on October 18,2024 at 07:27:26 UTC from IEEE Xplore. Restrictions apply.

You might also like