Electrocardiogram Delineation
Electrocardiogram Delineation
Electrocardiogram Delineation
A dissertation submitted to the
University of Cincinnati
by
Hedayat Abrishami
sis of ECG and its constituent cardiac waves are of high significance in cardi-
clinicians in large scale ECG screening such as drug test phases or population-
mated ECG segmentation approach that can segment cardiac waves accurately
is of high importance. This dissertation studies Deep Learning (DL) based auto-
Due to various shapes and abnormalities presented in ECG signal, the tra-
(ConvNet) applies multilayer feature filters on the input to extract complex fea-
tures from the signal. Thus, ConvNets can be utilized to extract hierarchical
i
features from ECG signals. Two ConvNet architectures are studied and used for
and other studies in the literature as well. The result shows that ConvNets are
terns exist in ECG signals due to arrhythmia and other heart conditions. Gen-
erally speaking, short-term memory may not be able to capture the temporal
cies in ECG sequences. This method has improved the identification of temporal
Results show that it performs generally well for raw ECG signal segmentation.
ii
iii
Acknowledgements
sor Xuefu Zhou for his guidance at all the time of research and writing of this
dissertation. He has shown attitude and substance of a true scholar and during
the most difficult times when writing this dissertation, he gave me the moral
support and the freedom I needed to move on. Without his supervision and
constant help, this dissertation would not have been possible. I am also very
grateful to Professor Chia Han, for his aspiring guidance, insightful comments,
and Professor Richard Czosek, whom I worked with as Graduate Assistant for
plinary topic.
Wee and Professor Anca Ralescu for their priceless time, insightful comments,
My sincere thanks also goes to Mr. Fred Murrell, who provided me the op-
iv
Ph.D. studies. Also, I thank my mentors at the Simulation Center, Dr. Jue Wang,
Dr. Taoran Dong, and Dr. Matthew Barker for many great discussions and
helpful advises.
ishami and Mitra Nafarieh, for their love and support throughout life, and my
wife, Anna Abrishami, that her patience and kindness made this road enjoyable
to ride. Thank you all for your countless self-sacrifice and the chances provided
v
Contents
1 Introduction 1
2 Related Works 19
2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vi
Contents
work 61
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
work Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
Contents
Bibliography 128
viii
List of Figures
1.2 Position of electrodes in the six chest leads and angle of Louis
position [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
ix
List of Figures
3.2 (a) normalized ECG segment, (b) local area under the magnified
3.7 Training and validation error curves for the best result, ConvNet
x
List of Figures
xi
List of Tables
3.5 Result for every architecture and their related learning rates. . . . 81
xii
List of Tables
xiii
Acronyms
AF Atrial Fibrillation.
AI Artificial Intelligence.
BO Bayesian Optimization.
DL Deep Learning.
xiv
Acronyms
E2E End-to-End.
ECG Electrocardiogram.
FN False Negative.
FP False Positive.
FT Fourier Transform.
ML machine learning.
xv
Acronyms
mV milliVolts.
NN Neural Network.
QTDB QT Database.
TN True Negative.
TP True Positive.
xvi
List of Symbols
: Slicing operation
α Learning rate
C̄ Cell state
ζ1 A smoothing kernel
ζ 10 A smoothing kernel
ζ 20 A derivative kernel
A A hypothetical category
B A hypothetical category
xvii
List of Symbols
DX ×Y Input-label distribution
F Forget gate
I Input gate
O Model output
xviii
List of Symbols
Q Output gate
X Input instance
x Input vector
C Encoded vector
A A member of set A
A A set of functions
B A member of set B
B A set of functions
X Input space
Y Label space
⊗ Convolution operation
τ Time index
xix
List of Symbols
a Neuron activation
b Bias connection
f req Frequency
i General indexing
j General indexing
K Number of categories
m Vector dimension
xx
List of Symbols
MI A misclassification metric
n Vector dimension
S Stride
SE A sensitivity metric
T Sequence length
t Time index
x Input sample
ζ2 A derivative kernel
xxi
Chapter 1
Introduction
Controlled by electrical currents through the heart muscle, the heart contracts
(ECG, or EKG) is the recording of the heart electrical activity to obtain critical in-
formation about the heart’s condition. The electrical activity of the heart varies
cal activity in milliVolts (mV) on the vertical axis against time on the horizontal
1
lized to diagnose various cardiovascular diseases. For example, by measuring
time intervals on the ECG, a cardiologist can determine how long the electrical
wave takes to pass through the heart. Measuring the time duration of a cardiac
wave to travel from one part of the heart to the next shows if the electrical ac-
The conventional 12-lead ECG is the current worldwide standard for diag-
nostic electrocardiography. Each ECG lead monitors the electrical activity from
a particular position of the body with respect to the heart muscle [3]. Generally,
these twelve leads are divided into two groups. The first group consists of six
limb leads, and the second group includes six chest leads. The chest limbs elec-
trode angles are with respect to a slight horizontal ridge on the chest called the
angle of Louis. The angle of Louis is located where the manubrium joins the
body of the sternum [3]. The electrodes in the limb leads are placed on the left
arm, right arm and left leg, respectively. While the six limb leads are labeled as
lead I, lead II, lead III, aVR, aVL and aVF, the six chest leads are denoted as V1,
V2, V3, V4, V5 and V6. Fig. 1.2 shows the placement of the six chest leads and
2
Fig. 1.2: Position of electrodes in the six chest leads and angle of
Louis position [3].
posed of several cardiac waves. Among those waves, three of them are of high
atrial depolarization. A typical P-wave duration is less than or equal 0.11 sec-
onds and the amplitude is less than 0.25 mV. QRS-complex represents ventricu-
seconds and the amplitude is over 0.25 mV [3]. Further, T-wave represents the
normal T-wave duration is from 0.10 to 0.25 seconds and the amplitude is at less
than 0.25 mV [3]. Other intervals and segments in the ECG signal are derivative
of these key waves including PR interval (i.e., from the beginning of P-wave to
3
the beginning of the QRS-complex), PR segment (i.e., from the end of P-wave
to the beginning of the QRS-complex segment), ST-segment (i.e., from the end
of S-wave to the beginning of the T-wave segment) and QT-interval (the end
of QRS-complex to the end of the T-wave interval). Each interval and segment
conveys information for cardiac diagnoses [4]. However, locating these cardiac
waves (P-wave, QRS-complex and T-wave) are essential in finding the rest of
the intervals and segments. Fig. 1.3 shows a typical lead II ECG cardiac com-
plex with cardiac intervals, segments and peak annotations. Further, a cardiac
P-wave and QRS-complex appear in the cardiac complex. During the repolar-
While Fig 1.3 represents one typical and normal cardiac cycle, due to arrhyth-
4
mia and many other heart conditions, various abnormalities may be present in
ECG signals and make it a challenging task to interpret. Fig. 1.4 shows five
normal cardiac cycles and four abnormal cardiac cycles are shown in Fig. 1.5.
5
Fig. 1.5: Various atypical QRS cardiac wave [5].
As shown in Fig. 1.4, a capital letter indicates the peak with high ampli-
tude and a small letter indicates the peak with low amplitude. The five typical
QRS-complex are QRS type (including Q-peak, R-peak and S-peak), QR type
(including Q-peak and R-peak), Q type (including only Q-peak), RS type (in-
cluding R-peak and S-peak) and R type (including only R-peak). Fig. 1.5 shows
four atypical QRS cardiac wave formations including QRSR’S’ type, QRSR’ type,
RSR’S’ type and RSR’ type. The abnormal ECG waveforms, R’ is the second pos-
itive or upward wave and S’ is the second negative or downward wave. These
as RSR’, RSS’ or RSR’S’ complexes are elongated QRS waveforms and they are
signs of a symptom called bundle branch block [5]. Bundle branch block is a de-
6
lay in the contraction of ventricles (chambers of the heart) and it reduces pump-
ing efficiency of the heart muscle [6]. Moreover, the presence of RSR’ waveforms
can be an indicator of myocardial infarction (heart attack) [5]. The prime wave-
mal waveforms. Therefore, one of the main challenges of ECG delineation is the
can have varieties of long-term or short term temporal relations to each other.
Fig. 1.6 shows different temporal relationships between cardiac waves. While
Fig. 1.6 (a) shows a QRS-complex is connected to the T-wave through an ele-
vated ST-segment, Fig. 1.6 (b) shows a QRS-complex is connected to the T-wave
through a depressed ST-segment. Both of them are abnormal ECG signals and
are related to Sudden Cardiac Death (SCD) and sudden cardiac arrest. Further-
more, Fig. 1.6 (c) shows T-wave inversion where the QRS-complex is connected
to an inverted T-wave. Here, unlike Fig. 1.6 (a), the T-wave has an abnormal
the symptoms of SCD and eventual death [7]. These are a few examples of the
malities in cardiac waves formations and temporal relationship that lead to SCD
Computer-aided ECG analysis research idea dates back to 1950 [8]. The ob-
7
Fig. 1.6: Cardiac waves various temporal relationship.
automated ECG analysis, reduce the need for clinician expert resources and re-
8
cardiac waves. Over the years, many digital signal processing algorithms have
been developed, as ECG analysis and localization of cardiac waves are crucial to
both cardiovascular research and clinical practices [8]. Based on accurate ECG
diac wave formations and cardiac wave temporal relations. Thus, an ECG seg-
problem.
waves, the task of delineating ECG signal and its constituent cardiac waves re-
tify the temporal relation between the cardiac waves. This dissertation studies
novel ECG delineation methods (i.e. ECG segmentation, cardiac wave localiza-
tion and cardiac wave identification) based on Deep Learning (DL) methods.
9
Results show that DL-based approaches for extracting spatial and temporal fea-
in ECG delineation.
model takes raw ECG signals as its inputs and maps the inputs to the desired
models for ECG analysis from expert-annotated ECG data. There are three ad-
tion of cardiac waves can be maximized. Finally, both long-term and short-term
in ECG data can be captured by employing models with the capability of cap-
10
this research and potential applications based on automated ECG delineation.
Section 1.2 presents the scopes of computer-aided ECG analysis and contribu-
tions of this dissertation. Finally, Section 1.3 outlines the organization of the
dissertation.
As discussed above, numerous diagnoses and findings can be made based upon
lation screening for heart diseases including the leading cause of SCD, Hyper-
the years, ECG delineation has generally been addressed by two different ap-
and determine how other samples are relative to the identified peaks. The other
approach is to classify every ECG sample into one of the cardiac wave cate-
gories, i.e., ECG segmentation [11]. The performances of these two approaches
depend on the feature filters used in the ECG analysis, the kernels designed for
[12–15].
11
Applications of the ECG delineation research include but are not limited to
systems due to their economical cost [7]. However, current ECG monitor-
ing systems are cited for their high false-positive rates and thus unable
• The second application is the mobile cardiac telemetry. Remote medical di-
ticularly helpful for patients in isolated or remote areas who lack quality
medical services [18]. Besides, the increasing demand for predictive and
try.
• The third application is screening heart activities during drug trials. ECG
12
cardiac events are caused by reactions to drugs and this led to a variety of
ated with heart activities of the subjects under the medication should be
treatments or harm to the patients. In addition, patients may not receive the nec-
essary and critical treatments due to the failure of identification of specific symp-
toms [21]. Therefore, verification and validation are required for any computer-
assisted diagnosis.
HCM screening based on the Seattle Criteria [4], problems related to the limi-
13
tations of current ECG signal analysis approaches have been identified. HCM
symptoms are a subset of SCD symptoms that are common among athletes and
this subset of SCD symptoms are introduced in Seattle Criteria [4]. In our con-
ducted research, the first step of the experiment was implementing a cardiac
three different approaches were implemented. Even though the results of the
dataset [22], the localization wasn’t performing to the expectations when SCD
symptoms were occurring. By studying these methods, it became clear the pro-
posed features acquisition in these works isn’t adapted to the HCM symptoms.
For example, low R-peak amplitude in QRS-complex could mislead the algo-
9].
classification using a DL model that can tune feature filters capable of extracting
essential spatial features for a variety of image classes from raw pixel data [23].
By that time, a method that could extract the essential features from various
cardiac wave formations and heart-condition symptoms was missing from ECG
signal processing studies. From that moment, attempts to define the problem
statement and studying an E2E ECG delineation system without any predefined
14
1.2 Research Scope and Contributions
Computer-aided ECG analysis has two broad scopes. The first scope is defin-
publishing their results. Thus, the automation algorithms can be developed with
respect to the published criteria [4, 8, 24]. Willems et al. defined cardiac waves
and its cardiac intervals and reviewed the approaches to visual determination
of the onsets and offsets of the P-wave, the QRS-complex and the T-wave [24].
Drezner et al. have defined several criteria for cardiac waves to diagnose SCD
the feasibility and accuracy of these definitions and criteria using signal process-
15
Different aspects of the algorithmic computer-aided ECG analysis are dis-
cussed further. The first aspect is the acquisition of ECG signal, converting ECG
signal from analog to digital and removing baseline and artifacts of ECG signal
[25, 26]. A public dataset [22] has already acquired ECG signals and this work
isn’t dealing with acquiring ECG signals. However, a baseline wander drift re-
moval methodology is used on the provided ECG signals. The second aspect
is cardiac wave onset, offset, peak localization, cardiac waves recognition and
cardiac wave segmentation. The third aspect is the interpretation of ECG car-
diac waves using diagnostics [8]. This aspect of ECG analysis delves into the
efficient methods for wearable devices with limited power consumption [8, 25].
This research does not focus on the third and fourth aspects of ECG signal pro-
cessing.
for P-wave, QRS-complex and T-wave from ECG signals by utilizing a DL-
16
prove the ECG temporal pattern recognition task. A new DL model archi-
(HMM), this new model is capable of capturing both long-term and short-
an E2E ECG segmentation model using raw ECG signal as its input based
17
• Chapter 4 introduces research on the capability of DL methods in temporal
in P-wave segmentation.
diac wave formations and capturing ECG temporal relationships from the
18
Chapter 2
Related Works
chines can have intelligence. In the early days of the invention, programmable
lems that were too difficult for human beings to solve but rather straightforward
for computers, i.e., problems that required formal mathematics or intense mem-
ory bookkeeping [27]. One of the most important AI successes in the early stages
was defeating Gary Kasparov by IBM’s Deep Blue in the game of chess [27, 28].
the tasks that are hard-to-describe but easy-to-perform for human beings such
actions and understanding human behaviors. The reason for this shortcoming
is computers perform extraordinarily well when the set of rules are known and
can be formulated; however, if the task is more intuitive, abstract and less de-
19
[27].
their own knowledge by extracting patterns from the data. These approaches
gorithm heavily depends on the data presented to it. Data presentation is the
ML algorithm, then an ML algorithm can find the correlation between the fea-
tures and perform the designated tasks (i.e., decision making, regression, etc.).
Generally speaking, there are two types of features including ML-obtained fea-
by experts to extract meaningful information from the raw data (i.e., designing
caded on top of each other. Due to its hierarchical nature, DL is capable of ex-
tracting more complicated and abstract features at the top of the hierarchies by
utilizing the simple and shallow concepts obtained from the data in the lower
hierarchies [27].
20
Fig. 2.1 summarizes different classes of approaches in AI. Here, ML mod-
ules are indicated by the grey blocks. As shown in Fig. 2.1, rule-based methods
21
designed features extracted by another algorithm. In contrast to rule-based sys-
their features.
tried to solve. Not only does specific pattern identification in ECG signals re-
quire in-depth expert medical knowledge but also it is very challenging to trans-
There are a plethora of ECG analysis studies using rule-based methods and clas-
sic ML methods [29]. These studies are referred to as traditional ECG signal
processing approaches. On the other hand, there is still a lack of study in ECG
DL can be a great contribution to the ECG analysis field. The novelty of this
performances.
fundamentals and their applications. Section 2.2 reviews the traditional feature
Section 2.3 introduces the dataset used in this dissertation research. Section 2.4
rithms has been investigated and compared to related works in the literature.
22
Finally, Section 2.5 discusses the perspective of the dissertation research.
attributes from the raw data through an optimization procedure called training
[27]. However, there’s an interpretation for defining the raw data, e.g., in hand-
writing recognition, local features of a pixel are considered to be raw data [30]
or in image classification, the RGB values of the pixels are considered to be raw
data [23].
input-label pairs, called labeled data, is provided to the algorithm for training.
inferred function, which can be used for mapping new examples. Unlike super-
vised learning, unsupervised learning performs without labeled data, i.e., the
input data does not have any label associated with it. The goal of unsupervised
learning is to find the underlying structure of the data, learn relationships be-
tween elements in the data and identify common characteristics among them.
23
The typical applications for unsupervised learning are clustering problems or
For supervised learning, there are three sets of input-label data pairs, the
training set, validation set and test set. For every input-label pair (X, Y), X is
instances in the training set and test set are mutually exclusive. However, the
validation set can be a subset of the training set or mutually exclusive from the
training set and test set. The goal of the training set is to minimize a task-specific
error measure defined on the test set. For example, in a regression task, the
error measure is usually the Euclidean distance between the algorithm outputs
and the provided labels. The purpose of the validation set is to validate the
set is used to determine when to stop training and avoid over-fitting. The test
24
2.1.1 Artificial Neural Networks
DL methods are based on ANNs and have been used in supervised learning,
connected units called artificial neurons [31]. Each weighted connection, like
the synapses in biological brain, can transmit a signal from one artificial neuron
approach in the ML field [30]. Generally, there are two types of ANNs including
acyclical ANNs and cyclical ANNs. In acyclical ANNs, there’s no weighted con-
nection between any two neurons that can create a loop between them. These
trast, cyclical ANNs are RNNs and have extra weighted connections that create
loops in the network graph to maintain the temporal interstate of the network
n
z = X1×n × θn×1 + b = ∑ xi × θi + b , (2.1)
i =1
and
a = g(z) (2.2)
25
Fig. 2.2: Single neuron model.
where X1×n is the input to the neuron, θn×1 is the weighted connection vector
and b is the bias connection. Bias unit or intercept is necessary for data fitting
and every neuron except the input neuron has one independent bias unit con-
nected to it through the bias connection. A bias unit is not connected to any
previous layer neurons and it is set to numerical value of one. z is the weighted
summation input to the neuron and g(.) is called activation function. Activation
tion, Tanh function, Rectified Linear Unit (ReLU) function and piecewise linear
1
g(z) = (2.3)
1 + e−z
26
Fig. 2.3: Activation functions.
especially for output neurons. Since its output is in the range of (0, 1), it is
suitable for modeling probability. Due to its nonlinearity, ANNs with sigmoid
tions modeling. Therefore, it is more suitable for nonlinear problems than linear
ANN models which are capable of modeling only linear equations. Moreover,
any combination of linear operators is still a linear operator, i.e., any ANN with
multiple linear layers is exactly equivalent to another ANN with a single linear
layer [30]. Linear networks are in contrast to nonlinear networks which can gain
27
input data [30, 32]. Another key property of sigmoid function is its differentia-
bility, which allows the network to be trained with gradient descent [30].
ez − e−z .
Tanh(z) = (2.4)
ez + e−z
is another nonlinear sigmoidal activation function with the range of (−1, 1).
Multilayer Perceptron
The neurons in a Multilayer Perceptron (MLP) model are arranged in three dif-
ferent interconnected layers including one input layer, one or more hidden lay-
ers and one output layer. Fig. 2.4 shows an MLP example with one input layer
consisting of four input neurons activated by input vector X, two hidden layers
28
(l )
each consisting of three neurons which outputs ah (where h is the index of the
hidden neuron and l is the hidden layer index) and finally one output layer con-
sisting of two output neurons which output ok where k is the index of the output
neuron.
All MLP layers except the input layer are fully-connected layers (dense lay-
ers), i.e., each neuron receives input from every neuron of the previous layer.
This allows a particular layer in MLP to learn patterns from all the combinations
of the features provided in the previous layer. One common application of MLP
models is classification. By providing a feature set for an MLP, this type of model
It has been shown that an MLP with a single hidden layer containing a suf-
ever, since the output of an MLP depends only on the current input (i.e., mem-
oryless) and doesn’t pass on any past or future inputs, MLPs are more suitable
In the feedforward process, the input is passed through the intermediate layers
in one direction (i.e., no feedback), activates the neurons of the hidden layers
29
In an MLP with an input X, each neuron in the first hidden layer calculates
a weighted summation of the input neurons. For hidden neuron h in the first
n
∑ θi,h xi + bh
(1) (1) (1)
zh = (2.5)
i =1
(1)
where n is the number of input neurons and θi,h denotes the weighted connec-
(1)
tion from input neuron i to hidden neuron h, bh is the bias for hidden neuron h.
(1)
Afterward, the activation function gh (.) for neuron h is applied to the weighted
(1) (1)
sum zh and the final activation ah is obtained as
By calculating the activation for each neuron in the first hidden layer, weighted
summation and activation for the rest of the hidden layers are repeated, e.g., for
neuron h in the l th hidden layer Hl , the summation and activation are given by
[30]
( l −1)
∑
(l ) (l ) (l )
zh = θh0 ,h ah0 + bh , (2.7)
0
h ∈ H l −1
and
(l ) (l ) (l )
a h = gh ( z h ) (2.8)
(l )
where Hl −1 is the hidden neuron set in l − 1th hidden layer, θh0 ,h denotes the
30
(l )
hidden layer, bh is the bias for hidden neuron h in the l th hidden layer.
For the MLP example shown in Fig. 2.4, by activating the input layer using
the input vector X, the forward pass for the first and second hidden layers cal-
culates the output of the first and second hidden layers, respectively. Utilizing
the activation of the second hidden layer, the output layer activation O, which
Output Layer
In an NN model with K output neurons, the output vector is given by the acti-
vation function of the neurons in the output layer. The weighted summation, zk ,
∑
( L)
zk = θh,k ah + bk (2.9)
h∈H L
where L is the index of the last hidden layer, θh,k denotes the weighted connec-
( L)
tion between the hidden neuron h and output neuron k, ah is the activation of
neuron h in the last hidden layer, bk represents the bias connection for output
have K output neurons and normalize the output activations with the softmax
31
function to obtain the class probabilities as
e zk
p (ck |X) = ok = (2.10)
∑kK0 =1 ezk0
where ok is the activation of the kth output neuron. Label data in a classification
label class as a binary vector with all elements equal to zero except for element
and the correct class is c2 , label data Y, is represented by (0, 1, 0, 0, 0) [30]. Using
this coding obtains the following convenient form for the label probabilities:
K
∏ ok k
y
p(Y|X) = (2.11)
k =1
which implies the probability of finding the correct class by the model. If all the
output neuron activations are close to the label data, the probability is closer to
the numerical value of one (i.e., the model finds the correct class). Otherwise,
if any of output neuron activations are not close to its corresponding label, the
probability will be closer to zero (i.e., the model is incapable of finding the cor-
rect class).
Given the above definitions for pattern classification in MLPs, input vector
activates the model and the most activated output neuron corresponds to the
32
Loss Functions
The loss function for multiclass classification problems is simply obtained from
[30]
K
L(X, Y) = − ∑ yk ln ok (2.12)
k =1
where L(X, Y) is the loss function and the equation is differentiable which makes
it suitable for computing the error gradients. This loss value is small (close to
zero) if the output neuron activations and the label for that neurons are close
to each other. For example, if kth output neuron activation is close to zero and
the label data for that output neuron is also zero (the instance doesn’t belong to
the kth class), thus, the loss value related to the kth class will be approximately
zero because the model has classified it correctly. On the other hand, the loss
value will be large if a label element and its output neuron activation are far
from each other. For example, the kth output neuron activation is close to zero,
but the label data for that output neuron is one (the instance belongs to the kth
class), the loss value will be very large because the model has classified the kth
class incorrectly. Given the above definition, the range of the loss function for a
By using the differentiable loss function detailed in Eq. 2.12, MLP can be trained
to minimize the loss function utilizing gradient descent methods. The most pop-
33
ular approach to calculate the gradient descent is known as backpropagation
[30, 34]. Backpropagation is also referred to as the backward pass of the net-
derivatives. The first step is to calculate the derivatives of the loss function with
∂L(X, Y) y
=− k. (2.13)
∂ok ok
The activation of each neuron in a softmax layer depends on the network input
K
∂L(X, Y) ∂L(X, Y) ∂ok0 .
= ∑ (2.14)
∂zk k 0 =1
∂ok0 ∂zk
∂ok0
= ok δk,k0 − ok ok0 (2.15)
∂zk
34
K
where ∑ ok = 1 because it sums over the activation of all the neurons in a
k =1
softmax layer and
1 k = k0
δk,k0 = . (2.16)
0 k 6= k0
By substituting Eq. 2.13 and Eq. 2.15 into Eq. 2.14, we obtain
∂L(X, Y)
= ok − yk . (2.17)
∂ok
Backpropagation is calculated the same way for all the hidden layers. It is help-
ful to introduce the differentaition of loss function with respect to the neuron
where j is any neuron in the network and z j is the weighted summation for any
neuron except input neurons [30]. For the neurons in the last hidden layer, we
have [30]
K
∂L(X, Y) ∂ah ∂a ∂L(X, Y) ∂zk
δh =
∂ah ∂zh
= h
∂zh ∑ ∂zk ∂ah
(2.19)
k =1
where L(X, Y) depends only on each hidden neuron h through its influence on
the output neurons. By differentiating Eq. 2.6 and Eq. 2.9 and then substituting
K
δh = g0 (z j ) ∑ δk θh,k (2.20)
k =1
35
where g0 (.) is the is the first-order derivative of the activation function with re-
spect to its input z. In addition, the δ terms for each hidden layer Hl before the
∑
(l ) (l ) ( l +1) ( l +1)
δh = g0(l ) (zh ) δh0 θh,h0 (2.21)
0
h ∈ Hl +1
where h is the index of the hidden neuron (h ∈ Hl ) and h0 is the index of the
neuron in the hidden layer above (h0 ∈ H(l +1) ). Once the δ terms are obtained
for all the hidden neurons, utilizing Eq. 2.5, the derivatives with respect to each
∂L(X, Y) ∂L(X, Y) ∂z j
= = δj ai (2.22)
∂θi,j ∂z j ∂θi,j
where θij is the weighted connection between neuron i and neuron j, z j is the
∂L(X, Y)
∆θi,j = (1 − α)θi,j − α (2.23)
∂θi,j
where α ∈ [0, 1] is the learning rate. This process is repeated until a stopping
criteria is met (i.e., the loss function stops decreasing when repeating the pro-
36
fundamentals can be found at [30].
Batch learning, mini-batch learning and stochastic learning are weight update
policies. Batch learning updates the network weights by calculating the average
of the gradients with respect to the loss function for the entire training instances.
Since the average of gradients with respect to the loss function for the entire
training set is known, batch learning has the advantage of fast convergence to a
minima, but it has the risk of falling into a local minima [30, 35]. Unlike batch
learning, in stochastic learning, the network weights are updated after calcu-
lating the gradients with respect to the loss function for each training instance.
Stochastic learning has the advantage of escaping from local minima since ev-
ery training instance has a different loss value, but it has the risk of missing the
global minima as well [35]. Due to the nature of stochastic learning, this policy
stochastic learning, mini-batch learning divides the training set into multiple
subsets and updates the weights after calculating the average of gradients with
respect to the loss function for each subset. Therefore, it has the advantage of
escaping the local minima and fast convergence, however, the training set size
37
Weight Initialization
Gradient descent algorithms for NNs require small, random, initial values for
with mean zero and a standard deviation of 0.1 to initialize the weights [30].
connected to the layer. The idea behind this approach is keeping the variance
of layer’s output the same as the variance of the layer’s input. It is shown that
for deep models with a large number of hidden layers, this approach can in-
crease the convergence speed [37]. Another approach similar to Xavier weight
Xavier solution, but the weights range are dependent on the number of neurons
els such as the number of hidden layers, number of neurons in each hidden
lenging and critical task to select the values of these variables as they determine
38
the performance of the model.
stant except the one that is targeted for grid search. For example, to identify the
best learning rate, the NN model should be trained on a range of learning rates
while the rest of the hyperparameters remain constant. The second approach
Optimization (BO) approaches [39]. BO divides some of the inputs into differ-
ent sets called surrogates. Using a slightly different set of hyperparameters for
and tries to find the best combination of hyperparameters set by repeating this
on similar research studies and evaluate the performance and the convergence
rate-annealing method called step decay, at the beginning of the training pro-
cedure, the learning rate starts from a high value and gradually the learning
rate decreases by every epoch in the training [40]. Annealing the learning rate
through the training process leads to faster learning at the beginning of the train-
ing process and decreasing the learning rate through each epoch leads to smaller
39
2.1.2 Convolutional Neural Networks (ConvNets)
ConvNet models are introduced by Lecun [42] and they are becoming increas-
dataset [43, 44] in 2014 by using a large ConvNet model. ConvNets have been
tion. For instance, Longpre et al. used ConvNets to detect facial features (eyes,
lips, eyebrows, etc.) with key point locations [45]. Zeiler et al. used a ConvNet
to classify and localize objects in the ImageNet dataset [23, 44]. Moreover, Con-
vNets have been used for image segmentation and creating super-resolutions
[46, 47].
called feature filter that is applied to adjacent local neurons of the input data
features from adjacent regions of the input data [42]. The output of a feature
filter applied to its receptive fields is called feature map [42, 43].
X1×9 with nine samples. Furthermore, feature filter C(1) is activated in different
X as a feature filter receptive field to produce a feature map. Thus, feature fil-
ter extracts local features from various regions of the input (various receptive
40
fields). Generally, a ConvNet that can extract a variety of features requires mul-
gions). Max-pooling layer is also known as the zoom-out layer because it allows
the model to look at a larger patch of data and choose the maximum value of its
receptive field as its output. Thus, this type of layer provides computation effi-
ciency since it reduces the size of its receptive field and passes the size-reduced
41
output to the next layer and also avoid over-fitting. In Fig. 2.5, the layer with
Stride, defined as the distance between the center of two adjacent receptive
fields, refers to relative offset applied to feature filters (or any other NN opera-
tion) [48]. Stride S means how many samples from the center of a receptive field
will be horizontally and vertically translated. Stride operation reduces the over-
lap of receptive fields and spatial dimensions [48]. It is argued that the reduction
fields. In this example, the max-pooling function is displayed with empty cir-
cles. Further, stride and receptive fields on an input X1×12 are explained. In this
example, the stride S = (1 × 3) and the size of the receptive field is also (1 × 3).
Thus, the centers of the receptive fields are (2, 5, 9, 11) where each center has 2
operation outputs the maximum value over every adjacent receptive field of size
(1 × 3).
42
Fig. 2.6: Max-pooling, stride and receptive field example.
as shown in Fig. 2.5 where feature filter C(2) is cascaded on top of the max-
pooling layer of the prior convolutional layer. This feature allows the ConvNet
purposes, the last layers in ConvNet models are fully-connected layers, shown
tive field sizes and strides are additional hyperparameters to construct a Con-
vNet model.
43
2.1.3 Long Short-Term Memory Neural Networks
processing, handwriting recognition, etc. Due to the ability to keep the model’s
the temporal dependency in the data [30, 31]. Fig. 2.7 shows an RNN scheme
and unrolled RNN scheme [50]. RNN scheme shows the weighted loop connec-
tion in an RNN model and unrolled RNN scheme shows that every timestamp
tamp difference between τ and t varies for applications, however, 100 times-
tamps can be considered a long-term gap between the data [52, 53]. Traditional
RNNs are able to only capture short-term dependency and they are unable to
44
the problem of vanishing-exploding gradients [51], i.e., the derivative of the loss
function with respect to weights approaches close to zero or infinity after a short
period of the time. This issue makes it impossible to train the networks for
long-term dependency. To address this issue, LSTM RNN has been developed
by Hochreiter and Schmidhuber [53]. LSTM RNN uses trainable memory cells
called LSTM cells instead of simple neurons. These memory cells have three
trainable gates including input, output and forget gates. These cells remember
values over long-term periods, and the gates regulate the flow of information
into and out of the cells. A large number of applications have performed better
than their competitors using LSTM networks [54]. Details of the LSTM cells will
be discussed in Chapter 4.
between the handwriting pixels [55]. Later on, Graves et al. used a similar ap-
Generally speaking, there are two technical research fields in ECG delineation.
While one focuses on the extraction of features which can best represent the
structure of cardiac waves [12, 13, 56, 57], the other is to classify the obtained
45
features into a specific set of classes, i.e., cardiac wave classes or heart-condition
symptom classes [14, 15, 58]. An extensive literature review on ECG signal delin-
eation can be found in [29]. Recently, researchers have applied DL methods for
ECG processing as well [59, 60]. In this section, a background review on ECG
analysis. The first approach is to apply filters such as smoothing filters [12], first-
order derivative filters [61, 62], second-order derivative filters [13], etc. The sec-
Transform (DWT) [57, 63–65], Fourier Transform (FT) [56], etc. The third ap-
T-wave offset. Low-pass differentiators are first-order derivative filters with cut-
off frequency [12]. In their work, a cut-off frequency is defined to minimize the
impact of noise in ECG signals. Based on the first-order derivative of the ECG
signal and defined rule-based methods, QRS-complex onset and T-wave offset
46
Pan and Tompkins developed one of the well-known derivative-based meth-
ECG signal and then identify ECG segments with higher variations in the slope
of the signal. However, assigning a cardiac wave to the highest slope varia-
tion is not always possible, especially for abnormal ECGs. For abnormal ECG
signals, it is not uncommon that either P-wave or T-wave may show a higher
change in the slope than QRS-complex. Therefore, these algorithms may result
val. In their work, they analyzed the change in the second-order derivative of
the ECG signal and amplitude of the ECG signal under a small window size to
waves [56, 63, 64]. These methods can be comprehensive if various basis func-
tions and frequency patterns are defined for a variety of cardiac wave forma-
tions.
Murthy et al. used Discrete Fourier Transform (DFT) to obtain a filter bank
work, a set of cardiac wave instances is transformed into distinct frequency band
filter bank. The filter characteristics are determined from the time signal. Mul-
47
tiplication of the transformed signal with a complex sinusoidal function allows
the use of a bank of low-pass filters for the delineation of cardiac waves [56].
tive to the threshold that is defined for the cardiac waves and if there is a mi-
nor variation in cardiac wave formations then the threshold should be adjusted.
However, if there is a major change between heartbeats the algorithm will not
perform on the heartbeats with major variations and instead, it relies on the in-
Over the years, various rule-based and ML methods have been developed for
NN, Random Forest, Support Vector Machine (SVM), Naive Bayes, HMM, linear
izing the peak point of the cardiac waves. However, there is research on ECG
wave, QRS-complex, T-wave, etc. and the outputs are four diagnostic categories
48
Kheidorov et al. used an NN to classify ECG segments into cardiac wave cat-
egories. In their work, ECG features are obtained in two different ways. First,
ECG features are obtained by local properties of ECG data. Second, ECG fea-
tures are obtained using wavelet transform. Based on the obtained ECG features
Rahman et al. used SVM and Random Forest to classify ECG signals into two
categories of HCM and non-HCM ECGs. In their work, 36 temporal features and
six spatial feature are extracted from each one of the 12-lead ECG. Thus, in total
504 features are extracted for the classification task [70]. Random Forest is an en-
semble decision tree algorithm and it is mainly used for classification problems
[69, 70]. Further, SVM finds the optimal hyperplane between the categories of
Warner et al. used logistic regression for a heart arrhythmia called left ven-
logistic function [71]. De Chazal et al. used linear discriminants to find car-
diac complex intervals. Linear discriminant finds the mean and variance for
each class of the data and classifies the new examples according to the obtained
Kaiser et al. used a rule-based method. This method detects heart condi-
49
tions such as myocardial infarction and left ventricular hypertrophy by extract-
ing parameter values and classifying the extracted parameter values with rule
sets aligned with the symptoms. Therefore, it generates a set of rules for every
adjust the parameters of the detection algorithm so that the detected peaks and
wave boundaries are nearly the same as those manually annotated by cardiolo-
gists [67].
dency between the features [11]. Hughes et al. segmented ECG cardiac waves
by utilizing HMM. In their research, three approaches have been adopted for
ECG segmentation. First, HMM is trained on the raw ECG Signal. The result
of this experiment has the lowest performance of the three approaches. Sec-
ond, the results are improved by training HMM on wavelet encoded ECG. In
the third approach, the research improves the results more by utilizing a spe-
Model (HSMM). Unlike HMM, HSMM governs the duration of staying in one
hidden state [11]. By studying this work, two conclusions are made. First, the
automated ECG delineation field is still struggling to provide an E2E ECG de-
lineation method using the raw ECG signal. The first experiment, utilizing raw
ECG signals, has the lowest performance compared to the wavelet encoded ECG
50
experiments. Second, HSMM outperforms HMM. This observation shows that
HMM requires a policy to stay in a particular hidden state for cardiac waves du-
ious patterns as typically shown in ECG signals [30]. On the other hand, deep
ing and data analysis, DL methods have been applied to ECG pattern recogni-
tion as well. In terms of ECG signal analysis field, Kiranyaz et al. developed
the first DL-based method to analyze ECG signals [59]. The DL method was
used to classify ECG signals into normal and abnormal ECG. Their work used
identify heart cardiac condition symptoms and they could exceed the average
cardiologist performance in both recall and precision [60]. Acharya et al. used
DL methods to detect arrhythmias [73]. In these works, the focus was to extract
51
extraction capability of these models for Paroxysmal Atrial Fibrillation (PAF)
which is a life threatening cardiac arrhythmia. Their work shows that ConvNets
are good feature extractors. However, their proposed E2E ConvNet model un-
derperforms another proposed model that uses the ConvNet as its feature ex-
In their work, ECG signal converts to its time-frequency domain and the time-
frequency domain is used as the input to a ConvNet model to classify it into AF,
to identify various types of heart arrhythmia patterns. In their work, one convo-
lutional layer and three LSTM layers have been cascaded on top of each other to
classify segments of ECG into one of the heart arrhythmia patterns [76].
Mostayed et al. used a specific type of LSTM models to classify 12-lead ECG
signals into nine classes of heart conditions. In their work, they used two hidden
RNN layers and a fully-connected layer cascaded on top of the second hidden
nals into one of the categories of heart disease symptoms. However, there is
another aspect of ECG signal analysis, which is the synthesis of the cardiac
52
wave structure and ECG signal. While this topic is of high importance, only
a few recent DL-based research in the area of identifying cardiac waves have
waves under any heart condition symptom such as identifying T-wave under
waves, a cardiologist can relate the reading to a symptom of heart condition [7].
2.3 Data
ical School, Boston University and McGill University, all working together with
the MIT group. In the 1970s, PhysioNet team realized the usefulness of estab-
published the MIT-BIH Arrhythmia Database in 1980, which soon became the
standard reference collection of its type, used by over 500 academic, hospi-
tal and industry researchers and developers worldwide during the 1980s and
53
1990s. PhysioNet has become one of the main resources for research and ed-
ucation, offering the public free access to large collections of physiological data
and related open-source software. PhysioNet also hosts an annual series of chal-
Physiology [22].
phologies, etc. An automated system has annotated the cardiac waveforms, and
experts made corrections when the automated system failed to perform anno-
tation [22]. This dataset provides researchers the best resources for varieties of
research in ECG signal processing. The QTDB ECG records are chosen from mul-
the European Society of Cardiology (ESC) ST-T Database and additional ECG
recordings collected at Boston’s BIDMC. The additional records are added to the
added [22]. Table 2.1 shows the included heart-conditions from various database
54
sources. Mainly, this dissertation has adopted QTDB for three reasons. First, it
has a large amount of ECG data. Second, it includes a variety of normal cardiac
Third, it is widely used so it is easier to compare this work with other research.
Database Heart-condition
Arrhythmia
ST-segment abnormalities
MIT-BIH Supraventricular ectopic beats
Long term QT-segment
Normal sinus rhythm
ST-segment abnormalities
ESC
T-wave abnormalities
BIDMC records SCD
tion algorithm that classifies the instances into two categories: 1- category A
• True Positive (TP): These outputs are the correct positive predictions, i.e.,
if the actual class for an instance is A, the algorithm also classifies the
instance into A, e.g. the correct category is P-wave category and the algo-
55
rithm prediction category is P-wave category for that instance.
• True Negative (TN): These outputs are the correct negative predictions,
i.e., if the actual class of an instance is not A, then the algorithm does not
classify it into A, e.g. the actual class is not P-wave and the algorithm
• False Positive (FP): These outputs are incorrect positive predictions, i.e.,
the actual class of an instance is not P-wave but the algorithm predicts the
• False Negative (FN): These outputs are the incorrect negative predictions,
i.e., the instance belongs to A but the algorithm didn’t categorize it into
category A, e.g. the instance belongs to P-wave category but the catego-
The red and green circles are the instances in the dataset. On the left side, the
green circles represent hypothetical P-wave instances. On the right side, the
red circles represent the hypothetical non-P-wave instances. The data instances
within the inner circle represent the data instances that the categorization clas-
sifies them as P-wave class. The data instances out of the inner circle are the
instances that the categorization does not classify them as P-wave class. Based
56
Fig. 2.8: Categorization states.
Table 2.2 shows the categorization states. Good categorization is able to have
a high number of TP and TN. On the other hand, it should avoid FP and FN.
Predicted Class
rithms relies on four metrics that combine the categorization states. These four
metrics are accuracy, precision, recall and f1-score. These metrics are used to
57
measure the performance of the segmentation algorithm as well. The follow-
ings will define the metrics and discuss the importance of them.
TP + TN .
Accuracy = (2.24)
TP + FP + FN + TN
TP .
Precision = (2.25)
TP + FP
High precision translates to low FP predictions, e.g. from all the predicted
58
to all observation in the actual class. The recall is expressed as
TP .
Recall = (2.26)
TP + FN
High recall translates to low FN predictions, e.g. from all the instances in
fore, it takes false positive and false negative into account simultaneously.
2 × Recall × Precision .
F1 − score = (2.27)
Recall + Precision
These four criteria have been used in Chapter 4 and Chapter 5. Chapter 3
uses different metrics because the output is not categorization. Therefore, other
metrics will be introduced in Chapter 3 that can measure the performance of the
localization approach.
This dissertation research utilizes the concepts of each of the previous sections.
59
ECG segmentation in the literature. Our first DL-based research in ECG analy-
poral attributes of ECG signals using local derivative-based features and LSTM
networks for ECG segmentation [79]. However, this approach relies on local
ECG segmentation.
60
Chapter 3
Network
require in-depth medical knowledge about cardiac wave spatial formations. For
more complex features from ECG signals. As a result, it can identify a vari-
ety of cardiac wave formations. Furthermore, with labeled data and training,
61
ConvNets are capable of tuning the feature filter parameters to capture the spa-
tial patterns that exist in the training set. Therefore, by tuning the feature filter
parameters, they can extract patterns that represent the labeled data well and
ing the hierarchical structure of ECG cardiac waves by localizing cardiac waves
which takes one cardiac complex as an input and predicts the location of cardiac
waves.
processed for removing ECG wander drift baselines and data normalization.
In addition, since the input to the proposed NN models is only one complete
cardiac complex, ECG recordings should get divided into separate individual
cardiac complexes. Section 3.1 discusses the importance and methods of the
analysis approaches, no cardiac wave location is assigned and only the slope
Section 3.2 explains the cardiac complex dataset distribution for training, vali-
dation and test sets. NN architectures and training procedure are presented in
Section 3.3. A test loss function to measure the performance of the models on
the unseen data is introduced in Section 3.4. Section 3.5 presents performance
62
analysis and comparisons with other research. Finally, Section 3.6 concludes the
contribution of the research. Fig. 3.1 shows the overall process for the proposed
ECG’s wander drift baseline noises which compromise feature extraction meth-
for general ECG signal analysis. Since the model’s weight gradients are depen-
dent on the input values, unnormalized input data causes the gradients over
some input samples to become significantly larger than other samples. This
the loss function [82]. Thus, following the wander drift removal, the ECG sig-
63
nals are normalized since this makes NN models to converge faster. However,
One of the successful approaches to remove wander drift baseline from the
where E(WF) is wander free ECG, E( R) is the raw ECG from QTDB and M(.) op-
eration is the result of applying a median filter to a signal. The size of the median
filter equals half of the sampling frequency, f req. Since ECGs from QTDB ave
been sampled at 250Hz ( f req = 250Hz), the window size for the median filter is
E(WF) − Avg(E(WF) )
E(norm) = (3.2)
Std(E(WF) )
where Avg(.) is the average function, Std(.) is the standard deviation function
and E(norm) is the normalized ECG values. Therefore, even though the majority
of normalized ECG will be in the range of [−1, 1], there will be some samples
that exceed this range. However, because these samples are rare, it will not effect
on the training process. The regions with spikes have been excluded from the
dataset.
After wander drift baseline removal and normalization, the ECG signals are
64
ready for cardiac complex extraction. Following the methods introduced in [13],
[61] and [62], a derivative-based method has been developed to extract cardiac
complete cardiac complexes instead of cardiac waves, i.e., while other research
wave to detect the cardiac waves, this research utilizes slope variations to iden-
tify the regions with electrical impulse activity. Thus, this method finds cardiac
complexes by analyzing the high and low slope variations of the ECG signal as
shown in Fig. 3.2. The extracted cardiac complexes and their associated cardiac
wave location annotations from QTDB annotations will be the input-label pair
Fig. 3.2: (a) normalized ECG segment, (b) local area under the mag-
nified second-order derivative signal.
the signal (change in the direction of the signal) imply a heart-muscle activity
such as a QRS-complex [62]. The lowest slope variation activity between two
65
high-slope variation activity locations represent the end of the one cardiac com-
plex and the beginning of another one [26, 62]. ECG slope variation activities
This algorithm relies on finding at least two slope variation maxima asso-
ciated with cardiac complexes. Therefore, the input to the algorithm should
include more than one cardiac complex. Since the typical duration of a cardiac
complex is about 0.83 second, eight seconds of ECG signal recording (i.e., 2, 000
samples of ECG signal sampled at 250Hz) is given as the input to the algorithm.
As a result, there will be more than one cardiac complex (approximately nine to
The followings are performed on the normalized ECG signal, E(norm) , to ob-
tain E(γ) .
E(SM) = E(norm) ⊗ ζ 1 (3.3)
where ⊗ and |.| are the convolution operation and the absolute value opera-
are used since they can capture crossing the baseline in the ECG signal [61]. The
66
where ζ 2 = [−1, −1, −1, 0, 1, 1, 1] is a derivative kernel. Followed by another
derivative kernel, the second-order derivative of the absolute ECG signal can be
expressed as
Thus, the local area under the magnified second-order derivative curve is ob-
tained by [61]
i +4
∑
j
E(i γ) = ( E(SD) )6 (3.6)
j = i −4
j
where E(SD) is the jth sample of E(SD) , E(γ) is the local area under the magnified
(power of six) second-order derivative of E(norm) curve (second-order deriva-
tive is magnified to accentuate the higher absolute amplitudes [26]) , i.e., E(γ)
sums over the values of the magnified second-order derivative under the win-
dows of size nine (four prior points, the center point and four points after) which
is 36 milliseconds of ECG signals [61] and E(i γ) is the ith sample of the E(γ) . The
local maximas in E(γ) indicates high slope variations in the ECG signal.
A set called Φ contains all the high slope-variation locations which indi-
cate the location of cardiac complexes activities. Initially, Φ has only one el-
ement and that element is the location of the maximum value in E(γ) , thus,
Φ = { Idx ( Max (E(γ) ))} where Max (.) and Idx (.) are the maximum function
and indexing function respectively. After the initialization, other local maxima
sample locations can be added to this set with descending order of local maxi-
mum values. Since the Φ set is initialized by the location of the maximum value
67
of the E(γ) set and the rest of local maxima locations will be added in descend-
ing order of local maxima values, a potential local maxima that hasn’t yet been
considered for the set have lower values than the ones that are already in the Φ
not to miss any potential cardiac complex activity. Moreover, any of the cardiac
waves in a cardiac complex can have high variations in the slope, thus, a car-
diac complex can make more than one local maxima close to each other. As a
result, to make sure to not mistake nearby local maxima as two different cardiac
complexes, any potential candidate for the Φ set should have a margin of 300
elements. This margin is chosen because at least there should be the the length
a local maxima in E(γ) and it has a margin from Φ elements, it will be added to
the Φ set. Each sample in Φ represents the location of a high slope variation in a
local neighborhood of the ECG which indicates that a cardiac complex activity is
within that region. Fig. 3.2 (a) shows a segment of E(norm) and Fig. 3.2 (b) shows
E(γ) applied to the segment of E(norm) . The Φ set is marked with red-filled cir-
cles in Fig. 3.2 and red-filled stars are the beginning and the end of the cardiac
68
The following step after identifying the cardiac complex activity is finding
the starting and ending points of cardiac complexes. The lowest value of E(γ)
between every two consecutive Φ elements has the lowest slope variation be-
tween these two maximas, and it is considered as the end of a cardiac complex,
marked with red-filled stars in Fig. 3.2 (b). A cardiac complex occurs between
variation and these points are part of a heart-muscle activity. A regular heart
muscle resting period is 200 milliseconds and to make sure that the beginning
and the ending of a cardiac complex are chosen at the end of the heart-muscle
resting period, a small margin of 20 milliseconds, which is less than the regular
resting period, between the low slope-variation points and high-slope variation
identified.
complexes from QTDB ECG recordings. Cardiac complexes are extracted from
two leads of fifteen-minute ECGs except for two instances where there were no
sampling data and the signals were blank and the regions with abnormal spikes.
Since QTDB consists of different datasets with different annotations, the car-
69
diac wave location annotations in some cases include the beginning and end of
the cardiac waves instead of the regular one point per cardiac wave. Therefore,
in all cases, the mid-point of the wave or maximum absolute amplitude point
is taken if the duration was not available to represent the location of the cardiac
wave interval.
Since the extracted cardiac complexes have different lengths, and fixed-length
from E(norm) are padded to the length of 300 samples which represents 1.2 sec-
onds in time. Padding the cardiac complexes with repetitions of the signal’s last
value is adopted for under 300 samples cardiac complexes. If the network finds
waves, i.e., it becomes overfitted for those regions. To prevent over-fitting, the
padding locations are randomly selected at the beginning, ending or both sides
of the cardiac complexes. Thus, the network’s output will not have any precon-
Upon the completion of preprocessing, training, validation and test sets are
ready to be fed to the neural networks. Table 3.1 shows the distribution of ex-
tracted cardiac complexes in the training, validation and test sets. The cardiac
complexes in these three sets are mutually exclusive, indicating there is no iden-
tical cardiac complex from one recording to another. Three locations, P-wave,
70
TABLE 3.1: Dataset distribution.
Three different DL-based NN models including an MLP network and two Con-
vNets have been developed for cardiac wave localization. Their performances
have been investigated and compared to similar research. In the proposed NNs,
samples drawn from a Gaussian distribution with mean zero and a standard
deviation of 0.1 initialized the networks’ weights and ReLU function has been
where z is the input to the ReLU and a is the output. ReLU function outputs
zero when the input is less than zero and it outputs the value of the input when
v
K
u
u1
R =t
K ∑ ( y k − o k )2 (3.8)
k =1
71
where K is the number of outputs, ok is the predicted cardiac wave location
of the kth neuron and yk is the kth element in the label vector Y. For example,
Y = (70, 120, 160) means the P-wave, QRS-complex and T-wave are at loca-
tions 70, 120 and 160 of the input respectively. Thus, RMSE measures the dis-
tance between the network predictions and labels in terms of sample location
differences.
In our research, Adam optimization algorithm has been adopted for training
[83]. Since it has advantages offered by both Adaptive Gradient algorithm (Ada-
ing rate for each adaptive parameter in the network separately, thus it combines
the average of the first moments of the gradients and the average second-order
moments of the gradients to achieve a better learning rate for each parameter
[83].
In our experiments, three different learning rates, 10−3 , 10−4 and 10−5 an-
nealed to 10−7 through 500 epochs using step decay annealing and the best result
through the 500 epochs is reported. These three different learning rates enable
the model to perform a grid search for the best learning rate. The batch size of
The first proposed NN model is an MLP model. This MLP model establishes
can perform better than the fully-connected NNs and if the hierarchical feature
72
extractions do possess advantages over the MLP model. The MLP model archi-
tecture is as follows. The first layer is the input layer with 300 input neurons for
an ECG segment with 300 samples. Inspired by the fully-connected layers in [42]
and monitoring the model’s convergence, the next layer is a fully-connected hid-
den layer with 150 neurons utilizing a ReLU activation function. Fig. 3.3 shows
the proposed MLP model architecture and Table 3.2 describes the architecture of
during training, a dropout layer randomly selects some of the weights and pre-
vents them from getting updated by the learning algorithm. This causes the
73
model to update the weights independent of the effect of the other weights, i.e.,
for the ConvNet models are inspired by [23] except for the learning rates and the
size of full-connected layers. Learning rate has been selected by grid searching
the best learning rate and the size of a fully-connected layer is inspired by the
Table 3.3 describes the proposed ConvNet with no dropout layer. This model
input to the network is a 1-D cardiac complex signal with 300 samples. All the
convolutional layers have (1 × 5) feature filters with stride of (1 × 3) and all the
of (1 × 2). The first and second convolutional layers include 16 and 32 feature
fully-connected layer has 150 neurons. The second fully-connected layer, the
output layer of size three indicates three cardiac wave positions within a cardiac
74
complex. Fig. 3.4 shows the ConvNet model architecture.
as
w
( l −1)
= g( ∑ ac+i−b w c × θi,j,v + b j,v )
(l ) (l ) (l )
a j,v,c (3.9)
2
i =0
where c is center of the receptive field position, w is the length of the receptive
(l )
field. θi,j,v weighted connection between receptive field at location i and neuron j
( l −1)
for feature filter v. ac+i−b w c is the input from layer (l − 1) at location c + i − b w2 c
2
75
(l )
and b j,v is the bias for neuron j. Here, w = 5, c ∈ {1, ... , 300} for the first
layer with stride (1 × 3) and v ∈ {1, 2, ..., 16} since there are 16 feature filters.
put and selects the maximum number. The max-pooling function is described
as
(l ) ( l −1)
ac = max ( ac−b w c:c+b w c ) (3.10)
2 2
where : is the slicing operator (slicing operator selects a range from a vector)
(l )
and ac is the result of max-pooling operation on center of the receptive field
position c at layer l − 1.
The same operation is performed for the second convolutional layer and the
so, the redundant dimensions are eliminated, turning the vector to a 288 feature
point vector. The 288 feature point vector is followed by a fully-connected layer
with 150 neurons. The next layer is the output layer, which is the last fully-
connected layer.
The next ConvNet architecture for cardiac complexes is the same as the pre-
vious one except that a dropout layer with dropout rate hyperparameter of 0.5
is added after each max-pooling, as illustrated in Table 3.4. The dropout rate of
76
0.5 is a commonly used value for the dropout rate [84].
Fig. 3.5 shows the proposed ConvNet with dropout layers model architec-
ture.
In this chapter, the focus is not to find the cardiac fiducial points such as P-peak,
QRS-on, QRS-off and T-peak points, but to identify cardiac waves. Therefore,
annotations aren’t fiducial cardiac points, the middle point of cardiac waves is
annotated in the data preparation step. However, finding fiducial points can
77
Fig. 3.5: The proposed ConvNet model with dropout layers archi-
tecture.
j
algorithm identified cardiac wave β k is expressed as
j j
1 if ok − yk ≤ 120 milliseconds
j
βk = , k ∈ {1, 2, 3} and j ∈ {1, ..., N }
0 otherwise
(3.11)
78
j
where yk is the label for the kth cardiac wave of the jth test instance in the test
j
set, ok is the network output for the kth cardiac wave of jth test instance in the
j
test set and N is the number of instances in the test set. If β k is equal to one, it
means the algorithm identified the the cardiac wave and if it is zero, it means the
algorithm missed the cardiac wave. In other words, the distance between model
output and annotation should less than 120 milliseconds to consider the cardiac
performance, two criteria, Sensitivity (SE) of finding the cardiac waves and the
j
∑N
j =1 β k
SEk = × 100 (3.12)
N
and
An algorithm identifies a cardiac wave if any sample within the cardiac wave
within the cardiac wave duration, the test loss function should be able to tolerate
Another aspect of the performance is the prediction error relative to the la-
bel. The RMSE of the predicted values by the network is measured by Eq. 3.8.
79
0
quired. A new output prediction, ok , introduces displacement tolerance based
dk = ok − yk k ∈ {1, 2, 3} ,
e i f dk ≥ e
dk 0 = −e i f dk ≤ −e , (3.14)
y − d otherwise
k k
0 0
ok = ok − dk
where ok is the NN prediction for the kth cardiac wave location, yk is the anno-
tated mark in the QTDB and ok 0 is the network’s new predicted position. e is a
constant indicating the vicinity that the loss function can tolerate for every wave.
The vicinity tolerance range, e, has been tried for 0, 5 and 10 which indicates 0
sures the RMSE between the network prediction O and its label Y with tolerance
The interpretation of Eq. 3.14 is that if the predicted value, ok , is within the range
of e then it assigns the new predicted value to the label value yk . Consequently,
the error will be zero. In addition, if the predicted value, ok , is out of e range
from the target, yk , then it brings the output prediction closer to the target, as
much as value e.
80
3.5 Results
A total of 167, 301 complexes are extracted by our cardiac wave extraction algo-
rithm from QTDB and they are divided into three different sets with no sample
repetition. While 60% of them are used for training, 10% and 30% are used for
validation and test respectively. Table 3.5 shows the results of the proposed NN
TABLE 3.5: Result for every architecture and their related learning
rates.
The best result for the baseline MLP model is achieved by the learning rate
10−3 with the error rate of 12.50 samples (i.e., 50 milliseconds) on the test set.
This error rate indicates that on average the predicted output for every cardiac
The best result for the ConvNet model using dropout layers is achieved by
the learning rate 10−3 with the error rate of 8.92 samples (i.e., 35 milliseconds)
on the test set. This error rate indicates the predicted output for every cardiac
81
wave is 35 milliseconds off from the label.
The ConvNet without dropout layers with the learning rate of 10−3 achieves
the best architecture in our experiment. As shown in Table 3.5 the values of
RMSE for training, validation and testing are 5.76, 7.04 and 6.86, respectively.
Test set error rate 6.86 (27 milliseconds) means on average the predicted output
for every cardiac wave is 27 milliseconds off from the label. We named this NN
model ECGNet. Fig. 3.6 shows the result of ECGNet cardiac wave identification
ECGNet RMSE curve through 500 epochs for training and validation sets are
82
Fig. 3.7: Training and validation error curves for the best result,
ConvNet without dropout layer and learning rate of 10−3 .
ble too. Considering the input has 300 samples, if locations of the cardiac waves
were predicted randomly, then the average RMSE would be 150 samples. There-
where Avg( R1:N ) is the average of the RMSE for cardiac waves prediction in all
Table 3.6 shows the result of the ECGNet with three different vicinity toler-
ance ranges, (0, 5 and 10), using the test loss function introduced in Eq. 3.14.
When e equals zero, it means that this computation for the test set has no effect
and the output is the same as its label and ECGNet Acc is 95.43%. With a vicinity
83
TABLE 3.6: ECGNet with vicinity tolerance.
Table 3.7 shows the performance of the ECGNet for both individual and all
of QRS-complexes in the test set and 99.19% of test set cardiac waves, respec-
tively. The second row illustrates the percentage of the undetected waves. The
highest missed-detection belongs to the T-wave and it shows that it is more dif-
ficult for the ECGNet to find T-waves compared to the other waves. The third
row is the mean error of the prediction output relative to the annotation in mil-
liseconds. The fourth row is the standard deviation of error. The fifth row is the
To compare the ECGNet results with other similar research, it is worth men-
tioning other research focused on finding fiduciary cardiac points such as car-
diac wave peak points, the beginning of the wave and the ending of a wave. All
methods like ours adopt a displacement tolerance for these points. Therefore,
84
it is logical to pick the maximum detection rate on a particular cardiac wave to
compare it to the ECGNet result. For example, if research reports detection rates
for T-on, T-peak and T-off, the highest detection rate is picked. Because if they
found a particular point in a wave, it is sound to claim that the method found
Table 3.8 compares our cardiac wave detection rate (sensitivity) to the other
methods. A more detailed report can be found in [29]. ECGNet has competitive
3.6 Conclusion
and explore a novel localization method. The results, as shown in Table 3.8,
demonstrated that the proposed ConvNet models are capable of tuning their
feature filter parameters to extract features that are the best representative of
85
spatial complex features necessary for detecting the position of various ECG
cardiac waves.
ECG cardiac wave localization. Since FNNs cannot keep the interstate of the
network, the temporal relationship between the cardiac waves and cardiac com-
plexes have not been addressed and the DL model does not focus on capturing
the relationship between the cardiac waves. This issue causes to depend on a
finding cardiac complexes). This is one of the areas that could be improved.
cal structure extraction, however, they are not capable of capturing the temporal
86
Chapter 4
Network
Cardiologists identify cardiac waves in relation to each other. For example, fol-
each individual wave formation also affects other waves. Due to coherent tem-
poral dependency in ECG signals, NN models such as RNN can be used to cap-
ture this temporal dependency in ECG signals. Basically, an RNN can pass the
87
time series attributes [86].
The short-term temporal dependency alone may not capture all the temporal
attributes of the ECG signal. Unlike the traditional RNN, LSTM RNN is capa-
ble of capturing both long-term and short-term dependencies in the data. This
makes it suitable for ECG segmentation [55]. In this chapter, an LSTM RNN
model is developed to segment ECG cardiac waves. The LSTM RNN model has
fication. The novelty of this work is to use LSTM RNNs to classify each sample
of ECG signal into one of the four categories, namely, the P-wave, the QRS-
complex, the T-wave and neutral. Thus, as a result of the classification of every
ECG sample, ECG segmentation can be achieved. Results show that given ECG
This chapter is organized as follows. Section 4.1 describes the local features
extracted from ECG signals to train the LSTM RNN model. Section 4.2 intro-
duces the data sets including training, validation and test sets. Section 4.3 in-
troduces the Bidirectional Long Short-Term Memory (BLSTM) RNN. Section 4.4
proposes the novel architecture of ECG-SegNet and its application for ECG seg-
mentation. Section 4.5 presents training experiment and the convergence analy-
sis of ECG-SegNet. The results are demonstrated in Section 4.6, and they show
HMM for the same task of ECG cardiac wave segmentation. Finally, Section 4.7
concludes the contribution of the research. Fig. 4.1 shows the overall process for
88
the proposed cardiac wave segmentation approach.
Similar to the previous research, wander drift baseline and data normalization
has been applied to ECG recordings. Unlike the data prepared for the previous
research, ECG recordings are divided into 500 sample segments. With a sam-
pling frequency of 250Hz, every 2-second ECG segment includes more than one
the inputs will contain at least one complete cardiac complex and a partial car-
diac complex. As a result, the model trains on complete cardiac complexes and
89
partial cardiac complexes, thus, the model can find temporal relationships by
plexes can be analyzed as well. In total 93, 490 of 500 sample segments have
In this research, in addition to E(norm) samples, three local features are ex-
tracted for each sample by using different filtering kernels including the local av-
erage and the first and second-order derivatives. The smoothed E(norm) , E(SM)0 ,
is obtained by
where E(SD)0 is the second-order derivative of E(norm) [61]. Therefore, every in-
Since every input timestamp has four features and the ECG segment se-
90
quence length is T = 500, the input to the LSTM RNN model is matrix X4×T =
(x1 , x2 , ..., xT ). The label matrix for the ECG segment is Y T = (y1 , y2 , ..., y T ) and
yt is the label for the timestamp t corresponding to the input timestamp xt . Ev-
ery xt belongs to one of the four categories including neutral as category one, P-
P-wave (second category) then the vector label is expressed as yt = (0, 1, 0, 0).
Similar to the previous research, three different sets including training, valida-
tion and test sets have been prepared from the extracted ECG segments. These
three sets are mutually exclusive, indicating there are no identical segments from
one recording to another. In addition, the sets are patient independent, i.e., there
are no ECG segments from one patient in two different sets. Table 4.1 lists the
91
4.3 Bidirectional Long Short-Term Memory Recurrent
that describes the raw input in that timestamp. Thus, in a conventional RNN
with one hidden layer, given the input matrix X = (x1 , x2 , ..., x T ), hidden layer
activation matrix sequence a = (a1 , a2 , ..., a T ) and the output matrix sequence
and
ot = θh,o at + bo (4.5)
where θx,h denotes weighted connection matrix between input vector and hid-
den neurons, θh,h denotes loop weighted matrix between hidden neurons, bh
function. By cascading RNN hidden layers on top of each other and considering
the output of one hidden sequence as the input of the cascaded hidden layer,
deep RNNs can be achieved [54]. The same operations, defined in Eq. 4.4 and
92
Eq. 4.5, can be performed for the cascaded hidden layers as well. Fig. 4.2 shows
a traditional deep RNN model example. The first layer is the input layer X and
it is followed by two hidden layers, a(1) and a(2) , and an output layer O. As it is
shown, the attribute of the hidden layers in the traditional RNN is to receive the
output of the same hidden layer from previous timestamp and also passes the
hidden layer output to the layer above and also to its own next timestamp.
Another aspect of conventional RNNs is only the prior data is used. How-
ever, in many cases, future data is available and can be used as an informational
(BRNN) [87], which uses both directions of the data, prior and future samples,
93
in two separate hidden layers and the activations are passed to the layer above
(another hidden layer or an output layer) [54]. Using the input xt at time t, the
forward hidden layer activation vector, ~at , and the backward hidden layer acti-
~at = G θx,~h xt + θ~h,~h~at−1 + b~h (4.6)
and
a~t = G θx,~h xt + θ ~~a~t+1 + b ~ (4.7)
hh h
where ~h refers to forward layer neurons and h~ refers to the backward layer neu-
layer [54],
where k refers to output neurons. Fig. 4.3 shows an example of BRNN with only
one hidden layer with its backward and forward hidden hidden layers.
tional RNNs, LSTM RNNs are capable of storing long-term and short-term in-
formation and they are resistant to noise (input fluctuations that are random and
irrelevant to the predictions) [51]. Thus, this extension of RNN is ideal to learn
94
Fig. 4.3: Bidirectional RNN layer.
uses trainable memory cells called LSTM cells. As a reminder, these memory
cells have three trainable gates including input, output and forget gates and the
gates have the ability to add, remove and control the flow of information [30].
The gates, cell state (the stored state of the LSTM cell), cell activation and their
vector between the forget gate and the input layer, θh,F is the weighted con-
95
nection vector between the LSTM cell activation and the forget gate, at−1 is the
activation vector of the LSTM cell of the previous timestamp and b F is the forget
gate bias vector. Briefly, forget gate takes the current input and activation vec-
tor of the previous timestamp and decides if the previous timestamp cell state
pressed as [88]
where θh,I is the weighted connection vector between the input gate and the
LSTM cell activation, θx,I is the weighted connection between the input gate
and the input layer and b I is the input gate bias vector. Briefly, the input gate
takes the input and decides which parts of information of the new input are
good candidates to be stored in the LSTM cell state. LSTM cell state is expressed
as [88]
connection vector between LSTM cell state and the input layer and θh,C̄ is the
weighted connection vector between the LSTM cell activation and the LSTM
cell state and , bC̄ is the cell state bias vector. Briefly, the candidate information
obtained from the input gate and the previous LSTM cell state obtained from the
forget gate decide the new information stored in the LSTM cell state. The output
96
gate and LSTM cell activation are expressed as [88]
and
at = Qt Tanh(C̄t ) (4.13)
where θx,Q , θh,Q , θC̄,I , θC̄,F , θC,Q are connected weight vectors and bQ is the
output gate bias vector. Briefly, the output gate controls the flow of information
for the LSTM cell activation vector. Finally, a is the activation vector of the LSTM
Similar to traditional RNNs, LSTM RNNs have access to only one direction
Long Short-Term Memory (BLSTM) as a solution similar to BRNNs for the LSTM
RNNs to benefit from both directions of the information [89]. Similar to BRNNs,
97
the input sequence is presented to the model in a forward and a backward di-
rection to two separate backward and forward hidden layers. The activations
of these two hidden layers are concatenated to form the final output [90]. How-
ever, in contrast to BRNN, BLSTM utilizes LSTM cells instead of simple neurons
for the backward and forward hidden layers. Similar to BRNN, the backward
hidden layer can capture the future information and forward hidden layer can
and future samples, i.e., finding QRS-complex samples using prior samples such
as P-wave samples and future samples such as T-wave samples. Thus, BLSTM
in Fig. 4.5 and every block in the BLSTM layers is an LSTM layer.
This model contains two BLSTM layers and a fully-connected time distributed
hyperparameters such as the number of hidden layers and the number of LSTM
cells are chosen based on monitoring the convergence of the BLSTM RNN model
98
Fig. 4.5: The proposed BLSTM RNN architecture.
and are inspired by [54]. Table 4.2 describes the details of each layer for ECG-
SegNet.
The first layer is the input layer which takes time series X4×500 as explained
in Section 4.1. The next layer is a hidden BLSTM layer. In this layer, each back-
ward hidden layer and forward hidden layer have 250 LSTM cells, which sug-
99
TABLE 4.2: Deep BLSTM RNN for ECG segmentation.
gests that it has a total of 500 hidden LSTM cells. This is followed by another
BLSTM of size 250 and followed by the output layer which is a fully-connected
time distributed layer of size K to categorize every sample into one of the K
categories.
Calculating the loss for sequential models are similar to non-sequential mod-
els except that the loss in sequential models sums over all the timestamps and
then backpropagate to the weights. For the ECG segment X of length T, the net-
probability distribution over the K possible categories where k ∈ {1, 2, 3, 4}, otk
(the kth element of ot ) is the network’s estimate for the probability of observing
log-probability of the label sequence using a softmax output layer for a multi-
100
4.5 Training Experiment
Like previous research, weight initialization has been done utilizing a Gaussian
distribution with mean zero and a standard deviation of 0.1. The ECG-SegNet
is trained with Adam Optimizer [91] through 68 epochs using the mini-batch
The training stopped after 68 epochs because the gap between training set error
was getting smaller and validation set error was getting larger, and this diver-
gence is a sign of over-fitting. After training, the results showed 94.6% accuracy
for training set, 93.8% accuracy for the validation set and 93.7% accuracy for
the test set. Fig. 4.6 shows the accuracy rates and Fig. 4.7 shows the loss rates
101
Fig. 4.7: Loss curve.
4.6 Results
The precision, recall and f1-score related to each cardiac wave category and neu-
tral samples are reported in Table 4.3. The highest precision in the cardiac waves
belongs to the QRS-complex at 0.94 and the lowest belongs to T-wave with 0.90.
The highest recall rate belongs to the QRS-complex with 0.95 recall rate and the
lowest recall rate belongs to the P-wave 0.90. Further, the highest f1-score be-
longs to the QRS-complex at 0.94 and the lowest is tied between the P-wave and
0.94
102
TABLE 4.3: ECG segmentation results.
For comparison purposes, two other approaches are used. Hughes et al. used
HMM to solve ECG segmentation with two approaches [11]. The first approach
used raw ECG signals and the second approach used wavelet encoded ECG.
4.4. It was shown that ECG-SegNet performs better in term of accuracy. HMM
T-wave. However, in all other cases and also in overall results, ECG-SegNet
The majority of other research focuses on finding the cardiac complex fidu-
cial points and not segmenting every single sample of ECG independently. Even
though the ECG-SegNet task is different than finding ECG cardiac waves iden-
103
tification, it provides competitive accuracy in identifying cardiac waves. Table
4.5 shows the accuracy of finding waves regardless of segmentation using ECG-
Fig. 5.5 shows a sample from the test set and its related result and it shows
the accuracy of ECG-SegNet. P-wave, QRS-complex and T-wave areas are rep-
104
4.7 Conclusion
To the best of our knowledge, there has not been a DL-based method for ECG
signal segmentation before this research. Our work demonstrated that ECG-
ECG signal using only a few local features to yield very competitive results.
This research provides knowledge for the second main component of the E2E
hybrid DL method for ECG segmentation in Chapter 5. In the next chapter, the
inputs to the LSTM network are more comprehensive feature vector sequences.
105
Chapter 5
The previous studies provide the two main components of an E2E ECG segmen-
tation method, complex feature extractor component and sequence learner com-
This chapter proposes a novel hybrid DL method to combine these two compo-
nents together and introduces a single hybrid model for E2E ECG segmentation.
Results show that this hybrid DL method achieves better ECG segmentation per-
has the potential to extract the complex hierarchical structure of a signal and has
been proven to be effective in segmentation applications [46, 47, 92]. They have
106
the ability to extract or compress the essential information from the input and re-
As shown in Fig. 5.1, there are four components in the proposed new model.
The first component is the input layer to the normalized ECG signals. The sec-
The third component is a sequence learner consisting of two BLSTM layers. The
feature vectors generated by the convolutional autoencoder are the input to the
107
sequence learner. The fourth and last component form the output layer. In the
output layer, every timestamp sample is classified into one of the categories, i.e.,
neutral, P-wave, QRS-complex and T-wave. Fig. 5.2 shows the overall process
for the proposed approach. As it is shown, the flow starts by dividing nor-
malized ECG data into smaller segments. The ECG segments and their related
108
annotation will get distributed into training, validation and test sets. Following
the data sets preparation, the autoencoder and the sequence learner are trained
This chapter is organized as follows. Section 5.1 introduces the data sets
including training, validation and test sets. Section 5.2 explains the idea be-
SegNet are introduced in Section 5.3. Finally, Section 5.4 describes experiments
quence learners such as HMM for the same task of ECG cardiac waves segmen-
Unlike the data prepared for the previous research, ECG recordings are divided
into 1, 000 sample segments, i.e., four seconds of E(norm) which contains approx-
imately four to five cardiac complexes (one or two of them can be partial cardiac
complexes). Having multiple cardiac complexes can help the model to find the
relationship between the cardiac waves. These ECG segments are the inputs
the labels. In total, 46, 690 segments are extracted from the QTDB to be used as
input-label pairs. In this research, there are not any traditional feature extraction
109
method and an ML approach is responsible for extracting spatial features from
ECG segments.
In our experiment, three different sets including training, validation and test
sets have been extracted from QTDB. Similar to the previous research, these
sets are mutually exclusive, indicating there are no identical segments from one
recording to another and subject independent. Table 5.1 illustrates the training,
representation and the decoder is tuned to either reconstruct the initial input
from the compressed data with minimum loss or the desired representation of
that reconstruct the initial inputs are called auto-associative autoencoders and
the autoencoders that transform the input into a representation of the input
110
output is a feature vector for every timestamp of the input data, thus, a non-
al. used a fully convolutional network with a bottleneck architecture for E2E
image segmentation task [46]. Also, Meng et al. used autoencoders to extract
viable features for image segmentation [47]. Fig. 5.3 illustrates the operation
111
objective of a non-auto-associative is to find A ∈ A and B ∈ B such that
m m
min L( A, B) = min ∑ L (xi , yi ) = min ∑ ∆ ( B A (xi ) , yi ) (5.1)
A,B A,B
i =1 i =1
where is the application function and L(., .) is the loss function [93]. Therefore,
an autoencoder finds function A that lowers the dimension of the input from n
from the input and B to construct the desired output from the essential features
extracted by A.
tional autoencoders and BLSTM networks to classify ECG cardiac wave sam-
The proposed autoencoder for feature extraction has three different types of
pooling layers for reducing dimensions and upsampling layers for increasing
dimensions. While convolutional and max-pooling layers are used for encod-
ing, convolutional and upsampling layers are used for decoding. The decoder
112
(SRCNN). SRCNN upsamples the signal to a higher resolution from the low-
resolution signal without any obvious artifact for images [92]. In this research,
function repeats the rows and columns of the data. The hyperparameters adopted
for the research is based on our previous research and monitoring the conver-
gence progression. Fig. 5.4 shows the details of the proposed deep convolutional
autoencoder. Since the first layer has 1, 000 timestamps, the sequence length is
(1 × 2) and stride (1 × 2), the encoder compresses the output of the first layer
T
data to 32 × 4 dimension sizes through its bottleneck architecture, i.e., max-
pooling halves the size of its input and this results to compressing its input. A
layer and upsampling layer. Upsampling layers repeat the input’s rows and
columns (1 × 2), i.e., it doubles the size of its input. By utilizing upsampling,
size (1 × 2), the same size of the first convolutional layer feature maps is ob-
tained.
113
Fig. 5.4: The proposed autoencoder feature extractor.
allows extracting the hierarchical structure of the ECG signal that represents the
114
essential information of the ECG signal. As a result of applying a convolutional
size 16 for every timestamp and the feature vectors are the input to the sequence
learner.
the sequence learner architecture through a time distributed layer. The sequence
learner architecture contains two BLSTM layers of size 2 × 150 and 2 × 75 LSTM
connected softmax output layer. The output layer is OK ×T matrix which pro-
duces probability over K categories for each timestamp sample t. By utilizing the
loss function explained in Eq. 2.12 and using backpropagation through time, the
weight gradients are measured and weights are updated [54, 86]. After training,
in Fig. 5.5.
The category of the layers, the description of the layers, size of each layer and
Weight initialization has been done similar to the previous research. The Hybrid-
115
Fig. 5.5: The proposed hybrid architecture.
ing the convergence progression. The training stopped after 47 epochs because
116
TABLE 5.2: Hybrid-ECG-SegNet architecture.
Description Receptive
Layer Category Size Stride
of the layer Field Size
Layer 0 Input ECG signal (1 x 1000) - -
Convolutional
Layer 1 (16 x 1000) (1×5) (1×1)
layer with 16 filters
Max-Pooling
Layer 2 (16 x 500) (1×2) (1×2)
of Size (1 x 2)
Convolutional
Layer 3 (32 x 500) (1×5) (1×1)
layer with 32 filters
Max-Pooling
Layer 4 (32 x 250) (1×2) (1×2)
of Size (1 x 2)
Convolutional
Layer 5 (32 x 250) (1×5) (1×1)
layer with 32 filters
Upsampling
Layer 6 Auto- (32 x 500) (1 x 1) (1 x 1)
of Size (1 x 2)
encoder
Convolutional layer
Layer 7 (32 x 500) (1×5) (1×1)
with 16 filters
Upsampling
Layer 8 (16 x 1000) (1 x 1) (1 x 1)
of Size (1 x 2)
Time Distributed
Layer 9 (16 x 1000) - -
Dense Layer
Layer 10 Sequence BLSTM Layer (2 ×150) - -
Layer 11 Learner BLSTM Layer (2 ×75) - -
Time Distributed
Layer 12 Output (4 x 1000) - -
Dense Layer
the validation set error did not improve after 10 epochs and this is the descrip-
tion of the early stopping policy. RMSProp maintains per parameter learning
rate based on the average of the recent magnitude of the gradients for the param-
After training, the results showed 94.73% accuracy for training set, 93.58% accu-
racy for the validation set and 93.99% accuracy for the test set. Fig. 5.6 shows
the accuracy rates through 47 epochs for validation and training sets. Fig. 5.7
117
Fig. 5.6: Accuracy curve.
shows the error rates through 47 epochs for validation and training sets.
118
5.5 Results
The detailed result of this segmentation is reported in Table 5.3. The highest
precision in the cardiac waves belong to the T-wave at 0.92 and the lowest are
tied between P-wave and QRS-complex with 0.9. The highest recall rate belongs
to QRS-complex with 0.95 recall rate and the lowest recall rate belongs to the
P-wave category with 0.92. The highest f1-score belongs to the QRS-complex
at 0.94 and the lowest belongs to the P-wave category with 0.91. The average
For comparison purposes, two other approaches and ECG-SegNet are used.
Hughes et al. used HMM to solve ECG segmentation with two approaches
[11]. The first approach used the raw ECG signal and the second approach used
QRS-complex and T-wave gave better accuracy. However, in all other cases and
119
TABLE 5.4: Segmentation accuracy comparison.
tion process. On the other hand, ECG-SegNet uses derivative filters and
SegNet comes from the ML-based feature extraction process. This result
strate that merely using a sequence learner is not enough for ECG segmen-
120
tation. Even a simple feature extractor component, such as the traditional
tion methods that use raw ECG signals and a sequence learner. Although
coder is built into the model. This is the reason that Hybrid-ECG-SegNet
cardiac waves, it provides competitive accuracy in this field as well. Table 5.5
121
sion from this observation is that the hybrid model provides synergy for the
ECG segmentation task. Fig. 5.8 shows two samples from the test set and their
complex and T-wave areas are represented by red, blue and green regions. Fig.
5.9 shows a noisy sample from the test set and its related result. It shows the
122
Fig. 5.9: Noisy sample result.
123
Chapter 6
for spatial feature extraction and BLSTM NN for capturing temporal attributes
the temporal features of the ECG signal. Furthermore, the shortcoming of the
feature filters have been used to extract features from the ECG signal.
and the hierarchical structure of ECG cardiac waves. The reason for utiliz-
ing such feature extractors is to address the various cardiac waves formations.
124
The proposed convolutional autoencoder is able to extract the hierarchical ECG
is adopted to capture the relationship between cardiac waves. Our result shows
that the LSTM sequence learner outperforms other types of sequence learners
based on HMMs.
The materials in this dissertation research can provide a baseline for future stud-
a new DL architecture has been introduced for a successful E2E ECG segmenta-
tion. It is still an open-ended question whether DL can improve the ECG analy-
125
First, our dissertation research is limited to the cardiac wave formations in
QTDB, and more datasets can achieve better generalization. Especially, datasets
for diagnostic purposes in ECG analysis. All the reported results are based on
finding the cardiac waves and segmenting ECG samples. However, the primary
and dropout layers. However, with the continuation of DL research, other mod-
els and topologies can be good candidates for ECG delineation problem state-
ment. Attention models [95] and capsule networks [96] are excellent candidates
in Section 1.1. Using the proposed model for those applications can be a future
explorations.
126
ered, especially for wearable devices. One of the concerns over portable medical
This concern has not been addressed in this dissertation and should be investi-
gated further.
127
Bibliography
topics/heart-attack/diagnosing-a-heart-attack/electrocardiogram-
[2] Belt with textile electrodes. http : / / www . e - projects . ubi . pt / smart -
letes: the ‘Seattle Criteria’”. In: British Journal of Sports Medicine 47.3 (2013),
https://bjsm.bmj.com/content/47/3/122.
128
[5] R. Tafreshi, A. Jaleel, J. Lim, and L. Tafreshi. “Automated analysis of ECG
nal Processing and Control 10 (2014), pp. 41–49. ISSN: 17468094. DOI : 10 .
retrieve/pii/S1746809413001821.
ity in Mitral Stenosis”. In: The Tohoku Journal of Experimental Medicine 220.4
(US). In: Heart Rhythm 14.6 (June 2017), pp. 848–852. ISSN: 1547-5271. DOI:
10.1016/j.hrthm.2017.02.011.
In: Journal of the American College of Cardiology 70.9 (2017), pp. 1183–1192.
onlinejacc.org/content/70/9/1183.
129
(), pp. 944–950. DOI: 10.1111/pace.12913. URL: https://onlinelibrary.
wiley.com/doi/abs/10.1111/pace.12913.
mated ECG Interval Analysis”. In: Proceedings of the 16th International Con-
org/citation.cfm?id=2981345.2981422.
ing and Computing 28.1 (1990), pp. 67–73. ISSN: 1741-0444. DOI : 10.1007/
ISSN: 0022-0736.
130
[15] R. V. Andreao, B. Dorizzi, and J. Boudy. “ECG signal analysis through
53.8 (2006), pp. 1541–1549. ISSN : 0018-9294. DOI : 10 . 1109 / TBME . 2006 .
877103.
for Hypertrophic Cardiomyopathy”. In: Transl Pediatr 6.3 (), pp. 199–206.
[17] A. Alattar and N. Maffulli. “The Validity of Adding ECG to the Prepartic-
2018-12-05.
Safety: Workshop Summary. Ed. by Steve Olson, Sally Robinson, and Robert
with 8518 ECGs”. In: Annals of Oncology 23.11 (2012), pp. 2960–2963. DOI :
131
10.1093/annonc/mds130. URL : http://dx.doi.org/10.1093/annonc/
mds130.
EnfermerÃa ClÃnica (English Edition) 27.2 (2017), pp. 136 –137. ISSN: 2445-
//www.sciencedirect.com/science/article/pii/S2445147916300224.
e215–e220.
http://arxiv.org/abs/1311.2901.
with the use of a reference data base.” In: Circulation 71 3 (1985), pp. 523–
34.
132
Rautaharju, and G. S. Wagner. “Recommendations for the Standardization
[26] A. K. Dohare, V. Kumar, and R. Kumar. “An efficient new method for
compeleceng.2013.11.004.
[28] M. Campbell, A. Joseph Hoane Jr., and F. Hsu. “Deep Blue”. In: Artif. In-
tell. 134.1-2 (Jan. 2002), pp. 57–83. ISSN: 0004-3702. DOI : 10.1016/S0004-
00129-1.
[29] I. Beraza and I. Romero. “Comparative study of algorithms for ECG seg-
mentation”. In: Biomed. Signal Proc. and Control 34 (2017), pp. 166–173.
//cds.cern.ch/record/1503877.
Deep Belief Nets”. In: Neural Comput. 18.7 (July 2006), pp. 1527–1554. ISSN:
133
0899-7667. DOI : 10.1162/neco.2006.18.7.1527. URL : http://dx.doi.
org/10.1162/neco.2006.18.7.1527.
works are universal approximators”. In: Neural Networks 2.5 (1989), pp. 359
0893608089900208.
Research Group. Cambridge, MA, USA: MIT Press, 1986. Chap. Learning
[35] D.Randall Wilson and Tony R. Martinez. “The general inefficiency of batch
training for gradient descent learning”. In: Neural Networks 16.10 (2003),
article/pii/S0893608003001382.
for Stochastic Optimization”. In: Proceedings of the 20th ACM SIGKDD In-
New York, New York, USA: ACM, 2014, pp. 661–670. ISBN: 978-1-4503-
134
2956-9. DOI : 10.1145/2623330.2623612. URL : http://doi.acm.org/10.
1145/2623330.2623612.
[38] K. He, X. Zhang, S. Ren, and J. Sun. “Delving Deep into Rectifiers: Sur-
Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 2951–2959. URL :
http://dl.acm.org/citation.cfm?id=2999325.2999464.
[40] B. Y. Hsueh, W. Li, and I-Chen Wu. “Stochastic Gradient Descent with
Systems 1”. In: ed. by David S. Touretzky. San Francisco, CA, USA: Mor-
gan Kaufmann Publishers Inc., 1989. Chap. Comparing Biases for Minimal
135
[42] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning
Large Scale Visual Recognition Challenge”. In: Int. J. Comput. Vision 115.3
URL : http://dx.doi.org/10.1007/s11263-015-0816-y.
University (2016).
URL : http://arxiv.org/abs/1411.4038.
7965877.
136
[48] C. Kong and S. Lucey. “Take it in your stride: Do we need striding in
//arxiv.org/abs/1712.02502.
networks”. In: Neural Networks 71 (2015), pp. 1 –10. ISSN: 0893-6080. DOI :
sciencedirect.com/science/article/pii/S0893608015001446.
Neural Networks ICANN 99. (Conf. Publ. No. 470). Vol. 2. 1999, 850–855 vol.2.
DOI : 10.1049/cp:19991218.
Comput. 9.8 (Nov. 1997), pp. 1735–1780. ISSN: 0899-7667. DOI : 10 . 1162 /
8.1735.
137
[54] A. Graves, N. Jaitly, and A. R. Mohamed. “Hybrid speech recognition with
2013.6707742.
10.1109/IJCNN.2005.1556215.
and Computing 30.2 (1992), pp. 169–176. ISSN: 1741-0444. DOI : 10 . 1007 /
[57] C. Li, C. Zheng, and C. Tai. “Detection of ECG characteristic points using
for ECG beat segmentation”. In: Physiological Measurement 30.3 (2009), p. 335.
URL : http://stacks.iop.org/0967-3334/30/i=3/a=008.
DOI : 10.1109/TBME.2015.2468589.
138
[60] P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn, and A. Y. Ng.
pubmed/21181267.
ing: ECIFMBE 2008 23–27 November 2008 Antwerp, Belgium. Ed. by S. J. Van-
1007/978-3-540-89208-3_290.
139
[64] S. K. Mukhopadhyay, M. Mitra, and S. Mitra. “Time plane ECG feature
DOI : 10.1109/TBME.2003.821031.
science/article/pii/S0022073603001274.
rithm”. In: Computers in Cardiology, 2005. 2005, pp. 707–710. DOI: 10.1109/
CIC.2005.1588202.
[69] W. Shi and I. Kheidorov. “Hybrid hidden Markov models for ECG seg-
140
[70] Q. A. Rahman, L. G. Tereshchenko, M. Kongkatong, T. Abraham, M. R.
ing ECG morphology and heartbeat interval features”. In: IEEE Transac-
DOI : 10.1109/TBME.2006.883802.
science/article/pii/S0020025517306539.
141
ysmal atrial fibrillation”. In: 2016 International Joint Conference on Neural
plement (2018), S18 –S21. ISSN: 0022-0736. DOI : https : / / doi . org / 10 .
com/science/article/pii/S0022073618303315.
161-460.
localization in ECG using deep learning”. In: 2018 IEEE EMBS International
10.1109/BHI.2018.8333406.
ECG Interval Segmentation Using LSTM Neural Network”. In: 2018 Int’l
77.
142
[80] Meng J. Xiang Y Lin Z. “Automatic QRS complex detection using two-
level convolutional neural network”. In: Biomed Eng Online (2018). eprint:
08520.
entificWorldJournal. 2013.
[82] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. “How Does Batch Nor-
ting”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 1929–1958. ISSN : 1532-
rithm for 32-bit integer online processing”. In: BioMedical Engineering On-
URL : https://doi.org/10.1186/1475-925X-10-23.
143
[86] X. Li and X. Wu. “Long Short-Term Memory based Convolutional Re-
In: IEEE Transactions on Signal Processing 45.11 (1997), pp. 2673–2681. ISSN:
org/abs/1505.08075.
URL : http://arxiv.org/abs/1603.01354.
[91] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-
144
and Machine Intelligence 38.2 (2016), pp. 295–307. ISSN: 0162-8828. DOI: 10.
1109/TPAMI.2015.2439281.
3045801.
org/abs/1609.04747.
sules”. In: CoRR abs/1710.09829 (2017). arXiv: 1710 . 09829. URL : http :
//arxiv.org/abs/1710.09829.
145