A Hybrid Spatial-Temporal Deep Learning Architecture For Lane Detection
A Hybrid Spatial-Temporal Deep Learning Architecture For Lane Detection
12829
RESEARCH ARTICLE
1
2 DONG ET AL.
techniques, for example, Inverse Perspective Mapping (Aly, not yet make the utmost of the spatial-temporal information
2008; Wang et al., 2014), Hough transform (Berriel et al., together with correlation and dependencies in the continuous
2017; Jiao et al., 2019; Zheng et al., 2018), Gaussian filters driving frames. Thus, for certain extremely challenging
(Aly, 2008; Sivaraman and Trivedi, 2013; Wang et al., 2012), driving scenes, their detection results are still unsatisfactory.
and Random Sample Consensus (RANSAC) (Aly, 2008; Choi In this paper, lane detection is treated as a segmentation
et al., 2018; Du et al., 2018; Guo et al., 2015; Lu et al., 2019), task, in which a novel hybrid spatial-temporal sequence-to-one
are usually adopted in the 4-step procedure. The problems of deep learning architecture is developed for lane detection
traditional methods are: (a) hand-crafted features are through a continuous sequence of images in an end-to-end
cumbersome to manage and not always useful, suitable, or approach. To cope with challenging driving situations, the
powerful; and (b) the detection results are always based on one hybrid model takes multiple continuous frames of an image
single image. Thus, the detection accuracies are relatively not sequence as inputs, and integrates the single image feature
high. extraction module, the spatial-temporal feature integration
During the last decade, with the advancements in deep module, together with the encoder-decoder structure to make
learning algorithms and computational power, many deep full use of the spatial-temporal information in the image
neural network based methods have been developed for lane sequence. The single image feature extraction module utilizes
detection with good performance. There are generally two modified common backbone networks with embedded spatial
dominant approaches (Tabelini et al., 2020b), i.e., (1) convolutional neural network (SCNN) (Pan et al., 2018) layers
segmentation-based pipeline (Kim and Park, 2017; Ko et al., to extract the features in every single image throughout the
2020; T. Liu et al., 2020; Pan et al., 2018; Zhang et al., 2021; continuous driving scene. SCNN is powerful in extracting
Zou et al., 2020), in which predictions are made on the per- spatial features and relationships in one single image,
pixel basis, classifying each pixel as either lane or not; (2) the especially for long continuous shape structures. Next, the
pipeline using row-based prediction (Hou et al., 2020; Qin et extracted features are fed into spatial-temporal recurrent neural
al., 2020; Yoo et al., 2020), in which the image is split into a network (ST-RNN) layers to capture the spatial-temporal
(horizontal) grid and the model predicts the most probable dependencies and correlations among the continuous frames.
location to contain a part of a lane marking in each row. An encoder-decoder structure is adopted with the encoder
Recently, Liu et al. (2021) summarized two additional consisting of SCNN and several fully-convolution layers to
categories of deep learning based lane detection methods, i.e., downsample the input image and abstract the features, while
the anchor-based approach (Chen et al., 2019; Li et al., 2020; the decoder, constructed by CNNs, upsample the abstracted
Tabelini et al., 2020b; Xu et al., 2020), which focuses on outputs of previous layers to the same size as the input image.
optimizing the line shape by regressing the relative coordinates With the labelled ground truth of the very last image in the
with the help of predefined anchors, and the parametric continuous frames, the model training works in an end-to-end
prediction based method which directly outputs parametric way as a supervised learning approach. To train and validate
lines expressed by curve equation (R. Liu et al., 2020; Tabelini the proposed model on two large-scale open-sourced datasets,
et al., 2020a). Apart from these dominant approaches, some i.e., tvtLANE (Zou et al., 2020) and TuSimple, a
other less common methods were proposed recently. For corresponding training strategy has been also developed. To
instance, Lin et al. (2020) fused the adaptive anchor scheme summarize, the main contributions of this paper lie in:
(designed by formulating a bilinear interpolation algorithm) • A hybrid spatial-temporal sequence-to-one deep neural
aided informative feature extraction and object detection into network architecture integrating the advantages of the
a single deep convolutional neural network for lane detection encoder-decoder structure, SCNN embedded single image
from a top-view perspective. Philion (2019) developed a novel feature extraction module, and ST-RNN module, is proposed;
learning-based approach with a fully convolutional model to • The proposed model architecture is the first attempt that
decode the lane structures directly rather than delegating tries to strengthen both spatial relation feature extraction in
structure inference to post-processing, plus an effective every single image frame and spatial-temporal correlation
approach to adapt the model to new contexts by unsupervised together with dependencies among continuous image frames
transfer learning. for lane detection;
Similar to traditional vision-based lane-detection methods, • The implementation utilized two widely used neural
most available deep learning models utilize only the current network backbones, i.e., UNet (Ronneberger et al., 2015) and
image frame to perform the detection. Until very recently, a SegNet (Badrinarayanan et al., 2017) and included extensive
few studies have explored the combination of convolutional evaluation experiments on commonly used datasets,
neural network (CNN) and recurrent neural network (RNN) to demonstrating the effectiveness and strength of the proposed
detect lane markings or simulate autonomous driving using model architecture;
continuous driving scenes (Chen et al., 2020; Zhang et al.,
• The proposed model can tackle lane detection in
2021; Zou et al., 2020). However, the available methods do not
challenging scenes such as curves, dirty roads, serious vehicle
take full advantage of the essential properties of the lane being
occlusions, etc., and outperforms all the available state-of-the-
long continuous solid or dashed line structures. Also, they do
art baseline models in most cases with a large margin.
DONG ET AL. 3
Skip connection
Concatenate for UNet-based backbone;
Pooling indices reuse for SegNet-based backbone.
Encoded(Xt0)
ST-RNN(0,0) ST-RNN(1,0) ST-RNN(r,0)
I t 0
Encoded(Xt1)
ST-RNN(0,1) ST-RNN(1,1) ST-RNN(r,1)
I t 1
Encoded(Xt2)
ST-RNN(0,2) ST-RNN(1,2) ST-RNN(r,2)
It 2
SCNN
I t n-1
Encoded(Xtn-1)
ST-RNN(0,n-1) ST-RNN(1,n-1) ST-RNN(r,n-1)
Prediction of I t n
It n Encoded(Xtn)
ST-RNN(0,n) ST-RNN(1,n) ST-RNN(r,n)
Single image
feature extraction Spatio-temporal feature integration module
Continuous Images module
SCNN structure
C SCNN_DOWN SCNN_UP SCNN_RIGHT SCNN_LEFT
w
W C
H
H
C
Selected layer
• Under the proposed architecture, the light version model encoder-decoder structure, to tackle lane detection in
variant can achieve beyond state-of-the-art performance while challenging driving scenes.
using fewer parameters.
2.1 Overview of the proposed model architecture
2 PROPOSED METHOD
The proposed deep neural network architecture adopts a
Although many sophisticated methods have been proposed sequence-to-one end-to-end encoder-decoder structure as
for lane detection, most of the available methods use only one shown in Figure 1.
single image resulting in unsatisfactory performance under Here "sequence-to-one" means that the model gets a
some extremely challenging scenarios, e.g., dazzle lighting, sequence of multi images as input and outputs the detection
and serious occlusion. This study proposes a novel hybrid result of the last image (please note that essentially the model
spatial-temporal sequence-to-one deep neural network is still utilizing sequence-to-sequence neural networks); "end-
architecture for lane detection. The architecture was inspired to-end" means that the learning algorithm goes directly from
by: (a) the successful precedents of hybrid deep neural network the input to the desired output, which refers to the lane
architectures which fuse CNN and RNN to make use of by: (a) detection result in this paper, bypassing the intermediate states
the successful precedents of hybrid deep neural network (Levinson et al., 2011; Neven et al., 2017); the encoder-
architectures which fuse CNN and RNN to make use of decoder structure is a modular structure that consists of an
information in continuous multiple frames (Zhang et al., 2021; encoder network and a decoder network, and is often employed
Zou et al., 2020); (b) the domain prior knowledge that traffic in sequence-to-sequence tasks, such as language translation
lanes are long continuous shape line structure with strong e.g., (Sutskever et al., 2014), and speech recognition e.g., (Wu
spatial relationship. The architecture integrates two modules et al., 2017). Here, the proposed model adopts encoder CNN
utilizing two distinctive neural networks with complementary with SCNN layers and decoder CNN using fully convolutional
merits, i.e., SCNN and convolutional Long Short Term layers. The encoder takes a sequence of continuous image
Memory (ConvLSTM) neural network, under an end-to-end frames, i.e., time-series-images, as input and abstracts the
feature map(s) in smaller sizes. To make use of the prior
4 DONG ET AL.
knowledge that traffic lanes are solid- or dashed- line detailed structure of SCNN is demonstrated in the bottom part
structures with a continuous shape, one special kind of CNN, of Figure 1.
i.e., SCNN, is adopted after the first CNN hidden layer. With SCNN can propagate the spatial information in one image
the help of SCNN, spatial features and relationships in every through four directions, as shown with the suffix "DOWN",
single image will be better extracted. Following this, the "UP", "RIGHT", "LEFT" in Figure 1, which denotes
extracted feature maps of the continuous frames, constructed downward, upward, rightward, and leftward, respectively.
in a time-series manner, will be fed to ST-RNN blocks for Take the "SCNN_DOWN" module for an example,
sequential feature extraction and spatial-temporal information considering that SCNN is adopted on a three dimensional
integration. Finally, the decoder network upsamples the tensor of size C × W × H, where in the lane detection task, C,
abstracted feature maps obtained from the ST-RNN and W, and H denote the number of channels, image (or its feature
decodes the content to the original input image size with the map) width, and heights respectively. For SCNN_D, the input
detection results. The proposed model architecture is tensor would be split into H slices, and the first slice will then
implemented with two backbones, UNet (Ronneberger et al., be sent into a convolution operation layer with C kernels of
2015) and SegNet (Badrinarayanan et al., 2017). Note, in the size C × w, in which w is the kernel width. Different from the
UNet based architecture, similar to (Ronneberger et al., 2015), traditional CNN in which the output of one convolution layer
the proposed model employs the skip connection between the is introduced into the next layer directly, in SCNN_D the
encoder and decoder phase by concatenating operation to reuse output is added to the next adjacent slice to produce a new
features and retain information from previous encoder layers slice, and iteratively to the next convolution layer continuing
for more accurate predictions; while in the SegNet based until the last slice in the selected direction is updated. The
networks, at the decoder stage, similar to (Badrinarayanan et convolution kernel weights are shared throughout all slices,
al., 2017), the proposed model reuses the pooling indices to and the same mechanism works for other directions of SCNNs.
capture, store, and make use of the vital boundary information With the above properties, SCNN has demonstrated its
in the encoder feature maps. The detailed network strengths in extracting spatial relationships in the image, which
implementation is elaborated in the remaining parts of Section makes it suitable for detecting long continuous shape
2. structures, e.g., traffic lanes, poles, and walls (Pan et al., 2018).
However, using only one image to do the detection, SCNN still
2.2 Network design could not produce satisfying performance under extremely
1) End-to-end encoder-decoder: Regarding lane detection challenging conditions. And that is why a sequence-to-one
as an image segmentation problem, the encoder-decoder architecture with continuous image frames as inputs and ST-
structure based neural network can be implemented and trained RNN blocks to capture the spatial-temporal correlations in the
in an end-to-end way. Inspired by the excellent performance of continuous frames is proposed in this paper.
CNN-based encoder-decoder for image semantic- 3) ST-RNN module: In this proposed framework, the
segmentation tasks in various domains (Badrinarayanan et al., multiple continuous frames of images are modelled as "image-
2017; Wang et al., 2020; Yasrab et al., 2017), this study also time-series" inputs. To capture the spatial-temporal
adopts the "symmetrical" encoder-decoder as the main dependencies and correlations among the image-time-series,
backbone structure. Convolution and pooling operations are the ST-RNN module is embedded in the middle of the encoder-
employed to extract and abstract the features in every image in decoder structure, which takes over the output extracted
the encoder stage; while in the decoder subset, the inverted features of the encoder as its input and outputs the integrated
convolution and upsampling operation are adopted to grasp the spatial-temporal information to the decoder.
extracted high-order features and construct the outputs layer Various versions of RNNs have been proposed, e.g., Long
by layer with regards to the targets. By setting the output target Short Term Memory (LSTM) together with its multivariate
size the same as the input image size, the whole network can version, i.e., fully connected LSTM (FC-LSTM), and Gated
work in an end-to-end approach. In the implementation, two Recurrent Unit (GRU), to tackle time-series data in different
widely used backbones, U-Net and Seg-Net, are adopted. To application domains. In this paper, two state-of-the-art RNN
better extract and make use of the spatial relations in every networks, i.e., ConvLSTM (Shi et al., 2015) and Convolutional
image frame, the SCNN layer is introduced in the encoder part Gated Recurrent Unit (ConvGRU) (Ballas et al., 2016), are
of the single image feature extraction module. Furthermore, to employed. These models, considering their abilities in spatial-
excavate and make use of the spatial-temporal correlations and temporal feature extraction, generally outperform other
dependencies among the input continuous image frames, ST- traditional RNN models.
RNN blocks are embedded in the middle of the encoder- A general critical problem for the vanilla RNN model is the
decoder networks. gradients vanishing (Hochreiter and Schmidhuber, 1997;
2) SCNN: The Spatial Convolutional Neural Network Pascanu et al., 2013; Ribeiro, 2020). For this, LSTM
(SCNN) was first proposed by Pan et al. (2018). The "spatial" introduces memory cells and gates to control the information
here means that the specially designed CNN can propagate flow to trap the gradient preventing it from vanishing during
spatial information via slice-by-slice message passing. The the back-propagation. In LSTM, the information of the new
DONG ET AL. 5
time-series inputs will be accumulated to the memory cell 𝒞𝑡 previous hidden state is supposed to be forgotten through an
if the input gate 𝑖𝑡 is on. In contrast, if the information is not element-wise multiplication operation when calculating
"important", the past cell status 𝒞𝑡−1 could be "forgotten" current candidate hidden representation. From the equations, it
by activating the forget gate 𝑓𝑡 . Also, there is the output gate is concluded that the information of ℋ𝑡 mainly comes from
ℋ̃𝑡 , while ℋ𝑡−1 as the previous hidden-state representation
𝑜𝑡 which decides whether the latest cell output 𝒞𝑡 will be
propagated to the final state ℋ𝑡 . The traditional FC-LSTM also contributes to the process of computing the final
contains too much redundancy for spatial information, which representation of ℋ𝑡 , thus the temporal dependencies are
makes it time-consuming and computational-expensive. To captured.
address this, the ConvLSTM (Shi et al., 2015) is selected to In practice, both ConvLSTM and ConvGRU with different
build the ST-RNN block of the proposed framework. In numbers of hidden layers were employed to serve as the ST-
ConvLSTM, the convolutional structures and operations are RNN module in the proposed architecture, and the
introduced in both the input-to-state and state-to-state corresponding performances were evaluated, respectively. To
transitions to do spatial information encoding, which also be specific, in the proposed network, the input and the output
alleviates the problem of time- and computation-consuming. sizes of the ST-RNN block are equivalent to the feature map
The key formulation of the ConvLSTM is shown by size extracted through the encoder, which are 8 ×16 and 4 × 8
equations (1)-(5), where ⊙ denotes the Hadamard product, ∗ for the UNet based and SegNet based backbone, respectively.
denotes the convolution operation, 𝜎(∙) represents the sigmoid The convolutional kernel size in ConvLSTM and ConvGRU is
function, and tanh(∙) represents the hyperbolic tangent 3 × 3, and the dimension of each hidden layer is 512. The
function; 𝑋𝑡 , 𝒞𝑡 , and ℋ𝑡 are the input (i.e., the extracted detailed implementations are described in the following
features from the encoder in the proposed framework), section.
memory cell status, and output at time 𝑡; 𝑖𝑡 , 𝑓𝑡 , and 𝑜𝑡 are the
2.3 Detailed implementation
function values of the input gate, forget gate, and output gate,
respectively; 𝑊 denotes the weight matrices, whose subscripts 1) Network Design Details: The proposed spatial-temporal
indicate the two corresponding variables are connected by this sequence-to-one neural network was developed for the lane
matrix. For instance, 𝑊𝑥𝑐 is the weight matrix between the detection task with K (in this paper K=5 if not specified)
input extracted features 𝑋𝑡 and the memory cell 𝒞𝑡 ; ′𝑏′s are continuous image frames as inputs. The image frames were
biases of the gates, e.g., 𝑏𝑖 is the input gate’s bias. firstly fed into the encoder for feature extraction and
𝑖𝑡 = 𝜎(𝑊𝑥𝑖 ∗ 𝑋𝑡 + 𝑊ℎ𝑖 ∗ ℋ𝑡−1 + 𝑊𝑐𝑖 ⊙ 𝒞𝑡−1 + 𝑏𝑖 ) (1) abstraction. Different from the normal CNN-based encoder,
𝑓𝑡 = 𝜎(𝑊𝑥𝑓 ∗ 𝑋𝑡 + 𝑊ℎ𝑓 ∗ ℋ𝑡−1 + 𝑊𝑐𝑓 ⊙ 𝒞𝑡−1 + 𝑏𝑓 ) (2) the SCNN layer was utilized to effectively extract the spatial
𝒞𝑡 = 𝑓𝑡 ⊙ 𝒞𝑡−1 + 𝑖𝑡 ⊙ tanh(𝑊𝑥𝑐 ∗ 𝑋𝑡 + 𝑊ℎ𝑐 ∗ ℋ𝑡−1 + relationships within every image. Different locations of the
𝑏𝑐 )(3) SCNN layer were tested, i.e., embedding the SCNN layer after
𝑜𝑡 = 𝜎(𝑊𝑥𝑜 ∗ 𝑋𝑡 + 𝑊ℎ𝑜 ∗ ℋ𝑡−1 + 𝑊𝑐𝑜 ⊙ 𝒞𝑡 + 𝑏𝑜 ) (4) the first hidden convolutional layer or at the very beginning.
ℋ𝑡 = 𝑜𝑡 ⊙ tanh(𝐶𝑡 ) (5) The outputs of the encoder network were modelled in a time-
The ConvGRU (Ballas et al., 2016) further lightens the series manner and fed into the ST-RNN blocks (i.e.,
computational complexity by reducing a gate structure but ConvLSTM or ConvGRU layers) to further extract more
could perform similarly or slightly better compared with the useful and accurate features, especially the spatial-temporal
traditional RNNs or even ConvLSTM. The procedure of dependencies and correlations among different image frames.
computing different gates and hidden states/outputs of In short, the encoder network is primarily responsible for
ConvGRU is demonstrated with equations (6)-(9), in which the spatial feature extraction and abstraction transforming input
symbols have the same meaning as described before, while images into specified feature maps, while the ST-RNN blocks
additional 𝑧𝑡 and 𝑟𝑡 mean the update gate and the reset gate, accept the extracted features from the continuous image frames
̃ represents the current candidate hidden in a time-series manner to capture the spatial-temporal
respectively, plus ℋ
dependencies.
representation.
The outputs of the ST-RNN blocks were then transferred
𝑧𝑡 = 𝜎(𝑊𝑧𝑥 ∗ 𝑋𝑡 + 𝑊𝑧ℎ ∗ ℋ𝑡−1 + 𝑏𝑧 ) (6) into the decoder network that adopts deconvolution and
𝑟𝑡 = 𝜎(𝑊𝑟𝑥 ∗ 𝑋𝑡 + 𝑊𝑟ℎ ∗ ℋ𝑡−1 + 𝑏𝑟 ) (7) upsampling operations to highlight and make full use of the
features and rebuild the target to the original size of the input
̃𝑡 = tanh(𝑊𝑜𝑥 ∗ 𝑋𝑡 + 𝑊𝑜ℎ ∗ (𝑟𝑡 ⊙ ℋ𝑡−1 ) + 𝑏𝑜 )
ℋ (8) image. Note there is the skip concatenate connection (for
̃ + (1-𝑧𝑡 )ℋ𝑡−1
ℋ𝑡 = 𝑧𝑡 ℋ (9) UNet-based architecture) or pooling indices reusing (for
SegNet-based architecture) between the encoder and decoder
In ConvGRU, there are only two gate structures, i.e., the to reuse the retained features from previous encoder layers for
update gate 𝑧𝑡 and the reset gate 𝑟𝑡 . It is the update gate 𝑧𝑡 that more accurate predictions at the decoder phase. After the
decides how to update the hidden representation when decoder phase, the lane detection result is obtained as an image
generating the ultimate result of ℋ𝑡 at the current layer, as in the equivalent size to the input image frame. With the
shown in equation (9). While the reset gate 𝑟𝑡 is served to labelled ground truth and the help of the encoder-decoder
control to what extent the feature information captured in the structure, the proposed model can be trained and implemented
6 DONG ET AL.
in an end-to-end way. The detailed input, output sizes, together ℎ𝜃 (𝑥𝑖 ))] (10)
with parameters of the layers in the entire neural network are
where 𝑆 is number of training examples, 𝑤 stands for the
listed in Appendix Table A1 and Table A2.
weight which is set according to the ratio between the total lane
For both SegNet-based and UNet-based implementations,
pixel quantities and none-lane pixel quantities throughout the
two types of RNN layers, i.e., ConvLSTM and ConvGRU,
whole training set, 𝑦𝑖 is the true target label for training
were tested to serve as the ST-RNN block. Besides, the ST-
RNN blocks were tested with 1 hidden layer and 2 hidden example 𝑖, 𝑥𝑖 is the input for training example 𝑖, and ℎ𝜃 stands
layers, respectively. So there are four variants of in the for the model with neural network weights 𝜃.
proposed SegNet-based models, i.e., 3) Training details: The proposed neural networks with
SCNN_SegNet_ConvGRU1, SCNN_SegNet_ConvGRU2, different variants, together with the baseline models were
SCNN_SegNet_ConvLSTM1, and trained on the Dutch high-performance supercomputer
SCNN_SegNet_ConvLSTM2. SCNN_SegNet_ConvGRU1 clusters, Cartesius and Lisa, using 4 Titan RTX GPUs with the
means the model is using SegNet as the backbone with SCNN data parallel mechanism in PyTorch. The input image size was
layer embedded encoder, and 1 hidden layer of ConvGRU as set as 128 × 256 to reduce the computational payload. The
the ST-RNN block. This naming rule applies to the other 3 batch size was set to be as large as possible (e.g., 64 for UNet-
variants. Also, there are four variants of the proposed UNet- based network architecture, 100 for SegNet based ones, and
based models, with a similar naming rule. 136 for UNetLight based ones), and the learning rate was
In the proposed models with U-Net as the backbone, the initially set to 0.03. The RAdam optimizer (Liu et al., 2019)
number of kernels used in the last convolutional block of the was first used in this work for training the model at the
encoder part differs from the original U-Net’s settings. Here, beginning. At the later stage, when the training accuracy was
the number of output kernels (channels) of the last beyond 95%, the optimizer was switched to the Stochastic
convolutional block in the proposed encoder does not double Gradient Descent (SGD) (Bottou, 2010) optimizer with decay.
its input kernels, which applies to all the previous With the labelled ground truth, the models were trained
convolutional blocks. This is done, similar to (Zou et al., through iteratively updating the parameters in the weight
2020), to better connect the output of the encoder with the ST- matrixes and the losses on the basis of the deviation between
RNN block (ConvLSTM or ConvGRU layers). To do so, the outputs of the proposed neural network and the ground truth
parameters of the full-connection layer are designed to be using the backpropagation mechanism. To speed up the
quadrupled while the side lengths of the feature maps reduced training process, the pre-trained weights of SegNet and U-Net
to half, at the same time, the number of kernels remains on ImageNet (Deng et al., 2009) were adopted.
unchanged. This strategy also somewhat contributes to 3 EXPERIMENTS AND RESULTS
reducing the parameter size of the whole network.
Extensive experiments were carried out to inspect and
A modified light version of UNet (UNetLight) was also
verify the accuracy, effectiveness, and robustness of the
tested to serve as the network backbone to reduce the total
proposed lane detection model using two large-scale open-
parameter size, increase the model’s ability to operate in real-
sourced datasets. The proposed models were evaluated on
time, and also further verify the proposed network
different driving scenes and were compared with several state-
architecture’s effectiveness. The UNetLight has a similar
of-the-art baseline lane detection methods which also employ
network design to the demonstration in Table A2. The only
deep learning, e.g., U-Net (Ronneberger et al., 2015), Seg-Net
difference is that all the numbers of kernels in the ConvBlocks
(Badrinarayanan et al., 2017), SCNN (Pan et al., 2018),
are reduced to half except for the Input in In_ConvBlock (with
LaneNet (Neven et al., 2018), UNet_ConvLSTM (Zou et al.,
the input channel of 3 unchanged) and Output in
2020), and SegNet_ConvLSTM (Zou et al., 2020).
Out_ConvBlock (with the output channel of 2 unchanged). To
save space, the parameter settings of UNetLight based 3.1 Datasets
implementation will not be illustrated. 1) tvtLANE training set: To verify the proposed model
2) Loss function: Since the lane detection is modeled as a performance, the tvtLANE dataset (Zou et al., 2020) based
segmentation task and a pixel-wise binary classification upon the TuSimple lane marking challenge dataset, was first
problem, cross-entropy is a suitable candidate to serve as the utilized for training, validating, and testing. The original
loss function. However, because the pixels classified to be dataset of the TuSimple lane marking challenge includes 3,626
lanes are always quite less than those classified to be the clips of training and 2,782 clips of testing which are collected
background (meaning that it is an imbalanced binary under various weather conditions and during different periods.
classification and discriminative segmentation task), in the In each clip, there are 20 continuous frames saved in the same
implementation, the loss was built upon the weighted cross- folder. In each clip, only the lane marking lines of the very last
entropy. The adopted loss function as the standard weighted frame, i.e., the 20th frame, are labelled with the ground truth
binary cross-entropy function is given as in equation (10), officially. Zou et al. (2020) additionally labelled every 13th
1 image in each clip and added their own collected lane dataset
𝐿𝑜𝑠𝑠 = − ∑𝑆𝑖=1[𝑤 ∗ 𝑦𝑖 ∗ 𝑙𝑜𝑔(ℎ𝜃 (𝑥𝑖 )) + (1-𝑦𝑖 ) ∗ 𝑙𝑜𝑔(1 −
𝑆
DONG ET AL. 7
which includes 1,148 sequences of rural driving scenes TABLE 1. Trainset and testset in tvtLANE.
collected in China. This immensely expanded the variety of the
road and driving conditions since the original TuSimple Trainset
dataset only covers the highway driving conditions. K Subset Labled Images Num
continuous frames of each clip are used as the inputs with the Original TuSimple Dataset (Highway) 7,252
ground truth of the labelled 13th or 20th frame to train the Zou et al. (2020) added (Rural Road) 2,296
models. Sample Methods
To further augment the training dataset, crop, flip, and Sample
Labled Ground Truth Train Sample Frames
Stride
rotation operations were employed, thus a total number of
3 1st, 4th, 7th, 10th, 13th
(3,626 + 1,148) × 4 = 19,096 continuous sequences were th
13 2 5th, 7th, 9th, 11th, 13th
produced, in which 38,192 images are labelled with ground
1 9th, 10th, 11th, 12th, 13th
truth. To adapt to different driving speeds, the input image
3 8th, 11th, 14th, 17th, 20th
sequences were sampled at 3 strides with a frame interval of 1,
20th 2 12th, 14th, 16th, 18th,20th
2, or 3, respectively. Then, 3 sampling methods were employed
1 16th, 17th, 18th, 19th,20th
to construct the training samples regarding the labelled 13th
Testset
and 20th frames in each sequence, as demonstrated in Table 1. Labled Labled
2) tvtLANE testing set: Two different datasets were used for Sample
Subset Images Ground Test Sample Frames
Stride
testing, i.e., Testset #1 (normal) and Testset #2 (challenging), Num Truth
th
which are also formatted with 5 continuous images as the input Testset #1 13 1 9th, 10th, 11th, 12th, 13th
540
Normal 20th 1 16th, 17th, 18th,19th,20th
to detect the lane markings in the very last frame with the
labelled ground truth. To be specific, Testset #1 is built upon 1st, 2nd, 3rd, 4th, 5th
the original TuSimple test set for normal driving scene testing; Testset #2 2nd, 3rd, 4th, 5th, 6th
728 All 1
Challenging 3rd, 4th, 5th, 6th, 7th
while Testset #2 is constructed with 12 challenging driving
situations, especially used for robustness evaluation. The ⋯
detailed descriptions of the trainset and testset in tvtLANE are
illustrated in Table 1, with examples shown in Figure 2. (a)
Baseline Models: (c) SegNet; (d) UNet; (e) SegNet_ConvLSTM; (f) UNet_ConvLSTM
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(a)
Ground truth: (b)
(b)
Baseline Models: (c) SegNet; (d) UNet; (e) SegNet_ConvLSTM; (f) UNet_ConvLSTM
(c)
(d)
(e)
(f)
(h)
(i)
(j)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(2) Visualization of the lane-detection results on tvtLANE Testset #2 (challenging situations)
FIGURE 3. Qualitative evaluation: visualization of the lane-detection results on (1) tvtLANE Testset #1 and (2) tvtLANE Testset #2
10 DONG ET AL.
2) tvtLANE Testset #2: 12 challenging driving cases false positive means the number of image pixels that are
Figure 3(2) shows the comparison of the proposed models background but are wrongly classified as lane markings; false
with the baseline models under some extremely challenging negative stands for the number of image pixels which are lane
driving scenes in the tvtLANE testset#2. All the results are not marking but are wrongly classified as the background.
post-processed. These challenging scenes cover wide Specifically, this study chooses 𝛽 = 1, which corresponds
situations including serious vehicle occlusion, bad lighting to the F1-measure (harmonic mean) shown in equation (15).
conditions (e.g., shadow, dim), tunnel situations, and dirt road Precision∗Recall
conditions. In some extremely challenging cases, the lanes are F1-measure = 2 ∗ (15)
Precision+Recall
totally occluded by vehicles, other objects, and/or shadows, The F1-Measure, which balances Precision and Recall, is
which could be very difficult even for humans to do the always selected as the main benchmark for model evaluation,
detection. e.g., (Liu et al., 2021; Pan et al., 2018; Xu et al., 2020; Zhang
As can be observed in Figure 3(2), although all the baseline et al., 2021; Zou et al., 2020).
models fail in these challenging cases, the proposed models, Furthermore, the model parameter size, i.e., Params (M),
especially the one named SCNN_SegNet_ConvLSTM2 together with the multiply-accumulate (MAC) operations, i.e.,
illustrated in the row (k), could still deliver good predictions in MACs (G), are provided as indicators of the model complexity.
almost every situation listed in Figure 3(2). The only flaw is The two indicators are commonly used to estimate models’
that in the 3rd column where vehicle occlusion and blur road computational complexities and real-time capabilities.
conditions happen simultaneously, the proposed models also
find it hard to predict precisely. With the results in the 4 th, 7th, 2) Performance and comparisons on tvtLANE testset
and 8th columns, the robustness of #1(normal situations)
SCNN_SegNet_ConvLSTM2’s property in detecting the As shown in Table 2, the proposed model of
correct number of lane lines is further verified, especially, one SCNN_UNet_ConvLSTM2, performs the best when
can observe in the 4th column, where almost all the other evaluating on tvtLANE Testset#1, with the highest Accuracy
models are defeated, SCNN_SegNet_ConvLSTM2 can still and F1-Measure, while the proposed model of
predict the correct number of lanes. SCNN_SegNet_ConvLSTM2 delivers the best Precision.
Furthermore, it should be noticed that correct lane location
predictions in these challenging situations are of vital TABLE 2. Model performance comparison on tvtLANE
importance for safe driving. For example, regarding the testset #1 (normal situations)
situation in the last column where a heavy vehicle totally Test_Acc
(%) Precision Recall
F1- MACs Params
Measure (G) (M)
shadows the field of vision on the left side, it will be very Baseline Models
dangerous if the automated vehicle is driving according to the Models
using
U-Net 96.54 0.790 0.985 0.877 15.5 13.4
lane detection results demonstrated in the 3 rd to 5th rows. single SegNet 96.93 0.796 0.962 0.871 50.2 29.4
image
SCNN* 96.79 0.654 0.808 0.722 77.7 19.2
as input
3.2 Quantitative evaluation LaneNet* 97.94 0.875 0.927 0.901 44.5 19.7
1) Evaluation metrics: This subsection examines the SegNet_ConvLSTM** 97.92 0.874 0.931 0.901 217.0 67.2
proposed models’ properties regarding quantitative UNet_ConvLSTM** 98.00 0.857 0.958 0.904 69.0 51.1
F-measure = (1 + 𝛽 ) 2 Precision∗Recall
(14) * Results reported in (Zhang et al., 2021).
𝛽 2 Precision+Recall ** There are two hidden layers of ConvLSTM in
In the above equation, true positive indicates the number of SegNet_ConvLSTM and UNet_ConvLSTM.
image pixels that are lane marking and are correctly identified;
DONG ET AL. 11
TABLE 3. Model performance comparison on tvtLANE testset #2 (12 types of challenging scenes)
PRECISION
Challenging Scenes 1-
2-
6- 8-
10-
12-
curve 3- 4- 5- dirty 7- blur 9- 11- dim
shadow- shadow- overall
Models & bright occlude curve & urban & blur tunnel &
bright dark
occlude occlude curve occlude
U-Net 0.7018 0.7441 0.6717 0.6517 0.7443 0.3994 0.4422 0.7612 0.8523 0.7881 0.7009 0.5968 0.6754
SegNet 0.6810 0.7067 0.5987 0.5132 0.7738 0.2431 0.3195 0.6642 0.7091 0.7499 0.6225 0.6463 0.6080
UNet_ConvLSTM 0.7591 0.8292 0.7971 0.6509 0.8845 0.4513 0.5148 0.8290 0.9484 0.9358 0.7926 0.8402 0.7784
SegNet_ConvLSTM 0.8176 0.8020 0.7200 0.6688 0.8645 0.5724 0.4861 0.7988 0.8378 0.8832 0.7733 0.8052 0.7563
SCNN_SegNet_ConvGRU1 0.8107 0.7951 0.7225 0.6830 0.8503 0.4640 0.5071 0.6699 0.8481 0.8994 0.7804 0.8429 0.7477
SCNN_SegNet_ConvGRU2 0.7952 0.8087 0.7770 0.6444 0.8689 0.5067 0.5171 0.7147 0.8423 0.8744 0.7979 0.8757 0.7572
SCNN_SegNet_ConvLSTM1 0.7945 0.8078 0.7600 0.6417 0.8525 0.5252 0.3686 0.7582 0.7715 0.8702 0.7778 0.8517 0.7348
SCNN_SegNet_ConvLSTM2 0.8326 0.7497 0.7470 0.7369 0.8647 0.6196 0.4333 0.7371 0.8566 0.9125 0.8153 0.8466 0.7673
SCNN_UNet_ConvGRU1 0.8492 0.8306 0.8163 0.7845 0.8819 0.4025 0.4493 0.7378 0.8291 0.8928 0.8198 0.8040 0.7639
SCNN_UNet_ConvGRU2 0.8678 0.7873 0.8548 0.7654 0.8805 0.5319 0.4735 0.8064 0.8765 0.8431 0.7112 0.7388 0.7640
SCNN_UNet_ConvLSTM1 0.8602 0.7844 0.8119 0.7807 0.8871 0.4066 0.4652 0.7445 0.8321 0.8972 0.7507 0.7068 0.7531
SCNN_UNet_ConvLSTM2 0.8182 0.8362 0.8189 0.7359 0.8365 0.5872 0.5377 0.8046 0.8770 0.8722 0.7952 0.7817 0.7784
SCNN_UNetLight_ConvGRU1 0.8212 0.7454 0.7189 0.6996 0.8521 0.3499 0.3999 0.7851 0.7282 0.8686 0.6940 0.6289 0.7011
SCNN_UNetLight_ConvGRU2 0.8147 0.8349 0.7390 0.7004 0.8591 0.4039 0.3360 0.6811 0.8300 0.8533 0.8125 0.7996 0.7238
SCNN_UNetLight_ConvLSTM1 0.7222 0.7450 0.6533 0.6203 0.8039 0.2635 0.2716 0.7341 0.7546 0.7319 0.6298 0.7406 0.6377
SCNN_UNetLight_ConvLSTM2 0.7618 0.7416 0.7067 0.6537 0.8096 0.1921 0.2639 0.6857 0.6830 0.6931 0.6391 0.6022 0.6190
F1-MEASURE
Challenging Scenes 1-
2-
6- 8-
10-
12-
curve 3- 4- 5- dirty 7- blur 9- 11- dim
shadow- shadow- overall
Models & bright occlude curve & urban & blur tunnel &
bright dark
occlude occlude curve occlude
U-Net 0.8200 0.8408 0.7946 0.7337 0.7827 0.3698 0.5658 0.8147 0.7715 0.6619 0.5740 0.4646 0.6985
SegNet 0.8042 0.7900 0.7023 0.6127 0.8639 0.2110 0.4267 0.7396 0.7286 0.7675 0.6935 0.5822 0.6727
UNet_ConvLSTM 0.8465 0.8891 0.8411 0.7245 0.8662 0.2417 0.5682 0.8323 0.7852 0.6404 0.4741 0.5718 0.7143
SegNet_ConvLSTM 0.8852 0.8544 0.7688 0.6878 0.9069 0.4128 0.5317 0.7873 0.7575 0.8503 0.7865 0.7947 0.7609
SCNN_SegNet_ConvGRU1 0.8821 0.8626 0.7734 0.7185 0.9039 0.3027 0.5288 0.7229 0.7866 0.8658 0.7759 0.7763 0.7547
SCNN_SegNet_ConvGRU2 0.8710 0.8630 0.8094 0.6989 0.9005 0.3963 0.5497 0.7470 0.7637 0.8525 0.7798 0.7396 0.7591
SCNN_SegNet_ConvLSTM1 0.8768 0.8801 0.8185 0.7166 0.9083 0.3750 0.4516 0.7806 0.7320 0.8622 0.8029 0.8245 0.7629
SCNN_SegNet_ConvLSTM2 0.8956 0.8237 0.7909 0.7468 0.9108 0.4398 0.4858 0.7379 0.7546 0.8729 0.7963 0.8074 0.7666
SCNN_UNet_ConvGRU1 0.8608 0.8745 0.8393 0.7802 0.9005 0.3181 0.5143 0.7833 0.7567 0.5554 0.3503 0.3703 0.6839
SCNN_UNet_ConvGRU2 0.8706 0.8556 0.8304 0.7647 0.8532 0.3515 0.5253 0.8345 0.7399 0.5405 0.3567 0.2855 0.6722
SCNN_UNet_ConvLSTM1 0.8971 0.8493 0.8234 0.7633 0.8997 0.3054 0.5307 0.7424 0.7436 0.6243 0.5568 0.5366 0.6992
SCNN_UNet_ConvLSTM2 0.8670 0.8866 0.8405 0.7565 0.7955 0.4179 0.5933 0.7880 0.7285 0.6296 0.4747 0.4134 0.7024
SCNN_UNetLight_ConvGRU1 0.8896 0.8212 0.7819 0.7517 0.8913 0.3043 0.4961 0.8133 0.7000 0.5635 0.3086 0.2733 0.6637
SCNN_UNetLight_ConvGRU2 0.8593 0.8730 0.7878 0.7406 0.8889 0.3335 0.4266 0.7263 0.7782 0.6498 0.5280 0.5257 0.6910
SCNN_UNetLight_ConvLSTM1 0.8115 0.8056 0.7168 0.6882 0.8179 0.2613 0.3681 0.7834 0.7576 0.5701 0.5281 0.5081 0.6418
SCNN_UNetLight_ConvLSTM2 0.8377 0.8158 0.7620 0.6971 0.8365 0.2209 0.3577 0.7551 0.6594 0.4597 0.3545 0.3559 0.6079
Incorporating the quantitative evaluation with the (iv) the thinness of the predicted lanes with less blurriness,
qualitative evaluation, it could be easily interpreted that the which accords with (ii). The correct prediction directly reduces
highest Precision, Accuracy, and F1-Measure are mainly the number of False Positives, and a good Precision contributes
derived from (i) the correct lane number, (ii) the accurate lane to better Accuracy and F1-Measure. Considering the structure
position, (iii) the sound continuity in the detected lanes, and of the proposed model architecture, a further explanation of the
12 DONG ET AL.
high F1-Measure, Accuracy, and Precision can be explained as Table 3 provides the Precision and F1-Measure for the
follows: evaluation reference.
Firstly, the SCNN layer embedded in the encoder equips the As indicated by the bold numbers, the proposed model,
proposed model with better information extracting ability SCNN_SegNet_ConvLSTM2, results in the best F1-Measure
regarding the low-level features and spatial relations in each at the overall level and in more situations, while the
image. UNet_ConvLSTM results in the best Precision at the overall
Secondly, the ST-RNN blocks, i.e., ConvLSTM / level and in more situations. Incorporating with the qualitative
ConvGRU layers, can effectively capture the temporal evaluation in Figure 3(2), it is shown that UNet_ConvLSTM
dependencies among the continuous image frames, which tends to not classify pixels into lane lines for uncertain areas
could be very helpful for challenging situations where the lanes under some challenging situations (e.g., the 2nd and 7th columns
are shadowed or covered by other objects in the current frame. in Figure 3(2)). This might be the reason for its obtaining better
Finally, the proposed architecture could make the best of Precision. To further confirm this speculation, Figure 4
the spatial-temporal information among the processed K compares the lane detection results of
continuous frames by regulating the weights of the SCNN_SegNet_ConvLSTM2 and UNet_ConvLSTM under
convolutional kernels within the SCNN and ConvLSTM / challenging situations 8-blur&curve, and 10-shadow-dark,
ConvGRU layers. where UNet_ConvLSTM delivers very good Precisions.
All in all, with the proposed architecture the proposed (a)
model tries to not only strengthen feature extraction regarding
(b)
spatial relation in one image frame but also the spatial-
(c)
temporal correlation and dependencies among image frames
for lane detection. (d)
speculated from the qualitative evaluation, where one can find (e)
that U-Net and SegNet tend to produce thicker lane lines. With
(f)
thicker lines and blurry areas, the two models can somehow
(2) Challenging situation 10-shadow-dark
reduce the False Negative, which will contribute to better FIGURE 4. Visual comparison of the lane-detection results on
Recall. This also demonstrates that Recall and Precision challenging driving situations for UNet_ConvLSTM and the proposed
antagonize each other which further proves that F1-Measure model SCNN_SegNet_ConvLSTM2. All the results are not post-
processed. (a) Input images. (b) Ground truth. (c) Detection results of
should be a more reasonable evaluation measure compared UNet_ConvLSTM. (d) Detection results of UNet_ConvLSTM
with Precision and Recall. overlapping on the original images. (e) Detection results of
3) Performance and comparisons on tvtLANE testset #2 SCNN_SegNet_ConvLSTM2. (f) Detection results of
SCNN_SegNet_ConvLSTM2 overlapping on the original images. The
(challenging situations)
upper part (1) is for challenging situation 8-blur&curve, while the down
To further evaluate the proposed models’ performance and part (2) is for situation 10-shadow-dark.
verify the models’ robustness, the models were evaluated on a
As illustrated in Figure 4, truly UNet_ConvLSTM tries not
brand-new dataset, i.e., the tvtLANE Testset #2. As introduced
to classify pixels into lane lines under uncertain areas as much
in 3.1 Datasets, tvtLANE Testset #2 includes 728 images in
as possible. This leads to fewer False Negatives which helps
highway, urban, and rural driving scenes. These challenging
for raising a better Precision. However, in real application
driving scenes’ data were obtained by data recorders at various
scenarios, this is not wise and not acceptable. On the contrary,
locations, outside and inside the car front windshield under
the proposed model SCNN_SegNet_ConvLSTM2 tries to
different road and weather conditions. Testset #2 is a
make tough but valuable detections classifying candidate
challenging and comprehensive dataset for model evaluation,
points into lane lines in the challenging uncertain areas with
from which some cases would be difficult enough for humans
dirt, dark road conditions, and/or vehicle occlusions. This may
to do the correct detection.
lead to more False Negatives and a worse Precision but is
Table 3 demonstrates the model performance comparison
praiseworthy. These analyses further demonstrate that F1-
on the 12 types of challenging scenes in tvtLANE Testset #2.
Measure is a better measure compared with Precision. Finally,
Following the results and discussions in 2) Performance and
it can be concluded that the proposed model,
comparisons on tvtLANE testset #1(normal situations), here
SCNN_SegNet_ConvLSTM2, delivers the best performance
DONG ET AL. 13
delivers quite good Precision and Accuracy, but worse Recall, network backbone to reduce the total parameter size and
which means there are fewer False Positives but more False improve the model’s ability to operate in real-time. The
Negatives. This should be related to the properties of the UNet UNetLight backbone has a similar network design with UNet
style neural network. These results further confirm the whose parameter settings are demonstrated in Table A2. The
effectiveness of the proposed model architecture. only difference is that all the numbers of kernels in the
3) Type and number of ST-RNN layers ConvBlocks are reduced to half except for the Input in
As described in Section 3, in the proposed model In_ConvBlock (with the input channel of 3 unchanged) and
architecture two types of RNNs, i.e., ConvLSTM and Output in Out_ConvBlock (with the output channel of 2
ConvGRU, are employed to serve in the ST-RNN block, to unchanged). From the testing results in Table 2, it is shown that
capture and make use of the spatial-temporal dependencies and the model named SCNN_UNetLight_ConvGRU2, with fewer
correlations among the continuous image sequences. The parameters than all the baseline models, beat the baselines
number of hidden ConvLSTM and ConvGRU layers were also exhibiting better performance regarding both Accuracy and
tested from 1 to 2. The quantitative results are demonstrated in F1-Measure. To be specific, compared with the best baseline
Table 2 and Table 3, while some intuitive qualitative insights model, i.e., UNet_ConvLSTM,
could be drawn from Figure 3 and Figure 4. SCNN_UNetLight_ConvGRU2 only uses less than one-fifth
From Table 2, it is illustrated that in general models of the parameter size but delivers better evaluation metrics in
adopting ConvLSTM layers in the ST-RNN block perform testing Accuracy, Precision, and F1-Measure.
better than those adopting ConvGRU layers with improved F1- Regarding UNetLight based models, models using
measure, except for the UNetLight based. This could be ConvGRU layers in the ST-RNN block perform better than
explained by ConvLSTM’s better properties in extracting those adopting ConvLSTM. The reason could be that light
spatial-temporal features and capturing time dependencies by version UNet cannot implement high-quality feature extraction
more control gates and thus more parameters compared with which does not feed enough information for ConvLSTM,
ConvGRU. Furthermore, from Table 2 and Table 3, it is while ConvGRU, with fewer control gates, is more robust
observed that models with two hidden ST-RNN layers, for when low-level features are not that fully extracted.
both ConvLSTM and ConvGRU, generally perform better than All these results further verify the proposed network
those with only one hidden ST-RNN layer. This could be architecture’s effectiveness and strength.
speculated that with two hidden ST-RNN layers, one layer can
serve for sequential feature extraction, and the other can 4 CONCLUSION
achieve spatial-temporal feature integration. The In this paper, a novel spatial-temporal sequence-to-one model
improvements of two ST-RNN layers over one are not that framework with a hybrid neural network architecture is
significant which might be due to (a) models employing one proposed for robust lane detection under various normal and
ST-RNN layer already obtain good results; (b) since the length challenging driving scenes. This architecture integrates single
of the continuous image frames is only five, one ST-RNN layer image feature extraction module with SCNN, spatial-temporal
might be already enough to do the spatial-temporal feature feature integration module with ST-RNN, together with the
extraction, so when incorporating longer image sequences the encoder-decoder structure. The proposed architecture achieved
superiorities of two ST-RNN layers could be promoted. significantly better results in comparison to baseline models
However, longer image sequences require more computational that use a single frame (e.g., U-Net, SegNet, and LaneNet), as
resources and longer training time, which could not be well as the state-of-art models adopting "CNN+RNN"
afforded at the present stage in this study. This could be the structures (e.g., UNet_ConvLSTM, SegNet_ConvLSTM),
future research direction. with the best testing Accuracy, Precision, F1-measure on the
4) Number of parameters and real-time capability normal driving dataset (i.e., tvtLANE Testset #1) and the best
As shown in Table 2, the two proposed candidate models, F1-measure on 12 challenging driving scenarios dataset
i.e., SCNN_SegNet_ConvLSTM2 and (tvtLANE Testset #2). The results demonstrate the
SCNN_UNet_ConvLSTM2, possess a bit more parameters effectiveness of strengthening spatial relation abstraction in
compared with the baseline SegNet_ConvLSTM and every single image with SCNN layer, plus the employment of
UNet_ConvLSTM, respectively. However, almost all of the multiple continuous image sequences as inputs. The results
proposed model variants with different types and numbers of also demonstrate the proposed model architecture’s ability in
ST-RNN layers outperform the baselines, and some of them making the best of the spatial-temporal information in
are even with low parameter sizes e.g., continuous image frames. Extensive experimental results show
SCNN_SegNet_ConvGRU1, SCNN_SegNet_ConvLSTM1, the superiorities of the sequence-to-one "SCNN +
SCNN_UNet_ConvGRU1, SCNN_UNet_ConvLSTM1. ConvLSTM" over "SCNN + ConvGRU" and ordinary "CNN
Generally speaking, lower numbers of model parameters mean + ConvLSTM" regarding sequential spatial-temporal feature
better real-time capability. extracting and learning, together with target-information
In addition, four model variants were implemented with a classification for robust lane detection. In addition, testing
modified light version of UNet, i.e., UNetLight, serving as the results of the model variants with the modified light version of
DONG ET AL. 15
UNet (i.e., UNetLight) as the backbone, demonstrate the Berriel, R.F., de Aguiar, E., de Souza, A.F., Oliveira-Santos, T., 2017.
proposed model architecture’s potential regarding real-time Ego-Lane Analysis System (ELAS): Dataset and algorithms. Image
Vis. Comput. https://doi.org/10.1016/j.imavis.2017.07.005
capability. Bottou, L., 2010. Large-scale machine learning with stochastic gradient
To the best of the authors’ knowledge, the proposed model descent, in: Proceedings of COMPSTAT 2010 - 19th International
is the first attempt that tries to strengthen both spatial relations Conference on Computational Statistics, Keynote, Invited and
Contributed Papers. https://doi.org/10.1007/978-3-7908-2604-3_16
regarding feature extraction in every image frame together Chen, S., Leng, Y., Labi, S., 2020. A deep learning algorithm for
with the spatial-temporal correlations and dependencies simulating autonomous driving considering prior knowledge and
among image frames for lane detection, and the extensive temporal information. Comput. Civ. Infrastruct. Eng. 35, 305–321.
https://doi.org/10.1111/mice.12495
evaluation experiments demonstrate the strength of this
Chen, W., Wang, W., Wang, K., Li, Z., Li, H., Liu, S., 2020. Lane
proposed architecture. Therefore, it is recommended in future departure warning systems and lane line detection methods based on
research to incorporate both aspects to obtain better image processing and semantic segmentation–a review. J. Traffic
performance. Transp. Eng. (English Ed. https://doi.org/10.1016/j.jtte.2020.10.002
Chen, Z., Liu, Q., Lian, C., 2019. PointLaneNet: Efficient end-to-end
In this paper, the challenging cases do not include night CNNs for accurate real-time lane detection. IEEE Intell. Veh. Symp.
driving, rainy or wet road conditions, neither do they include Proc. 2019-June, 2563–2568.
situations in which the input images are defective (e.g., partly https://doi.org/10.1109/IVS.2019.8813778
Choi, Y., Park, J.H., Jung, H., 2018. Lane Detection Using Labeling
masked or blurred). There are demands to build larger test sets Based RANSAC Algorithm. International Journal of Computer and
with comprehensive challenging situations to further validate Information Engineering, 12(4), 245–248.
the model’s robustness. Since a large amount of unlabeled Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L., 2009.
driving scene data involving various challenging cases was Imagenet: A large-scale hierarchical image database. In 2009 IEEE
conference on computer vision and pattern recognition (pp. 248-255).
collected within the research group, a future research direction https://doi.org/10.1109/cvprw.2009.5206848.
might be to develop semi-supervised learning methods and Du, H., Xu, Z., Ding, Y., 2018. The fast lane detection of road using
employ domain adaption to label the collected data, and then RANSAC algorithm, in: Advances in Intelligent Systems and
Computing. https://doi.org/10.1007/978-3-319-67071-3_1
open source them for boosting the research in the field of Guo, J., Wei, Z., Miao, D., 2015. Lane Detection Method Based on
robust lane detection. Furthermore, to further enhance the lane Improved RANSAC Algorithm, in: Proceedings - 2015 IEEE 12th
detection model, customed loss function, pre-trained International Symposium on Autonomous Decentralized Systems,
ISADS 2015. https://doi.org/10.1109/ISADS.2015.24
techniques adopted in image-inpainting task, e.g., masked
Haris, M., Glowacz, A., 2021. Lane line detection based on object feature
autoencoders, plus sequential attention mechanism could be distillation. Electron, 10(9), 1102.
introduced and integrated into the proposed framework. https://doi.org/10.3390/electronics10091102
Hochreiter, S., Schmidhuber, J., 1997. Long Short Term Memory. Neural
ACKNOWLEDGMENT Computation. Neural computation, 9(8), 1735-1780.
Hou, Y., Ma, Z., Liu, C., Hui, T.W., Loy, C.C., 2020. Inter-Region
This work was supported by the Applied and Technical Affinity Distillation for Road Marking Segmentation. Proc. IEEE
Sciences (TTW), a subdomain of the Dutch Institute for Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 12483–125492.
Scientific Research (NWO) through the Project Safe and https://doi.org/10.1109/CVPR42600.2020.01250
Jiao, X., Yang, D., Jiang, K., Yu, C., Wen, T., Yan, R., 2019. Real-time
Efficient Operation of Automated and Human-Driven lane detection and tracking for autonomous vehicle applications.
Vehicles in Mixed Traffic (SAMEN) under Contract 17187. Proc. Inst. Mech. Eng. Part D J. Automob. Eng.
The authors thank Dr. Qin Zou, Hanwen Jiang, and Qiyu Dai https://doi.org/10.1177/0954407019866989
Kim, J., Park, C., 2017. End-To-End Ego Lane Estimation Based on
from Wuhan University, as well as Jiyong Zhang from
Sequential Transfer Learning for Self-Driving Cars, in: IEEE
Southwest Jiaotong University for their tips in using the Computer Society Conference on Computer Vision and Pattern
tvtLANE dataset. Recognition Workshops. https://doi.org/10.1109/CVPRW.2017.158
Ko, Y., Lee, Y., Azam, S., Munir, F., Jeon, M., Pedrycz, W., 2020. Key
REFERENCES Points Estimation and Point Instance Segmentation Approach for
Lane Detection 1–10. https://doi.org/10.1109/tits.2021.3088488
Aly, M., 2008. Real time detection of lane markers in urban streets. IEEE
Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D., Kammel, S.,
Intell. Veh. Symp. Proc. 7–12.
Kolter, J.Z., Langer, D., Pink, O., Pratt, V., Sokolsky, M., Stanek, G.,
https://doi.org/10.1109/IVS.2008.4621152
Stavens, D., Teichman, A., Werling, M., Thrun, S., 2011. Towards
Andrade, D.C., Bueno, F., Franco, F.R., Silva, R.A., Neme, J.H.Z.,
fully autonomous driving: Systems and algorithms, in: IEEE
Margraf, E., Omoto, W.T., Farinelli, F.A., Tusset, A.M., Okida, S.,
Intelligent Vehicles Symposium, Proceedings.
Santos, M.M.D., Ventura, A., Carvalho, S., Amaral, R.D.S., 2019. A
https://doi.org/10.1109/IVS.2011.5940562
Novel Strategy for Road Lane Detection and Tracking Based on a
Li, X., Li, J., Hu, X., Yang, J., 2020. Line-CNN: End-to-End Traffic Line
Vehicle’s Forward Monocular Camera. IEEE Trans. Intell. Transp.
Detection with Line Proposal Unit. IEEE Trans. Intell. Transp. Syst.
Syst. 20, 1497–1507. https://doi.org/10.1109/TITS.2018.2856361
21, 248–258. https://doi.org/10.1109/TITS.2019.2890870
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: A Deep
Liang, D., Guo, Y.C., Zhang, S.K., Mu, T.J., Huang, X., 2020. Lane
Convolutional Encoder-Decoder Architecture for Image
Detection: A Survey with New Results. J. Comput. Sci. Technol. 35,
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–
493–505. https://doi.org/10.1007/s11390-020-0476-4
2495. https://doi.org/10.1109/TPAMI.2016.2644615
Lin, C., Li, L., Cai, Z., Wang, K.C.P., Xiao, D., Luo, W., Guo, J.G., 2020.
Ballas, N., Yao, L., Pal, C., Courville, A., 2016. Delving deeper into
Deep Learning-Based Lane Marking Detection using A2-LMDet.
convolutional networks for learning video representations, in: 4th
Transp. Res. Rec. https://doi.org/10.1177/0361198120948508
International Conference on Learning Representations, ICLR 2016 -
Liu, L., Chen, X., Zhu, S., Tan, P., 2021. CondLaneNet: a Top-to-down
Conference Track Proceedings.
Lane Detection Framework Based on Conditional Convolution. arXiv
Bar Hillel, A., Lerner, R., Levi, D., Raz, G., 2014. Recent progress in road
preprint arXiv:2105.05003.
and lane detection: A survey. Mach. Vis. Appl. 25, 727–745.
https://doi.org/10.1007/s00138-011-0404-2
16 DONG ET AL.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J., 2019. On recognition. IEEE J. Sel. Top. Signal Process.
the Variance of the Adaptive Learning Rate and Beyond. arXiv https://doi.org/10.1109/JSTSP.2017.2756439
preprint arXiv:1908.03265. Xing, Y., Lv, C., Chen, L., Wang, Huaji, Wang, Hong, Cao, D., Velenis,
Liu, R., Yuan, Z., Liu, T., Xiong, Z., 2020. End-to-end Lane Shape E., Wang, F.Y., 2018. Advances in Vision-Based Lane Detection:
Prediction with Transformers 3694–3702. Algorithms, Integration, Assessment, and Perspectives on ACP-
https://doi.org/10.1109/wacv48630.2021.00374 Based Parallel Vision. IEEE/CAA J. Autom. Sin. 5, 645–661.
Liu, T., Chen, Z., Yang, Y., Wu, Z., Li, H., 2020. Lane Detection in Low- https://doi.org/10.1109/JAS.2018.7511063
light Conditions Using an Efficient Data Enhancement : Light Xu, H., Wang, S., Cai, X., Zhang, W., Liang, X., Li, Z., 2020. CurveLane-
Conditions Style Transfer. IEEE Intell. Veh. Symp. Proc. 2020-May, NAS: Unifying Lane-Sensitive Architecture Search and Adaptive
1394–1399. https://doi.org/10.1109/IV47402.2020.9304613 Point Blending. arXiv preprint arXiv:2007.12147.
Lu, Z., Xu, Y., Shan, X., Liu, L., Wang, X., Shen, J., 2019. A lane Yasrab, R., Gu, N., Zhang, X., 2017. An encoder-decoder based
detection method based on a ridge detector and regional G-RANSAC. Convolution Neural Network (CNN) for future Advanced Driver
Sensors (Switzerland). https://doi.org/10.3390/s19184028 Assistance System (ADAS). Appl. Sci.
Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M., Van Gool, https://doi.org/10.3390/app7040312
L., 2018. Towards End-to-End Lane Detection: An Instance Yoo, S., Seok Lee, H., Myeong, H., Yun, S., Park, H., Cho, J., Hoon Kim,
Segmentation Approach. IEEE Intell. Veh. Symp. Proc. 2018-June, D., 2020. End-to-end lane marker detection via row-wise
286–291. https://doi.org/10.1109/IVS.2018.8500547 classification. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M., Van Gool, Recognit. Work. 2020-June, 4335–4343.
L., 2017. Fast Scene Understanding for Autonomous Driving. https://doi.org/10.1109/CVPRW50498.2020.00511
Pan, X., Shi, J., Luo, P., Wang, X., Tang, X., 2018. Spatial as deep: Spatial Zhang, J., Deng, T., Yan, F., Liu, W., 2021. Lane Detection Model Based
CNN for traffic scene understanding, in: 32nd AAAI Conference on on Spatio-Temporal Network With Double Convolutional Gated
Artificial Intelligence, AAAI 2018. AAAI press, pp. 7276–7283. Recurrent Units. IEEE Trans. Intell. Transp. Syst.
Pascanu, R., Mikolov, T., Bengio, Y., 2013. On the difficulty of training https://doi.org/10.1109/TITS.2021.3060258
recurrent neural networks, in: 30th International Conference on Zheng, F., Luo, S., Song, K., Yan, C.W., Wang, M.C., 2018. Improved
Machine Learning, ICML 2013. Lane Line Detection Algorithm Based on Hough Transform. Pattern
Philion, J., 2019. FastDraw: Addressing the long tail of lane detection by Recognit. Image Anal. 28, 254–260.
adapting a sequential prediction network. Proc. IEEE Comput. Soc. https://doi.org/10.1134/S1054661818020049
Conf. Comput. Vis. Pattern Recognit. 2019-June, 11574–11583. Zou, Q., Jiang, H., Dai, Q., Yue, Y., Chen, L., Wang, Q., 2020. Robust
https://doi.org/10.1109/CVPR.2019.01185 lane detection from continuous driving scenes using deep neural
Qin, Z., Wang, H., Li, X., 2020. Ultra Fast Structure-aware Deep Lane networks. IEEE Trans. Veh. Technol. 69, 41–54.
Detection. In Computer Vision–ECCV 2020: 16th European https://doi.org/10.1109/TVT.2019.2949603
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part Zou, Q., Ni, L., Wang, Q., Li, Q., Wang, S., 2017. Robust Gait
XXIV 16 (pp. 276-291). Springer International Publishing. Recognition by Integrating Inertial and RGBD Sensors. IEEE Trans.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional Cybern. https://doi.org/10.1109/TCYB.2017.2682280
networks for biomedical image segmentation. Lect. Notes Comput.
Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics) 9351, 234–241. https://doi.org/10.1007/978-3-319-
24574-4_28
Ribeiro, A. H., Tiels, K., Aguirre, L. A., & Schön, T. (2020). Beyond
exploding and vanishing gradients: analysing RNN training using
attractors and smoothness. In International Conference on Artificial
Intelligence and Statistics (pp. 2370-2380). PMLR.
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.,
2015. Convolutional LSTM network: A machine learning approach
for precipitation nowcasting, in: Advances in Neural Information
Processing Systems.
Sivaraman, S., Trivedi, M.M., 2013. Integrated lane and vehicle detection,
localization, and tracking: A synergistic approach. IEEE Trans. Intell.
Transp. Syst. 14, 906–917.
https://doi.org/10.1109/TITS.2013.2246835
Sutskever, I., Vinyals, O., Le, Q. V., 2014. Sequence to sequence learning
with neural networks, in: Advances in Neural Information Processing
Systems (pp. 3104-3112).
Tabelini, L., Berriel, R., Paixão, T.M., Badue, C., de Souza, A.F.,
Oliveira-Santos, T., 2020a. PolyLaneNet: Lane estimation via deep
polynomial regression. arXiv preprint arXiv:2004.10924.
Tabelini, L., Berriel, R., Paixão, T.M., Badue, C., De Souza, A.F.,
Olivera-Santos, T., 2020b. Keep your Eyes on the Lane: Attention-
guided Lane Detection. arXiv e-prints, arXiv-2010
Wang, B.F., Qi, Z.Q., Ma, G.C., 2014. Robust lane recognition for
structured road based on monocular vision. J. Beijing Inst. Technol.
(English Ed). 23, 345–351.
Wang, S., Hou, X., Zhao, X., 2020. Automatic Building Extraction from
High-Resolution Aerial Imagery via Fully Convolutional Encoder-
Decoder Network with Non-Local Block. IEEE Access.
https://doi.org/10.1109/ACCESS.2020.2964043
Wang, Y., Dahnoun, N., Achim, A., 2012. A novel system for robust lane
detection and tracking. Signal Processing.
https://doi.org/10.1016/j.sigpro.2011.07.019
Wu, B., Li, K., Ge, F., Huang, Z., Yang, M., Siniscalchi, S.M., Lee,
C.H.L., 2017. An end-to-end deep learning approach to simultaneous
speech dereverberation and acoustic modeling for robust speech
DONG ET AL. 17
APPENDIX
See Table A1 and Table A2.
TABLE A1. Parameter settings for each layer of the SegNet-based neural network.
Input Output
Layer Kernel Padding Stride Activation
(channel×hight×width) (channel×hight×width)
Conv_1_1 3×128×256 64×128×256 3×3 (1,1) 1 ReLU
Down_ConvBlock_1 Conv_1_2 64×128×256 64×128×256 3×3 (1,1) 1 ReLU
Maxpool1 64×128×256 64×64×128 2×2 (0,0) 2 ---
SCNN_Down 64×1×128 64×1×128 1×9 (0,4) 1 ReLU
SCNN_Up 64×1×128 64×1×128 1×9 (0,4) 1 ReLU
SCNN
SCNN_Right 64×64×1 64×64×1 9×1 (4,0) 1 ReLU
SCNN_Left 64×64×1 64×64×1 9×1 (4,0) 1 ReLU
Conv_2_1 64×64×128 128×64×128 3×3 (1,1) 1 ReLU
Down_ConvBlock_2 Conv_2_2 128×64×128 128×64×128 3×3 (1,1) 1 ReLU
Maxpool2 128×64×128 128×32×64 2×2 (0,0) 2 ---
Conv_3_1 128×32×64 256×32×64 3×3 (1,1) 1 ReLU
Conv_3_2 256×32×64 256×32×64 3×3 (1,1) 1 ReLU
Down_ConvBlock_3
Conv_3_3 256×32×64 256×32×64 3×3 (1,1) 1 ReLU
Maxpool3 256×64×128 256×16×32 2×2 (0,0) 2 ---
Conv_4_1 256×16×32 512×16×32 3×3 (1,1) 1 ReLU
Conv_4_2 512×16×32 512×16×32 3×3 (1,1) 1 ReLU
Down_ConvBlock_4
Conv_4_3 512×16×32 512×16×32 3×3 (1,1) 1 ReLU
Maxpool4 512×16×32 512×8×16 2×2 (0,0) 2 ---
Conv_5_1 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Conv_5_2 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Down_ConvBlock_5
Conv_5_3 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Maxpool5 512×8×16 512×4×8 2×2 (0,0) 2 ---
5 * ConvLSTMCell(input=(512×4×8), kernel=(3,3), stride=(1,1), padding=(1,1)) Or
ST-RNN Layer1*
5 * ConvGRUCell(input=(512×4×8), kernel=(3,3), stride=(1,1), padding=(1,1), dropout(0.5))
5 * ConvLSTMCell(input=(512×4×8), kernel=(3,3), stride=(1,1), padding=(1,1)) Or
ST-RNN Layer2**
5 * ConvGRUCell(input=(512×4×8), kernel=(3,3), stride=(1,1), padding=(1,1), dropout(0.5))
MaxUnpool1 512×4×8 512×8×16 2×2 (0,0) 2 ---
Up_Conv_5_1 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Up_ConvBlock_5
Up_Conv_5_2 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Up_Conv_5_3 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
MaxUnpool2 512×8×16 512×16×32 2×2 (0,0) 2 ---
Up_Conv_4_1 512×16×32 512×16×32 3×3 (1,1) 1 ReLU
Up_ConvBlock_4
Up_Conv_4_2 512×16×32 512×16×32 3×3 (1,1) 1 ReLU
Up_Conv_4_3 512×16×32 256×16×32 3×3 (1,1) 1 ReLU
MaxUnpool3 256×16×32 256×32×64 2×2 (0,0) 2 ---
Up_Conv_3_1 256×32×64 256×32×64 3×3 (1,1) 1 ReLU
Up_ConvBlock_3
Up_Conv_3_2 256×32×64 256×32×64 3×3 (1,1) 1 ReLU
Up_Conv_3_3 256×32×64 128×32×64 3×3 (1,1) 1 ReLU
MaxUnpool4 128×32×64 128×64×128 2×2 (0,0) 2 ---
Up_ConvBlock_2 Up_Conv_2_1 128×64×128 128×64×128 3×3 (1,1) 1 ReLU
Up_Conv_2_2 128×64×128 64×64×128 3×3 (1,1) 1 ReLU
MaxUnpool5 64×64×128 64×128×256 2×2 (0,0) 2 ---
Up_ConvBlock_1 Up_Conv_1_1 64×128×256 64×128×256 3×3 (1,1) 1 ReLU
Up_Conv_1_2 64×128×256 2×128×256 3×3 (1,1) 1 LogSoftmax
Abbreviations: ConvGRU, convolutional gated recurrent unit; ConvLSTM, convolutional long short-term memory; SCNN, spatial convolutional
neural network; ST-RNN, spatial-temporal recurrent neural network; ReLU, Rectified Linear Unit.
TABLE A2. Parameter settings for each layer of the UNet-based neural network.
Input Output
Layer Kernel Padding Stride Activation
(channel×hight×width) (channel×hight×width)
In_Conv_1 3×128×256 64×128×256 3×3 (1,1) 1 ReLU
In_ConvBlock
In_Conv_2 64×128×256 64×128×256 3×3 (1,1) 1 ReLU
SCNN_Down 64×1×256 64×1×256 1×9 (0,4) 1 ReLU
SCNN_Up 64×1×256 64×1×256 1×9 (0,4) 1 ReLU
SCNN
SCNN_Right 64×128×1 64×128×1 9×1 (4,0) 1 ReLU
SCNN_Left 64×128×1 64×128×1 9×1 (4,0) 1 ReLU
Maxpool1 64×128×256 64×64×128 2×2 (0,0) 2 ---
Down_ConvBlock_1 Conv_1_1 64×64×128 128×64×128 3×3 (1,1) 1 ReLU
Conv_1_2 128×64×128 128×64×128 3×3 (1,1) 1 ReLU
Maxpool2 128×64×128 128×32×64 2×2 (0,0) 2 ---
Down_ConvBlock_2 Conv_2_1 128×32×64 256×32×64 3×3 (1,1) 1 ReLU
Conv_2_2 256×32×64 256×32×64 3×3 (1,1) 1 ReLU
Maxpool3 256×32×64 256×16×32 2×2 (0,0) 2 ---
Down_ConvBlock_3 Conv_3_1 256×16×32 512×16×32 3×3 (1,1) 1 ReLU
Conv_3_2 512×16×32 512×16×32 3×3 (1,1) 1 ReLU
Maxpool4 512×16×32 512×8×16 2×2 (0,0) 2 ---
Down_ConvBlock_4 Conv_4_1 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
Conv_4_2 512×8×16 512×8×16 3×3 (1,1) 1 ReLU
5 * ConvLSTMCell(input=(512×8×16), kernel=(3,3), stride=(1,1), padding=(1,1)) Or
ST-RNN Layer1*
5 * ConvGRUCell(input=(512×8×16), kernel=(3,3), stride=(1,1), padding=(1,1), dropout(0.5))
5 * ConvLSTMCell(input=(512×8×16), kernel=(3,3), stride=(1,1), padding=(1,1)) Or
ST-RNN Layer2**
5 * ConvGRUCell(input=(512×8×16), kernel=(3,3), stride=(1,1), padding=(1,1), dropout(0.5))
UpsamplingBilinear2D_1 512×8×16 512×16×32 2×2 (0,0) 2 ---
Up_ConvBlock_4 Up_Conv_4_1 1024×16×32 256×16×32 3×3 (1,1) 1 ReLU
Up_Conv_4_2 256×16×32 256×16×32 3×3 (1,1) 1 ReLU
UpsamplingBilinear2D_2 256×16×32 256×32×64 2×2 (0,0) 2 ---
Up_ConvBlock_3 Up_Conv_3_1 512×32×64 128×32×64 3×3 (1,1) 1 ReLU
Up_Conv_3_2 128×32×64 128×32×64 3×3 (1,1) 1 ReLU
UpsamplingBilinear2D_3 128×32×64 128×64×128 2×2 (0,0) 2 ---
Up_ConvBlock_2 Up_Conv_2_1 256×64×128 64×64×128 3×3 (1,1) 1 ReLU
Up_Conv_2_2 64×64×128 64×64×128 3×3 (1,1) 1 ReLU
UpsamplingBilinear2D_4 64×64×128 64×128×256 2×2 (0,0) 2 ---
Up_ConvBlock_1 Up_Conv_1_1 128×128×256 64×128×256 3×3 (1,1) 1 ReLU
Up_Conv_1_2 64×128×256 64×128×256 3×3 (1,1) 1 ReLU
Out_ConvBlock Out_Conv 64×128×256 2×128×256 1×1 (0,0) 1 ---
Abbreviations: ConvGRU, convolutional gated recurrent unit; ConvLSTM, convolutional long short-term memory; SCNN, spatial convolutional
neural network; ST-RNN, spatial-temporal recurrent neural network; ReLU, Rectified Linear Unit.
* Similar to the SegNet-based network architecture, two types of ST-RNN, i.e., ConvLSTM and ConvGRU, are tested;
** ST-RNN blocks are tested with one hidden layer or two hidden layers.