2023TOU30321
2023TOU30321
Colin DECOURT
École doctorale
EDMITT - Ecole Doctorale Mathématiques, Informatique et Télécommunications de Toulouse
Spécialité
Informatique et Télécommunications
Unité de recherche
CERCO - Centre de Recherche Cerveau et Cognition
Thèse dirigée par
Rufin VANRULLEN et Thomas OBERLIN
Composition du jury
M. Yassine RUICHEK, Rapporteur, Université de Technologie de Belfort Montbéliard
M. Martin BOUCHARD, Rapporteur, University of Ottawa
M. Didier SALLE, Examinateur, NXP Semiconductors, France
M. Rufin VANRULLEN, Directeur de thèse, CNRS Toulouse - CerCo
M. Thomas OBERLIN, Co-directeur de thèse, ISAE-SUPAERO
Mme Veronique BERGE-CHERFAOUI, Présidente, Université de Technologie de Compiègne
Contents
Glossary xi
Résumé en français 1
Introduction 11
2 Related work 35
2.1 Deep learning background . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Computer vision background . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Image classification . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.3 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . 49
2.3 Automotive radar datasets . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.1 Point clouds datasets . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.2 Raw datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Automotive radar perception on radar point clouds . . . . . . . . . . 59
2.5 Automotive radar perception on raw data . . . . . . . . . . . . . . . 60
2.5.1 Object classification . . . . . . . . . . . . . . . . . . . . . . . 61
2.5.2 Object detection and segmentation . . . . . . . . . . . . . . . 64
2.5.3 Data augmentation for radar . . . . . . . . . . . . . . . . . . 70
ii Contents
Conclusion 135
Bibliography 157
List of Figures
1 Chaîne de traitement du signal d’un radar FCMW . . . . . . . . . . 2
2 Exemples de spectres RD et RA . . . . . . . . . . . . . . . . . . . . 3
3 Architecture de Faster R-CNN et du modèle DAROD proposée . . . 5
4 Architectures des modèles RECORD et MV-RECORD . . . . . . . . 8
5 Vue d’ensemble de RICL . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Levels of ADAS and their meaning . . . . . . . . . . . . . . . . . . . 11
7 Type of sensors found in ADAS systems . . . . . . . . . . . . . . . . 12
8 Example of radar point clouds . . . . . . . . . . . . . . . . . . . . . . 13
BEV Bird-Eye-View
CW Continuous Wave
EM Electromagnetic Waves
FP False Positive
GT Ground-Truth
IR Inverted Residual
OVA One-vs-All
OVO One-vs-One
Glossary xiii
RA Range-Angle
RAD Range-Angle-Doppler
RD Range-Doppler
RF Radio Frequency
SD Self-Distillation
SOTA State-of-the-art
TP True Positive
V2X Vehicle-to-Everything
xiv Glossary
Introduction
Ces dernières années, poussée par le besoin de systèmes de transport plus sûrs
et plus autonomes, l’industrie automobile a connu un changement de paradigme
vers l’intégration d’un nombre croissant de systèmes avancés d’aide à la conduite
(ADAS). À mesure que les niveaux d’aides à la conduite augmentent, allant de
l’aide au freinage à des niveaux plus élevés d’automatisation (dits de niveaux 4 et
5), il est désormais primordial de développer des systèmes de perception robustes.
Dans cette thèse, on appelle perception la capacité d’un système à modéliser son
environnement à l’aide de multiples capteurs.
La plupart des systèmes d’aides à la conduite reposent sur des capteurs de types
caméras, LiDAR et radar pour créer une représentation de l’environnement, chacun
de ces capteurs présentant des avantages et des inconvénients. Par exemple, la haute
résolution des caméras est indispensable pour lire les panneaux de signalisation ou
pour reconnaître des objets. D’un autre côté, le LiDAR apparaît comme un cap-
teur adapté pour cartographier en 3D l’environnement de part sa haute résolution
angulaire. Cependant, en cas de mauvaises conditions météorologiques (brouillard,
pluie) et lumineuses (nuit, contre-jour), l’efficacité des caméras et des LiDAR est
limitée. Également, bien qu’il soit possible d’obtenir des informations de vitesse et
de distance à l’aide de caméras stéréo par exemple, la capacité des caméras à estimer
ces grandeurs reste extrêmement limitée, et il en est de même pour le LiDAR.
D’un autre côté, le radar s’est imposé comme un concurrent de choix pour
compléter les caméras et le LiDAR en raison de ses capacités uniques et sa robustesse
pour détecter des objets et estimer leur vitesse dans des conditions météorologiques
défavorables ou des scénarios à faible luminosité. En émettant des ondes radio et en
mesurant leurs réflexions, le radar permet de mesurer la vitesse et la distance avec
une grande précision. Alors que les faisceaux laser émis par le LiDAR peuvent être
diffractés par des gouttelettes d’eau, créant ainsi des fausses détections, les ondes
émises par le radar les traversent et n’entravent pas le fonctionnement du radar.
Combinés, les caméras, le LiDAR et le radar garantissent un cocon de sécu-
rité à 360° autour du véhicule. Si la fusion des capteurs est apparue comme une
approche essentielle pour accroître la précision, la sécurité et la redondance des sys-
tèmes ADAS, cette efficacité dépend grandement de la capacité de chaque capteur
à fournir une représentation adéquate de l’environnement. Grâce à l’émergence
de l’apprentissage profond, des algorithmes de vision par ordinateur ainsi que le
grande nombre de jeu de données pour des applications de conduites autonomes
[Ettinger 2021, Urtasun 2012, Caesar 2019], des progrès considérables ont été faits
2 Résumé en français
Range processing
Range spectrum
(Fast time FFT)
Velocity processing
Range-Doppler spectrum
(Slow time FFT)
Angle / MIMO
Range-Angle spectrum
processing
CFAR detection
étapes de traitement filtrent le signal brut réfléchi par les objets, ce qui peut affecter
les performances des algorithmes d’intelligence artificielle.
Une alternative aux nuages de points consiste à représenter le signal réfléchi par
les objets sous la forme d’un spectre (données brutes) qui représente l’environnement
en distance et en vitesse (distance-Doppler, RD), en distance et en angle (distance-
angle, RA), ou en distance, en angle et en vitesse (distance-angle-Doppler,
RAD). La Figure 2 montre un spectre RD et RA avec l’image caméra associée.
Ces trois dernières années, la parution de base de données comme CARRADA
[Ouaknine 2021b], RADDet [Zhang 2021] ou encore CRUW [Wang 2021c] a per-
mis d’accélérer la recherche vers le développement de modèles d’IA pour la dé-
tection et la classification d’objets à partir de données radar brutes. Alors que
certains travaux se concentrent sur la classification d’objets à partir de données
brutes [Akita 2019, Palffy 2020, Khalid 2019], d’autres aspirent à réduire le nombre
d’étapes de post-traitement du radar pour détecter et classifier simultanément des
objets [Ouaknine 2021a, Wang 2021b, Gao 2021, Giroux 2023].
Inspirée de ces travaux, cette thèse propose d’exploiter les spectres radar pour
détecter et identifier des usagers de la route dans des environnements complexes.
Essentiellement, cette thèse vise à proposer des algorithmes d’apprentissage profond
conçus explicitement pour les données radar et à étudier si ces algorithmes peuvent
se substituer à certaines étapes dans la chaîne de traitement radar. Cette thèse a
été réalisée au sein de l’institut d’intelligence artificielle ANITI, en collaboration
avec la société NXP, un leader mondial dans le domaine des émetteurs-récepteurs
et des micro contrôleurs pour radars automobiles. L’entreprise est activement
4 Résumé en français
3x3 2D
convolution
Group
normalization
Leaky ReLU
2x2 2D
MaxPooling
2x1 2D
MaxPooling
Flatten
Fully connected
Region proposal
network
Classification
head
Proposals
Bicyclist
Features
extractor
RoI
pooling
Features
map
Regression
Doppler
head
La seconde étude de cette thèse est dédiée à la détection d’objets à partir de données
radar en temps réel. Cette étude, un peu plus générale, vise à exploiter l’information
temporelle pour améliorer les performance de détection des détecteurs d’objet radar
basées sur de l’apprentissage profond. Les modèles utilisés dans la première étude
ont montré de bonnes performances de détection et de classification. Cependant,
leurs capacités à différencier des objets de classes similaires (comme des piétons et
des cyclistes) sont limitées. En radar, exploiter l’information temporelle est cruciale
car la signature d’un objet évolue au cours du temps et varie selon plusieurs fac-
teurs tels que sa distance par rapport au radar, son orientation et sa classe. Ainsi,
l’exploitation de l’information temporelle, c’est-à-dire l’utilisation de plusieurs spec-
tres radar à des pas de temps successifs, pourrait permettre d’apprendre des infor-
mations comme la dynamique de l’objet et donc de limiter la confusion entre classes.
Récemment, différent travaux ont vu le jour dans le but d’apprendre des dépen-
dances temporelles entre différents spectres radar ou entre objets. Principale-
ment, ces approches reposent sur des convolutions temporelles [Ouaknine 2021a,
Wang 2021b, Ju 2021], sur des réseaux de neurones récurrents convolutionnels
[Major 2019] ou sur des modèles d’attention [Li 2022]. Cependant, la plupart de ces
méthodes ont du mal à capturer des dépendances à long terme et sont souvent non-
causales (elles utilisent des informations du passés et du futur) et donc impossibles
à utiliser en temps réel.
Dans le but d’extraire des dépendances spatio-temporelles entre objets, un nou-
Résumé en français 7
Head
Nclasses
Conv block IR block IR block
HxW 32 16 16
IR block Bottleneck LSTM IR block ConvTranspose
H/2xW/2 32 32 32 HxW 32
IR block Bottleneck LSTM IR block ConvTranpose
H/4xW/4 64 64 64 H/2xW/2 64
IR block ConvTranspose
H/8xW/8 128 H/4xW/4 128
(a) Architecture du modèle RECORD. Les flèches arrondies représentent des couches récur-
rentes. Le signe plus représente l’opération de concaténation. Le nombre de filtre par couche
et la taille de sortie sont représentés à gauche et à droite de chaque couche.
RA encoder
Down/Up Down/Up Down/Up
sampling sampling sampling AD encoder
CxHxW CxHxW CxHxW
RD encoder
Concatenate
3CxHxW
RA decoder
RD decoder
Conv2D, 1x1
TMVSC
Comme pour les expériences sur la base de données CRUW, les modèles RECORD
et MV-RECORD sont plus efficients que TMVA-Net. Bien qu’intéressant pour
la recherche, les approches multi vues sont longues et difficiles à optimiser et à
intégrer dans un système radar. À mesure que la résolution des radar augmente, la
quantité de mémoire nécessaire à la production des spectres RA et RAD augmente.
En conclusion, cette étude suggère que les modèles simple vue semblent plus
appropriés pour traiter des données radar brutes et pour détecter des objets.
Appliqués sur des spectres RD et couplés à un algorithme d’estimation d’angle
d’arrivée, ils devraient permettre d’améliorer les performances de détection et de
classification des radars.
Figure 5: Vue d’ensemble de RICL. Deux réseaux sont utilisés pour encoder les car-
actéristiques de chaque spectre (un réséau online et un réseau target). L’opération
RoIAlign [He 2017] est utilisée pour extraire les caractéristiques de chaque objet dé-
tecté par CFAR. Une fonction de coût contrastive est appliquée pour chaque paire
d’objet.
tations pour la détection d’objets. L’annotation des données radar étant com-
pliquée, la plupart des auteurs de base de données radar annotent les jeux de
données de manière semi-automatique [Ouaknine 2021b, Wang 2021c, Zhang 2021,
Rebut 2022]. Cependant, ces annotations reposent sur la fusion des détections à
l’aide d’une caméra [He 2017] et des détections du radar (obtenues avec des méth-
odes classiques). Une telle méthode peut mener à des mauvaises détections ou des
objets manqués. Le but de cette dernière étude vise donc à réduire la quantité de
labels nécessaires à l’entraînement des modèles de détections d’objets utilisant des
données radar, en les pré-entraînant de manière auto supervisée et en les spécialisant
à une tâche de détection avec des annotations manuelles.
En utilisant un apprentissage contrastif, une extension de la méthode SoCo
[Wei 2021] est proposée pour apprendre des représentations de ce qu’est un objet
dans un spectre RD, sans utiliser de labels (appelé RICL). Une vue d’ensemble de la
méthode est présentée en Figure 5. L’idée consiste à extraire la position d’un même
objet à deux pas de temps successifs dans un spectre RD (à l’aide d’un algorithme
de type CFAR), d’encoder cette représentation à l’aide d’un réseaux de neurones
convolutionnel et de maximiser la similarité entre ces objets pour en extraire des
informations relatives à leurs classes (inconnues au moment de l’entraînement). Une
fois le modèle pré-entraîné (les représentations des objets apprises), le modèle est
spécialisé sur une tâche donnée. Ici la détection d’objets.
Dans cette étude, le réseau de neurones convolutionnels choisi est un ResNet-50
[He 2016]. Ce modèle est pré-entraîné et spécialisé sur la base de données
CARRADA [Ouaknine 2021b], et des spectres RD sont utilisés. Le modèle de
détection choisi est le même que pour la première étude, à savoir un Faster
R-CNN. Pour tester l’efficacité de la méthode, le modèle est spécialisé pour
de la détection d’objets avec différentes quantités de labels, allant de 100% à
5%. Des comparaisons en utilisant un pré-entraînement différent (supervisé) sur
10 Résumé en français
Conclusion
Pour conclure, cette thèse montre le potentiel de l’utilisation de modèles d’IA pour
améliorer les capacités de perception des radar automobiles en utilisant des données
brutes: spectres distance-vitesse, distance-angle et distance-angle-vitesse. Ce travail
a montré que des algorithmes d’IA utilisant ces données brutes peuvent se substituer
aux traitements basés sur les nuages de points, plus coûteux et nécessitant une
chaîne de pré- et post-traitement plus lourde. Il a également permis d’évaluer et
de mieux comprendre les avantages et inconvénients des différents modèles, tâches
de détection, et types de données, d’un point de vue des performances de détection
mais aussi en vue de l’intégration dans une chaîne temps-réelle et embarquée. Ce
travail souligne l’évolution constante de la synergie entre IA et radar, ouvrant la
voie à des transports plus sûrs et plus intelligents.
Introduction
In recent years, driven by the need for safer and more autonomous transport sys-
tems, the automotive industry has undergone a paradigm shift towards the inte-
gration of a growing number of advanced driver assistance systems (ADAS, Figure
7). As we navigate the journey from low ADAS levels (driver assistance, partial
and conditional automation) toward higher levels of driving automation (levels 4
and 5, see Figure 6), robust perception systems have become paramount. Percep-
tion forms the cornerstone of ADAS systems, allowing vehicles to represent their
surroundings through multiple sensors, enabling informed decision-making for safer
and more efficient driving scenarios.
Among the array of sensors employed in perception, the primary sensing tech-
nologies for ADAS systems are cameras, LiDAR and radars. Ultimately, there is no
one-size-fits-all sensor solution. Each sensor has unique strengths and weaknesses
and can complement or provide redundancy to the other sensor types [Gu 2022].
High-resolution camera sensors appear indispensable for reading traffic signs or de-
tecting and classifying objects. The ultra-precise angular and fine resolutions at
the range of LiDAR sensors make LiDAR well-suited for high-resolution 3D envi-
ronment mapping. However, cameras and LiDAR technologies’ effectiveness and
reliability become compromised in varying lighting and harsh weather conditions.
Despite the speed and depth information that can be obtained using stereo cameras,
cameras’ ability to measure distance and speed remains extremely limited. Also,
LiDAR’s ability to estimate velocity and detect objects far ahead remains limited.
On the other hand, radar has emerged as a formidable contender due to its
Figure 7: Type of sensors found in ADAS systems. Vehicles with ADAS are
equipped with various cameras and sensors for 360-degree visibility. Source:
https://dewesoft.com/blog/types-of-adas-sensors
unique capabilities in adverse weather conditions or low-light scenarios, and its ro-
bustness in maintaining consistent performance across diverse environments. By
emitting radio waves and measuring their reflections, radar allows for highly accu-
rate speed and distance measurements. While LiDAR illuminates the target scene
with sparsely placed laser beams, radar illuminates the scene seamlessly. LiDAR
may miss smaller targets at greater distances if the targets are situated between
the sharply defined laser beams. As a result, radar is a much more reliable sensor
for longer-range operation [Gu 2022]. Moreover, environmental debris and water
drop refraction introduced by adverse weather conditions will not impair radar op-
erations.
Combined, cameras, LiDAR and radar guarantee a 360° safety type cocoon
around the vehicle as shown in Figure 7. While sensor fusion has emerged as a crit-
ical approach for enhancing the perception accuracy and the safety of ADAS systems
(by adding sensor redundancy), the efficacy of fusion hinges upon the robustness
and the performance of individual sensor processing. Driven by the recent surge in
deep learning and computer vision, and the large number of automotive datasets
[Ettinger 2021, Urtasun 2012, Caesar 2019] with cameras and LiDAR data, the re-
search community has seen significant strides in the development of sensor-specific
perception and sensor fusion involving cameras and LiDARs. However, despite its
distinctive strengths, the radar has been sidelined for artificial intelligence (AI)-
driven perception tasks. This dearth of exploration is attributed to several factors,
including the limited availability of radar datasets, the inability of radar to capture
Introduction 13
colour information, its limited angular resolution compared to camera and LiDAR
sensors and the inherent challenges of processing such data using AI.
Focusing on the radar sensors, this thesis aims to bridge the gap between au-
tomotive radar technology and AI-driven perception. In its current form, radar
data consists of a list of targets (also known as radar point clouds), which contains
information about the position of the target, its velocity and a notion of radar cross
section (RCS) characterising the target (see Figure 8). However, radar point clouds
require significant pre and post-processing steps before being used by AI models.
Also, these processing steps filter the raw signal reflected by objects, which can
affect the performance of artificial intelligence algorithms. One alternative to point
clouds consists of representing the signal reflected by the objects as a spectrum
which represents the environment in distance and velocity (range-Doppler, RD),
distance and angle (range-angle, RA), or range, angle and velocity (range-angle-
Doppler, RAD).
This thesis proposes leveraging radar spectrum representations to detect and
identify road-user objects in complex environments. In essence, this thesis aims
to propose deep learning algorithms tailored explicitly for radar data and study if
those algorithms can substitute conventional radar processing steps. This thesis
was conducted at the ANITI artificial intelligence institute, in collaboration with
the semiconductors company NXP, a world leader in automotive radar transceivers
and microcontrollers. The company is actively involved in building next-generation
radar for enhancing road safety and increasing driver convenience. In this context,
this thesis also serves as a pioneering step in designing AI-enabled radar transceivers
and microcontrollers. The algorithms proposed in this thesis will have to meet the
constraints of the automotive environment: low energy consumption, low complex-
ity and fast reaction time.
14 Introduction
Thesis overview
First, we give an introduction to automotive radar, its role in ADAS systems and
its limitations in Chapter 1. Then Chapter 2 gives an overview of prior works
to this thesis. Particularly, we review the literature on AI for automotive radar
perception and the limits of current methods.
Third, we tackle the problem of online object detection for radar. Time is crucial
information for perception. For example, it allows for exploiting correlations be-
tween objects in successive frames. In radar, exploiting the time is crucial as an
object’s signature evolves depending on many factors like the distance, the angle
of arrival and the object’s class. In Chapter 4, we propose a model leveraging
convolutions and convolutional recurrent neural networks for online radar object
detection. The proposed model learns spatio-temporal features from several types
of radar data (range-Doppler, range-azimuth or range-azimuth-Doppler) and can
perform different perception tasks, ranging from object detection to semantic seg-
mentation. Finally, as our model aims to operate in a computationally constrained
environment, we propose an efficient model with few parameters and operations.
For each of the works presented in this thesis, aiming to bridge the gap between au-
tomotive radar technology and AI-driven perception, we present the limitations and
Introduction 15
Publications
Decourt, Colin, Rufin VanRullen, Didier Salle, and Thomas Oberlin. "DAROD:
A Deep Automotive Radar Object Detector on Range-Doppler Maps."
In 2022 IEEE Intelligent Vehicles Symposium (IV), pp. 112-118. IEEE, 2022.
Decourt, Colin, Rufin VanRullen, Didier Salle, and Thomas Oberlin. "A
Recurrent CNN for Online Object Detection on Raw Radar
Frames." In arXiv preprint arXiv:2212.11172 (2022). (Under review,
IEEE Transactions on Intelligent Transportation Systems)
Chapter 1
Introduction to automotive
radar
Contents
1.1 Radar in ADAS systems . . . . . . . . . . . . . . . . . . . 17
1.2 Radar principle . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.1 The radar equation . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2 Automotive radar classification and waveform . . . . . . . 20
1.3 FMCW automotive radar . . . . . . . . . . . . . . . . . . 21
1.3.1 FMCW radar system . . . . . . . . . . . . . . . . . . . . . 21
1.3.2 Radar signal processing chain . . . . . . . . . . . . . . . . 25
1.3.3 Range and velocity estimation . . . . . . . . . . . . . . . 25
1.3.4 Target detection . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.5 Direction of arrival estimation . . . . . . . . . . . . . . . 29
1.3.6 Post-processing steps . . . . . . . . . . . . . . . . . . . . . 31
1.4 Limits of current radar systems . . . . . . . . . . . . . . . 33
Figure 1.1: Type of radars found in ADAS systems. Different types of radars
(short range, long range, corner), with different functions are arranged all around
the vehicle. Source: NXP Semiconductors
data about the vehicle’s surroundings. As most road crashes come from human
error, ADAS enables proactive actions to enhance safety and improve the driving
experience [Brookhuis 2001].
Among camera and LiDAR sensors, radar sensors are good candidates for ADAS
applications as they bring complementary information to other sensors. In particu-
lar, because radar sensors emit electromagnetic waves, they can operate in difficult
weather conditions (night, fog, snow, dust, and intense light). Also, they allow
more accurate distance and velocity estimation in a single capture compared to
camera and LiDAR sensors. Therefore, radar sensors are appropriate for, but are
not limited to:
When arranged all around the vehicle, and combined with other sensors, radars
allow creating a 360-degree safety cocoon as shown in Figure 1.1.
Figure 1.2: Principle of a radar. A radar emits electromagnetic waves and uses the
reflected waves from objects to estimate the distance, velocity, and azimuth and
elevation angles.
Considering Pt , the nominal transmit power, and a target at a distance R, then the
received power is related to the transmit power of a radar:
Pt Gσλ2
Pr = (1.1)
(4π)3 R4
where G is the antennas gain, σ is the RCS of the target, and λ is the wave-
length of the emitted EM wave. Equation 1.1 is known as the radar equation
[Richards 2005]. For automotive radar, G, Pt and λ vary little. Hence the power
received back from a target depends on its RCS σ and decreases proportionally to
R4 . The radar equation determines the maximum range Rmax (in meter) of radar
for a given target RCS: s
Pt Gσλ2
Rmax = 4 (1.2)
Pr,min (4π)3
20 Chapter 1. Introduction to automotive radar
LRR
MRR (corner radar) SRR
(front radar)
Distance range (Rmin − Rmax ) 10-250 m 1-100m 0.15-30m
Range resolution (δr ) 0.5m 0.5m 0.1m
Range accuracy (∆r ) 0.1m 0.1m 0.02m
Azimuth field of view ± 15° ± 40° ± 80°
Angular accuracy (∆ϕ ) 0.1° 0.5° 1°
Bandwidth 600MHz 600MHz 4GHz
Table 1.1: Automotive radar sensors classification and their associated characteris-
tics [Hasch 2012]
• Maximum range (Rmax ): the maximum range at which a target can be de-
tected.
• Maximum speed (vmax ): the maximum non-ambiguous speed the radar can
detect. Targets with a relative speed vr higher than vmax will be detected but
their speed will be incorrect.
• Range resolution (δr ): how close in range can two objects of equal strength
be and theoretically still be detected as two objects.
• Speed resolution (δv ): how close in velocity can two objects of equal strength
be and theoretically still be detected as two objects.
System overview Figure 1.3 illustrates a 1Tx-1Rx, FMCW radar system, i.e. a
system with one emitting (Tx) and one receiving (Rx) antenna. Chirps are emitted
through an Tx antenna. The receiving Rx antenna receives the signal reflected
by targets in the radar’s field of view. Then, a mixer multiplies the sent and the
received signals to produce a low-frequency beat signal, whose frequency gives the
target range. A low-pass filter is used to filter out unwanted high frequencies. The
ADC digitises the signal at periodic sampling intervals during each chirp. Finally,
signal processing is performed to obtain radar point clouds.
Tx antenna
Chirp generation
Signal
LP filter ADC
procesing
Rx antenna
Figure 1.3: FMCW radar system overview. Chirps are emitted through an antenna.
The receiving antenna receives the signal reflected by targets in the radar’s field
of view. Then, a mixer multiplies the sent and the received signals to produce a
low-frequency beat signal, whose frequency gives the target range. A low-pass filter
is used to filter out unwanted high frequencies. The ADC digitises the signal at pe-
riodic sampling intervals during each chirp. Finally, signal processing is performed
to obtain radar point clouds.
the targets. For nth antenna, we express the emitted signal as:
B
s(t) = At exp j2π(fc + t)t (1.4)
2T
= At exp j(2πfc t + πKt2 ) 0 ≤ t ≤ T0 (1.5)
= At exp jΦ(t) (1.6)
where K = 2T B
, fc is the carrier frequency, B is the bandwidth of the signal, T0
is the duration of the chirp (fast time), T is the pulse period, and Ar is the am-
plitude related to the transmit power. Figure 1.4 depicts one chirp profile and its
parameters. As shown in Figure 1.4, different parameters define the FMCW signal:
• Tsettle : the time for the ramp to be linear. During this phase, there is no
acquisition by the ADC.
• TF F T : the time during which the ADC acquires the data (tens of µs)
• Treset : the time needed for the ramp generator to reset before the next chirp
(few µs)
• Tchirp : Tramp + Treset + Tdweel , the total time of the chirp. Also referred to as
T in Equation 1.5.
Received signal The signal r(t) received at time t from a single reflector at radar
range r = cτ is related to the signal transmitted at time t − 2τ earlier as:
where c is the speed of light, and Ar is the amplitude of the received signal.
After mixing (IF block, Figure 1.3), the mixed signal for a single chirp duration
(but this can be generalised to all chirps) is:
This mixed signal has components at the sum and difference frequencies of the two
signals. After filtering the sum of the frequencies (which are outside the receiver’s
24 Chapter 1. Introduction to automotive radar
2K
fb (r) = r (1.13)
c
and the phase of the IF signal:
4πfc 4π
ψ(r) = r= r (1.14)
c λ
We can see that the phase and the beat frequency of the IF signal are range de-
pendent. While the beat frequency allows distance measurement, we will see in
Section 1.3.3 the phase variation provides an exquisitely sensitive measure of range
variation which is used in Doppler processing over multiple chirps in the frame.
Finally, we can express the sampled ADC output x[j] at ADC sample j within a
j
chirp from a target at range r by setting t = fADC :
j
x[j] = A exp j(2πfb (r)( ) + ψ(r)) (1.15)
fADC
where fADC is the sampling frequency of the ADC. For each chirp, the ADC samples
the signal at periodic intervals to obtain a grid-like representation of the signal as
shown in Figure 1.6.
Figure 1.5: Radar signal processing chain. First, the received signal is converted
from the time domain to the frequency domain to extract distance and velocity
information. An object detector is applied to find the range and velocity bins where
there are objects. Then, the azimuth (and the elevation) of objects is estimated.
Finally, some post-processing steps are applied to output the final target list.
1.3. FMCW automotive radar 25
cfb (r)
r= (1.16)
2K
The beat frequency is the difference in frequency between the emitted and the
received signals. Because of the linearity of the frequency variation of the chirp,
this difference in frequency is constant and proportional to the delay τ between the
emitted and received signals as shown in Figure 1.6(1). This difference in frequency
26 Chapter 1. Introduction to automotive radar
Figure 1.6: Range and Doppler radar processing. (1) A spectrogram of an FMCW
waveform with carrier frequency fc and bandwidth B. The emitted signal is in
blue, and the received signal is in green. Orange points correspond to the points
sampled by the ADC. (2) The ADC matrix after sampling and the corresponding
Range-Doppler spectrum. Range FFT is first applied for every chirp. Doppler FFT
is then applied for each chirp index (or sample). (3) Illustration of the phase of the
range FFT that evolves according to the relative velocity.
can be precisely measured using an FFT for every chirp (fast-time index) to obtain
the range spectrum. We call this operation the range FFT (Figure 1.6(2)). The
beat frequency fb [m] (in Hz) corresponding to a peak in bin m in the M point range
FFT sampled at fADC is:
mfADC M
fb [m] = , for 0 ≤ m ≤ (1.17)
M 2
The range of individual objects r[m] can be computed from the beat frequencies
fb [m] present in peaks of the range spectrum:
Range limit and resolution For a given ADC sampling frequency fADC , the
maximum range of the radar is inversely proportional to the slope of the chirp and
1.3. FMCW automotive radar 27
cTramp cfADC
δr = = (1.20)
2BTchirp 2M K
±2v
fd = (1.21)
c
The beat signal x(t) in Equation 1.12 now includes the Doppler shift:
The beat frequency fb now depends on the target range r and the target’s relative
speed v = ∂r∂t . The two cannot be separated using the beat frequency of a single
pulse. Instead, multiple chirps are used to estimate the velocity. For the same
sample, the small target’s variation in distance will slightly change the phase ψ(r)
between two chirps. This phase appears in the phase of the range FFT bin as shown
in Figure 1.6(3). Differentiating Equation 1.14 with respect to time gives:
(p−P )cfP RF
2
(1.24)
v[p] = 2N fc , for P
2 ≤p≤N
where fP RF = Tchirp
1
is the pulse repetition frequency (PRF). If there is a target at
range bin m and Doppler bin pn then, a peak in the RD map at position (m, p) can
be seen, as shown in Figure 1.6.1
28 Chapter 1. Introduction to automotive radar
Velocity limit and resolution The maximum target velocity (in m/s) is a func-
tion of the PRF:
λ cfP RF
vmax,P RF = = (1.25)
4Tchirp 4fc
The bin spacing in a N point Doppler FFT is an estimate of the velocity resolution
δv :
c cfP RF
δv = = (1.26)
2fc Nchirp Tchirp 2fc Nchirp
Figure 1.8 shows the procedure to compute the DoA of targets after range and
velocity estimation. Before estimating the DoA of targets, one must first detect
potential targets in the radar’s field of view. Recall that a potential target cor-
responds to a peak in the range-Doppler spectrum. Most FMCW radars apply a
Constant False Alarm Rate [Blake 1988] algorithm to detect peaks in the RD spec-
trum. CFAR automatically adapts the threshold to keep the false alarm rate at the
desired level. Therefore, it will also adapt the probability of detection.
The most common CFAR detector is the cell-averaging detector (CA-CFAR)
[Rohling 1983]. First, the RD map is divided into a grid of cells; each cell contains
information about the radar reflections at a specific range and velocity. For each
cell in the map (cell under test, CUT), the noise is estimated using a 2D sliding
window:
N X M
1
Pn = (1.27)
X
xjk
N + M j=1 k=1
where N +M are the number of training cells in the 2D window and xjk are the cells
in the window. Figure 1.7 shows the 2D window of a CFAR algorithm. Generally,
guard cells are placed adjacent to the CUT to prevent signal components from
leaking into the training cells, which could affect the noise estimate. Then, the
threshold factor can be written as [Richards 2005]:
1
−
α = (N + M )(Pf aN +M − 1) (1.28)
where Pf a is the desired alarm rate (set empirically). The detection threshold is
set as:
T = αPn (1.29)
If the value of the CUT is higher than T , there is a potential object at the cell
coordinate. The coordinate is kept in memory for further processing. It is also
possible to represent the output of the threshold operation as a binary detection
mask.
1.3. FMCW automotive radar 29
marise the processing in Figure 1.10. Summing along the Doppler axis results in a
Range-Angle (or range-azimuth, RA) spectrum.
In practice, the angle FFT is computed across the NRx antennas only on the
range and the Doppler bins where an object is detected, as shown in Figure 1.8.
To enhance the spatial resolution of the radar, super-resolution algorithms (MUSIC
[Schmidt 1986], ESPRIT [Paulraj 1985]) can be used. Super-resolution algorithms
use multiple radar measurements of an object and fuse them to estimate its position
accurately.
MIMO modulation All Rx antennas must be able to separate the signals cor-
responding to different T x antennas. One way of doing it is to have the Tx trans-
• Time: Each TX transmits signals one at time, using the same spectrum
• Frequency: TXs transmit at the same time, but using a central frequency
shift big enough so that all their spectra don’t overlap.
• Coding: TXs transmit at the same time and frequency, using orthogonal
sequences.
Figure 1.12: MIMO modulation strategies. (a) Orthogonality domains for Tx sig-
nals. (b) Time Division Multiple Access (TDMA) modulation. (c) Doppler Division
Multiple Access (DDMA) modulation. (d) Code Division Multiple Access (CDMA)
modulation.
Following the target detection (range and velocity estimation, peak detection and
DoA estimation), targets are ego-motion compensated, clustered and tracked. For
tracking, Kalman filters [Kalman 1960] are typically used. The tracking might be
constrained to moving targets only; this is why targets are first classified as moving
or static. Ego-velocity of the radar can be estimated based on the detections and
used as a filter to separate still-standing and moving targets. Finally, the final
target list consists of the target’s position in cartesian coordinates (x, y), the radial
velocity vr , the RCS σ and the angle of arrival θ of the target.
Related work
Contents
2.1 Deep learning background . . . . . . . . . . . . . . . . . . 35
2.2 Computer vision background . . . . . . . . . . . . . . . . 39
2.2.1 Image classification . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 Object detection . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.3 Image segmentation . . . . . . . . . . . . . . . . . . . . . 49
2.3 Automotive radar datasets . . . . . . . . . . . . . . . . . . 52
2.3.1 Point clouds datasets . . . . . . . . . . . . . . . . . . . . . 52
2.3.2 Raw datasets . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Automotive radar perception on radar point clouds . . . 59
2.5 Automotive radar perception on raw data . . . . . . . . 60
2.5.1 Object classification . . . . . . . . . . . . . . . . . . . . . 61
2.5.2 Object detection and segmentation . . . . . . . . . . . . . 64
2.5.3 Data augmentation for radar . . . . . . . . . . . . . . . . 70
This chapter gives an overview of the prior work to this thesis. First, we present
deep learning and computer vision fundamentals. Second, we review the literature
for automotive radar perception. In this thesis, we use the term perception for
computer vision applied to radar because it acknowledges the broader process of
acquiring and interpreting sensory information, regardless of the sensing modality
involved, while computer vision traditionally focuses on visual data analysis.
In Sections 2.1 and 2.2, we present deep learning background and some models
for object detection and segmentation in computer vision that are useful for this
thesis. In Section 2.3, we present available automotive datasets for radar perception.
In Section 2.4 and 2.5, we give an overview about the literature on radar perception
for radar point clouds and raw data, respectively.
gorithm is an algorithm that can learn from data. According to [Mitchell 1997]:
"A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.". While traditional machine learning
learns a mapping from hand-crafted representation (features) to a specific output,
deep learning learns not only a mapping from representation to output but also the
representation itself.
In this thesis, task T consists of image classification, segmentation, and object
detection. The performance measure P is specific to the task T. It is used to
measure how well our algorithm performs on unseen data. The experience E is a
set of data points with associated features, also known as a dataset. Each data
point is associated with a label or target in a supervised setting. The target is used
to teach the deep learning algorithm what to do.
i=1
with wi and b are the weights and the bias of the neuron and σ is an activation
function. Typically σ is non-linear to allow the neuron to learn complex non-linear
functions. The most popular activation functions are the sigmoid function σ(a) =
1+exp −a , the rectified linear unit (ReLU) σ(a) = max(0, a) and the hyperbolic
1
tangent function σ(a) = tanh(a). Figure 2.1 depicts the computation of a neuron.
Figure 2.1: Artificial neural neuron. A neuron is a function that takes a set of
input signal x = (x1 , x2 , ..., xn ), computes a weighted sum of the input and apply
an activation function to it to produce the output.
2.1. Deep learning background 37
Figure 2.2: The multi-layer perceptron. Neurons are stacked together to form
layers. The first layer is called the input layer. The last layer is called the output
layer. Other layers are called hidden layers.
In an ANN, neurons are stacked together to form layers, and the network be-
comes a composition of layers. Since an artificial neural network is a composition
of functions, it is a function. Because the information flows through the function
evaluated from the inputs, intermediate functions and finally through the output,
we also refer to ANN as feed-forward neural networks.
A common ANN is the multi-layer perceptron (MLP, Figure 2.2). The MLP
stacks multiple layers of neurons together, where neurons of each layer are connected
to other neurons belonging to different layers through edges with associated weights.
The output of the ith layer of an MLP is computed as:
where W (l) , b(i) and σ (i) are the weight matrix, the bias vector and the activation
of the ith layer respectively. Thus, the MLP is θ parameterised function defining a
mapping y = f (x; θ) where θ = ((W (i) , b(i) ), ..., (W (1) , b(1) )) are the parameters of
the network, x = (x1 , x2 , ..., xn ) are the input values and y = (y1 , y2 , ..., ym ) are the
output values of the network.
MLPs are not the only type of ANNs. In the following paragraphs we choose
to present two types of neural networks, namely the convolutional neural networks
and the recurrent neural networks. As an example we choose to leave aside the well
known Transformer architecture [Vaswani 2017a].
supervised training setting, for a given loss function L, we minimise the empirical
risk:
N
1 X
J(θ) = Ex,y∼p̂data (x,y) [L(f (x; θ), y] = L(f (x(i) ; θ), y (i) ) (2.3)
N i=1
Where N is the number of training examples, and y is the target output. The
minimisation process, the optimisation step, uses backpropagation and a gradient
descent algorithm such as the Stochastic Gradient Descent (SGD) [Kiefer 1952,
Robbins 1951].
Where ŷ (t) is the prediction of the RNN at timestep t and h(t) is the hidden state at
timestep t. RNNs are trained using backpropagation through time (BPTT). BPTT
is a computational technique that allows for the efficient calculation of gradients by
unfolding the network across time and propagating the errors backwards. However,
traditional RNNs cannot handle long-term dependencies and suffer from the van-
Figure 2.3: Left: a recurrent neural network. Right: Unfolded recurrent neural
network.
2.2. Computer vision background 39
ishing gradient problem. The vanishing gradient problem refers to the issue where
the gradients become very small during backpropagation, hindering the training
process and resulting in slow or ineffective convergence. Variants of RNNs have
been proposed to solve this, such as Long-Short-Term-Memory networks (LSTMs)
[Hochreiter 1997] and Gated Recurrent Unit (GRU) [Cho 2014]. LSTMs and GRUs
add additional cells (called gates) to allow the gradient to flow through the network
without vanishing.
(K ∗ I)(h, w) = (2.6)
XXX
I(i, j, c)K(h − i, w − j, c)
i j c
Where h, w ∈ N defines the coordinates in the image (or the feature maps). CNNs
show interesting properties for grid-like data. The kernel weights are shared for the
entire image, reducing the number of parameters of the network. Also, convolutions
are equivariant to translation. This means the representation will be the same if we
move an object in the input image I and apply a convolution on the shifted object.
AlexNet
Krizhevsky et al. [Krizhevsky 2012] leverage the power of GPUs (Graphical Power
Unit ) and introduced AlexNet, a CNN with a similar architecture as LeNet-5
[Lecun 1989]. AlexNet introduced better non-linearity in the network with the
ReLU activation function and proves ReLU is more efficient for gradient propa-
gation. Moreover, the paper introduced two major deep learning concepts: the
dropout as a regularisation method and the concept of data augmentation to re-
duce overfitting. Finally, Krizhevsky et al. show that deeper networks are better.
The more convolutional layers there are, the more fine-grained features the network
learns for classification. Although now outdated, AlexNet was the forerunner of the
current use and craze for deep learning.
Krizhevsky et al. suggest that the depth of CNNs allows finer features extraction.
Simonyan and Zisserman [Simonyan 2015] explored this point by stacking several
convolutional layers with small kernels (3 × 3)together. VGG networks build upon
the following configuration: a stack of convolutional layers (which have different
depths in different architectures), three fully connected layers and a softmax layer.
The depth of the networks ranges from 11 layers to 19 layers. The deepest architec-
ture (VGG19) reaches 7.5% top-5 validation errors, outperforming AlexNet. The
2.2. Computer vision background 41
AlexNet [Krizhevsky 2012], VGG [Simonyan 2015] or GoogLeNet [Szegedy 2015] all
follow the same trend: going deeper. However, stacking more and more layers does
not necessarily lead to better accuracy. When the depth of the network increases, a
degradation problem appears. Accuracy gets saturated, and adding layers leads to
higher training errors. In other words, the networks are more challenging to train
when they are deeper because it is more difficult to backpropagate the gradient.
ResNet networks family [He 2016] introduces the concept of residual connections.
As shown in Figure 2.5, identity mapping is added via an element-wise addition
between the input and the output of the layer. This helps the gradient propagation
and avoids the problem of vanishing gradient. Also, residual connections help to
combine different levels of features at each network step. He et al. [He 2016] were
able to stack up to 152 layers, thus reaching a top-5 validation error of 5.71%.
Efficient CNNs
The general trend in deep learning is to build bigger and deeper networks to extract
more fine-grained features. Despite increasing the accuracy, these networks could
be more computationally efficient in size and speed. In many real work applications,
including automotive applications in this thesis, the recognition and detection tasks
must be carried out on the edge of computationally limited accelerators. MobileNet
[Howard 2017, Sandler 2018, Howard 2019] family introduces a new kind of efficient
architecture in order "to build very small and low latency models that can be easily
matched to the design requirements for mobile and embedded vision application"
[Howard 2017]. In this thesis, we build an efficient network upon these requirements;
the contribution is presented in Chapter 4.
MobileNetV1 [Howard 2017] is one of the first CNN architectures built for mo-
bile and embedded vision applications. MobileNetV1 is based on a simple architec-
ture (similar to VGG [Simonyan 2015]) and uses depthwise separable convolutions
instead of plain convolutions to build a lightweight deep neural network. Depthwise
42 Chapter 2. Related work
Figure 2.6: In MobileNetV1, the standard convolutional filters in (a) are replaced
by two layers: depthwise convolution in (b) and pointwise convolution in (c) to
build a depthwise separable filter with M input channels, N output channels and
a kernel size DK .
Figure 2.7: Inverted Residual block. The + symbol corresponds to the addition
operation.
2.2. Computer vision background 43
Object detection is a computer vision task that aims to detect and locate objects
of interest in an image or video. The task involves identifying the position and the
boundaries (bounding boxes) of objects and classifying them into different categories
(see Figure 2.4).
The most popular benchmarks and datasets for object detection are the Pascal VOC
(Visual Object Classes) [Everingham 2012] and the MSCOCO [Lin 2014] datasets.
For autonomous driving applications, other benchmarks exist, such as the nuScenes
dataset [Caesar 2019], the KITTI Vision Benchmark [Urtasun 2012], or the Waymo
Open Dataset challenge [Ettinger 2021].
Pascal VOC dataset [Everingham 2012] is one of the first object detection and
segmentation dataset. The first version was released in 2005, but the 2012 version is
the most popular. Pascal VOC dataset contains around 10K images over 20 object
categories (vehicles, animals, bicycles). Each image has pixel-level segmentation,
bounding box, and object class annotations. The Pascal VOC has been widely used
for object detection, semantic segmentation and classification tasks but remains a
small dataset.
MSCOCO dataset [Lin 2014] is a large-scale object detection, segmentation,
keypoint recognition, and captioning dataset. To this day, this is the benchmark
of reference for computer vision tasks. The dataset comprises 328k images over
80 categories (91 for the latest version). The dataset has various annotations,
including bounding boxes, semantic and panoptic segmentation, and key points
detection annotations.
Metrics
Intersection Over Union IoU is a measure based on Jaccard Index that evalu-
ates the overlap between two bounding boxes (or two segmentation masks for image
segmentation). It requires a ground-truth box Bgt and a predicted bounding box
Bp . The IoU ranges from 0 to 1. A perfect object localisation would have an IoU
of 1. By setting a threshold, we can tell if a detection is valid (true positive, TP)
or not (false positive, FP). The IoU is given by the overlapping area between Bgt
and Bp , divided by the area of union of them:
area(Bp ∩ Bgt )
IoU = (2.8)
area(Bp ∪ Bgt )
Precision and Recall The precision is the ability of the model to identify only
the relevant objects. It is the percentage of correct positive predictions (IoU >
threshold):
TP TP
Precision = = (2.9)
TP + FP all detection
The recall is the ability of a model to find all the relevant cases (all ground
truth bounding boxes). It is the percentage of true positives detected among all
relevant ground truths. For critical automotive scenarios, a high recall is desirable,
indicating we do not miss any objects.
TP TP
Recall = = (2.10)
TP + FN all ground truth
Mean Average Precision The mAP evaluates the average precision for all
classes in the dataset. In practice, the AP is the area under the curve (AUC)
of the precision vs recall curve. The COCO mAP [Lin 2014] consists of computing
the AP for each class with different IoU thresholds, ranging from 0.5 to 0.95 with
a 0.05 step, and average them.
2.2. Computer vision background 45
Models
Fast R-CNN In R-CNN, each proposal goes through a CNN for classification
and regression, which is inefficient and not adapted for real-time applications. In
[Girshick 2015], Girshik et al. try to reduce the computational cost of R-CNN by
performing a single CNN forward pass2 on the image. Then, they extract Regions
Of Interest (RoI) using the selective search algorithm on the produced feature maps.
Each RoI is reduced to a fixed size using a pooling layer and a small shared two-
layer MLP extracts features. Finally, the vector created by the two-layers MLP is
used to predict the object’s class with a softmax classifier and its position with a
linear regressor. Fast R-CNN allows nine times faster training speed and processes
images 146 times faster than R-CNN.
Faster R-CNN The R-CNN [Girshick 2014] and the Fast R-CNN [Girshick 2015]
depend on heuristic-based region proposal algorithms (selective search) to hypoth-
esise object locations. However, region proposal algorithms are slow compared to
neural networks on GPUs. For example, in Fast R-CNN, the selective search algo-
rithm takes up to 2 seconds in inference to produce proposals. Therefore, Ren et al.
[Ren 2017] proposes to use GPUs to compute the proposals with deep convolutional
neural networks. They introduce a novel network in the Fast R-CNN framework,
the region proposal network (RPN). The RPN takes a set of feature maps (produced
1
R-CNN uses the AlexNet [Krizhevsky 2012] architecture to extract features
2
Fast R-CNN uses VGG [Simonyan 2015] backbone.
46 Chapter 2. Related work
Figure 2.9: Left: Faster R-CNN overview. First a CNN extracts features from an
input image to produce the features maps. Then the RPN proposes proposals that
are used as input of a Fast R-CNN head to predict the position and the class of
objects in the input image. Right: The RPN and its anchors at a single location.
For each position in the feature map, a 3 × 3 convolution is applied. Then two
MLPs are used to predict a set of k proposals relative to k reference boxes called
the anchors. Source: [Ren 2017]
Figure 2.10: YOLO overview. YOLO divides the input image into a S × S grid,
and for each cell predicts B bounding boxes, confidence for those boxes and C class
probabilities. The predictions are then encoded as an S ×S ×(B ∗5+C) tensor. The
S × S grid corresponds to the output feature maps of a CNN, and the B bounding
boxes are similar to the anchors in Faster R-CNN. Source: [Redmon 2016]
CNN replaces the classic RoI pooling operation with a new pooling operation called
RoIAlign (Region of Interest Align). RoIAlign removes the quantisation of RoI
pooling and computes the exact coordinates of objects. Additionally, Mask R-CNN
adopts a ResNeXt-101 backbone [Xie 2017] with Feature Pyramid Networks (FPN)
[Lin 2017a]. FPN uses a top-down architecture with lateral connections to extract
features according to their scale.
Figure 2.11: SSD model vs. YOLO model. SSD detects objects at multiple scales
while YOLO uses a single scale feature maps [Liu 2016]
high-resolution features. Moreover, the authors add pass-through layers which con-
catenate high-resolution features with low-resolution features to obtain fine-grained
feature maps.
Conclusion In this section we presented the main frameworks for object detec-
tion in computer vision. Most of object detectors are built upon YOLO, SSD or
Faster R-CNN. To further improve the performance of these object detectors, re-
searchers focus their works on improving features extraction [Liu 2022, Liu 2021],
training strategy [Caron 2021] or new paradigms such as anchor-free object detec-
tors [Tian 2019, Carion 2020a, Zhou 2019]. In this thesis, we will use the Faster
R-CNN framework and study how much this architecture is suited for radar object
detection. Two-stage detectors are generally more accurate but slower than sin-
gle stage detectors. Thus we will optimise the features extraction stage to extract
2.2. Computer vision background 49
As for object detection, the PASCAL VOC [Everingham 2012] and the MSCOCO
[Lin 2014] datasets are famous benchmarks for semantic segmentation. Addition-
ally, the Cityscapes [Cordts 2016] dataset is widely used for semantic segmentation.
We refer the reader to Section 2.2.2 for details about Pascal VOC and MSCOCO.
The Cityscapes dataset [Cordts 2016] is a large-scale dataset for semantic un-
derstanding of urban street scenes. It provides semantic, instance-wise, and dense
pixel annotations for 30 classes grouped into eight categories (flat surfaces, humans,
vehicles, constructions, objects, nature, sky, and void). The dataset is small com-
pared to MSCOCO. Indeed, Cityscape contains only 5000 annotated images and
20000 coarse annotated ones.
Metrics
Because semantic segmentation models predict masks, the mIoU metric we defined
in Section 2.2.2 and the pixel accuracy is used. The IoU is computed between a
ground truth mask and the prediction for each class. Then, by averaging the IoU
of each class, we compute the mIoU.
The pixel accuracy is the percentage of pixels in the image which are correctly
classified. Generally, pixel accuracy is reported for each class separately and by
averaging across classes. One issue with pixel accuracy is that it can provide mis-
leading results when the class representation is small within the image (e.g. mostly
background).
50 Chapter 2. Related work
Models
Semantic segmentation models aim to label each pixel of an image with a corre-
sponding class. Thus, such models require the same input and output size. A naive
approach is to design an architecture with convolutional layers without decreasing
the input size. Then, apply a softmax function to the last feature maps. Neverthe-
less, this is computationally expensive. Deep CNNs for image classification generally
downsample the size of the input multiple times to learn deeper representations.
However, we must produce a full-resolution segmentation mask the same size as the
input image for semantic segmentation. One popular image segmentation approach
follows an encoder-decoder architecture, where the encoder downsamples the spa-
tial resolution, developing lower-resolution feature maps, and where the decoder
upsamples the feature representations learned by the encoder into a segmentation
mask.
Fully Convolutional Networks (FCN) Long et al. [Long 2015] were the first
to propose a fully convolutional network trained end-to-end for image semantic seg-
mentation. The authors proposed to adapt existing image classification networks
(e.g. AlexNet [Krizhevsky 2012]) as an encoder and use transpose convolution (or
deconvolution) layers on the top of the feature maps to upsample low-resolution
features into a full-resolution segmentation map. However, FCN struggles to pro-
duce fine-grained segmentation masks. Indeed, the input’s resolution is reduced by
32, and the authors use a single deconvolution layer. To address this issue, the
authors propose slowly upsampling the encoded representation at different stages,
adding skip connections from earlier layers and summing feature maps together. It
allows fine layers (where) to be combined with coarse layers (what), improving the
segmentation of object boundaries.
U-Net Later on, Ronneberger et al. [Ronneberger 2015] improved the FCN ar-
chitecture by expanding the capacity of the decoder. Instead of using a single
deconvolution, they propose a symmetric encoder-decoder architecture for image
semantic segmentation. The encoder (referred to as contracting path in the original
paper) captures context. The decoder (referred to as expanding path) is symmetric
to the encoder and enables precise localisation. Also, U-Net adds skip connections
between the encoder and the decoder to combine low-level features (where) with
high-level features (what). U-Net architecture has become popular, modified, and
adapted for various segmentation problems. Today, we can consider this architec-
ture as the reference encoder-decoder architecture. Figure 2.12 depicts the U-Net
architecture.
2.2. Computer vision background 51
Figure 2.13: Overview of point clouds radar datasets. (a) nuScenes [Caesar 2019]
dataset sample. We notice a few points per object, and the radar has no elevation
capabilities. (b) Astyx [Meyer 2019] dataset sample. Compared to nuScenes data,
the radar has elevation capabilities, and more points per object are available. (c)
RadarScenes [Schumann 2021] dataset sample. The point cloud is denser than
the nuScenes point cloud, but no elevation measurement exists. (d) VoD dataset
[Palffy 2022] sample. We can see that radar (big points) has a lower resolution than
LiDAR but has a better resolution than nuScenes data and elevation data.
azimuth θ, the elevation α and the radial velocity vr . Point clouds can be used for
object classification, segmentation (clustering objects of the same class) and sensor
fusion. Because radar point clouds are sparse, they are particularly appropriate for
embedded applications. This sparsity is also a drawback because models need more
information to generalise well.
nuScenes [Caesar 2019], released in 2019, is the first public large-scale dataset for
autonomous driving. It contains 2+1D radar point clouds from 5 radars alongside
cameras, LiDARs and IMUs. Although data from several radars is available, these
radars have low resolution, resulting in very sparse (few points per object) point
clouds as shown in Figure 2.13. Some work tried using this dataset for object
detection [Niederlöhner 2022, Svenningsson 2021], but the resolution is too low to
obtain enough detection accuracy. As a result, nuScenes’ radar data is mainly used
for sensor-fusion applications rather than radar-only perception.
54 Chapter 2. Related work
CRUW dataset, and annotations are in the form of points to remain compliant
with conventional radar outputs. Unfortunately, the CRUW dataset is not publicly
available, but authors release a subset of it, named the ROD20217 dataset. The
ROD2021 dataset contains about 50k frames from a single RGB camera and FCMW
radar.
In this thesis, we use these datasets as they were the only ones available during
the major part of the thesis. We also chose to work on these datasets as the
low-resolution aspect of the radars used in CARRADA, RADDet or CRUW is
challenging. We hypothesise that with enough data and good annotations, a deep
learning algorithm could overstep the low resolution of the radar. We show samples
of these datasets in Figure 2.14.
Low-resolution radar datasets are very useful for detecting objects in range and
velocity. However, the low-angle resolution makes detecting and classifying objects
difficult in the azimuth domain. More recently, high-resolution radar datasets have
started emerging. The goals of high-resolution datasets are multiple: enabling
accurate 3D detection and classification for radar sensors [Paek 2022, Rebut 2022,
Madani 2022] and avoiding the computationally expensive generation of RA radar
maps [Rebut 2022].
RADIal (for Radar, LiDAR et al.) is a raw high-resolution radar dataset in-
cluding other sensor modalities like cameras and LiDAR. RADIal aims to motivate
research on automotive high-resolution radar and camera-lidar-radar sensor fusion.
RADIal contains around 25k synchronised frames, out of which more or less 8k are
labelled with vehicles and free-space driving masks. Data are provided in ADC data
to be used directly to detect and classify vehicles and avoid time-consuming RA
map generations. However, the RADIal dataset mainly contains data recorded on
highways or countryside and only car labels, making it challenging to use for more
general applications (city roads).
The K-Radar [Paek 2022] is a 3+1D radar dataset with a similar size as RADIal
[Rebut 2022] collected under various scenarios (e.g. urban, suburban, highways),
time (e.g. day, night), and weather conditions (e.g. clear, fog, rain, snow). It con-
tains around 35k manually annotated 4D radar tensors (range, Doppler, azimuth,
elevation). Unlike the ADC data of the RADIal dataset, K-Radar tensors are heavy,
and the dataset cannot be downloaded because of its massive size (16TB).
Finally, Madani et al. introduce the Radatron dataset [Madani 2022]. The
Radatron dataset is a high-resolution radar dataset using a cascaded MIMO radar.
As for the K-Radar dataset [Paek 2022], radar data is in the form of 4D tensors. The
7
https://www.cruwdataset.org/rod2021
2.3. Automotive radar datasets 57
Figure 2.15: (a) RADIal dataset sample. From left to right: camera image with
projected laser point cloud in red and radar point cloud in indigo, vehicle annotation
in orange and free-driving space annotation in green ; radar power spectrum (MIMO
RD) with bounding box annotation ; free-driving space annotation in bird-eye view,
with vehicles annotated with orange bounding boxes, radar point cloud in yellow and
LiDAR point cloud in red ; range-azimuth map in Cartesian coordiates, overlayed
with radar point cloud and LiDAR point cloud [Rebut 2022]. (b) K-Radar dataset
sample in various weather conditions. (c) Radatron dataset sample. Ground truth
are marked in green.
dataset is collected under clear weather conditions, and out of the 152k frames, 16k
vehicles were annotated with 2D bounding boxes on RA maps. Radatron presents
several limitations:
2. Radatron does not leverage the 4D nature of the data because annotations
are only provided for 2D RA maps.
However, learning to detect vulnerable road users like pedestrians is essential for
automotive applications.
58 Chapter 2. Related work
Figure 2.16: Overview of scanning radar datasets. Scanning radar allows a 360°
view around the car. (a) RADIATE dataset sample with different driving scenarios
under several weather conditions. (b) Oxford RobotCar dataset sample.
Scanning radar data (see Figure 2.16) is another radar data available. One advan-
tage of scanning radars is that they allow a 360° representation of the environment.
Because they measure each azimuth using a moving antenna, scanning radar pro-
vides better azimuth resolution than low-resolution radars (around 0.9° azimuth
resolution). However, they do not provide Doppler information, a significant ad-
vantage of radar sensors and crucial for automotive applications. During this thesis,
we will not use these datasets because we consider the Doppler information a core
radar component and because scanning radar is not used in practice.
The Oxford RobotCar [Barnes 2020], MulRan and RADIATE [Sheeny 2021]
datasets provide radar data using scanning radars. The Oxford RobotCar dataset
contains around 240k radar scans collected in various traffic, weather and lightning
conditions in Oxford, UK. Data from sensors such as LiDAR, GPS or cameras are
also available. However, the authors of the Oxford RobotCar dataset do not provide
annotations.
2.4. Automotive radar perception on radar point clouds 59
Danzer 2019]. In contrast, object segmentation methods attempt to classify each re-
flection to create clusters automatically [Schumann 2018, Danzer 2019, Feng 2019].
Object detection and segmentation models for radar point clouds use all the
reflections available in the scene (or accumulated over a short period to increase
the resolution) as input. The most common approaches are grid-based or point-
based.
Grid-based approaches usually render the radar point cloud to a 2D bird-
eye-view (BEV) representation or 3D cartesian grid and apply a CNN on it
[Dreher 2020, Niederlöhner 2022, Xu 2021]. In [Xu 2021], the author renders the
point cloud onto a pillar and uses a self-attention mechanism to solve the prob-
lem of orientation estimation in a grid-based approach. [Dreher 2020] exploits the
YOLOv3 architecture on a grid-map representation of the point cloud. However,
the sparsity of the data does not lead to encouraging results. Niederlöhner et al.
[Niederlöhner 2022] accumulate point clouds over time to reduce the sparsity of the
data and apply an FPN architecture on a rendered point clouds for object detec-
tion and cartesian velocity estimation. As in [Dreher 2020], the results were not
encouraging due to the high sparsity of the radar point cloud.
Point-based approaches are appropriate for sparse point-cloud object detection.
Indeed, they do not pad the data with zeros when there is no measurement. Instead,
they learn the relationship between each point in a local neighbourhood. Point-
based CNNs create a pseudo-image of the point cloud in the object detection model.
Well known architectures are PointNet [Charles 2017], PointNet++ [Qi 2017], Vox-
elNet [Zhou 2018] or PointPilars [Lang 2019]. In the literature, researchers suc-
cessfully modify these architectures for object detection or segmentation on radar
point clouds [Schumann 2018, Feng 2019, Tilly 2020, Palffy 2022, Xiong 2022]. In
particular, Xiong et al. [Xiong 2022] show contrastive learning on radar clusters
helps improve overall detection performance using fewer training data. Ulrich et
al. [Ulrich 2022] take advantage of both methods. They mix point-based and grid-
based approaches to improve object detection and orientation estimation on radar
point clouds. [Fent 2023] employs a graph neural network (GNN) instead of a CNN
for object detection and segmentation on radar reflections.
Finally, other works on radar point clouds exist for ghost target detection
[Kraus 2020] or scene-flow estimation [Ding 2022, Ding 2023].
Figure 2.17: Raw data object classification data flow. The dotted lines represent
an optional operation.
Features-based methods
ant features from the image directly. They show that using image-based features
as input of an SVM classifier achieves an 88% accuracy on pedestrians.
Spectrum-based methods
The work of Prophet et al. [Prophet 2018a] shows the potential of using the spec-
trum to classify objects in radar. This work lead to many studies for object classi-
fication using of convolutional neural networks [Kim 2018, Pérez 2018, Patel 2019,
Khalid 2019, Cai 2019, Akita 2019, Lee 2019, Palffy 2020, Gao 2019b, Gao 2019a,
Patel 2022, Saini 2023, Angelov 2018].
Classification with RA, RD or RAD tensors Pérez et al. [Pérez 2018] use
tiny two layers CNN to classify moving targets such as pedestrians, cyclists or
cars based on their RoI in the range-angle-Doppler power spectrum. They show
that such a model can achieve 97.3% accuracy in classifying objects in single-target
scenarios. Kim et al. [Kim 2018], Khalid et al. [Khalid 2019] and Akita et al.
[Akita 2019] propose to learn temporal dynamics of moving objects using recurrent
and convolutional neural networks. [Kim 2018] classifies sequences of range-Doppler
spectra with single moving objects, while [Khalid 2019] and [Akita 2019] extract
RoIs from the RD and RA spectrum respectively. The authors of [Akita 2019]
show that using raw data benefits object classification in this study. They compare
the performance of their classifier on raw data (raw reflection intensity RoI) versus
radar features such as the maximum intensity of the reflection or a set of features
(average reflection intensity, maximum reflection intensity, roundness). Patel et al.
[Patel 2022] notice that deep radar classifiers maintain high confidence for ambigu-
ous, complex samples under domain shift and signal corruptions. Indeed, according
to the radar equation (Equation 1.1), the same target at different ranges produces
2.5. Automotive radar perception on raw data 63
Figure 2.18: RTC-Net model overview. RTC-Net extracts RoIs from the radar
cube using a list of detections. A combination of CNNs is used to extract features
for each extracted RoI. Ensemble classifiers use the features to perform the target
classification. Source: [Palffy 2020]
64 Chapter 2. Related work
elevation and azimuth prediction. However, their data comes from an anechoic
chamber with some corner reflectors inside, which is unrealistic. Franceschi and
Rachkov [Franceschi 2022] extend this work to simulated radar data. They use
the same network as [Brodeski 2019] and show a higher accuracy, recall and dice
score than conventional methods. However, this work highlights the difficulty
of deep neural networks to estimate azimuth and elevation in complex scenarios
despite good generalisation results on real data for object detection. Moreover, the
simulated data is unrealistic and looks closer to LiDAR data than radar data.
Similarly, Fang et al. [Fang 2022] introduce ERASE-Net, a two-stage detector
called detect-then-segment. From a RAD tensor, they first detect object centres
in RAD space, then extract and separate regions of interest from the background
to form sparse point clouds. Lastly, they segment objects in RA and RD views
using a sparse segmentation network for efficiency. In [Zhang 2021] Zhang et al.
adapt the famous YOLO architecture [Redmon 2016] for 3D object detection on
the RAD cube. They propose a backbone named RadarResNet that learns to
extract velocity information in the channel dimension without 3D convolutions.
Their model predicts object position in polar and cartesian coordinates, the latter
providing the best detection result. However, the Doppler information is encoded
as an extra channel. In computer vision, increasing the number of channels as we
go deeper into the network is a good practice. Encoding the velocity in such a way
might lead to a wrong estimation of the object’s velocity.
To avoid this, multi-view models were proposed [Major 2019, Ouaknine 2021a,
Gao 2021]. They use one encoder per view to extract information separately before
Figure 2.19: MVRSS framework. At a given instant, radar signals take the form
of a range-angle-Doppler (RAD) tensor. Sequences of q + 1 2D views of this
data cube are formed and mapped to a common latent space by the proposed
multi-view architectures. Two heads with distinct decoders produce a semantic
segmentation of the range-angle (RA) and range-Doppler (RD) views respectively.
Source: [Ouaknine 2021a]
66 Chapter 2. Related work
Figure 2.20: RADDet model. Features are extracted from the RAD cube with
a custom ResNet model adapter to radar. Two YOLO heads are used to detect
objects in the RAD and in the RA cartesian views. Source: [Zhang 2021]
merging it into a single latent space. For example, RAMP-CNN [Gao 2021] predicts
barycentres in the RA domain using multiple views (RA, RD, AD) as input. The
model comprises three different 3D convolutional autoencoders learning across mul-
tiple timesteps and domains. However, RAMP-CNN is huge (around 104 million
parameters) and cannot be considered for real-time applications. Ouaknine et al.
[Ouaknine 2021a] also introduce multi-view radar semantic segmentation (MVRSS)
architectures to detect and classify objects in range-azimuth and range-Doppler do-
mains (TMVA-Net and MVA-Net). As RAMP-CNN, they use one encoder per
view and concatenate features from each view in a latent space. In order to handle
the variability of radar objects’ signature, Atrous Spatial Pyramid Pooling module
[Chen 2018a] is used. They use the latent space to feed two decoders in charge of
segmenting objects in RD and RA view. The models learn from past frames using
3D convolutions but only predict the positions of objects for the last timestep, mak-
ing it more efficient than RAMP-CNN. Finally, Major et al. [Major 2019] perform
bird-eye-view object detection in the RA domain using a multi-view model. Instead
of using 3D convolutions to learn from time, they propose to add an LSTM cell on
top of a detection head. One takeaway of their work is that predicting the posi-
tion in cartesian coordinates instead of polar coordinates leads to higher detection
accuracy. Indeed, it considers the increase in distance between adjacent bins when
the range increases.
Alongside the CRUW dataset [Wang 2021c], Wang et al. launch the ROD2021
challenge. The ROD2021 challenge came with the ROD2021 dataset, a subset
of CRUW. This competition motivates research on new models for object detec-
tion using the RA modality. RODNet [Wang 2021b] has paved the way for a new
radar object detection paradigm. To overcome the low resolution of radar, they
propose to detect objects as points in RA view instead of using bounding boxes
[Major 2019, Zhang 2021] or segmentation masks [Ouaknine 2021a, Kaul 2020].
That makes the detection task easier and well-posed when the boxes are not well
defined, but reduces objects to a single point which is not always true. RODNet
[Wang 2021b] consists of an hourglass [Newell 2016] 3D encoder-decoder model that
predicts object location at multiple successive timesteps. However, as RAMP-CNN
[Gao 2021], the models proposed by Wang et al. are huge (more than 100 million
parameters). Ju et al. [Ju 2021] then introduce a lightweight module called Di-
mension Apart Module (DAM), which separately learns range, azimuth and time
information to save computations. Zheng et al. [Zheng 2021] replace the 3D con-
volutions of RODNet with (2+1)D convolutions [Tran 2018]. They use ensemble
learning to detect objects either in static or moving scenarios. Lately, Dalbah et al.
[Dalbah 2023] exploit the power of the Transformer architecture [Vaswani 2017b] to
solve the ROD2021 challenge. However, all these models process and predict data
by batch of N frames and require a buffer of N frames to store in memory. 3D
convolutions are used to learn spatio-temporal information, therefore the learned
temporal context is not reused by the network from one batch to another. More-
over, because frames are treated and predicted by batch the methods presented
above can be seen as non-causal because the convolutional kernel is applied on past
and future frames. In real-time scenarios, one only needs to predict the position of
objects for the last timestep, not for all the frames. In Chapter 4, we propose an
alternative by predicting only the object position for the last frame based on the
previous ones with recurrent neural networks to handle long-term dependencies.
Apart from the ROD2021 challenge, Dong et al. [Dong 2020] propose a proba-
bilistic and class-agnostic object detector. Based upon the CenterNet [Zhou 2019]
architecture, they model the uncertainty by predicting variances for bounding boxes
orientation, size and offset. They also experiment with different types of RA in-
puts: polar or cartesian coordinates, with or without MUSIC [Schmidt 1986] super-
resolution algorithm. Kaul et al. [Kaul 2020] present a weakly-supervised method
using camera and LiDAR supervision semantic segmentation using scanning radar
data. As in many works [Wang 2021b, Major 2019, Ouaknine 2021a], they use the
time information and store it in the channel dimension. Using the same type of
data, Li et al. use a Transformer-like module and computer vision backbones to
68 Chapter 2. Related work
Figure 2.21: RODNet models. The authors propose three encoder-decoder architec-
ture to predict the positions of objets in radar snippets. The M-Net model allows
to merge RA maps from multiple chirps. Source: [Wang 2021b]
The release of high-resolution radar datasets has marked the last year [Rebut 2022,
Madani 2022, Paek 2022]. The higher the resolution, the more data to process
and to store. As a result, it becomes unfeasible to use RA or RAD data for
object detection. The range-Doppler spectrum is one of the most efficient rep-
resentations available in radar. Indeed, it contains information about the dis-
tance and the velocity, and last but not least, it contains angle information
through the antenna’s dimensions. For efficiency reasons and before high-resolution
radars, some prior works on RD maps using low-resolution radar have been done
[Fatseas 2019, Dubey 2020, Guo 2022, Fatseas 2022]. In [Fatseas 2019], the authors
use YOLO [Redmon 2016] object detector and Kalman filtering to detect and track
2.5. Automotive radar perception on raw data 69
pedestrians and bicyclists in the range-Doppler domain. Dubey et al. [Dubey 2020]
propose to use generative adversarial networks (GAN) [Goodfellow 2014] to detect
the presence of targets in a scene. The generator is a U-Net [Ronneberger 2015]
model, taking as input a RD spectrum, and the discriminator is an autoencoder
which predicts whether the input is a detection mask or the ground truth mask. Us-
ing computer vision object detectors (YOLOX, D-DETR, SSD, RetinaNet, Faster
R-CNN), Guo et al. [Guo 2022] first detect objects in RD view using a single frame.
Then, they use Kalman filtering and Deep SORT algorithms to fix wrong detection
made by the detection model based on historical information. For low-resolution
radars, because of the lack of annotated datasets and the low resolution in angle, the
methods mentioned above were mainly used as alternatives to CFAR algorithms.
earlier, have advantages and drawbacks. Regarding the type of data, because of the
size of high-resolution radar data, the use of complex MIMO RD spectra or ADC
data seems to be the most promising and realistic research direction. Some works
[Major 2019, Dong 2020, Rebut 2022] prefer to learn or to predict the position of
objects in cartesian coordinates directly and show a slight performance improve-
ment when using this representation. Other works [Wang 2021b, Ouaknine 2021a]
directly predict objects’ position in polar coordinates and map the prediction in
cartesian coordinates afterwards without losing accuracy.
Figure 2.22: Translation in range and angle data augmentation. Source: [Gao 2021]
Contents
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Experiments and results . . . . . . . . . . . . . . . . . . . 78
3.3.1 Datasets and competing methods . . . . . . . . . . . . . . 78
3.3.2 Training setting and evaluation metrics . . . . . . . . . . 79
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.4 Ablations studies . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Comparison with conventional object detectors . . . . . 84
3.4.1 Binary object detection . . . . . . . . . . . . . . . . . . . 84
3.4.2 Multi-class object detection . . . . . . . . . . . . . . . . . 85
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Conclusion and perspectives . . . . . . . . . . . . . . . . . 88
3.1 Motivation
This chapter is dedicated to multiple road-users detection on range-Doppler spectra.
Based upon the Faster R-CNN architecture [Ren 2017], we propose a new object
detection and classification model to resolve road-users targets in distance and ve-
locity. This chapter introduces a lightweight backbone for Faster R-CNN adapted
to RD data. We design our model to handle the complexity of the RD maps and the
small size of radar objects while trying to keep the processing pipeline as efficient
as possible.
Chapter 2 indicates that radar point clouds are sparse, and the filtering tech-
niques applied to the radar signal to obtain those reduce the information for tar-
get classification. Hence, the reflections list might hamper classification perfor-
mance. Radar data can also be represented as raw data tensors (RD, RA or
74 Chapter 3. Multiple road-users detection on Range-Doppler maps
RAD maps). Contrary to radar point clouds, such tensors benefit radar object
detection because they represent the unfiltered signal. Prior works to this con-
tribution show that deep learning models, and particularly CNNs, enable accu-
rate object classification [Akita 2019], segmentation [Ouaknine 2021a] or detection
[Meyer 2021, Wang 2021b, Zhang 2021] on raw data tensors. However, most of
these works exploit the RA or the RAD views. Instead, we propose to build a
model for object detection on RD maps in this chapter.
The use of RD maps instead of RAD tensors is motivated by the fact that RAD
tensors are more computationally demanding to produce for radar microcontroller
units (MCUs) and heavy in memory. Also, the RA view might not be an adequate
representation for object detection since it does not account for Doppler, which is
crucial information, as we will see in this chapter. Besides, the RA map usually
suffers from a poor angular resolution caused by a few antennas in the FMCW radar.
In this chapter, we hypothesise that the RD spectrum contains enough information
for detection and classification tasks in automotive radar. Angular information
can be computed for each target afterwards in a post-processing step, either using
standard techniques or with AI as done by Brodeski et al. in [Brodeski 2019].
In the computer vision literature, one can detect an object by drawing bound-
ing boxes around it (object detection) or attributing a class to every pixel in the
image (image segmentation). Today, there has yet to be a consensus in the radar
community about which task to use for radar object detection. Many automo-
tive radar datasets (CARRADA [Ouaknine 2021b], RADDet [Zhang 2021], CRUW
[Wang 2021c]) are annotated semi-automatically because radar data is difficult to
annotate. Usually, an object detection model (Mask R-CNN [He 2017]) first detects
objects on the camera. Then, the detection from the radar (the target list) and the
object detection model are merged together to keep objects of interest. Finally,
valid points are projected onto the radar view. Bounding boxes or segmentation
masks are then created from those points. However, this process can lead to miss
targets if the object detection models miss objects. Also, the points projected on
the radar might not truly represent the targets because of the filtering operation in
the radar signal processing chain.
This is why this work focuses on learning to represent targets as boxes instead of
segmenting the RD map. According to the radar equation 1.1, the power received
by the radar, thus the signature in the RD spectra, decreases proportionality to
the distance to the power of four. The same car at five meters will have a different
signature at 40 meters. While an image segmentation model learns regular shapes
and pixel values, an object detection approach might be more robust to shape and
intensities variation and less prone to overfitting. Indeed, the RD spectrum con-
tains mostly noise, creating imbalance in the dataset. In contrast, object detection
operates on a higher level by identifying and localising specific targets, allowing it
3.2. Methodology 75
3.2 Methodology
This section presents a lightweight Faster R-CNN architecture for object detection
on Range-Doppler spectra. Given an RD map as input, we use a convolutional
neural network to learn relevant features, as in Faster R-CNN. Following the feature
extraction, we use a region proposal network (RPN) to propose spectrum regions
containing potential targets. A small network is slid over the learned convolutional
feature map to generate region proposals. For each point in the feature maps, the
RPN learns whether an object is present in the input image at its corresponding
location and estimates its size. A set of anchors is placed on each location’s output
feature maps’ input image. These anchors indicate possible objects in various sizes
and aspect ratios at this location. We refer to Section 2.2.2 for more detailed
information about RPN and anchors. Next, the bounding box proposals from the
1
The code of this work was made publicly available here: https://github.com/colindecourt/
darod/
Figure 3.1: Road users signature in range-Doppler view. We show two RD maps of
RADDet dataset, along with the bounding boxes around objects and their zoom.
76 Chapter 3. Multiple road-users detection on Range-Doppler maps
RPN are used to pool features from the backbone feature maps. This study uses a
pooling size of 4×4. These features are used to classify the proposals as background
or object and to predict a bounding box using two sibling fully connected layers.
This second part is named Fast-RCNN [Girshick 2015]. We depict this pipeline in
Figure 3.2b.
We show in Figure 3.1 two RD maps with radar signatures of some objects in
the captured scene of the RADDet dataset. Even though those RD maps seem
complex, their information remains of low complexity, contrary to camera images
which are bigger and more diverse in textures, orientations, geometry, and lighting.
Although noisier, RD maps have fixed orientation, and their objects exhibit more
similar patterns and shapes.
To account for those differences, we modify Faster R-CNN to include a lighter
backbone and a modified RPN. Our backbone is derived from the VGG architecture
[Simonyan 2015] and contains seven convolutional layers. Figure 3.2a depicts this
lightweight backbone architecture. To keep the processing pipeline as simple and
3x3 2D
convolution
Group
normalization
Leaky ReLU
2x2 2D
MaxPooling
2x1 2D
MaxPooling
Flatten
Fully connected
Region proposal
network
Classification
head
Proposals
Bicyclist
Features
extractor
RoI
pooling
Features
map
Regression
Doppler
head
Figure 3.2: DAROD overview. (a) DAROD backbone. We propose a simple feature
extractor derived from the VGG architecture which contains seven convolutional
layers. (b) Overview of the Faster R-CNN architecture. First we extract feature
from a RD map. Then, the RPN make proposals using DAROD’s feature maps.
For each proposal, we extract a RoI from the feature maps and we classify it as an
object or not.
3.2. Methodology 77
label to an anchor if it has an IoU overlap lower than 0.3 for all GT boxes. Anchors
that are neither positive nor negative do not contribute to the training objective.
According to [Ren 2017], we minimise the following multi-task loss to train the
RPN:
1 X 1 X ∗
Lrpn = Lcls (pi , p∗i ) + p Lreg (ti , t∗i ), (3.1)
Ncls i Nreg i i
where i is the index of an anchor in a mini-batch, and pi is the predicted probability
of anchor i being an object. p∗i is set to 1 if the anchor is positive and 0 if the anchor
is negative. ti is a vector representing the coordinates transformation between the
predicted box and an anchor. t∗i represents the coordinates transformation between
an anchor and the GT box. The classification loss Lcls is a binary cross-entropy.
The regression loss Lreg is the Huber loss defined in [Huber 1964]. Ncls is the mini-
batch size, representing the number of proposals (positive and negative) to use for
training the RPN. Here we set Ncls to 32 as there are few objects in our RD data.
Nreg is the number of positive anchor locations.
For training the second part of the network (the detection head Ldet ), we use a
similar multi-task loss as for the RPN. We train the regression head using the same
loss function as the RPN and replace the binary cross entropy with multi-class cross
entropy. As a result, we optimise the following loss function:
DAROD (ours) 68.26 ± 0.08 79.84 48.37 58.20 ± 0.03 74.31 44.19 3.4 25.31
Faster R-CNN (pretrained) 71.08 ± 0.12 51.70 72.97 64.56 ± 0.09 47.86 67.21 41.3 37.19
Faster R-CNN 64.21 ± 0.07 45.90 74.17 52.93 ± 0.06 41.59 67.40 41.3 37.19
RADDet RD 48.59 ± 0.05 61.31 42.56 18.57 ± 0.08 36.73 25.50 7.8 74.03
RD maps as input instead of RAD tensors. We also consider the variant RADDet
RAD, corresponding to the original RADDet model (train on RAD tensors) evalu-
ated only on the range and the Doppler dimensions, using the pre-trained weights
provided in [Zhang 2021]. As a second baseline, we consider the state of the art
in computer vision by selecting the Torchvision2 Faster R-CNN implementation us-
ing the default hyper-parameters, namely a resizing of the input from 256 × 64 to
800 × 800 and a ResNet50+FPN backbone pre-trained on ImageNet. In addition,
we train the Torchvision Faster R-CNN without the pre-training on ImageNet to
evaluate the impact of this pre-training on the results.
3.3.3 Results
Tables 3.1 and 3.2 show the performance of our model on CARRADA and RADDet
datasets4 . Our DAROD model outperforms the RADDet method on both datasets
while it remains competitive with Faster R-CNN. When pre-trained on ImageNet,
2
https://github.com/pytorch/vision
3
https://www.tensorflow.org/
4
We train all the models ten times, and we show the mean results for each in Table 3.1 and 3.2
80 Chapter 3. Multiple road-users detection on Range-Doppler maps
DAROD (ours) 65.56 ± 0.83 82.31 47.78 46.57 ± 0.7 68.23 38.74 3.4 25.31
Faster R-CNN (pretrained) 58.47 ± 0.67 52.17 56.92 49.55 ± 0.72 47.78 51.77 41.3 37.19
Faster R-CNN 49.16 ± 0.56 32.33 61.46 40.84 ± 0.61 29.37 55.29 41.3 37.19
RADDet RD 38.42 ± 1.12 78.20 29.77 22.87 ± 1.45 60.41 20.55 7.8 74.03
RADDet RAD [Zhang 2021] 38.32 68.80 26.83 17.13 46.55 16.99 8 75.2
Table 3.2: Results of different models on RADDet dataset. We do not report mean
and standard deviation for RADDet RAD as we report the results from the paper.
Faster R-CNN leads to the best mAP in 3 cases, with DAROD being second best,
the positions being inverted in the last experiment (RADDet dataset and IoU at
0.3).
Generally, we observe that DAROD achieves good precision scores but medium
recall. This suggests that our model accurately classifies targets when there are
detected but misses some objects present in the scene. The confusion matrices in
Figure 3.3a confirms this interpretation. For each class, we notice that 20% of
the time, targets are detected while there are no objects in the image (last line of
the confusion matrix). The confusion matrix’s last column also shows that DAROD
tends to miss objects in the scene, which can be problematic for critical applications.
We explain this behaviour because we aimed to optimise mAP, which measures the
global performance of object detectors. We might be able to improve the recall
by reducing the selectivity of our model during training and in the post-processing
step or by decreasing the penalty of classification errors. Finally, because of their
similarities (velocity, RCS), we notice confusion between pedestrians and bicyclists.
Mainly, pedestrians are classified as bicyclists. On the contrary, cars are either
correctly classified or missed. Examples in Appendix A show some failure cases of
DAROD (missed targets and confusion between similar classes).
We draw the same conclusion for the RADDet model, which obtains decent
precision scores but low recall, impacting mAP@0.3 and mAP@0.5. The confusion
matrix in Figure 3.3b shows many pedestrians and bicyclists false positives and
mostly missed cars and pedestrians. Under-represented classes (bicyclists, motor-
cycles, buses) are rarely missed. As for the CARRADA dataset, we notice confusion
between similar classes (bus and truck here), which raises the question about the
necessity of labelling such classes.
The original version of the Faster R-CNN model achieves sufficient precision
scores and good recall, resulting in more false positives but fewer missed targets,
which may be better for critical applications. In this implementation, because the
input spectrum is upsampled targets are bigger, therefore they match more anchors
than in our implementation. The number of positive labels to train the RPN and
the Fast R-CNN part is also higher. This is why the recall of Faster R-CNN is better
than ours. However, upsampling the input might change the radar signature which
3.3. Experiments and results 81
(a) DAROD confusion matrix (CAR- (b) DAROD confusion matrix (RADDet
RADA dataset). dataset).
can affect the classification results. Also, because of its size Faster R-CNN is more
subject to overfitting than DAROD. Finally, the pretraining of the Faster R-CNN
backbone on the ImageNet dataset helps to improve the detection performance. It
drastically improves the precision score but does not impact the recall score. This
is interesting because the features in ImageNet are highly different from radar data.
This suggests the network uses the shapes and the patterns learned on ImageNet to
find objects in the spectrum. We discuss further pretraining strategies in Section
3.5.
A critical point in automotive radar is the computational load of the different
models. We compute the FLOPS (floating point operations per second) of the
different models and represent it as a function of the performance in Figure 3.4.
Not surprisingly, radar based approaches are far more efficient than Faster R-CNN
that uses up-sampling and deeper backbones. RADDet model is the model with
the lowest number of FLOPS as it is inspired from the single stage detector YOLO
[Redmon 2016]. The number of FLOPS required by DAROD is slightly bigger than
RADDet, but stays reasonable to run on microcontrollers.
We add the velocity of each detected target to the feature vector used for the
classification. We try to add this information in different ways:
82 Chapter 3. Multiple road-users detection on Range-Doppler maps
Figure 3.4: Number of FLOPS vs. mAP@0.5 for DAROD, Faster R-CNN (pre-
trained or trained from scratch), RADDet and RADDet RAD.
1. We extract the range and the velocity values from the centre of the RoI.
2. We extract the top-k maximum intensities from the RoI and extract the top-k
range and velocity values from it, with k = 3.
3. We compute a range and velocity grids, then use these grids as additional
input channels.
Figure 3.5 summarises the different methods.
We report in Table 3.3 the detection results on the CARRADA dataset when
adding the following features: the range, the velocity and the range and the velocity
values. Overall, adding features about the position and the velocity of targets boost
the mAP. We notice that extracting the top-3 maximum intensities from the RoI
lead to the best results. Indeed, the successive downsampling stages and the RoI
pooling operation result in an approximate location of the target. Selecting the
centre of the RoI as a reference might extract features from the background (i.e.
noise) instead of from the target.
On the contrary, choosing the top-k maximum intensity will help to correct
localisation error by putting more weights on high-intensity values (i.e. foreground).
Finally, using the range and Doppler grids as additional input channels does not
improve the results. We imagine the distance and Doppler information are lost and
do not flow through successive layers. However, we see a slight improvement when
using a Doppler grid, which suggests the targets’ velocity is helpful.
We study the impact of the size of the feature maps on the mAP. We do not consider
feature maps bigger than 32 × 32 to save computations. We experiment with square
3.3. Experiments and results 83
features maps (16 × 16 and 8 × 8) and rectangular features maps to keep the ratios
of the input (32 × 8 and 16 × 8). Table 3.4 shows a 32 × 32 feature maps provides
the best results. The smaller the feature maps, the less accurate the model. While
IoU thresholds increase, the mAP of models using larger feature maps increase. We
experimented with smaller pooling sizes in Faster R-CNN for smaller feature maps
and did not notice any improvements. We use the CARRADA dataset for these
experiments.
Table 3.3: mAP@0.5 for DAROD when adding additional features (range, Doppler
or range and Doppler) to the feature vector of detected targets. The last line reports
the mAP@0.5 for a model without additional features.
84 Chapter 3. Multiple road-users detection on Range-Doppler maps
Table 3.4: DAROD mAP for different feature maps size. We report the mAP
at different thresholds to show how the size of the feature maps affects location
accuracy. Experiments are on the CARRADA dataset.
To compare DAROD with conventional object detectors, we project the DoA points
on the RD spectrum as our model detects objects in the RD view. Unfortunately,
we can not compute the DoA for the CARRADA dataset because we do not have
the ADC data. As our model outputs boxes, we cluster the radar detection using
DBSCAN and draw a bounding box around each cluster following the same method
as in [Ouaknine 2021b]. Figure 3.6 summarises the process, and Table 3.5 shows
the mAP at different IoU thresholds.
Table 3.5 shows that our approach outperforms conventional radar object de-
tectors. We consider another baseline with an AI-based classification stage (binary
and multi-class). We describe this approach in detail in Section 3.4.2. We note that
using a classifier which learns to remove background objects increases the overall
detection performance. The results highlight the relevance of using deep-learning-
3.4. Comparison with conventional object detectors 85
Figure 3.6: CARRADA Point Clouds dataset generation. After projecting point
clouds on the RD map (pink points), we cluster points using the DBSCAN algorithm
(green and orange points). For each cluster, we create a segmentation mask using
the same procedure as [Ouaknine 2021b]. From the segmentation masks, we draw
bounding boxes around each cluster to obtain box for binary detection (green and
orange boxes). For the multi-class detection, we then compute the IoU between
ground truth box (blue box) and boxes from CFAR+DBSCAN. If the IoU is greater
than a threshold, we label the box from CFAR+DBSCAN using the ground truth
label (a car here) and set the label of the last box to background.
based object detectors for the detection tasks. Indeed, DAROD exhibits better
localisation precision, resulting in a better AP for object detection.
If the IoU between the cluster bounding box and a ground truth box is higher than
a threshold, we label the cluster with the ground truth bounding box label. The
ground truth box is removed from the label list, and we repeat the operation for
all the clusters. Non-matched clusters are then labelled as background. Figure 3.6
illustrates the process.
Model Using our CARRADA Point Clouds dataset, we train the DeepReflecs
model for object classification. DeepReflecs [Ulrich 2021] takes as input a list of M
reflections with five features as input. In the original paper, the authors use the final
detection list, which contains the position of the object in cartesian coordinates, the
velocity, the RCS and the range of the target. In this study, we work in the RD
domain. Thus we do not aim to estimate the object’s DoA. We slightly change the
features for the classification accordingly, and we use the following features: the
range, the velocity and the RCS of the reflection, and the mean range and velocity
value of the cluster.
We train the model over 500 epochs using Adam optimiser and a learning rate
(a) DeepReflecs train and validation loss (b) DeepReflecs train and validation ac-
curacy
Figure 3.8: Train and validation metrics (loss and accuracy) for the DeepReflecs
model on the CARRADA Point Clouds dataset.
3.4. Comparison with conventional object detectors 87
Figure 3.9: DeepReflecs confusion matrix on the CARRADA Point Clouds dataset
Table 3.6: Comparison between deep learning object detectors and a conventional
approach with DeepReflecs [Ulrich 2021] model.
of 1 × 10−4 . The batch size is set to 512. Figure 3.7 shows the DeepReflecs model.
DBSCAN approaches lead to lower localisation accuracy than deep learning object
detectors. Table 3.6 confirms that deep learning models (DAROD or RADDet)
achieve higher mAP than conventional approaches with point cloud classification
networks. Nevertheless, such an approach reaches better mAP@0.5 than RADDet
RD [Zhang 2019]. We think the mAP of the CFAR+DBSCAN+DeepReflecs model
could be improved by improving the accuracy of DeepReflecs (thus trying to reduce
the confusion between cyclists and pedestrians).
3.4.3 Discussion
Although DAROD outperforms traditional methods (with and withou AI) in bi-
nary and multi-class object detection tasks, the comparison could be unfair. First,
the CARRADA dataset is made to detect only pedestrians, bicyclists and cars.
However, many other targets or reflections in the scene affect the detection results
(binary task). We noticed that using a classification network helps to increase the
overall performance of the approach, but the mAP at high IoU remains low. Second,
because we formulate the detection problem as bounding boxes, we evaluate the con-
ventional approach using the mAP metric. We created the bounding boxes from the
point clouds of each cluster using the same code as the authors of the CARRADA
dataset. However, in the CARRADA dataset, the authors correct DBSCAN’s clus-
ters using the prediction of a Mask R-CNN model [He 2017] they project on the RD
spectra. As a result, IoU between boxes from CFAR+DBSCAN and the ground
truth might be low for objects corrected in the annotations process, hence decreasing
IoU at higher thresholds (greater than 0.5). Figure 3.6, which shows how we build
the dataset, illustrates this issue. We think the mAP@0.3 is a good starting point
to compare different deep-learning and non-deep-learning methods. Although less
accurate than our approach, the CFAR+DBSCAN+DeepReflecs approach achieves
decent results at this threshold. For the multi-class object detection, this method
has low mAP@0.3 compared to DAROD and RADDet [Zhang 2021]. Decreasing the
class imbalance in the dataset and performing grid-search over hyper-parameters
could improve the accuracy of the classifier and the overall mAP. We did not con-
duct such experiments as this is not the purpose of this thesis.
done, using ten times fewer parameters. Also, we show that our model outperforms
the radar-based model RADDet [Zhang 2021] on the CARRADA and the RADDet
dataset.
We designed our backbone to deal with range-Doppler data specifically. We
experimented with different pooling and feature map sizes, proving that the dis-
tance and, mainly, the velocity information are crucial to improving detection and
classification results. We introduced three methods to add the distance and the
velocity information into the Faster R-CNN framework. We found sampling the
top-k highest intensity values from the detected RoI to provide the best results.
Finally, we compared our model with conventional radar object detectors (i.e.
CFAR + DBSCAN) with and without object classification. Experiments show
that deep-learning object detectors improve localisation precision and yield fewer
false positives. This comparison confirms the promise of deep learning applied to
automotive radar.
Comparison with the RADDet model In section 3.3.3, we show that our
model achieves much better mAP than the RADDet model on CARRADA and
RADDet datasets. We want to emphasise that the RADDet model was specifically
designed to process RAD tensors instead of RD spectra. Therefore, it might be
inefficient on RD spectra, and the results might suffer from this difference. For this
reason, we evaluated the RADDet model on the range and the Doppler dimensions
only, using the pre-trained model provided by the authors. Results are given in Ta-
ble 3.2 as the RADDet RAD model. Results show that an RD-only model achieves
better mAP than the RAD model. Because RADDet uses the Doppler information
90 Chapter 3. Multiple road-users detection on Range-Doppler maps
in the channel dimension, we conclude that the model is better at detecting objects
in the RA or cartesian space than in the RD view.
Pre-training the backbone Although camera images are very different from
RD maps, pre-training the weights of Faster R-CNN lead to an improvement of 7
to 9 points in mAP, which outperformed DAROD in 3 of the 4 cases. Pre-training
the backbone of DAROD might also led to a significant increase in performance.
However, this is not trivial since it requires a well-suited dataset in terms of shape
and complexity. Experiments have been conducted to pre-train the DAROD back-
bone using the ImageNet [Russakovsky 2015] dataset, but we found DAROD to be
undersized to be pre-trained adequately on ImageNet. A good pre-training of object
detectors’ backbones could help reduce the annotation process of radar data and the
number of required labels to train models. Exploring self-supervised pre-training
methods such as SimCLR [Chen 2020a], MoCo [He 2020], BYOL [Grill 2020] or
DINO [Caron 2021] while exploiting the specificity of radar data could help to re-
duce the number of required labels for radar object detectors. We start conducting
research on pre-training radar networks in Chapter 5.
be very different in the RD view (this remains true in the RA and RAD views).
Exploiting the time information (e.g. multiple frames) has been shown to help to
capture better the dynamics of objects, and therefore, the intra-class variation of
objects [Ouaknine 2021a, Major 2019, Wang 2021b]. The next chapter introduces a
new approach to learning features from the time for radar. It shows the relevance of
such an approach to improving radar object detectors’ detection and classification
accuracy.
Chapter 4
Contents
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.1 Sequential object detection in computer vision . . . . . . 95
4.2.2 Sequential object detection in radar . . . . . . . . . . . . 96
4.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . 97
4.4 Model description . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.3 Multi-view spatio-temporal object detector . . . . . . . . 100
4.4.4 Training procedure . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.1 Single-view object detection . . . . . . . . . . . . . . . . . 102
4.5.2 Multi-view semantic segmentation . . . . . . . . . . . . . 108
4.6 Conclusion and perspectives . . . . . . . . . . . . . . . . . 111
4.1 Motivation
For automotive applications, time is key information which can be exploited to
learn temporal patterns between successive frames in videos for example. The
models tested in Chapter 3 worked independently on the different frames. While
they showed decent detection and classification results, their ability to distinguish
between close classes (e.g. pedestrians and bikes) was limited (see Figures 3.3a,
3.3b and 3.9). Indeed, radar object signatures vary a lot for a same object located
at different distance, angle of arrival and with a different speed, resulting in a lot
of variance in the class distribution. According to the radar equation (Equation
1.1), the signature of an object vary with the distance between the object and the
94 Chapter 4. Online object detection from radar raw data
Figure 4.1: Fluctuation of the radar signature of a pedestrian over time. The
power reflected by the pedestrian decrease with the distance, according to the radar
equation.
radar. This variation, also referred to as the dynamic of the object, is characteristic.
Therefore, the use of time in radar makes it possible to learn the dynamics of the
objects held in the radar signal, handle the variation in the shape of the object over
time, and reduce the noise between successive frames (induced by the movement of
the surrounding object and the vehicle itself). We show in Figure 4.1 the fluctuation
over time of the signature of a pedestrian which moves away from the radar. In
this example we can see that the farther the pedestrian, the lower the reflected
power. Also, the figure illustrates the micro-Doppler effects that appears when the
pedestrian is close from to radar (due to an higher velocity of the arms compared
to the body of the pedestrian).
Recent efforts have been made to exploit temporal relationships between raw
radar frames using multiple frames for detection or segmentation tasks. The most
common approaches such as [Ouaknine 2021a, Wang 2021b, Ju 2021] is to use tem-
poral convolutions (or 3D convolutions). Conversely, in [Major 2019], Major et al.
use a ConvLSTM to detect cars in RA view and [Li 2022] processes sequences of
two successive radar frames to learn the temporal relationship between objects.
Most temporal radar object detectors use temporal convolutions to learn spatial
and temporal information. However, these methods are often non-causal, struggle
to capture long-term dependencies and are unsuitable for real-time applications.
Indeed, temporal convolutions require large kernel, therefore more parameters and
computations, to capture long-term dependencies. Moreover, because the convo-
lutional kernel is applied over past and future frames, some models based on 3D
convolutions are not causal [Ju 2021, Wang 2021b].
This chapter presents a new convolutional and recurrent neural network (CRNN)
for radar spectra. Unlike most multi-frame radar object detectors, our model is
causal, which means we only use past frames to detect objects. This characteristic
is crucial for real-time ADAS applications because such systems do not have access
to future frames. To learn spatial and temporal dependencies, we introduce a
4.2. Related work 95
use single scale feature maps to learn temporal relationships. In [Ventura 2019],
authors present a recurrent model for one-shot and zero-shot video object instance
segmentation. Contrary to previous methods and ours, they use a fully recurrent
decoder composed of upsampling ConvLSTM layers to predict instance segmenta-
tion masks. Another approach proposed by Sainte Fare Garnot and Landrieu in
[Fare Garnot 2021] consists in using temporal self-attention to extract multi-scale
spatio-temporal features for panoptic segmentation of satellite image time series.
Major et al. [Major 2019] follow [Jones 2018] by using a ConvLSTM over the fea-
tures of a multi-view convolutional encoder. Even though this model is similar to
ours, the LSTM cell is applied only to the learned cartesian output before the de-
tection head, and the proposed model only detects cars. Additionally, this model
is not end-to-end trainable and requires pre-training of a non-recurrent version of
it before.
Because our encoder encodes the past N frames recurrently to predict the position
of objects at time step k, our decoder is a fully convolutional decoder that takes as
input the encoder’s updated hidden states Hk (the memory) and the set of feature
maps Fk (spatio-temporal feature maps) such that:
D(Fk , Hk ) = Pk . (4.2)
Head
Nclasses
Conv block IR block IR block
HxW 32 16 16
IR block Bottleneck LSTM IR block ConvTranspose
H/2xW/2 32 32 32 HxW 32
IR block Bottleneck LSTM IR block ConvTranpose
H/4xW/4 64 64 64 H/2xW/2 64
IR block ConvTranspose
H/8xW/8 128 H/4xW/4 128
Figure 4.2: Model architecture (RECORD). Our encoder mixes efficient 2D con-
volution (IR block) and efficient ConvLSTMs (Bottleneck LSTMs) to learn spatio-
temporal dependencies at multiple scales. The decoder is a 2D convolutional de-
coder, taking as input the last feature maps of the encoder and a set of two hidden
states. It predicts either a confidence map or a segmentation mask. Rounded ar-
rows on Bottleneck LSTMs stand for a recurrent layer. Plus sign stands for the
concatenation operation. We report the output size (left) and the number of out-
put channels (right) for each layer.
4.4.1 Encoder
where M and N are the numbers of input and output channels, j W k ⋆ X denotes
a depthwise-separable convolution with weights W , input X, j input channels, k
output channels and, ◦ denotes the Hadamard product. Contrary to [Zhu 2018],
we do not use the proposed bottleneck gate bt authors propose. This gate aims
to reduce the number of input channels of the LSTM to reduce the computational
cost of the recurrent layer. This gate is not valuable for this work because we use
few channels for our ConvLSTM. As a result, M = N . σ and ϕ denote the sigmoid
and the LeakyReLU activation function, respectively. In this work, we use two
bottleneck LSTMs, as a result, I = 2. Such a layout enhances spatial features with
temporal features and vice versa.
We follow the MobileNetV2 [Sandler 2018] structure by first applying a full con-
volution to increase the number of channels followed by a single IR bottleneck block.
Except for the first IR bottleneck block, we set the expansion rate γ to four. Next,
we apply two blocks composed of three IR bottleneck blocks followed by a bot-
tleneck LSTM to learn spatio-temporal dependencies. Because the computational
cost of bottleneck LSTMs is proportional to the input size, we use a stride of two
in the first IR bottleneck block to reduce the input dimension. Finally, we refine
the spatio-temporal feature maps obtained from the bottleneck LSTMs by adding
three additional IR bottleneck blocks.
Because we treat data sequences, it is desirable to calculate normalisation statis-
tics across all features and all elements for each instance independently instead of
a batch of data (a batch can be composed of sequences from different scenes). As
a result, we add layer normalisation before sigmoid activation on gates ot , it and ft
in the bottleneck LSTM, and we adopt layer normalisation for all the layers in the
model.
4.4.2 Decoder
As described in Section 4.3, our decoder is a 2D convolutional decoder which takes
as input the last feature maps of the encoder (denoted Fk ) and a set of two hidden
states Hk = {h0k , h1k }. Our decoder is composed of three 2D transposed convolutions
100 Chapter 4. Online object detection from radar raw data
RA encoder
Down/Up Down/Up Down/Up
sampling sampling sampling AD encoder
CxHxW CxHxW CxHxW
RD encoder
Concatenate
3CxHxW
RA decoder
RD decoder
Conv2D, 1x1
TMVSC
followed by a single IR block with an expansion factor γ set to one, and a layer
normalisation layer. Each transposed convolution block upsamples the input feature
map by two. Finally, we use two 2D convolutions as a classification/segmentation
head (depending on the task) which projects the upsampled feature map onto the
desired output.
The U-Net architecture [Ronneberger 2015] has popularised skip connections
between the encoder and decoder. It allows precise localisation by combining high-
resolution and low-resolution features. We, therefore, adopt skip connections be-
tween our encoder and our decoder to improve the localisation precision. To prevent
the loss of temporal information in the decoding stage, we use the hidden states of
each bottleneck LSTM (denoted by h0k and h1k in Figure 4.2) and concatenate them
with the output of a transposed convolution operation to propagate in the decoder
the temporal relationship learned by the encoder.
In Section 4.4.2, we use the hidden states of each bottleneck LSTMs for the skip
connection to add the temporal information in the decoding part. For the multi-
view approach, we want to take advantage of the multi-view and the spatio-temporal
approaches in the skip connections to supplement decoders with data from other
views (e.g. add velocity information in the RA view). Similarly to the multi-view
latent space, we concatenate the hidden states from RD, RA and AD views. This
concatenation results in a set of concatenated hidden states Hk = {h0kskip , h1kskip }.
We describe the operation to obtain Hk in Figure 4.3b. We concatenate Hk in the
same way as in the single view approach. We call this operation Temporal Multi-
View Skip Connections (TMVSC). Figure 4.3 illustrates the multi-view architecture
we propose.
where k is the last time step of the sequence. Buffer training forces the model to
focus on a specific time window and to learn a global representation of the scene.
However, in inference, the model must process N frames sequentially to make a
prediction. Therefore, we propose to train the model differently using a many-to-
many paradigm to improve the model’s efficiency in inference.
N
L(p̂, p) = L(p̂k , pk ) (4.5)
X
k=1
102 Chapter 4. Online object detection from radar raw data
(a) (b)
Figure 4.4: Training procedures with N = 3. (a) Buffer training procedure (many-
to-one). (b) Online training procedure (many-to-many).
Online training pushes the model to use previous objects’ positions to make a
new prediction. It encourages the model to keep only relevant information from
the previous frames. Online training requires training with longer sequences but
allows data processing one by one (no buffer) in inference. In contrast to the buffer
approach, the hidden states are not reset in inference.
4.5 Experiments
4.5.1 Single-view object detection
Dataset. We prototype and train our model, RECORD, on the ROD2021 challenge
dataset2 , a subset of the CRUW dataset [Wang 2021c]. Due to its high frame rate
(30 fps), this dataset is well-suited to evaluate the temporal models. Also, this
dataset use a different detection paradigm than the CARRADA dataset. In the
ROD2021 datasets, objects are represented as points (as like the output of a radar).
We think this representation is more adapted to radar than bounding boxes or
segmentation masks. Therefore, we prototype our model to deal with such outputs.
The ROD2021 dataset contains 50 sequences (40 for training and 10 for testing)
of synchronised cameras and raw radar frames. Each sequence contains around
800-1700 frames in four different driving scenarios (parking lot (PL), campus road
(CR), city street (CS), and highway (HW)).
The provided data of the ROD2021 challenge dataset are pre-processed se-
quences of RA spectra (or maps). Annotations are confidence maps (ConfMaps) in
range-azimuth coordinates that represent object locations (see Figure 4.2). Accord-
ing to [Wang 2021b] one set of ConfMaps has multiple channels, each representing
2
https://www.cruwdataset.org/rod2021
4.5. Experiments 103
one specific class label (car, pedestrian, and cyclist). The pixel value in the cls-
th channel represents the probability of an object with class cls occurring at that
range-azimuth location. We refer the reader to [Wang 2021b] for more information
about ConfMaps generation and post-processing. RA spectra and ConfMaps have
dimensions 128 × 128.
−d2
OLS = exp (4.6)
2(sκcls )2
where d is the distance (in meters) between the two points in the RA spectrum,
s is the object distance from the radar sensor (representing object scale informa-
tion) and κcls is a per-class constant that describes the error tolerance for class cls
(average object size of the corresponding class). First, OLS is computed between
GT and detection. Then the average precision (AP) and the average recall (AR)
are calculated using different OLS thresholds ranging from 0.5 to 0.9 with a step of
0.05, representing different localisation error tolerance for the detection results. In
the rest of this section, AP and AR denote the average precision and recall for all
the thresholds.
the attention-based model UTAE [Fare Garnot 2021], which is lighter than image-
based approaches and which we can use causally. We found that decreasing the
number of channels of UTAE and changing the positional encoding improved the
performances (see Table B.1). We also consider two variants of our model without
LSTMs, one using the time along the channel dimension (no lstm, multi) and one
using a single frame (no lstm, single).
AP AR
Model Params (M)
Mean PL CR CS HW Mean PL CR CS HW
RECORD (buffer, ours) 72.8 ± 2.2 95.0 67.7 48.3 77.4 82.8 ± 1.5 96.7 73.9 72.8 81.7 0.69
RECORD (online, ours) 73.5 ± 3.5 96.4 72.5 49.9 72.5 81.2 ± 2.0 96.4 78.1 68.8 77.6 0.69
RECORD (no lstm, multi) 65.5 ± 4.6 89.9 57.3 43.1 68.9 78.9 ± 1.4 93.1 68.2 71.5 75.7 0.47
RECORD (no lstm, single) 59.5 ± 2.9 85.7 48.5 39.11 64.4 75.1 ± 2.1 90.8 62.4 68.9 69.6 0.44
DANet [Ju 2021] 71.9 ± 2.3 94.7 65.7 51.9 70.0 80.7 ± 2.3 96.2 75.1 72.8 73.0 0.74
UTAE [Fare Garnot 2021] 68.4 ± 4.6 92.1 67.4 51.4 65.5 78.4 ± 2.2 94.6 74.0 69.7 70.0 0.79
T-RODNet [Jiang 2023] 69.9 ± 3.4 95.6 72.5 48.2 63.7 79.5 ± 1.9 97.2 79.1 70.2 67.2 159.7
Table 4.1: Results obtained on the test set of the ROD2021 challenge for different
driving scenarios (PL: Parking Lot, CR: Campus Road, CS: City Street and HW:
Highway). Overall, our recurrent models outperform baselines. The model that
does not use time gets the worst performance. We report the best results over five
different seeds with standard deviation. The best results are in bold, and the second
bests are underlined.
4.5. Experiments 105
80
70
60
Runtime (ms)
50
40
Model
30 RECORD (buffer)
RECORD (online)
RECORD (no LSTM, multi)
20 RECORD (no LSTM, single)
DANet
10 UTAE
T-RODNet
0 20 40 60 80
GMACS
Figure 4.5: Runtime vs. GMACS on the ROD2021 dataset. RECORD (online) is
one of the most efficient model among the baselines (low GMACS, low runtime).
The T-RODNet model that uses Transformers is the slowest and the one requiring
the more operations. Overall, most of the models have a runtime lower than 20 ms
(on GPU).
all the models with five different seeds and report the mean and standard deviation
results in the next paragraph.
Results Table 4.1 presents the results of our model and the baselines on the test
set of the ROD2021 challenge. Our recurrent approaches generally outperform base-
lines for both AP and AR metrics; this remains true for most scenarios. Overall,
most of the targets are detected, as shown in Figure 4.6, which confirms the high re-
call of the proposed models. Despite some mis-classified targets, the online version
of RECORD obtains the best trade-off between performances and computational
complexity (parameters, number of multiplications and additions and runtime, see
Figure 4.5). Despite having less GMACs than UTAE and DANet, the buffer ver-
sion of RECORD is the slowest one among all the models. Indeed, for each new
frame we need to process the 11 previous ones, which is inefficient. Results show
that the online version should be preferred for real-time applications. Additionally,
RECORD methods exceed 3D and attention-based methods on static scenarios such
as parking lot (PL) and campus road (CR). In PL and CR scenarios, the radar is
static and the velocity of targets varies a lot, our recurrent models seem to learn
variations of the target’s speed better than other approaches. Surprisingly the
attention-based method UTAE, initially designed for the segmentation of satellite
106 Chapter 4. Online object detection from radar raw data
Figure 4.6: RECORD (online) and RECORD (buffer) qualitative results on the
ROD2021 dataset. We show two samples per scenarios. From left to right: highway,
city street, camprus road and parking lot. As shown in Table 4.1, the parking lot
scenarios is the easiest one. More results are available in Appendix B.
images, obtains very competitive results with our method and the DANet model.
The T-RODNet model shows good results on static scenarios but is unsuitable for
real-time application. As shown in Figure 4.5, it is the slowest and the one requiring
the more operations. We notice that the approach using the time in the channel
dimension reaches lower AP and AR than their counterpart, which explicitly uses
time as a new dimension. Finally, training our model without the time and using
only a 2D backbone (no lstm, single) obtains the lowest performance on the test
set. Qualitative results in Appendix B confirm the conclusion we draw here.
Table 4.2: Comparison of different types of ConvRNN. We train all the models with
the same loss and hyperparameters. Bottleneck LSTM achieves the best AP while
having fewer parameters and GMACS.
Table 4.3: Comparison of different types of skip connections. Results are averaged
over 5 different seeds on the ROD2021 test set. Concatenation is the RECORD
model, addition stand for a model where we add the output of transposed convolu-
tions to hik , and no skip connection stands for a model without skip connections.
Data augmentation study Table 4.4 shows the impact of different types of
data augmentation and their combination on the performance. The experiments
were conducted on the validation set. Among all the data augmentation available,
horizontal flipping and Gaussian noise appear to be the most useful one. Tem-
poral flipping reduces the overall performance when used alone or with Gaussian
noise. However, when combined with horizontal flipping, or Gaussian noise this
Table 4.4: Impact of different types of data augmentation and their combination
on the performance. Experiments were done on the validation set using the same
seed.
108 Chapter 4. Online object detection from radar raw data
T ∩P
IoU = . (4.7)
T ∪P
We then average this metric over all classes to compute the mean IoU (mIoU).
We decay exponentially the learning rate every 20 epochs with a factor of 0.9.
We use a combination of a weighted cross-entropy loss and a dice loss with the
recommended parameters described in [Ouaknine 2021a] to train our model as we
find it provides the best results. To avoid overfitting, we apply horizontal and
vertical flipping data augmentation. We also use an early stopping strategy to stop
training if the model’s performance does not improve for 15 epochs. Training multi-
view models is computationally expensive (around six days for TMVA-Net and five
days for ours). As a result, we train models using the same seed as the baseline for
a fair comparison. We use the pre-trained weights of TMVA-Net and MV-Net to
evaluate baselines.
Results Table 4.5 shows the results we obtain on the CARRADA dataset. Our
multi-view approaches beat the state-of-the-art model TMVA-Net on the multi-
view radar semantic segmentation task while using two times fewer parameters
and requiring significantly fewer GMACS (see Figure 4.7). More, as shown in
Figures 4.8 and 4.9, both approaches succeed in detecting a car which was not
annotated. Our approach seems to correctly learn the variety of objects’ shapes
without complex operations such as the atrous spatial pyramid pooling (ASPP) used
in TMVA-Net. We notice that using recurrent units instead of 3D convolutions in a
multi-view approach significantly helps to improve the classification of bicyclists and
cars, especially on the RA view, where we double the IoU for bicyclists compared
to TMVA-Net. However, bicyclists and pedestrians are very similar classes, and
IoU
Model Params (M)
mIoU Bg Ped Cycl Car
MV-RECORD (buffer, ours) 44.5 99.8 24.2 20.1 34.1 1.9
MV-RECORD (online, ours) 42.4 99.8 22.1 11.1 36.4 1.9
RA RECORD* (buffer, ours) 34.8 99.7 10.3 1.4 27.7 0.69
RECORD* (online, ours) 36.3 99.8 12.1 3.1 30.4 0.69
TMVA-Net [Ouaknine 2021a] 41.3 99.8 26.0 8.6 30.7 5.6
MV-Net [Ouaknine 2021a] 26.8 99.8 0.1 1.1 6.2 2.4
MV-RECORD (buffer, ours) 63.2 99.6 54.9 39.3 58.9 1.9
MV-RECORD (online, ours) 58.5 99.7 49.4 26.3 58.6 1.9
RD RECORD* (buffer, ours) 58.1 99.6 46.6 28.6 57.5 0.69
RECORD* (online, ours) 61.7 99.7 52.1 33.6 61.4 0.69
TMVA-Net [Ouaknine 2021a] 58.7 99.7 52.6 29.0 53.4 5.6
MV-Net [Ouaknine 2021a] 29.0 98.0 0.0 3.8 14.1 2.4
View
150 RAD
RA
RD
100
50
0
0 20 40 60 80 100
GMACS
Figure 4.7: Runtime vs. GMACS on the CARRADA dataset. Multi-view methods
have higher runtime and require more GMACS than single-view models. Our MV-
RECORD (buffer) has few GMACS compared to TMVA-Net but requires more
than 250ms for a single forward pass, while the online one can perform a single
forward pass in 20 ms. RAD stands for the multi-view approach.
when using a single view on both the ROD2021 (Table 4.1) and the CARRADA
dataset. Especially on the RD view, our single view and online model outperforms
TMVA-Net without using the angle information, with eight times fewer parameters
and less computations. This confirms that the low frame rate of the CARRADA
dataset limits the motion information that the recurrent layers can learn. Finally,
despite having fewer GMACS and parameters than TMVA-Net, our multi-view
model (buffer) is much slower in inference than TMVA-Net and is unsuitable for
real-time applications. The online version is faster and should be preferred for real-
time applications. Decreasing the size of the feature maps in the early layer of the
network might help to increase the inference speed of the model. Also, we notice
using a profiler that the LayerNorm operation takes up to 90% of the inference time
for the multi-view models and up to 70% of the inference time for the single-view
models. Replacing layer normalisation with batch normalisation should speed up
the runtime of our approaches. Given the good results of the single-view approach
(especially for the RD view), we recommend using our model for single-view inputs,
as RECORD was originally designed for single-view object detection.
The difference with the results in the DANet paper Experiments in Section
4.5.1 show that DANet produces a 71.9 AP and 79.5 AR which is different from
the results announced in the original paper. The code of DANet being unavailable,
we implemented it according to the author’s guidelines. Although we obtained the
same number of parameters announced for the DAM blocks, our implementation
has 740k parameters instead of the 460k announced in the paper. Beyond the
implementation, the training and evaluation procedure in our paper is different
from the one in the DANet [Ju 2021] paper. While DANet is trained on the entire
training set, we trained it on 36 carefully chosen sequences for a fair comparison
with other models. Also, DANet authors use the following techniques when testing
the model to improve the performance: test-time augmentation (TTA), ensemble
models and frame averaging. Because DANet predicts frames by a batch of 16
with a stride of four, the authors average the overlapping frames (12 in total) in
inference. Together, those techniques boost the performance of DANet around ten
points, according to the ablation studies in DANet’s paper, which is coherent with
the gap between our scores and the ones from DANet paper. While applying TTA,
4.6. Conclusion and perspectives 113
ensemble models and training all the models using all the sequences would certainly
also improve the global performance of all the models in Table 4.1, we preferred
comparing the architectures on a simpler but fair evaluation.
Why recurrent neural networks are suitable for radar Radar data differs
from LiDAR and images. The most critical differences being 1) the data is simpler
in terms of variety, size, and complexity of the patterns; and 2) the datasets are
smaller. We thus believe that our lighter architectures are flexible enough, while
being less sensitive than huge backbones and less prone to overfitting for radar
data. This mainly explains why Bottleneck LSTMs perform better than ConvL-
STMs/ConvGRUs (see Table 4.2). Also, we think convolutional LSTMs are more
adapted to radar sequences because 1) convLSTMs learn long-term spatio-temporal
dependencies at multiple scales, which 3D convolution cannot do because of the
limited size of the temporal kernel; 2) LSTMs can learn to weigh contributions of
different frames which can be seen as an adaptive frame rate depending on the
scenarios and the speed of vehicles; 3) Hidden states keep the position/velocity of
objects in previous frames in memory and use it to predict the position in the next
timesteps. Indeed, we show that, except for MV-RECORD, which is hard to opti-
mise, online methods generally perform better than buffer ones while having lower
computational cost (GMACs and inference time).
Vision Transformers and radar One alternative to RNN is the use of Vision
Transformers (ViT)[Dosovitskiy 2021]. ViT has been proven to be a solid alter-
native to CNN and to solve some challenges of CNNs (pixel weighting, shared
concepts across images, spatially distant concepts). In [Naseer 2021], Naseer et al.
show Transformers are more robust to occlusions, perturbations and domain shift
and less biased towards local texture. The work of Caron et al. [Caron 2021] shows
that self-supervised ViT can automatically segment foreground objects, which is an
interesting property for radar. Additionally, ViT mainly utilises dense connections
and embeddings, lowering the number of FLOPS compared to CNN but increas-
ing the number of parameters. However, training ViT needs large-scale datasets
and sometimes self-supervised pre-training. As large-scale radar datasets become
available, it is worth considering ViT as a backbone to learn spatial, temporal
or spatio-temporal radar features. Giroux et al. [Giroux 2023] and Jiang et al.
[Jiang 2023] propose to use Transformers blocks (SwinTransformers) as a backbone.
Both works show that ViTs perform similarly to CNNs for radar object detection.
While [Giroux 2023] confirms the ViT can lower the number of GFLOPS compared
to CNNs, the authors of [Jiang 2023] mix 3D convolutions. As shown in Table
4.1 and Figure 4.5, such combination is too computationally expensive and does
not help to reduce the FLOPS compared to RECORD, DANet [Ju 2021] or UTAE
114 Chapter 4. Online object detection from radar raw data
[Fare Garnot 2021]. For multi-frame object detection, one future work could be
to change the embedding of [Giroux 2023] with the tubelet embedding proposed in
[Arnab 2021]. In this way, the Transformer can learn global information in different
parts of the spectrum at different time steps.
Contents
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 What is self-supervised learning? . . . . . . . . . . . . . . 116
5.3 A review of SSL frameworks for computer vision . . . . 117
5.3.1 Deep Metric Learning . . . . . . . . . . . . . . . . . . . . 117
5.3.2 Self-Distillation . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.3 Canonical Correlation . . . . . . . . . . . . . . . . . . . . 120
5.3.4 Masked Image Modelling . . . . . . . . . . . . . . . . . . 121
5.4 Pre-training models for object localisation . . . . . . . . 122
5.5 Limits of image-based pre-training strategies for radar . 124
5.6 Radar Instance Contrastive Learning (RICL) . . . . . . 126
5.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.7 Conclusion and perspectives . . . . . . . . . . . . . . . . . 132
5.1 Motivation
In deep learning, the data annotation is a crucial parameter to succeed in learning
meaningful representations for object detection, semantic segmentation or classi-
fication. Since radar raw data is complex and time-consuming to annotate, gen-
erally, the authors of radar datasets generate the annotations semi-automatically
[Ouaknine 2021b, Zhang 2021, Rebut 2022, Wang 2021c]. The authors of the CAR-
RADA, the RADDet, and the CRUW datasets first detect objects in the camera
(using a pre-trained Mask-RCNN [He 2017]) and in the radar views (using CFAR,
DBSCAN and DoA estimation techniques). Then, they project the Mask R-CNN
detection on the radar view and combine them to create the label. The authors of
the RADIal dataset adopt a similar approach, but they also use the detection from
a LiDAR.
116 Chapter 5. Self-supervised learning for radar object detection
trains a network to make the embedding of two samples close or far from each
other. Generally, because labels are unavailable, different views of the same image
are created using image transformations. These views refer to as positive pairs are
expected to be made similar. The dissimilar samples we want to make are called
negatives. To make negatives far from positive, a distance m is imposed so that
images from different classes must have a distance larger than m. A variant to the
contrastive loss is the Triplet loss [Weinberger 2009, Schroff 2015], which consists
of a query, a positive and a negative sample. In the Triplet loss, we aim to minimise
the distance between the embedding of the query and the positive sample and to
maximise the distance between the query and the negative sample.
SimCLR We now present one of the most prominent DML approaches termed
SimCLR [Chen 2020a]. The idea of SimCLR is simple. Two views of the same
image are created using a combination of image transformations (random resizing,
cropping, colour jittering) and are encoded using a CNN. After the views are en-
coded, a MLP is used to map the features from the CNN to another space where
the contrastive loss is applied to encourage the similarity between the two views. In
SimCLR, negative samples are other images in the batch; thereby, SimCLR requires
large batches to work. Figure 5.1 summarises the SimCLR method.
Apart from SimCLR, other DML approaches exist in the literature. For ex-
ample, Sermanet et al. [Sermanet 2018] use a triplet loss in video frames where
positive pairs come from nearby frames.
Figure 5.1: SimCLR overview. Two views of the same image are created using
a combination of image transformations and are encoded using a CNN. After the
views are encoded, a MLP is used to map the features from the CNN to another
space where the contrastive loss is applied to encourage the similarity between the
two views. Source: [Chen 2020a]
5.3. A review of SSL frameworks for computer vision 119
5.3.2 Self-Distillation
As for the DML family, the self-distillation family relies on the following mechanism:
feeding two views of the same image to two encoders (a CNN or a ViT) and mapping
one view to the other using a predictor (a MLP). However, this approach (using the
identical two encoders) can lead to dimensional collapse. Dimensional collapse is
a phenomenon that appears in SSL when the information encoded across different
dimensions of the representation is redundant [Balestriero 2023]. One example of
dimensional collapse is that the two encoders consistently predict a constant value
for any input. A solution to dimensional collapse is to update one of the two encoder
weights with a running average of the other encoder’s weights. One advantage of
the self-distillation method is that they do not necessarily require negative sam-
ples compared to DML approaches. The most famous SD approaches are BYOL
[Grill 2020] and DINO [Caron 2021], which we will explain later.
BYOL (Bootstrap Your Own Latent) BYOL [Grill 2020] first introduced
self-distillation as a mean to avoid dimensional collapse [Balestriero 2023]. BYOL
uses two networks (the online or student and the target or teacher) along with
a predictor to map the outputs of one network to the other. The online and the
target networks are two identical CNNs with different weights. The student network
predicts the output, while the teacher network produces the target. As for most
SSL methods, each network receives a different view of the same image. BYOL uses
image transformations, including random resizing, cropping, colour jittering, and
brightness alterations. Each view is encoded using a CNN and then projected in
a new space using a MLP. The particularity of BYOL is that the student network
uses an additional MLP (the predictor) to map the student network’s outputs to
the target network’s output. The student network is updated using SGD, and
the teacher is slowly updated using an exponential moving average (EMA) of the
student’s weights. Figure 5.2 illustrates the BYOL method.
Figure 5.2: BYOL overview. BYOL uses two networks (CNNs), the teacher and
the student networks, along with a predictor (a MLP) to map the outputs of one
network to the other. The student network is updated using SGD, and the teacher
is slowly updated using an EMA of the student’s weights. Source: [Grill 2020]
120 Chapter 5. Self-supervised learning for radar object detection
Figure 5.3: DINO overview. Compared to BYOL, DINO does not use a predictor
but performs a centring of the student network output and applies a softmax func-
tion. [Caron 2021]
Figure 5.4: VicReg overview. VicReg proposes to minimise the distance between
two embedding of the same view while maintaining the variance of the embedding
above a threshold and pushing the covariance between embedding variables of a
batch to zero. Source: [Bardes 2022]
zero. Maintaining the variance above a threshold prevents collapse, minimising the
distance ensures views are encoded similarly, and covariance encourages different
dimensions of the representation to capture different features.
Figure 5.6: SoCo overview. SoCo randomly selects proposals from the selective
search algorithm as object priors and constructs three views of an image where the
scales and locations of the same objects are different. RoIs are extracted using the
RoIAlign operator. Views are encoded using a CNN. The model is trained using
the BYOL framework. Source: [Wei 2021]
object priors to learn localised features. Methods propose to modify SSL frameworks
with such prior to enhance localisation in their features [Wei 2021, Wang 2021a,
Dai 2021, Yang 2021, Bar 2022, Tong 2022, Zhao 2021, Carion 2020b]. It is worth
noting that improving localisation features comes at the price of a lower accu-
racy when transferring features for classification. One example of unsupervised
object priors is to modify the training loss to enforce the relationship between ex-
tracted features from locations within a single image [Wang 2021a, Yun 2022]. In
[Yun 2022], Yun et al. encourage adjacent patches within an image to produce
similar features by computing a contrastive loss between adjacent patches. Instead
of modifying the loss function, some works explicitly add a prior on object loca-
tion [Wei 2021, Yang 2021]. [Wei 2021] and [Yang 2021] both leverage RoIAlign
[He 2017] to pre-train the backbone and the detection head of a CNN in a self-
supervised manner to improve localisation on downstream tasks. Instance Locali-
sation [Yang 2021] pastes a randomly chosen patch cut from the foreground of one
image onto two other images and extracts features corresponding to only the pasted
foreground patch. They use a contrastive loss to force the model to produce similar
features regardless of the background and the location of the foreground patch in
the image. SoCo [Wei 2021] randomly selects proposals from the selective search al-
gorithm as object priors and constructs three views of an image where the scales and
locations of the same objects are different. They train their model using the BYOL
[Grill 2020] framework. Figure 5.6 gives an overview of SoCo. Finally, although
ViT naturally encodes the position of objects, methods for pre-training DETR
[Carion 2020a] family detectors were proposed in [Dai 2021, Bar 2022]. Most of the
methods mentioned above enhance the localisation performance of self-supervised
learners on object detection and semantic segmentation downstream tasks compared
124 Chapter 5. Self-supervised learning for radar object detection
The data augmentation problem First, SSL frameworks belonging to the self-
distillation, deep metric learning, and canonical correlation analysis families rely
on several image transformations to encourage similarities between two views of
the same image. However, as explained in Section 2.5.3, most of the existing im-
age transformations used in SSL frameworks cannot be applied to the radar data.
Radar data differs from camera images in several ways. For example, it has com-
plex input, energy loss with range (see the radar equation 1.1), and a non-uniform
resolution in the angular domain. Except for horizontal and vertical flipping, we
cannot transform too much radar data without altering it. Therefore, directly ap-
plying methods such as SimCLR [Chen 2020a] or BYOL [Grill 2020] is not possible.
One alternative to encourage the similarities between similar objects in radar is to
use successive frames. Assuming objects are not moving too much between two
successive frames, we can consider two close frames as positive and far frames as
negative and then train a model using a Triplet loss or a contrastive loss as in
[Pathak 2017, Zhang 2019].
The localisation problem Second, raw radar data contains multiple objects at
different distances, velocities or angles. As discussed in Section 5.4, SSL frameworks
are designed on datasets with single objects centred in the image. Hence, for efficient
pre-training for object localisation models in computer vision, adding a prior about
object location in radar is essential. Fortunately, obtaining that prior in radar is
easily possible using CFAR-like object detectors. CFAR is an alternative to the
selective search algorithm used in [Bar 2022] and [Wei 2021] for pseudo-labelling
the data. Experiments in Section 5.6.2 provide encouraging results using such an
approach.
is about 60%. Unlike images, radar data mainly contains noise and tiny targets.
Masking 60% of a spectrum means a small probability of masking targets. Thus,
the network is more about learning noise rather than the probability distribution of
targets, as shown by experiments with the FCMAE framework [Woo 2023] on the
CARRADA dataset in Figure 5.7. Also, in the MIM framework, the patches have
a 32 × 32 size, which is larger than the size of an object in radar (8 × 8 on average),
resulting in a low signal-to-noise ratio inside the patch. However, our experiments
did not show better reconstruction by decreasing the patch size. This suggests that
MIM frameworks in their current form (randomly masking square patches in an
image) are unsuitable for radar.
The amount of data Finally, one key ingredient of SSL is the amount of data.
Generally, the more data available to pre-train the model, the more accurate the
model on downstream tasks. Most SSL frameworks are pre-trained on large un-
labelled image datasets (up to one billion) images. However, such large datasets
still need to be created for radar and might hamper the benefits of pre-training
strategies compared to a fully supervised approach. As a result, in this chapter, we
126 Chapter 5. Self-supervised learning for radar object detection
aim to analyse the advantages of the pre-training radar object detection models to
reduce the amount of labelled data during the training instead of achieving higher
performance than fully supervised learning.
5.6.1 Methodology
We propose an approach for pre-training the backbone and the detection head of an
object detection model without labels, named RICL for Radar Instance Contrastive
5.6. Radar Instance Contrastive Learning (RICL) 127
Figure 5.8: RICL framework overview. We use two networks (an online and target
network) to encode features from each view in parallel. We use RoIAlign to extract
object-level features from CFAR output (each object has an specific id in the figure).
The contrastive loss is applied object-wise, no negative samples are required.
Learning. Figure 5.8 displays an overview of RICL. This section details the pre-
training strategy.
Overview Given two successive RD maps, we encode features from each view
using two identical CNNs (an online and a target network). From the CFAR de-
tection, we then use RoIAlign [He 2017] to extract object-level features. Following
the RoIAlign operation, the object-level features are passed to a detection head
(a MLP) before being mapped onto another space using a projector (online and
student networks) and a predictor (online network only). As for SoCo, we use the
BYOL [Grill 2020] framework for learning the representations.
Figure 5.9: RICL object proposals generation and matching. First, CFAR is applied
on frames at t and t − 1. Then we cluster objects to retrieve all the objects in the
spectra. Then, we match objects together and return bounding boxes for each valid
object.
where dt , dt−1 , vrt , vrt−1 are the range and the velocity of the objects at time t and
t − 1 respectively and, εd and εv are uncertainty constants for the range and the
velocity. Indeed, because we compute the range and the velocity using the centre
of the box, this constant is necessary to avoid matching errors.
bounding box representation b, we apply RoIAlign [He 2017] to extract the fore-
ground features from the last feature map of the backbone. An R-CNN head f H
is introduced into pre-training to align the pre-training for object detection. For a
RD map R and a bounding box b, the object-level feature representation is:
SoCo follows BYOL [Grill 2020] learning framework. Therefore, two neural net-
works are used to learn: the online and target networks. They share the same
architecture, but they have different weights. The weights of the target network
fξS , fξH are updated using an EMA of the student’s weights fθS , fθH with a momen-
tum coefficient τ . τ controls how fast the target’s weights are updated. Extending
Equation 5.3 to the online and target networks and multiple objects, the object-
level feature representation hi of a set of possible objects {bi } in views Rt and Rt−1
is respectively:
hi = fθH (RoIAlign(fθS (Rt ), bi )), (5.4)
′
hi = fξH (RoIAlign(fξS (Rt−1 ), bi )). (5.5)
The contrastive loss for the i-th possible object is defined as:
′
⟨vi , vi ⟩
Li = −2 · ′ (5.7)
∥vi ∥2 · ∥vi ∥2
K
1 X
L= Li (5.8)
K i=1
where K is the number of possible objects in the RD maps. Finally, as in SoCo and
BYOL, the loss is symmetrised by separately feeding Rt to the target network and
Rt−1 to the online network to compute L̃. At each training iteration, a stochastic
optimisation step is performed to minimise LRICL = L + L̃.
130 Chapter 5. Self-supervised learning for radar object detection
5.6.2 Experiments
Dataset and data augmentation We pre-train Faster R-CNN using the CAR-
RADA [Ouaknine 2021b] dataset, without labels. We apply horizontal and vertical
flipping to avoid dimensional collapse. Note that we always apply the flipping for
both frames.
Object proposals generation and matching We use the following settings for
CA-CFAR. The range and Doppler guard length are set to four and two, respec-
tively. The range and Doppler training length are set to 20 and 10, respectively. We
set εv to 0.6 and εd to 1. We find these values provide the best matching between
frames.
For ease of development, we use the Detectron API2 to fine-tune the Faster R-CNN
framework.
2
https://github.com/facebookresearch/Detectron
5.6. Radar Instance Contrastive Learning (RICL) 131
Table 5.1: Comparison of RICL with a supervised approach and ImageNet pre-
training for different amounts of data. AP stands for the COCO mAP. AP@0.5 is
the AP at IoU threshold 0.5. Best results are in bold, second best are underlined.
Dataset This chapter aims to learn with fewer data. To evaluate the pre-training
strategy, we create subsets of the training set of the CARRADA dataset. We train
and fine-tune the same model five times with 100%, 50%, 20%, 10% and 5% of the
training dataset. We create these splits randomly and use the same subsets for the
supervised, ImageNet and the RICL training. We validate and test the model with
100% of the validation and testing sets.
Metrics We report the bounding boxes COCO AP and the AP@0.5. COCO AP
corresponds to the mean of all AP@IoU where IoU ranges from 0.5 to 0.95 with
0.05 step.
5.6.2.3 Results
We report in Table 5.1 the preliminary results of the RICL pre-training strategy
compared to a supervised approach with random weights initialisation and Ima-
geNet pre-training. Overall, we notice RICL speed-up convergence and outperforms
132 Chapter 5. Self-supervised learning for radar object detection
the supervised training when trained with fewer data but remains less effective than
a pre-training on ImageNet. Although performing similarly with 100% and 50% of
the training set, RICL boosts the performance of Faster R-CNN when training with
20% and 10% of the data. Indeed, pre-training Faster R-CNN with RICL allows an
improvement of five and seven points when 20% and 10% of the dataset are used,
respectively.
Experiments using ImageNet pre-trained weights confirm the findings in Chapter
3. No matter the amount of data, ImageNet pre-training outperforms RICL and the
training from scratch by far. We notice that the model trained with the ImageNet
weights slightly outperforms the model trained from scratch while being trained
with only 50% of the training set. Also, this model reaches the same AP@0.5 using
5% and 20% of the training set. However, this should be treated cautiously since
we construct the training subsets randomly. For all training strategies, the AP
and the AP@0.5 are higher using 10% of the data than using 20%. To avoid this
phenomenon, cross-validation should be used.
Finally, though RICL pre-training generally improves performance compared to
training from scratch, there is still a gap between a pre-training on ImageNet and
our method.
The model First, for fast prototyping, we use a ResNet-50 backbone which is not
appropriate for radar. This choice has been motivated by the availability of the pre-
trained weights on the ImageNet dataset to compare with our approach. Indeed,
the results we obtain with this backbone are far from those presented in Chapter 3.
Though the aim of this chapter is more to present preliminary results than obtaining
a high-performance model, experiments using the DAROD (see Chapter 3) or the
RECORD (see Chapter 4) models must be conducted. Indeed, those backbones
have been proven to be more adapted to radar data than the ResNet-50 model.
Since the ResNet-50 model is larger than DAROD and RECORD, it requires more
data to reach good performance. As mentioned throughout this thesis, the size of
5.7. Conclusion and perspectives 133
The search for safer and more robust perception systems led to the massive use of
AI models to detect and identify objects in complex urban environments. Presently,
most perception systems use cameras and LiDAR sensors to build a representation
of the scene. The use of radar sensors remains sporadic and dedicated to tasks re-
quiring speed estimation, and the use of AI for radar processing is limited to point
cloud classification. In this thesis, we successfully demonstrated the potential of
AI models in enhancing automotive radar perception using raw data. By exploring
radar spectra such as range-Doppler, range-angle, and range-angle-Doppler, this
work has established their effectiveness in substituting radar point clouds and vari-
ous aspects of the radar signal processing chain for object detection. In the different
chapters of this thesis, we endeavoured to find the best representation of the radar
signal to use and the most appropriate formulation to learn to detect and identify
objects using raw data.
quired through unannotated radar data collection campaigns, presents an avenue for
future research. Owing to the complexity of annotation radar data, self-supervised
learning is a promising path to improve AI-driven radar object detection.
Limitations
Comparison with point cloud based approaches This thesis aimed to build
deep neural networks for radar object detection using raw data. We demonstrated
that strong performances can be achieved using deep learning on radar spectra,
particularly on the range-Doppler maps. However, we did not compare our work
with AI models on radar point clouds like [Palffy 2022, Saini 2023]. We did not
perform this comparison because of the unavailability of radar datasets containing
both ADC data (or spectrum) and corresponding annotated point clouds.
Pretraining strategy The results presented in Chapter 5 remain far from those
obtained using the weights of a model pre-trained on ImageNet. This chapter
aimed to show that a good pre-training strategy allows for a reduction in the num-
ber of labelled data without losing performance. Nevertheless, Chapter 5’s work is
preliminary, and we agree that it can be improved in many ways. First, using a
ResNet-50 backbone instead of DAROD, RECORD or other radar-based backbones
[Zhang 2021, Rebut 2022, Giroux 2023] might prevent the model from learning rel-
evant features from the data. This choice was motivated by the availability of pre-
trained weights on ImageNet for ResNet-50. Second, knowing the key ingredient of
self-supervised learning is the amount of data available, we use the smallest radar
dataset available. For sure, this hindered the learning of representations. Last,
we compared our model with a supervised pre-training strategy on the ImageNet
dataset. As our pre-training strategy is unsupervised, we believe we could have
made a comparison with other image-based pre-training strategies such as SoCo
[Wei 2021], BYOL [Grill 2020] or MoCo [He 2020].
Future works
Deploying our models in the real world As we gaze into the short-term
perspective, this thesis contributes to deciding on the dimension of future genera-
tions of radar hardware accelerators by providing key performance indicators like
the number of parameters, GMACS and the runtime required to reach a certain
level of performance. Deep learning models are often over parameterised; therefore,
pruning and quantisation will be pivotal to optimising the models proposed in this
thesis for real-world deployment. Beyond pruning the models, the models proposed
in this thesis can serve as base models for NAS, as proposed by Boot et al. in
[Boot 2023]. In addition to enhancing the efficiency of the models, their integration
into the system must be considered. From Chapter 4, we saw that with the in-
creasing resolution of radar sensors, using RAD tensors and, therefore, multi-view
models appears difficult for computational reasons. A first step towards integrat-
ing deep learning models into radar systems is to use single-view models instead of
CFAR detectors to detect objects in the RD view before calculating their direction
Conclusion 139
of arrival. In this way, we can obtain a list of targets containing the position (dis-
tance, azimuth, elevation), the speed and the class of the objects. Besides allowing
us to get a representation of the environment, it will enable us to compare raw
data-based methods with point cloud-based methods.
# channels
Pos. enc. AP AR Params (M)
Encoder Decoder
16, 32, 64, 128 16, 32, 64, 128 Yes 68.4 78.4 0.79
16, 32, 64, 128 16, 32, 64, 128 No 46.9 64.3 0.79
64, 64, 64, 128 32, 32, 64, 128 Yes 60.8 77.9 1.1
Table B.1: Performances improvement of UTAE model with and without positional
encoding and with the default architecture (underlined line). Results are obtained
on the test set and on a single seed.
144 Appendix B. Online object detection from radar raw data
Figure B.2: Qualitative results for RECORD (online) on CARRADA dataset (RD
view). From left to right: camera image, RD spectrum, predicted mask and ground
truth mask. Legend: pedestrians, bicyclists, cars.
Figure B.3: Qualitative results for RECORD (online) on CARRADA dataset (RA
view). From left to right: camera image, RA spectrum, predicted mask and ground
truth mask. Legend: pedestrians, bicyclists, cars.
146 Appendix B. Online object detection from radar raw data
Figure B.5: RECORD (no LSTM, multi) qualitative results on the ROD2021
dataset. We sample two example per scenario. From left to right: highway, city
street, campus road and parking lot.
148 Appendix B. Online object detection from radar raw data
Figure B.6: RECORD (no LSTM, single) qualitative results on the ROD2021
dataset. We sample two example per scenario. From left to right: highway, city
street, campus road and parking lot.
B.2. Single-view object detection 149
B.2.5 DANet
Figure B.7: DANet [Ju 2021] qualitative results on the ROD2021 dataset. We
sample two example per scenario. From left to right: highway, city street, campus
road and parking lot.
150 Appendix B. Online object detection from radar raw data
B.2.6 UTAE
Figure B.8: UTAE [Fare Garnot 2021] qualitative results on the ROD2021 dataset.
We sample two example per scenario. From left to right: highway, city street,
campus road and parking lot.
B.3. Multi-view object detection 151
B.2.7 T-RODNet
Figure B.9: T-RODNet [Jiang 2023] qualitative results on the ROD2021 dataset.
We sample two example per scenario. From left to right: highway, city street,
campus road and parking lot.
B.3.3 TMVA-Net
B.3.4 MV-Net
Figure B.16: Qualitative results for MV-Net on CARRADA dataset (RD view).
From left to right: camera image, RD spectrum, predicted mask and ground truth
mask. Legend: pedestrians, bicyclists, cars.
Figure B.17: Qualitative results for MV-Net on CARRADA dataset (RA view).
From left to right: camera image, RA spectrum, predicted mask and ground truth
mask. Legend: pedestrians, bicyclists, cars.
Bibliography
[Akita 2019] Tokihiko Akita and Seiichi Mita. Object Tracking and Classification
Using Millimeter-Wave Radar Based on LSTM. In 2019 IEEE Intelligent
Transportation Systems Conference (ITSC), pages 1110–1115, 2019. (Cited
in pages 3, 33, 62, and 74.)
[Arnab 2021] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario
Lučić and Cordelia Schmid. ViViT: A Video Vision Transformer. In 2021
IEEE/CVF International Conference on Computer Vision (ICCV), pages
6816–6826, 2021. (Cited in page 114.)
[Balestriero 2023] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos,
Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gre-
goire Mialon, Yuandong Tianet al. A cookbook of self-supervised learning.
arXiv preprint arXiv:2304.12210, 2023. (Cited in pages 117 and 119.)
[Ballas 2016] Nicolas Ballas, Li Yao, Chris Pal and Aaron C. Courville. Delving
Deeper into Convolutional Networks for Learning Video Representations.
In 4th International Conference on Learning Representations, ICLR 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
(Cited in page 107.)
[Bao 2022] Hangbo Bao, Li Dong, Songhao Piao and Furu Wei. BEiT: BERT
Pre-Training of Image Transformers. In International Conference on Learn-
ing Representations, 2022. (Cited in pages 121 and 124.)
[Bar 2022] Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig,
Gal Chechik, Anna Rohrbach, Trevor Darrell and Amir Globerson. DETReg:
Unsupervised Pretraining with Region Priors for Object Detection. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 14585–14595, 2022. (Cited in pages 123 and 124.)
[Bardes 2022] Adrien Bardes, Jean Ponce and Yann LeCun. VICReg:
Variance-Invariance-Covariance Regularization for Self-Supervised Learning.
In International Conference on Learning Representations, 2022. (Cited in
pages 120, 121, and 126.)
158 Bibliography
[Barnes 2020] Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman and In-
gmar Posner. The Oxford Radar RobotCar Dataset: A Radar Extension
to the Oxford RobotCar Dataset. In 2020 IEEE International Conference
on Robotics and Automation (ICRA), pages 6433–6438, 2020. (Cited in
page 58.)
[Bay 2006] Herbert Bay, Tinne Tuytelaars and Luc Van Gool. SURF: Speeded Up
Robust Features. In Aleš Leonardis, Horst Bischof and Axel Pinz, editors,
Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg, 2006.
Springer Berlin Heidelberg. (Cited in page 61.)
[Bertasius 2018] Gedas Bertasius, Lorenzo Torresani and Jianbo Shi. Object
Detection in Video with Spatiotemporal Sampling Networks. In Vittorio
Ferrari, Martial Hebert, Cristian Sminchisescu and Yair Weiss, editors, Com-
puter Vision – ECCV 2018, pages 342–357, Cham, 2018. Springer Interna-
tional Publishing. (Cited in page 95.)
[Blake 1988] Stephen Blake. OS-CFAR theory for multiple targets and nonuniform
clutter. IEEE transactions on aerospace and electronic systems, vol. 24,
no. 6, pages 785–790, 1988. (Cited in pages 4, 6, 25, 28, and 61.)
[Boot 2023] Thomas Boot, Nicolas Cazin, Willem Sanberg and Joaquin Van-
schoren. Efficient-DASH: Automated Radar Neural Network Design Across
Tasks and Datasets. In 2023 IEEE Intelligent Vehicles Symposium (IV),
pages 1–7, 2023. (Cited in page 138.)
[Brodeski 2019] Daniel Brodeski, Igal Bilik and Raja Giryes. Deep Radar Detector.
In 2019 IEEE Radar Conference (RadarConf), pages 1–6, 2019. (Cited in
pages 64, 65, 70, and 74.)
[Bromley 1993] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger and
Roopak Shah. Signature Verification using a "Siamese" Time Delay Neural
Network. In J. Cowan, G. Tesauro and J. Alspector, editors, Advances in
Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993.
(Cited in page 117.)
[Brown 2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jef-
frey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei. Language
Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell,
M.F. Balcan and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. (Cited
in pages 116 and 121.)
[Caesar 2019] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Bal-
dan and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous
driving. arXiv preprint arXiv:1903.11027, 2019. (Cited in pages 1, 12, 43,
and 53.)
[Cai 2019] Xiuzhang Cai and Kamal Sarabandi. A Machine Learning Based 77
GHz Radar Target Classification for Autonomous Vehicles. In 2019 IEEE
International Symposium on Antennas and Propagation and USNC-URSI
Radio Science Meeting, pages 371–372, 2019. (Cited in page 62.)
[Caron 2018] Mathilde Caron, Piotr Bojanowski, Armand Joulin and Matthijs
Douze. Deep Clustering for Unsupervised Learning of Visual Features. In
160 Bibliography
Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu and Yair Weiss, edi-
tors, Computer Vision – ECCV 2018, pages 139–156, Cham, 2018. Springer
International Publishing. (Cited in page 117.)
[Caron 2020] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bo-
janowski and Armand Joulin. Unsupervised Learning of Visual Features by
Contrasting Cluster Assignments. In H. Larochelle, M. Ranzato, R. Hadsell,
M.F. Balcan and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, pages 9912–9924. Curran Associates, Inc., 2020. (Cited
in page 120.)
[Caron 2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien
Mairal, Piotr Bojanowski and Armand Joulin. Emerging Properties in
Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Con-
ference on Computer Vision (ICCV), pages 9630–9640, 2021. (Cited in
pages 48, 90, 113, 119, and 120.)
[Chen 2006] V.C. Chen, F. Li, S.-S. Ho and H. Wechsler. Micro-Doppler effect in
radar: phenomenon, model, and simulation study. IEEE Transactions on
Aerospace and Electronic Systems, vol. 42, no. 1, pages 2–21, 2006. (Cited
in page 62.)
[Chen 2017] Liang-Chieh Chen, George Papandreou, Florian Schroff and Hartwig
Adam. Rethinking Atrous Convolution for Semantic Image Segmentation.
ArXiv, vol. abs/1706.05587, 2017. (Cited in page 52.)
[Chen 2018b] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff
and Hartwig Adam. Encoder-Decoder with Atrous Separable Convolution
for Semantic Image Segmentation. In Vittorio Ferrari, Martial Hebert, Cris-
tian Sminchisescu and Yair Weiss, editors, Computer Vision – ECCV 2018,
pages 833–851, Cham, 2018. Springer International Publishing. (Cited in
pages 52 and 95.)
[Chen 2020a] Ting Chen, Simon Kornblith, Mohammad Norouzi and Geof-
frey Hinton. A Simple Framework for Contrastive Learning of Visual
Representations. In Hal Daumé III and Aarti Singh, editors, Proceedings
of the 37th International Conference on Machine Learning, volume 119 of
Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18
Jul 2020. (Cited in pages 90, 117, 118, 124, and 126.)
[Chen 2020b] Yihong Chen, Yue Cao, Han Hu and Liwei Wang. Memory Enhanced
Global-Local Aggregation for Video Object Detection. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages
10334–10343, 2020. (Cited in page 95.)
[Chen 2021] Xinlei Chen and Kaiming He. Exploring Simple Siamese
Representation Learning. In 2021 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 15745–15753, 2021. (Cited in
page 120.)
[Cho 2014] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk and Yoshua Bengio. Learning
Phrase Representations using RNN Encoder–Decoder for Statistical Machine
Translation. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar,
October 2014. Association for Computational Linguistics. (Cited in page 39.)
[Cordts 2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,
Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth and Bernt
Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3213–3223, 2016. (Cited in page 49.)
[Dai 2021] Zhigang Dai, Bolun Cai, Yugeng Lin and Junying Chen. UP-DETR:
Unsupervised Pre-training for Object Detection with Transformers. In
162 Bibliography
[Dalbah 2023] Yahia Dalbah, Jean Lahoud and Hisham Cholakkal. RadarFormer:
Lightweight and Accurate Real-Time Radar Object Detection Model. In
Image Analysis: 23rd Scandinavian Conference, SCIA 2023, Sirkka, Finland,
April 18–21, 2023, Proceedings, Part I, pages 341–358. Springer, 2023. (Cited
in page 67.)
[Danzer 2019] Andreas Danzer, Thomas Griebel, Martin Bach and Klaus Diet-
mayer. 2D Car Detection in Radar Data with PointNets. In 2019 IEEE
Intelligent Transportation Systems Conference (ITSC), pages 61–66, 2019.
(Cited in page 60.)
[Decourt 2022a] Colin Decourt, Rufin VanRullen, Didier Salle and Thomas Oberlin.
DAROD: A Deep Automotive Radar Object Detector on Range-Doppler
maps. In 2022 IEEE Intelligent Vehicles Symposium (IV), pages 112–118,
2022. (Cited in pages 75 and 96.)
[Decourt 2022b] Colin Decourt, Rufin VanRullen, Didier Salle and Thomas Oberlin.
A recurrent CNN for online object detection on raw radar frames. arXiv
preprint arXiv:2212.11172, 2022. (Cited in page 95.)
[Devlin 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In Proceedings of the 2019 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota, June 2019. Association for Computational Linguis-
tics. (Cited in pages 116 and 121.)
[Ding 2022] Fangqiang Ding, Zhijun Pan, Yimin Deng, Jianning Deng and Chris Xi-
aoxuan Lu. Self-Supervised Scene Flow Estimation With 4-D Automotive
Radar. IEEE Robotics and Automation Letters, pages 1–8, 2022. (Cited in
page 60.)
[Ding 2023] Fangqiang Ding, Andras Palffy, Dariu M. Gavrila and Chris Xiaox-
uan Lu. Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal
Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 9340–9349, June 2023. (Cited
in page 60.)
[Dong 2020] Xu Dong, Pengluo Wang, Pengyue Zhang and Langechuan Liu.
Probabilistic Oriented Object Detection in Automotive Radar. In 2020
Bibliography 163
[Dreher 2020] Maria Dreher, Emeç Erçelik, Timo Bänziger and Alois Knoll.
Radar-based 2D Car Detection Using Deep Neural Networks. In 2020
IEEE 23rd International Conference on Intelligent Transportation Systems
(ITSC), pages 1–8, 2020. (Cited in page 60.)
[Dubey 2020] Anand Dubey, Jonas Fuchs, Maximilian Lübke, Robert Weigel
and Fabian Lurz. Generative Adversial Network based Extended Target
Detection for Automotive MIMO Radar. In 2020 IEEE International Radar
Conference (RADAR), pages 220–225, 2020. (Cited in pages 68 and 69.)
[Duke 2021] Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi and
Graham W. Taylor. SSTVOS: Sparse Spatiotemporal Transformers for
Video Object Segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages 5912–5921,
June 2021. (Cited in page 95.)
[Ericsson 2021] Linus Ericsson, Henry Gouk and Timothy M. Hospedales. How
Well Do Self-Supervised Models Transfer? In 2021 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages 5410–5419,
2021. (Cited in page 122.)
[Ester 1996] Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu.
A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise. In Proceedings of the Second International Con-
ference on Knowledge Discovery and Data Mining, KDD’96, page 226–231.
AAAI Press, 1996. (Cited in page 25.)
[Ettinger 2021] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu,
Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin
Zhouet al. Large scale interactive motion forecasting for autonomous driving:
The waymo open motion dataset. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 9710–9719, 2021. (Cited in
pages 1, 12, and 43.)
164 Bibliography
[Fang 2022] Shihong Fang, Haoran Zhu, Devansh Bisla, Anna Choromanska,
Satish Ravindran, Dongyin Ren and Ryan Wu. ERASE-Net: Efficient
Segmentation Networks for Automotive Radar Signals. arXiv preprint
arXiv:2209.12940, 2022. (Cited in pages 64 and 65.)
[Fare Garnot 2021] Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic
Segmentation of Satellite Image Time Series with Convolutional Temporal
Attention Networks. In 2021 IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 4852–4861, 2021. (Cited in pages 7, 95, 96, 104,
114, 143, and 150.)
[Fatseas 2019] Konstantinos Fatseas and Marco J.G. Bekooij. Neural Network
Based Multiple Object Tracking for Automotive FMCW Radar. In 2019
International Radar Conference (RADAR), pages 1–5, 2019. (Cited in
pages 33, 64, and 68.)
[Fatseas 2022] Konstantinos Fatseas and Marco J.G. Bekooij. Weakly Supervised
Semantic Segmentation for Range-Doppler Maps. In 2021 18th European
Radar Conference (EuRAD), pages 70–73, 2022. (Cited in page 68.)
[Fel 2022] Thomas Fel, Lucas Hervier, David Vigouroux, Antonin Poche, Justin
Plakoo, Remi Cadene, Mathieu Chalvidal, Julien Colin, Thibaut Boissin,
Louis Bethune, Agustin Picard, Claire Nicodeme, Laurent Gardes, Gre-
gory Flandin and Thomas Serre. Xplique: A Deep Learning Explainability
Toolbox. Workshop on Explainable Artificial Intelligence for Computer Vi-
sion (CVPR), 2022. (Cited in page 140.)
[Feng 2019] Zhaofei Feng, Shuo Zhang, Martin Kunert and Werner Wiesbeck. Point
Cloud Segmentation with a High-Resolution Automotive Radar. In AmE
2019 - Automotive meets Electronics; 10th GMM-Symposium, pages 1–5,
2019. (Cited in page 60.)
[Fent 2023] Felix Fent, Philipp Bauerschmidt and Markus Lienkamp. RadarGNN:
Transformation Invariant Graph Neural Network for Radar-based
Perception. arXiv preprint arXiv:2304.06547, 2023. (Cited in page 60.)
[Galvani 2019] Marco Galvani. History and future of driver assistance. IEEE In-
strumentation and Measurement Magazine, vol. 22, no. 1, pages 11–16, 2019.
(Cited in page 17.)
[Gao 2019a] Teng Gao, Zhichao Lai, Zengyang Mei and Qisong Wu. Hybrid
SVM-CNN Classification Technique for Moving Targets in Automotive
FMCW Radar System. In 2019 11th International Conference on Wireless
Communications and Signal Processing (WCSP), pages 1–6, 2019. (Cited
in pages 62 and 63.)
[Gao 2019b] Xiangyu Gao, Guanbin Xing, Sumit Roy and Hui Liu. Experiments
with mmWave Automotive Radar Test-bed. In 2019 53rd Asilomar Con-
ference on Signals, Systems, and Computers, pages 1–6, 2019. (Cited in
pages 33, 52, 62, and 64.)
[Gao 2021] Xiangyu Gao, Guanbin Xing, Sumit Roy and Hui Liu. RAMP-CNN: A
Novel Neural Network for Enhanced Automotive Radar Object Recognition.
IEEE Sensors Journal, vol. 21, no. 4, pages 5119–5132, 2021. (Cited in
pages 3, 4, 64, 65, 66, 67, 70, 71, and 96.)
[Girshick 2014] Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Ma-
lik. Rich Feature Hierarchies for Accurate Object Detection and Semantic
Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern
Recognition, pages 580–587, 2014. (Cited in page 45.)
[Girshick 2015] Ross Girshick. Fast R-CNN. In 2015 IEEE International Conference
on Computer Vision (ICCV), pages 1440–1448, 2015. (Cited in pages 45
and 76.)
[Goodfellow 2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio.
Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes,
N. Lawrence and K.Q. Weinberger, editors, Advances in Neural Information
Processing Systems, volume 27. Curran Associates, Inc., 2014. (Cited in
page 69.)
166 Bibliography
[Goodfellow 2016] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep
learning. MIT Press, 2016. http://www.deeplearningbook.org. (Cited
in pages 36, 38, and 117.)
[Goyal 2017] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia and Kaiming He.
Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017. (Cited in page 131.)
[Goyal 2021] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu,
Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra,
Armand Joulinet al. Self-supervised pretraining of visual features in the wild.
arXiv preprint arXiv:2103.01988, 2021. (Cited in page 116.)
[Goyal 2022] Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan
Misra, Levent Sagun, Armand Joulin and Piotr Bojanowski. Vision models
are more robust and fair when pretrained on uncurated images without
supervision. arXiv preprint arXiv:2202.08360, 2022. (Cited in page 117.)
[Grill 2020] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec,
Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires,
Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu,
Remi Munos and Michal Valko. Bootstrap Your Own Latent - A New
Approach to Self-Supervised Learning. In H. Larochelle, M. Ranzato,
R. Hadsell, M.F. Balcan and H. Lin, editors, Advances in Neural Informa-
tion Processing Systems, volume 33, pages 21271–21284. Curran Associates,
Inc., 2020. (Cited in pages 90, 117, 119, 123, 124, 127, 129, 132, and 138.)
[Gu 2022] Huanyu Gu. The Importance of Imaging Radar, February 2022. https:
//www.nxp.com/company/blog/the-importance-of-imaging-radar:
BL-THE-IMPORTANCE-OF-IMAGING-RADAR. (Cited in pages 11, 12, and 13.)
[Guo 2022] Zuyuan Guo, Haoran Wang, Wei Yi and Jiahao Zhang. Efficient Radar
Deep Temporal Detection in Urban Traffic Scenes. In 2022 IEEE Intelligent
Vehicles Symposium (IV), pages 498–503, 2022. (Cited in pages 68 and 69.)
[Hameed 2022] Syed Waqar Hameed. Peaks Detector Algorithm after CFAR for
Multiple Targets Detection. EAI Endorsed Transactions on AI and Robotics,
vol. 1, no. 1, 7 2022. (Cited in page 29.)
Bibliography 167
[Hasch 2012] Jürgen Hasch, Eray Topak, Raik Schnabel, Thomas Zwick, Robert
Weigel and Christian Waldschmidt. Millimeter-Wave Technology for
Automotive Radar Sensors in the 77 GHz Frequency Band. IEEE Transac-
tions on Microwave Theory and Techniques, vol. 60, no. 3, pages 845–860,
2012. (Cited in page 20.)
[He 2014] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Spatial
Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
In David Fleet, Tomas Pajdla, Bernt Schiele and Tinne Tuytelaars, edi-
tors, Computer Vision – ECCV 2014, pages 346–361, Cham, 2014. Springer
International Publishing. (Cited in page 51.)
[He 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Deep Residual
Learning for Image Recognition. In 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 770–778, 2016. (Cited in
pages 6, 9, 41, 42, 51, and 130.)
[He 2017] Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Girshick. Mask
R-CNN. In 2017 IEEE International Conference on Computer Vision
(ICCV), pages 2980–2988, 2017. (Cited in pages 9, 45, 46, 74, 88, 115,
123, 127, and 129.)
[He 2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie and Ross Girshick.
Momentum Contrast for Unsupervised Visual Representation Learning. In
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 9726–9735, 2020. (Cited in pages 90, 120, 126, and 138.)
[He 2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár and
Ross Girshick. Masked Autoencoders Are Scalable Vision Learners. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 15979–15988, 2022. (Cited in pages 117, 121, 122, 124,
and 126.)
[Hendrycks* 2020] Dan Hendrycks*, Norman Mu*, Ekin Dogus Cubuk, Barret
Zoph, Justin Gilmer and Balaji Lakshminarayanan. AugMix: A Simple
Method to Improve Robustness and Uncertainty under Data Shift. In Inter-
national Conference on Learning Representations, 2020. (Cited in page 70.)
168 Bibliography
[Howard 2019] Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-
Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu,
Ruoming Pang, Hartwig Adam and Quoc Le. Searching for MobileNetV3.
In 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
pages 1314–1324, 2019. (Cited in pages 41 and 42.)
[Hu 2018] Jie Hu, Li Shen and Gang Sun. Squeeze-and-Excitation Networks. In
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 7132–7141, 2018. (Cited in page 42.)
[Jiang 2023] Tiezhen Jiang, Long Zhuang, Qi An, Jianhua Wang, Kai Xiao and
Anqi Wang. T-RODNet: Transformer for Vehicular Millimeter-Wave Radar
Object Detection. IEEE Transactions on Instrumentation and Measurement,
vol. 72, pages 1–12, 2023. (Cited in pages 7, 104, 113, and 151.)
[Ju 2021] Bo Ju, Wei Yang, Jinrang Jia, Xiaoqing Ye, Qu Chen, Xiao Tan, Hao Sun,
Yifeng Shi and Errui Ding. DANet: Dimension Apart Network for Radar
Object Detection. In Proceedings of the 2021 International Conference on
Bibliography 169
Multimedia Retrieval, ICMR ’21, page 533–539, New York, NY, USA, 2021.
Association for Computing Machinery. (Cited in pages 4, 6, 7, 67, 94, 96,
103, 104, 112, 113, and 149.)
[Kaul 2020] Prannay Kaul, Daniele de Martini, Matthew Gadd and Paul New-
man. RSS-Net: Weakly-Supervised Multi-Class Semantic Segmentation with
FMCW Radar. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages
431–436, 2020. (Cited in pages 64, 67, and 96.)
[Kim 2018] Sangtae Kim, Seunghwan Lee, Seungho Doo and Byonghyo Shim.
Moving Target Classification in Automotive Radar Systems Using
Convolutional Recurrent Neural Networks. In 2018 26th European Sig-
nal Processing Conference (EUSIPCO), pages 1482–1486, 2018. (Cited in
page 62.)
[Kraus 2020] Florian Kraus, Nicolas Scheiner, Werner Ritter and Klaus Dietmayer.
Using Machine Learning to Detect Ghost Images in Automotive Radar. In
2020 IEEE 23rd International Conference on Intelligent Transportation Sys-
tems (ITSC), pages 1–7, 2020. (Cited in page 60.)
[Lang 2019] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang
and Oscar Beijbom. PointPillars: Fast Encoders for Object Detection From
Point Clouds. In 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 12689–12697, 2019. (Cited in page 60.)
[Lecun 1989] Yann Lecun. Generalization and network design strategies. Elsevier,
1989. (Cited in pages 39 and 40.)
[Lee 2017] Seongwook Lee, Young-Jun Yoon, Jae-Eun Lee and Seong-Cheol
Kim. Human–vehicle classification using feature-based SVM in 77-GHz
automotive FMCW radar. IET Radar, Sonar & Navigation, vol. 11, no. 10,
pages 1589–1596, 2017. (Cited in page 61.)
[Lee 2019] Dajung Lee, Colman Cheung and Dan Pritsker. Radar-based Object
Classification Using An Artificial Neural Network. In 2019 IEEE National
Aerospace and Electronics Conference (NAECON), pages 305–310, 2019.
(Cited in page 62.)
[Li 2018] Xiaobo Li, Haohua Zhao and Liqing Zhang. Recurrent RetinaNet: A
Video Object Detection Model based on Focal Loss. In International con-
ference on neural information processing, pages 499–508. Springer, 2018.
(Cited in page 95.)
[Li 2021] Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang
Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang and Jinqiao Wang. MST:
Masked Self-Supervised Transformer for Visual Representation. In M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P.S. Liang and J. Wortman Vaughan,
editors, Advances in Neural Information Processing Systems, volume 34,
pages 13165–13176. Curran Associates, Inc., 2021. (Cited in page 122.)
[Li 2022] Peizhao Li, Pu Wang, Karl Berntorp and Hongfu Liu. Exploiting
Temporal Relations on Radar Perception for Autonomous Driving. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 17050–17059, 2022. (Cited in pages 6, 64, 94, and 96.)
[Lin 2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár and C. Lawrence Zitnick.
Microsoft COCO: Common Objects in Context. In ECCV, 2014. (Cited
in pages 43, 44, and 49.)
[Lin 2017a] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hari-
haran and Serge Belongie. Feature Pyramid Networks for Object Detection.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 936–944, 2017. (Cited in pages 45, 47, and 51.)
Bibliography 171
[Lin 2017b] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dol-
lár. Focal Loss for Dense Object Detection. In 2017 IEEE International
Conference on Computer Vision (ICCV), pages 2999–3007, 2017. (Cited in
page 45.)
[Liu 2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu and Alexander C. Berg. SSD: Single Shot MultiBox
Detector. In Bastian Leibe, Jiri Matas, Nicu Sebe and Max Welling, edi-
tors, Computer Vision – ECCV 2016, pages 21–37, Cham, 2016. Springer
International Publishing. (Cited in pages 45, 48, and 51.)
[Liu 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Stephen Lin and Baining Guo. Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows. In 2021 IEEE/CVF International Con-
ference on Computer Vision (ICCV), pages 9992–10002, 2021. (Cited in
page 48.)
[Liu 2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer,
Trevor Darrell and Saining Xie. A ConvNet for the 2020s. In 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 11966–11976, 2022. (Cited in page 48.)
[Long 2015] Jonathan Long, Evan Shelhamer and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In 2015 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 3431–
3440, 2015. (Cited in page 50.)
[Madani 2022] Sohrab Madani, Jayden Guan, Waleed Ahmed, Saurabh Gupta and
Haitham Hassanieh. Radatron: Accurate Detection Using Multi-resolution
Cascaded MIMO Radar. In Shai Avidan, Gabriel Brostow, Moustapha
Cissé, Giovanni Maria Farinella and Tal Hassner, editors, Computer Vision
– ECCV 2022, pages 160–178, Cham, 2022. Springer Nature Switzerland.
(Cited in pages 56 and 68.)
[Major 2019] Bence Major, Daniel Fontijne, Amin Ansari, Ravi Teja Sukhavasi,
Radhika Gowaikar, Michael Hamilton, Sean Lee, Slawomir Grzechnik and
Sundar Subramanian. Vehicle Detection With Automotive Radar Using
Deep Learning on Range-Azimuth-Doppler Tensors. In 2019 IEEE/CVF
International Conference on Computer Vision Workshop (ICCVW), pages
924–932, 2019. (Cited in pages 6, 64, 65, 66, 67, 70, 91, 94, and 97.)
172 Bibliography
[Meyer 2019] Michael Meyer and Georg Kuschk. Automotive Radar Dataset for
Deep Learning Based 3D Object Detection. In 2019 16th European Radar
Conference (EuRAD), pages 129–132, 2019. (Cited in pages 53 and 54.)
[Meyer 2021] Michael Meyer, Georg Kuschk and Sven Tomforde. Graph
Convolutional Networks for 3D Object Detection on Radar Data. In 2021
IEEE/CVF International Conference on Computer Vision Workshops (IC-
CVW), pages 3053–3062, 2021. (Cited in pages 68 and 74.)
[Mitchell 1997] Tom Mitchell. Machine learning. McGraw Hill, 1997. http://www.
cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html. (Cited
in page 36.)
[Newell 2016] Alejandro Newell, Kaiyu Yang and Jia Deng. Stacked Hourglass
Networks for Human Pose Estimation. In Bastian Leibe, Jiri Matas, Nicu
Sebe and Max Welling, editors, Computer Vision – ECCV 2016, pages 483–
499, Cham, 2016. Springer International Publishing. (Cited in page 67.)
[Noroozi 2016] Mehdi Noroozi and Paolo Favaro. Unsupervised Learning of Visual
Representations by Solving Jigsaw Puzzles. In Bastian Leibe, Jiri Matas,
Nicu Sebe and Max Welling, editors, Computer Vision – ECCV 2016, pages
69–84, Cham, 2016. Springer International Publishing. (Cited in page 121.)
[Paek 2022] Dong-Hee Paek, SEUNG-HYUN KONG and Kevin Tirta Wijaya.
K-Radar: 4D Radar Object Detection for Autonomous Driving in Various
Weather Conditions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho and A. Oh, editors, Advances in Neural Information Processing Sys-
tems, volume 35, pages 3819–3829. Curran Associates, Inc., 2022. (Cited in
pages 56, 59, 68, and 69.)
[Palffy 2020] Andras Palffy, Jiaao Dong, Julian F. P. Kooij and Dariu M. Gavrila.
CNN Based Road User Detection Using the 3D Radar Cube. IEEE Robotics
and Automation Letters, vol. 5, no. 2, pages 1263–1270, 2020. (Cited in
pages 3, 62, 63, 139, and 140.)
[Patel 2019] Kanil Patel, Kilian Rambach, Tristan Visentin, Daniel Rusev, Michael
Pfeiffer and Bin Yang. Deep Learning-based Object Classification on
Automotive Radar Spectra. In 2019 IEEE Radar Conference (RadarConf),
pages 1–6, 2019. (Cited in pages 62 and 63.)
[Patel 2022] Kanil Patel, William Beluch, Kilian Rambach, Michael Pfeiffer and Bin
Yang. Improving Uncertainty of Deep Learning-based Object Classification
on Radar Spectra using Label Smoothing. In 2022 IEEE Radar Conference
(RadarConf22), pages 1–6, 2022. (Cited in page 62.)
[Pathak 2016] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell
and Alexei A. Efros. Context Encoders: Feature Learning by Inpainting.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 2536–2544, 2016. (Cited in pages 117 and 121.)
[Pathak 2017] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell and
Bharath Hariharan. Learning Features by Watching Objects Move. In 2017
174 Bibliography
[Patole 2017] Sujeet Milind Patole, Murat Torlak, Dan Wang and Murtaza Ali.
Automotive Radars: A Review of Signal Processing Techniques. IEEE Sig-
nal Processing Magazine, vol. 34, no. 2, pages 22–35, 2017. (Cited in pages 18
and 21.)
[Prophet 2018a] Robert Prophet, Marcel Hoffmann, Alicja Ossowska, Waqas Malik,
Christian Sturm and Martin Vossiek. Image-Based Pedestrian Classification
for 79 GHz Automotive Radar. In 2018 15th European Radar Conference
(EuRAD), pages 75–78, 2018. (Cited in pages 61 and 62.)
[Prophet 2018b] Robert Prophet, Marcel Hoffmann, Alicja Ossowska, Waqas Ma-
lik, Christian Sturm and Martin Vossiek. Pedestrian Classification for 79
GHz Automotive Radar Systems. In 2018 IEEE Intelligent Vehicles Sympo-
sium (IV), pages 1265–1270, 2018. (Cited in pages 61 and 63.)
[Pérez 2018] Rodrigo Pérez, Falk Schubert, Ralph Rasshofer and Erwin Biebl.
Single-Frame Vulnerable Road Users Classification with a 77 GHz FMCW
Radar Sensor and a Convolutional Neural Network. In 2018 19th Interna-
tional Radar Symposium (IRS), pages 1–10, 2018. (Cited in page 62.)
[Qi 2017] Charles Ruizhongtai Qi, Li Yi, Hao Su and Leonidas J Guibas.
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric
Space. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017. (Cited in
page 60.)
[Rao 2018] Sandeep Rao. MIMO Radar, 2018. (Cited in pages 29, 30, and 31.)
[Rebut 2022] Julien Rebut, Arthur Ouaknine, Waqas Malik and Patrick Pérez.
Raw High-Definition Radar for Multi-Task Learning. In Proceedings of
Bibliography 175
[Redmon 2016] Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi.
You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages
779–788, 2016. (Cited in pages 45, 47, 48, 65, 68, and 81.)
[Redmon 2017] Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster,
Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 6517–6525, 2017. (Cited in pages 47 and 48.)
[Ren 2017] Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun. Faster R-CNN:
Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,
pages 1137–1149, 2017. (Cited in pages 4, 14, 45, 46, 51, 73, 77, 78, 128,
and 130.)
[Rohling 2010] Hermann Rohling, Steffen Heuel and Henning Ritter. Pedestrian
detection procedure integrated into an 24 GHz automotive radar. In 2010
IEEE Radar Conference, pages 1229–1232, 2010. (Cited in page 25.)
[Ronneberger 2015] Olaf Ronneberger, Philipp Fischer and Thomas Brox. U-Net:
Convolutional Networks for Biomedical Image Segmentation. In Nassir
Navab, Joachim Hornegger, William M. Wells and Alejandro F. Frangi,
editors, Medical Image Computing and Computer-Assisted Intervention –
MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publish-
ing. (Cited in pages 50, 51, 64, 69, and 100.)
[Russakovsky 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge. International Journal
of Computer Vision (IJCV), vol. 115, no. 3, pages 211–252, 2015. (Cited in
pages 10, 40, 90, and 116.)
[Saini 2023] Loveneet Saini, Axel Acosta and Gor Hakobyan. Graph Neural
Networks for Object Type Classification Based on Automotive Radar Point
Clouds and Spectra. In ICASSP 2023 - 2023 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
(Cited in pages 62, 63, and 138.)
[Sandler 2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmogi-
nov and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear
Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 4510–4520, 2018. (Cited in pages 7, 41, 42, 95, 98,
and 99.)
[Scheiner 2018] Nicolas Scheiner, Nils Appenrodt, Jürgen Dickmann and Bernhard
Sick. Radar-based Feature Design and Multiclass Classification for Road
User Recognition. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages
779–786, 2018. (Cited in pages 2 and 59.)
[Scheiner 2019] Nicolas Scheiner, Nils Appenrodt, Jürgen Dickmann and Bernhard
Sick. Radar-based Road User Classification and Novelty Detection with
Recurrent Neural Network Ensembles. In 2019 IEEE Intelligent Vehicles
Symposium (IV), pages 722–729, 2019. (Cited in page 59.)
[Scheiner 2020] Nicolas Scheiner, Ole Schumann, Florian Kraus, Nils Appenrodt,
Jürgen Dickmann and Bernhard Sick. Off-the-shelf sensor vs. experimental
radar - How much resolution is necessary in automotive radar classification?
In 2020 IEEE 23rd International Conference on Information Fusion (FU-
SION), pages 1–8, 2020. (Cited in pages 2 and 59.)
[Schroff 2015] Florian Schroff, Dmitry Kalenichenko and James Philbin. FaceNet:
A unified embedding for face recognition and clustering. In 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages
815–823, 2015. (Cited in page 118.)
Bibliography 177
[Schumann 2018] Ole Schumann, Markus Hahn, Jürgen Dickmann and Christian
Wöhler. Semantic Segmentation on Radar Point Clouds. In 2018 21st In-
ternational Conference on Information Fusion (FUSION), pages 2179–2186,
2018. (Cited in pages 60 and 96.)
[Schumann 2021] Ole Schumann, Markus Hahn, Nicolas Scheiner, Fabio Weishaupt,
Julius F. Tilly, Jürgen Dickmann and Christian Wöhler. RadarScenes: A
Real-World Radar Point Cloud Data Set for Automotive Applications. In
2021 IEEE 24th International Conference on Information Fusion (FUSION),
pages 1–8, 2021. (Cited in pages 53 and 54.)
[Sermanet 2018] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu,
Eric Jang, Stefan Schaal, Sergey Levine and Google Brain. Time-Contrastive
Networks: Self-Supervised Learning from Video. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pages 1134–1141, 2018.
(Cited in page 118.)
[Sheeny 2020] Marcel Sheeny, Andrew Wallace and Sen Wang. RADIO:
Parameterized Generative Radar Data Augmentation for Small Datasets.
Applied Sciences, vol. 10, no. 11, 2020. (Cited in page 70.)
[Shi 2015] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong
and Wang-chun WOO. Convolutional LSTM Network: A Machine Learning
Approach for Precipitation Nowcasting. In C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama and R. Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 28. Curran Associates, Inc., 2015. (Cited in pages 7,
98, and 107.)
[Szegedy 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew
Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. (Cited
in page 41.)
[Tian 2019] Zhi Tian, Chunhua Shen, Hao Chen and Tong He. FCOS: Fully
Convolutional One-Stage Object Detection. In 2019 IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
(Cited in pages 45 and 48.)
[Tilly 2020] Julius F. Tilly, Stefan Haag, Ole Schumann, Fabio Weishaupt, Bha-
ranidhar Duraisamy, Jürgen Dickmann and Martin Fritzsche. Detection
and Tracking on Automotive Radar Data with Deep Learning. In 2020 IEEE
23rd International Conference on Information Fusion (FUSION), pages 1–7,
2020. (Cited in page 60.)
[Tong 2022] Zhan Tong, Yibing Song, Jue Wang and Limin Wang. VideoMAE:
Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video
Pre-Training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho
and A. Oh, editors, Advances in Neural Information Processing Systems,
volume 35, pages 10078–10093. Curran Associates, Inc., 2022. (Cited in
pages 123 and 126.)
[Tran 2018] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun
and Manohar Paluri. A Closer Look at Spatiotemporal Convolutions for
Action Recognition. In 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 6450–6459, 2018. (Cited in page 67.)
[Ulrich 2021] Michael Ulrich, Claudius Gläser and Fabian Timm. DeepReflecs:
Deep Learning for Automotive Object Classification with Radar Reflections.
In 2021 IEEE Radar Conference (RadarConf21), pages 1–6, 2021. (Cited in
pages 2, 59, 60, 85, 86, 87, and 89.)
[Ulrich 2022] Michael Ulrich, Sascha Braun, Daniel Köhler, Daniel Niederlöhner,
Florian Faion, Claudius Gläser and Holger Blume. Improved Orientation
Estimation and Detection with Hybrid Object Detection Networks for
Automotive Radar. In 2022 IEEE 25th International Conference on In-
telligent Transportation Systems (ITSC), pages 111–117, 2022. (Cited in
pages 60 and 61.)
Bibliography 179
[Vaswani 2017a] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N Gomez, Ł ukasz Kaiser and Illia Polosukhin. Attention
is All you Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
(Cited in page 37.)
[Vaswani 2017b] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin. Attention
is All You Need. In Proceedings of the 31st International Conference on Neu-
ral Information Processing Systems, NIPS’17, page 6000–6010, Red Hook,
NY, USA, 2017. Curran Associates Inc. (Cited in page 67.)
[Ventura 2019] Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador,
Ferran Marques and Xavier Giro-i Nieto. RVOS: End-To-End Recurrent
Network for Video Object Segmentation. In 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 5272–5281, 2019.
(Cited in pages 95 and 96.)
[Vincent 2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio and Pierre-Antoine
Manzagol. Extracting and composing robust features with denoising
autoencoders. In William W. Cohen, Andrew McCallum and Sam T. Roweis,
editors, Machine Learning, Proceedings of the Twenty-Fifth International
Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of
ACM International Conference Proceeding Series, pages 1096–1103. ACM,
2008. (Cited in pages 117 and 121.)
[Wang 2015] Xiaolong Wang and Abhinav Gupta. Unsupervised Learning of Visual
Representations Using Videos. In 2015 IEEE International Conference on
Computer Vision (ICCV), pages 2794–2802, 2015. (Cited in page 117.)
[Wang 2021a] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong and Lei
Li. Dense Contrastive Learning for Self-Supervised Visual Pre-Training. In
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3023–3032, 2021. (Cited in page 123.)
[Wang 2021b] Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang,
Guanbin Xing and Hui Liu. RODNet: Radar Object Detection using
180 Bibliography
[Wang 2021c] Yizhou Wang, Gaoang Wang, Hung-Min Hsu, Hui Liu and Jenq-
Neng Hwang. Rethinking of Radar’s Role: A Camera-Radar Dataset
and Systematic Annotator via Coordinate Alignment. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, pages 2815–2824, June 2021. (Cited in pages 3, 7, 9,
55, 67, 74, 95, 102, 115, and 137.)
[Wei 2021] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu and Stephen Lin. Aligning
Pretraining for Detection via Object-Level Contrastive Learning. In M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P.S. Liang and J. Wortman Vaughan,
editors, Advances in Neural Information Processing Systems, volume 34,
pages 22682–22694. Curran Associates, Inc., 2021. (Cited in pages 9, 116,
123, 124, 126, 128, 132, 133, and 138.)
[Woo 2023] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen,
Zhuang Liu, In So Kweon and Saining Xie. ConvNeXt V2: Co-Designing
and Scaling ConvNets With Masked Autoencoders. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 16133–16142, June 2023. (Cited in pages 122, 124, and 125.)
[Xiao 2018] Fanyi Xiao and Yong Jae Lee. Video Object Detection with an Aligned
Spatial-Temporal Memory. In Vittorio Ferrari, Martial Hebert, Cristian
Sminchisescu and Yair Weiss, editors, Computer Vision – ECCV 2018, pages
494–510, Cham, 2018. Springer International Publishing. (Cited in page 95.)
[Xie 2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu and Kaiming He.
Aggregated Residual Transformations for Deep Neural Networks. In 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 5987–5995, 2017. (Cited in page 47.)
[Xie 2022] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang
Yao, Qi Dai and Han Hu. SimMIM: a Simple Framework for Masked Image
Modeling. In 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9643–9653, 2022. (Cited in pages 121, 124,
and 126.)
Bibliography 181
[Xiong 2022] Weiyi Xiong, Jianan Liu, Yuxuan Xia, Tao Huang, Bing Zhu and
Wei Xiang. Contrastive Learning for Automotive mmWave Radar Detection
Points Based Instance Segmentation. In 2022 IEEE 25th International Con-
ference on Intelligent Transportation Systems (ITSC), pages 1255–1261,
2022. (Cited in page 60.)
[Xu 2021] Baowei Xu, Xinyu Zhang, Li Wang, Xiaomei Hu, Zhiwei Li, Shuyue
Pan, Jun Li and Yongqiang Deng. RPFA-Net: a 4D RaDAR Pillar Feature
Attention Network for 3D Object Detection. In 2021 IEEE International
Intelligent Transportation Systems Conference (ITSC), pages 3061–3066,
2021. (Cited in page 60.)
[Xu 2022] Li Xu, Yueqi Li and Jin Li. Improved Regularization of Convolutional
Neural Networks with Point Mask. In Xingming Sun, Xiaorui Zhang, Zhihua
Xia and Elisa Bertino, editors, Advances in Artificial Intelligence and Secu-
rity, pages 16–25, Cham, 2022. Springer International Publishing. (Cited in
page 70.)
[Yang 2021] Ceyuan Yang, Zhirong Wu, Bolei Zhou and Stephen Lin. Instance
Localization for Self-supervised Detection Pretraining. In 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages
3986–3995, 2021. (Cited in page 123.)
[Yang 2023] Bo Yang, Ishan Khatri, Michael Happold and Chulong Chen.
ADCNet: End-to-end perception with raw radar ADC data. arXiv preprint
arXiv:2303.11420, 2023. (Cited in page 69.)
[Yu 2022] Ye Yu, Jialin Yuan, Gaurav Mittal, Li Fuxin and Mei Chen. BATMAN:
Bilateral Attention Transformer in Motion-Appearance Neighboring Space
for Video Object Segmentation. In Shai Avidan, Gabriel Brostow,
Moustapha Cissé, Giovanni Maria Farinella and Tal Hassner, editors, Com-
puter Vision – ECCV 2022, pages 612–629, Cham, 2022. Springer Nature
Switzerland. (Cited in page 95.)
[Yun 2019] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh,
Youngjoon Yoo and Junsuk Choe. CutMix: Regularization Strategy to
Train Strong Classifiers With Localizable Features. In 2019 IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), pages 6022–6031, 2019.
(Cited in page 70.)
182 Bibliography
[Yun 2020] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han and
Jinhyung Kim. Videomix: Rethinking data augmentation for video
classification. arXiv preprint arXiv:2012.03457, 2020. (Cited in page 71.)
[Yun 2022] Sukmin Yun, Hankook Lee, Jaehyung Kim and Jinwoo Shin. Patch-level
Representation Learning for Self-supervised Vision Transformers. In 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 8344–8353, 2022. (Cited in page 123.)
[Zbontar 2021] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun and Stephane
Deny. Barlow Twins: Self-Supervised Learning via Redundancy Reduction.
In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Inter-
national Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 12310–12320. PMLR, 18–24 Jul 2021.
(Cited in page 120.)
[Zeiler 2014] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding
Convolutional Networks. In David Fleet, Tomas Pajdla, Bernt Schiele and
Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 818–833,
Cham, 2014. Springer International Publishing. (Cited in page 46.)
[Zhang 2018] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-
Paz. mixup: Beyond Empirical Risk Minimization. In International Confer-
ence on Learning Representations, 2018. (Cited in page 70.)
[Zhang 2019] Chen Zhang and Joohee Kim. Modeling Long-and Short-Term
Temporal Context for Video Object Detection. In 2019 IEEE international
conference on image processing (ICIP), pages 71–75. IEEE, 2019. (Cited in
pages 87, 88, 95, and 124.)
[Zhang 2021] Ao Zhang, Farzan Erlik Nowruzi and Robert Laganiere. RADDet:
Range-Azimuth-Doppler based Radar Object Detection for Dynamic Road
Users. In 2021 18th Conference on Robots and Vision (CRV), pages 95–102,
2021. (Cited in pages 3, 5, 6, 9, 55, 64, 65, 66, 67, 74, 78, 79, 80, 88, 89, 96,
115, 138, and 139.)
[Zhao 2021] Nanxuan Zhao, Zhirong Wu, Rynson W.H. Lau and Stephen Lin.
Distilling Localization for Self-Supervised Representation Learning. Pro-
ceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12,
pages 10990–10998, May 2021. (Cited in pages 122 and 123.)
[Zhao 2023] Peijun Zhao, Chris Xiaoxuan Lu, Bing Wang, Niki Trigoni and Andrew
Markham. CubeLearn: End-to-end Learning for Human Motion Recognition
Bibliography 183
from Raw mmWave Radar Signals. IEEE Internet of Things Journal, pages
1–1, 2023. (Cited in pages 69 and 140.)
[Zheng 2021] Zangwei Zheng, Xiangyu Yue, Kurt Keutzer and Alberto San-
giovanni Vincentelli. Scene-Aware Learning Network for Radar Object
Detection. In Proceedings of the 2021 International Conference on Mul-
timedia Retrieval, ICMR ’21, page 573–579, New York, NY, USA, 2021.
Association for Computing Machinery. (Cited in pages 67, 70, 71, and 96.)
[Zhou 2018] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-End Learning for Point
Cloud Based 3D Object Detection. In 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 4490–4499, 2018. (Cited in
page 60.)
[Zhou 2019] Xingyi Zhou, Dequan Wang and Philipp Krähenbühl. Objects as
Points. ArXiv, vol. abs/1904.07850, 2019. (Cited in pages 45, 48, and 67.)
[Zhu 2017] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan and Yichen Wei.
Flow-Guided Feature Aggregation for Video Object Detection. In 2017 IEEE
International Conference on Computer Vision (ICCV), pages 408–417, 2017.
(Cited in page 95.)
[Zhu 2018] Menglong Zhu and Mason Liu. Mobile Video Object Detection with
Temporally-Aware Feature Maps. In 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 5686–5695, 2018. (Cited in
pages 7, 95, 98, 99, and 107.)
Titre : Extraction et identification de cibles multiples pour radar automobile à l'aide d'intelligence artificielle
Mots clés : Radar FMCW, Intelligence artificielle, Détection, Identification, Suivi
Résumé : Ces dernières années, l'apparition de véhicules de plus en plus connectés ont ouvert la voie à des modes de transports plus sûrs et plus
autonomes. Ces véhicules s'appuient sur des systèmes avancés d'aide à la conduite (ADAS) et utilisent divers capteurs comme le radar, la caméra, le
LiDAR et le V2X pour créer un cocon de sécurité à 360° autour du véhicule. Si l'intelligence artificielle et l'apprentissage profond ont permis la
détection et l'identification d'objets en temps réel à l'aide de caméras et de LiDAR, l'utilisation de ces algorithmes sur des données radar est encore
limitée. Pourtant, les radars présentent des avantages, notamment celui de fonctionner dans des conditions météorologiques difficiles et d'offrir de
bonnes performances en terme de résolution en distance, angulaire et en vitesse, à un coût inférieur à celui du LiDAR. Cependant, les données
renvoyées par les radars actuels contiennent peu d'information concernant les cibles détectées et plusieurs étapes de pré-traitement et de post-
traitement sont nécessaires pour les obtenir. Ces étapes de traitement dénaturent le signal brut réfléchi par les objets, pouvant affecter les
performances des algorithmes d'intelligence artificielle. Ce doctorat vise à développer de nouveaux algorithmes d'apprentissage profond
spécifiquement adaptés aux données radar, visant à être incorporés dans des systèmes automobiles. Ces algorithmes auront pour but de détecter et
identifier les objets autour d'un véhicule dans des environnements complexes. Outre les algorithmes, cette thèse étudiera quelles types de données
radar, et donc quelle quantité de pré-traitement, permettent d'obtenir les meilleures performances. Les algorithmes proposés dans cette thèse
devront satisfaire aux contraintes des environnements automobiles: faible puissance, faible complexité et temps de réaction rapide.
Title: Multiple target extraction and identification for automotive radar with A.I.
Key words: FMCW Radar, Artificial intelligence, Detection, Classification, Tracking
Abstract: In recent years, connected vehicles have paved the way for safer and more automated transportation systems. These vehicles rely heavily
on Advanced Driving Assistance Systems (ADAS) and use various sensors like radar, camera, LiDAR, and V2X to ensure 360° safety type of cocoon
around the vehicle. While artificial intelligence and deep learning have enabled real-time object detection and identification using cameras and LiDAR,
the use of such algorithms on radar data is still limited. Radar sensors offer advantages, such as working in challenging weather conditions and
providing good performance in distance, angular and speed resolution, at a lower cost than LiDAR. However, radars output relatively low content
information regarding the detected targets and several pre and post-processing steps are required to obtain those. Since the processing steps filter
the raw signal returned by objects, it can affect the performance of AI algorithms. This PhD aims to develop new deep learning algorithms explicitly
tailored for raw radar data to integrate them into automotive systems. These algorithms will detect and identify objects in complex environments.
Additionally, this thesis will explore the optimal types of radar data and pre-processing steps for achieving the best performance. The algorithms will
have to meet automotive constraints, including low power consumption, simplicity, and fast response times.