Megaloc
Megaloc
Abstract
                                         1. Introduction
                                         This paper tackles the task of retrieving images from a large       solutions for each of them. As these three tasks contin-
                                         database that represent the same place as a given query im-         ued to diverge, over the years papers have avoided show-
                                         age. But what does it mean for two images to be “from the           ing results of their methods on more than one of these
                                         same place”? Depending on who you ask, you’ll get differ-           tasks: VPR papers don’t show results on LR, and LR pa-
                                         ent answers:                                                        pers don’t show results on VPR. In the meantime, 3D vi-
                                         1. Landmark Retrieval (LR) folks will tell you that two             sion pipelines like COLMAP [30], Hierarchical Localiza-
                                            photos are from the same place if they depict the same           tion [28] and GLOMAP [22] keep using outdated retrieval
                                            landmark, regardless of how close to each other the two          methods, like RootSIFT with bag-of-words [3, 10, 32] and
                                            photos were taken [40];                                          NetVLAD [4]. In this paper we aim to put an end to this, by
                                         2. Visual Place Recognition (VPR) people set a camera               training a single model that achieves SOTA (or almost) on
                                            pose distance of 25 meters to define if two images are           all of these tasks, showcasing robustness across diverse do-
                                            positives (i.e. from the same place) [4];                        mains. To train this model we do not propose any “technical
                                         3. Visual Localization (VL) / 3D Vision researchers will            novelty”, but we use all the lessons learned from all these
                                            tell you that two images need to have their pose as close        three task, putting together a combination of good samplers,
                                            as possible to be considered the same place.                     datasets, and general training techniques.
                                            Even though image retrieval is a core component in all               “Why does it matter?”, you may ask. Imagine you are
                                         three tasks, their different definitions and requirement has        doing 3D reconstruction, where image retrieval is a funda-
                                         inevitably led to the development of ad-hoc image retrieval         mental component, on a collection of diverse scenes (e.g. to
                                                                                                         1
create datasets like MegaDepth [18], MegaScenes [37], or              method assures that each class contains images that repre-
for the evergreen Image Matching Challenge [6]). In some              sent a given place from diverse perspectives, while ensuring
cases there would be small scenes (e.g. reconstruction of a           that no visual overlap exists between two different places.
fountain), requiring a retrieval model that is able to retrieve       EigenPlaces provides two sub-batches, one made of frontal-
nearby images (few meters away), which is something VPR               facing images (i.e. with the camera facing straight along the
models excel at, but LR models underperform (see [8] Tab.             street) and one of lateral-facing images.
14). In other cases however, the scene might be large (e.g.
a big landmark like a church), with images hundreds of me-            Google Street View Cities (GSV-Cities) is a dataset of
ters away: while LR models are designed for this, VPR                 530k images split into 62k places/classes from 40 cities,
models achieve poor results in this situations (see Sec. 3).          where each class contains at least 4 images with same ori-
Given these considerations, we note how neither VPR nor               entation and is at least 100 meters from any other class.
LR provide models for the diverse cases of 3D reconstruc-             Given that GSV-Cities is already split into non-overlapping
tions, creating a gap in literature that is filled by MegaLoc.        classes, it is not strictly necessary to apply a particular sam-
As another example where a model like MegaLoc is nec-                 pling technique. We therefore directly feed the GSV-Cities
essary, one can think of Visual Place Recognition (which              dataset to the multi-similarity loss, as in the original GSV-
is also the first step for Visual Localization), where models         Cities paper [1].
are evaluated by using a 25 meters threshold (and queries
in popular datasets always have at least one positive within
25 meters). However, in the real world the nearest image to           Mapillary Street-Level Sequences (MSLS) is a dataset
a given query might be 100 meters away, and while ideally             of 1.6M images split in contiguous sequences, across 30
we would still want to retrieve it, a VPR model is unlikely to        different cities over 9 years. To ideally sample data from
work in such case, as it has been trained to ignore anything          the MSLS dataset, we use the mining technique described in
further away from the camera.                                         the CliqueMining paper [33]. This method ensures that the
    In this paper we demonstrate that, by leveraging a di-            places selected for each batch depict visually similar (but
verse set of data sources and best practices from LR, VPR             geographically different) places (i.e. hard negatives), so that
and VL, we obtain a single image retrieval model that                 the loss can be as high as possible and effectively teach the
works well across all these tasks. Our model is called                model to disambiguate between similar-looking places.
MegaLoc and it is released at https://github.com/
gmberton/MegaLoc                                                      MegaScenes is a collection of 100k 3D structure-from-
                                                                      motion reconstructions, composed of 2M images from
2. Method                                                             Wikimedia Commons. Simply using each reconstruction
                                                                      as a class, and sampling random images from such class,
The core idea of this paper is to fuse data from multiple             could lead to images that do not have any visual overlap,
datasets, and train a single model. We use five datasets              e.g. two images could show opposites facades of a building,
containing both outdoor and indoor images and catering to             therefore having no visual overlap while belonging to the
different image localization tasks: GSV-Cities [1], Map-              same 3D reconstruction. Therefore we make sure that when
illary Street-Level Sequences (MSLS) [39], MegaScenes                 we sample a set of four images from a given reconstruction,
[37], ScanNet [13] and San Francisco eXtra Large (SF-XL)              each of these four images should have visual overlap with
[7]. At each training iteration, we extract six-sub batches           each other (we define visual overlap as having at least 1%
of data, one for each dataset (except SF-XL, from which               of 3D points in common in the 3D reconstruction).
two sub-batches are sampled) and use a multi-similarity loss
[38] computed over each sub-batch. Each sub-batch is made
                                                                      ScanNet is a dataset of 2.5M views from 1500 scans from
of 128 images, containing 4 images (called quadruplets)
                                                                      707 indoor places. To train on ScanNet we use each scene
from 32 different places/classes. Given that these datasets
                                                                      as a class, and select quadruplets so that each pair of images
have diverse format, they require different sampling tech-
                                                                      within a quadruplet has visual overlap (i.e. less than 10 me-
niques. In the following paragraphs we explain how data is
                                                                      ters and 30° apart); simultaneously we ensure that no two
sampled from each dataset.
                                                                      images from different quadruplets has visual overlap.
                                                                  2
                     Desc.   Baidu [34]    Eynsham [8, 12]       MSLS val [39]    Pitts250k [4, 14]   Pitts30k [4, 14]      SF-XL v1 [7]      SF-XL v2 [7]   SF-XL night [5]   SF-XL occlusion [5]      Tokyo 24/7 [36]
 Method
                     Dim.    R1    R10     R1      R10           R1     R10        R1       R10        R1       R10          R1    R10         R1    R10      R1     R10        R1       R10             R1     R10
 NetVLAD [4]          4096   69.0   95.0   77.7    90.5          54.5   70.4      85.9      95.0      85.0     94.4         40.1   57.7       76.9   91.1     6.7     14.2      9.2      22.4           69.8    82.9
 AP-GeM [27]          2048   59.8   90.8   68.3    84.0          56.0   72.9      80.0      93.5      80.7     94.1         37.9   54.1       66.4   84.6     7.5     16.7      5.3      14.5           57.5    77.5
 CosPlace [7]         2048   52.0   80.4   90.0    94.9          85.0   92.6      92.3      98.4      90.9     96.7         76.6   85.5       88.8   96.8    23.6     32.8     30.3      44.7           87.3    95.6
 MixVPR [2]           4096   71.9   94.7   89.6    94.4          83.2   91.9      94.3      98.9      91.6     96.4         72.5   80.9       88.6   95.0    19.5     30.5     30.3      38.2           87.0    94.0
 EigenPlaces [9]      2048   69.1   91.9   90.7    95.4          85.9   93.1      94.1      98.7      92.5     97.6         84.0   90.7       90.8   96.7    23.6     34.5     32.9      52.6           93.0    97.5
 AnyLoc [17]         49152   75.6   95.2   85.0    94.1          58.7   74.5      89.4      98.0      86.3     96.7          -      -          -      -        -       -         -        -             87.6    97.5
 Salad [16]          8448    72.7   93.6   91.6    95.9          88.2   95.0      95.0      99.2      92.3     97.4         88.7   94.4       94.6   98.2    46.1     62.4     50.0      68.4           94.6    98.1
 CricaVPR [20]       10752   65.6   93.2   88.0    94.3          76.7   87.2      92.6      98.3      90.0     96.7         62.6   78.9       86.3   96.0    25.8     40.6     27.6      47.4           82.9    93.7
 CliqueMining [33]    8448   72.9   92.7   91.9    96.2          91.6   95.9      95.3      99.2      92.6     97.8         85.5   92.6       94.5   98.3    46.1     60.9     44.7      64.5           96.8    97.8
 MegaLoc (Ours)      8448    87.7   98.0   92.6    96.8          91.0   95.8      96.4      99.3      94.1     98.2         95.3   98.0       94.8   98.5    52.8     73.8     51.3      75.0           96.5    99.4
Table 1. Recall@1 and Recall@10 on multiple VPR datasets. Best overall results on each dataset are in bold, second best results
underlined. Results marked with a “-” did not fit in 480GB of RAM (2.8M features of 49k dimensions require 560GB for a float32-based
kNN).
                                             CAB (Phone)                   HGE (Phone)                   LIN (Phone)                  CAB (HoloLens)                HGE (HoloLens)               LIN (HoloLens)
 Method
                                           (1, 0.1) (5, 1.0)             (1, 0.1) (5, 1.0)            (1, 0.1) (5, 1.0)               (1, 0.1) (5, 1.0)             (1, 0.1) (5, 1.0)           (1, 0.1) (5, 1.0)
 NetVLAD                                    43.4          54.0             54.8          80.0          74.4              87.8              63.1      81.4            57.9      71.6              76.1          83.0
 AP-GeM                                     39.4          52.0             58.0          81.3          69.1              82.0              62.9      82.5            65.6      76.6              80.7          91.1
 Fusion (NetVLAD+AP-GeM)                    41.4          53.8             56.3          82.4          76.0              89.4              63.2      83.1            63.1      75.1              78.5          87.0
 CosPlace                                   29.0          37.4             54.4          81.3          63.3              75.7              56.4      77.8            55.6      69.8              80.6          91.4
 MixVPR                                     40.9          50.8             59.2          83.8          77.5              89.8              65.2      84.7            63.3      74.7              83.6          92.2
 EigenPlaces                                32.3          44.7             56.3          81.3          70.2              82.6              63.9      81.8            60.2      72.5              84.8          93.1
 AnyLoc                                     48.0          59.8             58.8          83.0          77.2              92.4              69.7      88.5            70.1      81.0              81.4          90.4
 Salad                                      44.2          55.6             65.3          92.2          81.7              94.0              71.5      90.7            75.3      85.2              91.3          99.4
 CricaVPR                                   40.4          52.0             63.7          89.3          80.7              93.1              73.9      90.7            72.5      81.6              89.1          98.4
 CliqueMining                               44.2          55.6             66.0          91.4          80.5              93.1              74.2      90.9            77.3      86.3              92.0          98.8
 MegaLoc (Ours)                             47.0          60.4             67.2          92.9          83.3              94.9              77.4      93.4            72.9      83.5              92.2          99.0
Table 2. Results on LaMAR’s datasets, computed on each of the three locations, for both types of queries (HoloLens and Phone), which
include both indoor and outdoor. For each location we report the recall at (1°, 10cm) and (5°, 1m), following the LaMAR paper [29].
use RandAugment [11] for data augmentation, as in [1], and                                                       VRAM requirement of training MegaLoc from (roughly)
AdamW [19] as optimizer. Training is performed for 40k                                                           300GB to 60GB.
iterations. The loss is simply computed as L = L1 + L2 +
L3 + L4 + L5 + L6 , where each Ln is the multi-similarity
loss computed on one of the sub-batches.
                                                                                                                 3.2. Results
                                                                                                                 We perform experiments on three different types of tasks:
The architecture consists of a DINO-v2-base backbone                                                             • Visual Place Recognition, where the task is to retrieve im-
[21] followed by a SALAD [16] aggregation layer, which                                                             ages that are within 25 meters from the query (Sec. 3.2.1);
has shown state-of-the-art performances over multiple VPR                                                        • Visual Localization, where retrieval is part of a bigger
datasets [16, 33]. The SALAD layer is computed with 64                                                             pipeline that aims at finding the precise pose of the query
clusters, 256 channels per cluster, a global token of 256 and                                                      given a set of posed images (Sec. 3.2.2);
an MLP dimension of 512. The SALAD layer is followed                                                             • Landmark Retrieval, i.e. retrieving images that depict the
by a linear projection (from a dimension of 16640 to 8448)                                                         same landmark as the query (Sec. 3.2.3).
and an L2 normalization.
                                                                                                          3
Figure 2. Failure cases, grouped in 4 categories. Each one of the 4 column represent a category of failure cases: for each category we
show 5 examples, made of 3 images, namely the query and its top-2 predictions with MegaLoc, which can be in red or green depending
if the prediction is correct (i.e. within 25 meters). The 4 categories that we identified are (1) very difficult cases, which are unlikely to
be solved any time soon; (2) difficult cases, which can probably be solved by slightly better models than the current ones or simple post-
processing; (3) incorrect GPS labels, which, surprisingly, exist also in Mapillary and Google StreetView data; (4) predictions just out of
the 25m threshold, which despite being considered negatives in VPR, are actually useful predictions for real-world applications.
    Method
                             R-Oxford               R-Paris              method. Results are reported in Tab. 2.
                       E        M     H       E       M        H
    NetVLAD           24.1     16.1 4.7      61.2    46.3     22.0       3.2.3. Landmark Retrieval
    AP-GeM            49.6     37.6 19.3     82.5    69.5     45.5
    CosPlace          32.1     23.4 10.3     57.6    45.0     22.3       For the task of Landmark Retrieval we compute results on
    MixVPR            38.2     28.4 10.8     61.9    48.3     25.0       the most used datasets in literature, namely (the revisited
    EigenPlaces       29.4     22.9 11.8     60.9    47.3     23.6       versions of [26]) Oxford5k [24] and Paris6k [25]. To do
    AnyLoc            64.2     45.5 18.9     82.8    68.5     48.8
                                                                         this we relied on the official codebase for the datasets2 , by
    Salad             55.2     42.3 21.4     76.6    66.2     44.8
    CricaVPR          57.0     39.2 15.3     80.0    68.9     48.9       simply swapping the retrieval method. Results, reported in
    CliqueMining      52.2     41.0 22.1     71.8    60.5     41.2       Tab. 3, show a large gap between MegaLoc and previous
    MegaLoc (Ours)    91.0     79.0 62.1     95.3    89.6     77.1       VPR models on this task, which can be simply explained
                                                                         by the fact that previous models were only optimized for the
                                                                         standard VPR metric of retrieving images within 25 meters
Table 3. Results on Landmark Retrieval datasets, respectively
                                                                         from the query.
Revisited Paris 6k [25, 26] and Revisited Oxford 5k [24, 26].
                                                                         3.2.4. Failure Cases
                                                                         We identified a series of 4 main categories of “failure cases”
3.2.2. Visual Localization
                                                                         that prevent the results from reaching 100% recalls, and we
Image retrieval is a core tool to solve 3D vision tasks, in              present them in Fig. 2. We note however that, from a prac-
pipelines like visual localization (e.g. Hierarchical Local-             tical perspective, the only real failure cases are depicted
ization [28] and InLoc [35]) and 3D reconstructions (e.g.                in the second category/column of Fig. 2: furthermore, in
COLMAP [30, 31] and GLOMAP [22]). To understand if                       most similar cases SOTA models (i.e. not only MegaLoc,
our method can help this use case, we compute results on                 but also other recent ones) can actually retrieve precise pre-
the three datasets of Lamar [29], which comprise various                 dictions, meaning that these failure cases can be likely solv-
challenges, including plenty of visual aliasing from both in-            able by some simple post-processing techniques (e.g. re-
door and outdoor imagery. To do this, we relied on the of-               ranking with image matchers, or majority voting). Finally,
ficial LaMAR codebase1 by simply replacing the retrieval                 another failure case that we noted, is when database images
   1 https://github.com/microsoft/lamar-benchmark                           2 https://github.com/filipradenovic/revisitop
                                                                     4
do not cover properly the search area: this is very com-                       weakly supervised place recognition. IEEE Transactions
mon in the Mapillary (MSLS) dataset, where database im-                        on Pattern Analysis and Machine Intelligence, 40(6):1437–
ages only show one direction (e.g. photos along a road taken                   1451, 2018. 1, 3
from north to south), while the queries are photos facing the            [5]   Giovanni Barbarani, Mohamad Mostafa, Hajali Bayramov,
other direction. We note however, that in the real world this                  Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Bar-
                                                                               bara Caputo. Are local features all you need for cross-
can be easily solved by collecting database images in multi-
                                                                               domain visual place recognition? In CVPRW, pages 6155–
ple directions, which is also common in most test datasets,
                                                                               6165, 2023. 3
like Eynsham, Pitts30k, Tokyo 24/7 and SF-XL.                            [6]   Fabio Bellavia, Jiri Matas, Dmytro Mishkin, Luca Morelli,
                                                                               Fabio Remondino, Weiwei Sun, Amy Tabb, Eduard Trulls,
4. Conclusion and limitations                                                  Kwang Moo Yi, Sohier Dane, and Ashley Chow. Im-
                                                                               age matching challenge 2024 - hexathlon. https://
So, is image retrieval for localization solved? Well, almost.                  kaggle.com/competitions/image- matching-
While some datasets still show some room for improve-                          challenge-2024, 2024. Kaggle. 2
ment, we note that this is often due to either arguably un-              [7]   Gabriele Berton, Carlo Masone, and Barbara Caputo. Re-
solvable failure cases, wrong labels, and very few cases that                  thinking visual geo-localization for large-scale applications.
can be solved by better models. We emphasize however that                      In IEEE Conference on Computer Vision and Pattern Recog-
this has been the case for some time, as previous DINO-v2-                     nition, pages 4868–4878, 2022. 2, 3, 5
based models, like SALAD and CliqueMining, show very                     [8]   Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo
high results on classic VPR datasets. What is still missing                    Masone, Gabriela Csurka, Torsten Sattler, and Barbara Ca-
from literature is models like MegaLoc that achieve good                       puto. Deep visual geo-localization benchmark, 2023. 2, 3
results in a variety of diverse tasks and domains.                       [9]   Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and
   Should you always use MegaLoc? Well, almost, except                         Carlo Masone. Eigenplaces: Training viewpoint robust
                                                                               models for visual place recognition. In Proceedings of the
for at least 3 use-cases. MegaLoc has shown great results
                                                                               IEEE/CVF International Conference on Computer Vision
on a variety of related tasks, and, unlike other VPR models,                   (ICCV), pages 11080–11090, 2023. 2, 3
achieves good results on landmark retrieval, which make it              [10]   Gabriela Csurka, Christopher Dance, Lixin Fan, Jutta
a great option also for retrieval for 3D reconstruction tasks,                 Willamowski, and Cédric Bray. Visual categorization with
besides standard VPR and visual localization tasks. How-                       bags of keypoints. In European Conference on Computer
ever, experiments show that MegaLoc is outperformed by                         Vision, 2004. 1
CliqueMining in MSLS, which is a dataset made of (almost                [11]   Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.
entirely) forward facing images (i.e. photos where the cam-                    Randaugment: Practical automated data augmentation with
era is facing the same direction of the street, instead of fac-                a reduced search space. In Advances in Neural Information
ing sideways towards the side of the street). Another use                      Processing Systems, pages 18613–18624. Curran Associates,
case where MegaLoc is likely to be suboptimal is in very                       Inc., 2020. 3
unusual natural environments, like forests or caves, where              [12]   M. Cummins and P. Newman. Highly scalable appearance-
                                                                               only slam - FAB-MAP 2.0. In Robotics: Science and Sys-
instead AnyLoc has been shown to work well [17]. A third
                                                                               tems, 2009. 3
and final use case where other models might be preferred                [13]   Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-
to MegaLoc is for embedded systems, where one might opt                        ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
for more lightweight models, like the ResNet-18 [15] ver-                      Richly-annotated 3d reconstructions of indoor scenes. In
sions of CosPlace [7], which has 11M parameters instead of                     Proc. Computer Vision and Pattern Recognition (CVPR),
MegaLoc’s 228M.                                                                IEEE, 2017. 2
                                                                        [14]   Petr Gronát, Guillaume Obozinski, Josef Sivic, and Tomá
References                                                                     Pajdla. Learning and calibrating per-location classifiers for
                                                                               visual place recognition. In 2013 IEEE Conference on Com-
 [1] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère.                   puter Vision and Pattern Recognition, pages 907–914, 2013.
     Gsv-cities: Toward appropriate supervised visual place                    3
     recognition. Neurocomputing, 513:194–203, 2022. 2, 3               [15]   K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
 [2] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère.                   for image recognition. In IEEE Conference on Computer
     Mixvpr: Feature mixing for visual place recognition. In Pro-              Vision and Pattern Recognition, pages 770–778, 2016. 5
     ceedings of the IEEE/CVF Winter Conference on Applica-             [16]   Sergio Izquierdo and Javier Civera. Optimal transport aggre-
     tions of Computer Vision, pages 2998–3007, 2023. 3                        gation for visual place recognition. In IEEE Conference on
 [3] R. Arandjelović and Andrew Zisserman. Three things every-                Computer Vision and Pattern Recognition, 2024. 2, 3
     one should know to improve object retrieval. pages 2911–           [17]   Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr-
     2918, 2012. 1                                                             ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava
 [4] Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pa-                Krishna, and Sourav Garg. Anyloc: Towards universal vi-
     jdla, and Josef Sivic. NetVLAD: CNN architecture for                      sual place recognition. arXiv, 2023. 3, 5
                                                                    5
[18] Zhengqi Li and Noah Snavely. Megadepth: Learning single-                   Ondrej Miksik, and Marc Pollefeys. LaMAR: Benchmark-
     view depth prediction from internet photos. In Proceed-                    ing Localization and Mapping for Augmented Reality. In
     ings of the IEEE conference on computer vision and pattern                 ECCV, 2022. 3, 4
     recognition, pages 2041–2050, 2018. 2                               [30]   Johannes Lutz Schönberger and Jan-Michael Frahm.
[19] Ilya Loshchilov and Frank Hutter. Decoupled weight de-                     Structure-from-motion revisited. In CVPR, 2016. 1, 4
     cay regularization. In International Conference on Learning         [31]   Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
     Representations, 2019. 3                                                   and Jan-Michael Frahm. Pixelwise view selection for un-
[20] Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang,                        structured multi-view stereo. In ECCV, 2016. 4
     Yaowei Wang, and Chun Yuan. Cricavpr: Cross-image                   [32]   Johannes L. Schönberger, True Price, Torsten Sattler, Jan-
     correlation-aware representation learning for visual place                 Michael Frahm, and Marc Pollefeys. A vote-and-verify strat-
     recognition. In Proceedings of the IEEE/CVF Conference                     egy for fast spatial verification in image retrieval. In Com-
     on Computer Vision and Pattern Recognition (CVPR), 2024.                   puter Vision – ACCV 2016, pages 321–337, Cham, 2017.
     3                                                                          Springer International Publishing. 1
[21] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V.             [33]   Javier Civera Sergio Izquierdo. Close, but not there: Boost-
     Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,                     ing geographic distance sensitivity in visual place recogni-
     Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus-                   tion. In European Conference on Computer Vision (ECCV),
     sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-                       2024. 2, 3
     Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico-            [34]   Xun Sun, Yuanfan Xie, Peiwen Luo, and Liang Wang. A
     las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou,                    dataset for benchmarking image-based localization. 2017
     Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo-               IEEE Conference on Computer Vision and Pattern Recog-
     janowski. Dinov2: Learning robust visual features without                  nition (CVPR), pages 5641–5649, 2017. 3
     supervision, 2023. 3
                                                                         [35]   Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea
[22] Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo-                         Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak-
     hannes Lutz Schönberger. Global Structure-from-Motion                     ihiko Torii. InLoc: Indoor visual localization with dense
     Revisited. In European Conference on Computer Vision                       matching and view synthesis. In IEEE Conference on Com-
     (ECCV), 2024. 1, 4                                                         puter Vision and Pattern Recognition, 2018. 4
[23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
                                                                         [36]   A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pa-
     James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
                                                                                jdla. 24/7 place recognition by view synthesis. IEEE Trans-
     ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
                                                                                actions on Pattern Analysis and Machine Intelligence, 40(2):
     son, Andreas Kopf, Edward Yang, Zachary DeVito, Mar-
                                                                                257–271, 2018. 3
     tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
                                                                         [37]   Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai
     Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch:
                                                                                Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah
     An imperative style, high-performance deep learning library.
                                                                                Snavely. Megascenes: Scene-level view synthesis at scale.
     In Advances in Neural Information Processing Systems 32,
                                                                                In ECCV, 2024. 2
     pages 8024–8035. Curran Associates, Inc., 2019. 3
[24] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and         [38]   Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and
     Andrew Zisserman. Object retrieval with large vocabular-                   Matthew R Scott. Multi-similarity loss with general pair
     ies and fast spatial matching. In IEEE Conference on Com-                  weighting for deep metric learning. In Proceedings of the
     puter Vision and Pattern Recognition. IEEE Computer Soci-                  IEEE Conference on Computer Vision and Pattern Recogni-
     ety, 2007. 4                                                               tion, pages 5022–5030, 2019. 2
[25] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and         [39]   Frederik Warburg, Søren Hauberg, Manuel López-
     Andrew Zisserman. Lost in quantization: Improving par-                     Antequera, Pau Gargallo, Yubin Kuang, and Javier
     ticular object retrieval in large scale image databases. In                Civera. Mapillary street-level sequences: A dataset for
     IEEE Conference on Computer Vision and Pattern Recog-                      lifelong place recognition. In 2020 IEEE/CVF Conference
     nition, 2008. 4                                                            on Computer Vision and Pattern Recognition (CVPR), pages
[26] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum.              2623–2632, 2020. 2, 3
     Revisiting oxford and paris: Large-scale image retrieval            [40]   Tobias Weyand, A. Araújo, Bingyi Cao, and Jack Sim.
     benchmarking. In CVPR, 2018. 4                                             Google landmarks dataset v2 – a large-scale benchmark for
[27] Jérôme Revaud, Jon Almazán, R. S. Rezende, and                          instance-level recognition and retrieval. In IEEE Conference
     César Roberto de Souza. Learning with average precision:                  on Computer Vision and Pattern Recognition, pages 2572–
     Training image retrieval with a listwise loss. 2019 IEEE/CVF               2581, 2020. 1
     International Conference on Computer Vision (ICCV), pages
     5106–5115, 2019. 3
[28] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and
     Marcin Dymczyk. From coarse to fine: Robust hierarchical
     localization at large scale. In CVPR, 2019. 1, 4
[29] Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L.
     Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson,