A) Experimental Setting:: Esults
A) Experimental Setting:: Esults
GOAT COW GOAT w/o Memory GOAT COW GOAT w/o Memory
1.0 1.0
0.9
0.8
0.8
Success Rate
0.7 0.6
SPL
0.6
0.4
0.5
0.4 0.2
0.3
0.0
1 2 3 4 5-10 1 2 3 4 5-10
Number of sequential goals Number of sequential goals
Fig. 6. Navigation performance based on sequential goal count. GOAT performance improves with experience in the environment:
from a 60% success rate (0.2 SPL) at the first goal to 90% (0.8 SPL) for goals 5-10 after thorough exploration. Conversely, GOAT without
memory shows no improvement from experience, while COW benefits but plateaus at much lower performance.
we collected an additional 5 trajectories (5-10 navigation goals specific end-to-end learned approach would be limited by the
each) during which people continuously moved throughout available data for this specific task. GOAT is able to tie all of
the scene. Either one or two people (chosen randomly) were these components together using our Object Instance memory
instructed to walk to a randomly selected sequence of objects to achieve state-of-the-art performance for lifelong real-world
(waiting briefly at each object) in the scene while the robot navigation.
was navigating. They were instructed to treat the robot as Furthermore, the modular design of GOAT allows it to be
they would another human (i.e. not walking directly into the easily adapted to different robot embodiments and a variety of
robot). In this setting, GOAT preserves an 81% success rate. downstream applications. GOAT can be deployed on any robot
We further evaluate the ability of GOAT to search for and with an RGB-D camera, a pose sensor (onboard SLAM), and
follow people by introducing people as image goals in 5 the ability to execute low-level locomotion commands (move
additional trajectories following the same protocol. GOAT can forward, turn left, turn right). GOAT’s modularity eliminates
localize and follow people with 83% success, close to the 86% the need for new data collection or training when deployed
success rate for static image instance goals. Demo video and on a new robot platform. This stands in contrast to end-to-
visualizations can be found in supplementary. end methods, which would require new data collection and
retraining for every different embodiment.
VII. D ISCUSSION Consequently, new modalities of goals can easily be added
a) Modularity allows GOAT to Achieve Robust General- to the system as long as a mechanism for matching exists and
Purpose Navigation in the Real World: The GOAT system as a the robot is equipped with the correct sensors. For example,
whole is a robust navigation platform, achieving a success rate if goals are specified by 3D models, all that is required is a
of 83% across image, language, and category goals in the wild module for estimating the 3D shape of detected objects and
and up to 90% once the environment is fully explored (Table for matching against the specified goals.
I, Figure 6). This is possible in-part due to the modular nature b) Matching Performance During Exploration Lags Be-
of the system. A modular system allows learning to be applied hind Performance After Exploration: Using a predefined
in the components in which it is required (i.e. object detec- threshold for goal-to-object matching scores during explo-
tion, image/language matching), while still leveraging strong ration can be error-prone (visualization in supplementary). On
classical methods (i.e. mapping and planning). Furthermore, the other hand, once the scene has been explored, the agent
for learning-based components, we can use models trained on has the privilege of selecting the best matching instance across
large datasets (i.e. CLIP, MaskRCNN), or specialized tasks all observed objects. This is reflected in the improved perfor-
(monocular depth estimation) to full effect, where a task- mance of the agent post-exploration (Figure 6). When the
Fig. 7. Online evaluation qualitative trajectories. We compare methods on the same sequence of 5 goals (top) in the same environment.
GOAT localizes all goals and navigates efficiently (with an SPL of 0.78). CLIP on Wheels localizes only 1 out of 5 goals, illustrating the
superiority of GOAT’s Object Instance Memory for matching. GOAT without memory is able to localize 4 our of 5 goals, but with an SPL
of only 0.40 as it has to re-explore the environment with every goal. See Section V for details.
environment is fully explored, failures are almost exclusively total for all images by the end of an episode on average.
due to failures in matching the correct goal. The most common Similarly, only storing CLIP features for language matching
failure is a language goal being matched against the an object requires minimal memory (only 257 KB on average for an
of the correct class, but the wrong instance (i.e. The language entire trajectory), and allows for fast vectorized comparison
specifies a bed, but the system matches against a different bed). for language matching (7ms on average and 29ms at max on
Examples of these failures can be seen in supplementary figure a single GPU). The computational costs for image-to-image
S2, and additional details of matching matching performance comparisons remain under control too as we continue to only
can be found in supplementary. match to the instances belonging to the category of interest.
c) Image Goal Matching is More Reliable than Lan- Matching a single image pair takes 45ms on a single GPU,
guage Goal Matching: We observe that image-to-image goal and the matching takes 0.9s on average (and 2.6s at max) after
matching is more successful at identifying goal instances as the environment is fully explored — these matching times
compared to matching instance views with semantic features were more than fast enough for our experiments. However,
of language descriptions of the goal. This is expected because for extremely long trajectories a mechanism to increase paral-
SuperGLUE-based image keypoint matching can leverage lelism or cull duplicate images would be necessary to increase
correspondences in geometric properties between predicted matching speeds.
instances and goal objects. However, the semantic feature g) Additional Limitations: To achieve robust image-
encodings from CLIP can be incapable of capturing fine- matching results GOAT’s memory system stores all images in
grained instance properties – that can often be crucial for which objects have been detected. For very small or compute-
goal matching. As a result, navigation with image goals is constrained robots this may be too memory inefficient and
significantly more successful (Table I, SR 86.4 vs. 68.2). images should be sub-sampled. Like all systems that rely on
d) Real-World Open-Vocabulary Detection: Limitations 2D mapping, GOAT is designed to handle only a single story
and Opportunities: An interesting and noteworthy observation in a building. While this is remedied by detecting when a floor
is that despite the rapid advances in open (or large) vocabulary change has happened and maintaining a separate map per floor,
vision-and-language models (VLMs) [37, 43], we find their we leave this to future work.
performance to be significantly worse than a Mask RCNN
ACKNOWLEDGMENTS
model from 2017. We attribute this observation to two possible
hypotheses: (i) open-vocabulary models trade-off robustness Saurabh Gupta’s effort was supported by the NSF CAREER
for being more versatile, and supporting more queries, and Award (IIS2143873).
(ii) the internet-scale weakly labeled data sources used to
train modern VLMs under-represent the kind of embodied
interaction data that would benefit robots occupying real-
world environments with humans. The latter represents a
challenging opportunity to develop such large-scale models
that are simultaneously versatile and robust for embodied
applications in real-world environments.
e) Generalization to New Environments: While end-to-
end learning-based solutions may suffer from overfitting on a
few training scenes, the modular design of GOAT is able to
avoid this issue. The generalization of GOAT is only limited
by the robustness of its components, many of which have been
trained on large internet-scale data. In real-world experiments,
in 9 visually diverse homes we found no generalization issues
in any of the components of GOAT. That being said, GOAT
was designed for indoor navigation and consequently was
not tested in outdoor settings where low-level locomotion
is far more challenging. While utilizing large-scale models
such as CLIP improves generalization, GOAT also inherits
the limitations and biases of these models. For example, if
the majority of the objects used for training the object detector
originated from North America, the system’s performance may
be diminished when operating in other regions.
f) Computational Constraints: While the memory uti-
lization of GOAT is consistently increasing throughout an
episode, when the proper steps have been taken to optimize
performance, this was not a hindrance in our experiments.
Storing compressed images as 480 × 640 requires only 6 MB
R EFERENCES tion (ICRA), pages 11509–11522. IEEE, 2023.
[12] Howie Choset and Keiji Nagatani. Topological simul-
[1] Ziad Al-Halah, Santhosh Kumar Ramakrishnan, and taneous localization and mapping (slam): toward exact
Kristen Grauman. Zero experience required: Plug & localization without explicit localization. IEEE Transac-
play modular transfer learning for semantic visual navi- tions on robotics and automation, 17(2):125–137, 2001.
gation. In Proceedings of the IEEE/CVF Conference on [13] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch,
Computer Vision and Pattern Recognition, pages 17031– Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
17041, 2022. Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-
[2] Peter Anderson, Angel Chang, Devendra Singh Chaplot, e: An embodied multimodal language model. arXiv
Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, preprint arXiv:2303.03378, 2023.
Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Mano- [14] Alberto Elfes. Using occupancy grids for mobile robot
lis Savva, et al. On evaluation of embodied navigation perception and navigation. Computer, 22(6):46–57, 1989.
agents. arXiv preprint arXiv:1807.06757, 2018. [15] Hao Fang*, Saurabh Gupta*, Forrest Iandola*, Rupesh K
[3] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Srivastava*, Li Deng, Piotr Dollár, Jianfeng Gao, Xi-
Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and aodong He, Margaret Mitchell, John C Platt, C Lawrence
Anton Van Den Hengel. Vision-and-language navigation: Zitnick, and Geoffrey Zweig. From captions to visual
Interpreting visually-grounded navigation instructions in concepts and back. In Proceedings of the IEEE confer-
real environments. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages
ence on computer vision and pattern recognition, pages 1473–1482, 2015.
3674–3683, 2018. [16] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,
[4] Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and
Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Ji- David Forsyth. Every picture tells a story: Generating
tendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. sentences from images. In Computer Vision–ECCV 2010:
Rearrangement: A challenge for embodied ai. arXiv 11th European Conference on Computer Vision, Herak-
preprint arXiv:2011.01975, 2020. lion, Crete, Greece, September 5-11, 2010, Proceedings,
[5] Benjamin Bolte, Austin Wang, Jimmy Yang, Mustafa Part IV 11, pages 15–29. Springer, 2010.
Mukadam, Mrinal Kalakrishnan, and Chris Paxton. Usa- [17] Andrea Frome, Greg S Corrado, Jon Shlens, Samy
net: Unified semantic and affordance representations for Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas
robot memory. arXiv preprint arXiv:2304.12164, 2023. Mikolov. Devise: A deep visual-semantic embedding
[6] Matthew Chang, Arjun Gupta, and Saurabh Gupta. Se- model. Advances in neural information processing sys-
mantic visual navigation by watching youtube videos. tems, 26, 2013.
In Advances in Neural Information Processing Systems, [18] Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il-
2020. harco, Ludwig Schmidt, and Shuran Song. Cows on pas-
[7] Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, ture: Baselines and benchmarks for language-driven zero-
and Ruslan Salakhutdinov. Object goal navigation using shot object navigation. In Proceedings of the IEEE/CVF
goal-oriented semantic exploration. In In Neural Infor- Conference on Computer Vision and Pattern Recognition,
mation Processing Systems (NeurIPS), 2020. pages 23171–23181, 2023.
[8] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, [19] Theophile Gervet, Soumith Chintala, Dhruv Batra, Ji-
Abhinav Gupta, and Ruslan Salakhutdinov. Learning tendra Malik, and Devendra Singh Chaplot. Navigating
to explore using active neural slam. In International to objects in the real world. Science Robotics, 8(79):
Conference on Learning Representations, 2020. URL eadf6991, 2023.
https://openreview.net/pdf?id=HklXn1BKDH. [20] Jiayuan Gu, Devendra Singh Chaplot, Hao Su, and Ji-
[9] Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav tendra Malik. Multi-skill mobile manipulation for object
Gupta, and Saurabh Gupta. Neural topological slam rearrangement. arXiv preprint arXiv:2209.02778, 2022.
for visual navigation. In Computer Vision and Pattern [21] Saurabh Gupta, James Davidson, Sergey Levine, Rahul
Recognition (CVPR), 2020. Sukthankar, and Jitendra Malik. Cognitive mapping
[10] Devendra Singh Chaplot, Murtaza Dalal, Saurabh Gupta, and planning for visual navigation. In Proceedings of
Jitendra Malik, and Russ R Salakhutdinov. Seal: Self- the IEEE Conference on Computer Vision and Pattern
supervised embodied active learning using exploration Recognition, 2017.
and 3d consistency. Advances in neural information [22] Meera Hahn, Devendra Singh Chaplot, Shubham Tul-
processing systems, 34:13086–13098, 2021. siani, Mustafa Mukadam, James M Rehg, and Abhinav
[11] Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Gupta. No rl, no simulation: Learning to navigate without
Keerthana Gopalakrishnan, Michael S Ryoo, Austin navigating. Advances in Neural Information Processing
Stone, and Daniel Kappler. Open-vocabulary queryable Systems, 34:26661–26673, 2021.
scene representations for real world planning. In 2023 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
IEEE International Conference on Robotics and Automa- Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision ence, Glasgow, UK, August 23–28, 2020, Proceedings,
and pattern recognition, pages 770–778, 2016. Part XXVIII 16, pages 104–120. Springer, 2020.
[24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross [33] Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra,
Girshick. Mask r-cnn. In Proceedings of the IEEE in- and Devendra Singh Chaplot. Instance-specific image
ternational conference on computer vision, pages 2961– goal navigation: Training embodied agents to find object
2969, 2017. instances. arXiv preprint arXiv:2211.15876, 2022.
[25] Evan Hernandez, Sarah Schwettmann, David Bau, Teona [34] Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin
Bagashvili, Antonio Torralba, and Jacob Andreas. Nat- Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra,
ural language descriptions of deep visual features. In Jitendra Malik, Stefan Lee, and Devendra Singh Chaplot.
International Conference on Learning Representations, Navigating to objects specified by images. In ICCV,
2021. 2023.
[26] Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram [35] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Burgard. Visual language maps for robot navigation. In Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis
2023 IEEE International Conference on Robotics and Kalantidis, Li-Jia Li, David A Shamma, et al. Visual
Automation (ICRA), pages 10608–10615. IEEE, 2023. genome: Connecting language and vision using crowd-
[27] Daniel P Huttenlocher and Shimon Ullman. Recognizing sourced dense image annotations. International journal
solid objects by alignment with an image. International of computer vision, 123:32–73, 2017.
journal of computer vision, 5(2):195–212, 1990. [36] Benjamin Kuipers and Yung-Tai Byun. A robot explo-
[28] brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea ration and mapping strategy based on a semantic hierar-
Finn, Karol Hausman, Alexander Herzog, Daniel Ho, chy of spatial representations. Robotics and autonomous
Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry systems, 8(1-2):47–63, 1991.
Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, [37] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Koltun, and Rene Ranftl. Language-driven semantic
Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, segmentation. In International Conference on Learn-
Mengyuan Yan, Noah Brown, Michael Ahn, Omar ing Representations, 2022. URL https://openreview.net/
Cortes, Nicolas Sievers, Clayton Tan, Sichun Xu, Diego forum?id=RriDjddCLN.
Reyes, Jarek Rettinghouse, Jornell Quiambao, Peter Pas- [38] David G Lowe. Distinctive image features from scale-
tor, Linda Luu, Kuang-Huei Lee, Yuheng Kuang, Sally invariant keypoints. International journal of computer
Jesmonth, Nikhil J. Joshi, Kyle Jeffrey, Rosario Jauregui vision, 60:91–110, 2004.
Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron [39] Matthias Luber, Luciano Spinello, Jens Silva, and Kai O
David, Andy Zeng, and Chuyuan Kelly Fu. Do as i can, Arras. Socially-aware robot navigation: A learning ap-
not as i say: Grounding language in robotic affordances. proach. In 2012 IEEE/RSJ International Conference on
In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Intelligent Robots and Systems, pages 902–907. IEEE,
Proceedings of The 6th Conference on Robot Learning, 2012.
volume 205 of Proceedings of Machine Learning Re- [40] Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani,
search, pages 287–318. PMLR, 14–18 Dec 2023. URL Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-
https://proceedings.mlr.press/v205/ichter23a.html. goal navigation using multimodal goal embeddings. Ad-
[29] Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, vances in Neural Information Processing Systems, 35:
Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh 32340–32352, 2022.
Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, [41] Pierre Marza, Laetitia Matignon, Olivier Simonin, Dhruv
et al. Conceptfusion: Open-set multimodal 3d mapping. Batra, Christian Wolf, and Devendra Singh Chaplot.
arXiv preprint arXiv:2302.07241, 2023. Autonerf: Training implicit scene representations with
[30] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, autonomous agents. arXiv preprint arXiv:2304.11241,
Chris Bamford, Devendra Singh Chaplot, Diego de las 2023.
Casas, Florian Bressand, Gianna Lengyel, Guillaume [42] So Yeon Min, Devendra Singh Chaplot, Pradeep Raviku-
Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint mar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Fol-
arXiv:2310.06825, 2023. lowing instructions in language with modular methods.
[31] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Dense- arXiv preprint arXiv:2110.07342, 2021.
cap: Fully convolutional localization networks for dense [43] Matthias Minderer, Alexey Gritsenko, Austin Stone,
captioning. In Proceedings of the IEEE conference on Maxim Neumann, Dirk Weissenborn, Alexey Dosovit-
computer vision and pattern recognition, pages 4565– skiy, Aravindh Mahendran, Anurag Arnab, Mostafa De-
4574, 2016. hghani, Zhuoran Shen, et al. Simple open-vocabulary
[32] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv object detection. In European Conference on Computer
Batra, and Stefan Lee. Beyond the nav-graph: Vision- Vision, pages 728–755. Springer, 2022.
and-language navigation in continuous environments. In [44] Bryan A Plummer, Liwei Wang, Chris M Cervantes,
Computer Vision–ECCV 2020: 16th European Confer- Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb-
nik. Flickr30k entities: Collecting region-to-phrase cor- [55] Anthony Simeonov, Yilun Du, Andrea Tagliasac-
respondences for richer image-to-sentence models. In chi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit
Proceedings of the IEEE international conference on Agrawal, and Vincent Sitzmann. Neural descriptor fields:
computer vision, pages 2641–2649, 2015. Se (3)-equivariant object representations for manipula-
[45] Xavier Puig, Eric Undersander, Andrew Szot, tion. In 2022 International Conference on Robotics and
Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Automation (ICRA), pages 6394–6400. IEEE, 2022.
Ruta Desai, Alexander William Clegg, Michal Hlavac, [56] Chan Hee Song, Jiaman Wu, Clayton Washington,
So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner:
avatars and robots. arXiv preprint arXiv:2310.13724, Few-shot grounded planning for embodied agents with
2023. large language models. In Proceedings of the IEEE/CVF
[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya International Conference on Computer Vision, pages
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, 2998–3009, 2023.
Amanda Askell, Pamela Mishkin, Jack Clark, et al. [57] Sebastian Thrun, Maren Bennewitz, Wolfram Burgard,
Learning transferable visual models from natural lan- Armin B Cremers, Frank Dellaert, Dieter Fox, Dirk
guage supervision. In International conference on ma- Hahnel, Charles Rosenberg, Nicholas Roy, Jamieson
chine learning, pages 8748–8763. PMLR, 2021. Schulte, et al. Minerva: A second-generation museum
[47] Santhosh Kumar Ramakrishnan, Devendra Singh Chap- tour-guide robot. In Proceedings 1999 IEEE Interna-
lot, Ziad Al-Halah, Jitendra Malik, and Kristen Grau- tional Conference on Robotics and Automation (Cat. No.
man. Poni: Potential functions for objectgoal navigation 99CH36288C), volume 3. IEEE, 1999.
with interaction-free learning. In Proceedings of the [58] Sebastian Thrun, Wolfram Burgard, and Dieter Fox.
IEEE/CVF Conference on Computer Vision and Pattern Probabilistic Robotics (Intelligent Robotics and Au-
Recognition, pages 18890–18900, 2022. tonomous Agents). The MIT Press, 2005. ISBN
[48] René Ranftl, Katrin Lasinger, David Hafner, Konrad 0262201623.
Schindler, and Vladlen Koltun. Towards robust monocu- [59] Saim Wani, Shivansh Patel, Unnat Jain, Angel Chang,
lar depth estimation: Mixing datasets for zero-shot cross- and Manolis Savva. Multion: Benchmarking semantic
dataset transfer. IEEE Transactions on Pattern Analysis map memory using multi-object navigation. Advances in
and Machine Intelligence, 44(3), 2022. Neural Information Processing Systems, 33:9700–9712,
[49] Renato F Salas-Moreno, Richard A Newcombe, Hauke 2020.
Strasdat, Paul HJ Kelly, and Andrew J Davison. Slam++: [60] Brian Yamauchi. Frontier-based exploration using mul-
Simultaneous localisation and mapping at the level of tiple robots. In Proceedings of the second international
objects. In Proceedings of the IEEE conference on conference on Autonomous agents, pages 47–53, 1998.
computer vision and pattern recognition, pages 1352– [61] Sriram Yenamandra, Arun Ramachandran, Karmesh Ya-
1359, 2013. dav, Austin Wang, Mukul Khanna, Theophile Gervet,
[50] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg,
isiewicz, and Andrew Rabinovich. Superglue: Learning John Turner, Zsolt Kira, Manolis Savva, Angel Chang,
feature matching with graph neural networks. In Pro- Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mot-
ceedings of the IEEE/CVF conference on computer vision taghi, Yonatan Bisk, and Chris Paxton. Homerobot:
and pattern recognition, pages 4938–4947, 2020. Open vocabulary mobile manipulation. 2023. URL
[51] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen https://github.com/facebookresearch/home-robot.
Koltun. Semi-parametric topological memory for nav- [62] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
igation. arXiv preprint arXiv:1803.00653, 2018. Krähenbühl, and Ishan Misra. Detecting twenty-thousand
[52] Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel classes using image-level supervision. In ECCV, 2022.
Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: [63] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim,
Weakly supervised semantic fields for robotic memory. Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-
arXiv preprint arXiv:2210.05663, 2022. driven visual navigation in indoor scenes using deep
[53] Mohit Shridhar, Jesse Thomason, Daniel Gordon, reinforcement learning. In 2017 IEEE international
Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke conference on robotics and automation (ICRA), pages
Zettlemoyer, and Dieter Fox. Alfred: A benchmark for 3357–3364. IEEE, 2017.
interpreting grounded instructions for everyday tasks. In
Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10740–10749,
2020.
[54] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
What and where pathways for robotic manipulation. In
Conference on Robot Learning, pages 894–906. PMLR,
2022.