Skip to content

open-world-agents/open-world-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Open World Agents

πŸš€ Open World Agents

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Documentation License: MIT Python 3.11+ GitHub stars

⚠️ Active Development Notice: This codebase is under active development. APIs and components may change, and some may be moved to separate repositories. Documentation may be incomplete or reference features still in development.

πŸ“„ Research Paper: This project was first introduced and developed for the D2E project. For more details, see D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI. If you find this work useful, please cite our paper.

Quick Start

πŸ’‘ This is a conceptual overview. See the Quick Start Guide for detailed instructions.

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_to_event.py --train-dir ./

# 3. Train your model (coming soon)
$ python train.py --dataset ./event-dataset

Installation

# For video recording, install GStreamer first. Skip if you only need data processing.
$ conda install open-world-agents::gstreamer-bundle

# Install OWA
$ pip install owa

Documentation

Resource Description
🏠 Full Documentation Complete docs with all guides and references
πŸ“– Quick Start Guide Complete tutorial: Record β†’ Process β†’ Train
πŸ€— Community Datasets Browse and share datasets

Core Components

  • 🌍 Environment Framework: "USB-C of desktop agents" - universal interface for native desktop automation with pre-built plugins for desktop control, high-performance screen capture, and zero-configuration plugin system
  • πŸ“Š Data Infrastructure: Complete desktop agent data pipeline from recording to training with OWAMcap format - a universal standard powered by MCAP
  • πŸ› οΈ CLI Tools: Command-line utilities (owl) for recording, analyzing, and managing agent data
  • πŸ€– Examples: Complete implementations and training pipelines for multimodal agents

Contributing

We welcome contributions! See our Contributing Guide.

License

MIT License. See LICENSE.

Citation

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}

About

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 10