Skip to content

open-world-agents/open-world-agents

Repository files navigation

Open World Agents

🚀 Open World Agents

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Documentation License: MIT Python 3.11+ GitHub stars

⚠️ Active Development Notice: This codebase is under active development. APIs and components may change, and some may be moved to separate repositories. Documentation may be incomplete or reference features still in development.

📄 Research Paper: This project was first introduced and developed for the D2E project. For more details, see D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI. If you find this work useful, please cite our paper.

Quick Start

💡 This is a conceptual overview. See the Quick Start Guide for detailed instructions.

# 1. Record desktop interaction
$ ocap my-session.mcap

# 2. Process to training format
$ python scripts/01_raw_to_event.py --train-dir ./

# 3. Train your model (coming soon)
$ python train.py --dataset ./event-dataset

Installation

# For video recording, install GStreamer first. Skip if you only need data processing.
$ conda install open-world-agents::gstreamer-bundle

# Install OWA
$ pip install owa

Documentation

Resource Description
🏠 Full Documentation Complete docs with all guides and references
📖 Quick Start Guide Complete tutorial: Record → Process → Train
🤗 Community Datasets Browse and share datasets

Core Components

  • 🌍 Environment Framework: "USB-C of desktop agents" - universal interface for native desktop automation with pre-built plugins for desktop control, high-performance screen capture, and zero-configuration plugin system
  • 📊 Data Infrastructure: Complete desktop agent data pipeline from recording to training with OWAMcap format - a universal standard powered by MCAP
  • 🛠️ CLI Tools: Command-line utilities (owl) for recording, analyzing, and managing agent data
  • 🤖 Examples: Complete implementations and training pipelines for multimodal agents

Contributing

We welcome contributions! See our Contributing Guide.

License

MIT License. See LICENSE.

Citation

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}

About

Everything you need to build state-of-the-art foundation multimodal desktop agent, end-to-end.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 10