WorldMM is a novel dynamic multimodal memory agent designed for long video reasoning. It constructs multimodal, multi-scale memories that capture both textual and visual information, and employs adaptive retrieval across multiple memories with reasoning.
To set up the environment, we recommend using uv for fast and deterministic setup. All dependencies are specified in pyproject.toml and pinned in uv.lock.
git clone https://github.com/wgcyeo/WorldMM.git
cd WorldMMThe setup script will:
- Install uv (if not already installed)
- Install all project dependencies
- Download required datasets
bash script/1_setup.shTo use GPT-family models for preprocessing or evaluation, set your OpenAI API key:
export OPENAI_API_KEY="your_openai_api_key"Before memory construction and evaluation, preprocess the EgoLife dataset:
bash script/2_preprocess.shAfter preprocessing, the dataset directory is organized as follows:
data/EgoLife/
βββ A1_JAKE/
β βββ DAY1/ # Video files
β βββ DAY2/
β βββ ...
βββ EgoLifeCap/
β βββ DenseCaption/ # Fine-grained video captions (in Chinese)
β β βββ translated/ # Machine-translated English captions
β βββ Sync/ # Synchronized transcripts + captions
β βββ Transcript/ # Audio transcripts
βββ EgoLifeQA/
βββ EgoLifeQA_A1_JAKE.json # QA annotations
WorldMM builds three memory modulesβepisodic, semantic, and visualβto support long-term reasoning, which can be constructed with:
bash script/3_build_memory.shTo run a specific module only:
bash script/3_build_memory.sh --step [episodic|semantic|visual]--step <type> # Memory type: episodic, semantic, visual, all
--gpu <ids> # GPU IDs to use (default: 0,1,2,3)
--model <name> # LLM model for memory construction (default: gpt-5-mini)Tip
To skip memory construction and use the prebuilt WorldMM EgoLife memory metadata, download it from wgcyeo/WorldMM-EgoLife into output/metadata.
Run evaluation on EgoLifeQA with:
bash script/4_eval.sh --retriever-model gpt-5-mini --respond-model gpt-5--retriever-model <m> # Model for retrieval process (default: gpt-5-mini)
--respond-model <m> # Model for iterative reasoning and generating answers (default: gpt-5)
--max-rounds <n> # Max retrieval rounds (default: 5)WorldMM supports a variety of backbone models for retrieval and reasoning, including gpt-5 and qwen3vl-8b.
Beyond evaluation on week-long videos, WorldMM also supports evaluation on general video benchmarks. We provide an example pipeline for Video-MME:
bash script/videomme/1_setup.sh
bash script/videomme/2_preprocess.sh
bash script/videomme/3_build_memory.sh --model gpt-5-mini
bash script/videomme/4_eval.sh --retriever-model gpt-5-mini --respond-model gpt-5For detailed information about each step, please refer to the scripts located in script/videomme.
If you find WorldMM helpful, please consider citing our paper:
@inproceedings{yeo2026worldmm,
title = {WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning},
author = {Yeo, Woongyeong and Kim, Kangsan and Yoon, Jaehong and Hwang, Sung Ju},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {25599-25609}
}Our implementation is built upon EgoLife, HippoRAG, and VLM2Vec. We thank the authors for open-sourcing their code and dataset.