Hugging Face reposted this
Introducing Molmo 2 🎥: State-of-the-art video understanding, pointing, and tracking Last year, Molmo helped push image understanding forward with pointing—grounded answers you can verify. Now, Molmo 2 brings those capabilities to video—so the model doesn’t just answer questions, it can show you where & when something is happening. On major industry benchmarks, Molmo 2 surpasses most open multimodal models & even rivals closed peers like Gemini 3 Pro and Claude Sonnet 4.5. Molmo 2 returns pixel coordinates + timestamps over videos & coordinates over images, enabling: ◘ Video + image QA ◘ Counting-by-pointing ◘ Dense captioning ◘ Artifact detection ◘ Subtitle-aware analysis …and more! Three variants depending on your needs: 🔹 Molmo 2 (8B): Qwen 3 backbone, best overall performance 🔹 Molmo 2 (4B): Qwen 3 backbone, fast + efficient 🔹 Molmo 2-O (7B): Olmo backbone, fully open model flow Demos: 🎯 Counting objects & actions (“How many times does the ball hit the ground?”)—returns the count plus space–time pointers for each event: https://lnkd.in/eAg8nNWP ❓ Ask-it-anything long-video QA (“Why does the player change strategy here?”)—points to the moments supporting the answer: https://lnkd.in/eXSf5dYb 📍 Object tracking (“Follow the red race car.”)—tracks it across frames with coordinates over time: https://lnkd.in/ezDy38cR We’ve also significantly upgraded the Ai2 Playground 🛠️ You can now upload a video or multiple images to try summarization, tracking, and counting—while seeing exactly where the model is looking. Try it and learn more: ▶️ Playground: https://lnkd.in/gk3Q49a5 ⬇️ Models: https://lnkd.in/eJXDvZ_m 📝 Blog: https://lnkd.in/emNwBqrH 📑 Report: https://lnkd.in/enZG-Z5Q 💻 API coming soon