Stars
JavaScript in-page GUI agent. Control web interfaces with natural language.
official repository of article "CrystaL: Spontaneous Emergence of Visual Latents in MLLMs"
ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构…
[Remote Sensing 2026] Co-Training Vision Language Models for Remote Sensing Multi-task Learning
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
A reproduction of the Deepseek-OCR model including training
Offical implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models"
[CVPR2026] Detect Anything via Next Point Prediction
[NeurIPS 2025]"Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning"
Reference PyTorch implementation and models for DINOv3
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
AI for remote sensing, remote sense, object detection, oriented object detection, computer vision, cv
[ICML 2025 Spotlight] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding
[NeurIPS 2025 Oral] Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models (dLLMs with block diffusion, mixed-CoT, unified RL)
Official Repo For Pixel-LLM Codebase: Sa2VA (Arxiv-25), SAMTok (CVPR-26), VRT, SaSaSa2VA (1-st solution for LSVOS)
Official PyTorch Implementation of SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding [IEEE TGRS 2026].
[NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
[CVPR 2025] Mr. DETR: Instructive Multi-Route Training for Detection Transformers
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.