Stars
Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform
🎯 告别信息过载,AI 助你看懂新闻资讯热点,简单的舆情监控分析 - 多平台热点聚合+基于 MCP 的AI分析工具。监控35个平台(抖音、知乎、B站、华尔街见闻、财联社等),智能筛选+自动推送+AI对话分析(用自然语言深度挖掘新闻:趋势追踪、情感分析、相似检索等13种工具)。支持企业微信/个人微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 推送,1分钟手机通知,无需…
[ArXiv 2025] Co-Training Vision Language Models for Remote Sensing Multi-task Learning
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
A reproduction of the Deepseek-OCR model including training
Offical implementation of "Visual Instruction Pretraining for Domain-Specific Foundation Models"
Detect Anything via Next Point Prediction (Based on Qwen2.5-VL-3B)
[NeurIPS 2025]"Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning"
Reference PyTorch implementation and models for DINOv3
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
AI for remote sensing, remote sense, object detection, oriented object detection, computer vision, cv
[ICML 2025 Spotlight] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding
[NeurIPS 2025 Oral] Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
SARLANG-1M is a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality.
[NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
[CVPR 2025] Mr. DETR: Instructive Multi-Route Training for Detection Transformers
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
Fully open reproduction of DeepSeek-R1
The first large-scale multimodal dialogue dataset focusing on Synthetic Aperture Radar (SAR) imagery.
Solve Visual Understanding with Reinforced VLMs
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning