-
Sichuan University
- Shenzhen ⇌ Chengdu, China
-
11:40
(UTC +08:00) - https://xuyang-liu16.github.io/
- @xuyang_liu16
Lists (13)
Sort Name ascending (A-Z)
CoT
Diffusion Acceleration
🌟 KV Cache Compression
😵💫 Mitigating Hallucination
🔥 MLLMs
Foundation MLLMs, including image LLMs and video LLMs.✨ Token Compression
🚀 Token Compression for MLLM
Token compression methods for MLLMs, including training-based and training-free token compression for MLLMs.📹 Training-free VideoLLMs
Training-free video LLMs extend image LLMs for video understanding without requiring additional fine-tuning on video data.Stars
Paper list of streaming video understanding
🔥 OneThinker: All-in-one Reasoning Model for Image and Video
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
Code and data for VTCBench, a vision-text compression benchmark for Vision Language Models.
Fast, memory-efficient attention column reduction (e.g., sum, mean, max)
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Official Implementation for the paper: "ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos"
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Official Code for "ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning"
[ICML 2025 Oral] An official implementation of VideoRoPE & VideoRoPE++
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
[arxiv'25] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
MR. Video: MapReduce is the Principle for Long Video Understanding
The repository provides code for running inference and finetuning with the Meta Segment Anything Model 3 (SAM 3), links for downloading the trained model checkpoints, and example notebooks that sho…
[AAAI2025] Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark