-
Nanjing Unversity
- Nanjing
- http://wanglimin.github.io
Stars
[NeurIPS 2025] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
[AAAI 2026] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
[ICCV 2025] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
[NeurIPS 2025 Spotlight] StreamForest: Efficient Online Video Understanding with Persistent Event Memory
[ICML 2025] Differentiable Solver Search for Fast Diffusion Sampling
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
[TPAMI] JointFormer: A Unified Framework with Joint Modeling for Video Object Segmentation
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
[CVPR 2025] Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
[CVPR 2025] Online Video Understanding: OVBench and VideoChat-Online
[WACV 2025 Oral] Transferring Foundation Models for Generalizable Robotic Manipulation
[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
[ICCV 2025] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
[NeurIPS 2024] Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution
[ECCV 2024] ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
[ECCV 2024 Oral] SPLAM: Accelerating Image Generation with Sub-path Linear Approximation Model
[CVPR 2024] Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
[ECCV 2024] Fully Sparse 3D Occupancy Prediction & RayIoU Evaluation Metric
[CVPR 2024] Sparse Global Matching for Video Frame Interpolation with Large Motion
[NeurIPS 2024] VFIMamba: Video Frame Interpolation with State Space Models
[TPAMI 2024] Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
[CVPR 2024] SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding