-
ByteDance
- Shanghai/Shenzhen
- https://www.kaggle.com/shentao
- @SeuTao1
- https://scholar.google.com/citations?user=8cprenoAAAAJ&hl=zh-CN
Stars
[ICLR 2026] Data Pipeline, Models, and Benchmark for Omni-Captioner.
利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.
Official Code of NAVA: Native Audio-Visual Alignment for Generation.
Unlimited-length talking video generation that supports image-to-video and video-to-video generation
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
Code for "OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation"
Lens is a 3.8B-parameter text-to-image diffusion model that achieves quality competitive with and in several cases surpassing models like FLUX and SD3, while requiring significantly less training c…
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
Unified Codebase for Advanced World Models.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Official implementation of Tuna-2: Pixel Embeddings Beat Vision Encoders for Unified Understanding and Generation
ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding.
Implementation of D4RT, Efficiently Reconstructing Dynamic Scenes, from Deepmind
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
A framework for efficient model inference with omni-modality models
Gen-Searcher: Reinforcing Agentic Search for Image Generation
[ICML'26] Code and website for Self-Flow: Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
AI agents running research on single-GPU nanochat training automatically
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
A feed-forward 3D foundation model for reconstructing scenes from streaming data
[NeurIPS 2025 Oral] Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.