- Cambridge, Massachusetts, United States
- http://kentang.net
Stars
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
A sparse attention kernel supporting mix sparse patterns
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
[ICML 2024] LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery
Code for the paper DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents, ICML 2024
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Open-MAGVIT2: Democratizing Autoregressive Visual Generation
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
(NeurIPS 2024 Oral 🔥) Improved Distribution Matching Distillation for Fast Image Synthesis
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
PyTorch emulation library for Microscaling (MX)-compatible data formats
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-sim…
Code for QuaRot, an end-to-end 4-bit inference of large language models.
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models