-
Tsinghua University
- Beijing
-
08:06
(UTC +08:00) - robertluo1.github.io
Lists (3)
Sort Name ascending (A-Z)
Stars
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
CoreNet: A library for training deep neural networks
Taming Transformers for High-Resolution Image Synthesis
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
OmniGen2: Exploration to Advanced Multimodal Generation. https://arxiv.org/abs/2506.18871
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
The hub for EleutherAI's work on interpretability and learning dynamics
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
A suite of image and video neural tokenizers
An easy/swift-to-adapt PyTorch-Lighting template. 套壳模板,简单易用,稍改原来Pytorch代码,即可适配Lightning。You can translate your previous Pytorch code much easier using this template, and keep your freedom to edit a…
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
This repo contains the code for 1D tokenizer and generator
Codebase for Aria - an Open Multimodal Native MoE
The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)
A linear estimator on top of clip to predict the aesthetic quality of pictures
Official Jax Implementation of MaskGIT
Fine-Tuning Embedding for RAG with Synthetic Data
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
[NeurIPS 2024] CV-VAE: A Compatible Video VAE for Latent Generative Video Models