Skip to content
View ywang96's full-sized avatar

Organizations

@vllm-project

Block or report ywang96

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 75,030 15,100 Updated Apr 2, 2026

Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 18,855 1,712 Updated Jan 30, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,549 1,005 Updated Mar 31, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,524 1,768 Updated Apr 2, 2026

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,470 985 Updated Apr 2, 2026

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL

Python 4,799 365 Updated Mar 26, 2026

A fast multimodal LLM for real-time voice

Python 4,389 370 Updated Dec 12, 2025

A framework for efficient model inference with omni-modality models

Python 4,118 673 Updated Apr 2, 2026

Entropy Based Sampling and Parallel CoT Decoding

Python 3,430 320 Updated Nov 13, 2024

how to optimize some algorithm in cuda.

Cuda 2,905 267 Updated Apr 1, 2026

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,791 171 Updated Apr 2, 2026

Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI

Python 1,331 71 Updated Jan 27, 2026

Train speculative decoding models effortlessly and port them smoothly to SGLang serving.

Python 753 193 Updated Apr 2, 2026

CUDA/Metal accelerated language model inference

C 632 31 Updated May 29, 2025

Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference

Python 249 17 Updated Feb 3, 2026

documentation for content creation

HTML 233 22 Updated Oct 3, 2025

Manages vllm-nccl dependency

Python 17 3 Updated Jun 3, 2024

Cross-platform transformer training

Python 3 1 Updated Nov 14, 2025