-
16:30
(UTC -07:00) - https://rogerw.io
- in/rogerywang
- @rogerw0108
Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
FlashMLA: Efficient Multi-head Latent Attention Kernels
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A Datacenter Scale Distributed Inference Serving Framework
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
A framework for efficient model inference with omni-modality models
Entropy Based Sampling and Parallel CoT Decoding
how to optimize some algorithm in cuda.
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference