Skip to content
View gty111's full-sized avatar
🎯
Focusing is all you need
🎯
Focusing is all you need

Highlights

  • Pro

Block or report gty111

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

Python 41 3 Updated Sep 29, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,800 909 Updated Sep 30, 2025
Python 4,296 408 Updated Sep 14, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 7,149 610 Updated Oct 10, 2025

Official PyTorch implementation for "Large Language Diffusion Models"

Python 3,019 201 Updated Sep 30, 2025

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

Python 4,629 1,268 Updated Oct 10, 2025

llm-d enables high-performance distributed LLM inference on Kubernetes

Makefile 1,860 189 Updated Oct 9, 2025

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,195 182 Updated Mar 27, 2024

Kimi K2 is the large language model series developed by Moonshot AI team

8,317 548 Updated Sep 11, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 14,157 2,523 Updated Oct 10, 2025

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 1,874 133 Updated Oct 10, 2025

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 494 34 Updated Feb 10, 2025

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Python 19,476 1,623 Updated Sep 30, 2025

Supercharge Your LLM with the Fastest KV Cache Layer

Python 5,508 630 Updated Oct 10, 2025

Distributed Compiler based on Triton for Parallel Systems

Python 1,161 103 Updated Oct 2, 2025

The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Python 109 1 Updated Jul 1, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,502 238 Updated Oct 8, 2025

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 63 5 Updated Sep 15, 2025

[NeurIPS 2025] MMaDA - Open-Sourced Multimodal Large Diffusion Language Models

Python 1,422 71 Updated Sep 19, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,868 304 Updated Mar 10, 2025

Analyze computation-communication overlap in V3/R1.

1,102 143 Updated Mar 21, 2025

Expert Parallelism Load Balancer

Python 1,276 196 Updated Mar 24, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,590 948 Updated Oct 10, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 15,152 1,091 Updated Oct 10, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,921 284 Updated May 15, 2025

My learning notes/codes for ML SYS.

Python 3,832 232 Updated Oct 6, 2025

A markdown version emoji cheat sheet

TypeScript 13,386 4,563 Updated Oct 10, 2025

Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.

Go 153,850 13,360 Updated Oct 10, 2025

Materials for learning SGLang

597 48 Updated Oct 1, 2025

High performance Transformer implementation in C++.

C++ 135 16 Updated Jan 18, 2025
Next