Skip to content
View jinhuix's full-sized avatar
💜
Focusing
💜
Focusing

Block or report jinhuix

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞

Python 11,260 1,290 Updated Apr 10, 2026

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 9,082 2,318 Updated Mar 30, 2026

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,057 86 Updated Sep 4, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,288 1,045 Updated Apr 12, 2026

NVIDIA curated collection of educational resources related to general purpose GPU programming.

Jupyter Notebook 1,474 258 Updated Apr 14, 2026

GPU programming related news and material links

2,100 126 Updated Mar 8, 2026

AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。

Jupyter Notebook 6,752 883 Updated Dec 22, 2025

Summary of some awesome work for optimizing LLM inference

236 8 Updated Feb 14, 2026

Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]

Python 56 6 Updated Mar 5, 2025

分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等

Jupyter Notebook 1,688 136 Updated Apr 8, 2026

Nano vLLM

Python 12,941 1,941 Updated Apr 13, 2026
Python 7 5 Updated May 28, 2025

A Throughput-Optimized Pipeline Parallel Inference System for Large Language Models

Python 49 3 Updated Dec 24, 2025

The official GitHub page for the survey paper "A Survey of Large Language Models".

Python 12,143 940 Updated Mar 11, 2025

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. ACM Computing Surveys, 2026.

711 41 Updated Apr 10, 2026

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6,880 764 Updated Apr 16, 2026

A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.

Python 4,002 579 Updated Mar 13, 2026
Python 309 31 Updated Jul 10, 2025

InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

C 16 1 Updated Mar 30, 2025
Python 41 5 Updated Oct 16, 2025

Persist and reuse KV Cache to speedup your LLM.

Python 271 72 Updated Apr 16, 2026
Python 174 30 Updated Jul 15, 2025

Source code of paper ''KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing''

Python 31 2 Updated Oct 24, 2024

Official code repo for the O'Reilly Book - "Hands-On Large Language Models"

Jupyter Notebook 25,118 5,818 Updated Dec 17, 2025

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 56,743 9,706 Updated Nov 12, 2025

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

Python 24,162 3,213 Updated Aug 15, 2024

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).

688 24 Updated Apr 15, 2026

Supercharge Your LLM with the Fastest KV Cache Layer

Python 8,002 1,100 Updated Apr 16, 2026

A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.

Python 4,094 308 Updated Apr 13, 2026
Next