Skip to content
View shenjy0829's full-sized avatar
🤒
Out sick
🤒
Out sick

Highlights

  • Pro

Block or report shenjy0829

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Source files to replicate experiments in my RLC 2025 paper.

Python 7 Updated Aug 22, 2025

Big & Small LLMs working together

Python 1,261 143 Updated Feb 13, 2026

High Performance LLM Inference Operator Library

C++ 723 58 Updated Feb 5, 2026

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Cuda 420 24 Updated Jan 8, 2026
Python 39 3 Updated Jan 2, 2025

Sionna Research Kit: A GPU-Accelerated Research Platform for AI-RAN

Jupyter Notebook 71 13 Updated Dec 19, 2025

Allo Accelerator Design and Programming Framework (PLDI'24)

Python 344 64 Updated Feb 8, 2026

[ArXiv 2025] A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques from the survey "On-Device Large Language Models: A Survey…

21 2 Updated Jan 27, 2026

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Python 3,666 254 Updated Jan 14, 2026

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,926 114 Updated Feb 12, 2026

Tile primitives for speedy kernels

Cuda 3,139 237 Updated Feb 10, 2026

DFlash: Block Diffusion for Flash Speculative Decoding

Python 543 35 Updated Feb 6, 2026

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Python 873 108 Updated Feb 13, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,172 819 Updated Feb 3, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 8,980 1,096 Updated Feb 9, 2026

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 53,065 8,983 Updated Nov 12, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,018 86 Updated Sep 4, 2024

verl: Volcano Engine Reinforcement Learning for LLMs

Python 19,192 3,239 Updated Feb 13, 2026

Official inference framework for 1-bit LLMs

Python 28,408 2,328 Updated Feb 3, 2026

This repo release the detailed benchmark code and results of Sea Labs AI.

Python 13 1 Updated Jan 3, 2026

Exploring how optimizations for GEMMs work

Python 25 3 Updated Jan 1, 2026

ArcticInference: vLLM plugin for high-throughput, low-latency inference

Python 395 48 Updated Feb 10, 2026

Papers from the computer science community to read and discuss.

Shell 103,247 6,269 Updated Oct 10, 2025

THE NEXT FUTURE

Go 4,761 397 Updated Dec 27, 2025

Nano vLLM

Python 11,668 1,568 Updated Nov 3, 2025

A Quirky Assortment of CuTe Kernels

Python 798 80 Updated Feb 11, 2026

AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。

Jupyter Notebook 6,069 827 Updated Dec 22, 2025

CUDA implementation of LDPC decoding algorithm

C++ 42 12 Updated Dec 16, 2020

CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark, CVPR 2019, Oral

Python 332 39 Updated Jan 26, 2023
Next