Skip to content
View weifengpy's full-sized avatar

Block or report weifengpy

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

moodist

C 24 7 Updated Dec 9, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,280 353 Updated Dec 22, 2025

Triton Support in Compiler Explorer

TypeScript 5 Updated Aug 5, 2025

A Quirky Assortment of CuTe Kernels

Python 713 64 Updated Dec 22, 2025

Dynamic Instrumentation Tool Platform

C 2,981 602 Updated Dec 22, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,332 613 Updated Dec 23, 2025

Galvatron is an automatic distributed training system designed for Transformer models, including Large Language Models (LLMs).

Python 174 15 Updated Dec 16, 2025

NanoGPT (124M) in 3 minutes

Python 3,998 527 Updated Dec 22, 2025
Python 152 14 Updated Dec 27, 2024

LLM training in simple, raw C/CUDA

Cuda 28,447 3,336 Updated Jun 26, 2025

LLM101n: Let's build a Storyteller

35,926 1,961 Updated Aug 1, 2024

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Python 770 26 Updated Oct 13, 2025

Tile primitives for speedy kernels

Cuda 3,009 217 Updated Dec 9, 2025

An ML Systems Onboarding list

957 36 Updated Jan 24, 2025

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.

Python 49,763 4,104 Updated Dec 22, 2025

kaldi-asr/kaldi is the official location of the Kaldi project.

Shell 15,276 5,366 Updated Sep 22, 2025

Unified multidimensional array model that collects nonrectangular shapes, advanced indexing, views and sparsity into a single set of composable abstractions Resources

Python 11 Updated Mar 26, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 962 82 Updated Sep 4, 2024

Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs

Python 910 53 Updated Nov 27, 2025

A PyTorch native platform for training generative AI models

Python 4,865 648 Updated Dec 23, 2025

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 8,999 1,592 Updated Dec 21, 2025

Training LLMs with QLoRA + FSDP

Jupyter Notebook 1,534 202 Updated Nov 9, 2024

GPU programming related news and material links

1,876 110 Updated Sep 17, 2025

Official inference library for Mistral models

Jupyter Notebook 10,607 1,002 Updated Nov 21, 2025

Distributed Machine Learning Patterns from Manning Publications by Yuan Tang https://bit.ly/2RKv8Zo

Python 482 47 Updated Sep 22, 2025

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

Python 27,637 4,875 Updated Aug 18, 2024

An Extensible Deep Learning Library

Python 2,303 391 Updated Dec 11, 2025

ROCm Communication Collectives Library (RCCL)

C++ 405 193 Updated Dec 20, 2025

AMD ROCm™ Software - GitHub Home

Shell 6,001 501 Updated Dec 22, 2025

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 16,343 3,243 Updated Dec 22, 2025
Next