A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 3,010 583 Updated Dec 19, 2025

meta-pytorch / torchft

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

Python 457 53 Updated Dec 6, 2025

pytorch / helion

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 689 89 Updated Dec 19, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 699 64 Updated Dec 16, 2025

deepseek-ai / DeepSeek-V3

Python 100,791 16,422 Updated Aug 28, 2025

SamsungSAILMontreal / TinyRecursiveModels

Python 6,050 924 Updated Dec 2, 2025

tobymao / sqlglot

Python SQL Parser and Transpiler

Python 8,719 1,034 Updated Dec 19, 2025

sdan / vlm-gym

RL gym for vision language models written in JAX

Python 135 13 Updated Oct 30, 2025

OpenBMB / CPM.cu

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 212 21 Updated Oct 10, 2025

OpenBMB / infllmv2_cuda_impl

Python 78 6 Updated Dec 2, 2025

google-coral / coralnpu

A machine learning accelerator core designed for energy-efficient AI at the edge.

Emacs Lisp 1,950 212 Updated Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omkaar K omkaark

Achievements

Achievements

Highlights

Block or report omkaark

Stars

Noumena-Network / nmoe

zhaochenyang20 / Awesome-ML-SYS-Tutorial

HazyResearch / Megakernels

bigcode-project / bigcode-evaluation-harness

openai / human-eval-infilling

openai / human-eval

NVIDIA / RULER

KellerJordan / modded-nanogpt

ImFusionGmbH / lite-tracker

deepseek-ai / DeepGEMM

pytorch / pytorch

deepspeedai / DeepSpeed

JLSteenwyk / plmtogo

pytorch / ao

harvestingmoon / flash_attn_metal_cpp

modular / modular

huggingface / nanoVLM

pytorch / torchtitan

allenai / OLMo-core

NVIDIA / TransformerEngine