Skip to content
View XuezheMax's full-sized avatar
💭
I may be slow to respond.
💭
I may be slow to respond.

Highlights

  • Pro

Organizations

@asyml

Block or report XuezheMax

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

torchcomms: a modern PyTorch communications API

C++ 310 47 Updated Dec 21, 2025

Memory optimized Mixture of Experts

Python 72 6 Updated Jul 25, 2025

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 135 79 Updated May 29, 2025

Simple & Scalable Pretraining for Neural Architecture Research

Python 305 32 Updated Dec 6, 2025

The official Meta Llama 3 GitHub site

Python 29,144 3,501 Updated Jan 26, 2025
Python 565 56 Updated Sep 23, 2025

fast trainer for educational purposes

Python 22 12 Updated Nov 26, 2025

Efficient Triton Kernels for LLM Training

Python 5,962 452 Updated Dec 20, 2025

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,132 105 Updated Dec 21, 2025

Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model

Python 254 17 Updated May 27, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,820 1,035 Updated Dec 5, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 2,017 128 Updated Apr 3, 2025

🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"

Python 943 48 Updated Mar 19, 2025

HIPIFY: Convert CUDA to Portable C++ Code

C++ 637 101 Updated Dec 21, 2025

Long context evaluation for large language models

Python 224 23 Updated Mar 3, 2025

Fast and memory-efficient exact attention

Python 21,207 2,235 Updated Dec 20, 2025

Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Python 1,293 85 Updated Jul 14, 2024

Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTORCH

Python 58 5 Updated Dec 21, 2025

Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Python 375 34 Updated Apr 23, 2024

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Python 1,975 134 Updated Nov 7, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 4,094 334 Updated Dec 20, 2025

Building blocks for foundation models.

584 28 Updated Jan 3, 2024

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,850 78 Updated Dec 6, 2025

This project extends the idea of the innovative architecture of Kolmogorov-Arnold Networks (KAN) to the Convolutional Layers, changing the classic linear transformation of the convolution to learna…

Jupyter Notebook 910 97 Updated Apr 8, 2025

Tile primitives for speedy kernels

Cuda 3,008 217 Updated Dec 9, 2025

An efficient pure-PyTorch implementation of Kolmogorov-Arnold Network (KAN).

Python 4,549 408 Updated Aug 1, 2024

Kolmogorov-Arnold Networks (KAN) using Chebyshev polynomials instead of B-splines.

Jupyter Notebook 399 42 Updated May 13, 2024

Kolmogorov Arnold Networks

Jupyter Notebook 16,056 1,539 Updated Jan 19, 2025
Python 749 62 Updated May 24, 2024

Reference implementation of Megalodon 7B model

Cuda 527 54 Updated May 17, 2025
Next