Skip to content
View quaternior's full-sized avatar

Highlights

  • Pro

Organizations

@AIDASLab

Block or report quaternior

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

codebase for Coruscant: Co-Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference

Cuda 3 Updated Oct 17, 2025

codebase for MUSTAFAR:Promoting Unstructured Sparsity for KV Pruning in LLM Inference

Python 8 2 Updated May 30, 2025

Summary of some awesome work for optimizing LLM inference

134 5 Updated Nov 2, 2025

From a+b to sparsemax(QK^T)V in Triton!

Jupyter Notebook 27 Updated Jun 19, 2025

A comprehensive list of papers about Large-Language-Diffusion-Models.

23 4 Updated Nov 4, 2025

Material for gpu-mode lectures

Jupyter Notebook 5,249 524 Updated Sep 23, 2025

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Cuda 68 7 Updated Sep 8, 2024

CUDA Core Compute Libraries

C++ 2,007 285 Updated Nov 5, 2025

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 120 7 Updated Jul 13, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,317 823 Updated Oct 17, 2025
Cuda 28 2 Updated Apr 2, 2025

Curated collection of papers in MoE model inference

297 11 Updated Oct 20, 2025

Awesome LLM Books: Curated list of books on Large Language Models

1,061 161 Updated Oct 24, 2025

A curated list of neural network pruning resources.

2,481 332 Updated Apr 4, 2024

[NeurIPS 2024] A Generalizable World Model for Autonomous Driving

Python 809 58 Updated Jul 2, 2025

A curated list for Efficient Large Language Models

Python 1,891 144 Updated Jun 17, 2025
Python 10 1 Updated Sep 20, 2024

A low-latency & high-throughput serving engine for LLMs

Python 436 58 Updated Oct 16, 2025

Fast and memory-efficient exact attention

Python 20,352 2,113 Updated Nov 5, 2025
Python 16 1 Updated Jun 11, 2025

Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"

Python 100 7 Updated Sep 30, 2024

A framework for few-shot evaluation of language models.

Python 10,528 2,829 Updated Oct 29, 2025

Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its efficiency gain.

Python 21 3 Updated Sep 10, 2024

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Python 152,116 31,045 Updated Nov 5, 2025

Sample codes for my CUDA programming book

Cuda 1,922 375 Updated Feb 15, 2025
Python 345 44 Updated Apr 2, 2024

[인프런] 운영체제 공룡책 강의, 정리

C 292 25 Updated May 7, 2024
Next