Extreme KV Cache Compression for LLM Inference — C++17/CUDA implementation of TurboQuant (arXiv 2504.19874). 7.5x compression, <2% quality loss.
-
Updated
Apr 4, 2026 - Python
Extreme KV Cache Compression for LLM Inference — C++17/CUDA implementation of TurboQuant (arXiv 2504.19874). 7.5x compression, <2% quality loss.
a powerful, large-scale, multimodal model for Text-to-Image generation.
LLM primitives rebuilt in Triton — FlashAttention 2.52×, fused AdamW 3.45×, Bias+GELU 14.65× faster than PyTorch
Deep Learning coursework (2025): attention mechanisms (Self/Flash/Linear/Sparse) and OCR with ResNet + Transformer Decoder.
Fused softmax + Flash Attention in OpenAI Triton — 50x VRAM reduction at seq_len=2048
Research foundation for multimodal creative systems: single-stream audio, video, and text modeling for future workflow experiments.
Korean 3B LLM (pure Transformer) pretrained from scratch on 8× NVIDIA B200 GPUs with SFT + ORPO alignment
FlashAttention forward pass from scratch in CUDA C — with Nsight Compute profiling analysis
Decoder-only LLM trained on the Harry Potter books.
Vast.ai-first Qwen 3.5 SFT/LoRA training stack with Unsloth, CLI, and Gradio monitoring.
Extreme-performance Metal kernels for MLX. Optimized for Apple Silicon. Part of the Eco-Metal ecosystem.
🚀 Accelerate attention mechanisms with FlashMLA, featuring optimized kernels for DeepSeek models, enhancing performance through sparse and dense attention.
A minimalist, high-performance GPT implementation in PyTorch, optimized for research and training on the TinyStories dataset.
View GitHub-flavored Markdown files with syntax highlighting, diagrams, and math rendering directly in your browser.
Pytorch implementation of the paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
⚡ Boost code analysis and comprehension with FastCode, delivering fast, scalable, and cost-efficient solutions for Python projects.
GPU-optimized UL2 mixture-of-denoisers data collator for T5/FLAN encoder-decoder pretraining. Supports span corruption, prefix LM, infilling, curriculum learning, Flash Attention unpadding, and HuggingFace Trainer integration.
Debian-based Docker images for LLM inference on AMD GPUs using ROCm and vLLM.
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."