flash-attention

Here are 108 public repositories matching this topic...

ayinedjimi / flashquant

Extreme KV Cache Compression for LLM Inference — C++17/CUDA implementation of TurboQuant (arXiv 2504.19874). 7.5x compression, <2% quality loss.

machine-learning compression cpp gpu cuda inference pytorch transformer quantization kv-cache llm vllm flash-attention turboquant

Updated Apr 4, 2026
Python

kantkrishan0206-crypto / gen-image3.0

Star

a powerful, large-scale, multimodal model for Text-to-Image generation.

python image deep-learning transformers pytorch text-to-image mixture-of-experts huggingface ai-model ai-models autoregressive-models flash-attention multimodal-ai flashinfer visual-generation

Updated Oct 3, 2025
Python

anviit / triton-llm-kernels

Star

LLM primitives rebuilt in Triton — FlashAttention 2.52×, fused AdamW 3.45×, Bias+GELU 14.65× faster than PyTorch

deep-learning cuda inference pytorch triton gpu-kernels llm flash-attention

Updated Mar 18, 2026
Python

tranhohoangvu / Deep-Learning

Star

Deep Learning coursework (2025): attention mechanisms (Self/Flash/Linear/Sparse) and OCR with ResNet + Transformer Decoder.

nlp ocr computer-vision deep-learning vietnamese pytorch transformer attention resnet bleu tdtu linear-attention sparse-attention flash-attention

Updated Dec 30, 2025
Jupyter Notebook

jrajath94 / triton-inference-kernels

Star

Fused softmax + Flash Attention in OpenAI Triton — 50x VRAM reduction at seq_len=2048

machine-learning gpu cuda inference pytorch triton llm flash-attention

Updated Mar 18, 2026
Python

chrismado / single-stream-av-transformer

Star

Research foundation for multimodal creative systems: single-stream audio, video, and text modeling for future workflow experiments.

pytorch transformer vae creative-ai multimodal video-generation self-attention audio-generation generative-ai ai-video flash-attention creative-workflows single-stream

Updated Apr 12, 2026
Python

pathcosmos / FRANKENSTALLM

Star

Korean 3B LLM (pure Transformer) pretrained from scratch on 8× NVIDIA B200 GPUs with SFT + ORPO alignment

transformer sft gqa pretraining fp8 korean-llm flash-attention gguf orpo nvidia-b200

Updated Mar 26, 2026
Python

jrw96 / flash-attn

Star

FlashAttention forward pass from scratch in CUDA C — with Nsight Compute profiling analysis

deep-learning gpu cuda inference flash-attention

Updated Apr 1, 2026
Cuda

LukasDrews97 / DumbleLLM

Star

Decoder-only LLM trained on the Harry Potter books.

transformer byte-pair-encoding rotary-position-embedding large-language-model flash-attention grouped-query-attention

Updated Dec 20, 2024
Python

dwnmf / moe_training

Star

Vast.ai-first Qwen 3.5 SFT/LoRA training stack with Unsloth, CLI, and Gradio monitoring.

lora gradio sft huggingface vast-ai flash-attention qwen unsloth qwen3

Updated Mar 28, 2026
Python

helgklaizar / mac-native-power-pack

Star

Extreme-performance Metal kernels for MLX. Optimized for Apple Silicon. Part of the Eco-Metal ecosystem.

mlx gpu-kernels apple-silicon flash-attention ai-performance metal-shading-language deepseek-mla mnpp

Updated Apr 11, 2026
Python

kamalrss88 / FlashMLA

Star

🚀 Accelerate attention mechanisms with FlashMLA, featuring optimized kernels for DeepSeek models, enhancing performance through sparse and dense attention.

windows gpu cuda inference nvidia nvidia-cuda mla multi-head-attention mqa llm flash-attention cuda-core decoding-attention deepseek flashinfer flashmla

Updated Apr 12, 2026
C++

ichuniq / cs149gpt

Star

Efficient custom C++ attention layer for NanoGPT

llms flash-attention

Updated Jan 4, 2025
Python

dhruvjverma / NanoLanguageModel

Star

A minimalist, high-performance GPT implementation in PyTorch, optimized for research and training on the TinyStories dataset.

nlp pytorch transformer gpt tokenization llm generative-ai flash-attention python-3-13

Updated Jan 8, 2026
Python

rashomon-gh / cuda-flash-attention

Star

cuda cpp17 cuda-kernels flash-attention

Updated Aug 2, 2025
C

txetxoarnedo / mo

Star

View GitHub-flavored Markdown files with syntax highlighting, diagrams, and math rendering directly in your browser.

cats functional ai deep-learning state fp generics monoid future option maybe speaker-recognition 3d-vision monocular-depth-estimation llm speaker-diariazation flash-attention

Updated Apr 12, 2026
TypeScript

PeTeRr0 / FlashAttention-pytorch

Star

Pytorch implementation of the paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

machine-learning deep-learning gpu optimization cuda pytorch transformer attention large-language-model flash-attention flashattention

Updated Dec 15, 2025
Python

Ahemdgggi / FastCode

Star

⚡ Boost code analysis and comprehension with FastCode, delivering fast, scalable, and cost-efficient solutions for Python projects.

python data-science communication exploratory-data-analysis codeigniter transformers efficient inference x86 meetups hedgefund codeml autocode fastcode javacode flash-attention

Updated Apr 12, 2026
Python

GPU-optimized UL2 mixture-of-denoisers data collator for T5/FLAN encoder-decoder pretraining. Supports span corruption, prefix LM, infilling, curriculum learning, Flash Attention unpadding, and HuggingFace Trainer integration.

transformers pytorch encoder-decoder curriculum-learning huggingface t5 pretraining flan-t5 ul2 flash-attention span-corruption data-collator

Updated Dec 28, 2025
Python

kevinbazira / vllm-rocm-debian-images

Star

Debian-based Docker images for LLM inference on AMD GPUs using ROCm and vLLM.

docker dockerfile performance-engineering debian amd gpu pytorch rocm amd-gpu mlops llm-serving vllm llm-inference flash-attention amd-instinct

Updated Mar 15, 2026
Dockerfile

Improve this page

Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention

Here are 108 public repositories matching this topic...

ayinedjimi / flashquant

kantkrishan0206-crypto / gen-image3.0

anviit / triton-llm-kernels

tranhohoangvu / Deep-Learning

jrajath94 / triton-inference-kernels

chrismado / single-stream-av-transformer

pathcosmos / FRANKENSTALLM

jrw96 / flash-attn

LukasDrews97 / DumbleLLM

dwnmf / moe_training

helgklaizar / mac-native-power-pack

kamalrss88 / FlashMLA

ichuniq / cs149gpt

dhruvjverma / NanoLanguageModel

rashomon-gh / cuda-flash-attention

txetxoarnedo / mo

PeTeRr0 / FlashAttention-pytorch

Ahemdgggi / FastCode

pszemraj / UL2_5

kevinbazira / vllm-rocm-debian-images

Improve this page

Add this topic to your repo