-
diNo Research Group, LIPADE Lab
- Paris
-
08:25
(UTC +02:00) - https://amy-77.github.io/
- in/yanlin-qi-456177268
Stars
Understand and test language model architectures on synthetic tasks.
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
SkyRL: A Modular Full-stack RL Library for LLMs
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
Fast CUDA matrix multiplication from scratch
Implementation for IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs (ICLR 2026).
🚀🚀 Efficient implementations of Native Sparse Attention
Triton kernels and PyTorch ops for Block Attention Residuals (AttnRes)
A clean, modular SDK for building AI agents with OpenHands V1.
High-performance LLM operator library built on TileLang.
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
high-performance linear attention kernel library built on TileLang
你想蒸馏的下一个员工,何必是同事。蒸馏任何人的思维方式——心智模型、决策启发式、表达DNA。Distill how anyone thinks.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A kernel library written in tilelang
FlashKDA: high-performance Kimi Delta Attention kernels
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
TurboQuant reference implementation — KV cache compression with engineering insights (ICLR 2026 paper reproduction)
A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflo…
A Super AI Lab with massive AI Doctors as Assistants. Best IDE for Research via AI Power.
An official lightweight library for the RaBitQ algorithm and its applications in vector search.
TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration