Starred repositories
A high-performance RL training-inference weight synchronization framework, designed to enable second-level parameter updates from training to inference in RL workflows
Financial data platform for analysts, quants and AI agents.
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
IPTV直播源抓取 自动整合hao趣网直播源+TVBox直播源+其他网上直播源 择取分辨率、速度最佳视频流 定期更新
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Code base and slides for ECE408:Applied Parallel Programming On GPU.
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Large Language Model Text Generation Inference
Lightning fast C++/CUDA neural network framework
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
FlashInfer: Kernel Library for LLM Serving
Full and up-to-date source code of the chapters of the "SFML Game Development" book
how to learn PyTorch and OneFlow