- China
-
18:57
(UTC +08:00)
Highlights
- Pro
Stars
Multi-V-VM / Codify
Forked from thebaselab/codeappBuilding a full-fledged code editor for iPad
Disaggregated serving system for Large Language Models (LLMs).
从无名小卒到大模型(LLM)大英雄~ 欢迎关注后续!!!
A low-latency & high-throughput serving engine for LLMs
Implement some method of LLM KV Cache Sparsity
A machine learning accelerator core designed for energy-efficient AI at the edge.
[DATE'2025, TCAD'2025] Terafly : A Multi-Node FPGA Based Accelerator Design for Efficient Cooperative Inference in LLMs
[DATE'25, ICCAD'25] An embedded FPGA-based LLM accelerator capable of supporting Llama2-7B
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
A modern GUI client based on Tauri, designed to run in Windows, macOS and Linux for tailored proxy experience
Awesome Pruning. ✅ Curated Resources for Neural Network Pruning.
强化学习中文教程(蘑菇书🍄),在线阅读地址:https://datawhalechina.github.io/easy-rl/
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Course Project for High Level Chip Design (高层次芯片设计)
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
LLM大模型(重点)以及搜广推等 AI 算法中手写的面试题,(非 LeetCode),比如 Self-Attention, AUC等,一般比 LeetCode 更考察一个人的综合能力,又更贴近业务和基础知识一点
Bringing Language Models to the Most Resource Constrained Devices
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
Fast inference from large lauguage models via speculative decoding