-
iluvatar.ai
- Cloud Security City, Nanjing
- https://scholar.google.com/citations?hl=zh-CN&user=I5UcdEYAAAAJ&view_op=list_works&sortby=pubdate#
Stars
torchcomms: a modern PyTorch communications API
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
A PyTorch native platform for training generative AI models
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
DeepSparkInference has selected 216 inference models of both small and large sizes. The small models cover fields such as computer vision, natural language processing, and speech recognition; the L…
The DeepSpark open platform selects hundreds of open source application algorithms and models that are deeply coupled with industrial applications, supports mainstream application frameworks, and p…
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
《C++ Primer Plus 第6版(中文版)》原书代码、习题答案和个人笔记,仅供学习和交流。
Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch
FlashInfer: Kernel Library for LLM Serving
Causal depthwise conv1d in CUDA, with a PyTorch interface
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Ongoing research training transformer models at scale
Hackable and optimized Transformers building blocks, supporting a composable construction.
This repository contains the results and code for the MLPerf™ Training v2.1 benchmark.
Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
Efficient GPU kernels for block-sparse matrix multiplication and convolution
Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
Fast and memory-efficient exact attention
DeepSparkHub selects hundreds of application algorithms and models, covering various fields of AI and general-purpose computing, to support the mainstream intelligent computing scenarios. This repo…
Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
how to optimize some algorithm in cuda.
OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.