Skip to content
View ttanzhiqiang's full-sized avatar

Block or report ttanzhiqiang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,348 243 Updated Oct 9, 2025

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

63 2 Updated Oct 9, 2025

Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend

Python 76 7 Updated Sep 27, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,784 710 Updated Oct 9, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,587 947 Updated Oct 9, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,868 304 Updated Mar 10, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,798 907 Updated Sep 30, 2025

Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Python 1,455 290 Updated Oct 9, 2025

Triton kernels for Flux

Python 22 Updated Jul 7, 2025

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,494 238 Updated Oct 8, 2025

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 577 30 Updated Aug 12, 2025

Fast and memory-efficient exact attention

Python 19,839 2,046 Updated Oct 8, 2025

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.

Python 252 45 Updated Oct 9, 2025

Development repository for the Triton language and compiler

MLIR 17,167 2,289 Updated Oct 9, 2025

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

Python 2,869 370 Updated Oct 9, 2025

程序员相关电子书资料免费分享,欢迎关注个人微信公众号:编程与实战

4,712 1,186 Updated Apr 4, 2024

tensorrt-onnx build window

C++ 5 1 Updated Aug 4, 2021

micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-…

Python 2,257 478 Updated May 6, 2025

Detectron2 Libtorch C++ faster rcnn

C++ 13 2 Updated Aug 6, 2021

Two-stage CenterNet

Python 1,222 187 Updated Nov 20, 2022

RLE(run-length encoding) vs halcon vs opencv

C++ 39 16 Updated Jun 25, 2021

An easy to use PyTorch to TensorRT converter

Python 4,820 694 Updated Aug 17, 2024

Support Yolov5(4.0)/Yolov5(5.0)/YoloR/YoloX/Yolov4/Yolov3/CenterNet/CenterFace/RetinaFace/Classify/Unet. use darknet/libtorch/pytorch/mxnet to onnx to tensorrt

C++ 210 42 Updated Aug 2, 2021

Libtorch Examples

C++ 42 16 Updated Jul 16, 2021

🔥 (yolov3 yolov4 yolov5 unet ...)A mini pytorch inference framework which inspired from darknet.

C++ 747 148 Updated Apr 23, 2023

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

C++ 20,827 6,751 Updated Oct 25, 2023

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

C 22,147 7,955 Updated Aug 28, 2025
Next