Skip to content
View XueSongTap's full-sized avatar

Highlights

  • Pro

Block or report XueSongTap

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU.

Python 190 19 Updated Dec 20, 2025

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 10,789 869 Updated Jun 10, 2024

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 96,040 26,308 Updated Dec 20, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,263 350 Updated Dec 20, 2025

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

C++ 1,150 361 Updated Jan 21, 2025

Collective communications library with various primitives for multi-machine training.

C++ 1,378 339 Updated Dec 2, 2025

Python frame stack sampler for CPython

C 2,148 64 Updated Nov 2, 2025

pytorch单精度、半精度、混合精度、单卡、多卡(DP / DDP)、FSDP、DeepSpeed模型训练代码,并对比不同方法的训练速度以及GPU内存的使用

Python 127 16 Updated Mar 16, 2024

Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension, LoongArch64, POWER. Part of Node.js, WebKit/Safari, Lad…

C++ 1,604 106 Updated Dec 20, 2025

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 138 7 Updated May 10, 2025

A compact binary encoding for geographic data.

JavaScript 1,010 86 Updated Jan 17, 2022

CUDA/Metal accelerated language model inference

C 625 30 Updated May 29, 2025

Parallel algorithms and data structures for tree-based adaptive mesh refinement (AMR) with arbitrary element shapes.

C++ 247 62 Updated Dec 19, 2025

Megatron's multi-modal data loader

Python 295 35 Updated Dec 18, 2025

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 1,482 216 Updated Dec 15, 2025

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Python 1,426 228 Updated Mar 20, 2024

Ongoing research training transformer models at scale

Python 14,656 3,399 Updated Dec 20, 2025

Train speculative decoding models effortlessly and port them smoothly to SGLang serving.

Python 558 119 Updated Dec 18, 2025

Scalable toolkit for efficient model alignment

Python 846 105 Updated Oct 6, 2025

Fast, Flexible and Portable Structured Generation

C++ 1,429 110 Updated Dec 19, 2025

The Art of Writing Efficient Programs, published by Packt

C 356 82 Updated May 13, 2024

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" mea…

C++ 17,417 4,082 Updated Dec 18, 2025

Solve Visual Understanding with Reinforced VLMs

Python 5,765 375 Updated Oct 21, 2025

Library for lifting machine code to LLVM bitcode

C++ 1,548 163 Updated Dec 16, 2025

Python interface to TopoToolbox

Python 16 17 Updated Dec 9, 2025

A TensorRT and C++ based deployment of ​​FoundationPose, which makes integration lightweight and efficient. Supports Jetson Orin. Adapted from nvidia_isaac_pose_esitimation.

C++ 75 11 Updated Sep 26, 2025
Next