Skip to content
View wx-csy's full-sized avatar
👋
bonjour
👋
bonjour
  • Tsinghua University
  • Beijing, China

Organizations

@nju-calabash

Block or report wx-csy

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

An evolutionary approach to find small and low latency sorting networks

HTML 82 10 Updated Feb 22, 2026

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.

Python 523 69 Updated Mar 29, 2026

CUDA Python: Performance meets Productivity

Cython 3,294 301 Updated Jun 22, 2026

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,981 1,055 Updated May 7, 2026

Expert Parallelism Load Balancer

Python 1,390 204 Updated Mar 24, 2025

Analyze computation-communication overlap in V3/R1.

1,168 148 Updated Mar 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,966 327 Updated Jan 14, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,398 1,058 Updated Jun 4, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,751 1,293 Updated Jun 15, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,709 1,063 Updated Apr 30, 2026

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

8,007 288 Updated May 15, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 2,128 151 Updated Apr 3, 2025

NVIDIA Linux open GPU with P2P support

C 1,382 142 Updated Jun 6, 2025

Doing simple retrieval from LLM models at various context lengths to measure accuracy

Jupyter Notebook 2,320 247 Updated Jun 8, 2026

Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks

Makefile 103 22 Updated Sep 2, 2021

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C 1,391 189 Updated Jun 15, 2026

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Python 17,317 1,321 Updated Jun 21, 2026

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 5,626 869 Updated Jun 22, 2026

how to optimize some algorithm in cuda.

Cuda 3,093 279 Updated Jun 21, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,936 1,918 Updated Jun 21, 2026

Large Language Model Text Generation Inference

Python 10,863 1,270 Updated Mar 21, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 83,535 18,309 Updated Jun 22, 2026

Transformer related optimization, including BERT, GPT

C++ 6,427 935 Updated Mar 27, 2024

Inference code for Llama models

Python 59,461 9,790 Updated Jan 26, 2025

Intel® Performance Counter Monitor (Intel® PCM)

C++ 3,289 524 Updated Jun 19, 2026

Grasper: A High Performance Distributed System for OLAP on Property Graphs.

C++ 30 9 Updated Apr 3, 2021

A solver for subgraph isomorphism problems, based upon a series of papers by subsets of McCreesh, Prosser, and Trimble.

C++ 107 30 Updated Jun 21, 2026

CP 2015 subgraph isomorphism experiments, data and paper

C++ 13 5 Updated Sep 5, 2015
Next