Skip to content
View Xu-Kai's full-sized avatar
  • National University of Singapore
  • Singapore

Block or report Xu-Kai

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Source code for 300+ books, kept here for quick reference

Jupyter Notebook 349 210 Updated Jun 9, 2026

🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.

17,262 1,563 Updated Feb 13, 2023

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Python 8,617 791 Updated May 31, 2024

Compiler for multiple programming models (SYCL, C++ standard parallelism, HIP/CUDA) for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous progra…

C++ 1,876 219 Updated Jun 12, 2026

Distributed reliable key-value store for the most critical data of a distributed system

Go 51,827 10,398 Updated Jun 12, 2026

ACCESS-OM3 MOM6-CICE6 configurations with optional WW3 and Wombat. All the configurations use the Payu and pre-built executables available on NCI.

9 19 Updated Jun 13, 2026

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 192 16 Updated Feb 11, 2026

Checkpoint-engine is a simple middleware to update model weights in LLM inference engines

Python 965 86 Updated Jun 8, 2026

C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)

C++ 2,964 327 Updated Jul 31, 2024

Tutorials for NVIDIA CUPTI samples

C++ 68 13 Updated Nov 3, 2025

Collective communications library with various primitives for multi-machine training.

C++ 1,430 359 Updated Jun 10, 2026

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 882 153 Updated Jun 13, 2026

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,966 1,055 Updated May 7, 2026

[DEPRECATED] Moved to ROCm/rocm-systems repo

C++ 418 205 Updated Jun 11, 2026

A lightweight local-first graphic-centric productivity tool to build your second brain. Supporting Excalidraw/Tldraw whiteboard and notion-like note. 一款以图形为中心、轻量级、本地优先的用于构建第二大脑的效率工具。支持 Excalidraw、T…

TypeScript 2,623 177 Updated Jan 10, 2024

MAGI-1: Autoregressive Video Generation at Scale

Python 3,706 238 Updated Jun 17, 2025

This is a Chinese translation of the CUDA programming guide

1,985 291 Updated Nov 13, 2024

Official code repo for the O'Reilly Book - "Hands-On Large Language Models"

Jupyter Notebook 26,968 6,263 Updated Apr 24, 2026

Scalable and memory-optimized training of diffusion models

Python 1,361 140 Updated May 26, 2026

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C 1,383 189 Updated Jun 13, 2026

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Python 848 58 Updated Jun 13, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,459 151 Updated Apr 22, 2026

[2025] Efficient Vision Language Models: A Survey

1 Updated Nov 8, 2025

E3SM post-processing toolchain

Python 8 16 Updated Jun 12, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 25 22 Updated Jun 12, 2026

A lightweight data processing framework built on DuckDB and 3FS.

Python 4,960 444 Updated Mar 5, 2025

Athena++ radiation GRMHD code and adaptive mesh refinement (AMR) framework

C++ 339 186 Updated May 23, 2026

A native gRPC client & server implementation with async/await support.

Rust 12,287 1,222 Updated Jun 5, 2026

Democratizing Reinforcement Learning for LLMs

Python 5,608 577 Updated Jun 13, 2026
Next