- San Diego
-
13:05
(UTC -07:00)
Starred repositories
⬇ Downloads lifetime Fitbit data and exports it into the format supported by Garmin Connect data importer. This includes historical body composition data (weight, BMI, and fat percentage), activity…
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…
PyGWalker: Turn your dataframe into an interactive UI for visual analysis
OpenAI Triton backend for Intel® GPUs
Shared Middle-Layer for Triton Compilation
Triton for OpenCL backend, and use mlir-translate to get source OpenCL code
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
flash attention tutorial written in python, triton, cuda, cutlass
Real-time webcam demo with SmolVLM and llama.cpp server
Real-time GPU profiling layer for Vulkan applications.
A tool for bandwidth measurements on NVIDIA GPUs.
GPGPU-Sim provides a detailed simulation model of a contemporary GPU running CUDA and/or OpenCL workloads and now includes an integrated (and validated) energy model, GPUWattch.
Diffusion model(SD,Flux,Wan,Qwen Image,Z-Image,...) inference in pure C/C++
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient Multi-head Latent Attention Kernels
A book for Learning the Foundations of LLMs