-
independent contributor @ HPC Users Alliance
- United States
-
03:58
(UTC -12:00) - https://yiakwy.github.io/
- in/lei-wang-1722a28a
- @yiakwy2023
- https://mp.weixin.qq.com/s/AVujFosiC15ZmSRvByYcRQ
- https://mp.weixin.qq.com/s/13NKhY3GccjU9Emz-cRSHQ
Highlights
- Pro
Lists (3)
Sort Name ascending (A-Z)
Stars
Autonomous GPU Kernel Generation via Deep Agents
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
Debugging torch distributed program
DeerFlow is a community-driven Deep Research framework, combining language models with tools like web search, crawling, and Python execution, while contributing back to the open-source community.
Distributed Compiler based on Triton for Parallel Systems
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks …
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
A retargetable MLIR-based machine learning compiler and runtime toolkit.
Code for solving LP on GPU using first-order methods
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
PyTorch bindings for CUTLASS grouped GEMM for MoE.
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""
A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
JAX for Graphcore IPU (experimental)
graphcore / hpc-cookbook
Forked from UoB-HPC/ipu-hpc-cookbookUseful tutorials and recipes for developers doing low-level work with the Graphcore IPU
Poplar implementation of "Bundle Adjustment on a Graph Processor" (CVPR 2020)
Poplar implementation of "Bundle Adjustment on a Graph Processor" (CVPR 2020)
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.