Stars
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Efficient Triton Kernels for LLM Training
llama3 implementation one matrix multiplication at a time
PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily write your own.
A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Run compilers interactively from your web browser and interact with the assembly
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
A tool based on Excalidraw to create stop motion animations and slides.
helper scripts for vivado and vivado_hls build with cmake.
Examples shown as part of the tutorial "Productive parallel programming on FPGA with high-level synthesis".
a cheat-sheet for mathematical notation in code form
FPGA+SoC+Linux+Device Tree Overlay+FPGA Manager U-Boot&Linux Kernel&Debian11 Images (for Xilinx:Zynq Ultrascale+ MPSoC)
Example for ZynqMP-FPGA-XRT(Xilinx RunTime for ZynqMP-FPGA-Linux)
Tool for updating the contents of BlockRAMs found in Xilinx 7 series bitstreams.
Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.
A booklet on machine learning systems design with exercises. NOT the repo for the book "Designing Machine Learning Systems", which is `dmls-book`
A collection of out-of-tree LLVM passes for teaching and learning
Intro to Creative Coding workshop with p5.js and Tone.js
A high-level performance analysis tool for FPGA-based accelerators