This repository is a deep dive into optimizing fundamental linear algebra operations. It explores a range of performance engineering techniques on both CPU and GPU.
- CPU Optimization: Written in Mojo
- GPU Programming: Using Apple's Metal framework (with Objective-C++ and Python).
├── pixi.toml # Mojo project configuration
├── cpu/
│ └── matrix_multiplication/ # GEMM optimization in Mojo
│
└── gpu/
├── dot_product/ # Dot product optimization in Metal/Objective-C++
└── matrix_multiplication/ # Matrix multiplication in Metal/Python
This section details the journey of optimizing General Matrix Multiplication (GEMM) on the CPU using Mojo. We start with a textbook implementation and incrementally apply advanced techniques to leverage modern CPU architecture features.
The implementations for each stage can be found in cpu/matrix-multiplication/
.
- Naive GEMM: The classic three-loop algorithm.
- Loop Reordering: Swapping loop order to improve cache locality for one of the input matrices.
- SIMD: Using Single Instruction, Multiple Data to perform vectorized computations.
- Micro-Kernel: A highly-optimized core function that computes a small tile of the result matrix, designed to maximize register usage.
- Cache Blocking: A BLIS-style approach that partitions matrices into blocks to fit into different levels of the CPU cache, reducing trips to main memory.
- Parallelization: Using Mojo's
parallelize
to distribute the work across multiple CPU cores.
For a complete, in-depth explanation of each optimization technique and the theory behind it, please see the detailed write-up:
Prerequisites:
Setup and Execution:
pixi shell
mojo run main.mojo
This section explores GPU programming using Apple's Metal framework. It includes implementations for both dot product (in Objective-C++) and matrix multiplication (in Python).
This part focuses on a common parallel programming challenge: the reduction operation. We implement a dot product using several different reduction strategies on the GPU.
Reduction Strategies Explored:
- GPU multiplication with final reduction on the CPU.
- Partial reduction on the GPU within threadgroups.
- Full reduction on the GPU using atomic operations.
- Tree-based reduction within a threadgroup to minimize divergence and maximize parallelism.
- Hierarchical reduction to handle inputs larger than a single threadgroup.
For a detailed discussion of the algorithms, GPU architecture concepts, and trade-offs, refer to the documentation:
Prerequisites:
- macOS with Xcode Command Line Tools installed.
Build and Execution:
- Navigate to the dot product directory:
cd gpu/dot_product
- Build the project using the Makefile and run:
make run
This implementation provides a baseline for GEMM on the GPU, using Python and the PyObjC bridge to interface with Metal. The kernel uses a simple approach where each thread in the compute grid is responsible for calculating a single element of the output matrix C.
For a brief overview of the Metal kernel and Python setup, see:
Prerequisites:
- macOS with a Metal-capable GPU.
- Python 3.12+
- A Python package manager like
pip
oruv
.
Setup and Execution:
- Navigate to the matrix multiplication directory:
cd gpu/matrix_multiplication
- Install the required Python dependencies from
pyproject.toml
.# Using uv uv lock
- Run the script. It will compile the Metal kernel, run the matrix multiplication, and verify the result against NumPy.
python main.py