Fast Linear Algebra

This repository is a deep dive into optimizing fundamental linear algebra operations. It explores a range of performance engineering techniques on both CPU and GPU.

CPU Optimization: Written in Mojo
GPU Programming: Using Apple's Metal framework (with Objective-C++ and Python).

Project Structure

├── pixi.toml                         # Mojo project configuration
├── cpu/
│   └── matrix_multiplication/       # GEMM optimization in Mojo
│
└── gpu/
    ├── dot_product/                 # Dot product optimization in Metal/Objective-C++
    └── matrix_multiplication/       # Matrix multiplication in Metal/Python

Part 1: CPU Optimization - GEMM in Mojo

This section details the journey of optimizing General Matrix Multiplication (GEMM) on the CPU using Mojo. We start with a textbook implementation and incrementally apply advanced techniques to leverage modern CPU architecture features.

The implementations for each stage can be found in cpu/matrix-multiplication/.

Optimization Stages

Naive GEMM: The classic three-loop algorithm.
Loop Reordering: Swapping loop order to improve cache locality for one of the input matrices.
SIMD: Using Single Instruction, Multiple Data to perform vectorized computations.
Micro-Kernel: A highly-optimized core function that computes a small tile of the result matrix, designed to maximize register usage.
Cache Blocking: A BLIS-style approach that partitions matrices into blocks to fit into different levels of the CPU cache, reducing trips to main memory.
Parallelization: Using Mojo's parallelize to distribute the work across multiple CPU cores.

For a complete, in-depth explanation of each optimization technique and the theory behind it, please see the detailed write-up:

📄 CPU GEMM Optimization Details

Running the Mojo Code

Prerequisites:

Pixi

Setup and Execution: pixi shell mojo run main.mojo

Part 2: GPU Optimization - Metal

This section explores GPU programming using Apple's Metal framework. It includes implementations for both dot product (in Objective-C++) and matrix multiplication (in Python).

Dot Product in Objective-C++ and Metal

This part focuses on a common parallel programming challenge: the reduction operation. We implement a dot product using several different reduction strategies on the GPU.

Reduction Strategies Explored:

GPU multiplication with final reduction on the CPU.
Partial reduction on the GPU within threadgroups.
Full reduction on the GPU using atomic operations.
Tree-based reduction within a threadgroup to minimize divergence and maximize parallelism.
Hierarchical reduction to handle inputs larger than a single threadgroup.

For a detailed discussion of the algorithms, GPU architecture concepts, and trade-offs, refer to the documentation:

📄 GPU Dot Product Optimization Details

Running the Dot Product Code

Prerequisites:

macOS with Xcode Command Line Tools installed.

Build and Execution:

Navigate to the dot product directory:
```
cd gpu/dot_product
```
Build the project using the Makefile and run:
```
make run
```

Matrix Multiplication in Python and Metal

This implementation provides a baseline for GEMM on the GPU, using Python and the PyObjC bridge to interface with Metal. The kernel uses a simple approach where each thread in the compute grid is responsible for calculating a single element of the output matrix C.

For a brief overview of the Metal kernel and Python setup, see:

📄 GPU Matrix Multiplication Details

Running the Matrix Multiplication Code

Prerequisites:

macOS with a Metal-capable GPU.
Python 3.12+
A Python package manager like pip or uv.

Setup and Execution:

Navigate to the matrix multiplication directory:
```
cd gpu/matrix_multiplication
```
Install the required Python dependencies from pyproject.toml.
```
# Using uv
uv lock
```
Run the script. It will compile the Metal kernel, run the matrix multiplication, and verify the result against NumPy.
```
python main.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
cpu		cpu
gpu		gpu
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast Linear Algebra

Project Structure

Part 1: CPU Optimization - GEMM in Mojo

Optimization Stages

📄 CPU GEMM Optimization Details

Running the Mojo Code

Part 2: GPU Optimization - Metal

Dot Product in Objective-C++ and Metal

📄 GPU Dot Product Optimization Details

Running the Dot Product Code

Matrix Multiplication in Python and Metal

📄 GPU Matrix Multiplication Details

Running the Matrix Multiplication Code

About

Uh oh!

Releases

Packages

Languages

andresnowak/Fast_linear_algebra

Folders and files

Latest commit

History

Repository files navigation

Fast Linear Algebra

Project Structure

Part 1: CPU Optimization - GEMM in Mojo

Optimization Stages

📄 CPU GEMM Optimization Details

Running the Mojo Code

Part 2: GPU Optimization - Metal

Dot Product in Objective-C++ and Metal

📄 GPU Dot Product Optimization Details

Running the Dot Product Code

Matrix Multiplication in Python and Metal

📄 GPU Matrix Multiplication Details

Running the Matrix Multiplication Code

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages