High-Efficiency 16-bit BFloat16 Multiply-Accumulate (MAC) Unit for ML Acceleration. Verified for SkyWater 130nm (TinyTapeout 07). Includes FP32 accumulation and streaming I/O.
-
Updated
Feb 16, 2026 - Python
High-Efficiency 16-bit BFloat16 Multiply-Accumulate (MAC) Unit for ML Acceleration. Verified for SkyWater 130nm (TinyTapeout 07). Includes FP32 accumulation and streaming I/O.
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
AI Performance Benchmark Tool. Unifies CPU, GPU (Metal), and NPU (Neural Engine) tests in GFLOPS and TOPS.
🔍 Benchmark AI performance on Apple Silicon with a unified tool for CPU, GPU, and NPU testing, leveraging advanced strategies for accurate results.
Add a description, image, and links to the fp32 topic page so that developers can more easily learn about it.
To associate your repository with the fp32 topic, visit your repo's landing page and select "manage topics."