llms wrote this readme i will fix it i promise
make test-python
make lint
make test-rtl-unit
make test-rtl-integrationRun the Python 16x16 matmul demo:
PYTHONPATH=src python3 examples/matmul_16.pyRun the local Qwen/vTPU chip-sim demo UI:
make qwen-demo-uimake lint requires Verilator. The Python tests require the package dependencies declared in pyproject.toml.
Host / testbench / MMIO
|
v
Instruction memory -> decoder -> control FSM
| |
| v
| DMA + barriers
| |
v v
Simulated HBM <----------> CMEM <----------> VMEM0 / VMEM1
|
v
TensorCore local units
MXU + vector + reduce
|
v
results copied back to HBM
The top-level RTL version of this shape lives in rtl/top/virtual_tpu_v4_top.sv. The Python version of the same execution contract lives in src/virtual_tpu/golden.py.
Modern ML workloads spend much of their time multiplying large tensors. If the core operation is:
C = A @ B
then the machine should make that operation cheap, predictable, and easy to tile. The smallest end-to-end program in this repo is a 16x16 int8 matmul in examples/matmul_16.py, with reusable program construction in src/virtual_tpu/programs.py.
A matrix multiply reuses each element of A and B many times. A systolic array turns that reuse into local movement: A streams across one dimension, B streams across the other, and processing elements accumulate partial sums. This model uses an output-stationary dataflow because it is teachable, deterministic, and easy to verify before scaling.
The RTL compute path is split across rtl/primitive/pe_int8.sv, rtl/mxu/systolic_array.sv, and rtl/mxu/mxu_top.sv. The architectural rationale is in docs/02_ARCHITECTURE.md.
If every multiply fetched from global memory, arithmetic units would mostly wait. So each TensorCore gets local VMEM, and matmul instructions operate on VMEM addresses. HBM is for bulk tensor storage; VMEM is for the active working set.
The Python memory contract is in src/virtual_tpu/memory.py. The banked RTL VMEM implementation is in rtl/memory/vmem_top.sv and rtl/memory/vmem_bank.sv.
Accelerators are not just arithmetic engines. They are orchestration machines. This design uses explicit DMA instructions and barriers because the program should say when data moves, when compute may begin, and when results are safe to read.
The DMA and memory model are specified in docs/05_MEMORY_AND_DMA.md, encoded in src/virtual_tpu/isa.py, executed by src/virtual_tpu/golden.py, and implemented in RTL in rtl/memory/dma_engine.sv.
TPU v4 public documentation describes TensorCore-local VMEM and chip-level CMEM. This educational model keeps that distinction: VMEM is local to a TensorCore, while CMEM can stage data shared across TensorCores. CMEM exists because not every useful data movement is directly HBM-to-local; some programs need a chip-level scratchpad between bulk memory and compute.
See the CMEM-staged matmul program in src/virtual_tpu/programs.py and the RTL shared memory in rtl/memory/cmem_top.sv.
The full v4-inspired shape has two TensorCores and four MXUs per TensorCore. The educational RTL keeps the structure visible while using small 16x16 arrays for tractable simulation. Target masks in the ISA select TensorCore-local VMEM spaces and eligible MXUs, which lets simple programs grow toward multi-unit scheduling without changing the core mental model.
The target field is documented in docs/03_ISA.md, modeled by target handling in src/virtual_tpu/golden.py, and wired through rtl/tensor_core/tensor_core.sv.
Real tensor programs need more than matmul: elementwise operations, reductions, clamps, activations, and accumulation patterns all matter. This repo includes vector and reduce instructions so the architecture can express the non-MXU parts of simple ML kernels.
The operation names live in src/virtual_tpu/isa.py. The Python semantics are in src/virtual_tpu/golden.py. The RTL units are rtl/vector/vector_unit.sv and rtl/vector/reduce_unit.sv.
The ISA is intentionally not a TPU ISA. It is a small educational instruction format:
127:120 opcode
119:112 flags
111:104 target
103:096 reserved
095:080 dst
079:064 src0
063:048 src1
047:032 imm0
031:016 imm1
015:000 imm2
Instructions are fixed 128-bit words because decode clarity matters more than code density in this project. The format is spacious enough to keep memory spaces, target masks, dimensions, and operation modes visible to readers.
Read the ISA spec in docs/03_ISA.md, the Python encoding in src/virtual_tpu/isa.py, the assembler in src/virtual_tpu/assembler.py, and the RTL decoder in rtl/isa/instr_decoder.sv.
The first implemented compute path is int8 inputs with int32 accumulation because exact integer behavior makes verification crisp. BF16/FP32 matmul also runs in the Python reference path and in RTL simulation: BF16 operands are stored as raw 16-bit values, expanded to FP32 for accumulation, and written back as FP32 results. FP16 and BF16 vector/reduce paths are intentionally out of scope for now.
The numeric plan is in docs/04_NUMERICS.md, and the Python helpers are in src/virtual_tpu/numeric.py.
RTL should implement a checked contract, not vibes. The Python golden executor runs the same instruction stream against the same memory-space model and produces the expected architectural result. Tests compare programs to NumPy and use cocotb/Verilator to check RTL behavior against the contract.
The development method is described in docs/07_VERIFICATION_DRIVEN_DEVELOPMENT.md. Python tests live in tests/python. RTL/cocotb tests live in tests/cocotb and are launched through tests/rtl.
Follow one 16x16 matmul from user intent to hardware-shaped execution:
- examples/matmul_16.py creates random
AandB, writes them to simulated HBM, runs a program, and checksCagainst NumPy. - src/virtual_tpu/programs.py builds that program: DMA
AandBinto VMEM, barrier, clearC, issueMATMUL, barrier, DMACback to HBM, halt. - src/virtual_tpu/isa.py turns those operations into fixed-width instructions with explicit address spaces, flags, targets, and dimensions.
- src/virtual_tpu/golden.py executes the instructions against src/virtual_tpu/memory.py, including DMA behavior, matmul semantics, barriers, errors, and counters.
- rtl/top/virtual_tpu_v4_top.sv wires the hardware-shaped version: instruction memory, decoder, control FSM, DMA, HBM model, CMEM, two TensorCores, VMEM, MXUs, vector/reduce units, and counters.
- tests/python/test_golden.py checks the golden model. tests/rtl/test_chip_runner.py launches chip-level RTL tests that compare HBM-visible results and MMIO counters.
That path is the heart of the repo: a program is data movement plus local compute plus synchronization, and every layer is inspectable.
- Custom 128-bit instruction encoding.
- Python assembler and golden executor.
- Simulated HBM, CMEM, VMEM0, and VMEM1 memory spaces.
- Exact int8 x int8 -> int32 tiled matmul behavior.
- Vector and reduce golden operations.
- BF16/FP32 Python reference path and RTL-simulation matmul path.
- Packed-tile 64x64 matmul and single-tile MLP lowering examples.
- Python 3D mesh/torus architectural simulator.
- Structured tiny-target layout search with JSONL experiment logs and OpenROAD fragments.
- Golden-model performance counters.
- Structural SystemVerilog chip top with instruction, control, memory, DMA, TensorCore, MXU, vector, reduce, and counter modules.
- RTL CMEM/VMEM0/VMEM1/HBM data movement.
- TC0/TC1 target-local execution, target masks, and MMIO counters.
- pytest-based Python verification.
- Verilator lint and cocotb unit-test harness.
BF16 matmul is supported in RTL simulation; FP16 and BF16 vector/reduce remain unsupported.
docs/ Design specs, public contract, module contracts, and development plan
src/virtual_tpu/ Python ISA, assembler, memory model, golden executor, lowering, archsim
compiler/ Compatibility wrappers for the compiler package layout in docs
rtl/ SystemVerilog packages and RTL modules
tests/python/ pytest tests for the executable golden contract
tests/cocotb/ cocotb tests for RTL simulation
tests/rtl/ pytest wrappers for cocotb/Verilator tests
examples/ Small runnable programs
- Architecture: docs/02_ARCHITECTURE.md
- ISA: docs/03_ISA.md
- Numerics: docs/04_NUMERICS.md
- Memory and DMA: docs/05_MEMORY_AND_DMA.md
- RTL module contracts: docs/06_RTL_MODULE_CONTRACTS.md
- Verification-driven development: docs/07_VERIFICATION_DRIVEN_DEVELOPMENT.md
- Test plan: docs/08_TEST_PLAN.md
- Compiler and assembler: docs/10_COMPILER_ASSEMBLER.md
- Example programs: docs/11_EXAMPLE_PROGRAMS.md
- Multi-chip archsim: docs/12_ARCHSIM_MULTI_CHIP.md
- Layout search: docs/13_LAYOUT_SEARCH.md
The project follows a verification-first loop:
public contract -> spec -> golden model -> tests -> RTL -> RTL tests -> docs
Every architectural decision should be legible, every educational simplification should be labeled, and every implemented behavior should have a test or a clear path toward one.