Strix is a static profiler and timeline simulator for TPU Mosaic/Pallas
LLO dumps (post-finalize-llo .txt/.mlir). It parses an LLO dump, runs a
dual-clock (VPU + DMA) simulator, analyzes stalls and bottlenecks, and prints a
console summary. It can also emit Chrome/Perfetto-compatible trace JSON and
Graphviz DOT dataflow graphs.
- Python 3.12+
Core analysis has no third-party dependencies (standard library only).
The optional tpu extra (for the import subcommand) needs JAX, libtpu,
protobuf, and google-cloud-storage:
pip install -e ".[tpu]"# Implicit analyze (backward-compatible)
python -m strix path/to/post-finalize-llo.txt
# Explicit analyze
python -m strix analyze path/to/post-finalize-llo.txtBoth python -m strix and python -m strix.cli are equivalent entry points.
This prints a performance summary to stdout and writes trace.json in the
current directory.
Strix provides three subcommands:
| Command | Description |
|---|---|
analyze |
Parse and simulate an LLO dump. Default — if the first positional arg doesn't match any subcommand, analyze is prepended automatically. |
analyze-bundles |
Parse a *-final_bundles.txt file and map VLIW bundles to Pallas source lines. |
import |
Deploy a kernel to a GKE TPU pod, run a benchmark, and download IR dumps. |
cross-compile |
Cross-compile a kernel for a target TPU topology and export LLO without running. |
python -m strix analyze [OPTIONS] PATH
| Argument | Required | Description |
|---|---|---|
PATH |
yes | Path to a post-finalize-llo .txt or .mlir file |
| Option | Default | Description |
|---|---|---|
-t, --trace-output |
trace.json |
Output path for Chrome trace JSON. Set to '' to disable. |
--arg, --arg-override |
— | Override scalar SSA values, e.g. --arg %arg0=128 --arg %1237=10. Needed to resolve DMA sizes and loop bounds. |
--default-sld-value |
— | Default value for all unresolved llo.sld results, e.g. --default-sld-value 128. |
--exclude-instructions |
— | Exclude instructions by opcode, e.g. --exclude-instructions llo.nop llo.dbg. |
--dump-tree |
False |
Print the parsed Instruction tree to stdout before simulating. |
--tree-max-depth |
— | Maximum depth when printing the Instruction/OpEvent tree. |
--dataflow-output |
— | Output path for Graphviz DOT dataflow graph. Render with: dot -Tsvg output.dot -o output.svg |
Examples:
# Basic analysis
python -m strix analyze dump.txt
# Override SSA args and disable trace
python -m strix analyze dump.txt --arg %arg0=128 --trace-output ''
# Generate dataflow graph
python -m strix analyze dump.txt --dataflow-output graph.dot
dot -Tsvg graph.dot -o graph.svg
# Debug: dump parsed tree with limited depth
python -m strix analyze dump.txt --dump-tree --tree-max-depth 5The console output includes:
- Overview: total instructions, FLOPs, bytes, simulated time, VPU/DMA utilization
- Bottleneck: classification (Compute / Memory / Latency / Balanced) with stall ratio
- Instruction mix: breakdown by operation category
python -m strix analyze-bundles [OPTIONS] PATH
| Argument | Required | Description |
|---|---|---|
PATH |
yes | Path to a *-final_bundles.txt file |
| Option | Description |
|---|---|
--json OUTPUT |
Write structured JSON to OUTPUT instead of console table. |
--line N |
Filter to source locations that include line N. |
--source-root DIR |
Local directory matching the TPU pod source paths, for inline code display. |
Examples:
# Console table
python -m strix analyze-bundles final_bundles.txt
# Filter to a specific source line
python -m strix analyze-bundles final_bundles.txt --line 42
# Show source code alongside bundle mappings
python -m strix analyze-bundles final_bundles.txt --source-root ./src
# Export as JSON
python -m strix analyze-bundles final_bundles.txt --json bundles.jsonpython -m strix import [OPTIONS] KERNEL
| Argument | Required | Description |
|---|---|---|
KERNEL |
yes | Kernel module path, e.g. kernels.chunk_kda_fwd |
| Option | Default | Description |
|---|---|---|
--shape |
(required) | Comma-separated shape dimensions, e.g. 1,2048,4,128,128 |
--chunk-size |
— | Chunk size for the kernel. |
--tpu-type |
v7x |
TPU type (e.g. v6e, v7x). |
--tpu-topology |
2x2x1 |
TPU topology (e.g. 2x2x1, 4x4x4). |
Requires the [tpu] extras and authenticated GKE access.
Example:
python -m strix import kernels.chunk_kda_fwd \
--shape 1,2048,4,128,128 \
--tpu-type v7x \
--tpu-topology 2x2x1This deploys a K8s Job to the GKE cluster, runs the kernel with the given shape,
collects HLO/LLO/Mosaic IR dumps, and downloads them locally for further
analysis with strix analyze.
python scripts/cross_compile.py [OPTIONS]
Cross-compiles the fused MoE kernel on a small TPU slice (e.g. 4-chip v7x) targeting a larger topology (e.g. 2x8x8 = 128 chips, 256 devices). Uses jax.experimental.topologies to create a virtual mesh, then compiles via libtpu to dump HLO/LLO/Mosaic IR — without needing all physical devices.
| Option | Default | Description |
|---|---|---|
--topology |
2x8x8 |
Target TPU chip topology (e.g. 2x8x8, 2x2x1). |
--output-dir |
/tmp/cross_compile |
Root output directory for IR dumps. |
--num-tokens |
256 |
Number of tokens. |
--num-experts |
256 |
Number of experts. |
--top-k |
8 |
Top-K experts per token. |
--hidden-size |
8192 |
Hidden dimension size. |
--intermediate-size |
2048 |
Intermediate (FFN) dimension size. |
--se-intermediate-size |
2048 |
Shared expert intermediate dimension size. |
Requires running on a TPU VM with JAX >= 0.5.x and libtpu.
Example:
python scripts/cross_compile.py --topology 2x8x8IR dumps are written to <output-dir>/<topology>/{hlo,llo,mosaic}/.
The --trace-output file (default: trace.json) is compatible with
Perfetto and Chrome chrome://tracing. The trace has
four tracks:
| TID | Track |
|---|---|
| 1 | Compute (VPU) |
| 2 | DMA |
| 3 | Stall |
| 4 | Control Flow |
Open the file in Perfetto to inspect simulated compute/DMA timelines, stall regions, and loop nesting structure.
Use --dataflow-output to export a Graphviz DOT file showing the SSA dataflow
dependency graph. The graph includes:
- Hardware clusters: VPU and DMA operations grouped by execution unit
- Loop subgraphs: nested loop blocks shown as clusters
- Time-bucket rank constraints: same-T constraints displayed as rank annotations
- Cross-stream edges: dependencies between VPU and DMA highlighted
Render with Graphviz:
dot -Tsvg dataflow.dot -o dataflow.svgLLO dump (.txt/.mlir)
│
▼
[Parser] regex-based → Instruction tree
│
▼
[Simulator] dual-clock (VPU/DMA), SSA resolution, loop expansion
│
▼
[Analyzer] bottleneck classification, stall ratio, instruction mix
│
├──► Console summary (always)
├──► Chrome trace JSON (--trace-output)
└──► Graphviz DOT (--dataflow-output)
The hardware model defaults to v6e-class TPU specs (918 TFLOPS BF16, 32 GB HBM,
1600 GB/s bandwidth) and can be customized in strix/hardware.py.