Skip to content

primatrix/strix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

237 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strix

Strix is a static profiler and timeline simulator for TPU Mosaic/Pallas LLO dumps (post-finalize-llo .txt/.mlir). It parses an LLO dump, runs a dual-clock (VPU + DMA) simulator, analyzes stalls and bottlenecks, and prints a console summary. It can also emit Chrome/Perfetto-compatible trace JSON and Graphviz DOT dataflow graphs.

Requirements

  • Python 3.12+

Core analysis has no third-party dependencies (standard library only).

The optional tpu extra (for the import subcommand) needs JAX, libtpu, protobuf, and google-cloud-storage:

pip install -e ".[tpu]"

Quickstart

# Implicit analyze (backward-compatible)
python -m strix path/to/post-finalize-llo.txt

# Explicit analyze
python -m strix analyze path/to/post-finalize-llo.txt

Both python -m strix and python -m strix.cli are equivalent entry points.

This prints a performance summary to stdout and writes trace.json in the current directory.

Commands

Strix provides three subcommands:

Command Description
analyze Parse and simulate an LLO dump. Default — if the first positional arg doesn't match any subcommand, analyze is prepended automatically.
analyze-bundles Parse a *-final_bundles.txt file and map VLIW bundles to Pallas source lines.
import Deploy a kernel to a GKE TPU pod, run a benchmark, and download IR dumps.
cross-compile Cross-compile a kernel for a target TPU topology and export LLO without running.

analyze — Parse and simulate an LLO dump

python -m strix analyze [OPTIONS] PATH
Argument Required Description
PATH yes Path to a post-finalize-llo .txt or .mlir file
Option Default Description
-t, --trace-output trace.json Output path for Chrome trace JSON. Set to '' to disable.
--arg, --arg-override Override scalar SSA values, e.g. --arg %arg0=128 --arg %1237=10. Needed to resolve DMA sizes and loop bounds.
--default-sld-value Default value for all unresolved llo.sld results, e.g. --default-sld-value 128.
--exclude-instructions Exclude instructions by opcode, e.g. --exclude-instructions llo.nop llo.dbg.
--dump-tree False Print the parsed Instruction tree to stdout before simulating.
--tree-max-depth Maximum depth when printing the Instruction/OpEvent tree.
--dataflow-output Output path for Graphviz DOT dataflow graph. Render with: dot -Tsvg output.dot -o output.svg

Examples:

# Basic analysis
python -m strix analyze dump.txt

# Override SSA args and disable trace
python -m strix analyze dump.txt --arg %arg0=128 --trace-output ''

# Generate dataflow graph
python -m strix analyze dump.txt --dataflow-output graph.dot
dot -Tsvg graph.dot -o graph.svg

# Debug: dump parsed tree with limited depth
python -m strix analyze dump.txt --dump-tree --tree-max-depth 5

The console output includes:

  • Overview: total instructions, FLOPs, bytes, simulated time, VPU/DMA utilization
  • Bottleneck: classification (Compute / Memory / Latency / Balanced) with stall ratio
  • Instruction mix: breakdown by operation category

analyze-bundles — Map VLIW bundles to source lines

python -m strix analyze-bundles [OPTIONS] PATH
Argument Required Description
PATH yes Path to a *-final_bundles.txt file
Option Description
--json OUTPUT Write structured JSON to OUTPUT instead of console table.
--line N Filter to source locations that include line N.
--source-root DIR Local directory matching the TPU pod source paths, for inline code display.

Examples:

# Console table
python -m strix analyze-bundles final_bundles.txt

# Filter to a specific source line
python -m strix analyze-bundles final_bundles.txt --line 42

# Show source code alongside bundle mappings
python -m strix analyze-bundles final_bundles.txt --source-root ./src

# Export as JSON
python -m strix analyze-bundles final_bundles.txt --json bundles.json

import — Benchmark a kernel on TPU and download IR dumps

python -m strix import [OPTIONS] KERNEL
Argument Required Description
KERNEL yes Kernel module path, e.g. kernels.chunk_kda_fwd
Option Default Description
--shape (required) Comma-separated shape dimensions, e.g. 1,2048,4,128,128
--chunk-size Chunk size for the kernel.
--tpu-type v7x TPU type (e.g. v6e, v7x).
--tpu-topology 2x2x1 TPU topology (e.g. 2x2x1, 4x4x4).

Requires the [tpu] extras and authenticated GKE access.

Example:

python -m strix import kernels.chunk_kda_fwd \
    --shape 1,2048,4,128,128 \
    --tpu-type v7x \
    --tpu-topology 2x2x1

This deploys a K8s Job to the GKE cluster, runs the kernel with the given shape, collects HLO/LLO/Mosaic IR dumps, and downloads them locally for further analysis with strix analyze.


cross-compile — Cross-compile a kernel for a target TPU topology

python scripts/cross_compile.py [OPTIONS]

Cross-compiles the fused MoE kernel on a small TPU slice (e.g. 4-chip v7x) targeting a larger topology (e.g. 2x8x8 = 128 chips, 256 devices). Uses jax.experimental.topologies to create a virtual mesh, then compiles via libtpu to dump HLO/LLO/Mosaic IR — without needing all physical devices.

Option Default Description
--topology 2x8x8 Target TPU chip topology (e.g. 2x8x8, 2x2x1).
--output-dir /tmp/cross_compile Root output directory for IR dumps.
--num-tokens 256 Number of tokens.
--num-experts 256 Number of experts.
--top-k 8 Top-K experts per token.
--hidden-size 8192 Hidden dimension size.
--intermediate-size 2048 Intermediate (FFN) dimension size.
--se-intermediate-size 2048 Shared expert intermediate dimension size.

Requires running on a TPU VM with JAX >= 0.5.x and libtpu.

Example:

python scripts/cross_compile.py --topology 2x8x8

IR dumps are written to <output-dir>/<topology>/{hlo,llo,mosaic}/.


Trace Viewing

The --trace-output file (default: trace.json) is compatible with Perfetto and Chrome chrome://tracing. The trace has four tracks:

TID Track
1 Compute (VPU)
2 DMA
3 Stall
4 Control Flow

Open the file in Perfetto to inspect simulated compute/DMA timelines, stall regions, and loop nesting structure.

Dataflow Graph

Use --dataflow-output to export a Graphviz DOT file showing the SSA dataflow dependency graph. The graph includes:

  • Hardware clusters: VPU and DMA operations grouped by execution unit
  • Loop subgraphs: nested loop blocks shown as clusters
  • Time-bucket rank constraints: same-T constraints displayed as rank annotations
  • Cross-stream edges: dependencies between VPU and DMA highlighted

Render with Graphviz:

dot -Tsvg dataflow.dot -o dataflow.svg

Architecture

LLO dump (.txt/.mlir)
        │
        ▼
   [Parser]          regex-based → Instruction tree
        │
        ▼
   [Simulator]       dual-clock (VPU/DMA), SSA resolution, loop expansion
        │
        ▼
   [Analyzer]        bottleneck classification, stall ratio, instruction mix
        │
        ├──► Console summary (always)
        ├──► Chrome trace JSON (--trace-output)
        └──► Graphviz DOT (--dataflow-output)

The hardware model defaults to v6e-class TPU specs (918 TFLOPS BF16, 32 GB HBM, 1600 GB/s bandwidth) and can be customized in strix/hardware.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors