TraceML

A lightweight runtime health check for PyTorch training runs.

Quickstart • Compare Runs • Read Output • Use With Your Stack • FAQ

Is your GPU training, waiting on data, or blocked by one slow rank?

PyTorch training can look normal while step time is lost to a slow input pipeline, low GPU utilization, memory growth, distributed rank skew, or a run-to-run regression.

TraceML runs alongside your PyTorch training loop and writes a compact runtime performance report at the end of each run. With low overhead (<2% in our current benchmark runs), it helps you see where step time and memory went before opening heavier tools like torch.profiler or Nsight. It helps answer:

Are my GPUs waiting on a slow dataloader?
Is one distributed rank consistently slower than the others?
Is memory usage silently creeping upward during the run?
Did a recent code, data, or infrastructure change slow training down?

Here, health means runtime performance health: step time, input/compute/wait balance, memory behavior, distributed rank skew, and run-to-run change.

3-Minute Quickstart

1. Install the package

pip install traceml-ai

Using Hugging Face Trainer, PyTorch Lightning, Ray Train, W&B, or MLflow? Start with the native integration path in Use With Your Stack.

2. Wrap your training step

Add TraceML around the core training step. You do not need to change your model, optimizer, loss function, or dataloader.

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

3. Run your script

traceml run train.py

For DDP, FSDP, and multi-node runs, see Distributed Training.

What You Get

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

You can re-print a saved summary later without rerunning training:

traceml view logs/<run_name>/final_summary.json

Want a shareable report? Add --html-report to also write a self-contained final_summary.html (inline styling, no network, opens in any browser), or render one from a saved run after the fact:

traceml run train.py --html-report
traceml view logs/<run_name>/final_summary.json --html   # writes <...>.html

Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went.

Example TraceML output:

+----------------------------------------------------------------------------+
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms    |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
+----------------------------------------------------------------------------+

In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.

For experiment trackers, call traceml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.

What TraceML Helps You Triage

TraceML is meant to be the first check when a training run is slower than expected. It points you to the likely area before you decide whether to open a heavier profiler.

Area	What TraceML surfaces	What to inspect next
Input pipeline	High input time or slow input rank	`num_workers`, `pin_memory`, transforms, tokenization, `collate_fn`, dataset/storage latency
GPU utilization / wait	Step time split across input, compute, and wait	input pipeline, CPU/GPU handoff, synchronization, distributed coordination
Distributed skew	One DDP/FSDP rank slower than the others	rank-local dataloading, data imbalance, node variance, storage/network differences
Memory creep	Memory usage growing during the run	retained tensors, logging references, loss accumulation, cached activations
Run regression	Changed metrics versus a known-good run	code changes, data changes, batch size, container, driver, hardware, infrastructure
Compute-heavy runs	Most time is spent in compute	open `torch.profiler` or Nsight for operator/kernel-level detail

Catching Regressions with Compare Mode

Compare a slow run against a known good baseline to identify which metrics changed:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Metric                         A                B                Delta              |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
+--------------------------------------------------------------------------------------+

See Compare Runs for the full report format.

Display Modes

TraceML controls what you see during training with the --mode flag, without changing the final saved artifacts.

Mode flag	Experience during training	Supported topology
`--mode=summary` (default)	Silent execution	Single-node and multi-node multi-GPU
`--mode=cli`	Live terminal display	Single-node, including multi-GPU
`--mode=dashboard`	Live browser display	Single-node; requires `pip install "traceml-ai[dashboard]"`

Current Support

Works today:

Single GPU training
Single-node multi-GPU DDP / FSDP
Multi-node DDP summary reports
Multi-node runs on Slurm (sbatch template + guide)
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train

On the roadmap:

Multi-node live CLI / browser dashboard
Explicit collective / NCCL timing

Where TraceML Fits in the Stack

TraceML does not replace torch.profiler. It is the low-overhead first pass that helps you decide where to aim heavier profiling tools.

Tool	Best used for	Output	Cost / overhead
TraceML	Classifying high-level bottlenecks: input, compute, wait, memory, rank skew	JSON fingerprint, text summary, live views	<2% in current benchmark runs; small code wrapper
`torch.profiler`	Inspecting expensive ops, kernels, and CUDA activity	Profiler trace	Higher overhead; requires profiler context
Nsight Systems	Debugging low-level CUDA and kernel behavior	GPU timeline	Separate profiler run
W&B / MLflow	Tracking training metrics and experiment history	Metrics dashboard / run history	Logging integration
`nvidia-smi`	Checking machine-level GPU health and utilization	Terminal metrics	No code changes

Overhead

In our benchmark runs, TraceML adds:

<2% overhead on single GPU at default settings
<1% overhead on single-node multi-GPU at default settings

Troubleshooting Guides

These guides cover the common bottlenecks TraceML is designed to identify:

Feedback

For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.

GitHub issues: open an issue

If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.

Security reports: see SECURITY.md

Email: support@traceopt.ai

Contributing

Contributions are welcome, especially:

real slowdown examples and repros
distributed training edge cases
docs improvements
framework integrations

See CONTRIBUTING.md for development setup and contribution guidelines.

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Name		Name	Last commit message	Last commit date
Latest commit History 450 Commits
.git-hooks		.git-hooks
.github		.github
docs		docs
examples		examples
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TraceML

Is your GPU training, waiting on data, or blocked by one slow rank?

3-Minute Quickstart

1. Install the package

2. Wrap your training step

3. Run your script

What You Get

What TraceML Helps You Triage

Catching Regressions with Compare Mode

Display Modes

Current Support

Where TraceML Fits in the Stack

Overhead

Troubleshooting Guides

Feedback

Contributing

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TraceML

Is your GPU training, waiting on data, or blocked by one slow rank?

3-Minute Quickstart

1. Install the package

2. Wrap your training step

3. Run your script

What You Get

What TraceML Helps You Triage

Catching Regressions with Compare Mode

Display Modes

Current Support

Where TraceML Fits in the Stack

Overhead

Troubleshooting Guides

Feedback

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages