A lightweight runtime health check for PyTorch training runs.
Quickstart • Compare Runs • Read Output • Use With Your Stack • FAQ
PyTorch training can look normal while step time is lost to a slow input pipeline, low GPU utilization, memory growth, distributed rank skew, or a run-to-run regression.
TraceML runs alongside your PyTorch training loop and writes a compact
runtime performance report at the end of each run. With low overhead
(<2% in our current benchmark runs), it helps you see where step time and
memory went before opening heavier tools like torch.profiler or Nsight.
It helps answer:
- Are my GPUs waiting on a slow dataloader?
- Is one distributed rank consistently slower than the others?
- Is memory usage silently creeping upward during the run?
- Did a recent code, data, or infrastructure change slow training down?
Here, health means runtime performance health: step time, input/compute/wait balance, memory behavior, distributed rank skew, and run-to-run change.
pip install traceml-aiUsing Hugging Face Trainer, PyTorch Lightning, Ray Train, W&B, or MLflow? Start with the native integration path in Use With Your Stack.
Add TraceML around the core training step. You do not need to change your model, optimizer, loss function, or dataloader.
import traceml_ai as traceml
traceml.init(mode="auto")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()traceml run train.pyFor DDP, FSDP, and multi-node runs, see Distributed Training.
TraceML writes two end-of-run artifacts:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
You can re-print a saved summary later without rerunning training:
traceml view logs/<run_name>/final_summary.jsonWant a shareable report? Add --html-report to also write a self-contained
final_summary.html (inline styling, no network, opens in any browser), or
render one from a saved run after the fact:
traceml run train.py --html-report
traceml view logs/<run_name>/final_summary.json --html # writes <...>.htmlInstead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went.
Example TraceML output:
+----------------------------------------------------------------------------+
| Step Time |
| - Diagnosis: INPUT STRAGGLER |
| - Scope: compared over last 460 aligned steps across 4 global ranks |
| - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms |
| - Why: r0 input was slower than median global rank (254.5/3.8ms). |
+----------------------------------------------------------------------------+
In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.
For experiment trackers, call traceml.summary() near the end of your script
to get a flat dict of diagnosis statuses and average metrics. Keep
final_summary.json when you want the full run artifact or an input for
traceml compare.
TraceML is meant to be the first check when a training run is slower than expected. It points you to the likely area before you decide whether to open a heavier profiler.
| Area | What TraceML surfaces | What to inspect next |
|---|---|---|
| Input pipeline | High input time or slow input rank | num_workers, pin_memory, transforms, tokenization, collate_fn, dataset/storage latency |
| GPU utilization / wait | Step time split across input, compute, and wait | input pipeline, CPU/GPU handoff, synchronization, distributed coordination |
| Distributed skew | One DDP/FSDP rank slower than the others | rank-local dataloading, data imbalance, node variance, storage/network differences |
| Memory creep | Memory usage growing during the run | retained tensors, logging references, loss accumulation, cached activations |
| Run regression | Changed metrics versus a known-good run | code changes, data changes, batch size, container, driver, hardware, infrastructure |
| Compute-heavy runs | Most time is spent in compute | open torch.profiler or Nsight for operator/kernel-level detail |
Compare a slow run against a known good baseline to identify which metrics changed:
traceml compare input_slow/final_summary.json input_fixed/final_summary.json+--------------------------------------------------------------------------------------+
| TraceML Compare |
+--------------------------------------------------------------------------------------+
| Verdict: IMPROVEMENT |
| Why: Step time decreased by 95.6%. |
| |
| Metric A B Delta |
| Total step 294.0 ms 13.0 ms -280.9 ms (-95.6%) |
| Input 66.4 ms 2.7 ms -63.7 ms (-95.9%) |
+--------------------------------------------------------------------------------------+
See Compare Runs for the full report format.
TraceML controls what you see during training with the --mode flag, without
changing the final saved artifacts.
| Mode flag | Experience during training | Supported topology |
|---|---|---|
--mode=summary (default) |
Silent execution | Single-node and multi-node multi-GPU |
--mode=cli |
Live terminal display | Single-node, including multi-GPU |
--mode=dashboard |
Live browser display | Single-node; requires pip install "traceml-ai[dashboard]" |
Works today:
- Single GPU training
- Single-node multi-GPU DDP / FSDP
- Multi-node DDP summary reports
- Multi-node runs on Slurm (sbatch template + guide)
- Run-to-run comparison from
final_summary.json - Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train
On the roadmap:
- Multi-node live CLI / browser dashboard
- Explicit collective / NCCL timing
TraceML does not replace torch.profiler. It is the low-overhead first pass that
helps you decide where to aim heavier profiling tools.
| Tool | Best used for | Output | Cost / overhead |
|---|---|---|---|
| TraceML | Classifying high-level bottlenecks: input, compute, wait, memory, rank skew | JSON fingerprint, text summary, live views | <2% in current benchmark runs; small code wrapper |
torch.profiler |
Inspecting expensive ops, kernels, and CUDA activity | Profiler trace | Higher overhead; requires profiler context |
| Nsight Systems | Debugging low-level CUDA and kernel behavior | GPU timeline | Separate profiler run |
| W&B / MLflow | Tracking training metrics and experiment history | Metrics dashboard / run history | Logging integration |
nvidia-smi |
Checking machine-level GPU health and utilization | Terminal metrics | No code changes |
In our benchmark runs, TraceML adds:
- <2% overhead on single GPU at default settings
- <1% overhead on single-node multi-GPU at default settings
These guides cover the common bottlenecks TraceML is designed to identify:
- Find why PyTorch training is slow
- Find DataLoader Bottlenecks
- Debug Low GPU Utilization
- Debug DDP Rank Stragglers
- Find PyTorch Memory Creep
- Distributed Training
- Running on Slurm
- Use With Your Stack
- Compare Runs
- How to Read Output
- FAQ
For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.
GitHub issues: open an issue
If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.
Security reports: see SECURITY.md
Email: support@traceopt.ai
Contributions are welcome, especially:
- real slowdown examples and repros
- distributed training edge cases
- docs improvements
- framework integrations
See CONTRIBUTING.md for development setup and contribution guidelines.
Apache 2.0. See LICENSE.
TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).