Goal: Correlate NCCL collective events, GPU telemetry (DCGM/NVML), and NIC/RDMA counters (ethtool/devlink) into a single, time-aligned report to quickly identify training stalls caused by network congestion, CQ issues, or GPU starvation.
Status: early alpha. Simulation mode is stable and real-time collectors for NVIDIA GPUs (
nvidia-smi) and NICs (ethtool) are available; advanced DCGM/devlink collectors are planned behind feature flags. The legacyinspectorbinary remains available as a deprecated alias for one release.
-
Enable NCCL logs in your training job:
export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=GRAPH,COLL,INIT
-
Start real-time collectors:
# Collect GPU telemetry nccue collect --gpu 0 --interval 1s --output gpu.csv & # Collect NIC telemetry nccue collect --nic eth0 --interval 1s --output nic.csv &
-
Run your training job (NCCL logs will be captured to
nccl.log) -
Stop collectors with Ctrl+C and correlate:
nccue --out report.html correlate \ --nccl-log nccl.log \ --gpu-csv gpu.csv \ --nic-csv nic.csv
docs/man/nccue.1.mdcontains the latestnccue(1)help text and subcommand usage (plus alias notes).- Regenerate it after CLI changes with
scripts/generate-man.sh(captures--helpoutput for the command and each subcommand).
- Put common defaults in
nccue.yaml(auto-loaded from the working directory or passed via--config). Fields cover global output/log settings plus per- command inputs (correlate.*,simulate.duration). Every key can also be overridden via env vars such asNCCUE_OUT,NCCUE_FORMAT, andNCCUE_NCCL_LOG. - Use
nccue doctorto verify pre-requisites on a host. It checks fornvidia-smi,dcgmi,ethtool,devlink, and basic permission requirements so you can spot missing packages before running collectors.
NCCue/
├─ Cargo.toml # workspace
├─ crates/
│ ├─ nccue-core/ # data model, parsers, correlation
│ ├─ nccue-renderer/ # HTML report renderer
│ └─ nccue-cli/ # CLI: simulate | correlate (binary `nccue`, alias `inspector`)
├─ docs/ # detailed docs for agents & contributors
├─ scripts/ # helper scripts (nccl-tests runner, sampling)
└─ samples/ # example logs/CSVs (placeholder)
- NCCL shows collective phases but not NIC/RDMA counters.
- DCGM shows GPU/PCIe metrics but not NCCL context.
- ethtool/devlink show NIC counters but not which collective they align with. This repo aligns all three so you can see when and why a collective stalls.
- HTML reports - Interactive, standalone HTML with embedded CSS and data visualization
- JSON output - Machine-readable format for automation and custom analysis pipelines
# HTML output (default)
cargo run -p nccue-cli -- --out report.html correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv
# JSON output for automation
cargo run -p nccue-cli -- --format json --out report.json correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv- Summary cards highlight capture span, longest windows, and telemetry peaks at a glance.
- Inline GPU and NIC trend charts (no external JS) make congestion bursts obvious.
- Congestion counts, error spikes, and duration percentiles now appear in both HTML and JSON outputs.
The nccue-core crate provides a complete API for custom analysis. Check out the examples:
# Basic parsing example
cargo run --example basic_parsing
# Correlation and window creation
cargo run --example correlation
# Custom performance analysis
cargo run --example custom_analysisExample: Custom Analysis Pattern
use nccue_core::{
parse_nccl_info_lines, windows_from_events, correlate, summarize_report, Report,
};
// Parse NCCL logs
let events = parse_nccl_info_lines(&nccl_log)?;
// Create windows and correlate
let mut windows = windows_from_events(&events);
let mut notes = Vec::new();
correlate(&mut windows, &mut notes);
// Summarize telemetry
let summary = summarize_report(&windows, &gpu_samples, &nic_samples);
// Create report
let report = Report { windows, gpu_samples, nic_samples, notes, summary };
// Export as JSON
let json = serde_json::to_string_pretty(&report)?;Benchmarks available for all core operations:
# Run benchmarks
cargo bench --bench parsing_benchmarks
# Benchmark specific component
cargo bench --bench parsing_benchmarks -- nccl_parsingTypical performance (on modern hardware):
- NCCL log parsing: ~100k events/sec
- CSV parsing: ~50k samples/sec
- Window correlation: ~10k windows/sec
Comprehensive test suite covering:
- Parser correctness and error handling
- Correlation logic with edge cases
- HTML/JSON/NDJSON rendering
- Integration tests with sample data
- GPU and NIC collectors (unit and integration)
- CLI and renderer behavior
# Run all tests
cargo test --workspace
# Run specific test suite
cargo test -p nccue-core
# Run with output
cargo test -- --nocapture- Rust: 1.70.0 or later (MSRV enforced in CI)
- Linux: Primary target platform
- Optional: NVIDIA GPU with drivers for real GPU monitoring
- Optional: Mellanox NIC for real network monitoring
Current: Early alpha with comprehensive test coverage, full CI/CD pipeline, simulation mode, and basic real collectors for GPU/NIC telemetry.