NCCue

Goal: Correlate NCCL collective events, GPU telemetry (DCGM/NVML), and NIC/RDMA counters (ethtool/devlink) into a single, time-aligned report to quickly identify training stalls caused by network congestion, CQ issues, or GPU starvation.

Status: early alpha. Simulation mode is stable and real-time collectors for NVIDIA GPUs (nvidia-smi) and NICs (ethtool) are available; advanced DCGM/devlink collectors are planned behind feature flags. The legacy inspector binary remains available as a deprecated alias for one release.

Quick start (real logs & samplers)

Enable NCCL logs in your training job:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=GRAPH,COLL,INIT

Start real-time collectors:

# Collect GPU telemetry
nccue collect --gpu 0 --interval 1s --output gpu.csv &

# Collect NIC telemetry
nccue collect --nic eth0 --interval 1s --output nic.csv &

Run your training job (NCCL logs will be captured to nccl.log)

Stop collectors with Ctrl+C and correlate:

nccue --out report.html correlate \
  --nccl-log nccl.log \
  --gpu-csv gpu.csv \
  --nic-csv nic.csv

Manual page

docs/man/nccue.1.md contains the latest nccue(1) help text and subcommand usage (plus alias notes).
Regenerate it after CLI changes with scripts/generate-man.sh (captures --help output for the command and each subcommand).

Configuration & diagnostics

Put common defaults in nccue.yaml (auto-loaded from the working directory or passed via --config). Fields cover global output/log settings plus per- command inputs (correlate.*, simulate.duration). Every key can also be overridden via env vars such as NCCUE_OUT, NCCUE_FORMAT, and NCCUE_NCCL_LOG.
Use nccue doctor to verify pre-requisites on a host. It checks for nvidia-smi, dcgmi, ethtool, devlink, and basic permission requirements so you can spot missing packages before running collectors.

Project layout

NCCue/
├─ Cargo.toml                 # workspace
├─ crates/
│  ├─ nccue-core/        # data model, parsers, correlation
│  ├─ nccue-renderer/    # HTML report renderer
│  └─ nccue-cli/         # CLI: simulate | correlate (binary `nccue`, alias `inspector`)
├─ docs/                     # detailed docs for agents & contributors
├─ scripts/                  # helper scripts (nccl-tests runner, sampling)
└─ samples/                  # example logs/CSVs (placeholder)

Why this exists

NCCL shows collective phases but not NIC/RDMA counters.
DCGM shows GPU/PCIe metrics but not NCCL context.
ethtool/devlink show NIC counters but not which collective they align with. This repo aligns all three so you can see when and why a collective stalls.

Features

Output Formats

HTML reports - Interactive, standalone HTML with embedded CSS and data visualization
JSON output - Machine-readable format for automation and custom analysis pipelines

# HTML output (default)
cargo run -p nccue-cli -- --out report.html correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv

# JSON output for automation
cargo run -p nccue-cli -- --format json --out report.json correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv

Insightful Reports

Summary cards highlight capture span, longest windows, and telemetry peaks at a glance.
Inline GPU and NIC trend charts (no external JS) make congestion bursts obvious.
Congestion counts, error spikes, and duration percentiles now appear in both HTML and JSON outputs.

API Examples

The nccue-core crate provides a complete API for custom analysis. Check out the examples:

# Basic parsing example
cargo run --example basic_parsing

# Correlation and window creation
cargo run --example correlation

# Custom performance analysis
cargo run --example custom_analysis

Example: Custom Analysis Pattern

use nccue_core::{
    parse_nccl_info_lines, windows_from_events, correlate, summarize_report, Report,
};

// Parse NCCL logs
let events = parse_nccl_info_lines(&nccl_log)?;

// Create windows and correlate
let mut windows = windows_from_events(&events);
let mut notes = Vec::new();
correlate(&mut windows, &mut notes);

// Summarize telemetry
let summary = summarize_report(&windows, &gpu_samples, &nic_samples);

// Create report
let report = Report { windows, gpu_samples, nic_samples, notes, summary };

// Export as JSON
let json = serde_json::to_string_pretty(&report)?;

Performance

Benchmarks available for all core operations:

# Run benchmarks
cargo bench --bench parsing_benchmarks

# Benchmark specific component
cargo bench --bench parsing_benchmarks -- nccl_parsing

Typical performance (on modern hardware):

NCCL log parsing: ~100k events/sec
CSV parsing: ~50k samples/sec
Window correlation: ~10k windows/sec

Testing

Comprehensive test suite covering:

Parser correctness and error handling
Correlation logic with edge cases
HTML/JSON/NDJSON rendering
Integration tests with sample data
GPU and NIC collectors (unit and integration)
CLI and renderer behavior

# Run all tests
cargo test --workspace

# Run specific test suite
cargo test -p nccue-core

# Run with output
cargo test -- --nocapture

Requirements

Rust: 1.70.0 or later (MSRV enforced in CI)
Linux: Primary target platform
Optional: NVIDIA GPU with drivers for real GPU monitoring
Optional: Mellanox NIC for real network monitoring

Project Status

Current: Early alpha with comprehensive test coverage, full CI/CD pipeline, simulation mode, and basic real collectors for GPU/NIC telemetry.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
crates		crates
docs/man		docs/man
examples		examples
samples		samples
scripts		scripts
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
clippy.toml		clippy.toml
deny.toml		deny.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NCCue

Quick start (real logs & samplers)

Manual page

Configuration & diagnostics

Project layout

Why this exists

Features

Output Formats

Insightful Reports

API Examples

Performance

Testing

Requirements

Project Status

About

Uh oh!

Releases

Packages

Languages

License

eladwf/NCCue

Folders and files

Latest commit

History

Repository files navigation

NCCue

Quick start (real logs & samplers)

Manual page

Configuration & diagnostics

Project layout

Why this exists

Features

Output Formats

Insightful Reports

API Examples

Performance

Testing

Requirements

Project Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages