Skip to content
/ NCCue Public

NCCue - a lightweight RDMA + GPU inspection toolkit for analyzing NCCL-style communication patterns

License

Notifications You must be signed in to change notification settings

eladwf/NCCue

Repository files navigation

NCCue

Goal: Correlate NCCL collective events, GPU telemetry (DCGM/NVML), and NIC/RDMA counters (ethtool/devlink) into a single, time-aligned report to quickly identify training stalls caused by network congestion, CQ issues, or GPU starvation.

Status: early alpha. Simulation mode is stable and real-time collectors for NVIDIA GPUs (nvidia-smi) and NICs (ethtool) are available; advanced DCGM/devlink collectors are planned behind feature flags. The legacy inspector binary remains available as a deprecated alias for one release.

Quick start (real logs & samplers)

  1. Enable NCCL logs in your training job:

    export NCCL_DEBUG=INFO
    export NCCL_DEBUG_SUBSYS=GRAPH,COLL,INIT
  2. Start real-time collectors:

    # Collect GPU telemetry
    nccue collect --gpu 0 --interval 1s --output gpu.csv &
    
    # Collect NIC telemetry
    nccue collect --nic eth0 --interval 1s --output nic.csv &
  3. Run your training job (NCCL logs will be captured to nccl.log)

  4. Stop collectors with Ctrl+C and correlate:

    nccue --out report.html correlate \
      --nccl-log nccl.log \
      --gpu-csv gpu.csv \
      --nic-csv nic.csv

Manual page

  • docs/man/nccue.1.md contains the latest nccue(1) help text and subcommand usage (plus alias notes).
  • Regenerate it after CLI changes with scripts/generate-man.sh (captures --help output for the command and each subcommand).

Configuration & diagnostics

  • Put common defaults in nccue.yaml (auto-loaded from the working directory or passed via --config). Fields cover global output/log settings plus per- command inputs (correlate.*, simulate.duration). Every key can also be overridden via env vars such as NCCUE_OUT, NCCUE_FORMAT, and NCCUE_NCCL_LOG.
  • Use nccue doctor to verify pre-requisites on a host. It checks for nvidia-smi, dcgmi, ethtool, devlink, and basic permission requirements so you can spot missing packages before running collectors.

Project layout

NCCue/
├─ Cargo.toml                 # workspace
├─ crates/
│  ├─ nccue-core/        # data model, parsers, correlation
│  ├─ nccue-renderer/    # HTML report renderer
│  └─ nccue-cli/         # CLI: simulate | correlate (binary `nccue`, alias `inspector`)
├─ docs/                     # detailed docs for agents & contributors
├─ scripts/                  # helper scripts (nccl-tests runner, sampling)
└─ samples/                  # example logs/CSVs (placeholder)

Why this exists

  • NCCL shows collective phases but not NIC/RDMA counters.
  • DCGM shows GPU/PCIe metrics but not NCCL context.
  • ethtool/devlink show NIC counters but not which collective they align with. This repo aligns all three so you can see when and why a collective stalls.

Features

Output Formats

  • HTML reports - Interactive, standalone HTML with embedded CSS and data visualization
  • JSON output - Machine-readable format for automation and custom analysis pipelines
# HTML output (default)
cargo run -p nccue-cli -- --out report.html correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv

# JSON output for automation
cargo run -p nccue-cli -- --format json --out report.json correlate --nccl-log nccl.log --gpu-csv gpu.csv --nic-csv nic.csv

Insightful Reports

  • Summary cards highlight capture span, longest windows, and telemetry peaks at a glance.
  • Inline GPU and NIC trend charts (no external JS) make congestion bursts obvious.
  • Congestion counts, error spikes, and duration percentiles now appear in both HTML and JSON outputs.

API Examples

The nccue-core crate provides a complete API for custom analysis. Check out the examples:

# Basic parsing example
cargo run --example basic_parsing

# Correlation and window creation
cargo run --example correlation

# Custom performance analysis
cargo run --example custom_analysis

Example: Custom Analysis Pattern

use nccue_core::{
    parse_nccl_info_lines, windows_from_events, correlate, summarize_report, Report,
};

// Parse NCCL logs
let events = parse_nccl_info_lines(&nccl_log)?;

// Create windows and correlate
let mut windows = windows_from_events(&events);
let mut notes = Vec::new();
correlate(&mut windows, &mut notes);

// Summarize telemetry
let summary = summarize_report(&windows, &gpu_samples, &nic_samples);

// Create report
let report = Report { windows, gpu_samples, nic_samples, notes, summary };

// Export as JSON
let json = serde_json::to_string_pretty(&report)?;

Performance

Benchmarks available for all core operations:

# Run benchmarks
cargo bench --bench parsing_benchmarks

# Benchmark specific component
cargo bench --bench parsing_benchmarks -- nccl_parsing

Typical performance (on modern hardware):

  • NCCL log parsing: ~100k events/sec
  • CSV parsing: ~50k samples/sec
  • Window correlation: ~10k windows/sec

Testing

Comprehensive test suite covering:

  • Parser correctness and error handling
  • Correlation logic with edge cases
  • HTML/JSON/NDJSON rendering
  • Integration tests with sample data
  • GPU and NIC collectors (unit and integration)
  • CLI and renderer behavior
# Run all tests
cargo test --workspace

# Run specific test suite
cargo test -p nccue-core

# Run with output
cargo test -- --nocapture

Requirements

  • Rust: 1.70.0 or later (MSRV enforced in CI)
  • Linux: Primary target platform
  • Optional: NVIDIA GPU with drivers for real GPU monitoring
  • Optional: Mellanox NIC for real network monitoring

Project Status

Current: Early alpha with comprehensive test coverage, full CI/CD pipeline, simulation mode, and basic real collectors for GPU/NIC telemetry.

About

NCCue - a lightweight RDMA + GPU inspection toolkit for analyzing NCCL-style communication patterns

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published