Skip to content

OSU-Nowlab/osu-hidl

Repository files navigation

OSU HPC-AI Logo

OSU HPC-AI Stack

License

High-Performance Inference and Deep Learning Stack

This repository contains the NOWLAB high-performance deep learning stack, integrating CUDA-aware MPI libraries with state-of-the-art deep learning frameworks for distributed training and inference on HPC systems.

Overview

The OSU HPC-AI stack provides:

  • CUDA-aware MPI: MVAPICH-Plus for optimized GPU-to-GPU communication
  • Deep Learning Frameworks: Custom builds of PyTorch, DeepSpeed, vLLM, and SGLang with MPI integration
  • Serving Methods: MAC-Attention for reuse-based decode acceleration
  • Communication Runtime: MCR-DL for modular, backend-agnostic distributed communication
  • Optimized Collective Operations: GPUDirect RDMA support for high-performance distributed training
  • HPC-Ready: Pre-configured for common HPC systems (MRI, Vista, etc.)

Architecture

OSU HPC-AI Stack Architecture

Repository Structure

osu-hpc-ai/
├── frameworks/              # Deep learning framework submodules
│   ├── pytorch/            # PyTorch 2.10.0 with MVAPICH-Plus
│   ├── deepspeed/          # DeepSpeed with ZeRO optimization
│   ├── sglang/             # SGLang structured generation
│   ├── vllm/               # vLLM high-throughput inference
│   └── mcr-dl/             # MCR-DL communication runtime
├── methods/                # Reusable method implementations
│   └── mac-attention/      # MAC-Attention source package and kernels
├── build_scripts/          # Automated build scripts
│   └── mri/               # MRI cluster-specific builds
├── docs/                   # Documentation
│   ├── build-guides/      # Framework build instructions
│   └── user-guides/       # Usage and method documentation
├── tests/                  # Installation verification tests
│   ├── pytorch/
│   ├── deepspeed/
│   ├── sglang/
│   ├── vllm/
│   └── mac-attention/
├── examples/               # Usage examples
│   ├── deepspeed/         # DeepSpeed training examples
│   ├── megatron-lm/       # Megatron-LM dummy training example
│   ├── inference/         # vLLM inference examples
│   └── mac-attention/     # MAC-Attention workflow example
├── benchmarks/             # Performance benchmarks
│   ├── communication/     # MPI communication benchmarks
│   └── mac-attention/     # MAC-Attention performance benchmarks
├── recipe/                 # End-to-end workflows
├── hpc_ai/                 # Python utilities package
├── setup_runtime_env.sh    # Runtime environment setup
└── README.md

Quick Start

Prerequisites

  • System: HPC cluster with NVIDIA GPUs (Compute Capability ≥ 6.0)
  • CUDA: Version 11.0+ (12.x recommended)
  • GCC: Version 9.0+ (13.3.0 recommended)
  • Python: Version 3.8-3.12

Installation

Option 1: Pre-built Wheels (Recommended)

We provide pre-built wheels for common configurations:

# Visit https://hpc-ai.engineering.osu.edu for download instructions
# Registration required for pre-built wheel access

# pip install <wheel-url-provided-after-registration>

Option 2: Build from Source

# Clone the repository
git clone --recursive https://github.com/OSU-Nowlab/osu-hpc-ai.git
cd osu-hpc-ai

# Build PyTorch with MPI support
cd frameworks/pytorch
export USE_CUDA=1 USE_MPI=1
MAX_JOBS=4 python setup.py install

For detailed build instructions and system-specific guides, see:

For using distributed training with MPI:

Components

MVAPICH-Plus

High-performance CUDA-aware MPI library with GPUDirect RDMA support.

  • Version: 4.1+
  • Features: CUDA/ROCm support, UCX backend, GPUDirect RDMA
  • Documentation: MVAPICH User Guide

PyTorch

NOWLAB fork with CUDA-aware MPI and FP16 communication.

Quick Example:

import torch
import torch.distributed as dist

# Initialize with MPI backend
dist.init_process_group(backend='mpi')

# Distributed training
model = torch.nn.parallel.DistributedDataParallel(model)

DeepSpeed

Microsoft DeepSpeed with MVAPICH integration for ZeRO optimization.

vLLM

High-throughput LLM inference engine.

  • Repository: OSU-Nowlab/vllm
  • Features: PagedAttention, distributed inference

SGLang

Structured generation language serving framework.

MAC-Attention

Reuse-based decode acceleration method for long-context inference.

  • Type: Method, not a serving framework
  • Documentation: Build Guide, User Guide
  • Assets: Example, Benchmarks, Tests
  • Current Status: Imported method package and validation assets. Framework-level serving-path integration is still a separate step.

Megatron-LM and verl

Experimental reinforcement learning validation path for running verl with Megatron-LM as the trainer and vLLM as the rollout engine on MRI.

  • Megatron-LM Documentation: Build Guide, Dummy Example
  • Current Status: Installation scripts, smoke tests, and a gated dummy recipe are present. The full dummy run still needs MRI GPU validation before production use.

Building from Source

See our comprehensive guides:

Testing and Benchmarking

Correctness Tests

Tests verify that operations produce correct results within tolerance:

# Run PyTorch MPI correctness test
cd tests/pytorch
mpirun -n 2 python test_mpi_comm.py

See tests/README.md for more test suites.

Performance Benchmarks

Benchmarks measure performance to detect regressions between releases:

# Run communication benchmarks
cd benchmarks/communication
mpirun -n 4 python run_all.py --scan

Individual operations can be benchmarked:

# Benchmark all_reduce with message size scanning
mpirun -n 4 python all_reduce.py --scan

# Benchmark specific operations
mpirun -n 4 python run_all.py --scan --all-reduce --broadcast

See benchmarks/README.md for more benchmarking options.

Examples

Distributed Training with PyTorch + MPI

# examples/pytorch/distributed_training.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize MPI backend
dist.init_process_group(backend='mpi')

# Create model
model = MyModel().cuda()
model = DDP(model)

# Training loop
for batch in dataloader:
    outputs = model(batch)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

Run with:

mpiexec -n 4 python examples/pytorch/distributed_training.py

DeepSpeed Training

# ZeRO Stage 3 training
deepspeed --num_gpus=8 examples/deepspeed/train_gpt.py \
    --deepspeed_config=ds_config.json

vLLM Inference

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=4)
outputs = llm.generate(prompts, sampling_params=params)

See examples/ for more examples.

Supported Systems

We officially support and test on:

System Architecture GPUs CUDA Status
MRI Cluster x86_64 A100 12.6 Supported
TACC Vista ARM64 GH200 12.5 Supported
Generic HPC x86_64 V100+ 11.0+ Community

Performance

PyTorch Distributed Data Parallel with MVAPICH-Plus

GPT-2 training performance using NanoGPT benchmark on leading HPC systems. MVAPICH-Plus provides competitive or superior performance compared to vendor-optimized communication libraries.

OLCF Frontier (AMD MI250X)

CPU Memory GPU Interconnect
AMD EPYC 7A53 (64 cores @ 2GHz) 512 GB DDR4 AMD MI250X (4/Node, 128 GB HBM2e) HPE Slingshot (200 Gb/s)

Benchmark: GPT-2, Batch Size 12, Block Size 1024, PyTorch 2.10.0

GPT-2 on Frontier 1-8 GPU GPT-2 on Frontier 16-128 GPU

SDSC Cosmos (AMD CDNA3 APU)

CPU Memory GPU Interconnect
x86_64 (96 cores @ 3.7GHz) 512 GB HBM3 unified Integrated CDNA3 (128 GB HBM3/APU) HPE Cray Slingshot-11 (200 Gb/s)

Benchmark: GPT-2, Batch Size 32, Block Size 1024, PyTorch 2.10.0

GPT-2 on Cosmos

TACC Vista (NVIDIA GH200)

CPU Memory GPU Interconnect
NVIDIA Grace (72 cores @ 3.1GHz) 116 GB DDR5 NVIDIA H200 (1/Node, 96 GB HBM3) Mellanox NDR (400 Gb/s)

Benchmark: GPT-2, Batch Size 12, Block Size 512, Gradient Accumulation 256, PyTorch 2.10.0

GPT-2 on Vista

For more performance results, visit Here.

Citation

If you use this software in your research, please cite:

@software{osu_hidl,
  title = {OSU HPC-AI: High-Performance Inference and Deep Learning Stack},
  author = {Network-Based Computing Laboratory, The Ohio State University},
  url = {https://github.com/OSU-Nowlab/osu-hpc-ai},
  year = {2025},
  license = {Apache-2.0}
}

Contributing

We welcome contributions from the community! Please see our Contributing Guide for details on:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • Pull request process

By contributing, you agree that your contributions will be licensed under the Apache License 2.0.

License

This project is licensed under the Apache License 2.0 - see LICENSE file.

Individual components may have additional licenses:

  • PyTorch: BSD-3-Clause (see frameworks/pytorch/LICENSE)
  • DeepSpeed: Apache-2.0 (see frameworks/deepspeed/LICENSE)
  • vLLM: Apache-2.0 (see frameworks/vllm/LICENSE)
  • SGLang: Apache-2.0 (see frameworks/sglang/LICENSE)

Support


NOWLAB | The Ohio State University | nowlab.cse.ohio-state.edu

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors