High-Performance Inference and Deep Learning Stack
This repository contains the NOWLAB high-performance deep learning stack, integrating CUDA-aware MPI libraries with state-of-the-art deep learning frameworks for distributed training and inference on HPC systems.
The OSU HPC-AI stack provides:
- CUDA-aware MPI: MVAPICH-Plus for optimized GPU-to-GPU communication
- Deep Learning Frameworks: Custom builds of PyTorch, DeepSpeed, vLLM, and SGLang with MPI integration
- Serving Methods: MAC-Attention for reuse-based decode acceleration
- Communication Runtime: MCR-DL for modular, backend-agnostic distributed communication
- Optimized Collective Operations: GPUDirect RDMA support for high-performance distributed training
- HPC-Ready: Pre-configured for common HPC systems (MRI, Vista, etc.)
osu-hpc-ai/
├── frameworks/ # Deep learning framework submodules
│ ├── pytorch/ # PyTorch 2.10.0 with MVAPICH-Plus
│ ├── deepspeed/ # DeepSpeed with ZeRO optimization
│ ├── sglang/ # SGLang structured generation
│ ├── vllm/ # vLLM high-throughput inference
│ └── mcr-dl/ # MCR-DL communication runtime
├── methods/ # Reusable method implementations
│ └── mac-attention/ # MAC-Attention source package and kernels
├── build_scripts/ # Automated build scripts
│ └── mri/ # MRI cluster-specific builds
├── docs/ # Documentation
│ ├── build-guides/ # Framework build instructions
│ └── user-guides/ # Usage and method documentation
├── tests/ # Installation verification tests
│ ├── pytorch/
│ ├── deepspeed/
│ ├── sglang/
│ ├── vllm/
│ └── mac-attention/
├── examples/ # Usage examples
│ ├── deepspeed/ # DeepSpeed training examples
│ ├── megatron-lm/ # Megatron-LM dummy training example
│ ├── inference/ # vLLM inference examples
│ └── mac-attention/ # MAC-Attention workflow example
├── benchmarks/ # Performance benchmarks
│ ├── communication/ # MPI communication benchmarks
│ └── mac-attention/ # MAC-Attention performance benchmarks
├── recipe/ # End-to-end workflows
├── hpc_ai/ # Python utilities package
├── setup_runtime_env.sh # Runtime environment setup
└── README.md
- System: HPC cluster with NVIDIA GPUs (Compute Capability ≥ 6.0)
- CUDA: Version 11.0+ (12.x recommended)
- GCC: Version 9.0+ (13.3.0 recommended)
- Python: Version 3.8-3.12
We provide pre-built wheels for common configurations:
# Visit https://hpc-ai.engineering.osu.edu for download instructions
# Registration required for pre-built wheel access
# pip install <wheel-url-provided-after-registration># Clone the repository
git clone --recursive https://github.com/OSU-Nowlab/osu-hpc-ai.git
cd osu-hpc-ai
# Build PyTorch with MPI support
cd frameworks/pytorch
export USE_CUDA=1 USE_MPI=1
MAX_JOBS=4 python setup.py installFor detailed build instructions and system-specific guides, see:
- PyTorch Build Guide
- DeepSpeed Build Guide
- SGLang Build Guide
- vLLM Build Guide
- MAC-Attention Build Guide
- Megatron-LM Build Guide
- verl Build Guide
For using distributed training with MPI:
- PyTorch DDP with MPI Guide
- DeepSpeed ZeRO Guide
- vLLM Serving Guide
- SGLang Structured Output Guide
- MCR-DL Backend Guide
- MAC-Attention User Guide For troubleshooting and cluster-specific setup:
- Troubleshooting Guide
- MRI Cluster Setup
- TACC Vista Setup
- OLCF Frontier Setup
- SDSC Cosmos Setup
High-performance CUDA-aware MPI library with GPUDirect RDMA support.
- Version: 4.1+
- Features: CUDA/ROCm support, UCX backend, GPUDirect RDMA
- Documentation: MVAPICH User Guide
NOWLAB fork with CUDA-aware MPI and FP16 communication.
- Repository: OSU-Nowlab/pytorch
- Documentation: Build Guide
- Pre-built Wheels: Available for registered users at hpc-ai.engineering.osu.edu
- Features: MPI backend, GPUDirect RDMA
Quick Example:
import torch
import torch.distributed as dist
# Initialize with MPI backend
dist.init_process_group(backend='mpi')
# Distributed training
model = torch.nn.parallel.DistributedDataParallel(model)Microsoft DeepSpeed with MVAPICH integration for ZeRO optimization.
- Repository: OSU-Nowlab/DeepSpeed
- Features: ZeRO stages, MPI communication backend
High-throughput LLM inference engine.
- Repository: OSU-Nowlab/vllm
- Features: PagedAttention, distributed inference
Structured generation language serving framework.
- Repository: OSU-Nowlab/sglang
- Features: RadixAttention, multi-GPU serving
Reuse-based decode acceleration method for long-context inference.
- Type: Method, not a serving framework
- Documentation: Build Guide, User Guide
- Assets: Example, Benchmarks, Tests
- Current Status: Imported method package and validation assets. Framework-level serving-path integration is still a separate step.
Experimental reinforcement learning validation path for running verl with Megatron-LM as the trainer and vLLM as the rollout engine on MRI.
- Megatron-LM Documentation: Build Guide, Dummy Example
- Current Status: Installation scripts, smoke tests, and a gated dummy recipe are present. The full dummy run still needs MRI GPU validation before production use.
See our comprehensive guides:
- Installation Guide - Complete installation instructions
- Framework Build Guides - See docs/build-guides/ for DeepSpeed and SGLang
Tests verify that operations produce correct results within tolerance:
# Run PyTorch MPI correctness test
cd tests/pytorch
mpirun -n 2 python test_mpi_comm.pySee tests/README.md for more test suites.
Benchmarks measure performance to detect regressions between releases:
# Run communication benchmarks
cd benchmarks/communication
mpirun -n 4 python run_all.py --scanIndividual operations can be benchmarked:
# Benchmark all_reduce with message size scanning
mpirun -n 4 python all_reduce.py --scan
# Benchmark specific operations
mpirun -n 4 python run_all.py --scan --all-reduce --broadcastSee benchmarks/README.md for more benchmarking options.
# examples/pytorch/distributed_training.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize MPI backend
dist.init_process_group(backend='mpi')
# Create model
model = MyModel().cuda()
model = DDP(model)
# Training loop
for batch in dataloader:
outputs = model(batch)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()Run with:
mpiexec -n 4 python examples/pytorch/distributed_training.py# ZeRO Stage 3 training
deepspeed --num_gpus=8 examples/deepspeed/train_gpt.py \
--deepspeed_config=ds_config.jsonfrom vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=4)
outputs = llm.generate(prompts, sampling_params=params)See examples/ for more examples.
We officially support and test on:
| System | Architecture | GPUs | CUDA | Status |
|---|---|---|---|---|
| MRI Cluster | x86_64 | A100 | 12.6 | Supported |
| TACC Vista | ARM64 | GH200 | 12.5 | Supported |
| Generic HPC | x86_64 | V100+ | 11.0+ | Community |
GPT-2 training performance using NanoGPT benchmark on leading HPC systems. MVAPICH-Plus provides competitive or superior performance compared to vendor-optimized communication libraries.
| CPU | Memory | GPU | Interconnect |
|---|---|---|---|
| AMD EPYC 7A53 (64 cores @ 2GHz) | 512 GB DDR4 | AMD MI250X (4/Node, 128 GB HBM2e) | HPE Slingshot (200 Gb/s) |
Benchmark: GPT-2, Batch Size 12, Block Size 1024, PyTorch 2.10.0
| CPU | Memory | GPU | Interconnect |
|---|---|---|---|
| x86_64 (96 cores @ 3.7GHz) | 512 GB HBM3 unified | Integrated CDNA3 (128 GB HBM3/APU) | HPE Cray Slingshot-11 (200 Gb/s) |
Benchmark: GPT-2, Batch Size 32, Block Size 1024, PyTorch 2.10.0
| CPU | Memory | GPU | Interconnect |
|---|---|---|---|
| NVIDIA Grace (72 cores @ 3.1GHz) | 116 GB DDR5 | NVIDIA H200 (1/Node, 96 GB HBM3) | Mellanox NDR (400 Gb/s) |
Benchmark: GPT-2, Batch Size 12, Block Size 512, Gradient Accumulation 256, PyTorch 2.10.0
For more performance results, visit Here.
If you use this software in your research, please cite:
@software{osu_hidl,
title = {OSU HPC-AI: High-Performance Inference and Deep Learning Stack},
author = {Network-Based Computing Laboratory, The Ohio State University},
url = {https://github.com/OSU-Nowlab/osu-hpc-ai},
year = {2025},
license = {Apache-2.0}
}We welcome contributions from the community! Please see our Contributing Guide for details on:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
By contributing, you agree that your contributions will be licensed under the Apache License 2.0.
This project is licensed under the Apache License 2.0 - see LICENSE file.
Individual components may have additional licenses:
- PyTorch: BSD-3-Clause (see
frameworks/pytorch/LICENSE) - DeepSpeed: Apache-2.0 (see
frameworks/deepspeed/LICENSE) - vLLM: Apache-2.0 (see
frameworks/vllm/LICENSE) - SGLang: Apache-2.0 (see
frameworks/sglang/LICENSE)
- Website: https://hpc-ai.engineering.osu.edu
- GitHub Issues: Report bugs and request features
- Contact: Visit the HPC-AI website for support and contact information
NOWLAB | The Ohio State University | nowlab.cse.ohio-state.edu