GENIE (GPUs Eliminated for Network Infrastructure Examination) is a framework that allows users to test real network infrastructure for distributed ML, without requiring expensive GPUs.
Genie is deployed as processes on a CPU-only cluster where nodes are connected with a real network infrastructure (RoCE, Infiniband, Slingshot, etc.) The key idea is for each Genie process to replicate the behavior of a GPU, and to inject real network traffic into the network infrastructure, from the CPU. Genie uses ASTRA-sim to replicate the AI/ML workload and capture the interaction between the workload and various network effects.
Some questions that Genie is expected to help answer:
- How can I validate that my network infrastructure has no faults (no failing HW or SW misconfiguration, etc.) before attatching expensive GPUs and finding out the hard way?
- What kind of AI/ML workloads are more sensitive to specific network choices? (Design Space Exploration in network space)
- What kind of collective algorithms create more congestion in the network?
ASTRA-sim component in each Genie process takes as input the ML workload information of a single rank. This information is provided in the format of Chakra ET graph.
The default behavior of ASTRA-sim is to iterate through the workload graph and simulate the duration of each (compute or communication) operator, which it uses to build an end-to-end timeline. For collective communications, ASTRA-sim uses its internal collective library to break down into point to point send and receive messages, which it feeds into network simulators such as ns-3 or htsim.
Genie extends ASTRA-sim so that the point to point messages that ASTRA-sim are translated into real traffic, for example, by calling the transport library (such as libibverbs). The different Genie processes synchronize through the real network traffic, and there is no need for a separate mechanism to synchronize the processes.
For further reading, please refer to our arXiv preprint available here.
When citing this work, please use the following bibtex.
@misc{genie2026,
title={Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning},
author={Jinsun Yoo and ChonLam Lao and Lianjie Cao and Bob Lantz and Minlan Yu and Tushar Krishna and Puneet Sharma},
year={2025},
eprint={2504.20854},
archivePrefix={arXiv},
primaryClass={cs.NI},
url={https://arxiv.org/abs/2504.20854},
}