AI Infrastructure Engineer
Building the substrate for AI systems that train, reason, and improve themselves — from GPU clusters to agentic RL pipelines.
Large-scale LLM Training Infrastructure Led end-to-end design and implementation of cloud-native AI training infrastructure from the ground up, supporting large-scale distributed training across heterogeneous GPU clusters (A100, H100, GB200). Integrated GPU scheduling, high-performance networking, and distributed training frameworks (PyTorch, Ray), achieving high cluster MFU. Drove system-level performance optimization across compute, networking, and storage layers, addressing bottlenecks in NCCL communication, GPU utilization, and I/O throughput in multi-node environments.
GPU Cluster Scheduling & Kubernetes Native AI Platforms Architected a multi-cluster scheduling system spanning 5 GPU clusters, enabling cross-cluster workload orchestration, resource pooling, and improved global utilization. Reviewer and contributor to Volcano (CNCF), with contributions to gang scheduling, capacity plugin correctness, and DRA resource management. End-to-end ML platform design on Kubernetes: job lifecycle management, GPU affinity, multi-tenancy, autoscaling, and observability.
Agentic RL & Inference Infrastructure Infrastructure for agentic RL training and inference. Integrated RL training frameworks (veRL, AReaL, NeMo-RL) and high-throughput inference engines (vLLM, SGLang) into production platforms. Built a pluggable OSWorld sandbox provider on the training platform, enabling closed-loop RL training pipelines for computer-use agents at scale.
| Layer | Technologies |
|---|---|
| Training Frameworks | PyTorch, Ray |
| RL Training | veRL, AReaL, NeMo-RL |
| Inference | vLLM, SGLang |
| Distributed | NCCL |
| Orchestration | Kubernetes, Volcano |
| Hardware | A100, H100, GB200 |
- Systems thinking first — AI infrastructure requires deep understanding of the full stack: hardware, networking, runtime, and model architecture.
- Measure before optimizing — bottlenecks in distributed training are rarely where intuition suggests. Profile first, optimize second.
- volcano-sh/volcano — Reviewer & Contributor. Core focus on GPU scheduling correctness and scalability for AI workloads.
- volcengine/verl · inclusionAI/AReaL · NVIDIA/NeMo-RL — Contributor. Agentic RL training infrastructure and inference backend stability.
Particularly drawn to automated research and self-improving agents — systems that can autonomously explore, experiment, and refine themselves. Interested in the infrastructure challenges these workloads introduce: long-horizon task execution, scalable sandbox environments, and tight feedback loops between inference and training.