feat(sglang): add support for sglang server#1267
Conversation
Signed-off-by: Lin-xs <1833080950@qq.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in “server mode” SGLang backend for RLinf, where rollout/eval callers can talk to a pool of external SGLang HTTP servers via an sglang-router (instead of using the in-process engine).
Changes:
- Added Ray workers to launch/own SGLang HTTP server subprocesses and an
sglang-routersubprocess, plus a driver-side orchestrator to place and wire them together. - Added a small sync/async HTTP client for
/generate,/v1/chat/completions, and/health. - Added a runnable demo (Python + YAML + shell script) and added
sglang-routerto theagentic-sglangoptional dependency group.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
rlinf/hybrid_engines/sglang/server/server_launcher.py |
New SGLangServerWorker that spawns an SGLang HTTP server subprocess and waits for /health. |
rlinf/hybrid_engines/sglang/server/router_launcher.py |
New SGLangRouterWorker that spawns sglang_router.launch_router and supports dynamic server registration. |
rlinf/hybrid_engines/sglang/server/launcher.py |
Driver helper to build placement, launch server/router groups, and register servers. |
rlinf/hybrid_engines/sglang/server/http_client.py |
New InferenceHTTPClient with sync + async APIs for router/server HTTP endpoints. |
rlinf/hybrid_engines/sglang/server/__init__.py |
Exports new server-mode utilities. |
pyproject.toml |
Adds sglang-router under agentic-sglang extras. |
examples/reasoning/sglang_server_demo.py |
Demo script exercising sync/async generate and chat completions via the router. |
examples/reasoning/run_sglang_server_demo.sh |
Demo launch script. |
examples/reasoning/config/sglang_server_demo.yaml |
Demo configuration for server/router blocks and placement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _wait_for_http_health(host: str, port: int, timeout: float = 300.0) -> None: | ||
| """Block until ``GET http://host:port/health`` returns 200, or raise.""" | ||
| deadline = time.perf_counter() + timeout | ||
| url = f"http://{host}:{port}/health" | ||
| last_err: Optional[BaseException] = None | ||
| while time.perf_counter() < deadline: | ||
| try: | ||
| resp = requests.get(url, timeout=5) | ||
| if resp.status_code == 200: | ||
| return | ||
| except requests.exceptions.RequestException as e: | ||
| last_err = e | ||
| time.sleep(1.0) | ||
| raise RuntimeError( | ||
| f"sglang server at {url} did not become healthy within {timeout:.0f}s " | ||
| f"(last error: {last_err!r})." | ||
| ) |
| self._advertise_host = ray.util.get_node_ip_address() | ||
|
|
||
| _wait_for_http_health(self._advertise_host, http_port) | ||
| self.log_info(f"sglang server ready at {self.get_server_url()}") | ||
| return self.get_server_url() |
| router_cfg = ( | ||
| OmegaConf.to_container(self._router_cfg, resolve=True) | ||
| if self._router_cfg is not None | ||
| else {} | ||
| ) or {} | ||
|
|
| port = int(self._bind_port or self.acquire_free_port()) | ||
| self._port = port | ||
|
|
| def __init__( | ||
| self, | ||
| base_url: str, | ||
| connect_timeout: float = 10.0, | ||
| max_connections: int = 1024 * 16, | ||
| ): | ||
| self.base_url = base_url.rstrip("/") | ||
| self.connect_timeout = connect_timeout | ||
| self.max_connections = max_connections |
| agentic-sglang = [ | ||
| "sglang[all]==0.4.6.post5", | ||
| "sglang-router", | ||
| "torch-memory-saver", | ||
| "numpy==2.2", | ||
| "transformers==4.51.1", |
| "sglang[all]==0.4.6.post5", | ||
| "sglang-router", | ||
| "torch-memory-saver", |
| #! /bin/bash | ||
| set -x | ||
|
|
||
| tabs 4 | ||
| export CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
| export RAY_DEDUP_LOGS=0 | ||
|
|
| return args | ||
|
|
||
|
|
||
| class SGLangRouterWorker(Worker): |
There was a problem hiding this comment.
Should we move this to workers and name the file xxx worker?
| placement_strategy=router_placement, | ||
| ) | ||
|
|
||
| router_handle = router_group.init_router() if router_group is not None else None |
There was a problem hiding this comment.
Could we make the startup path transactional and clean up partially-started resources on failure?
There are two related failure paths here:
-
At the orchestrator level, if
server_handle.wait(),router_handle.wait(),get_server_url(), orregister_server()raises, the already-launched server/router Ray groups are returned by no one and their child processes can keep running. -
At the router worker level,
SGLangRouterWorker.init_router()callssubprocess.Popen()and then waits for/health. If the process stays alive but never becomes healthy,_wait_for_router_health()raises while the router subprocess may still be holding the selected port.
For SGLang this is especially risky because failed startup can leave GPU-serving processes and router ports behind, causing OOM or port conflicts on the next run.
Could we wrap the launch/init/register sequence in try/except and best-effort shut down router then server before re-raising, and also make init_router() call self.shutdown()/reset state when its health wait fails?
| rollout_placement = PackedPlacementStrategy( | ||
| start_hardware_rank=ranks[0], | ||
| end_hardware_rank=ranks[-1], | ||
| num_hardware_per_process=num_accelerators_per_engine, | ||
| ) | ||
|
|
||
| server_group = SGLangServerWorker.create_group( | ||
| config=config, | ||
| sglang_cfg=rollout_cfg.server, | ||
| ).launch( | ||
| cluster=cluster, | ||
| name=rollout_cfg.group_name, | ||
| placement_strategy=rollout_placement, | ||
| ) |
There was a problem hiding this comment.
Could we avoid rebuilding placement from only the flat hardware-rank list here?
The caller likely already has a parsed RLinf placement strategy from ComponentPlacement. Reconstructing a new PackedPlacementStrategy from rollout_hardware_ranks loses the original placement semantics, especially node_group / heterogeneous cluster placement / flexible mappings. It also assumes the ranks are contiguous.
For example, if the original component placement targets a non-default node group, this helper currently creates a new PackedPlacementStrategy without passing that node group, so the server group can be scheduled against the default group instead of the configured one.
Could this helper accept the caller-provided placement strategy directly, or preserve the original node group and process-to-resource mapping when repacking?
Signed-off-by: Lin-xs <1833080950@qq.com>
Signed-off-by: Lin-xs <1833080950@qq.com>
Description
Adds a server-mode SGLang backend to RLinf so rollouts can talk to one or more SGLang HTTP engines through an
sglang-routerinstead of using the in-process SGLang engine.New module
rlinf/hybrid_engines/sglang/server/:SGLangServerWorker— Ray worker that spawns an SGLang HTTP server per process group (sized bytensor_parallel_size * pipeline_parallel_size), waits on/health, and reports its URL.SGLangRouterWorker— single Ray worker that runs ansglang-routersubprocess, dynamically registers/unregisters server URLs, and exposes router URL + health + a thingeneratepassthrough.InferenceHTTPClient— sync/async client overrequests/aiohttpfor/generate,/v1/chat/completions, and/healthagainst either a router or a single server.launch_sglang_router_and_server(config, cluster, rollout_hardware_ranks, ...)— one-call orchestrator that builds aPackedPlacementStrategyfrom the rollout hardware ranks, launches the server group, brings up the router, and registers each server with the router from the driver.Demo example wired up:
examples/reasoning/sglang_server_demo.py— exercises sync/async/generateand/v1/chat/completionsagainst the router for Qwen2.5-VL-3B.examples/reasoning/config/sglang_server_demo.yaml—rollout.server/rollout.routerconfig block, withlaunch_server/launch_routertoggles andgroup_name/router_group_name.examples/reasoning/run_sglang_server_demo.sh— launch script.Dependency:
sglang-routeris added to theagentic-sglangoptional dependency group inpyproject.toml.Motivation and Context
The existing SGLang integration only supports the in-process engine, which couples rollout workers to the engine lifecycle and makes it hard to (1) share a pool of SGLang engines across multiple consumers, (2) scale the engine pool independently of the trainer, and (3) speak the standard OpenAI / SGLang HTTP protocol from agentic / tool-using code.
This change introduces a "server mode" path: each engine runs as a long-lived HTTP server, an
sglang-routerfronts the pool with cache-aware routing, and callers (RL rollouts, eval harnesses, agentic demos) talk to a single stable router URL. It is fully opt-in — existing configs are unaffected.How has this been tested?
bash examples/reasoning/run_sglang_server_demo.shon a 2-GPU node withQwen2.5-VL-3B(TP=2, PP=1):router_group.get_router_url()returned a reachable URL./generateand/v1/chat/completionspaths all completed end-to-end.shutdown().wait().Additional information (optional, e.g., figures and logs):
router_node_rankinlaunch_sglang_router_and_server.aiohttpdefaults tolimit=100connections — for large fan-outs bump the client'smax_connectionsandulimit -naccordingly (documented in theInferenceHTTPClientdocstring).Types of changes
Checklist: