Skip to content

feat(sglang): add support for sglang server#1267

Open
Lin-xs wants to merge 3 commits into
RLinf:mainfrom
Lin-xs:feat/sglang_router
Open

feat(sglang): add support for sglang server#1267
Lin-xs wants to merge 3 commits into
RLinf:mainfrom
Lin-xs:feat/sglang_router

Conversation

@Lin-xs

@Lin-xs Lin-xs commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds a server-mode SGLang backend to RLinf so rollouts can talk to one or more SGLang HTTP engines through an sglang-router instead of using the in-process SGLang engine.

New module rlinf/hybrid_engines/sglang/server/:

  • SGLangServerWorker — Ray worker that spawns an SGLang HTTP server per process group (sized by tensor_parallel_size * pipeline_parallel_size), waits on /health, and reports its URL.
  • SGLangRouterWorker — single Ray worker that runs an sglang-router subprocess, dynamically registers/unregisters server URLs, and exposes router URL + health + a thin generate passthrough.
  • InferenceHTTPClient — sync/async client over requests / aiohttp for /generate, /v1/chat/completions, and /health against either a router or a single server.
  • launch_sglang_router_and_server(config, cluster, rollout_hardware_ranks, ...) — one-call orchestrator that builds a PackedPlacementStrategy from the rollout hardware ranks, launches the server group, brings up the router, and registers each server with the router from the driver.

Demo example wired up:

  • examples/reasoning/sglang_server_demo.py — exercises sync/async /generate and /v1/chat/completions against the router for Qwen2.5-VL-3B.
  • examples/reasoning/config/sglang_server_demo.yamlrollout.server / rollout.router config block, with launch_server / launch_router toggles and group_name / router_group_name.
  • examples/reasoning/run_sglang_server_demo.sh — launch script.

Dependency: sglang-router is added to the agentic-sglang optional dependency group in pyproject.toml.

Motivation and Context

The existing SGLang integration only supports the in-process engine, which couples rollout workers to the engine lifecycle and makes it hard to (1) share a pool of SGLang engines across multiple consumers, (2) scale the engine pool independently of the trainer, and (3) speak the standard OpenAI / SGLang HTTP protocol from agentic / tool-using code.

This change introduces a "server mode" path: each engine runs as a long-lived HTTP server, an sglang-router fronts the pool with cache-aware routing, and callers (RL rollouts, eval harnesses, agentic demos) talk to a single stable router URL. It is fully opt-in — existing configs are unaffected.

How has this been tested?

  • Ran bash examples/reasoning/run_sglang_server_demo.sh on a 2-GPU node with Qwen2.5-VL-3B (TP=2, PP=1):
    • Server group came up, router registered the server, and router_group.get_router_url() returned a reachable URL.
    • Sync + async /generate and /v1/chat/completions paths all completed end-to-end.
    • Router + server groups shut down cleanly via shutdown().wait().
  • No existing rollout/training paths were modified, so existing reasoning / embodied configs continue to work unchanged.

Additional information (optional, e.g., figures and logs):

  • Router placement defaults to node rank 0 (head); override via router_node_rank in launch_sglang_router_and_server.
  • aiohttp defaults to limit=100 connections — for large fan-outs bump the client's max_connections and ulimit -n accordingly (documented in the InferenceHTTPClient docstring).
  • Server registration is serialized from the driver to keep worker ordering stable; can be parallelized from N workers if needed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Documentation update (Document-only update)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Signed-off-by: Lin-xs <1833080950@qq.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-in “server mode” SGLang backend for RLinf, where rollout/eval callers can talk to a pool of external SGLang HTTP servers via an sglang-router (instead of using the in-process engine).

Changes:

  • Added Ray workers to launch/own SGLang HTTP server subprocesses and an sglang-router subprocess, plus a driver-side orchestrator to place and wire them together.
  • Added a small sync/async HTTP client for /generate, /v1/chat/completions, and /health.
  • Added a runnable demo (Python + YAML + shell script) and added sglang-router to the agentic-sglang optional dependency group.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
rlinf/hybrid_engines/sglang/server/server_launcher.py New SGLangServerWorker that spawns an SGLang HTTP server subprocess and waits for /health.
rlinf/hybrid_engines/sglang/server/router_launcher.py New SGLangRouterWorker that spawns sglang_router.launch_router and supports dynamic server registration.
rlinf/hybrid_engines/sglang/server/launcher.py Driver helper to build placement, launch server/router groups, and register servers.
rlinf/hybrid_engines/sglang/server/http_client.py New InferenceHTTPClient with sync + async APIs for router/server HTTP endpoints.
rlinf/hybrid_engines/sglang/server/__init__.py Exports new server-mode utilities.
pyproject.toml Adds sglang-router under agentic-sglang extras.
examples/reasoning/sglang_server_demo.py Demo script exercising sync/async generate and chat completions via the router.
examples/reasoning/run_sglang_server_demo.sh Demo launch script.
examples/reasoning/config/sglang_server_demo.yaml Demo configuration for server/router blocks and placement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +78
def _wait_for_http_health(host: str, port: int, timeout: float = 300.0) -> None:
"""Block until ``GET http://host:port/health`` returns 200, or raise."""
deadline = time.perf_counter() + timeout
url = f"http://{host}:{port}/health"
last_err: Optional[BaseException] = None
while time.perf_counter() < deadline:
try:
resp = requests.get(url, timeout=5)
if resp.status_code == 200:
return
except requests.exceptions.RequestException as e:
last_err = e
time.sleep(1.0)
raise RuntimeError(
f"sglang server at {url} did not become healthy within {timeout:.0f}s "
f"(last error: {last_err!r})."
)
Comment on lines +166 to +170
self._advertise_host = ray.util.get_node_ip_address()

_wait_for_http_health(self._advertise_host, http_port)
self.log_info(f"sglang server ready at {self.get_server_url()}")
return self.get_server_url()
Comment on lines +167 to +172
router_cfg = (
OmegaConf.to_container(self._router_cfg, resolve=True)
if self._router_cfg is not None
else {}
) or {}

Comment on lines +164 to +166
port = int(self._bind_port or self.acquire_free_port())
self._port = port

Comment on lines +63 to +71
def __init__(
self,
base_url: str,
connect_timeout: float = 10.0,
max_connections: int = 1024 * 16,
):
self.base_url = base_url.rstrip("/")
self.connect_timeout = connect_timeout
self.max_connections = max_connections
Comment thread pyproject.toml
Comment on lines 76 to 81
agentic-sglang = [
"sglang[all]==0.4.6.post5",
"sglang-router",
"torch-memory-saver",
"numpy==2.2",
"transformers==4.51.1",
Comment thread pyproject.toml
Comment on lines 77 to 79
"sglang[all]==0.4.6.post5",
"sglang-router",
"torch-memory-saver",
Comment on lines +1 to +7
#! /bin/bash
set -x

tabs 4
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_DEDUP_LOGS=0

return args


class SGLangRouterWorker(Worker):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to workers and name the file xxx worker?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

placement_strategy=router_placement,
)

router_handle = router_group.init_router() if router_group is not None else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the startup path transactional and clean up partially-started resources on failure?

There are two related failure paths here:

  1. At the orchestrator level, if server_handle.wait(), router_handle.wait(), get_server_url(), or register_server() raises, the already-launched server/router Ray groups are returned by no one and their child processes can keep running.

  2. At the router worker level, SGLangRouterWorker.init_router() calls subprocess.Popen() and then waits for /health. If the process stays alive but never becomes healthy, _wait_for_router_health() raises while the router subprocess may still be holding the selected port.

For SGLang this is especially risky because failed startup can leave GPU-serving processes and router ports behind, causing OOM or port conflicts on the next run.

Could we wrap the launch/init/register sequence in try/except and best-effort shut down router then server before re-raising, and also make init_router() call self.shutdown()/reset state when its health wait fails?

Comment on lines +92 to +105
rollout_placement = PackedPlacementStrategy(
start_hardware_rank=ranks[0],
end_hardware_rank=ranks[-1],
num_hardware_per_process=num_accelerators_per_engine,
)

server_group = SGLangServerWorker.create_group(
config=config,
sglang_cfg=rollout_cfg.server,
).launch(
cluster=cluster,
name=rollout_cfg.group_name,
placement_strategy=rollout_placement,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid rebuilding placement from only the flat hardware-rank list here?

The caller likely already has a parsed RLinf placement strategy from ComponentPlacement. Reconstructing a new PackedPlacementStrategy from rollout_hardware_ranks loses the original placement semantics, especially node_group / heterogeneous cluster placement / flexible mappings. It also assumes the ranks are contiguous.

For example, if the original component placement targets a non-default node group, this helper currently creates a new PackedPlacementStrategy without passing that node group, so the server group can be scheduled against the default group instead of the configured one.

Could this helper accept the caller-provided placement strategy directly, or preserve the original node group and process-to-resource mapping when repacking?

Lin-xs added 2 commits June 12, 2026 09:14
Signed-off-by: Lin-xs <1833080950@qq.com>
Signed-off-by: Lin-xs <1833080950@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants