feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618
feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618Ricky-Daxia wants to merge 18 commits into
Conversation
… in test fixtures
…, use local image cache
…duling tests Fix gang_with_topology_lands_on_same_domain and gang_timeout_rolls_back by replacing scheduler.enqueue() with update_cache_pod(), which writes the pod into cache before enqueueing (enqueue-only caused silent cache misses in schedule_one). Add three GPU gang scheduling tests to xline_test.rs exercising the full Xline data path with GPU-labeled nodes and gang/topology pod specs: - test_gpu_gang_all_pods_land_on_same_nvlink_domain: TP=4 scenario, TopologyCoAffinityFilter + GangStateStore ensure all 4 pods land on the same NVLink domain node - test_gpu_gang_without_topology_constraint_is_atomic: gang-only All-or-Nothing without topology binding - test_gpu_gang_partial_fill_produces_no_assignment: insufficient GPU resources, gang never fills, no Assignment emitted before timeout Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…zip layers Three blocking bugs surfaced while building the candle-vllm CUDA image (FROM docker.m.daocloud.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04): 1. FROM dropped image_parsed.registry, so explicit-registry FROM lines were silently routed to default_registry. Now the registry is carried into the image ref string when present. 2. pull_manifest bailed on multi-arch manifest lists, blocking nearly all public images. Add pull_concrete_manifest helper that picks the linux/amd64 entry via clone_with_digest and re-pulls. 3. get_media_type only recognised OCI tar+gzip, so Docker tar.gzip layers fell through to MediaType::Other and were std::fs::copy'd as raw tarballs instead of unpacked; ensure_unpacked_layer_dirs then failed with "Pulled layer is not an unpacked directory". Also widen format_push_error to print the full anyhow chain so push 500s surface the real upstream error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference Dockerfile from EricLBuehler/candle-vllm upstream, used to exercise rkforge against a real CUDA workload. See ~/docs/rkforge-candle-vllm-构建记录.md for the adapted build recipe (mirror swap, in-container proxy plumbing, webapp verify stub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f9918c498
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if matches!(&pod_info.scheduled, Some(n) if n == node_name) { | ||
| return true; |
There was a problem hiding this comment.
Count already-bound pods before returning
When update_cache_pod receives a pod that already has scheduled set (for example from Xline after scheduler restart or an external bind), it first inserts that pod into the cache and then calls assume; this early return sees the same node and skips adding the pod's CPU, memory, and GPU requests to the node. Those existing workloads are then invisible to NodeResourcesFit/NodeGpuResourcesFit, so subsequent scheduling can overcommit the node.
Useful? React with 👍 / 👎.
| if ctx.required_values.is_empty() { | ||
| return Status::default(); |
There was a problem hiding this comment.
Require topology label for the first gang member
For a gang pod with same_value topology constraints, the first member has no assumed peers, so required_values stays empty and this filter accepts every node, including nodes that do not carry the requested topology key. If that first member lands on an unlabeled node, the next member hits the assumed gang member missing topology label path and the gang sits assumed until timeout instead of choosing a labeled node up front.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ce442f743
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if let Some(cmd) = spec.command { | ||
| self.args = Some(cmd); |
There was a problem hiding this comment.
Preserve container args when command is set
When a container spec sets both command and args, this branch replaces the runtime argv with only command, so the user-provided args are silently dropped (e.g. command: ["python"] plus args: ["app.py"] starts just python). The comment here says args should append, and this builder later writes only config.args into the OCI process argv, so the two vectors need to be combined in this branch.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fd3200f5b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .expect("scheduling result rx closed before scheduler closed"); | ||
| } | ||
| Some(g) => { | ||
| let full = gang_state.add_member(&g.id, g.size, &pod_name, &chosen); |
There was a problem hiding this comment.
Remove deleted pods from gang state before counting them
When this records an assumed gang member, that entry is never removed if the pod is deleted or explicitly unassumed before the gang fills (I checked remove_cache_pod and unassume, and they only update the cache/queue). In that scenario, a later member can make add_member return full using a stale pod name, take_and_clear will emit an Assignment for the deleted pod, and the remaining members may be bound even though the gang is no longer complete.
Useful? React with 👍 / 👎.
No description provided.