feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image by Ricky-Daxia · Pull Request #618 · rk8s-dev/rk8s

Ricky-Daxia · 2026-05-13T14:00:13Z

No description provided.

…ter plugins

… in test fixtures

…, use local image cache

…duling tests Fix gang_with_topology_lands_on_same_domain and gang_timeout_rolls_back by replacing scheduler.enqueue() with update_cache_pod(), which writes the pod into cache before enqueueing (enqueue-only caused silent cache misses in schedule_one). Add three GPU gang scheduling tests to xline_test.rs exercising the full Xline data path with GPU-labeled nodes and gang/topology pod specs: - test_gpu_gang_all_pods_land_on_same_nvlink_domain: TP=4 scenario, TopologyCoAffinityFilter + GangStateStore ensure all 4 pods land on the same NVLink domain node - test_gpu_gang_without_topology_constraint_is_atomic: gang-only All-or-Nothing without topology binding - test_gpu_gang_partial_fill_produces_no_assignment: insufficient GPU resources, gang never fills, no Assignment emitted before timeout Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…zip layers Three blocking bugs surfaced while building the candle-vllm CUDA image (FROM docker.m.daocloud.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04): 1. FROM dropped image_parsed.registry, so explicit-registry FROM lines were silently routed to default_registry. Now the registry is carried into the image ref string when present. 2. pull_manifest bailed on multi-arch manifest lists, blocking nearly all public images. Add pull_concrete_manifest helper that picks the linux/amd64 entry via clone_with_digest and re-pulls. 3. get_media_type only recognised OCI tar+gzip, so Docker tar.gzip layers fell through to MediaType::Other and were std::fs::copy'd as raw tarballs instead of unpacked; ensure_unpacked_layer_dirs then failed with "Pulled layer is not an unpacked directory". Also widen format_push_error to print the full anyhow chain so push 500s surface the real upstream error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reference Dockerfile from EricLBuehler/candle-vllm upstream, used to exercise rkforge against a real CUDA workload. See ~/docs/rkforge-candle-vllm-构建记录.md for the adapted build recipe (mirror swap, in-container proxy plumbing, webapp verify stub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f9918c498

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T14:02:54Z

+        if matches!(&pod_info.scheduled, Some(n) if n == node_name) {
+            return true;


Count already-bound pods before returning

When update_cache_pod receives a pod that already has scheduled set (for example from Xline after scheduler restart or an external bind), it first inserts that pod into the cache and then calls assume; this early return sees the same node and skips adding the pod's CPU, memory, and GPU requests to the node. Those existing workloads are then invisible to NodeResourcesFit/NodeGpuResourcesFit, so subsequent scheduling can overcommit the node.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T14:02:54Z

+        if ctx.required_values.is_empty() {
+            return Status::default();


Require topology label for the first gang member

For a gang pod with same_value topology constraints, the first member has no assumed peers, so required_values stays empty and this filter accepts every node, including nodes that do not carry the requested topology key. If that first member lands on an unlabeled node, the next member hits the assumed gang member missing topology label path and the gang sits assumed until timeout instead of choosing a labeled node up front.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ce442f743

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T15:43:55Z

+        if let Some(cmd) = spec.command {
+            self.args = Some(cmd);


Preserve container args when command is set

When a container spec sets both command and args, this branch replaces the runtime argv with only command, so the user-provided args are silently dropped (e.g. command: ["python"] plus args: ["app.py"] starts just python). The comment here says args should append, and this builder later writes only config.args into the OCI process argv, so the two vectors need to be combined in this branch.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fd3200f5b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T15:10:16Z

+                            .expect("scheduling result rx closed before scheduler closed");
+                    }
+                    Some(g) => {
+                        let full = gang_state.add_member(&g.id, g.size, &pod_name, &chosen);


Remove deleted pods from gang state before counting them

When this records an assumed gang member, that entry is never removed if the pod is deleted or explicitly unassumed before the gang fills (I checked remove_cache_pod and unassume, and they only update the cache/queue). In that scenario, a later member can make add_member return full using a stale pod name, take_and_clear will emit an Assignment for the deleted pod, and the remaining members may be bound even though the gang is no longer complete.

Useful? React with 👍 / 👎.

Ricky-Daxia and others added 16 commits May 13, 2026 21:56

feat(libscheduler): add GPU/Gang/Topology fields to data model

9c73a08

feat(libscheduler): add GangStateStore for cross-cycle gang membership

43e180f

feat(libscheduler): track GPU requested count in Cache assume/unassume

636974e

feat(libscheduler): add NodeGpuResourcesFit and TopologyCoAffinityFil…

fcd99be

…ter plugins

feat(libscheduler): defer Assignment send until full gang assumed

5cd33b2

feat(libscheduler): rollback timed-out gang assumptions

76d6062

feat(libscheduler): parse GPU/Gang/Topology fields from xline payloads

8438376

test(libscheduler): add LLM inference E2E scheduling test

4046721

fix(rkl): supply new common::PodSpec gang/topology_constraints fields…

5cdca26

… in test fixtures

docs(plans): record implementation

7849608

build: add CUDA build context and Ubuntu 24.04 Dockerfile

037e87f

fix(rkforge): fix VFS path filtering, add parent dir creation on COPY…

3e26f22

…, use local image cache

chore: clippy

4f9918c

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

Ricky-Daxia added 2 commits May 16, 2026 12:27

fix(rkforge): fix candle-vllm image build

3d51884

chore: clippy

51c7660

Ricky-Daxia force-pushed the feat/libcsi branch from 6fd3200 to 51c7660 Compare May 16, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618

feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618
Ricky-Daxia wants to merge 18 commits into
rk8s-dev:mainfrom
Ricky-Daxia:feat/libcsi

Ricky-Daxia commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if matches!(&pod_info.scheduled, Some(n) if n == node_name) {
		return true;

		if ctx.required_values.is_empty() {
		return Status::default();

Conversation

Ricky-Daxia commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant