Skip to content

feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618

Open
Ricky-Daxia wants to merge 18 commits into
rk8s-dev:mainfrom
Ricky-Daxia:feat/libcsi
Open

feat(libscheduler & rkforge): add gpu support & build cuda and candle-vllm image#618
Ricky-Daxia wants to merge 18 commits into
rk8s-dev:mainfrom
Ricky-Daxia:feat/libcsi

Conversation

@Ricky-Daxia
Copy link
Copy Markdown
Collaborator

No description provided.

Ricky-Daxia and others added 16 commits May 13, 2026 21:56
…duling tests

Fix gang_with_topology_lands_on_same_domain and gang_timeout_rolls_back
by replacing scheduler.enqueue() with update_cache_pod(), which writes
the pod into cache before enqueueing (enqueue-only caused silent cache
misses in schedule_one).

Add three GPU gang scheduling tests to xline_test.rs exercising the full
Xline data path with GPU-labeled nodes and gang/topology pod specs:
- test_gpu_gang_all_pods_land_on_same_nvlink_domain: TP=4 scenario,
  TopologyCoAffinityFilter + GangStateStore ensure all 4 pods land on
  the same NVLink domain node
- test_gpu_gang_without_topology_constraint_is_atomic: gang-only
  All-or-Nothing without topology binding
- test_gpu_gang_partial_fill_produces_no_assignment: insufficient GPU
  resources, gang never fills, no Assignment emitted before timeout

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…zip layers

Three blocking bugs surfaced while building the candle-vllm CUDA image
(FROM docker.m.daocloud.io/nvidia/cuda:12.9.0-cudnn-devel-ubuntu22.04):

1. FROM dropped image_parsed.registry, so explicit-registry FROM lines
   were silently routed to default_registry. Now the registry is carried
   into the image ref string when present.

2. pull_manifest bailed on multi-arch manifest lists, blocking nearly all
   public images. Add pull_concrete_manifest helper that picks the
   linux/amd64 entry via clone_with_digest and re-pulls.

3. get_media_type only recognised OCI tar+gzip, so Docker tar.gzip
   layers fell through to MediaType::Other and were std::fs::copy'd as
   raw tarballs instead of unpacked; ensure_unpacked_layer_dirs then
   failed with "Pulled layer is not an unpacked directory".

Also widen format_push_error to print the full anyhow chain so push 500s
surface the real upstream error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference Dockerfile from EricLBuehler/candle-vllm upstream, used to
exercise rkforge against a real CUDA workload. See
~/docs/rkforge-candle-vllm-构建记录.md for the adapted build recipe
(mirror swap, in-container proxy plumbing, webapp verify stub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f9918c498

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +44 to +45
if matches!(&pod_info.scheduled, Some(n) if n == node_name) {
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Count already-bound pods before returning

When update_cache_pod receives a pod that already has scheduled set (for example from Xline after scheduler restart or an external bind), it first inserts that pod into the cache and then calls assume; this early return sees the same node and skips adding the pod's CPU, memory, and GPU requests to the node. Those existing workloads are then invisible to NodeResourcesFit/NodeGpuResourcesFit, so subsequent scheduling can overcommit the node.

Useful? React with 👍 / 👎.

Comment on lines +102 to +103
if ctx.required_values.is_empty() {
return Status::default();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require topology label for the first gang member

For a gang pod with same_value topology constraints, the first member has no assumed peers, so required_values stays empty and this filter accepts every node, including nodes that do not carry the requested topology key. If that first member lands on an unlabeled node, the next member hits the assumed gang member missing topology label path and the gang sits assumed until timeout instead of choosing a labeled node up front.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ce442f743

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +89 to +90
if let Some(cmd) = spec.command {
self.args = Some(cmd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve container args when command is set

When a container spec sets both command and args, this branch replaces the runtime argv with only command, so the user-provided args are silently dropped (e.g. command: ["python"] plus args: ["app.py"] starts just python). The comment here says args should append, and this builder later writes only config.args into the OCI process argv, so the two vectors need to be combined in this branch.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fd3200f5b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.expect("scheduling result rx closed before scheduler closed");
}
Some(g) => {
let full = gang_state.add_member(&g.id, g.size, &pod_name, &chosen);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove deleted pods from gang state before counting them

When this records an assumed gang member, that entry is never removed if the pod is deleted or explicitly unassumed before the gang fills (I checked remove_cache_pod and unassume, and they only update the cache/queue). In that scenario, a later member can make add_member return full using a stale pod name, take_and_clear will emit an Assignment for the deleted pod, and the remaining members may be bound even though the gang is no longer complete.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant