Skip to content

feat: add self-contained Docker cluster tooling#1134

Merged
DanMacDonald merged 8 commits into
masterfrom
dmac/docker_cluster_tooling
Feb 14, 2026
Merged

feat: add self-contained Docker cluster tooling#1134
DanMacDonald merged 8 commits into
masterfrom
dmac/docker_cluster_tooling

Conversation

@DanMacDonald

@DanMacDonald DanMacDonald commented Feb 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add docker/agent_cluster/ — self-contained local cluster management with its own compose file, configs, and scripts (no dependencies on test infrastructure)
  • agent_cluster.sh supports build, deploy, hotdeploy, restart, status, logs, stop, destroy, validate-configs
  • Uses rsync for incremental source sync into Docker build volumes, enabling fast rebuilds (only changed files trigger recompilation)
  • Add docker/Dockerfile.builder and docker/Dockerfile.local for two-stage build pipeline
  • Add .claude/commands/deploy-cluster.md slash command for AI-assisted cluster management
  • Add validate-configs command that diffs consensus sections across node configs to catch drift early

Test plan

  • docker/agent_cluster/agent_cluster.sh validate-configs — configs consistent
  • docker/agent_cluster/agent_cluster.sh deploy --skip-build — all 3 nodes healthy
  • /deploy-cluster slash command tested end-to-end

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Local 3-node test cluster with CLI for building, deploying, restarting, status, logs, stop/destroy, hot-deploy and in-container exec/validation.
    • Auto-restart peer supervisor with health checks, progress detection, and backoff restart behavior.
    • Lightweight runtime image for local dev and a dedicated builder image to speed iterative builds.
  • Documentation

    • New comprehensive operational guide covering deployment, config parameters, validation, troubleshooting, and post-deploy steps.
  • Chores

    • Rust build/linker optimizations and minor debug/runtime image updates.

@DanMacDonald DanMacDonald changed the base branch from master to dmac/term_ledgers_1 February 13, 2026 05:38
Base automatically changed from dmac/term_ledgers_1 to master February 13, 2026 05:44
Add docker/agent_cluster/ with its own compose file, configs, and
management script for local 3-node cluster development. Uses rsync
for incremental source sync to enable fast rebuilds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@DanMacDonald DanMacDonald force-pushed the dmac/docker_cluster_tooling branch from 8ef22a0 to 69ef052 Compare February 13, 2026 05:46
@Irys-xyz Irys-xyz deleted a comment from coderabbitai Bot Feb 13, 2026
@coderabbitai

coderabbitai Bot commented Feb 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a local multi-node Irys test-cluster: new builder/runtime Dockerfiles, Docker Compose cluster, three per-node TOML configs, a bash cluster CLI, a peer supervisor script with auto-restart/genesis discovery, storage submodule config, Cargo/Cargo.toml tweaks, and an operational deployment guide.

Changes

Cohort / File(s) Summary
Operational docs
​.claude/commands/deploy-cluster.md
New comprehensive user guide documenting configurable parameters, workflow for generating identical per-node TOML configs, build/deploy steps, validation, common pitfalls, and post-deploy actions.
Docker build/runtime images
docker/Dockerfile.builder, docker/Dockerfile.local, docker/Dockerfile.debug
Adds a one-time builder image (Rust toolchain + build deps), a lightweight runtime image that packages a pre-built binary, and tweaks the debug Dockerfile to use mold and native CPU RUSTFLAGS and remove digest pins.
Cluster CLI & orchestration
docker/agent_cluster/agent_cluster.sh, docker/agent_cluster/docker-compose.yaml
New bash CLI for build/deploy/restart/status/logs/stop/destroy/hotdeploy/exec/validate; uses Docker Compose to bring up a 1–3 node cluster with fixed IPs, mounts, and restart policies.
Node supervision
docker/agent_cluster/start_peer_with_auto_restart.sh
Supervisor script that discovers genesis hash, polls heights, injects temp configs, monitors progress, and performs exponential-backoff restarts with graceful shutdown/cleanup.
Cluster configs & storage layout
docker/agent_cluster/configs/.irys_submodules.toml, docker/agent_cluster/configs/irys-1.toml, docker/agent_cluster/configs/irys-2.toml, docker/agent_cluster/configs/irys-3.toml
Adds storage submodule hardcoded paths and three full per-node TOML configs (Genesis + 2 Peers) covering network, storage, mempool, oracle, consensus, genesis allocations and economic parameters.
Rust/Cargo config
.cargo/config.toml, Cargo.toml
Adds target linker/rustflags for x86_64-unknown-linux-gnu (clang + mold, -C target-cpu=native, -C link-arg=-fuse-ld=mold) and changes release LTO from "fat" to "thin".

Sequence Diagram(s)

sequenceDiagram
    participant Operator
    participant CLI as agent_cluster.sh
    participant Builder as Docker Builder Image
    participant Orchestrator as Docker Compose
    participant Node1 as test-irys-1 (Genesis)
    participant Node2 as test-irys-2 (Peer)
    participant Node3 as test-irys-3 (Peer)

    Operator->>CLI: cmd_deploy (may trigger build)
    CLI->>Builder: build/runtime image (source sync, cargo build)
    Builder-->>CLI: image/artifact
    CLI->>Orchestrator: docker-compose up (mount configs)
    Orchestrator->>Node1: start container (Genesis entrypoint)
    Orchestrator->>Node2: start container (Peer, depends on Genesis)
    Orchestrator->>Node3: start container (Peer, depends on Genesis)
    Node1->>Node1: start_peer_with_auto_restart.sh (discover genesis, run binary)
    Node2->>Node2: start_peer_with_auto_restart.sh (wait for genesis, inject config, run binary)
    Node3->>Node3: start_peer_with_auto_restart.sh (wait for genesis, inject config, run binary)
    CLI->>Node1: poll health endpoints (height/status)
    CLI->>Operator: report status / errors
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • antouhou
  • JesseTheRobot
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (7 files):

⚔️ .cargo/config.toml (content)
⚔️ Cargo.toml (content)
⚔️ crates/c/build/main.rs (content)
⚔️ crates/chain/src/chain.rs (content)
⚔️ crates/types/src/config/consensus.rs (content)
⚔️ crates/types/src/config/mod.rs (content)
⚔️ docker/Dockerfile.debug (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: addition of self-contained Docker cluster tooling. It is directly related to the primary focus of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dmac/docker_cluster_tooling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Fix all issues with AI agents
In `@docker/agent_cluster/agent_cluster.sh`:
- Around line 496-511: In wait_for_node, avoid embedding color variables
directly into the printf format (SC2059); change the printf calls that currently
include ${GREEN}, ${RED}, and ${RESET} in the format string to use a format
specifier like %b or %s and pass the color variables as separate arguments
(e.g., printf "%b\n" "$GREEN" "up" "$RESET"), making the calls in the success
and timeout branches use explicit format parameters instead of interpolating the
variables into the format string.
- Around line 349-353: The script currently spawns python3 five times to extract
JSON fields into variables height, bi_height, peers, mining, and syncing;
replace those with a single JSON parse invocation (either a single python3 call
or jq) that reads info_json once and emits all required values in a stable
delimiter (e.g., newline or null-separated) which you then split into the
corresponding variables (height, bi_height, peers, mining, syncing); update the
block that sets those variables to use that single parse command and ensure any
2>/dev/null handling and default fallbacks are preserved.

In `@docker/agent_cluster/docker-compose.yaml`:
- Around line 30-31: Add a proper healthcheck for the irys node and make the
peer services depend on that health state instead of just start order: add a
healthcheck stanza to the test-irys-1 service that probes the node (e.g., an
HTTP or CLI readiness check that returns success when the API is ready) and
update the peers' depends_on to use condition: service_healthy for test-irys-1
so Compose waits for readiness; verify start_peer_with_auto_restart.sh still
works with the healthcheck polling and adjust timeouts/retries in the
healthcheck to match the script's expectations.
- Around line 71-77: Update the Docker Compose network configuration under
networks -> test_network -> ipam -> config by replacing the overly broad subnet
"172.21.0.0/16" with a /24 range (e.g., "172.21.0.0/24") so the network provides
~254 usable addresses for the three-node cluster and reduces potential IP
overlap.

In `@docker/agent_cluster/start_peer_with_auto_restart.sh`:
- Around line 190-193: The parent-shell NODE_PID becomes stale because
start_node, monitor_and_restart and restart_node run in subshells; fix by using
a PID file the monitor updates and shutdown_all reads: when start_node launches
the irys process (in start_node/ restart_node) write its actual pid to a known
file (e.g., /tmp/irys-node.pid) and update that file each time restart_node
replaces the process, ensure monitor_and_restart reads/uses the same file for
tracking, and change shutdown_all to read the pid from that file (and remove it)
instead of relying on the parent-shell NODE_PID; keep NODE_PID assignment as
auxiliary but canonicalize on the pidfile so the correct, current PID is always
used for kill operations.
- Around line 85-92: The sed substitution in make_config_with_genesis using
ghash is fragile when ghash contains sed metacharacters; update
make_config_with_genesis to avoid sed delimiter/escaping issues by replacing the
sed call that edits "$TMP_CONFIG" with a robust approach (e.g., use awk with -v
hash="$ghash" and gsub to replace the exact pattern expected_genesis_hash = ""
with expected_genesis_hash = "hash", or choose a sed invocation that uses a
different delimiter and properly escapes & and \ in "$ghash"); ensure you
operate on "$TMP_CONFIG" safely (write to a temp file and move it over) and keep
TMP_CONFIG and ghash references unchanged so the function behavior stays the
same.
- Around line 94-124: The start_node function currently waits up to 30 attempts
for the genesis node to reach MIN_HEIGHT but then silently proceeds even if the
condition never met; modify the wait loop in start_node (the block using
get_height, MIN_HEIGHT, CHECK_INTERVAL_SEC and GENESIS_URL) to detect timeout
and handle it explicitly — after the for-loop, if the latest height is still <
MIN_HEIGHT, log an error including the last observed height and either exit with
a non-zero status (e.g., exit 1) or return/fail the function instead of
continuing to launch IRYS_BIN, so the peer does not start when the genesis node
is not ready.

In `@docker/Dockerfile.builder`:
- Around line 5-7: The RUN apt-get install line in the Dockerfile.builder
currently installs packages without pinned versions, making builds
non-reproducible; update the RUN command that contains "apt-get update &&
apt-get install -y --no-install-recommends build-essential pkg-config clang
cmake git ca-certificates libgmp-dev m4 file perl rsync" to either pin explicit
package versions (e.g., add =<version> for each package) or document and lock
the base distribution release (e.g., use a specific Debian/Ubuntu tag in the
FROM line) so subsequent builds install the same package versions; ensure the
change is applied to the same RUN stanza and include a short comment explaining
the chosen approach for reproducibility.

In `@docker/Dockerfile.local`:
- Around line 22-23: The two COPY lines (COPY docker/build-output/irys /app/irys
and COPY docker/entrypoint.sh /app/entrypoint.sh) assume the Docker build
context is the repository root; either update these paths to match the intended
context (e.g., if the build context is the docker/ folder use COPY
build-output/irys /app/irys and COPY entrypoint.sh /app/entrypoint.sh) or
explicitly document/ensure the build orchestration sets the context to the repo
root (add a short comment at the top of Dockerfile.local and/or update the build
script/CI step to pass the repo root as the context). Ensure the chosen approach
is consistent with the build script so the files referenced by those COPY
directives exist in the build context.
- Around line 10-30: Add a Docker HEALTHCHECK to the Dockerfile.local so
compose's depends_on can wait for the service to be ready; e.g., add a
HEALTHCHECK directive (with sensible interval/retries/timeouts) that executes a
lightweight probe against the running service (either calling the bundled binary
/app/irys with a --health/health command or invoking a small check implemented
in /app/entrypoint.sh), and if needed, extend entrypoint.sh to support a fast
exit status for the health probe; ensure the probe runs as the non-root user
(irys) or uses a helper that can run without root.

Comment thread docker/agent_cluster/agent_cluster.sh Outdated
Comment thread docker/agent_cluster/agent_cluster.sh
Comment thread docker/agent_cluster/docker-compose.yaml
Comment thread docker/agent_cluster/docker-compose.yaml
Comment thread docker/agent_cluster/start_peer_with_auto_restart.sh
Comment thread docker/agent_cluster/start_peer_with_auto_restart.sh
Comment on lines +190 to +193
# Re-launch node as child of a subshell to capture PID
( start_node ) &
NODE_PID=$!
log "Node restarted with PID $NODE_PID"

@coderabbitai coderabbitai Bot Feb 13, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Subshell isolation causes stale PID tracking — node process orphaned on shutdown after restarts.

Both start_node (Line 218) and monitor_and_restart (Line 223) run in subshells (( ... ) &). When restart_node (called inside the monitor's subshell) kills the old node and sets NODE_PID=$! (Line 192), this only updates the monitor's copy of the variable. The parent shell's NODE_PID remains the PID of the original ( start_node ) & subshell.

On signal delivery, shutdown_all (Line 211 trap) runs in the parent and kills the stale NODE_PID — the currently running node process (spawned by the monitor) is never terminated, leading to an orphaned irys process.

Consider one of these approaches:

  1. Write the current node PID to a file (e.g., /tmp/irys-node.pid) and have shutdown_all read from it.
  2. Use process groups (kill -- -$PGID) to ensure all descendants are terminated.
  3. Run the monitor in the foreground instead of a subshell, and use background only for the node.
🐛 Proposed fix using a PID file
+PID_FILE="/tmp/irys-node.pid"
+
 restart_node() {
   # Backoff grows with each restart up to a cap
   ((RESTARTS++))
   local backoff=$(( RESTART_BACKOFF_BASE ** (RESTARTS-1) ))
   (( backoff > RESTART_BACKOFF_MAX_SEC )) && backoff="$RESTART_BACKOFF_MAX_SEC"
 
   if [[ -n "${NODE_PID:-}" ]]; then
     kill -TERM "$NODE_PID" 2>/dev/null || true
     # wait up to 10s then SIGKILL
     for _ in {1..20}; do
       if ! kill -0 "$NODE_PID" 2>/dev/null; then break; fi
       sleep 0.5
     done
     if kill -0 "$NODE_PID" 2>/dev/null; then
       log "Node not exiting, sending SIGKILL"
       kill -KILL "$NODE_PID" 2>/dev/null || true
     fi
   fi
 
   # ...backoff and restart...
   ( start_node ) &
   NODE_PID=$!
+  echo "$NODE_PID" > "$PID_FILE"
   log "Node restarted with PID $NODE_PID"
 }
 
 shutdown_all() {
   log "Caught signal, shutting down..."
   if [[ -n "${MONITOR_PID:-}" ]]; then
     kill "$MONITOR_PID" 2>/dev/null || true
   fi
-  if [[ -n "${NODE_PID:-}" ]]; then
-    kill -TERM "$NODE_PID" 2>/dev/null || true
-    wait "$NODE_PID" 2>/dev/null || true
+  local current_pid
+  current_pid=$(cat "$PID_FILE" 2>/dev/null || echo "")
+  if [[ -n "$current_pid" ]]; then
+    kill -TERM "$current_pid" 2>/dev/null || true
+    wait "$current_pid" 2>/dev/null || true
   fi
+  rm -f "$PID_FILE"
   # ...rest of cleanup...
 }
 
 # Main
 ( start_node ) &
 NODE_PID=$!
+echo "$NODE_PID" > "$PID_FILE"

Also applies to: 211-238

🤖 Prompt for AI Agents
In `@docker/agent_cluster/start_peer_with_auto_restart.sh` around lines 190 - 193,
The parent-shell NODE_PID becomes stale because start_node, monitor_and_restart
and restart_node run in subshells; fix by using a PID file the monitor updates
and shutdown_all reads: when start_node launches the irys process (in
start_node/ restart_node) write its actual pid to a known file (e.g.,
/tmp/irys-node.pid) and update that file each time restart_node replaces the
process, ensure monitor_and_restart reads/uses the same file for tracking, and
change shutdown_all to read the pid from that file (and remove it) instead of
relying on the parent-shell NODE_PID; keep NODE_PID assignment as auxiliary but
canonicalize on the pidfile so the correct, current PID is always used for kill
operations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker kills all container processes on stop, and PID 1 exit cleans up orphans. Low practical risk for a local dev tool — not worth the added complexity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanMacDonald You're absolutely right — I didn't fully account for the Docker context. The container runtime's process cleanup on stop/exit makes the PID tracking concern moot for this environment, and the added complexity of a PID file isn't justified for a local dev tool. Thanks for the clarification!

Comment thread docker/Dockerfile.builder
Comment thread docker/Dockerfile.local
Comment thread docker/Dockerfile.local
Comment on lines +22 to +23
COPY docker/build-output/irys /app/irys
COPY docker/entrypoint.sh /app/entrypoint.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

COPY context path assumes specific build context.

The COPY docker/build-output/irys and COPY docker/entrypoint.sh paths require the Docker build context to be the repo root. This is fine as long as the build script sets it correctly, but it's worth documenting or verifying in the build orchestration.

🤖 Prompt for AI Agents
In `@docker/Dockerfile.local` around lines 22 - 23, The two COPY lines (COPY
docker/build-output/irys /app/irys and COPY docker/entrypoint.sh
/app/entrypoint.sh) assume the Docker build context is the repository root;
either update these paths to match the intended context (e.g., if the build
context is the docker/ folder use COPY build-output/irys /app/irys and COPY
entrypoint.sh /app/entrypoint.sh) or explicitly document/ensure the build
orchestration sets the context to the repo root (add a short comment at the top
of Dockerfile.local and/or update the build script/CI step to pass the repo root
as the context). Ensure the chosen approach is consistent with the build script
so the files referenced by those COPY directives exist in the build context.

@DanMacDonald DanMacDonald self-assigned this Feb 13, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docker/agent_cluster/agent_cluster.sh`:
- Around line 60-64: The logging helpers info(), ok(), warn(), and err()
currently pass color variables inside the printf format string which triggers
SC2059; change each to use a literal '%b' (or '%b\n' where a newline is needed)
as the printf format and pass the colored message as the single argument so
interpolation of ANSI sequences is done via '%b' (e.g., replace the current
printf usage in info(), ok(), warn(), err() with printf '%b\n' and build the
message using the CYAN/GREEN/YELLOW/RED, the text ($*), and RESET as the
argument).
- Around line 573-577: The current sed invocation sed -n
'/^\[consensus\.Custom\]/,$ p' captures from the [consensus.Custom] header to
EOF and can include later unrelated sections; update the extraction used before
writing to "$tmp_dir/$name" so it stops at the next TOML section header (a line
starting with '['). Replace the one-shot tailing sed with a section-aware
extractor (e.g., switch to an awk-based or sed pattern that prints from
/^\[consensus\.Custom\]/ up to but not including the next line matching /^\[/)
while preserving the subsequent filtering (grep -v '^expected_genesis_hash',
comments, and blank lines) so only the consensus.Custom block is written to
"$tmp_dir/$name".
- Line 565: The trap is defined with double quotes so $tmp_dir is expanded at
definition time; change the trap to use single quotes around the command so the
variable is expanded at execution time instead (e.g., update the trap invocation
that references tmp_dir to use single-quoted command text and keep the inner
"$tmp_dir" quoted for safety).

Comment thread docker/agent_cluster/agent_cluster.sh
Comment thread docker/agent_cluster/agent_cluster.sh
Comment on lines +573 to +577
sed -n '/^\[consensus\.Custom\]/,$ p' "$cfg" \
| grep -v '^expected_genesis_hash' \
| grep -v '^#' \
| grep -v '^[[:space:]]*$' \
> "$tmp_dir/$name" || true

@coderabbitai coderabbitai Bot Feb 13, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

sed extracts from [consensus.Custom] to end-of-file — will diff non-consensus sections too.

sed -n '/^\[consensus\.Custom\]/,$ p' captures everything from the [consensus.Custom] header to the end of the file. If any TOML section appears after [consensus.Custom] in a config (e.g., [mining], [network]), it will be included in the comparison and falsely flagged as a consensus drift. This makes validation fragile and incorrect for any config layout where [consensus.Custom] isn't the final section.

🐛 Proposed fix — stop extraction at the next section header
         sed -n '/^\[consensus\.Custom\]/,$ p' "$cfg" \
+            | sed '/^\[/{ /^\[consensus\.Custom\]/!q; }' \
             | grep -v '^expected_genesis_hash' \
             | grep -v '^#' \
             | grep -v '^[[:space:]]*$' \
             > "$tmp_dir/$name" || true

Alternatively, replace the first sed with a single expression that extracts only the target section:

-        sed -n '/^\[consensus\.Custom\]/,$ p' "$cfg" \
+        sed -n '/^\[consensus\.Custom\]/,/^\[/{ /^\[consensus\.Custom\]/p; /^\[/!p; }' "$cfg" \

This prints lines from the [consensus.Custom] header up to (but not including) the next […] section header.

#!/bin/bash
# Verify whether any config file has additional sections after [consensus.Custom]
fd --type f '\.toml$' docker/agent_cluster/configs --exec sh -c '
    echo "=== $1 ==="
    # Print all section headers
    grep -n "^\[" "$1"
' _ {}
🤖 Prompt for AI Agents
In `@docker/agent_cluster/agent_cluster.sh` around lines 573 - 577, The current
sed invocation sed -n '/^\[consensus\.Custom\]/,$ p' captures from the
[consensus.Custom] header to EOF and can include later unrelated sections;
update the extraction used before writing to "$tmp_dir/$name" so it stops at the
next TOML section header (a line starting with '['). Replace the one-shot
tailing sed with a section-aware extractor (e.g., switch to an awk-based or sed
pattern that prints from /^\[consensus\.Custom\]/ up to but not including the
next line matching /^\[/) while preserving the subsequent filtering (grep -v
'^expected_genesis_hash', comments, and blank lines) so only the
consensus.Custom block is written to "$tmp_dir/$name".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[consensus.Custom] is the last section in all our configs, so the sed-to-EOF extraction works correctly. Worst case on a future config reorder would be a false positive in validation — harmless, and we'd catch it immediately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanMacDonald, that makes sense — if [consensus.Custom] is consistently the final section, the extraction works as intended. The fail-fast behavior on any future reorder is a reasonable tradeoff. Thanks for the clarification!

DanMacDonald and others added 2 commits February 13, 2026 12:09
- Platform-aware volumes: named Docker volumes on macOS (avoid VirtioFS),
  bind mounts on Linux (zero overhead)
- Configurable CARGO_JOBS and BUILD_MEMORY with platform-aware defaults
- Add --native flag for Docker-free builds (cross-compile on macOS)
- Add clean command for build cache cleanup
- Add mold linker to Dockerfile.builder and Dockerfile.debug
- Fix subshell PID issue in peer auto-restart (exec $IRYS_BIN)
- Replace python3 with jq in status command
- Rename image tag irys:debug -> irys:local
- Improve build logging: show rsync-detected changes and force warnings
  in Docker Desktop container logs
- Use thin LTO for faster link times

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docker/Dockerfile.debug (1)

25-35: ⚠️ Potential issue | 🟡 Minor

mold is installed but never used — ENV RUSTFLAGS overrides config-level rustflags.

ENV RUSTFLAGS="-C target-cpu=native" (line 34) takes precedence over all rustflags entries in .cargo/config.toml, including the target-specific -C link-arg=-fuse-ld=mold. The linker = "clang" config key still applies, but without -fuse-ld=mold, clang will use the default linker — so mold is installed but unused.

Either remove mold from the install list (since this is a debug Dockerfile), or include the mold flag in the ENV:

-ENV RUSTFLAGS="-C target-cpu=native"
+ENV RUSTFLAGS="-C target-cpu=native -C link-arg=-fuse-ld=mold"
🤖 Fix all issues with AI agents
In @.cargo/config.toml:
- Around line 6-8: The target-specific section [target.x86_64-unknown-linux-gnu]
currently sets rustflags and overrides the global rustflags (dropping -C
target-cpu=native); update the rustflags in that section (and keep linker =
"clang") to include both flags so they are applied when building for
x86_64-unknown-linux-gnu — e.g., merge the global -C target-cpu=native into the
target's rustflags along with -C link-arg=-fuse-ld=mold so rustflags in
target.x86_64-unknown-linux-gnu contains both options.

In `@docker/agent_cluster/agent_cluster.sh`:
- Around line 485-490: cmd_status currently calls jq directly to parse info_json
into height, bi_height, peers, mining, syncing which will fail if jq is missing;
modify cmd_status to first check for jq availability (reuse the existing
have_jq() helper used in start_peer_with_auto_restart.sh) and if jq is absent,
fall back to the grep/sed extraction logic used there to populate height,
bi_height, peers, mining, and syncing, or print a clear error advising to
install jq before continuing; update the parsing branches where height,
bi_height, peers, mining, and syncing are assigned so both the jq and fallback
paths set the same variables.
- Around line 554-619: In cmd_hotdeploy, after each docker cp (both in the "all"
loop and the single-node branch), ensure the deployed binary has the expected
ownership and execute permissions by running a post-copy docker exec that sets
owner to the irys user and fixes mode (e.g., chown irys:irys /app/irys && chmod
755 /app/irys); update the "copy" phase (the loop that calls docker cp "$binary"
"${node}:/app/irys") and the single-node copy before docker restart to run this
docker exec and keep wait_for_node usage unchanged.

In `@docker/agent_cluster/docker-compose.yaml`:
- Around line 22-23: The docker compose fragment grants the NET_ADMIN capability
(cap_add: - NET_ADMIN) to containers which is overly permissive for a node
process; remove NET_ADMIN from cap_add by default and either make it opt-in via
an environment flag (e.g., ENABLE_NET_ADMIN) that conditionally adds the
capability or add an in-file comment next to the cap_add block explaining the
exact debugging tools that require NET_ADMIN and why it must be enabled; update
the docker-compose service definition where cap_add and NET_ADMIN appear to
implement the conditional enablement or the explanatory comment accordingly.

In `@docker/agent_cluster/start_peer_with_auto_restart.sh`:
- Around line 40-51: The jq pipeline in extract_height is overly complex and
fragile; replace the expression jq -r '(.height // .height? // "0") | tostring |
capture("(?<n>\\d+)")?.n // "0"' with a simpler, robust form that handles
numeric or string heights, for example jq -r '.height // 0 | tostring' (keep the
surrounding 2>/dev/null || echo "0" fallback). Update the extract_height
function to use this simpler jq expression so it returns a stable string height
for both number and string JSON values.

Comment thread .cargo/config.toml Outdated
Comment thread docker/agent_cluster/agent_cluster.sh
Comment thread docker/agent_cluster/agent_cluster.sh
Comment thread docker/agent_cluster/docker-compose.yaml
Comment thread docker/agent_cluster/start_peer_with_auto_restart.sh
Adds fingerprint diagnostics for debugging incremental rebuild issues
and pins OpenSSL env vars to prevent cargo:rerun-if-env-changed triggers
across ephemeral container runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docker/agent_cluster/agent_cluster.sh`:
- Line 534: The grep invocation in the pipeline that builds the variable matches
(matches=$(docker logs "$node" 2>&1 | grep -i "$filter" | tail -20)) treats
patterns starting with "-" as options; change the grep call to use the option
separator so it becomes grep -i -- "$filter" to force the pattern to be
interpreted as an argument, and apply the same change to the other grep instance
referenced at line 548; look for the matches variable assignment and any similar
pipelines using docker logs, node, and filter and insert the "--" after the grep
options.
- Around line 62-65: The BUILD_MEMORY calculation currently uses local_mem from
`free -g`, which truncates small machines to 0 and then falls back to an unsafe
"14g"; fix by reading memory in megabytes (use `free -m` or /proc/meminfo) into
a variable (replace `local_mem` usage with a mem_mb value), set a sensible
minimum (e.g., 512 MB) and compute BUILD_MEMORY as the larger of min_sane_value
and floor(mem_mb * 3 / 4) and append an "m" suffix; keep `CARGO_JOBS` unchanged
but ensure `BUILD_MEMORY` uses the derived mem_mb-based value instead of the 14g
hardcode so low-memory VMs/CI get a safe capped allocation.

Comment on lines +62 to +65
CARGO_JOBS="${CARGO_JOBS:-$(( $(nproc 2>/dev/null || echo 4) / 2 ))}"
local_mem=$(free -g 2>/dev/null | awk '/Mem:/{print $2}' || echo 0)
BUILD_MEMORY="${BUILD_MEMORY:-$(( local_mem * 3 / 4 > 0 ? local_mem * 3 / 4 : 14 ))g}"
unset local_mem

@coderabbitai coderabbitai Bot Feb 13, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

BUILD_MEMORY fallback to 14g can OOM low-memory machines.

When free -g truncates to 0 (machines with <2 GiB RAM) or when free is unavailable, local_mem is 0 and the ternary falls through to the 14g default. This will over-commit a small VM or CI runner. A safer fallback would cap at what's actually available.

Proposed fix
     CARGO_JOBS="${CARGO_JOBS:-$(( $(nproc 2>/dev/null || echo 4) / 2 ))}"
-    local_mem=$(free -g 2>/dev/null | awk '/Mem:/{print $2}' || echo 0)
-    BUILD_MEMORY="${BUILD_MEMORY:-$(( local_mem * 3 / 4 > 0 ? local_mem * 3 / 4 : 14 ))g}"
+    local_mem=$(free -g 2>/dev/null | awk '/Mem:/{print $2}' || echo 0)
+    if (( local_mem <= 1 )); then
+        # free -g truncates, so 0 could mean ≤1 GiB or `free` not found.
+        # Use free -m for a more accurate picture; final fallback 4g.
+        local_mem_mb=$(free -m 2>/dev/null | awk '/Mem:/{print $2}' || echo 0)
+        BUILD_MEMORY="${BUILD_MEMORY:-$(( local_mem_mb * 3 / 4 > 0 ? local_mem_mb * 3 / 4 : 4096 ))m}"
+    else
+        BUILD_MEMORY="${BUILD_MEMORY:-$(( local_mem * 3 / 4 ))g}"
+    fi
     unset local_mem
🤖 Prompt for AI Agents
In `@docker/agent_cluster/agent_cluster.sh` around lines 62 - 65, The BUILD_MEMORY
calculation currently uses local_mem from `free -g`, which truncates small
machines to 0 and then falls back to an unsafe "14g"; fix by reading memory in
megabytes (use `free -m` or /proc/meminfo) into a variable (replace `local_mem`
usage with a mem_mb value), set a sensible minimum (e.g., 512 MB) and compute
BUILD_MEMORY as the larger of min_sane_value and floor(mem_mb * 3 / 4) and
append an "m" suffix; keep `CARGO_JOBS` unchanged but ensure `BUILD_MEMORY` uses
the derived mem_mb-based value instead of the 14g hardcode so low-memory VMs/CI
get a safe capped allocation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free doesn't exist on macOS (the primary target for this tool), so the fallback always triggers. Docker Desktop has its own memory cap that prevents actual OOM regardless of what we request. Low risk for a local dev tool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanMacDonald, understood—thanks for the context! Since Docker Desktop's resource limits provide a safety net and this is primarily targeting macOS developers, the 14g fallback is acceptable for a local dev tool.


✏️ Learnings added
Learnt from: DanMacDonald
Repo: Irys-xyz/irys PR: 1134
File: docker/agent_cluster/agent_cluster.sh:62-65
Timestamp: 2026-02-14T01:14:06.168Z
Learning: The `docker/agent_cluster/agent_cluster.sh` script is primarily designed for macOS developers using Docker Desktop, which provides its own memory caps to prevent OOM, making the 14g BUILD_MEMORY fallback acceptable for this local development tool.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread docker/agent_cluster/agent_cluster.sh
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.cargo/config.toml:
- Around line 6-8: The target-specific linker in .cargo/config.toml (the
[target.x86_64-unknown-linux-gnu] section) forces "clang" with
link-arg=-fuse-ld=mold but mold is not documented or installed and conflicts
with the devcontainer's ~/.cargo/config.toml which uses lld; fix by either 1)
adding mold to project docs (SETUP.md/README.md) and installing it in
.devcontainer/setup.sh, 2) moving the target.x86_64-unknown-linux-gnu linker
config out of repo-root .cargo/config.toml into a Docker-scoped or
devcontainer-only cargo config and documenting why, or 3) switching the linker
flag to a universally-available linker such as lld (make the change in
.cargo/config.toml and ensure .devcontainer/setup.sh and any user-scoped
~/.cargo/config.toml remain consistent) so local and devcontainer builds are not
broken.

In `@docker/Dockerfile.debug`:
- Line 34: The Dockerfile sets ENV RUSTFLAGS="-C target-cpu=native -C
link-arg=-fuse-ld=mold" which couples the built binary to the build host and
risks SIGILL on other machines; change the RUSTFLAGS to a portable value (e.g.,
replace target-cpu=native with target-cpu=x86-64) or add a clear comment above
the ENV line stating this image is local-only and must not be distributed, and
optionally remove or reconcile the redundant .cargo/config.toml flags brought in
by COPY . . so you don’t have duplicate/conflicting rustflags.

Comment thread .cargo/config.toml
Comment thread docker/Dockerfile.debug
@DanMacDonald DanMacDonald merged commit eccaca4 into master Feb 14, 2026
25 of 26 checks passed
@DanMacDonald DanMacDonald deleted the dmac/docker_cluster_tooling branch February 14, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant