runner-cleanup: hourly disk + memory maintenance for self-hosted runner hosts#913
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
WalkthroughAdds a runner-cleanup subsystem: README and config declaring KEEP_IMAGES and KEEP_RUNNERS_WORK; systemd service/timer units and installer wiring to enable them; and a bash script that acquires a flock, reaps stuck Runner.Worker processes, classifies actions-runner-* users as busy or idle, wipes idle users' $HOME/_work/ (unless whitelisted), builds a Docker keep-set and removes/prunes images/networks/builders, defers docker volume prune while runners are busy, and performs additional system housekeeping (apt/journal/log truncation). Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tools/modules/system/runner-cleanup/README.md (1)
63-75: ⚡ Quick winAdd language identifiers to fenced code blocks to satisfy markdown lint.
Line 63 and Line 69 use fenced blocks without a language, which triggers MD040. Add
text(orshwhere applicable) to keep docs lint-clean.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/modules/system/runner-cleanup/README.md` around lines 63 - 75, Two fenced code blocks in the README containing the snippets "runner-cleanup: no actions-runner-* users on this host — nothing to do." and the multiline example starting with "wipe /home/actions-runner-01/_work/" lack language identifiers and trigger MD040; fix by adding a language tag (use text for plain output and sh for shell commands) to those fenced blocks so they become e.g. ```text and ```sh respectively to satisfy the markdown linter.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/modules/system/runner-cleanup/runner-cleanup`:
- Line 43: The case arm handling -c|--config) currently dereferences $2
directly; guard against a missing argument by checking that the next positional
exists before assigning to CONFIG: inside the -c|--config) branch test whether
"${2:-}" (or $# >= 2) is set, and if not call the script's usage/failure path
and exit with a non-zero status, otherwise set CONFIG="$2" and shift 2; update
the -c|--config) branch accordingly to avoid an unbound-variable error under set
-u.
In `@tools/modules/system/runner-cleanup/runner-cleanup.service`:
- Line 3: The Documentation= line currently points to a misspelled or incorrect
URL; update the value of the Documentation= field so it points to the correct
Armbian configng repository URL (replace the existing
"https://github.com/armbian/configng" value with the verified, correctly spelled
repository URL) and confirm the link resolves in a browser; look for and edit
the line beginning with "Documentation=" to make this change.
---
Nitpick comments:
In `@tools/modules/system/runner-cleanup/README.md`:
- Around line 63-75: Two fenced code blocks in the README containing the
snippets "runner-cleanup: no actions-runner-* users on this host — nothing to
do." and the multiline example starting with "wipe
/home/actions-runner-01/_work/" lack language identifiers and trigger MD040; fix
by adding a language tag (use text for plain output and sh for shell commands)
to those fenced blocks so they become e.g. ```text and ```sh respectively to
satisfy the markdown linter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2fd75c1b-0012-497c-a95b-3aa6725dddbd
📒 Files selected for processing (5)
tools/modules/system/runner-cleanup/README.mdtools/modules/system/runner-cleanup/runner-cleanuptools/modules/system/runner-cleanup/runner-cleanup.conftools/modules/system/runner-cleanup/runner-cleanup.servicetools/modules/system/runner-cleanup/runner-cleanup.timer
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tools/modules/system/runner-cleanup/runner-cleanup (1)
53-53:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuard
--configagainst missing path argument.With
set -uenabled on line 33,runner-cleanup --config(without a path) will fail with an unbound variable error instead of showing a clean usage message.Proposed fix
- -c|--config) CONFIG="$2"; shift 2 ;; + -c|--config) + [[ $# -ge 2 ]] || { echo "Missing value for $1" >&2; usage >&2; exit 2; } + CONFIG="$2"; shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/modules/system/runner-cleanup/runner-cleanup` at line 53, The --config option handler currently assigns CONFIG="$2" without validating the presence of a following argument, which under set -u causes an unbound variable error; update the case branch for '-c|--config)' in runner-cleanup to verify that "$2" exists and is not another option (e.g. empty or starts with '-') before assigning and shifting, and if the check fails call the existing usage/error flow (print usage and exit non-zero). Keep the change confined to the '-c|--config)' branch and use the same usage/error helper the script uses elsewhere so behavior is consistent with other option validations.
🧹 Nitpick comments (1)
tools/modules/system/runner-cleanup/README.md (1)
97-109: ⚡ Quick winAdd language specifier to fenced code block.
The code block showing example output should specify a language (or use
text) for proper rendering and accessibility.Proposed fix
-``` +```text runner-cleanup: no actions-runner-* users on this host — nothing to do.or, with runners present and idle:
-
+text
wipe /home/actions-runner-01/_work/
keep image ghcr.io/armbian/builder:latest
rm image ubuntu:24.04 (sha256:...)</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@tools/modules/system/runner-cleanup/README.mdaround lines 97 - 109, The
fenced code blocks in the runner-cleanup README example do not include a
language specifier; update both blocks that show output (the one containing
"runner-cleanup: no actions-runner-* users on this host — nothing to do." and
the subsequent block showing "wipe /home/actions-runner-01/_work/" etc.) to use
a language tag like "text" (e.g., changetotext) so the examples
render/accessibly as plain text.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Duplicate comments:
In@tools/modules/system/runner-cleanup/runner-cleanup:
- Line 53: The --config option handler currently assigns CONFIG="$2" without
validating the presence of a following argument, which under set -u causes an
unbound variable error; update the case branch for '-c|--config)' in
runner-cleanup to verify that "$2" exists and is not another option (e.g. empty
or starts with '-') before assigning and shifting, and if the check fails call
the existing usage/error flow (print usage and exit non-zero). Keep the change
confined to the '-c|--config)' branch and use the same usage/error helper the
script uses elsewhere so behavior is consistent with other option validations.
Nitpick comments:
In@tools/modules/system/runner-cleanup/README.md:
- Around line 97-109: The fenced code blocks in the runner-cleanup README
example do not include a language specifier; update both blocks that show output
(the one containing "runner-cleanup: no actions-runner-* users on this host —
nothing to do." and the subsequent block showing "wipe
/home/actions-runner-01/_work/" etc.) to use a language tag like "text" (e.g.,
changetotext) so the examples render/accessibly as plain text.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `0aa9b43e-2e7f-420f-8854-693dc134d490` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between fd6521e4673951fde4b7533522c7bad38117283b and 97078222a47adfd36f745f602bc0ebe35f0a7fad. </details> <details> <summary>📒 Files selected for processing (3)</summary> * `tools/modules/system/runner-cleanup/README.md` * `tools/modules/system/runner-cleanup/runner-cleanup` * `tools/modules/system/runner-cleanup/runner-cleanup.conf` </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary> * tools/modules/system/runner-cleanup/runner-cleanup.conf </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/modules/system/module_armbian_runners.sh`:
- Line 171: The script currently silences failures of "systemctl enable --now
runner-cleanup.timer", which can hide a failed timer install; change this to run
the command without redirecting output, capture its exit code, and if non-zero
log a clear error (including command stderr) and attempt a safe fallback: try
"systemctl enable runner-cleanup.timer" then "systemctl start
runner-cleanup.timer", checking and logging each exit status; reference the
invocation "systemctl enable --now runner-cleanup.timer" in the fix and ensure
any error messages are emitted instead of using ">/dev/null 2>&1 || true".
In `@tools/modules/system/runner-cleanup/README.md`:
- Line 113: The README has fenced code blocks missing language identifiers;
update the two blocks containing the messages "runner-cleanup: no
actions-runner-* users on this host — nothing to do." and the block starting
"wipe /home/actions-runner-01/_work/" to use a language tag (e.g., change the
opening ``` to ```text) so markdown linters correctly detect the code block
language.
In `@tools/modules/system/runner-cleanup/runner-cleanup`:
- Around line 143-145: The early exit when runner_users is empty (the if block
checking ${`#runner_users`[@]} and calling note "runner-cleanup: no
actions-runner-* users..." then exit 0) prevents the later housekeeping steps
(section 4: journald/APT/log cleanup) from running; change the flow so that when
runner_users is empty you still perform the housekeeping: remove or relocate the
exit 0 and ensure the script continues to execute section 4 after logging the
"no actions-runner-* users" note (leave the note call intact but do not return
early), or alternatively call the housekeeping function/section explicitly
before exiting.
- Around line 231-237: The block that currently wipes "${work}" when "$unit" is
empty is unsafe; instead of running "find \"$work\" -mindepth 1 -delete" when no
systemd unit is found, change it to skip the wipe and emit a clear log like the
existing "skipping wipe to avoid mid-job damage" message (referencing the
variables unit, work, and user) so the script avoids deleting _work if the
runner cannot be quiesced; keep the early "log" message but replace the
destructive "run find ... -delete" with a skip-and-continue path that mirrors
the safe behavior used elsewhere.
- Around line 253-258: The current check using command -v docker only verifies
the CLI exists but not the daemon; update the logic so that before attempting
any container/image cleanup you try a daemon call (e.g., docker info) and if
that call fails treat Docker as unavailable: set goto_housekeeping and log a
skip rather than letting the script abort under set -euo pipefail. Apply the
same pattern around the docker cleanup blocks that call docker container prune
and docker image prune so those commands are executed only when the daemon check
succeeds; reference and update the conditional that currently uses command -v
docker, the goto_housekeeping variable, and the docker container prune / docker
image prune invocations.
- Around line 241-244: The cleanup currently exits if run find "$work" -mindepth
1 -delete fails (due to set -euo pipefail), preventing the later run systemctl
start "${unit}.service" from executing; change the flow so the restart is always
attempted: run systemctl stop "${unit}.service", then perform the wipe but do
not let failures abort the script (e.g. run the find with its failure masked or
capture its exit status), record any wipe error and continue, and finally always
attempt run systemctl start "${unit}.service" and call note "WARNING: failed to
restart ${unit}.service..." only if the restart itself fails. Ensure you
reference and modify the existing run invocation for find "$work" -mindepth 1
-delete and the subsequent run systemctl start "${unit}.service" and note call.
In `@tools/modules/system/runner-cleanup/runner-cleanup.timer`:
- Line 2: The timer unit runner-cleanup.timer has Description="Daily
runner-cleanup" but its OnCalendar=hourly schedule; update the Description in
runner-cleanup.timer (the Description field) to reflect the actual schedule
(e.g., "Hourly runner-cleanup" or "Runs hourly: runner-cleanup") so the
systemctl list-timers output is accurate; ensure any other similar unit with the
same mismatch (noted as 19-19) is updated the same way.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 48fac597-2195-441f-9913-2b1a3cc4bef4
📒 Files selected for processing (5)
tools/modules/system/module_armbian_runners.shtools/modules/system/runner-cleanup/README.mdtools/modules/system/runner-cleanup/runner-cleanuptools/modules/system/runner-cleanup/runner-cleanup.conftools/modules/system/runner-cleanup/runner-cleanup.timer
…er hosts
Adds tools/modules/system/runner-cleanup/ (script, config, systemd
service + timer, README) and wires installation into
module_armbian_runners install so a fresh runner host gets the
cleanup helper automatically.
Sections, in order of execution per pass:
1. Reap stuck Runner.Worker processes. Any Worker whose etime
exceeds RUNNER_JOB_MAX_HOURS (default 4h) is SIGTERM'd, given
5s grace, then SIGKILL'd. Targets the failure mode where a
Worker deadlocks on artifact upload / network black hole and
silently holds GBs of RAM while the Listener reports the job
as 'still running'. Listener requeues. Runs first so a
freshly-reaped runner is correctly classified as idle below.
2. Per-runner _work wipe. For each idle runner NOT in
KEEP_RUNNERS_WORK (default 'actions-runner-01' — the alfa-tier
primary in module_armbian_runners convention; preserves kernel
ccache / toolchain caches):
systemctl stop <unit> → find _work -mindepth 1 -delete → systemctl start <unit>
The stop closes the race where the Listener could accept a new
job mid-wipe. Unit name discovered by grepping User= in
/etc/systemd/system/actions.runner.*.service. Busy runners are
left untouched. EXIT trap restarts any unit we stopped if the
script gets killed before reaching the start.
3. Docker prune. Per-operation safety instead of an all-or-nothing
fleet gate:
- docker image rm for images not in KEEP_IMAGES and not used
by any container (running or stopped). KEEP_IMAGES defaults
to lscr.io/linuxserver/swag:latest and lscr.io/linuxserver/
openssh-server:latest (configng-managed co-tenants).
- docker container/image/network/builder prune: always run.
Docker enforces its own in-use checks (container prune only
touches stopped, image prune only dangling, etc).
- docker volume prune -a: gated on full-fleet idle. A volume
between job steps can briefly look unused; -a would yank it.
Defers to next pass otherwise.
Gated on `command -v docker` AND `docker info` (CLI present and
daemon reachable); falls through to housekeeping on failure.
4. System housekeeping (always runs, independent of runner state):
- apt-get clean (/var/cache/apt/archives/*.deb)
- journalctl --vacuum-size=$JOURNAL_MAX_SIZE (default 500M)
- prune ~/_diag/{Runner,Worker}_*.log older than
$DIAG_LOG_KEEP_DAYS (default 14)
- truncate -s 0 on /var/lib/docker/containers/*/*-json.log
larger than $DOCKER_LOG_MAX_MB (default 512); safe mid-flight
because Docker keeps writing to the same fd
- opt-in: apt-get autoremove --purge -y via RUN_APT_AUTOREMOVE=1
Concurrency + watchdog:
- flock on /var/lock/runner-cleanup.lock (-n non-blocking; if held,
exit 0 cleanly). Belt-and-suspenders to systemd's natural
same-Type=oneshot-can't-overlap behaviour; also catches the
operator running ./runner-cleanup manually while a timer pass
is mid-flight.
- TimeoutStartSec=30min in the service unit (RuntimeMaxSec=
ignored on Type=oneshot per systemd-analyze; corrected to the
right knob). SIGTERM at cap → TimeoutStopSec=30s → SIGKILL.
Defaults are baked into the script before the conf file is sourced,
so an outdated /etc/armbian/runner-cleanup.conf carried over from an
earlier install still picks up new defaults. A user who wants the
empty form can override with `KEEP_IMAGES=()` etc. in their conf.
Install path (module_armbian_runners install ends with):
- script + systemd units always overwritten (where fixes land)
- /etc/armbian/runner-cleanup.conf installed only if absent
(operator edits survive re-install); template always dropped
alongside as runner-cleanup.conf.dist for diff
- daemon-reload + enable --now timer; errors surfaced and retried
as separate enable + start so the diagnostic pinpoints which
half failed
Modes: --dry-run, --verbose, --config PATH (with guarded $2 access
so --config without an argument exits 2 cleanly under set -u).
Schedule: OnCalendar=hourly with RandomizedDelaySec=10min and
Persistent=true. Justified by the script's defensive design — most
runs no-op in well under a second; expensive paths only fire when
there's actual work. On hosts that often run out of space the higher
cadence catches container-log / journal / apt-cache accumulation
before it bites.
Host without runner users still runs the housekeeping section
(skips runner-scoped operations). Useful on any long-lived host;
runner-cleanup is the established entry point for the apt cache /
journal vacuum / docker-log truncate combination.
12546de to
749abee
Compare
Summary
Adds a small drop-in at
tools/modules/system/runner-cleanup/that keeps self-hosted GitHub Actions runner hosts trim — and wires it intomodule_armbian_runners installso it lands automatically alongside the runners themselves.Files (5):
runner-cleanup/usr/local/sbin/runner-cleanuprunner-cleanup.conf/etc/armbian/runner-cleanup.confrunner-cleanup.service/etc/systemd/system/runner-cleanup.servicerunner-cleanup.timer/etc/systemd/system/runner-cleanup.timerREADME.mdWhat it does, by section
1. Reap stuck
Runner.WorkerprocessesAny
Runner.Workerwhose elapsed time exceedsRUNNER_JOB_MAX_HOURS(default 4h) is SIGTERM'd, given 5 s grace, then SIGKILL'd. Targets the well-known failure mode where a Worker deadlocks (artifact upload to a black hole, zombie subprocess) and silently holds several GB of RAM while the Listener reports the job as "still running". Listener requeues the orphaned job onto another runner. Runs first so a freshly-reaped runner is correctly classified as idle below.2. Per-runner
_workwipeFor each idle runner NOT in
KEEP_RUNNERS_WORK(defaultactions-runner-01— the alfa-tier primary inmodule_armbian_runnersconvention, used as a warm cache home for kernel ccache and toolchain downloads):The stop closes the window where the Listener could accept a job mid-wipe. systemd's graceful stop waits for any race-in job to finish, so no live job is aborted. Unit name is discovered by grepping
User=in/etc/systemd/system/actions.runner.*.service.3. Docker prune (per-operation safety)
docker image rmfor images not inKEEP_IMAGESAND not used by any containerdocker container prune -fdocker image prune -f<none>:<none>only.docker network prune -fdocker builder prune -afdocker volume prune -af-awould yank it. Defers to next pass otherwise.KEEP_IMAGESdefaults tolscr.io/linuxserver/swag:latestandlscr.io/linuxserver/openssh-server:latest— the two configng-managed services most commonly co-tenant on a runner box.4. System housekeeping (always runs)
The long-tail disk hogs that don't depend on runner state:
apt-get clean—/var/cache/apt/archives/*.deb.journalctl --vacuum-size=$JOURNAL_MAX_SIZE(default 500M) — usually the single biggest free win.~/_diag/{Runner,Worker}_*.logolder than$DIAG_LOG_KEEP_DAYS(default 14).truncate -s 0on/var/lib/docker/containers/*/*-json.loglarger than$DOCKER_LOG_MAX_MB(default 512) — long-running containers like SWAG accumulate multi-GB log files; truncating mid-flight is safe.apt-get autoremove --purge -y(off by default; setRUN_APT_AUTOREMOVE=1).Schedule
OnCalendar=hourlywithRandomizedDelaySec=10minandPersistent=true. Justified by the script's defensive design: most runs no-op in well under a second; meaningful work only fires when there's something to clean. On a host that "often runs out of space" the higher cadence catches accumulation (container logs, journal, apt cache) before it bites.Defaults baked into the script
KEEP_IMAGES,KEEP_RUNNERS_WORK,RUNNER_JOB_MAX_HOURS, and the housekeeping knobs are all set inside the script before the conf file is sourced. An outdated/etc/armbian/runner-cleanup.confcarried over from an earlier install therefore still benefits from any new defaults. A user who explicitly wants the empty form can still setKEEP_RUNNERS_WORK=()(etc.) in their conf to override.Install behaviour
module_armbian_runners install→ runner-cleanup is bundled and the timer starts immediately./etc/armbian/runner-cleanup.confis left untouched (operator edits survive); template always dropped asrunner-cleanup.conf.distso admins candifffor new defaults.Test plan
./runner-cleanup --dry-run --verboseprints "no actions-runner-* users — nothing to do" and exits 0.wipe,stop/start,keep image,rm image, and[dry-run] docker container/image/network/builder/volume prunelines._workis touched;docker volume prune -ais skipped with a "still busy" message.Runner.Worker, fast-forward its etime, confirm reap.module_armbian_runners install ...(fresh) → confirmsystemctl status runner-cleanup.timershows enabled+active.module_armbian_runners install ...(re-install on a host with edited conf) → confirm conf untouched,.distupdated.Commits (11)
_workwipe.disttemplate dropRunner.Workerprocesses (>4h)module_armbian_runners