Skip to content

runner-cleanup: hourly disk + memory maintenance for self-hosted runner hosts#913

Merged
igorpecovnik merged 1 commit into
mainfrom
add-runner-cleanup
May 13, 2026
Merged

runner-cleanup: hourly disk + memory maintenance for self-hosted runner hosts#913
igorpecovnik merged 1 commit into
mainfrom
add-runner-cleanup

Conversation

@igorpecovnik

@igorpecovnik igorpecovnik commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Adds a small drop-in at tools/modules/system/runner-cleanup/ that keeps self-hosted GitHub Actions runner hosts trim — and wires it into module_armbian_runners install so it lands automatically alongside the runners themselves.

Files (5):

File Destination Mode
runner-cleanup /usr/local/sbin/runner-cleanup 0755
runner-cleanup.conf /etc/armbian/runner-cleanup.conf 0644
runner-cleanup.service /etc/systemd/system/runner-cleanup.service 0644
runner-cleanup.timer /etc/systemd/system/runner-cleanup.timer 0644
README.md repo-side docs

What it does, by section

1. Reap stuck Runner.Worker processes

Any Runner.Worker whose elapsed time exceeds RUNNER_JOB_MAX_HOURS (default 4h) is SIGTERM'd, given 5 s grace, then SIGKILL'd. Targets the well-known failure mode where a Worker deadlocks (artifact upload to a black hole, zombie subprocess) and silently holds several GB of RAM while the Listener reports the job as "still running". Listener requeues the orphaned job onto another runner. Runs first so a freshly-reaped runner is correctly classified as idle below.

2. Per-runner _work wipe

For each idle runner NOT in KEEP_RUNNERS_WORK (default actions-runner-01 — the alfa-tier primary in module_armbian_runners convention, used as a warm cache home for kernel ccache and toolchain downloads):

systemctl stop <unit>  →  find _work -mindepth 1 -delete  →  systemctl start <unit>

The stop closes the window where the Listener could accept a job mid-wipe. systemd's graceful stop waits for any race-in job to finish, so no live job is aborted. Unit name is discovered by grepping User= in /etc/systemd/system/actions.runner.*.service.

3. Docker prune (per-operation safety)

Step Behaviour
docker image rm for images not in KEEP_IMAGES AND not used by any container Always runs. Docker rejects in-use removals on its own.
docker container prune -f Always. Stopped only.
docker image prune -f Always. Dangling <none>:<none> only.
docker network prune -f Always. Unused only.
docker builder prune -af Always. Cold-cache slowdown but no breakage.
docker volume prune -af Gated on full-fleet idle. A volume between two job steps can briefly look unused; -a would yank it. Defers to next pass otherwise.

KEEP_IMAGES defaults to lscr.io/linuxserver/swag:latest and lscr.io/linuxserver/openssh-server:latest — the two configng-managed services most commonly co-tenant on a runner box.

4. System housekeeping (always runs)

The long-tail disk hogs that don't depend on runner state:

  • apt-get clean/var/cache/apt/archives/*.deb.
  • journalctl --vacuum-size=$JOURNAL_MAX_SIZE (default 500M) — usually the single biggest free win.
  • Prune ~/_diag/{Runner,Worker}_*.log older than $DIAG_LOG_KEEP_DAYS (default 14).
  • truncate -s 0 on /var/lib/docker/containers/*/*-json.log larger than $DOCKER_LOG_MAX_MB (default 512) — long-running containers like SWAG accumulate multi-GB log files; truncating mid-flight is safe.
  • Opt-in: apt-get autoremove --purge -y (off by default; set RUN_APT_AUTOREMOVE=1).

Schedule

OnCalendar=hourly with RandomizedDelaySec=10min and Persistent=true. Justified by the script's defensive design: most runs no-op in well under a second; meaningful work only fires when there's something to clean. On a host that "often runs out of space" the higher cadence catches accumulation (container logs, journal, apt cache) before it bites.

Defaults baked into the script

KEEP_IMAGES, KEEP_RUNNERS_WORK, RUNNER_JOB_MAX_HOURS, and the housekeeping knobs are all set inside the script before the conf file is sourced. An outdated /etc/armbian/runner-cleanup.conf carried over from an earlier install therefore still benefits from any new defaults. A user who explicitly wants the empty form can still set KEEP_RUNNERS_WORK=() (etc.) in their conf to override.

Install behaviour

  • Fresh module_armbian_runners install → runner-cleanup is bundled and the timer starts immediately.
  • Re-install → script + systemd units are overwritten so fixes propagate; /etc/armbian/runner-cleanup.conf is left untouched (operator edits survive); template always dropped as runner-cleanup.conf.dist so admins can diff for new defaults.

Test plan

  • On a non-runner host: ./runner-cleanup --dry-run --verbose prints "no actions-runner-* users — nothing to do" and exits 0.
  • On a runner host with all runners idle: dry-run lists wipe, stop/start, keep image, rm image, and [dry-run] docker container/image/network/builder/volume prune lines.
  • On a runner host with some runners busy: only idle runners' _work is touched; docker volume prune -a is skipped with a "still busy" message.
  • Reaper smoke: spawn a sleep-forever process named Runner.Worker, fast-forward its etime, confirm reap.
  • module_armbian_runners install ... (fresh) → confirm systemctl status runner-cleanup.timer shows enabled+active.
  • module_armbian_runners install ... (re-install on a host with edited conf) → confirm conf untouched, .dist updated.

Commits (11)

  • Initial scaffold (script + conf + service + timer + README)
  • KEEP_IMAGES defaults: SWAG, openssh-server
  • Per-operation safety (replaces original all-or-nothing fleet gate)
  • Stop systemd unit around _work wipe
  • KEEP_RUNNERS_WORK allowlist (defaults to actions-runner-01)
  • Bake defaults into the script, not just the conf
  • Preserve installed conf on re-install + .dist template drop
  • System-housekeeping section (apt/journal/diag/docker-logs)
  • Hourly timer instead of daily
  • Reap stuck Runner.Worker processes (>4h)
  • Bundle install with module_armbian_runners

@github-actions github-actions Bot added the size/large PR with 250 lines or more label May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Rate limit exceeded

@igorpecovnik has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 30 minutes and 41 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df20571d-cc7e-46e9-9be8-7bfda13f462b

📥 Commits

Reviewing files that changed from the base of the PR and between d075857 and 749abee.

📒 Files selected for processing (6)
  • tools/modules/system/module_armbian_runners.sh
  • tools/modules/system/runner-cleanup/README.md
  • tools/modules/system/runner-cleanup/runner-cleanup
  • tools/modules/system/runner-cleanup/runner-cleanup.conf
  • tools/modules/system/runner-cleanup/runner-cleanup.service
  • tools/modules/system/runner-cleanup/runner-cleanup.timer

Walkthrough

Adds a runner-cleanup subsystem: README and config declaring KEEP_IMAGES and KEEP_RUNNERS_WORK; systemd service/timer units and installer wiring to enable them; and a bash script that acquires a flock, reaps stuck Runner.Worker processes, classifies actions-runner-* users as busy or idle, wipes idle users' $HOME/_work/ (unless whitelisted), builds a Docker keep-set and removes/prunes images/networks/builders, defers docker volume prune while runners are busy, and performs additional system housekeeping (apt/journal/log truncation).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding an hourly disk and memory maintenance script for self-hosted runner hosts.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing each component, behavior, installation process, and test plan.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-runner-cleanup

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added Documentation Documentation changes or additions 05 Milestone: Second quarter release labels May 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tools/modules/system/runner-cleanup/README.md (1)

63-75: ⚡ Quick win

Add language identifiers to fenced code blocks to satisfy markdown lint.

Line 63 and Line 69 use fenced blocks without a language, which triggers MD040. Add text (or sh where applicable) to keep docs lint-clean.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/modules/system/runner-cleanup/README.md` around lines 63 - 75, Two
fenced code blocks in the README containing the snippets "runner-cleanup: no
actions-runner-* users on this host — nothing to do." and the multiline example
starting with "wipe /home/actions-runner-01/_work/" lack language identifiers
and trigger MD040; fix by adding a language tag (use text for plain output and
sh for shell commands) to those fenced blocks so they become e.g. ```text and
```sh respectively to satisfy the markdown linter.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/modules/system/runner-cleanup/runner-cleanup`:
- Line 43: The case arm handling -c|--config) currently dereferences $2
directly; guard against a missing argument by checking that the next positional
exists before assigning to CONFIG: inside the -c|--config) branch test whether
"${2:-}" (or $# >= 2) is set, and if not call the script's usage/failure path
and exit with a non-zero status, otherwise set CONFIG="$2" and shift 2; update
the -c|--config) branch accordingly to avoid an unbound-variable error under set
-u.

In `@tools/modules/system/runner-cleanup/runner-cleanup.service`:
- Line 3: The Documentation= line currently points to a misspelled or incorrect
URL; update the value of the Documentation= field so it points to the correct
Armbian configng repository URL (replace the existing
"https://github.com/armbian/configng" value with the verified, correctly spelled
repository URL) and confirm the link resolves in a browser; look for and edit
the line beginning with "Documentation=" to make this change.

---

Nitpick comments:
In `@tools/modules/system/runner-cleanup/README.md`:
- Around line 63-75: Two fenced code blocks in the README containing the
snippets "runner-cleanup: no actions-runner-* users on this host — nothing to
do." and the multiline example starting with "wipe
/home/actions-runner-01/_work/" lack language identifiers and trigger MD040; fix
by adding a language tag (use text for plain output and sh for shell commands)
to those fenced blocks so they become e.g. ```text and ```sh respectively to
satisfy the markdown linter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2fd75c1b-0012-497c-a95b-3aa6725dddbd

📥 Commits

Reviewing files that changed from the base of the PR and between 8d92cf3 and fd6521e.

📒 Files selected for processing (5)
  • tools/modules/system/runner-cleanup/README.md
  • tools/modules/system/runner-cleanup/runner-cleanup
  • tools/modules/system/runner-cleanup/runner-cleanup.conf
  • tools/modules/system/runner-cleanup/runner-cleanup.service
  • tools/modules/system/runner-cleanup/runner-cleanup.timer

Comment thread tools/modules/system/runner-cleanup/runner-cleanup Outdated
Comment thread tools/modules/system/runner-cleanup/runner-cleanup.service

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tools/modules/system/runner-cleanup/runner-cleanup (1)

53-53: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard --config against missing path argument.

With set -u enabled on line 33, runner-cleanup --config (without a path) will fail with an unbound variable error instead of showing a clean usage message.

Proposed fix
-		-c|--config)  CONFIG="$2"; shift 2 ;;
+		-c|--config)
+			[[ $# -ge 2 ]] || { echo "Missing value for $1" >&2; usage >&2; exit 2; }
+			CONFIG="$2"; shift 2 ;;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/modules/system/runner-cleanup/runner-cleanup` at line 53, The --config
option handler currently assigns CONFIG="$2" without validating the presence of
a following argument, which under set -u causes an unbound variable error;
update the case branch for '-c|--config)' in runner-cleanup to verify that "$2"
exists and is not another option (e.g. empty or starts with '-') before
assigning and shifting, and if the check fails call the existing usage/error
flow (print usage and exit non-zero). Keep the change confined to the
'-c|--config)' branch and use the same usage/error helper the script uses
elsewhere so behavior is consistent with other option validations.
🧹 Nitpick comments (1)
tools/modules/system/runner-cleanup/README.md (1)

97-109: ⚡ Quick win

Add language specifier to fenced code block.

The code block showing example output should specify a language (or use text) for proper rendering and accessibility.

Proposed fix
-```
+```text
 runner-cleanup: no actions-runner-* users on this host — nothing to do.

or, with runners present and idle:

- +text
wipe /home/actions-runner-01/_work/
keep image ghcr.io/armbian/builder:latest
rm image ubuntu:24.04 (sha256:...)

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @tools/modules/system/runner-cleanup/README.md around lines 97 - 109, The
fenced code blocks in the runner-cleanup README example do not include a
language specifier; update both blocks that show output (the one containing
"runner-cleanup: no actions-runner-* users on this host — nothing to do." and
the subsequent block showing "wipe /home/actions-runner-01/_work/" etc.) to use
a language tag like "text" (e.g., change totext) so the examples
render/accessibly as plain text.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @tools/modules/system/runner-cleanup/runner-cleanup:

  • Line 53: The --config option handler currently assigns CONFIG="$2" without
    validating the presence of a following argument, which under set -u causes an
    unbound variable error; update the case branch for '-c|--config)' in
    runner-cleanup to verify that "$2" exists and is not another option (e.g. empty
    or starts with '-') before assigning and shifting, and if the check fails call
    the existing usage/error flow (print usage and exit non-zero). Keep the change
    confined to the '-c|--config)' branch and use the same usage/error helper the
    script uses elsewhere so behavior is consistent with other option validations.

Nitpick comments:
In @tools/modules/system/runner-cleanup/README.md:

  • Around line 97-109: The fenced code blocks in the runner-cleanup README
    example do not include a language specifier; update both blocks that show output
    (the one containing "runner-cleanup: no actions-runner-* users on this host —
    nothing to do." and the subsequent block showing "wipe
    /home/actions-runner-01/_work/" etc.) to use a language tag like "text" (e.g.,
    change totext) so the examples render/accessibly as plain text.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `0aa9b43e-2e7f-420f-8854-693dc134d490`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between fd6521e4673951fde4b7533522c7bad38117283b and 97078222a47adfd36f745f602bc0ebe35f0a7fad.

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `tools/modules/system/runner-cleanup/README.md`
* `tools/modules/system/runner-cleanup/runner-cleanup`
* `tools/modules/system/runner-cleanup/runner-cleanup.conf`

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* tools/modules/system/runner-cleanup/runner-cleanup.conf

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@igorpecovnik igorpecovnik changed the title runner-cleanup: maintenance script + systemd timer for runner hosts runner-cleanup: hourly disk + memory maintenance for self-hosted runner hosts May 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/modules/system/module_armbian_runners.sh`:
- Line 171: The script currently silences failures of "systemctl enable --now
runner-cleanup.timer", which can hide a failed timer install; change this to run
the command without redirecting output, capture its exit code, and if non-zero
log a clear error (including command stderr) and attempt a safe fallback: try
"systemctl enable runner-cleanup.timer" then "systemctl start
runner-cleanup.timer", checking and logging each exit status; reference the
invocation "systemctl enable --now runner-cleanup.timer" in the fix and ensure
any error messages are emitted instead of using ">/dev/null 2>&1 || true".

In `@tools/modules/system/runner-cleanup/README.md`:
- Line 113: The README has fenced code blocks missing language identifiers;
update the two blocks containing the messages "runner-cleanup: no
actions-runner-* users on this host — nothing to do." and the block starting
"wipe /home/actions-runner-01/_work/" to use a language tag (e.g., change the
opening ``` to ```text) so markdown linters correctly detect the code block
language.

In `@tools/modules/system/runner-cleanup/runner-cleanup`:
- Around line 143-145: The early exit when runner_users is empty (the if block
checking ${`#runner_users`[@]} and calling note "runner-cleanup: no
actions-runner-* users..." then exit 0) prevents the later housekeeping steps
(section 4: journald/APT/log cleanup) from running; change the flow so that when
runner_users is empty you still perform the housekeeping: remove or relocate the
exit 0 and ensure the script continues to execute section 4 after logging the
"no actions-runner-* users" note (leave the note call intact but do not return
early), or alternatively call the housekeeping function/section explicitly
before exiting.
- Around line 231-237: The block that currently wipes "${work}" when "$unit" is
empty is unsafe; instead of running "find \"$work\" -mindepth 1 -delete" when no
systemd unit is found, change it to skip the wipe and emit a clear log like the
existing "skipping wipe to avoid mid-job damage" message (referencing the
variables unit, work, and user) so the script avoids deleting _work if the
runner cannot be quiesced; keep the early "log" message but replace the
destructive "run find ... -delete" with a skip-and-continue path that mirrors
the safe behavior used elsewhere.
- Around line 253-258: The current check using command -v docker only verifies
the CLI exists but not the daemon; update the logic so that before attempting
any container/image cleanup you try a daemon call (e.g., docker info) and if
that call fails treat Docker as unavailable: set goto_housekeeping and log a
skip rather than letting the script abort under set -euo pipefail. Apply the
same pattern around the docker cleanup blocks that call docker container prune
and docker image prune so those commands are executed only when the daemon check
succeeds; reference and update the conditional that currently uses command -v
docker, the goto_housekeeping variable, and the docker container prune / docker
image prune invocations.
- Around line 241-244: The cleanup currently exits if run find "$work" -mindepth
1 -delete fails (due to set -euo pipefail), preventing the later run systemctl
start "${unit}.service" from executing; change the flow so the restart is always
attempted: run systemctl stop "${unit}.service", then perform the wipe but do
not let failures abort the script (e.g. run the find with its failure masked or
capture its exit status), record any wipe error and continue, and finally always
attempt run systemctl start "${unit}.service" and call note "WARNING: failed to
restart ${unit}.service..." only if the restart itself fails. Ensure you
reference and modify the existing run invocation for find "$work" -mindepth 1
-delete and the subsequent run systemctl start "${unit}.service" and note call.

In `@tools/modules/system/runner-cleanup/runner-cleanup.timer`:
- Line 2: The timer unit runner-cleanup.timer has Description="Daily
runner-cleanup" but its OnCalendar=hourly schedule; update the Description in
runner-cleanup.timer (the Description field) to reflect the actual schedule
(e.g., "Hourly runner-cleanup" or "Runs hourly: runner-cleanup") so the
systemctl list-timers output is accurate; ensure any other similar unit with the
same mismatch (noted as 19-19) is updated the same way.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 48fac597-2195-441f-9913-2b1a3cc4bef4

📥 Commits

Reviewing files that changed from the base of the PR and between 9707822 and d075857.

📒 Files selected for processing (5)
  • tools/modules/system/module_armbian_runners.sh
  • tools/modules/system/runner-cleanup/README.md
  • tools/modules/system/runner-cleanup/runner-cleanup
  • tools/modules/system/runner-cleanup/runner-cleanup.conf
  • tools/modules/system/runner-cleanup/runner-cleanup.timer

Comment thread tools/modules/system/module_armbian_runners.sh Outdated
Comment thread tools/modules/system/runner-cleanup/README.md
Comment thread tools/modules/system/runner-cleanup/runner-cleanup Outdated
Comment thread tools/modules/system/runner-cleanup/runner-cleanup
Comment thread tools/modules/system/runner-cleanup/runner-cleanup
Comment thread tools/modules/system/runner-cleanup/runner-cleanup
Comment thread tools/modules/system/runner-cleanup/runner-cleanup.timer Outdated
…er hosts

Adds tools/modules/system/runner-cleanup/ (script, config, systemd
service + timer, README) and wires installation into
module_armbian_runners install so a fresh runner host gets the
cleanup helper automatically.

Sections, in order of execution per pass:

  1. Reap stuck Runner.Worker processes. Any Worker whose etime
     exceeds RUNNER_JOB_MAX_HOURS (default 4h) is SIGTERM'd, given
     5s grace, then SIGKILL'd. Targets the failure mode where a
     Worker deadlocks on artifact upload / network black hole and
     silently holds GBs of RAM while the Listener reports the job
     as 'still running'. Listener requeues. Runs first so a
     freshly-reaped runner is correctly classified as idle below.

  2. Per-runner _work wipe. For each idle runner NOT in
     KEEP_RUNNERS_WORK (default 'actions-runner-01' — the alfa-tier
     primary in module_armbian_runners convention; preserves kernel
     ccache / toolchain caches):
       systemctl stop <unit> → find _work -mindepth 1 -delete → systemctl start <unit>
     The stop closes the race where the Listener could accept a new
     job mid-wipe. Unit name discovered by grepping User= in
     /etc/systemd/system/actions.runner.*.service. Busy runners are
     left untouched. EXIT trap restarts any unit we stopped if the
     script gets killed before reaching the start.

  3. Docker prune. Per-operation safety instead of an all-or-nothing
     fleet gate:
       - docker image rm for images not in KEEP_IMAGES and not used
         by any container (running or stopped). KEEP_IMAGES defaults
         to lscr.io/linuxserver/swag:latest and lscr.io/linuxserver/
         openssh-server:latest (configng-managed co-tenants).
       - docker container/image/network/builder prune: always run.
         Docker enforces its own in-use checks (container prune only
         touches stopped, image prune only dangling, etc).
       - docker volume prune -a: gated on full-fleet idle. A volume
         between job steps can briefly look unused; -a would yank it.
         Defers to next pass otherwise.
     Gated on `command -v docker` AND `docker info` (CLI present and
     daemon reachable); falls through to housekeeping on failure.

  4. System housekeeping (always runs, independent of runner state):
       - apt-get clean (/var/cache/apt/archives/*.deb)
       - journalctl --vacuum-size=$JOURNAL_MAX_SIZE (default 500M)
       - prune ~/_diag/{Runner,Worker}_*.log older than
         $DIAG_LOG_KEEP_DAYS (default 14)
       - truncate -s 0 on /var/lib/docker/containers/*/*-json.log
         larger than $DOCKER_LOG_MAX_MB (default 512); safe mid-flight
         because Docker keeps writing to the same fd
       - opt-in: apt-get autoremove --purge -y via RUN_APT_AUTOREMOVE=1

Concurrency + watchdog:
  - flock on /var/lock/runner-cleanup.lock (-n non-blocking; if held,
    exit 0 cleanly). Belt-and-suspenders to systemd's natural
    same-Type=oneshot-can't-overlap behaviour; also catches the
    operator running ./runner-cleanup manually while a timer pass
    is mid-flight.
  - TimeoutStartSec=30min in the service unit (RuntimeMaxSec=
    ignored on Type=oneshot per systemd-analyze; corrected to the
    right knob). SIGTERM at cap → TimeoutStopSec=30s → SIGKILL.

Defaults are baked into the script before the conf file is sourced,
so an outdated /etc/armbian/runner-cleanup.conf carried over from an
earlier install still picks up new defaults. A user who wants the
empty form can override with `KEEP_IMAGES=()` etc. in their conf.

Install path (module_armbian_runners install ends with):
  - script + systemd units always overwritten (where fixes land)
  - /etc/armbian/runner-cleanup.conf installed only if absent
    (operator edits survive re-install); template always dropped
    alongside as runner-cleanup.conf.dist for diff
  - daemon-reload + enable --now timer; errors surfaced and retried
    as separate enable + start so the diagnostic pinpoints which
    half failed

Modes: --dry-run, --verbose, --config PATH (with guarded $2 access
so --config without an argument exits 2 cleanly under set -u).

Schedule: OnCalendar=hourly with RandomizedDelaySec=10min and
Persistent=true. Justified by the script's defensive design — most
runs no-op in well under a second; expensive paths only fire when
there's actual work. On hosts that often run out of space the higher
cadence catches container-log / journal / apt-cache accumulation
before it bites.

Host without runner users still runs the housekeeping section
(skips runner-scoped operations). Useful on any long-lived host;
runner-cleanup is the established entry point for the apt cache /
journal vacuum / docker-log truncate combination.
@igorpecovnik igorpecovnik merged commit 5fb3f8f into main May 13, 2026
12 checks passed
@igorpecovnik igorpecovnik deleted the add-runner-cleanup branch May 13, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

05 Milestone: Second quarter release Documentation Documentation changes or additions size/large PR with 250 lines or more

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant