Skip to content

Add rescue mode, execution state tracking, and runtimes aggregate#932

Open
aliel wants to merge 13 commits into
mainfrom
aliel-add-rescue-mode
Open

Add rescue mode, execution state tracking, and runtimes aggregate#932
aliel wants to merge 13 commits into
mainfrom
aliel-add-rescue-mode

Conversation

@aliel
Copy link
Copy Markdown
Member

@aliel aliel commented Apr 13, 2026

Rescue mode lets users boot a persistent instance from a minimal
rescue image while their original rootfs is attached as a secondary
drive for repair. The original rootfs is never moved or renamed.

POST /control/machine/{ref}/rescue - enter rescue mode
DELETE /control/machine/{ref}/rescue - exit and restore normal boot

Reinstall and erase are blocked while in rescue mode (409 conflict).

Execution state tracking:

  • New mode column on ExecutionRecord ("normal" or "rescue"), persisted across supervisor restarts.
  • New execution_events audit table records lifecycle events (created, stopped, reinstalled, rescue_entered, rescue_exited, etc.)
  • list_executions_v2 response now includes mode.

Runtimes aggregate:

  • Fetch and cache logic for the runtimes aggregate from the governance address, same pattern as the settings aggregate.
  • RuntimeEntry schema with id, name, type, item_hash, default fields.
  • Lookup helpers: get_runtime_by_id, get_default_runtime.
  • Rescue image hash is resolved from the aggregate (type "rescue") instead of a hardcoded config value.

@aliel aliel requested a review from foxpatch-aleph April 13, 2026 20:48
@aliel aliel marked this pull request as draft April 13, 2026 20:50
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 82.42613% with 113 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (72cb12d) to head (849ac97).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aleph/vm/orchestrator/views/operator.py 63.05% 51 Missing and 7 partials ⚠️
...ons/versions/0005_add_mode_and_execution_events.py 0.00% 24 Missing ⚠️
src/aleph/vm/orchestrator/utils.py 72.54% 14 Missing ⚠️
src/aleph/vm/controllers/qemu/instance.py 55.55% 2 Missing and 2 partials ⚠️
src/aleph/vm/pool.py 33.33% 3 Missing and 1 partial ⚠️
src/aleph/vm/storage.py 50.00% 3 Missing ⚠️
src/aleph/vm/controllers/qemu/backup.py 0.00% 2 Missing ⚠️
src/aleph/vm/models.py 66.66% 1 Missing ⚠️
src/aleph/vm/orchestrator/metrics.py 95.23% 1 Missing ⚠️
src/aleph/vm/orchestrator/run.py 50.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #932      +/-   ##
==========================================
+ Coverage   68.33%   69.07%   +0.74%     
==========================================
  Files         104      107       +3     
  Lines       11983    12609     +626     
  Branches     1023     1066      +43     
==========================================
+ Hits         8188     8710     +522     
- Misses       3530     3624      +94     
- Partials      265      275      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aliel aliel force-pushed the aliel-add-rescue-mode branch from 6b18f6d to 28b5c03 Compare April 15, 2026 08:29
@aliel aliel marked this pull request as ready for review April 20, 2026 13:52
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces a well-structured rescue mode feature with execution event tracking and runtimes aggregate. The core logic is sound, but there are two genuine issues that need addressing: (1) the Alembic migration uses placeholder revision IDs that will break the migration chain, and (2) operate_rescue_exit has a race condition where it can conflict with an in-progress rescue background task. Additionally, test coverage for the rescue API endpoints is completely missing.

src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 17): The revision IDs 'b3c4d5e6f7a8' and 'a1b2c3d4e5f6' are placeholder values. These must be replaced with real Alembic revision IDs (generated via alembic revision --rev-id=<real-id>) or the migration will fail to apply in the chain. Use alembic revision to generate proper IDs that match the actual previous revision.

src/aleph/vm/orchestrator/views/operator.py (line 1017): operate_rescue_exit does not check whether a background rescue task is in-flight for this VM. If a rescue is still downloading/restarting (task in state.tasks), this exit call could delete the rescue rootfs that the task is about to use, or leave the VM in an inconsistent state. Add a check: if vm_hash_str in state.tasks, return 409 with 'Rescue operation in progress'.

src/aleph/vm/orchestrator/views/operator.py (line 826): The rescue image download via download_file() has no timeout. If the network is slow or the server is unresponsive, this call could hang indefinitely, leaving the rescue task stuck. Consider passing a timeout to download_file or wrapping it with asyncio.wait_for.

src/aleph/vm/orchestrator/views/operator.py (line 865): execution.mode is set to 'rescue' on line 865 before _restart_persistent_vm succeeds. If the restart fails, the in-memory state says 'rescue' but the DB still says 'normal'. The next GET /rescue status call would incorrectly report rescue mode. Consider setting mode only after _restart_persistent_vm completes, or persisting the mode atomically with the restart.

tests/supervisor/test_rescue.py (line 1): Missing test coverage for the three rescue API endpoints: operate_rescue (POST), operate_rescue_status (GET), and operate_rescue_exit (DELETE). At minimum, add tests for: 409 when already in rescue mode, 409 when not in rescue mode (exit), 202 in_progress response, 409 conflict blocking on erase/reinstall while in rescue mode, and the happy path for enter/exit.

src/aleph/vm/orchestrator/utils.py (line 116): fetch_runtimes_aggregate accesses resp_data['data']['runtimes']['entries'] without defensive checks. If the API returns an unexpected structure, this raises a KeyError. While get_runtimes catches Exception, a malformed response silently returns [] which may mask configuration issues. Consider logging the raw response structure on error for debugging.

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is a well-structured addition implementing rescue mode, execution event auditing, and runtimes aggregate. The architecture is sound and the code follows existing patterns. However, there is one genuine bug: operate_rescue_exit is missing a check for an in-progress rescue task (state.tasks), which creates a race condition allowing exit and rescue to run concurrently on the same VM. There are also several minor issues worth addressing.

src/aleph/vm/orchestrator/views/operator.py (line 1017): Missing race condition guard: operate_rescue_exit does not check state.tasks for an in-progress rescue task. Compare with operate_rescue_status (line 994) and operate_rescue (line 937). If a rescue is still downloading/restarting, operate_rescue_exit could delete the rescue rootfs and reset to normal mode while _run_rescue_work is mid-flight, causing data corruption or a crash. Add: if vm_hash_str in state.tasks: return web.json_response({'status': 'in_progress'}, status=202)

src/aleph/vm/orchestrator/utils.py (line 116): Missing key validation: fetch_runtimes_aggregate accesses resp_data['data']['runtimes']['entries'] without checking these keys exist. If the API returns an unexpected structure (e.g., empty runtimes key), this raises a KeyError that is caught by the outer try/except but silently swallowed. Consider adding explicit validation: if 'runtimes' not in resp_data.get('data', {}): return []

src/aleph/vm/orchestrator/metrics.py (line 84): SQLAlchemy default vs server_default: Column(String, nullable=False, default='normal', server_default='normal') — the Python default is ignored when using server_default. Remove the redundant default='normal' parameter to avoid confusion.

src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 17): The down_revision = 'a1b2c3d4e5f6' appears to be a placeholder. Verify this matches the actual latest migration revision before merging, otherwise Alembic will fail to apply the migration.

src/aleph/vm/orchestrator/views/operator.py (line 1048): In operate_rescue_exit, the tap interface is deleted and recreated only if pool.network and execution.vm and pool.network.interface_exists(...). However, if execution.vm is None (possible if the VM was never fully started), the tap reset is silently skipped. This is probably fine, but consider logging a warning if execution.vm is None since it means the rescue exit is incomplete.

src/aleph/vm/orchestrator/views/operator.py (line 997): In operate_rescue_status, state.results.pop(vm_hash_str) removes the result on first read. This means if the caller refreshes the status page after a successful rescue, they get a 409 instead of the rescue status. The fallback to execution.mode == 'rescue' (line 1003) handles this, but the result data (which includes the full message about /dev/vdb) is lost. Consider not popping the result, or caching the result alongside the mode check.

src/aleph/vm/orchestrator/views/operator.py (line 865): In _run_rescue_work, execution.mode = 'rescue' is set before _restart_persistent_vm. If the restart fails (e.g., systemd error), the execution record is left in rescue mode even though the VM is not actually in rescue. Consider setting the mode after _restart_persistent_vm succeeds, or rolling back on failure.

src/aleph/vm/orchestrator/utils.py (line 98): The RuntimeEntry TypedDict uses total=False which makes all fields optional. However, id, name, type, item_hash are effectively required by consumers. Consider using a dataclass or a Pydantic model for stricter validation, or at least document which fields are required.

tests/supervisor/test_rescue.py (line 88): The _patch_session_maker fixture patches AsyncSessionMaker on the module-level aleph.vm.orchestrator.metrics which is shared across all tests. While this works because tests are isolated by in-memory DB, consider using autouse=False explicitly and ensuring tests that don't need the DB don't accidentally interfere. The fixture is fine as-is, but worth noting.

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR implements a well-structured rescue mode feature with execution state tracking and a runtimes aggregate. The core logic is sound: rescue mode boots from a .rescue overlay image while attaching the original rootfs as /dev/vdb, the audit trail via ExecutionEvent is useful, and the runtimes aggregate follows the existing settings aggregate pattern. However, there are a few issues worth addressing before merge.

src/aleph/vm/orchestrator/views/operator.py (line 848): When expected_sha256 is None (user-provided item_hash without a sha256 in the aggregate), the rescue image is downloaded without any integrity verification. Consider always requiring sha256 verification, or at minimum logging a warning when verification is skipped.

src/aleph/vm/orchestrator/views/operator.py (line 833): _run_rescue_work calls pool.stop_vm() which records a 'stopped' event, then _restart_persistent_vm is called. This means a 'stopped' event is recorded even though the VM immediately restarts. The event log will show stop/start pairs for rescue transitions. Consider recording 'stopped' only for explicit stops, not for restarts within _run_rescue_work, to avoid polluting the audit trail.

src/aleph/vm/orchestrator/views/operator.py (line 998): state.results.pop(vm_hash_str) removes the result after first read. This means if a user polls GET /rescue after success, they won't see the result from state.results and will fall through to the execution.mode == 'rescue' check. This is intentional (prevents replay), but the dual-path behavior (result from cache vs. live mode check) could return slightly different response bodies. Consider unifying the response format.

src/aleph/vm/orchestrator/views/operator.py (line 1033): operate_rescue_exit calls pool.stop_vm() which records a 'stopped' event, then calls _restart_persistent_vm which restarts the VM. Same issue as line 833: the 'stopped' event is recorded for a restart, not an actual stop. This creates misleading audit trail entries.

src/aleph/vm/orchestrator/metrics.py (line 84): The mode column uses default="normal" and server_default="normal". In SQLAlchemy 2.x, Column.default is only used when constructing a new Python object, while server_default is used for INSERT statements. Both being set is correct for this use case, but note that existing rows created before this column was added will have NULL until the migration's server_default constraint takes effect on future updates.

src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 12): The revision IDs b3c4d5e6f7a8 and a1b2c3d4e5f6 appear to be placeholder values. These should be verified against the actual alembic revision chain before merge.

src/aleph/vm/orchestrator/utils.py (line 111): The URL construction on this line has an unnecessary line break: f"{settings.API_SERVER}/api/v0/aggregates/" f"{settings.SETTINGS_AGGREGATE_ADDRESS}.json?keys=runtimes". While this works due to Python string concatenation, a single line would be cleaner.

tests/supervisor/test_rescue.py (line 88): The _patch_session_maker fixture monkeypatches metrics_mod.AsyncSessionMaker in-place. Since pytest runs tests sequentially and the patch is not undone between tests, this could cause test isolation issues if a test imports the module before the fixture runs. Consider using yield in the fixture to restore the original value after the test.

tests/supervisor/test_rescue.py (line 1): Missing test coverage for the rescue API endpoints themselves (operate_rescue, operate_rescue_status, operate_rescue_exit). The underlying functions (_run_rescue_work, RescueState) are tested indirectly through the metrics and utils tests, but the HTTP endpoints should have integration tests covering: 409 conflict when already in rescue mode, 409 when not in rescue mode, 403 unauthorized, and 202/in_progress flow.

src/aleph/vm/controllers/qemu/instance.py (line 216): In rescue mode, the original rootfs is inserted as a host volume with read_only=False. This allows writes to the original rootfs during rescue, which is the intended behavior for repair. However, consider whether data volumes should also be read-only in rescue mode to prevent accidental modification while the filesystem is mounted elsewhere.

aliel added 13 commits April 28, 2026 21:27
  Rescue mode lets users boot a persistent instance from a minimal
  rescue image while their original rootfs is attached as a secondary
  drive for repair. The original rootfs is never moved or renamed.

  POST /control/machine/{ref}/rescue   - enter rescue mode
  DELETE /control/machine/{ref}/rescue - exit and restore normal boot

  Reinstall and erase are blocked while in rescue mode (409 conflict).

  Execution state tracking:
  - New `mode` column on ExecutionRecord ("normal" or "rescue"),
    persisted across supervisor restarts.
  - New `execution_events` audit table records lifecycle events
    (created, stopped, reinstalled, rescue_entered, rescue_exited, etc.)
  - list_executions_v2 response now includes `mode`.

  Runtimes aggregate:
  - Fetch and cache logic for the runtimes aggregate from the
    governance address, same pattern as the settings aggregate.
  - RuntimeEntry schema with id, name, type, item_hash, default fields.
  - Lookup helpers: get_runtime_by_id, get_default_runtime.
  - Rescue image hash is resolved from the aggregate (type "rescue")
    instead of a hardcoded config value.
…s after entering rescue mode.

Run download and restart as a background task.
- Cloud-init auto-detects its ISO by the "cidata" label, so the drive letter is irrelevant.
  Placing host volumes first ensures the original rootfs appears as /dev/vdb in rescue mode as documented.
rescue: add HTTP endpoint tests
@aliel aliel force-pushed the aliel-add-rescue-mode branch from 7fb4f4c to 849ac97 Compare April 28, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants