Add rescue mode, execution state tracking, and runtimes aggregate#932
Add rescue mode, execution state tracking, and runtimes aggregate#932aliel wants to merge 13 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #932 +/- ##
==========================================
+ Coverage 68.33% 69.07% +0.74%
==========================================
Files 104 107 +3
Lines 11983 12609 +626
Branches 1023 1066 +43
==========================================
+ Hits 8188 8710 +522
- Misses 3530 3624 +94
- Partials 265 275 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6b18f6d to
28b5c03
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR introduces a well-structured rescue mode feature with execution event tracking and runtimes aggregate. The core logic is sound, but there are two genuine issues that need addressing: (1) the Alembic migration uses placeholder revision IDs that will break the migration chain, and (2) operate_rescue_exit has a race condition where it can conflict with an in-progress rescue background task. Additionally, test coverage for the rescue API endpoints is completely missing.
src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 17): The revision IDs 'b3c4d5e6f7a8' and 'a1b2c3d4e5f6' are placeholder values. These must be replaced with real Alembic revision IDs (generated via alembic revision --rev-id=<real-id>) or the migration will fail to apply in the chain. Use alembic revision to generate proper IDs that match the actual previous revision.
src/aleph/vm/orchestrator/views/operator.py (line 1017): operate_rescue_exit does not check whether a background rescue task is in-flight for this VM. If a rescue is still downloading/restarting (task in state.tasks), this exit call could delete the rescue rootfs that the task is about to use, or leave the VM in an inconsistent state. Add a check: if vm_hash_str in state.tasks, return 409 with 'Rescue operation in progress'.
src/aleph/vm/orchestrator/views/operator.py (line 826): The rescue image download via download_file() has no timeout. If the network is slow or the server is unresponsive, this call could hang indefinitely, leaving the rescue task stuck. Consider passing a timeout to download_file or wrapping it with asyncio.wait_for.
src/aleph/vm/orchestrator/views/operator.py (line 865): execution.mode is set to 'rescue' on line 865 before _restart_persistent_vm succeeds. If the restart fails, the in-memory state says 'rescue' but the DB still says 'normal'. The next GET /rescue status call would incorrectly report rescue mode. Consider setting mode only after _restart_persistent_vm completes, or persisting the mode atomically with the restart.
tests/supervisor/test_rescue.py (line 1): Missing test coverage for the three rescue API endpoints: operate_rescue (POST), operate_rescue_status (GET), and operate_rescue_exit (DELETE). At minimum, add tests for: 409 when already in rescue mode, 409 when not in rescue mode (exit), 202 in_progress response, 409 conflict blocking on erase/reinstall while in rescue mode, and the happy path for enter/exit.
src/aleph/vm/orchestrator/utils.py (line 116): fetch_runtimes_aggregate accesses resp_data['data']['runtimes']['entries'] without defensive checks. If the API returns an unexpected structure, this raises a KeyError. While get_runtimes catches Exception, a malformed response silently returns [] which may mask configuration issues. Consider logging the raw response structure on error for debugging.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR is a well-structured addition implementing rescue mode, execution event auditing, and runtimes aggregate. The architecture is sound and the code follows existing patterns. However, there is one genuine bug: operate_rescue_exit is missing a check for an in-progress rescue task (state.tasks), which creates a race condition allowing exit and rescue to run concurrently on the same VM. There are also several minor issues worth addressing.
src/aleph/vm/orchestrator/views/operator.py (line 1017): Missing race condition guard: operate_rescue_exit does not check state.tasks for an in-progress rescue task. Compare with operate_rescue_status (line 994) and operate_rescue (line 937). If a rescue is still downloading/restarting, operate_rescue_exit could delete the rescue rootfs and reset to normal mode while _run_rescue_work is mid-flight, causing data corruption or a crash. Add: if vm_hash_str in state.tasks: return web.json_response({'status': 'in_progress'}, status=202)
src/aleph/vm/orchestrator/utils.py (line 116): Missing key validation: fetch_runtimes_aggregate accesses resp_data['data']['runtimes']['entries'] without checking these keys exist. If the API returns an unexpected structure (e.g., empty runtimes key), this raises a KeyError that is caught by the outer try/except but silently swallowed. Consider adding explicit validation: if 'runtimes' not in resp_data.get('data', {}): return []
src/aleph/vm/orchestrator/metrics.py (line 84): SQLAlchemy default vs server_default: Column(String, nullable=False, default='normal', server_default='normal') — the Python default is ignored when using server_default. Remove the redundant default='normal' parameter to avoid confusion.
src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 17): The down_revision = 'a1b2c3d4e5f6' appears to be a placeholder. Verify this matches the actual latest migration revision before merging, otherwise Alembic will fail to apply the migration.
src/aleph/vm/orchestrator/views/operator.py (line 1048): In operate_rescue_exit, the tap interface is deleted and recreated only if pool.network and execution.vm and pool.network.interface_exists(...). However, if execution.vm is None (possible if the VM was never fully started), the tap reset is silently skipped. This is probably fine, but consider logging a warning if execution.vm is None since it means the rescue exit is incomplete.
src/aleph/vm/orchestrator/views/operator.py (line 997): In operate_rescue_status, state.results.pop(vm_hash_str) removes the result on first read. This means if the caller refreshes the status page after a successful rescue, they get a 409 instead of the rescue status. The fallback to execution.mode == 'rescue' (line 1003) handles this, but the result data (which includes the full message about /dev/vdb) is lost. Consider not popping the result, or caching the result alongside the mode check.
src/aleph/vm/orchestrator/views/operator.py (line 865): In _run_rescue_work, execution.mode = 'rescue' is set before _restart_persistent_vm. If the restart fails (e.g., systemd error), the execution record is left in rescue mode even though the VM is not actually in rescue. Consider setting the mode after _restart_persistent_vm succeeds, or rolling back on failure.
src/aleph/vm/orchestrator/utils.py (line 98): The RuntimeEntry TypedDict uses total=False which makes all fields optional. However, id, name, type, item_hash are effectively required by consumers. Consider using a dataclass or a Pydantic model for stricter validation, or at least document which fields are required.
tests/supervisor/test_rescue.py (line 88): The _patch_session_maker fixture patches AsyncSessionMaker on the module-level aleph.vm.orchestrator.metrics which is shared across all tests. While this works because tests are isolated by in-memory DB, consider using autouse=False explicitly and ensuring tests that don't need the DB don't accidentally interfere. The fixture is fine as-is, but worth noting.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR implements a well-structured rescue mode feature with execution state tracking and a runtimes aggregate. The core logic is sound: rescue mode boots from a .rescue overlay image while attaching the original rootfs as /dev/vdb, the audit trail via ExecutionEvent is useful, and the runtimes aggregate follows the existing settings aggregate pattern. However, there are a few issues worth addressing before merge.
src/aleph/vm/orchestrator/views/operator.py (line 848): When expected_sha256 is None (user-provided item_hash without a sha256 in the aggregate), the rescue image is downloaded without any integrity verification. Consider always requiring sha256 verification, or at minimum logging a warning when verification is skipped.
src/aleph/vm/orchestrator/views/operator.py (line 833): _run_rescue_work calls pool.stop_vm() which records a 'stopped' event, then _restart_persistent_vm is called. This means a 'stopped' event is recorded even though the VM immediately restarts. The event log will show stop/start pairs for rescue transitions. Consider recording 'stopped' only for explicit stops, not for restarts within _run_rescue_work, to avoid polluting the audit trail.
src/aleph/vm/orchestrator/views/operator.py (line 998): state.results.pop(vm_hash_str) removes the result after first read. This means if a user polls GET /rescue after success, they won't see the result from state.results and will fall through to the execution.mode == 'rescue' check. This is intentional (prevents replay), but the dual-path behavior (result from cache vs. live mode check) could return slightly different response bodies. Consider unifying the response format.
src/aleph/vm/orchestrator/views/operator.py (line 1033): operate_rescue_exit calls pool.stop_vm() which records a 'stopped' event, then calls _restart_persistent_vm which restarts the VM. Same issue as line 833: the 'stopped' event is recorded for a restart, not an actual stop. This creates misleading audit trail entries.
src/aleph/vm/orchestrator/metrics.py (line 84): The mode column uses default="normal" and server_default="normal". In SQLAlchemy 2.x, Column.default is only used when constructing a new Python object, while server_default is used for INSERT statements. Both being set is correct for this use case, but note that existing rows created before this column was added will have NULL until the migration's server_default constraint takes effect on future updates.
src/aleph/vm/orchestrator/migrations/versions/0005_add_mode_and_execution_events.py (line 12): The revision IDs b3c4d5e6f7a8 and a1b2c3d4e5f6 appear to be placeholder values. These should be verified against the actual alembic revision chain before merge.
src/aleph/vm/orchestrator/utils.py (line 111): The URL construction on this line has an unnecessary line break: f"{settings.API_SERVER}/api/v0/aggregates/" f"{settings.SETTINGS_AGGREGATE_ADDRESS}.json?keys=runtimes". While this works due to Python string concatenation, a single line would be cleaner.
tests/supervisor/test_rescue.py (line 88): The _patch_session_maker fixture monkeypatches metrics_mod.AsyncSessionMaker in-place. Since pytest runs tests sequentially and the patch is not undone between tests, this could cause test isolation issues if a test imports the module before the fixture runs. Consider using yield in the fixture to restore the original value after the test.
tests/supervisor/test_rescue.py (line 1): Missing test coverage for the rescue API endpoints themselves (operate_rescue, operate_rescue_status, operate_rescue_exit). The underlying functions (_run_rescue_work, RescueState) are tested indirectly through the metrics and utils tests, but the HTTP endpoints should have integration tests covering: 409 conflict when already in rescue mode, 409 when not in rescue mode, 403 unauthorized, and 202/in_progress flow.
src/aleph/vm/controllers/qemu/instance.py (line 216): In rescue mode, the original rootfs is inserted as a host volume with read_only=False. This allows writes to the original rootfs during rescue, which is the intended behavior for repair. However, consider whether data volumes should also be read-only in rescue mode to prevent accidental modification while the filesystem is mounted elsewhere.
Rescue mode lets users boot a persistent instance from a minimal
rescue image while their original rootfs is attached as a secondary
drive for repair. The original rootfs is never moved or renamed.
POST /control/machine/{ref}/rescue - enter rescue mode
DELETE /control/machine/{ref}/rescue - exit and restore normal boot
Reinstall and erase are blocked while in rescue mode (409 conflict).
Execution state tracking:
- New `mode` column on ExecutionRecord ("normal" or "rescue"),
persisted across supervisor restarts.
- New `execution_events` audit table records lifecycle events
(created, stopped, reinstalled, rescue_entered, rescue_exited, etc.)
- list_executions_v2 response now includes `mode`.
Runtimes aggregate:
- Fetch and cache logic for the runtimes aggregate from the
governance address, same pattern as the settings aggregate.
- RuntimeEntry schema with id, name, type, item_hash, default fields.
- Lookup helpers: get_runtime_by_id, get_default_runtime.
- Rescue image hash is resolved from the aggregate (type "rescue")
instead of a hardcoded config value.
…s after entering rescue mode. Run download and restart as a background task.
- Cloud-init auto-detects its ISO by the "cidata" label, so the drive letter is irrelevant. Placing host volumes first ensures the original rootfs appears as /dev/vdb in rescue mode as documented.
rescue: add HTTP endpoint tests
7fb4f4c to
849ac97
Compare
Rescue mode lets users boot a persistent instance from a minimal
rescue image while their original rootfs is attached as a secondary
drive for repair. The original rootfs is never moved or renamed.
POST /control/machine/{ref}/rescue - enter rescue mode
DELETE /control/machine/{ref}/rescue - exit and restore normal boot
Reinstall and erase are blocked while in rescue mode (409 conflict).
Execution state tracking:
modecolumn on ExecutionRecord ("normal" or "rescue"), persisted across supervisor restarts.execution_eventsaudit table records lifecycle events (created, stopped, reinstalled, rescue_entered, rescue_exited, etc.)mode.Runtimes aggregate: