[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned#16467
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughTaskManager.reap_jobs_from_orphaned_instances now resets UnifiedJob rows stuck in waiting whose controller_node is not a live control/hybrid Instance, setting status to pending and clearing controller_node and execution_node. Two functional tests assert reset and non-reset behavior. ChangesOrphaned Waiting Jobs Reaper
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
awx/main/tests/functional/test_dispatch.py (1)
164-196: ⚡ Quick winAdd a hybrid-controller test case.
Both tests only exercise
node_type='control'. Neither would catch the hybrid-node regression noted intask_manager.py(a live hybrid controller being treated as orphaned). Once the filter is fixed to include hybrids, add a test asserting a waiting job on a livenode_type='hybrid'controller is left untouched.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@awx/main/tests/functional/test_dispatch.py` around lines 164 - 196, Add a new functional test mirroring test_waiting_job_not_reset_when_controller_node_alive that creates an Instance(hostname='awx-task-live', node_type='hybrid'), saves it, then creates a Job with status='waiting' and controller_node='awx-task-live', runs TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and asserts the job.status remains 'waiting' and job.controller_node still equals 'awx-task-live' to ensure hybrid controller nodes are treated as alive by reap_jobs_from_orphaned_instances.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awx/main/scheduler/task_manager.py`:
- Around line 693-700: The registered_control_nodes query in
reap_jobs_from_orphaned_instances only includes Instance.node_type='control' so
jobs whose controller_node was set to a hybrid host by process_pending_tasks get
treated as orphaned; change the lookup that builds registered_control_nodes to
include hybrid nodes (e.g., filter node_type__in=('control','hybrid')) so the
subsequent orphaned_waiting
exclude(controller_node__in=registered_control_nodes) correctly preserves
waiting jobs controlled by live hybrid instances, leaving the rest of the reset
logic (setting status/controller_node/execution_node and
save(update_fields=...)) unchanged.
---
Nitpick comments:
In `@awx/main/tests/functional/test_dispatch.py`:
- Around line 164-196: Add a new functional test mirroring
test_waiting_job_not_reset_when_controller_node_alive that creates an
Instance(hostname='awx-task-live', node_type='hybrid'), saves it, then creates a
Job with status='waiting' and controller_node='awx-task-live', runs
TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and
asserts the job.status remains 'waiting' and job.controller_node still equals
'awx-task-live' to ensure hybrid controller nodes are treated as alive by
reap_jobs_from_orphaned_instances.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 066d9a69-fba6-4093-a134-9d82b8a832b9
📒 Files selected for processing (2)
awx/main/scheduler/task_manager.pyawx/main/tests/functional/test_dispatch.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
071a863 to
6013798
Compare
|
|
My only concern is that this is duplicated with other logic that is already triggered from the heartbeat task. But IIRC, we already went through this story, and the problem is that in k8s instances will automatically de-provision themselves. The heartbeat handles nodes that exist in the database but are identified to be MIA. So I expect this is only handling cases where the labeled node does not exist which has not yet, nor will ever be, put into the scope of the heartbeat peer-judgement and reaping logic. This is still confusing, and I think we would clearly prefer more consolidation. Running this on the tail-end of the task manager has a fairly significant expense. I would like to consolidate better... eventually. How? My initial thoughts:
In both cases, I expect the "cleanup" background task would probably be the same. Because the trigger doesn't run on every tick of either the heartbeat or task manager, we don't have to be stingy with the logic it runs. |
AlanCoding
left a comment
There was a problem hiding this comment.
still a good isolated fix but I would like to some kind of followup task for reaper cleanup
…round task flow Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only handles instance state transitions (marking offline/deleting), then triggers the task manager to detect orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB queries from PR ansible#16467. Assisted-By: Haiku
…round task flow Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only handles instance state transitions (marking offline/deleting), then triggers the task manager to detect orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB queries from PR ansible#16467. Assisted-By: Haiku
Summary
waitingstatus when the controller pod they were assigned to is replaced/deprovisionedcontroller_nodeare never cleaned up — no existing reaper path catches themreap_jobs_from_orphaned_instances()in the task manager to detect waiting jobs whosecontroller_nodeno longer matches any registered control/hybrid instance, and resets them topendingso they can be re-dispatchedWhy existing reaping paths don't catch this
reaper.reap()— only queriesstatus='running', so waiting jobs are invisible to it entirely.startup_reaping()— also only queriesstatus='running'on the current node at startup._heartbeat_handle_lost_instances()— correctly resets waiting jobs, but only for instances still in the DB. OnceAWX_AUTO_DEPROVISION_INSTANCESdeletes the Instance record (same function, a few lines later), future heartbeats have no record to match against — the dead hostname is gone.reap_jobs_from_orphaned_instances()(before this fix) — checksexecution_nodeagainst registered instances, but never checkscontroller_node. A waiting job withcontroller_node='dead-pod'andexecution_node=''passes right through.dispatch_waiting_jobsre-check — only looks for waiting jobs wherecontroller_nodematches the current node (CLUSTER_HOST_ID), not orphaned jobs on other nodes.ISSUE TYPE
COMPONENT NAME
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests