Skip to content

[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned#16467

Merged
fosterseth merged 1 commit into
ansible:develfrom
fosterseth:reap-orphaned-waiting-jobs
Jun 2, 2026
Merged

[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned#16467
fosterseth merged 1 commit into
ansible:develfrom
fosterseth:reap-orphaned-waiting-jobs

Conversation

@fosterseth

@fosterseth fosterseth commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

  • In K8s deployments, jobs can get permanently stuck in waiting status when the controller pod they were assigned to is replaced/deprovisioned
  • The instance record gets deleted but the waiting jobs referencing that controller_node are never cleaned up — no existing reaper path catches them
  • Extends reap_jobs_from_orphaned_instances() in the task manager to detect waiting jobs whose controller_node no longer matches any registered control/hybrid instance, and resets them to pending so they can be re-dispatched

Why existing reaping paths don't catch this

  1. reaper.reap() — only queries status='running', so waiting jobs are invisible to it entirely.
  2. startup_reaping() — also only queries status='running' on the current node at startup.
  3. _heartbeat_handle_lost_instances() — correctly resets waiting jobs, but only for instances still in the DB. Once AWX_AUTO_DEPROVISION_INSTANCES deletes the Instance record (same function, a few lines later), future heartbeats have no record to match against — the dead hostname is gone.
  4. reap_jobs_from_orphaned_instances() (before this fix) — checks execution_node against registered instances, but never checks controller_node. A waiting job with controller_node='dead-pod' and execution_node='' passes right through.
  5. Heartbeat dispatch_waiting_jobs re-check — only looks for waiting jobs where controller_node matches the current node (CLUSTER_HOST_ID), not orphaned jobs on other nodes.
ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Test plan

  • Added functional test: waiting job with deprovisioned controller_node is reset to pending
  • Added functional test: waiting job with live controller_node (control and hybrid) is not touched
  • Verify existing reaper tests still pass
  • Manual validation in K8s environment: kill a task pod with waiting jobs and confirm they recover

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Jobs stuck in waiting are now auto-recovered when their controller node is unavailable: they are reset to pending and can be reassigned to active controllers.
  • Tests

    • Added functional tests verifying waiting jobs are reset when controllers are deprovisioned and remain waiting when controllers are alive.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 04405883-52bb-4d70-83c2-cadc3baac261

📥 Commits

Reviewing files that changed from the base of the PR and between 071a863 and 6013798.

📒 Files selected for processing (2)
  • awx/main/scheduler/task_manager.py
  • awx/main/tests/functional/test_dispatch.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • awx/main/scheduler/task_manager.py
  • awx/main/tests/functional/test_dispatch.py

📝 Walkthrough

Walkthrough

TaskManager.reap_jobs_from_orphaned_instances now resets UnifiedJob rows stuck in waiting whose controller_node is not a live control/hybrid Instance, setting status to pending and clearing controller_node and execution_node. Two functional tests assert reset and non-reset behavior.

Changes

Orphaned Waiting Jobs Reaper

Layer / File(s) Summary
Reaper implementation for orphaned waiting jobs
awx/main/scheduler/task_manager.py
TaskManager.reap_jobs_from_orphaned_instances now queries live control/hybrid Instance hostnames and resets UnifiedJob rows in waiting whose controller_node is not among them, setting status='pending' and clearing controller_node and execution_node with a targeted save(update_fields=[...]).
Reaper test validation
awx/main/tests/functional/test_dispatch.py
Adds two functional tests in TestJobReaper: test_waiting_job_reset_when_controller_node_deprovisioned verifies a waiting job referencing a missing controller is reset to pending with cleared node fields; test_waiting_job_not_reset_when_controller_node_alive (parameterized for control and hybrid) verifies waiting jobs tied to live controllers remain waiting and retain controller_node.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: resetting orphaned waiting jobs when their controller node is deprovisioned, which directly matches the code additions in task_manager.py and the test coverage added.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
awx/main/tests/functional/test_dispatch.py (1)

164-196: ⚡ Quick win

Add a hybrid-controller test case.

Both tests only exercise node_type='control'. Neither would catch the hybrid-node regression noted in task_manager.py (a live hybrid controller being treated as orphaned). Once the filter is fixed to include hybrids, add a test asserting a waiting job on a live node_type='hybrid' controller is left untouched.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awx/main/tests/functional/test_dispatch.py` around lines 164 - 196, Add a new
functional test mirroring test_waiting_job_not_reset_when_controller_node_alive
that creates an Instance(hostname='awx-task-live', node_type='hybrid'), saves
it, then creates a Job with status='waiting' and
controller_node='awx-task-live', runs
TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and
asserts the job.status remains 'waiting' and job.controller_node still equals
'awx-task-live' to ensure hybrid controller nodes are treated as alive by
reap_jobs_from_orphaned_instances.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awx/main/scheduler/task_manager.py`:
- Around line 693-700: The registered_control_nodes query in
reap_jobs_from_orphaned_instances only includes Instance.node_type='control' so
jobs whose controller_node was set to a hybrid host by process_pending_tasks get
treated as orphaned; change the lookup that builds registered_control_nodes to
include hybrid nodes (e.g., filter node_type__in=('control','hybrid')) so the
subsequent orphaned_waiting
exclude(controller_node__in=registered_control_nodes) correctly preserves
waiting jobs controlled by live hybrid instances, leaving the rest of the reset
logic (setting status/controller_node/execution_node and
save(update_fields=...)) unchanged.

---

Nitpick comments:
In `@awx/main/tests/functional/test_dispatch.py`:
- Around line 164-196: Add a new functional test mirroring
test_waiting_job_not_reset_when_controller_node_alive that creates an
Instance(hostname='awx-task-live', node_type='hybrid'), saves it, then creates a
Job with status='waiting' and controller_node='awx-task-live', runs
TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and
asserts the job.status remains 'waiting' and job.controller_node still equals
'awx-task-live' to ensure hybrid controller nodes are treated as alive by
reap_jobs_from_orphaned_instances.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 066d9a69-fba6-4093-a134-9d82b8a832b9

📥 Commits

Reviewing files that changed from the base of the PR and between fccb674 and bd0b38d.

📒 Files selected for processing (2)
  • awx/main/scheduler/task_manager.py
  • awx/main/tests/functional/test_dispatch.py

Comment thread awx/main/scheduler/task_manager.py Outdated
@fosterseth fosterseth requested a review from AlanCoding May 29, 2026 15:50

@jessicamack jessicamack left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread awx/main/scheduler/task_manager.py Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@fosterseth fosterseth force-pushed the reap-orphaned-waiting-jobs branch from 071a863 to 6013798 Compare June 1, 2026 13:47
@sonarqubecloud

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

@AlanCoding

Copy link
Copy Markdown
Member

My only concern is that this is duplicated with other logic that is already triggered from the heartbeat task.

But IIRC, we already went through this story, and the problem is that in k8s instances will automatically de-provision themselves. The heartbeat handles nodes that exist in the database but are identified to be MIA. So I expect this is only handling cases where the labeled node does not exist which has not yet, nor will ever be, put into the scope of the heartbeat peer-judgement and reaping logic.

This is still confusing, and I think we would clearly prefer more consolidation. Running this on the tail-end of the task manager has a fairly significant expense. I would like to consolidate better... eventually. How? My initial thoughts:

  • at the tail end of task manager, run a single query to identify if active jobs exist with a labeled node (control or execution) that has no record in the database. If this condition is met, we trigger a background task that does more stuff.
  • the heartbeat is also treated as more of a trigger than anything else, and calls a general bad-instance-cleanup task when another instance is marked offline

In both cases, I expect the "cleanup" background task would probably be the same. Because the trigger doesn't run on every tick of either the heartbeat or task manager, we don't have to be stingy with the logic it runs.

Comment thread awx/main/scheduler/task_manager.py

@AlanCoding AlanCoding left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a good isolated fix but I would like to some kind of followup task for reaper cleanup

@fosterseth fosterseth merged commit 5cc467d into ansible:devel Jun 2, 2026
21 checks passed
AlanCoding added a commit to AlanCoding/awx that referenced this pull request Jun 4, 2026
…round task flow

Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only
handles instance state transitions (marking offline/deleting), then triggers the task manager to detect
orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to
detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with
on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and
reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted
shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from
heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle
load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB
queries from PR ansible#16467.

Assisted-By: Haiku
AlanCoding added a commit to AlanCoding/awx that referenced this pull request Jun 8, 2026
…round task flow

Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only
handles instance state transitions (marking offline/deleting), then triggers the task manager to detect
orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to
detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with
on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and
reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted
shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from
heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle
load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB
queries from PR ansible#16467.

Assisted-By: Haiku
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants