[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned by fosterseth · Pull Request #16467 · ansible/awx

fosterseth · 2026-05-29T15:36:36Z

Summary

In K8s deployments, jobs can get permanently stuck in waiting status when the controller pod they were assigned to is replaced/deprovisioned
The instance record gets deleted but the waiting jobs referencing that controller_node are never cleaned up — no existing reaper path catches them
Extends reap_jobs_from_orphaned_instances() in the task manager to detect waiting jobs whose controller_node no longer matches any registered control/hybrid instance, and resets them to pending so they can be re-dispatched

Why existing reaping paths don't catch this

reaper.reap() — only queries status='running', so waiting jobs are invisible to it entirely.
startup_reaping() — also only queries status='running' on the current node at startup.
_heartbeat_handle_lost_instances() — correctly resets waiting jobs, but only for instances still in the DB. Once AWX_AUTO_DEPROVISION_INSTANCES deletes the Instance record (same function, a few lines later), future heartbeats have no record to match against — the dead hostname is gone.
reap_jobs_from_orphaned_instances() (before this fix) — checks execution_node against registered instances, but never checks controller_node. A waiting job with controller_node='dead-pod' and execution_node='' passes right through.
Heartbeat dispatch_waiting_jobs re-check — only looks for waiting jobs where controller_node matches the current node (CLUSTER_HOST_ID), not orphaned jobs on other nodes.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

Test plan

Added functional test: waiting job with deprovisioned controller_node is reset to pending
Added functional test: waiting job with live controller_node (control and hybrid) is not touched
Verify existing reaper tests still pass
Manual validation in K8s environment: kill a task pod with waiting jobs and confirm they recover

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Jobs stuck in waiting are now auto-recovered when their controller node is unavailable: they are reset to pending and can be reassigned to active controllers.
Tests
- Added functional tests verifying waiting jobs are reset when controllers are deprovisioned and remain waiting when controllers are alive.

coderabbitai · 2026-05-29T15:36:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 04405883-52bb-4d70-83c2-cadc3baac261

📥 Commits

Reviewing files that changed from the base of the PR and between 071a863 and 6013798.

📒 Files selected for processing (2)

awx/main/scheduler/task_manager.py
awx/main/tests/functional/test_dispatch.py

🚧 Files skipped from review as they are similar to previous changes (2)

awx/main/scheduler/task_manager.py
awx/main/tests/functional/test_dispatch.py

📝 Walkthrough

Walkthrough

TaskManager.reap_jobs_from_orphaned_instances now resets UnifiedJob rows stuck in waiting whose controller_node is not a live control/hybrid Instance, setting status to pending and clearing controller_node and execution_node. Two functional tests assert reset and non-reset behavior.

Changes

Orphaned Waiting Jobs Reaper

Layer / File(s)	Summary
Reaper implementation for orphaned waiting jobs `awx/main/scheduler/task_manager.py`	`TaskManager.reap_jobs_from_orphaned_instances` now queries live control/hybrid `Instance` hostnames and resets `UnifiedJob` rows in `waiting` whose `controller_node` is not among them, setting `status='pending'` and clearing `controller_node` and `execution_node` with a targeted `save(update_fields=[...])`.
Reaper test validation `awx/main/tests/functional/test_dispatch.py`	Adds two functional tests in `TestJobReaper`: `test_waiting_job_reset_when_controller_node_deprovisioned` verifies a waiting job referencing a missing controller is reset to pending with cleared node fields; `test_waiting_job_not_reset_when_controller_node_alive` (parameterized for `control` and `hybrid`) verifies waiting jobs tied to live controllers remain waiting and retain `controller_node`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: resetting orphaned waiting jobs when their controller node is deprovisioned, which directly matches the code additions in task_manager.py and the test coverage added.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

awx/main/tests/functional/test_dispatch.py (1)
164-196: ⚡ Quick win

Add a hybrid-controller test case.

Both tests only exercise node_type='control'. Neither would catch the hybrid-node regression noted in task_manager.py (a live hybrid controller being treated as orphaned). Once the filter is fixed to include hybrids, add a test asserting a waiting job on a live node_type='hybrid' controller is left untouched.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awx/main/tests/functional/test_dispatch.py` around lines 164 - 196, Add a new
functional test mirroring test_waiting_job_not_reset_when_controller_node_alive
that creates an Instance(hostname='awx-task-live', node_type='hybrid'), saves
it, then creates a Job with status='waiting' and
controller_node='awx-task-live', runs
TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and
asserts the job.status remains 'waiting' and job.controller_node still equals
'awx-task-live' to ensure hybrid controller nodes are treated as alive by
reap_jobs_from_orphaned_instances.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awx/main/scheduler/task_manager.py`:
- Around line 693-700: The registered_control_nodes query in
reap_jobs_from_orphaned_instances only includes Instance.node_type='control' so
jobs whose controller_node was set to a hybrid host by process_pending_tasks get
treated as orphaned; change the lookup that builds registered_control_nodes to
include hybrid nodes (e.g., filter node_type__in=('control','hybrid')) so the
subsequent orphaned_waiting
exclude(controller_node__in=registered_control_nodes) correctly preserves
waiting jobs controlled by live hybrid instances, leaving the rest of the reset
logic (setting status/controller_node/execution_node and
save(update_fields=...)) unchanged.

---

Nitpick comments:
In `@awx/main/tests/functional/test_dispatch.py`:
- Around line 164-196: Add a new functional test mirroring
test_waiting_job_not_reset_when_controller_node_alive that creates an
Instance(hostname='awx-task-live', node_type='hybrid'), saves it, then creates a
Job with status='waiting' and controller_node='awx-task-live', runs
TaskManager().reap_jobs_from_orphaned_instances(), refreshes the job, and
asserts the job.status remains 'waiting' and job.controller_node still equals
'awx-task-live' to ensure hybrid controller nodes are treated as alive by
reap_jobs_from_orphaned_instances.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 066d9a69-fba6-4093-a134-9d82b8a832b9

📥 Commits

Reviewing files that changed from the base of the PR and between fccb674 and bd0b38d.

📒 Files selected for processing (2)

awx/main/scheduler/task_manager.py
awx/main/tests/functional/test_dispatch.py

jessicamack

LGTM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-01T14:05:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

AlanCoding · 2026-06-02T13:36:40Z

My only concern is that this is duplicated with other logic that is already triggered from the heartbeat task.

But IIRC, we already went through this story, and the problem is that in k8s instances will automatically de-provision themselves. The heartbeat handles nodes that exist in the database but are identified to be MIA. So I expect this is only handling cases where the labeled node does not exist which has not yet, nor will ever be, put into the scope of the heartbeat peer-judgement and reaping logic.

This is still confusing, and I think we would clearly prefer more consolidation. Running this on the tail-end of the task manager has a fairly significant expense. I would like to consolidate better... eventually. How? My initial thoughts:

at the tail end of task manager, run a single query to identify if active jobs exist with a labeled node (control or execution) that has no record in the database. If this condition is met, we trigger a background task that does more stuff.
the heartbeat is also treated as more of a trigger than anything else, and calls a general bad-instance-cleanup task when another instance is marked offline

In both cases, I expect the "cleanup" background task would probably be the same. Because the trigger doesn't run on every tick of either the heartbeat or task manager, we don't have to be stingy with the logic it runs.

AlanCoding

still a good isolated fix but I would like to some kind of followup task for reaper cleanup

…round task flow Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only handles instance state transitions (marking offline/deleting), then triggers the task manager to detect orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB queries from PR ansible#16467. Assisted-By: Haiku

github-actions Bot added the component:api label May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread awx/main/scheduler/task_manager.py Outdated

fosterseth requested a review from AlanCoding May 29, 2026 15:50

jessicamack approved these changes May 29, 2026

View reviewed changes

chrismeyersfsu reviewed May 29, 2026

View reviewed changes

Comment thread awx/main/scheduler/task_manager.py Outdated

Reset orphaned waiting jobs when controller node is deprovisioned

6013798

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fosterseth force-pushed the reap-orphaned-waiting-jobs branch from 071a863 to 6013798 Compare June 1, 2026 13:47

AlanCoding reviewed Jun 2, 2026

View reviewed changes

Comment thread awx/main/scheduler/task_manager.py

AlanCoding approved these changes Jun 2, 2026

View reviewed changes

fosterseth merged commit 5cc467d into ansible:devel Jun 2, 2026
21 checks passed

AlanCoding mentioned this pull request Jun 4, 2026

Use in-memory data for reaper waiting job logic, consolidate with heartbeat #16482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned#16467

[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned#16467
fosterseth merged 1 commit into
ansible:develfrom
fosterseth:reap-orphaned-waiting-jobs

fosterseth commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

jessicamack left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 1, 2026

Uh oh!

AlanCoding commented Jun 2, 2026

Uh oh!

Uh oh!

AlanCoding left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fosterseth commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why existing reaping paths don't catch this

ISSUE TYPE

COMPONENT NAME

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jessicamack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 1, 2026

Quality Gate passed

Uh oh!

AlanCoding commented Jun 2, 2026

Uh oh!

Uh oh!

AlanCoding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fosterseth commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading