Skip to content

Use in-memory data for reaper waiting job logic, consolidate with heartbeat#16482

Open
AlanCoding wants to merge 5 commits into
ansible:develfrom
AlanCoding:new_reaper
Open

Use in-memory data for reaper waiting job logic, consolidate with heartbeat#16482
AlanCoding wants to merge 5 commits into
ansible:develfrom
AlanCoding:new_reaper

Conversation

@AlanCoding

@AlanCoding AlanCoding commented Jun 4, 2026

Copy link
Copy Markdown
Member
SUMMARY

Putting up a concrete implementation of my rambling review comment from @fosterseth patch

#16467

The heartbeat change here is slightly architectural. When a peer instance is marked offline, the jobs from that will not right away be reaped, as another tick of the task manager, and another background task is necessary. However, when a node goes down we are not in a rush to reap jobs from it, and we can get in trouble for doing it too fast. Plus, all of these have immediate re-scheduling happening so if the system isn't bogged down we can expect it to happen as fast as the system resources can carry it.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Summary by CodeRabbit

  • New Features

    • New background tasks to detect and handle orphaned running and waiting jobs.
  • Bug Fixes

    • Improved handling when execution or controller nodes are unavailable or deprovisioned.
    • Running jobs on invalid execution nodes are failed with a clear explanation.
    • Waiting jobs on invalid controller nodes are reset to pending and reprocessed.
  • Tests

    • Expanded functional tests covering orphan detection, reaping, and heartbeat/lost-instance behavior.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 77acd144-7b15-409b-bc68-18550ffa8f09

📥 Commits

Reviewing files that changed from the base of the PR and between ebc8a65 and 601bd11.

📒 Files selected for processing (1)
  • awx/main/tests/functional/test_dispatch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • awx/main/tests/functional/test_dispatch.py

📝 Walkthrough

Walkthrough

This PR refactors orphaned job cleanup from synchronous inline code to background dispatcher tasks. It introduces new reaper tasks and a query helper, wires them through task-manager scheduling, updates lost-instance heartbeat handling, and adds comprehensive test coverage.

Changes

Orphaned Job Reaping Refactoring

Layer / File(s) Summary
Reaper background tasks and query helper
awx/main/dispatch/reaper.py
Adds get_orphaned_running_jobs_query() helper that filters UnifiedJob records by execution node validity, and two new dispatcher tasks: reap_orphaned_jobs() marks orphaned running jobs as failed, while reset_orphaned_waiting_jobs() resets orphaned waiting jobs to pending with cleared node fields and triggers task-manager scheduling.
Task Manager orphaned-job queueing
awx/main/scheduler/task_manager.py
Replaces reap_jobs_from_orphaned_instances() with queue_orphaned_job_reaping(), which computes valid node hostnames from in-memory instance data, checks for orphaned waiting and running jobs using already-loaded task data, and conditionally queues the background dispatcher tasks instead of performing reaping directly. Updates _schedule() to call the new queueing method.
Lost instance handling and task manager triggering
awx/main/tasks/system.py
Refactors _heartbeat_handle_lost_instances() to track affected instances in instances_marked_down, removes inline job reaping and resetting logic, and calls ScheduleTaskManager().schedule() to defer orphaned-job handling to the new task-manager dispatcher-task flow.
Test updates and new reaper task coverage
awx/main/tests/functional/test_dispatch.py
Imports reaper task functions directly and updates existing dispatch tests to simulate background behavior by calling reset_orphaned_waiting_jobs() explicitly. Adds new functional tests for reap_orphaned_jobs(), get_orphaned_running_jobs_query(), and reset_orphaned_waiting_jobs() behavior including filtering, status transitions, and field clearing.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: refactoring job reaping logic to use in-memory instance data and consolidating heartbeat/task manager/background task flow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AlanCoding

Copy link
Copy Markdown
Member Author

linter

linters: commands[1]> flake8 awx awxkit awx_collection
awx/main/dispatch/reaper.py:105:17: F541 f-string is missing placeholders
awx/main/dispatch/reaper.py:123:17: F541 f-string is missing placeholders
awx/main/scheduler/task_manager.py:28:1: F401 'awx.main.dispatch.reaper.reap_job' imported but unused
awx/main/scheduler/task_manager.py:29:1: F401 'awx.main.models.Instance' imported but unused
linters: exit 1 (6.70 seconds) /awx_devel> flake8 awx awxkit awx_collection pid=277

that's all

…round task flow

Refactored fragmented reaping logic across 5 entry points into a single architecture: heartbeat now only
handles instance state transitions (marking offline/deleting), then triggers the task manager to detect
orphaned jobs; task manager uses already-loaded in-memory instance data and a fast .exists() check to
detect orphaned jobs without expensive DB queries, queuing idempotent background tasks with
on_duplicate='queue_one' if needed; and background tasks reap_orphaned_jobs() and
reset_orphaned_waiting_jobs() run independently with no arguments to avoid worker saturation. Extracted
shared get_orphaned_running_jobs_query() function to DRY up query logic, removed inline job reaping from
heartbeat and task manager hot paths, and updated tests to verify the new pattern. Task manager cycle
load is now minimal (O(n) memory iteration + fast existence check), eliminating the expensive periodic DB
queries from PR ansible#16467.

Assisted-By: Haiku
@sonarqubecloud

sonarqubecloud Bot commented Jun 9, 2026

Copy link
Copy Markdown

@AlanCoding AlanCoding marked this pull request as ready for review June 9, 2026 19:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awx/main/dispatch/reaper.py`:
- Line 107: The condition is using the method object instead of calling it:
change the check from referencing job.is_container_group_task to calling
job.is_container_group_task() so the boolean result is evaluated; update the if
statement in reaper.py where the job variable (UnifiedJob) is tested to use
job.is_container_group_task() so container-group tasks are correctly skipped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 8a05ee96-852c-4983-903f-c1096e377a4b

📥 Commits

Reviewing files that changed from the base of the PR and between 49e21d7 and ebc8a65.

📒 Files selected for processing (4)
  • awx/main/dispatch/reaper.py
  • awx/main/scheduler/task_manager.py
  • awx/main/tasks/system.py
  • awx/main/tests/functional/test_dispatch.py


logger.info('Reaping orphaned running jobs')
for job in orphaned_running:
if not job.is_container_group_task:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

is_container_group_task is a method and must be called with parentheses.

Per context snippet 3, is_container_group_task() is defined as a method on UnifiedJob. The current code accesses it as a property, which will always be truthy (the method object itself), causing container-group tasks to never be skipped as intended.

🐛 Proposed fix
-        if not job.is_container_group_task:
+        if not job.is_container_group_task():
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not job.is_container_group_task:
if not job.is_container_group_task():
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awx/main/dispatch/reaper.py` at line 107, The condition is using the method
object instead of calling it: change the check from referencing
job.is_container_group_task to calling job.is_container_group_task() so the
boolean result is evaluated; update the if statement in reaper.py where the job
variable (UnifiedJob) is tested to use job.is_container_group_task() so
container-group tasks are correctly skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant