Skip to content

AAP-78420: Updating the remediation condition in system.py to check for cpu and memory values set to 0.#16499

Draft
thanujdesu11 wants to merge 1 commit into
ansible:develfrom
thanujdesu11:AAP-78420
Draft

AAP-78420: Updating the remediation condition in system.py to check for cpu and memory values set to 0.#16499
thanujdesu11 wants to merge 1 commit into
ansible:develfrom
thanujdesu11:AAP-78420

Conversation

@thanujdesu11

Copy link
Copy Markdown
SUMMARY

The execution nodes are getting stuck at values of capacity=1 with cpu=0 and memory=0 permanently if the initial health check fails. The remediation condition to re-run the health check is never triggered because the if statement only checks for if capacity=0. The calculations for capacity lead to a situation where the capacity value has a floor of 1, so the remediation condition is never met.

The max(1, ...) floor in capacity formulas and the capacity == 0 remediation check are contradictory:

get_cpu_effective_capacity(0)max(1, int(0 * 4)) → 1
get_mem_effective_capacity(0)max(1, (0 - 2GB) // 100MB) → 1
set_capacity_value()min(1, 1) + (max(1, 1) - min(1, 1)) * 1.0 → 1

Since capacity is 1 (not 0), the remediation condition at line 636 never matches.

I updated the remediation condition in awx/main/tasks/system.py:636 to also rerun the health check when instance.cpu == 0 and instance.memory == 0). This would make it so that the health check would run even if capacity = 1, if the cpu and memory values are set to 0.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API
STEPS TO REPRODUCE AND EXTRA INFO
  1. Deploy a containerized AAP cluster with 2 control nodes and 2 execution nodes
  2. During initial setup, if the receptor mesh is not fully established when the first execution_node_health_check runs, the health check may return partial/empty data
  3. The execution nodes end up with cpu=0, memory=0, version=ansible-runner-???, cpu_capacity=1, mem_capacity=1, capacity=1, node_state=ready
  4. Wait any amount of time - the nodes remain stuck at capacity=1 indefinitely with no automatic remediation

Expected Result: Remediation logic should detect that the CPU and Memory values are at 0 and retrigger a health check eventually.

Actual Result: Nodes are stuck without retriggering a health check, because the capacity value is stuck at 1.

# Before
elif instance.capacity == 0 and instance.enabled:

# After
elif (instance.capacity == 0 or (instance.cpu == 0 and instance.memory == 0)) and instance.enabled:

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: d1877970-34ee-4858-b9e2-4d7f9cf35a28

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant