AAP-78420: Updating the remediation condition in system.py to check for cpu and memory values set to 0.#16499
AAP-78420: Updating the remediation condition in system.py to check for cpu and memory values set to 0.#16499thanujdesu11 wants to merge 1 commit into
Conversation
…emory values set to 0.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
SUMMARY
The execution nodes are getting stuck at values of
capacity=1withcpu=0andmemory=0permanently if the initial health check fails. The remediation condition to re-run the health check is never triggered because the if statement only checks for ifcapacity=0. The calculations for capacity lead to a situation where the capacity value has a floor of 1, so the remediation condition is never met.The
max(1, ...)floor in capacity formulas and thecapacity == 0remediation check are contradictory:get_cpu_effective_capacity(0)→max(1, int(0 * 4))→ 1get_mem_effective_capacity(0)→max(1, (0 - 2GB) // 100MB)→ 1set_capacity_value()→min(1, 1) + (max(1, 1) - min(1, 1)) * 1.0→ 1Since capacity is 1 (not 0), the remediation condition at line 636 never matches.
I updated the remediation condition in
awx/main/tasks/system.py:636to also rerun the health check wheninstance.cpu == 0 and instance.memory == 0). This would make it so that the health check would run even if capacity = 1, if the cpu and memory values are set to 0.ISSUE TYPE
COMPONENT NAME
STEPS TO REPRODUCE AND EXTRA INFO
cpu=0,memory=0,version=ansible-runner-???,cpu_capacity=1,mem_capacity=1,capacity=1,node_state=readycapacity=1indefinitely with no automatic remediationExpected Result: Remediation logic should detect that the CPU and Memory values are at 0 and retrigger a health check eventually.
Actual Result: Nodes are stuck without retriggering a health check, because the capacity value is stuck at 1.