Skip to content

Add LLM eval tests for Kubernetes tool usage patterns#2048

Open
aantn wants to merge 5 commits into
masterfrom
claude/fix-k8s-approval-issue-xNNVP
Open

Add LLM eval tests for Kubernetes tool usage patterns#2048
aantn wants to merge 5 commits into
masterfrom
claude/fix-k8s-approval-issue-xNNVP

Conversation

@aantn
Copy link
Copy Markdown
Collaborator

@aantn aantn commented May 15, 2026

Summary

Add two new LLM evaluation test fixtures to validate that Holmes uses dedicated Kubernetes tools instead of shell pipelines for common query patterns. These tests ensure the LLM learns to prefer efficient, batched queries over iterative bash loops.

Changes

Test Case 1: Kubernetes Event Grouping (259_k8s_event_grouping_prefer_dedicated_tools)

  • Scenario: Count pods stuck on image pull failures across 4 namespaces (3, 1, 2, and 4 pods respectively)
  • Validation:
    • Holmes must report correct per-namespace counts and identify the namespace with the most failures
    • Must use dedicated Kubernetes tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource)
    • Must NOT use bash pipelines with awk, sort, uniq -c, or wc for aggregation
  • Setup: Creates 10 pods across 4 namespaces with invalid image registries to trigger ImagePullBackOff errors
  • Validation: Pre-test verification ensures all pods reach image pull failure state before Holmes is queried

Test Case 2: Multi-Pod Node Lookup (260_k8s_multi_pod_node_lookup_prefer_dedicated_tools)

  • Scenario: Look up which node each of 5 pods is running on across 3 namespaces
  • Validation:
    • Holmes must report the nodeName for every pod
    • Must use batched Kubernetes queries rather than shell loops iterating per pod
    • Must NOT use bash for/while loops with compound statements
  • Setup: Creates 5 pods across 3 namespaces using busybox image
  • Validation: Pre-test verification ensures all pods are scheduled before Holmes is queried

Implementation Details

Both test fixtures follow the established LLM eval pattern:

  • manifest.yaml: Kubernetes resource definitions for test scenario
  • test_case.yaml: User prompt, expected outputs, and setup/teardown scripts
  • include_tool_calls: true: Enables validation of which tools Holmes actually calls
  • Pre-test setup waits for pods to reach desired state (image pull failures or scheduled)
  • Post-test cleanup removes all created namespaces

These tests validate that Holmes learns to prefer efficient, dedicated Kubernetes tools over shell scripting patterns that require user approval and are slower/less reliable.

https://claude.ai/code/session_01ShMKrLaC9Dddn41ZJM6CWW

Summary by CodeRabbit

  • Tests
    • Added multiple new Kubernetes LLM test fixtures that validate pod counts, label distributions, per-pod node lookups, namespace summaries, outlier detection, and cross-namespace comparisons. All enforce using dedicated Kubernetes queries (no shell aggregation/loops) and include setup/teardown steps.
  • Chores
    • Added a pytest marker for tests that must avoid bash/shell loops and updated bash tool guidance to prefer dedicated Kubernetes queries or batched requests.

Review Change Stack

Two failing evals capture the anti-pattern where the LLM reaches for
bash with awk/sort/uniq/for-loops to aggregate Kubernetes data when
dedicated tools (kubernetes_jq_query, kubernetes_tabular_query,
kubernetes_count) would do the same job in one call without triggering
approval prompts.

259_k8s_event_grouping_prefer_dedicated_tools: pods stuck on image
pull failures across 4 namespaces; Holmes must group-by-namespace
without composing a shell aggregation pipeline.

260_k8s_multi_pod_node_lookup_prefer_dedicated_tools: 5 pods across 3
namespaces; Holmes must report each pod's nodeName without a
for/while shell loop.

Both are tagged 'hard' since the LLM currently fails them by choosing
the bash anti-pattern; they will move to easy/regression once the
fix lands.

Signed-off-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

📂 Previous Runs

📜 #4 · Run @ __f39a8f3__ (#25936448852) — May 15, 19:18 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit f39a8f3 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 13/13 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
📄 09_crashpod 39.0s 5 11 $0.2891 110,174 107,720 25,030 2,454 856 81,875 25,845 251
📄 101_loki_historical_logs_pod_deleted 96.8s 9 20 $0.5071 234,589 228,430 31,774 6,159 952 194,416 34,014 1,769
📄 112_find_pvcs_by_uuid 19.0s 3 4 $0.2033 61,055 59,927 21,825 1,128 583 37,844 22,083 253
📄 12_job_crashing 43.6s 5 14 $0.3416 129,979 127,330 29,910 2,649 736 95,492 31,838 213
📄 176_network_policy_blocking_traffic_no_skills 57.1s 7 17 $0.3947 173,805 170,169 29,001 3,636 939 137,593 32,576 673
📄 227_count_configmaps_per_namespace[0] 20.0s 4 9 $0.2050 76,841 75,716 20,688 1,125 591 54,709 21,007 53
📄 243_pod_names_contain_service 33.1s 4 10 $0.2564 83,662 81,436 23,162 2,226 919 57,495 23,941 265
📄 24_misconfigured_pvc 45.3s 6 15 $0.3210 134,296 131,256 24,968 3,040 1,011 104,813 26,443 256
📄 259_k8s_event_grouping_prefer_dedicated_tools 16.1s 3 6 $0.1823 56,619 55,611 19,604 1,008 654 36,003 19,608 65
📄 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools 18.2s 3 7 $0.1968 57,959 56,646 20,185 1,313 867 36,092 20,554 48
📄 43_current_datetime_from_prompt 3.7s 1 $0.1200 17,069 16,940 16,940 129 129 0 16,940 86
📄 51_logs_summarize_errors 21.5s 4 5 $0.2083 77,960 76,812 21,261 1,148 423 55,546 21,266 44
📄 61_exact_match_counting 9.9s 3 3 $0.1522 52,858 52,495 17,920 363 216 34,571 17,924 32
Total 32.6s avg 4.4 avg 10.1 avg $3.3778 1,266,866 1,240,488 31,774 26,378 1,011 926,449 314,039 4,008

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 #3 · Run @ __900fdf0__ (#25927341250) — May 15, 15:58 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 900fdf0 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 13/13 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
09_crashpod 38.7s 5 11 $0.2908 109,274 106,702 24,472 2,572 955 80,905 25,797 418
101_loki_historical_logs_pod_deleted 77.1s 7 17 $0.4171 171,621 166,920 29,206 4,701 920 135,429 31,491 886
112_find_pvcs_by_uuid 18.3s 3 4 $0.2034 61,083 59,951 21,822 1,132 614 37,873 22,078 254
12_job_crashing 54.6s 8 17 $0.3873 197,692 194,503 28,454 3,189 648 163,495 31,008 276
176_network_policy_blocking_traffic_no_skills 54.8s 5 16 $0.3228 115,880 112,648 26,750 3,232 942 85,342 27,306 543
227_count_configmaps_per_namespace[0] 21.0s 4 9 $0.2048 76,824 75,702 20,683 1,122 591 54,709 20,993 53
243_pod_names_contain_service 42.8s 6 13 $0.3129 136,103 133,415 25,472 2,688 665 107,037 26,378 301
24_misconfigured_pvc 39.7s 5 12 $0.2786 106,500 104,088 23,364 2,412 802 79,256 24,832 354
259_k8s_event_grouping_prefer_dedicated_tools 19.7s 3 6 $0.2019 58,202 56,803 20,202 1,399 975 35,710 21,093 53
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools 14.0s 3 3 $0.1689 54,893 54,142 18,721 751 461 35,417 18,725 47
43_current_datetime_from_prompt 3.9s 1 $0.1199 17,066 16,940 16,940 126 126 0 16,940 83
51_logs_summarize_errors 22.8s 4 5 $0.2074 77,649 76,490 21,099 1,159 434 55,386 21,104 41
61_exact_match_counting 11.1s 3 3 $0.1522 52,851 52,488 17,917 363 216 34,567 17,921 32
Total 32.2s avg 4.4 avg 9.7 avg $3.2680 1,235,638 1,210,792 29,206 24,846 975 905,126 305,666 3,341

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 #2 · Run @ __900fdf0__ (#25926210430) — May 15, 15:41 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 900fdf0 on branch claude/fix-k8s-approval-issue-xNNVP

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 11/11 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
09_crashpod 36.0s 5 10 $0.2660 105,054 102,878 23,198 2,176 936 79,101 23,777 251
101_loki_historical_logs_pod_deleted 67.3s 7 16 $0.3831 165,801 161,916 27,111 3,885 925 131,634 30,282 686
112_find_pvcs_by_uuid 20.5s 3 4 $0.2052 61,204 60,021 21,885 1,183 598 37,880 22,141 244
12_job_crashing 45.1s 6 14 $0.3767 160,352 157,693 30,119 2,659 1,018 122,248 35,445 221
176_network_policy_blocking_traffic_no_skills 54.6s 6 18 $0.3811 146,709 142,867 28,734 3,842 1,021 111,075 31,792 715
227_count_configmaps_per_namespace[0] 18.7s 4 9 $0.2095 76,827 75,702 20,686 1,125 591 53,781 21,921 53
243_pod_names_contain_service 44.8s 6 14 $0.3193 137,494 134,657 25,813 2,837 700 108,048 26,609 322
24_misconfigured_pvc 45.7s 7 16 $0.3329 156,870 153,920 24,847 2,950 1,015 127,058 26,862 309
43_current_datetime_from_prompt 4.7s 1 $0.1200 17,067 16,940 16,940 127 127 0 16,940 84
51_logs_summarize_errors 22.8s 4 5 $0.2086 77,897 76,729 21,221 1,168 443 55,503 21,226 44
61_exact_match_counting 11.0s 3 3 $0.1522 52,859 52,496 17,919 363 216 34,573 17,923 32
Total 33.8s avg 4.7 avg 10.9 avg $2.9546 1,158,134 1,135,819 30,119 22,315 1,021 860,901 274,918 2,961

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
⚠️ 1 older run truncated

Older runs were omitted to stay under GitHub's 64KB comment size limit.


⚠️ Eval Results (with failures)

Automatically triggered by commit 9b70f66 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 16/17 test cases were successful, 1 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
📄 09_crashpod 34.7s 4 10 $0.2735 90,638 88,403 25,398 2,235 929 62,451 25,952 277
📄 101_loki_historical_logs_pod_deleted 78.6s 8 17 $0.4529 205,671 200,449 30,081 5,222 933 168,584 31,865 1,382
📄 112_find_pvcs_by_uuid 16.9s 3 4 $0.1938 59,828 58,814 20,926 1,014 604 37,634 21,180 138
📄 12_job_crashing 39.8s 6 14 $0.3087 136,723 134,191 24,986 2,532 973 107,747 26,444 187
📄 176_network_policy_blocking_traffic_no_skills 56.2s 6 14 $0.3457 143,764 140,441 27,620 3,323 767 112,345 28,096 756
📄 227_count_configmaps_per_namespace[0] 19.5s 4 9 $0.2145 79,088 77,954 21,244 1,134 590 55,474 22,480 76
📄 243_pod_names_contain_service 34.0s 5 10 $0.2706 106,887 104,701 23,431 2,186 902 80,314 24,387 286
📄 24_misconfigured_pvc 39.7s 5 15 $0.2972 110,821 108,099 24,680 2,722 1,044 82,056 26,043 206
📄 259_k8s_event_grouping_prefer_dedicated_tools 16.9s 3 6 $0.1880 58,180 57,196 20,121 984 630 36,623 20,573 38
📄 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools 18.2s 3 7 $0.1918 58,400 57,271 20,012 1,129 838 36,713 20,558 165
📄 261_k8s_vague_overview_no_bash_aggregation 30.8s 4 11 $0.2420 82,285 80,274 21,990 2,011 707 57,584 22,690 132
📄 262_k8s_label_distribution_no_bash_aggregation 15.0s 3 3 $0.1684 55,661 54,997 18,960 664 336 36,033 18,964 156
📄 263_k8s_namespace_summary_no_bash_aggregation 16.1s 3 6 $0.1921 58,072 56,929 19,994 1,143 640 36,343 20,586 78
📄 264_k8s_find_outliers_no_bash_aggregation 13.2s 3 3 $0.1698 56,082 55,424 19,161 658 390 36,259 19,165 73
📄 43_current_datetime_from_prompt 3.2s 1 $0.1231 17,594 17,491 17,491 103 103 0 17,491 61
📄 51_logs_summarize_errors 22.3s 4 5 $0.2129 79,968 78,794 21,700 1,174 388 57,089 21,705 85
📄 61_exact_match_counting 11.1s 3 3 $0.1563 54,476 54,121 18,458 355 208 35,659 18,462 24
Total 27.4s avg 4.0 avg 8.6 avg $4.0014 1,454,138 1,425,549 30,081 28,589 1,044 1,038,908 386,641 4,120
Benchmark Comparison Details

Master baseline: latest master-* experiment (post-merge regression eval)
Status: 10 test/model combinations loaded

Benchmark baseline: latest ci-benchmark experiment on master
Status: 147 test/model combinations loaded

Time comparison (seconds):

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 34.7s 34.5s ±0% 32.5s ±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 78.6s 69.3s ↑13% 51.7s ↑52%
112_find_pvcs_by_uuid (opus-4.6) 📄 16.9s 18.1s ±0%
12_job_crashing (opus-4.6) 📄 39.8s 40.7s ±0% 42.4s ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 56.2s 51.3s ±0% 35.7s ↑57%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 19.5s
243_pod_names_contain_service (opus-4.6) 📄 34.0s 36.8s ±0% 27.4s ↑24%
24_misconfigured_pvc (opus-4.6) 📄 39.7s 40.0s ±0% 35.8s ↑11%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 16.9s
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 18.2s
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 30.8s
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 15.0s
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 16.1s
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 13.2s
43_current_datetime_from_prompt (opus-4.6) 📄 3.2s 2.9s ±0%
51_logs_summarize_errors (opus-4.6) 📄 22.3s 21.5s ±0% 23.1s ±0%
61_exact_match_counting (opus-4.6) 📄 11.1s 10.2s ±0% 10.8s ±0%
Average (m=9, b=9) 35.5s 34.1s ±0% 30.8s ↑20%

Cost comparison:

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 $0.2735 $0.2687 ±0% $0.2616 ±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 $0.4529 $0.4009 ↑13% $0.3371 ↑34%
112_find_pvcs_by_uuid (opus-4.6) 📄 $0.1938 $0.2014 ±0%
12_job_crashing (opus-4.6) 📄 $0.3087 $0.3193 ±0% $0.3076 ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 $0.3457 $0.3430 ±0% $0.2914 ↑19%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 $0.2145
243_pod_names_contain_service (opus-4.6) 📄 $0.2706 $0.2751 ±0% $0.2280 ↑19%
24_misconfigured_pvc (opus-4.6) 📄 $0.2972 $0.2929 ±0% $0.2831 ±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 $0.1880
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 $0.1918
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 $0.2420
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 $0.1684
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 $0.1921
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 $0.1698
43_current_datetime_from_prompt (opus-4.6) 📄 $0.1231 $0.0122 ↑912%
51_logs_summarize_errors (opus-4.6) 📄 $0.2129 $0.2051 ±0% $0.2072 ±0%
61_exact_match_counting (opus-4.6) 📄 $0.1563 $0.1522 ±0% $0.1522 ±0%
Average (m=9, b=9) $0.2712 $0.2521 ±0% $0.2522 ↑11%

Total tokens comparison:

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 90,638 105,590 ↓14% 103,497 ↓12%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 205,671 167,742 ↑23% 138,670 ↑48%
112_find_pvcs_by_uuid (opus-4.6) 📄 59,828 61,169 ±0%
12_job_crashing (opus-4.6) 📄 136,723 147,210 ±0% 133,893 ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 143,764 139,843 ±0% 111,145 ↑29%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 79,088
243_pod_names_contain_service (opus-4.6) 📄 106,887 107,178 ±0% 79,525 ↑34%
24_misconfigured_pvc (opus-4.6) 📄 110,821 107,382 ±0% 108,047 ±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 58,180
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 58,400
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 82,285
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 55,661
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 58,072
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 56,082
43_current_datetime_from_prompt (opus-4.6) 📄 17,594 17,043 ±0%
51_logs_summarize_errors (opus-4.6) 📄 79,968 77,335 ±0% 77,707 ±0%
61_exact_match_counting (opus-4.6) 📄 54,476 52,855 ±0% 52,942 ±0%
Average (m=9, b=9) 105,171 102,464 ±0% 96,288 ↑14%

Cached tokens comparison:

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 62,451 79,438 ↓21% 77,391 ↓19%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 168,584 133,072 ↑27% 106,565 ↑58%
112_find_pvcs_by_uuid (opus-4.6) 📄 37,634 38,002 ±0%
12_job_crashing (opus-4.6) 📄 107,747 117,018 ±0% 104,761 ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 112,345 108,706 ±0% 81,519 ↑38%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 55,474
243_pod_names_contain_service (opus-4.6) 📄 80,314 80,447 ±0% 55,513 ↑45%
24_misconfigured_pvc (opus-4.6) 📄 82,056 79,303 ±0% 80,270 ±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 36,623
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 36,713
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 57,584
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 36,033
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 36,343
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 36,259
43_current_datetime_from_prompt (opus-4.6) 📄 16,937
51_logs_summarize_errors (opus-4.6) 📄 57,089 55,251 ±0% 55,443 ±0%
61_exact_match_counting (opus-4.6) 📄 35,659 34,570 ±0% 34,632 ±0%
Average (m=8, b=9) 88,281 85,976 ±0% 70,455 ↑17%

Turns comparison:

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 4 5 ↓20%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 8 7 ↑14%
112_find_pvcs_by_uuid (opus-4.6) 📄 3
12_job_crashing (opus-4.6) 📄 6 6 ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 6 6 ±0%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 4
243_pod_names_contain_service (opus-4.6) 📄 5 5 ±0%
24_misconfigured_pvc (opus-4.6) 📄 5 5 ±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 3
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 3
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 4
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 3
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 3
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 3
43_current_datetime_from_prompt (opus-4.6) 📄 1 1 ±0%
51_logs_summarize_errors (opus-4.6) 📄 4 4 ±0%
61_exact_match_counting (opus-4.6) 📄 3 3 ±0%
Average (m=9, b=0) 4.7 4.7 ±0%

Tool calls comparison:

Test case This branch master (1h ago) Δ vs master benchmark (12d ago) Δ vs benchmark
09_crashpod (opus-4.6) 📄 10 10 ±0% 10 ±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄 17 15 ↑13% 14 ↑21%
112_find_pvcs_by_uuid (opus-4.6) 📄 4 4 ±0%
12_job_crashing (opus-4.6) 📄 14 12 ↑17% 14 ±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 14 15 ±0% 13 ±0%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄 9
243_pod_names_contain_service (opus-4.6) 📄 10 11 ±0% 8 ↑25%
24_misconfigured_pvc (opus-4.6) 📄 15 13 ↑15% 14 ±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 6
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 7
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 11
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 3
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 6
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 3
43_current_datetime_from_prompt (opus-4.6) 📄
51_logs_summarize_errors (opus-4.6) 📄 5 5 ±0% 5 ±0%
61_exact_match_counting (opus-4.6) 📄 3 3 ±0% 3 ±0%
Average (m=8, b=9) 11.0 10.5 ±0% 9.4 ±0%

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)

⚠️ 1 Failure Detected

📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
tags: regression

Or with more options (one per line):

/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
tags: regression
Option Description
model Model(s) to test (default: same as automatic runs)
tags Pytest tags / markers (no default - runs all tests!)
id Eval ID / pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):

Label Effect
evals-tag-<name> Run tests with tag <name> alongside regression
evals-id-<name> Run a specific eval by test ID
evals-model-<name> Override the model (use model list name, e.g. sonnet-4.5)

Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5

🏷️ Valid tags

benchmark, chain-of-causation, compaction, confluence, context_window, conversation_worker, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes-no-bash-approval, kubernetes, leaked-information, logs, loki, manual, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, skills, slackbot, storage, token-limit, toolset-limitation, traces, transparency, victorialogs

🤖 Valid models

deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, opus-4.7, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Walkthrough

Adds six Kubernetes manifest/test fixtures (tests 259–264) requiring dedicated Kubernetes queries (or direct kubectl calls) instead of bash aggregation loops, updates bash tool instructions to prefer dedicated tools, and adds a pytest marker to flag tests that forbid bash aggregation/approval flows.

Changes

Kubernetes LLM test fixtures and tooling guidance

Layer / File(s) Summary
Test 259: Manifest and test case for per-namespace worker counting
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/manifest.yaml, tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml
Creates four app-259-* namespaces and pods, before_test polls and asserts exact tier=worker counts per namespace, prompt requests counts and winning namespace while forbidding shell-pipeline aggregation, after_test deletes namespaces.
Test 260: Manifest and test case for pod nodeName lookup
tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/manifest.yaml, tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml
Creates three app-260-* namespaces and named pods; before_test waits until each pod has non-empty spec.nodeName; prompt requires dedicated Kubernetes queries or kubectl JSON and forbids bash loops; after_test deletes namespaces.
Pytest marker enabling tooling constraint
pyproject.toml
Adds kubernetes-no-bash-approval pytest marker to mark tests that require dedicated Kubernetes tools instead of bash aggregation/loops.
Bash tool instructions update
holmes/plugins/toolsets/bash/bash_instructions.jinja2
Prefers dedicated Kubernetes tools or batched kubectl get before using loops/conditionals; notes approval prompts interrupt execution and loops should be used only when necessary.
Test 261: Vague workload overview fixture and test case
tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml, tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/test_case.yaml
Creates four app-261-* namespaces and pods, before_test polls for readiness, prompt asks for cross-namespace workload overview using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.
Test 262: Label distribution fixture and test case
tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/manifest.yaml, tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/test_case.yaml
Creates four app-262-* namespaces and labeled pods; before_test waits for 15 pods; prompt requests (tier, count) table with exact expected counts using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.
Test 263: Namespace tier summary fixture and test case
tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/manifest.yaml, tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/test_case.yaml
Creates four app-263-* namespaces and pods; before_test polls until 15 pods exist; prompt requests per-namespace tier breakdown using dedicated Kubernetes tools and forbids bash aggregation/loops; after_test deletes namespaces.
Test 264: Find outliers fixture and test case
tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/manifest.yaml, tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/test_case.yaml
Creates four app-264-* namespaces and pods; before_test waits for readiness; prompt asks Holmes to enumerate non-worker pods (namespace + tier) using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

evals-tag-counting

Suggested reviewers

  • moshemorad
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely summarizes the primary change: adding LLM evaluation tests focused on verifying Holmes' preference for dedicated Kubernetes tools over shell pipelines.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Docker images ready for ab754a6d (built in 8m 13s)

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use these tags to pull the images for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6d

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:ab754a6d \
  --set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set operator.image=holmes-operator-dev:ab754a6d

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:ab754a6d \
  --set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.operator.image=holmes-operator-dev:ab754a6d

@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for holmes-docs ready!

Name Link
🔨 Latest commit 9b70f66
🔍 Latest deploy log https://app.netlify.com/projects/holmes-docs/deploys/6a07883e1bd5a00008f9f634
😎 Deploy Preview https://deploy-preview-2048--holmes-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

New pytest marker for the suite of evals that assert Holmes uses
dedicated Kubernetes tools (kubernetes_jq_query, kubernetes_tabular_query,
kubernetes_count) instead of bash pipelines/loops that would trigger
user approval prompts. Lets us run the whole group via
`pytest -m kubernetes-no-bash-approval`.

Signed-off-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml (1)

1-87: ⚠️ Potential issue | 🔴 Critical

Test number 259 is already in use; choose a different sequential number.

Test number 259 conflicts with the existing test directory tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/. The next available sequential test number is 261. Update the directory name and all references within the test to use a new sequential number that does not conflict with existing tests (256–260).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 1 - 87, The test directory name conflicts with an existing test
(259); rename the directory and all in-file references from
"259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number
metadata) to the next available sequential number
"261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and
every occurrence of the test-number string inside the YAML (ensure you do not
alter the app-259-* namespace names unless they are meant to change) so the test
number is unique.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-28: The expected_output block currently uses quoted multi-line
strings that preserve newlines/leading whitespace; replace those quoted
multi-line entries with a folded scalar style (>) like user_prompt to collapse
lines and remove leading spaces so the YAML parser/validator sees the intended
single-paragraph text; locate the expected_output key in the test case
(referenced as expected_output) and convert each quoted multi-line value to a
folded scalar (>), ensuring indentation matches the surrounding YAML and content
lines are wrapped without the manual line breaks.

In
`@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`:
- Around line 25-28: Update the acceptance text that currently permits "one call
per pod" so it instead requires either a dedicated Kubernetes tool
(kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls
(e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>"
style command) — remove the allowance for one-per-pod bash calls; also add
include_tool_calls: true to this test case to ensure tool execution is
validated; look for the string containing "Must answer using a dedicated
Kubernetes tool..." and the test case metadata to apply these changes.

---

Outside diff comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 1-87: The test directory name conflicts with an existing test
(259); rename the directory and all in-file references from
"259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number
metadata) to the next available sequential number
"261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and
every occurrence of the test-number string inside the YAML (ensure you do not
alter the app-259-* namespace names unless they are meant to change) so the test
number is unique.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e594859c-c880-41c9-94e7-c57190e6eabd

📥 Commits

Reviewing files that changed from the base of the PR and between 31fa24c and e239caf.

📒 Files selected for processing (4)
  • tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml

Comment on lines +20 to +28
- "Must answer using the dedicated Kubernetes tools (one or more of
kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
kubectl_find_resource). Counting via simple per-namespace kubectl get
calls through the bash tool is acceptable."
- "Must NOT use the bash tool with shell pipelines that aggregate or group
results (for example: piping kubectl output through awk, sort, uniq -c,
or wc to compute grouped counts). The grouping/aggregation must be done
by a dedicated Kubernetes tool or by reading the kubectl output directly,
not by composing a shell pipeline."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix multi-line string formatting in expected_output.

Lines 20-28 use quoted strings with manual line breaks, which preserves newlines and leading whitespace. This differs from the proper folded scalar style used for user_prompt at line 1 (using >). The preserved whitespace could cause validation issues.

📝 Proposed fix using proper YAML multi-line format
   - "Must identify app-259-control as the namespace with the most stuck pods"
-  - "Must answer using the dedicated Kubernetes tools (one or more of
-     kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
-     kubectl_find_resource). Counting via simple per-namespace kubectl get
-     calls through the bash tool is acceptable."
-  - "Must NOT use the bash tool with shell pipelines that aggregate or group
-     results (for example: piping kubectl output through awk, sort, uniq -c,
-     or wc to compute grouped counts). The grouping/aggregation must be done
-     by a dedicated Kubernetes tool or by reading the kubectl output directly,
-     not by composing a shell pipeline."
+  - >
+    Must answer using the dedicated Kubernetes tools (one or more of
+    kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+    kubectl_find_resource). Counting via simple per-namespace kubectl get
+    calls through the bash tool is acceptable.
+  - >
+    Must NOT use the bash tool with shell pipelines that aggregate or group
+    results (for example: piping kubectl output through awk, sort, uniq -c,
+    or wc to compute grouped counts). The grouping/aggregation must be done
+    by a dedicated Kubernetes tool or by reading the kubectl output directly,
+    not by composing a shell pipeline.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 20 - 28, The expected_output block currently uses quoted multi-line
strings that preserve newlines/leading whitespace; replace those quoted
multi-line entries with a folded scalar style (>) like user_prompt to collapse
lines and remove leading spaces so the YAML parser/validator sees the intended
single-paragraph text; locate the expected_output key in the test case
(referenced as expected_output) and convert each quoted multi-line value to a
folded scalar (>), ensuring indentation matches the surrounding YAML and content
lines are wrapped without the manual line breaks.

Comment on lines +25 to +28
- "Must answer using a dedicated Kubernetes tool such as
kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct
'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls
through the bash tool — one call per pod is acceptable."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten acceptance criteria to enforce batched/dedicated lookup.

Line 25–28 currently allows “one call per pod” through bash, which contradicts the stated goal at Line 14–17 and can let the non-batched anti-pattern pass.

Suggested fix
-  - "Must answer using a dedicated Kubernetes tool such as
-     kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct
-     'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls
-     through the bash tool — one call per pod is acceptable."
+  - "Must answer using a dedicated Kubernetes tool such as
+     kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single
+     batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call)
+     through the bash tool."

As per coding guidelines: “User prompts must be specific and match the test - test exact values and discovery of information… use include_tool_calls: true to verify tool execution.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- "Must answer using a dedicated Kubernetes tool such as
kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct
'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls
through the bash tool — one call per pod is acceptable."
- "Must answer using a dedicated Kubernetes tool such as
kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single
batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call)
through the bash tool."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`
around lines 25 - 28, Update the acceptance text that currently permits "one
call per pod" so it instead requires either a dedicated Kubernetes tool
(kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls
(e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>"
style command) — remove the allowance for one-per-pod bash calls; also add
include_tool_calls: true to this test case to ensure tool execution is
validated; look for the string containing "Must answer using a dedicated
Kubernetes tool..." and the test case metadata to apply these changes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-23: Update the fixture's tool-usage requirement so it no longer
permits the bash fallback: locate the string that currently reads "Must answer
using the dedicated Kubernetes tools (one or more of kubernetes_jq_query,
kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting
via simple per-namespace kubectl get calls through the bash tool is acceptable."
and remove the trailing allowance clause ("Counting via simple per-namespace
kubectl get calls through the bash tool is acceptable.") so the rule enforces
only the dedicated Kubernetes tools.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6c22c9f-44ff-4cee-ae6d-62391cbd7d15

📥 Commits

Reviewing files that changed from the base of the PR and between e239caf and 900fdf0.

📒 Files selected for processing (3)
  • pyproject.toml
  • tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml

Comment on lines +20 to +23
- "Must answer using the dedicated Kubernetes tools (one or more of
kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
kubectl_find_resource). Counting via simple per-namespace kubectl get
calls through the bash tool is acceptable."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten tool-usage criteria to disallow bash counting fallback.

This line makes the eval permissive in a way that can bypass the test’s core intent (dedicated Kubernetes tools). Please remove the “bash tool is acceptable” allowance so the fixture consistently enforces dedicated-tool usage.

Suggested diff
-  - "Must answer using the dedicated Kubernetes tools (one or more of
-     kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
-     kubectl_find_resource). Counting via simple per-namespace kubectl get
-     calls through the bash tool is acceptable."
+  - "Must answer using dedicated Kubernetes tools (one or more of
+     kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+     kubectl_find_resource), and must not use bash for counting/grouping."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 20 - 23, Update the fixture's tool-usage requirement so it no
longer permits the bash fallback: locate the string that currently reads "Must
answer using the dedicated Kubernetes tools (one or more of kubernetes_jq_query,
kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting
via simple per-namespace kubectl get calls through the bash tool is acceptable."
and remove the trailing allowance clause ("Counting via simple per-namespace
kubectl get calls through the bash tool is acceptable.") so the rule enforces
only the dedicated Kubernetes tools.

claude added 3 commits May 15, 2026 19:10
The original 259 used pods with invalid image references to produce
ImagePullBackOff/ErrImagePull status, but that requires a working
container runtime to even reach the image-pull stage. In constrained
CI environments (nested containers, cgroup v1) pods get stuck on
FailedCreatePodSandBox before the kubelet ever attempts an image
pull, making the setup unverifiable.

The redesigned 259 deploys 15 pods across 4 namespaces with a mix of
tier labels (worker plus noise tiers like edge/messaging/gateway/
batch/observability), and asks Holmes to count pods with tier=worker
in each namespace. Labels are queryable from metadata.labels as soon
as the Pod object is created, so the eval is independent of whether
the pods reach Running.

Tests the same anti-pattern (bash | awk | sort | uniq -c grouping)
with the same expected_output structure (3/1/2/4, control wins).
Verified locally against opus-4.6 via OpenRouter: 10/10 pass.

Signed-off-by: Claude <noreply@anthropic.com>
Adds evals 261-264 that probe opus-4.6 with vague, customer-style
questions about resource distribution and grouping across multiple
namespaces. They complement 259/260 by covering question phrasings
that historically slip past the existing assertion set.

- 261 (vague overview): "give me an overview of what's running across
  these 4 namespaces"
- 262 (label distribution): "what distinct tier values are in use,
  with counts per value"
- 263 (namespace summary): "per-namespace breakdown of pods by tier
  label"
- 264 (find outliers): "find every pod whose tier is something other
  than worker"

All four use the same 15-pod template (worker pods plus tier-noise
pods) in isolated app-{N}-* namespaces, so they're parallel-safe.

Locally on opus-4.6 these pass ~95% of the time (5/5, 4/5, 5/5, 5/5
on a clean run). The intermittent failures show opus-4.6 occasionally
falls back to bash pipelines like:

  kubectl get pods -n X --show-labels; echo "==="; kubectl get pods -n Y --show-labels ...
  kubectl get pods -A -o json | jq | sort -u

The dedicated tools (kubernetes_jq_query, kubernetes_count) handle all
four scenarios cleanly. The regression evals lock in that preference
and will catch future model changes that re-introduce the anti-pattern.

include_tool_calls is set so the judge inspects tool selection, not
just the final answer. Bash with simple per-namespace 'kubectl get'
calls (one per namespace, semicolon-chained allowed) is acceptable;
bash with aggregating pipelines (awk/sort/uniq/wc/cut/grep -c) is not.

Signed-off-by: Claude <noreply@anthropic.com>
The previous bash instructions used `kubectl get pods | grep Running | head -5`
as the canonical "pipe is auto-approved" example. That implicitly endorsed
piping kubectl output through awk/sort/uniq -c/wc to do grouping and counting,
which is exactly the anti-pattern that produces the customer pain reported in
this branch: instead of a single kubernetes_count / kubernetes_jq_query /
kubernetes_tabular_query call, the LLM stitched together shell pipelines that
either required approval (loops, command substitution) or produced noisy
investigation traces.

This change adds a "Tool Selection — Prefer Dedicated Tools" section at the
top of the bash instructions that:

  * Tells the model to check for a dedicated K8s tool before reaching for bash.
  * Enumerates the four most common anti-patterns (group-by-uniq, wc -l counts,
    distinct-via-sort-u, per-resource for-loops) and points each one at the
    dedicated tool that replaces it.
  * Carves out the legitimate uses of bash — one-off `kubectl get` with a
    specific flag, non-kubectl invocations — so the guidance does not over-rotate.

Two smaller edits below:

  * The "Pipes" example is changed from `kubectl get pods | grep ...` to a
    log-grep example, with a one-line reminder pointing at the new section.
  * The "go ahead and use loops" line is replaced with a check-first prompt
    that mentions the cost of approval prompts.

This is prompt guidance, not a behavior gate — the model can still pipe kubectl
when it judges that the right choice (e.g. when no dedicated tool fits). The
goal is to flip the default for grouping/counting/joining work back to the
dedicated tools.

Pairs with the 259/260/261/262/263/264 evals on this branch which lock in the
preference. Verification against opus-4.6 with full iteration counts is
pending — the OpenRouter weekly credit limit was hit during testing.

Signed-off-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`:
- Around line 22-200: All Pod resources in this manifest (e.g.,
web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit
non-root security context; add a pod-level securityContext with runAsNonRoot:
true and runAsUser: 1000 and ensure each container (name: app) sets
securityContext.allowPrivilegeEscalation: false (or container-level
runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its
container securityContext explicitly enforce non-root and no privilege
escalation across all Pod definitions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc321014-7f70-4636-9eb1-d60c115e3465

📥 Commits

Reviewing files that changed from the base of the PR and between f39a8f3 and 9b70f66.

📒 Files selected for processing (9)
  • holmes/plugins/toolsets/bash/bash_instructions.jinja2
  • tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/manifest.yaml
  • tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/test_case.yaml

Comment on lines +22 to +200
apiVersion: v1
kind: Pod
metadata:
name: web-renderer-v6q8
namespace: app-261-frontend
labels:
app: web-renderer
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: asset-cdn-v6q8
namespace: app-261-frontend
labels:
app: asset-cdn
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: session-router-v6q8
namespace: app-261-frontend
labels:
app: session-router
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: edge-cache-v6q8
namespace: app-261-frontend
labels:
app: edge-cache
tier: edge
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: order-processor-v6q8
namespace: app-261-backend
labels:
app: order-processor
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: queue-broker-v6q8
namespace: app-261-backend
labels:
app: queue-broker
tier: messaging
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: api-gateway-v6q8
namespace: app-261-backend
labels:
app: api-gateway
tier: gateway
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: stream-ingester-v6q8
namespace: app-261-data
labels:
app: stream-ingester
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: feature-compactor-v6q8
namespace: app-261-data
labels:
app: feature-compactor
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: warehouse-loader-v6q8
namespace: app-261-data
labels:
app: warehouse-loader
tier: batch
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: scheduler-shim-v6q8
namespace: app-261-control
labels:
app: scheduler-shim
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: policy-evaluator-v6q8
namespace: app-261-control
labels:
app: policy-evaluator
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: workload-binder-v6q8
namespace: app-261-control
labels:
app: workload-binder
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: lease-coordinator-v6q8
namespace: app-261-control
labels:
app: lease-coordinator
tier: worker
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
---
apiVersion: v1
kind: Pod
metadata:
name: audit-logger-v6q8
namespace: app-261-control
labels:
app: audit-logger
tier: observability
spec:
containers:
- {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Harden pod security context to avoid policy-dependent fixture failures.

These pods currently rely on default security settings; on clusters enforcing Pod Security standards, they can be rejected (runAsNonRoot / allowPrivilegeEscalation). Please add explicit non-root security context across all fixture pods.

Suggested pattern to apply to each Pod spec
 spec:
+  securityContext:
+    runAsNonRoot: true
+    seccompProfile:
+      type: RuntimeDefault
   containers:
-    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+    - name: app
+      image: busybox:1.36
+      command: ["sleep", "3600"]
+      securityContext:
+        allowPrivilegeEscalation: false
+        capabilities:
+          drop: ["ALL"]
🧰 Tools
🪛 Checkov (3.2.528)

[medium] 22-33: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 22-33: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 34-45: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 34-45: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 46-57: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 46-57: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 58-69: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 58-69: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 70-81: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 70-81: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 82-93: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 82-93: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 94-105: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 94-105: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 106-117: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 106-117: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 118-129: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 118-129: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 130-141: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 130-141: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 142-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 142-153: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 154-165: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 154-165: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 166-177: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 166-177: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 178-189: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 178-189: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 190-200: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 190-200: Minimize the admission of root containers

(CKV_K8S_23)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`
around lines 22 - 200, All Pod resources in this manifest (e.g.,
web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit
non-root security context; add a pod-level securityContext with runAsNonRoot:
true and runAsUser: 1000 and ensure each container (name: app) sets
securityContext.allowPrivilegeEscalation: false (or container-level
runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its
container securityContext explicitly enforce non-root and no privilege
escalation across all Pod definitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants