Add LLM eval tests for Kubernetes tool usage patterns#2048
Conversation
Two failing evals capture the anti-pattern where the LLM reaches for bash with awk/sort/uniq/for-loops to aggregate Kubernetes data when dedicated tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count) would do the same job in one call without triggering approval prompts. 259_k8s_event_grouping_prefer_dedicated_tools: pods stuck on image pull failures across 4 namespaces; Holmes must group-by-namespace without composing a shell aggregation pipeline. 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools: 5 pods across 3 namespaces; Holmes must report each pod's nodeName without a for/while shell loop. Both are tagged 'hard' since the LLM currently fails them by choosing the bash anti-pattern; they will move to easy/regression once the fix lands. Signed-off-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
📂 Previous Runs📜 #4 · Run @ __f39a8f3__ (#25936448852) — May 15, 19:18 UTC✅ Results of HolmesGPT evalsAutomatically triggered by commit f39a8f3 on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
📜 #3 · Run @ __900fdf0__ (#25927341250) — May 15, 15:58 UTC✅ Results of HolmesGPT evalsAutomatically triggered by commit 900fdf0 on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
📜 #2 · Run @ __900fdf0__ (#25926210430) — May 15, 15:41 UTC✅ Results of HolmesGPT evalsAutomatically triggered by commit 900fdf0 on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
|
| Status | Test case | Time | Turns | Tools | Cost | Total tokens | Input | Max input | Output | Max output | Cached | Non-cached | Reasoning | Compactions |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ✅ | 📄 09_crashpod | 34.7s | 4 | 10 | $0.2735 | 90,638 | 88,403 | 25,398 | 2,235 | 929 | 62,451 | 25,952 | 277 | — |
| ✅ | 📄 101_loki_historical_logs_pod_deleted | 78.6s | 8 | 17 | $0.4529 | 205,671 | 200,449 | 30,081 | 5,222 | 933 | 168,584 | 31,865 | 1,382 | — |
| ✅ | 📄 112_find_pvcs_by_uuid | 16.9s | 3 | 4 | $0.1938 | 59,828 | 58,814 | 20,926 | 1,014 | 604 | 37,634 | 21,180 | 138 | — |
| ✅ | 📄 12_job_crashing | 39.8s | 6 | 14 | $0.3087 | 136,723 | 134,191 | 24,986 | 2,532 | 973 | 107,747 | 26,444 | 187 | — |
| ✅ | 📄 176_network_policy_blocking_traffic_no_skills | 56.2s | 6 | 14 | $0.3457 | 143,764 | 140,441 | 27,620 | 3,323 | 767 | 112,345 | 28,096 | 756 | — |
| ✅ | 📄 227_count_configmaps_per_namespace[0] | 19.5s | 4 | 9 | $0.2145 | 79,088 | 77,954 | 21,244 | 1,134 | 590 | 55,474 | 22,480 | 76 | — |
| ✅ | 📄 243_pod_names_contain_service | 34.0s | 5 | 10 | $0.2706 | 106,887 | 104,701 | 23,431 | 2,186 | 902 | 80,314 | 24,387 | 286 | — |
| ✅ | 📄 24_misconfigured_pvc | 39.7s | 5 | 15 | $0.2972 | 110,821 | 108,099 | 24,680 | 2,722 | 1,044 | 82,056 | 26,043 | 206 | — |
| ✅ | 📄 259_k8s_event_grouping_prefer_dedicated_tools | 16.9s | 3 | 6 | $0.1880 | 58,180 | 57,196 | 20,121 | 984 | 630 | 36,623 | 20,573 | 38 | — |
| ✅ | 📄 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools | 18.2s | 3 | 7 | $0.1918 | 58,400 | 57,271 | 20,012 | 1,129 | 838 | 36,713 | 20,558 | 165 | — |
| ✅ | 📄 261_k8s_vague_overview_no_bash_aggregation | 30.8s | 4 | 11 | $0.2420 | 82,285 | 80,274 | 21,990 | 2,011 | 707 | 57,584 | 22,690 | 132 | — |
| ❌ | 📄 262_k8s_label_distribution_no_bash_aggregation | 15.0s | 3 | 3 | $0.1684 | 55,661 | 54,997 | 18,960 | 664 | 336 | 36,033 | 18,964 | 156 | — |
| ✅ | 📄 263_k8s_namespace_summary_no_bash_aggregation | 16.1s | 3 | 6 | $0.1921 | 58,072 | 56,929 | 19,994 | 1,143 | 640 | 36,343 | 20,586 | 78 | — |
| ✅ | 📄 264_k8s_find_outliers_no_bash_aggregation | 13.2s | 3 | 3 | $0.1698 | 56,082 | 55,424 | 19,161 | 658 | 390 | 36,259 | 19,165 | 73 | — |
| ✅ | 📄 43_current_datetime_from_prompt | 3.2s | 1 | — | $0.1231 | 17,594 | 17,491 | 17,491 | 103 | 103 | 0 | 17,491 | 61 | — |
| ✅ | 📄 51_logs_summarize_errors | 22.3s | 4 | 5 | $0.2129 | 79,968 | 78,794 | 21,700 | 1,174 | 388 | 57,089 | 21,705 | 85 | — |
| ✅ | 📄 61_exact_match_counting | 11.1s | 3 | 3 | $0.1563 | 54,476 | 54,121 | 18,458 | 355 | 208 | 35,659 | 18,462 | 24 | — |
| Total | 27.4s avg | 4.0 avg | 8.6 avg | $4.0014 | 1,454,138 | 1,425,549 | 30,081 | 28,589 | 1,044 | 1,038,908 | 386,641 | 4,120 | — |
Benchmark Comparison Details
Master baseline: latest master-* experiment (post-merge regression eval)
Status: 10 test/model combinations loaded
- master-25938985537 (created: 2026-05-15)
Benchmark baseline: latest ci-benchmark experiment on master
Status: 147 test/model combinations loaded
- ci-benchmark-25268492423 (created: 2026-05-03)
Time comparison (seconds):
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | 34.7s | 34.5s | ±0% | 32.5s | ±0% |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | 78.6s | 69.3s | ↑13% | 51.7s | ↑52% |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | 16.9s | — | — | 18.1s | ±0% |
| 12_job_crashing (opus-4.6) 📄 | 39.8s | 40.7s | ±0% | 42.4s | ±0% |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | 56.2s | 51.3s | ±0% | 35.7s | ↑57% |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | 19.5s | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | 34.0s | 36.8s | ±0% | 27.4s | ↑24% |
| 24_misconfigured_pvc (opus-4.6) 📄 | 39.7s | 40.0s | ±0% | 35.8s | ↑11% |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | 16.9s | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | 18.2s | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | 30.8s | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | 15.0s | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | 16.1s | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | 13.2s | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | 3.2s | 2.9s | ±0% | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | 22.3s | 21.5s | ±0% | 23.1s | ±0% |
| 61_exact_match_counting (opus-4.6) 📄 | 11.1s | 10.2s | ±0% | 10.8s | ±0% |
| Average (m=9, b=9) | 35.5s | 34.1s | ±0% | 30.8s | ↑20% |
Cost comparison:
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | $0.2735 | $0.2687 | ±0% | $0.2616 | ±0% |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | $0.4529 | $0.4009 | ↑13% | $0.3371 | ↑34% |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | $0.1938 | — | — | $0.2014 | ±0% |
| 12_job_crashing (opus-4.6) 📄 | $0.3087 | $0.3193 | ±0% | $0.3076 | ±0% |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | $0.3457 | $0.3430 | ±0% | $0.2914 | ↑19% |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | $0.2145 | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | $0.2706 | $0.2751 | ±0% | $0.2280 | ↑19% |
| 24_misconfigured_pvc (opus-4.6) 📄 | $0.2972 | $0.2929 | ±0% | $0.2831 | ±0% |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | $0.1880 | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | $0.1918 | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | $0.2420 | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | $0.1684 | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | $0.1921 | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | $0.1698 | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | $0.1231 | $0.0122 | ↑912% | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | $0.2129 | $0.2051 | ±0% | $0.2072 | ±0% |
| 61_exact_match_counting (opus-4.6) 📄 | $0.1563 | $0.1522 | ±0% | $0.1522 | ±0% |
| Average (m=9, b=9) | $0.2712 | $0.2521 | ±0% | $0.2522 | ↑11% |
Total tokens comparison:
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | 90,638 | 105,590 | ↓14% | 103,497 | ↓12% |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | 205,671 | 167,742 | ↑23% | 138,670 | ↑48% |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | 59,828 | — | — | 61,169 | ±0% |
| 12_job_crashing (opus-4.6) 📄 | 136,723 | 147,210 | ±0% | 133,893 | ±0% |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | 143,764 | 139,843 | ±0% | 111,145 | ↑29% |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | 79,088 | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | 106,887 | 107,178 | ±0% | 79,525 | ↑34% |
| 24_misconfigured_pvc (opus-4.6) 📄 | 110,821 | 107,382 | ±0% | 108,047 | ±0% |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | 58,180 | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | 58,400 | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | 82,285 | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | 55,661 | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | 58,072 | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | 56,082 | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | 17,594 | 17,043 | ±0% | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | 79,968 | 77,335 | ±0% | 77,707 | ±0% |
| 61_exact_match_counting (opus-4.6) 📄 | 54,476 | 52,855 | ±0% | 52,942 | ±0% |
| Average (m=9, b=9) | 105,171 | 102,464 | ±0% | 96,288 | ↑14% |
Cached tokens comparison:
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | 62,451 | 79,438 | ↓21% | 77,391 | ↓19% |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | 168,584 | 133,072 | ↑27% | 106,565 | ↑58% |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | 37,634 | — | — | 38,002 | ±0% |
| 12_job_crashing (opus-4.6) 📄 | 107,747 | 117,018 | ±0% | 104,761 | ±0% |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | 112,345 | 108,706 | ±0% | 81,519 | ↑38% |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | 55,474 | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | 80,314 | 80,447 | ±0% | 55,513 | ↑45% |
| 24_misconfigured_pvc (opus-4.6) 📄 | 82,056 | 79,303 | ±0% | 80,270 | ±0% |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | 36,623 | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | 36,713 | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | 57,584 | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | 36,033 | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | 36,343 | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | 36,259 | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | — | 16,937 | — | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | 57,089 | 55,251 | ±0% | 55,443 | ±0% |
| 61_exact_match_counting (opus-4.6) 📄 | 35,659 | 34,570 | ±0% | 34,632 | ±0% |
| Average (m=8, b=9) | 88,281 | 85,976 | ±0% | 70,455 | ↑17% |
Turns comparison:
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | 4 | 5 | ↓20% | — | — |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | 8 | 7 | ↑14% | — | — |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | 3 | — | — | — | — |
| 12_job_crashing (opus-4.6) 📄 | 6 | 6 | ±0% | — | — |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | 6 | 6 | ±0% | — | — |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | 4 | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | 5 | 5 | ±0% | — | — |
| 24_misconfigured_pvc (opus-4.6) 📄 | 5 | 5 | ±0% | — | — |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | 3 | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | 3 | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | 4 | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | 3 | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | 3 | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | 3 | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | 1 | 1 | ±0% | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | 4 | 4 | ±0% | — | — |
| 61_exact_match_counting (opus-4.6) 📄 | 3 | 3 | ±0% | — | — |
| Average (m=9, b=0) | 4.7 | 4.7 | ±0% | — | — |
Tool calls comparison:
| Test case | This branch | master (1h ago) | Δ vs master | benchmark (12d ago) | Δ vs benchmark |
|---|---|---|---|---|---|
| 09_crashpod (opus-4.6) 📄 | 10 | 10 | ±0% | 10 | ±0% |
| 101_loki_historical_logs_pod_deleted (opus-4.6) 📄 | 17 | 15 | ↑13% | 14 | ↑21% |
| 112_find_pvcs_by_uuid (opus-4.6) 📄 | 4 | — | — | 4 | ±0% |
| 12_job_crashing (opus-4.6) 📄 | 14 | 12 | ↑17% | 14 | ±0% |
| 176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄 | 14 | 15 | ±0% | 13 | ±0% |
| 227_count_configmaps_per_namespace[0] (opus-4.6) 📄 | 9 | — | — | — | — |
| 243_pod_names_contain_service (opus-4.6) 📄 | 10 | 11 | ±0% | 8 | ↑25% |
| 24_misconfigured_pvc (opus-4.6) 📄 | 15 | 13 | ↑15% | 14 | ±0% |
| 259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄 | 6 | — | — | — | — |
| 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄 | 7 | — | — | — | — |
| 261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄 | 11 | — | — | — | — |
| 262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄 | 3 | — | — | — | — |
| 263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄 | 6 | — | — | — | — |
| 264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄 | 3 | — | — | — | — |
| 43_current_datetime_from_prompt (opus-4.6) 📄 | — | — | — | — | — |
| 51_logs_summarize_errors (opus-4.6) 📄 | 5 | 5 | ±0% | 5 | ±0% |
| 61_exact_match_counting (opus-4.6) 📄 | 3 | 3 | ±0% | 3 | ±0% |
| Average (m=8, b=9) | 11.0 | 10.5 | ±0% | 9.4 | ±0% |
Comparison indicators:
±0%— diff under 10% (within noise threshold)↑N%/↓N%— diff 10-25%↑N%/↓N%— diff over 25% (significant)
⚠️ 1 Failure Detected
📖 Legend
| Icon | Meaning |
|---|---|
| ✅ | The test was successful |
| ➖ | The test was skipped |
| The test failed but is known to be flaky or known to fail | |
| 🚧 | The test had a setup failure (not a code regression) |
| 🔧 | The test failed due to mock data issues (not a code regression) |
| 🚫 | The test was throttled by API rate limits/overload |
| ❌ | The test failed and should be fixed before merging the PR |
🔄 Re-run evals manually
⚠️ Warning:/evalcomments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=
Option 1: Comment on this PR with /eval:
/eval
tags: regression
Or with more options (one per line):
/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5
Run evals on a different branch (e.g., master) for comparison:
/eval
branch: master
tags: regression
| Option | Description |
|---|---|
model |
Model(s) to test (default: same as automatic runs) |
tags |
Pytest tags / markers (no default - runs all tests!) |
id |
Eval ID / pytest -k filter (use /list to see valid eval names) |
iterations |
Number of runs, max 10 |
branch |
Run evals on a different branch (for cross-branch comparison) |
Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.
Option 2: Trigger via GitHub Actions UI → "Run workflow"
Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):
| Label | Effect |
|---|---|
evals-tag-<name> |
Run tests with tag <name> alongside regression |
evals-id-<name> |
Run a specific eval by test ID |
evals-model-<name> |
Override the model (use model list name, e.g. sonnet-4.5) |
Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5
🏷️ Valid tags
benchmark, chain-of-causation, compaction, confluence, context_window, conversation_worker, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes-no-bash-approval, kubernetes, leaked-information, logs, loki, manual, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, skills, slackbot, storage, token-limit, toolset-limitation, traces, transparency, victorialogs
🤖 Valid models
deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, opus-4.7, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6
Commands: /eval · /rerun · /list
CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=
WalkthroughAdds six Kubernetes manifest/test fixtures (tests 259–264) requiring dedicated Kubernetes queries (or direct kubectl calls) instead of bash aggregation loops, updates bash tool instructions to prefer dedicated tools, and adds a pytest marker to flag tests that forbid bash aggregation/approval flows. ChangesKubernetes LLM test fixtures and tooling guidance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Docker images ready for
Use these tags to pull the images for testing. 📋 Copy commandsgcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6dPatch Helm values in one line (choose the chart you use): HolmesGPT chart: helm upgrade --install holmesgpt ./helm/holmes \
--set registry=me-west1-docker.pkg.dev/robusta-development/development \
--set image=holmes-dev:ab754a6d \
--set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set operator.image=holmes-operator-dev:ab754a6dRobusta wrapper chart: helm upgrade --install robusta robusta/robusta \
--reuse-values \
--set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.image=holmes-dev:ab754a6d \
--set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.operator.image=holmes-operator-dev:ab754a6d |
✅ Deploy Preview for holmes-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
New pytest marker for the suite of evals that assert Holmes uses dedicated Kubernetes tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count) instead of bash pipelines/loops that would trigger user approval prompts. Lets us run the whole group via `pytest -m kubernetes-no-bash-approval`. Signed-off-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml (1)
1-87:⚠️ Potential issue | 🔴 CriticalTest number 259 is already in use; choose a different sequential number.
Test number 259 conflicts with the existing test directory
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/. The next available sequential test number is 261. Update the directory name and all references within the test to use a new sequential number that does not conflict with existing tests (256–260).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml` around lines 1 - 87, The test directory name conflicts with an existing test (259); rename the directory and all in-file references from "259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number metadata) to the next available sequential number "261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and every occurrence of the test-number string inside the YAML (ensure you do not alter the app-259-* namespace names unless they are meant to change) so the test number is unique.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-28: The expected_output block currently uses quoted multi-line
strings that preserve newlines/leading whitespace; replace those quoted
multi-line entries with a folded scalar style (>) like user_prompt to collapse
lines and remove leading spaces so the YAML parser/validator sees the intended
single-paragraph text; locate the expected_output key in the test case
(referenced as expected_output) and convert each quoted multi-line value to a
folded scalar (>), ensuring indentation matches the surrounding YAML and content
lines are wrapped without the manual line breaks.
In
`@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`:
- Around line 25-28: Update the acceptance text that currently permits "one call
per pod" so it instead requires either a dedicated Kubernetes tool
(kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls
(e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>"
style command) — remove the allowance for one-per-pod bash calls; also add
include_tool_calls: true to this test case to ensure tool execution is
validated; look for the string containing "Must answer using a dedicated
Kubernetes tool..." and the test case metadata to apply these changes.
---
Outside diff comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 1-87: The test directory name conflicts with an existing test
(259); rename the directory and all in-file references from
"259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number
metadata) to the next available sequential number
"261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and
every occurrence of the test-number string inside the YAML (ensure you do not
alter the app-259-* namespace names unless they are meant to change) so the test
number is unique.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e594859c-c880-41c9-94e7-c57190e6eabd
📒 Files selected for processing (4)
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/manifest.yamltests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yamltests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/manifest.yamltests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml
| - "Must answer using the dedicated Kubernetes tools (one or more of | ||
| kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or | ||
| kubectl_find_resource). Counting via simple per-namespace kubectl get | ||
| calls through the bash tool is acceptable." | ||
| - "Must NOT use the bash tool with shell pipelines that aggregate or group | ||
| results (for example: piping kubectl output through awk, sort, uniq -c, | ||
| or wc to compute grouped counts). The grouping/aggregation must be done | ||
| by a dedicated Kubernetes tool or by reading the kubectl output directly, | ||
| not by composing a shell pipeline." |
There was a problem hiding this comment.
Fix multi-line string formatting in expected_output.
Lines 20-28 use quoted strings with manual line breaks, which preserves newlines and leading whitespace. This differs from the proper folded scalar style used for user_prompt at line 1 (using >). The preserved whitespace could cause validation issues.
📝 Proposed fix using proper YAML multi-line format
- "Must identify app-259-control as the namespace with the most stuck pods"
- - "Must answer using the dedicated Kubernetes tools (one or more of
- kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
- kubectl_find_resource). Counting via simple per-namespace kubectl get
- calls through the bash tool is acceptable."
- - "Must NOT use the bash tool with shell pipelines that aggregate or group
- results (for example: piping kubectl output through awk, sort, uniq -c,
- or wc to compute grouped counts). The grouping/aggregation must be done
- by a dedicated Kubernetes tool or by reading the kubectl output directly,
- not by composing a shell pipeline."
+ - >
+ Must answer using the dedicated Kubernetes tools (one or more of
+ kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+ kubectl_find_resource). Counting via simple per-namespace kubectl get
+ calls through the bash tool is acceptable.
+ - >
+ Must NOT use the bash tool with shell pipelines that aggregate or group
+ results (for example: piping kubectl output through awk, sort, uniq -c,
+ or wc to compute grouped counts). The grouping/aggregation must be done
+ by a dedicated Kubernetes tool or by reading the kubectl output directly,
+ not by composing a shell pipeline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 20 - 28, The expected_output block currently uses quoted multi-line
strings that preserve newlines/leading whitespace; replace those quoted
multi-line entries with a folded scalar style (>) like user_prompt to collapse
lines and remove leading spaces so the YAML parser/validator sees the intended
single-paragraph text; locate the expected_output key in the test case
(referenced as expected_output) and convert each quoted multi-line value to a
folded scalar (>), ensuring indentation matches the surrounding YAML and content
lines are wrapped without the manual line breaks.
| - "Must answer using a dedicated Kubernetes tool such as | ||
| kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct | ||
| 'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls | ||
| through the bash tool — one call per pod is acceptable." |
There was a problem hiding this comment.
Tighten acceptance criteria to enforce batched/dedicated lookup.
Line 25–28 currently allows “one call per pod” through bash, which contradicts the stated goal at Line 14–17 and can let the non-batched anti-pattern pass.
Suggested fix
- - "Must answer using a dedicated Kubernetes tool such as
- kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct
- 'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls
- through the bash tool — one call per pod is acceptable."
+ - "Must answer using a dedicated Kubernetes tool such as
+ kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single
+ batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call)
+ through the bash tool."As per coding guidelines: “User prompts must be specific and match the test - test exact values and discovery of information… use include_tool_calls: true to verify tool execution.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - "Must answer using a dedicated Kubernetes tool such as | |
| kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct | |
| 'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls | |
| through the bash tool — one call per pod is acceptable." | |
| - "Must answer using a dedicated Kubernetes tool such as | |
| kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single | |
| batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call) | |
| through the bash tool." |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`
around lines 25 - 28, Update the acceptance text that currently permits "one
call per pod" so it instead requires either a dedicated Kubernetes tool
(kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls
(e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>"
style command) — remove the allowance for one-per-pod bash calls; also add
include_tool_calls: true to this test case to ensure tool execution is
validated; look for the string containing "Must answer using a dedicated
Kubernetes tool..." and the test case metadata to apply these changes.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-23: Update the fixture's tool-usage requirement so it no longer
permits the bash fallback: locate the string that currently reads "Must answer
using the dedicated Kubernetes tools (one or more of kubernetes_jq_query,
kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting
via simple per-namespace kubectl get calls through the bash tool is acceptable."
and remove the trailing allowance clause ("Counting via simple per-namespace
kubectl get calls through the bash tool is acceptable.") so the rule enforces
only the dedicated Kubernetes tools.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e6c22c9f-44ff-4cee-ae6d-62391cbd7d15
📒 Files selected for processing (3)
pyproject.tomltests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yamltests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml
| - "Must answer using the dedicated Kubernetes tools (one or more of | ||
| kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or | ||
| kubectl_find_resource). Counting via simple per-namespace kubectl get | ||
| calls through the bash tool is acceptable." |
There was a problem hiding this comment.
Tighten tool-usage criteria to disallow bash counting fallback.
This line makes the eval permissive in a way that can bypass the test’s core intent (dedicated Kubernetes tools). Please remove the “bash tool is acceptable” allowance so the fixture consistently enforces dedicated-tool usage.
Suggested diff
- - "Must answer using the dedicated Kubernetes tools (one or more of
- kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
- kubectl_find_resource). Counting via simple per-namespace kubectl get
- calls through the bash tool is acceptable."
+ - "Must answer using dedicated Kubernetes tools (one or more of
+ kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+ kubectl_find_resource), and must not use bash for counting/grouping."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 20 - 23, Update the fixture's tool-usage requirement so it no
longer permits the bash fallback: locate the string that currently reads "Must
answer using the dedicated Kubernetes tools (one or more of kubernetes_jq_query,
kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting
via simple per-namespace kubectl get calls through the bash tool is acceptable."
and remove the trailing allowance clause ("Counting via simple per-namespace
kubectl get calls through the bash tool is acceptable.") so the rule enforces
only the dedicated Kubernetes tools.
The original 259 used pods with invalid image references to produce ImagePullBackOff/ErrImagePull status, but that requires a working container runtime to even reach the image-pull stage. In constrained CI environments (nested containers, cgroup v1) pods get stuck on FailedCreatePodSandBox before the kubelet ever attempts an image pull, making the setup unverifiable. The redesigned 259 deploys 15 pods across 4 namespaces with a mix of tier labels (worker plus noise tiers like edge/messaging/gateway/ batch/observability), and asks Holmes to count pods with tier=worker in each namespace. Labels are queryable from metadata.labels as soon as the Pod object is created, so the eval is independent of whether the pods reach Running. Tests the same anti-pattern (bash | awk | sort | uniq -c grouping) with the same expected_output structure (3/1/2/4, control wins). Verified locally against opus-4.6 via OpenRouter: 10/10 pass. Signed-off-by: Claude <noreply@anthropic.com>
Adds evals 261-264 that probe opus-4.6 with vague, customer-style
questions about resource distribution and grouping across multiple
namespaces. They complement 259/260 by covering question phrasings
that historically slip past the existing assertion set.
- 261 (vague overview): "give me an overview of what's running across
these 4 namespaces"
- 262 (label distribution): "what distinct tier values are in use,
with counts per value"
- 263 (namespace summary): "per-namespace breakdown of pods by tier
label"
- 264 (find outliers): "find every pod whose tier is something other
than worker"
All four use the same 15-pod template (worker pods plus tier-noise
pods) in isolated app-{N}-* namespaces, so they're parallel-safe.
Locally on opus-4.6 these pass ~95% of the time (5/5, 4/5, 5/5, 5/5
on a clean run). The intermittent failures show opus-4.6 occasionally
falls back to bash pipelines like:
kubectl get pods -n X --show-labels; echo "==="; kubectl get pods -n Y --show-labels ...
kubectl get pods -A -o json | jq | sort -u
The dedicated tools (kubernetes_jq_query, kubernetes_count) handle all
four scenarios cleanly. The regression evals lock in that preference
and will catch future model changes that re-introduce the anti-pattern.
include_tool_calls is set so the judge inspects tool selection, not
just the final answer. Bash with simple per-namespace 'kubectl get'
calls (one per namespace, semicolon-chained allowed) is acceptable;
bash with aggregating pipelines (awk/sort/uniq/wc/cut/grep -c) is not.
Signed-off-by: Claude <noreply@anthropic.com>
The previous bash instructions used `kubectl get pods | grep Running | head -5`
as the canonical "pipe is auto-approved" example. That implicitly endorsed
piping kubectl output through awk/sort/uniq -c/wc to do grouping and counting,
which is exactly the anti-pattern that produces the customer pain reported in
this branch: instead of a single kubernetes_count / kubernetes_jq_query /
kubernetes_tabular_query call, the LLM stitched together shell pipelines that
either required approval (loops, command substitution) or produced noisy
investigation traces.
This change adds a "Tool Selection — Prefer Dedicated Tools" section at the
top of the bash instructions that:
* Tells the model to check for a dedicated K8s tool before reaching for bash.
* Enumerates the four most common anti-patterns (group-by-uniq, wc -l counts,
distinct-via-sort-u, per-resource for-loops) and points each one at the
dedicated tool that replaces it.
* Carves out the legitimate uses of bash — one-off `kubectl get` with a
specific flag, non-kubectl invocations — so the guidance does not over-rotate.
Two smaller edits below:
* The "Pipes" example is changed from `kubectl get pods | grep ...` to a
log-grep example, with a one-line reminder pointing at the new section.
* The "go ahead and use loops" line is replaced with a check-first prompt
that mentions the cost of approval prompts.
This is prompt guidance, not a behavior gate — the model can still pipe kubectl
when it judges that the right choice (e.g. when no dedicated tool fits). The
goal is to flip the default for grouping/counting/joining work back to the
dedicated tools.
Pairs with the 259/260/261/262/263/264 evals on this branch which lock in the
preference. Verification against opus-4.6 with full iteration counts is
pending — the OpenRouter weekly credit limit was hit during testing.
Signed-off-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`:
- Around line 22-200: All Pod resources in this manifest (e.g.,
web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit
non-root security context; add a pod-level securityContext with runAsNonRoot:
true and runAsUser: 1000 and ensure each container (name: app) sets
securityContext.allowPrivilegeEscalation: false (or container-level
runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its
container securityContext explicitly enforce non-root and no privilege
escalation across all Pod definitions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dc321014-7f70-4636-9eb1-d60c115e3465
📒 Files selected for processing (9)
holmes/plugins/toolsets/bash/bash_instructions.jinja2tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yamltests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/test_case.yamltests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/manifest.yamltests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/test_case.yamltests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/manifest.yamltests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/test_case.yamltests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/manifest.yamltests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/test_case.yaml
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: web-renderer-v6q8 | ||
| namespace: app-261-frontend | ||
| labels: | ||
| app: web-renderer | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: asset-cdn-v6q8 | ||
| namespace: app-261-frontend | ||
| labels: | ||
| app: asset-cdn | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: session-router-v6q8 | ||
| namespace: app-261-frontend | ||
| labels: | ||
| app: session-router | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: edge-cache-v6q8 | ||
| namespace: app-261-frontend | ||
| labels: | ||
| app: edge-cache | ||
| tier: edge | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: order-processor-v6q8 | ||
| namespace: app-261-backend | ||
| labels: | ||
| app: order-processor | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: queue-broker-v6q8 | ||
| namespace: app-261-backend | ||
| labels: | ||
| app: queue-broker | ||
| tier: messaging | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: api-gateway-v6q8 | ||
| namespace: app-261-backend | ||
| labels: | ||
| app: api-gateway | ||
| tier: gateway | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: stream-ingester-v6q8 | ||
| namespace: app-261-data | ||
| labels: | ||
| app: stream-ingester | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: feature-compactor-v6q8 | ||
| namespace: app-261-data | ||
| labels: | ||
| app: feature-compactor | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: warehouse-loader-v6q8 | ||
| namespace: app-261-data | ||
| labels: | ||
| app: warehouse-loader | ||
| tier: batch | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: scheduler-shim-v6q8 | ||
| namespace: app-261-control | ||
| labels: | ||
| app: scheduler-shim | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: policy-evaluator-v6q8 | ||
| namespace: app-261-control | ||
| labels: | ||
| app: policy-evaluator | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: workload-binder-v6q8 | ||
| namespace: app-261-control | ||
| labels: | ||
| app: workload-binder | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: lease-coordinator-v6q8 | ||
| namespace: app-261-control | ||
| labels: | ||
| app: lease-coordinator | ||
| tier: worker | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} | ||
| --- | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: audit-logger-v6q8 | ||
| namespace: app-261-control | ||
| labels: | ||
| app: audit-logger | ||
| tier: observability | ||
| spec: | ||
| containers: | ||
| - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} |
There was a problem hiding this comment.
Harden pod security context to avoid policy-dependent fixture failures.
These pods currently rely on default security settings; on clusters enforcing Pod Security standards, they can be rejected (runAsNonRoot / allowPrivilegeEscalation). Please add explicit non-root security context across all fixture pods.
Suggested pattern to apply to each Pod spec
spec:
+ securityContext:
+ runAsNonRoot: true
+ seccompProfile:
+ type: RuntimeDefault
containers:
- - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+ - name: app
+ image: busybox:1.36
+ command: ["sleep", "3600"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop: ["ALL"]🧰 Tools
🪛 Checkov (3.2.528)
[medium] 22-33: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 22-33: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 34-45: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 34-45: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 46-57: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 46-57: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 58-69: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 58-69: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 70-81: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 70-81: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 82-93: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 82-93: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 94-105: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 94-105: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 106-117: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 106-117: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 118-129: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 118-129: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 130-141: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 130-141: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 142-153: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 142-153: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 154-165: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 154-165: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 166-177: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 166-177: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 178-189: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 178-189: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 190-200: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 190-200: Minimize the admission of root containers
(CKV_K8S_23)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`
around lines 22 - 200, All Pod resources in this manifest (e.g.,
web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit
non-root security context; add a pod-level securityContext with runAsNonRoot:
true and runAsUser: 1000 and ensure each container (name: app) sets
securityContext.allowPrivilegeEscalation: false (or container-level
runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its
container securityContext explicitly enforce non-root and no privilege
escalation across all Pod definitions.
Summary
Add two new LLM evaluation test fixtures to validate that Holmes uses dedicated Kubernetes tools instead of shell pipelines for common query patterns. These tests ensure the LLM learns to prefer efficient, batched queries over iterative bash loops.
Changes
Test Case 1: Kubernetes Event Grouping (
259_k8s_event_grouping_prefer_dedicated_tools)kubernetes_jq_query,kubernetes_tabular_query,kubernetes_count, orkubectl_find_resource)awk,sort,uniq -c, orwcfor aggregationImagePullBackOfferrorsTest Case 2: Multi-Pod Node Lookup (
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools)for/whileloops with compound statementsImplementation Details
Both test fixtures follow the established LLM eval pattern:
manifest.yaml: Kubernetes resource definitions for test scenariotest_case.yaml: User prompt, expected outputs, and setup/teardown scriptsinclude_tool_calls: true: Enables validation of which tools Holmes actually callsThese tests validate that Holmes learns to prefer efficient, dedicated Kubernetes tools over shell scripting patterns that require user approval and are slower/less reliable.
https://claude.ai/code/session_01ShMKrLaC9Dddn41ZJM6CWW
Summary by CodeRabbit