Add LLM eval tests for Kubernetes tool usage patterns by aantn · Pull Request #2048 · HolmesGPT/holmesgpt

aantn · 2026-05-15T15:26:57Z

Summary

Add two new LLM evaluation test fixtures to validate that Holmes uses dedicated Kubernetes tools instead of shell pipelines for common query patterns. These tests ensure the LLM learns to prefer efficient, batched queries over iterative bash loops.

Changes

Test Case 1: Kubernetes Event Grouping (`259_k8s_event_grouping_prefer_dedicated_tools`)

Scenario: Count pods stuck on image pull failures across 4 namespaces (3, 1, 2, and 4 pods respectively)
Validation:
- Holmes must report correct per-namespace counts and identify the namespace with the most failures
- Must use dedicated Kubernetes tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource)
- Must NOT use bash pipelines with awk, sort, uniq -c, or wc for aggregation
Setup: Creates 10 pods across 4 namespaces with invalid image registries to trigger ImagePullBackOff errors
Validation: Pre-test verification ensures all pods reach image pull failure state before Holmes is queried

Test Case 2: Multi-Pod Node Lookup (`260_k8s_multi_pod_node_lookup_prefer_dedicated_tools`)

Scenario: Look up which node each of 5 pods is running on across 3 namespaces
Validation:
- Holmes must report the nodeName for every pod
- Must use batched Kubernetes queries rather than shell loops iterating per pod
- Must NOT use bash for/while loops with compound statements
Setup: Creates 5 pods across 3 namespaces using busybox image
Validation: Pre-test verification ensures all pods are scheduled before Holmes is queried

Implementation Details

Both test fixtures follow the established LLM eval pattern:

manifest.yaml: Kubernetes resource definitions for test scenario
test_case.yaml: User prompt, expected outputs, and setup/teardown scripts
include_tool_calls: true: Enables validation of which tools Holmes actually calls
Pre-test setup waits for pods to reach desired state (image pull failures or scheduled)
Post-test cleanup removes all created namespaces

These tests validate that Holmes learns to prefer efficient, dedicated Kubernetes tools over shell scripting patterns that require user approval and are slower/less reliable.

https://claude.ai/code/session_01ShMKrLaC9Dddn41ZJM6CWW

Summary by CodeRabbit

Tests
- Added multiple new Kubernetes LLM test fixtures that validate pod counts, label distributions, per-pod node lookups, namespace summaries, outlier detection, and cross-namespace comparisons. All enforce using dedicated Kubernetes queries (no shell aggregation/loops) and include setup/teardown steps.
Chores
- Added a pytest marker for tests that must avoid bash/shell loops and updated bash tool guidance to prefer dedicated Kubernetes queries or batched requests.

Two failing evals capture the anti-pattern where the LLM reaches for bash with awk/sort/uniq/for-loops to aggregate Kubernetes data when dedicated tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count) would do the same job in one call without triggering approval prompts. 259_k8s_event_grouping_prefer_dedicated_tools: pods stuck on image pull failures across 4 namespaces; Holmes must group-by-namespace without composing a shell aggregation pipeline. 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools: 5 pods across 3 namespaces; Holmes must report each pod's nodeName without a for/while shell loop. Both are tagged 'hard' since the LLM currently fails them by choosing the bash anti-pattern; they will move to easy/regression once the fix lands. Signed-off-by: Claude <noreply@anthropic.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

github-actions · 2026-05-15T15:27:15Z

📂 Previous Runs

📜 #4 · Run @ __f39a8f3__ (#25936448852) — May 15, 19:18 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit f39a8f3 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

ask_holmes: 13/13 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
✅	📄 09_crashpod	39.0s	5	11	$0.2891	110,174	107,720	25,030	2,454	856	81,875	25,845	251	—
✅	📄 101_loki_historical_logs_pod_deleted	96.8s	9	20	$0.5071	234,589	228,430	31,774	6,159	952	194,416	34,014	1,769	—
✅	📄 112_find_pvcs_by_uuid	19.0s	3	4	$0.2033	61,055	59,927	21,825	1,128	583	37,844	22,083	253	—
✅	📄 12_job_crashing	43.6s	5	14	$0.3416	129,979	127,330	29,910	2,649	736	95,492	31,838	213	—
✅	📄 176_network_policy_blocking_traffic_no_skills	57.1s	7	17	$0.3947	173,805	170,169	29,001	3,636	939	137,593	32,576	673	—
✅	📄 227_count_configmaps_per_namespace[0]	20.0s	4	9	$0.2050	76,841	75,716	20,688	1,125	591	54,709	21,007	53	—
✅	📄 243_pod_names_contain_service	33.1s	4	10	$0.2564	83,662	81,436	23,162	2,226	919	57,495	23,941	265	—
✅	📄 24_misconfigured_pvc	45.3s	6	15	$0.3210	134,296	131,256	24,968	3,040	1,011	104,813	26,443	256	—
✅	📄 259_k8s_event_grouping_prefer_dedicated_tools	16.1s	3	6	$0.1823	56,619	55,611	19,604	1,008	654	36,003	19,608	65	—
✅	📄 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools	18.2s	3	7	$0.1968	57,959	56,646	20,185	1,313	867	36,092	20,554	48	—
✅	📄 43_current_datetime_from_prompt	3.7s	1	—	$0.1200	17,069	16,940	16,940	129	129	0	16,940	86	—
✅	📄 51_logs_summarize_errors	21.5s	4	5	$0.2083	77,960	76,812	21,261	1,148	423	55,546	21,266	44	—
✅	📄 61_exact_match_counting	9.9s	3	3	$0.1522	52,858	52,495	17,920	363	216	34,571	17,924	32	—
	Total	32.6s avg	4.4 avg	10.1 avg	$3.3778	1,266,866	1,240,488	31,774	26,378	1,011	926,449	314,039	4,008	—

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 #3 · Run @ __900fdf0__ (#25927341250) — May 15, 15:58 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 900fdf0 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

ask_holmes: 13/13 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
✅	09_crashpod	38.7s	5	11	$0.2908	109,274	106,702	24,472	2,572	955	80,905	25,797	418	—
✅	101_loki_historical_logs_pod_deleted	77.1s	7	17	$0.4171	171,621	166,920	29,206	4,701	920	135,429	31,491	886	—
✅	112_find_pvcs_by_uuid	18.3s	3	4	$0.2034	61,083	59,951	21,822	1,132	614	37,873	22,078	254	—
✅	12_job_crashing	54.6s	8	17	$0.3873	197,692	194,503	28,454	3,189	648	163,495	31,008	276	—
✅	176_network_policy_blocking_traffic_no_skills	54.8s	5	16	$0.3228	115,880	112,648	26,750	3,232	942	85,342	27,306	543	—
✅	227_count_configmaps_per_namespace[0]	21.0s	4	9	$0.2048	76,824	75,702	20,683	1,122	591	54,709	20,993	53	—
✅	243_pod_names_contain_service	42.8s	6	13	$0.3129	136,103	133,415	25,472	2,688	665	107,037	26,378	301	—
✅	24_misconfigured_pvc	39.7s	5	12	$0.2786	106,500	104,088	23,364	2,412	802	79,256	24,832	354	—
✅	259_k8s_event_grouping_prefer_dedicated_tools	19.7s	3	6	$0.2019	58,202	56,803	20,202	1,399	975	35,710	21,093	53	—
✅	260_k8s_multi_pod_node_lookup_prefer_dedicated_tools	14.0s	3	3	$0.1689	54,893	54,142	18,721	751	461	35,417	18,725	47	—
✅	43_current_datetime_from_prompt	3.9s	1	—	$0.1199	17,066	16,940	16,940	126	126	0	16,940	83	—
✅	51_logs_summarize_errors	22.8s	4	5	$0.2074	77,649	76,490	21,099	1,159	434	55,386	21,104	41	—
✅	61_exact_match_counting	11.1s	3	3	$0.1522	52,851	52,488	17,917	363	216	34,567	17,921	32	—
	Total	32.2s avg	4.4 avg	9.7 avg	$3.2680	1,235,638	1,210,792	29,206	24,846	975	905,126	305,666	3,341	—

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 #2 · Run @ __900fdf0__ (#25926210430) — May 15, 15:41 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 900fdf0 on branch claude/fix-k8s-approval-issue-xNNVP

View workflow logs

Results of HolmesGPT evals

ask_holmes: 11/11 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
✅	09_crashpod	36.0s	5	10	$0.2660	105,054	102,878	23,198	2,176	936	79,101	23,777	251	—
✅	101_loki_historical_logs_pod_deleted	67.3s	7	16	$0.3831	165,801	161,916	27,111	3,885	925	131,634	30,282	686	—
✅	112_find_pvcs_by_uuid	20.5s	3	4	$0.2052	61,204	60,021	21,885	1,183	598	37,880	22,141	244	—
✅	12_job_crashing	45.1s	6	14	$0.3767	160,352	157,693	30,119	2,659	1,018	122,248	35,445	221	—
✅	176_network_policy_blocking_traffic_no_skills	54.6s	6	18	$0.3811	146,709	142,867	28,734	3,842	1,021	111,075	31,792	715	—
✅	227_count_configmaps_per_namespace[0]	18.7s	4	9	$0.2095	76,827	75,702	20,686	1,125	591	53,781	21,921	53	—
✅	243_pod_names_contain_service	44.8s	6	14	$0.3193	137,494	134,657	25,813	2,837	700	108,048	26,609	322	—
✅	24_misconfigured_pvc	45.7s	7	16	$0.3329	156,870	153,920	24,847	2,950	1,015	127,058	26,862	309	—
✅	43_current_datetime_from_prompt	4.7s	1	—	$0.1200	17,067	16,940	16,940	127	127	0	16,940	84	—
✅	51_logs_summarize_errors	22.8s	4	5	$0.2086	77,897	76,729	21,221	1,168	443	55,503	21,226	44	—
✅	61_exact_match_counting	11.0s	3	3	$0.1522	52,859	52,496	17,919	363	216	34,573	17,923	32	—
	Total	33.8s avg	4.7 avg	10.9 avg	$2.9546	1,158,134	1,135,819	30,119	22,315	1,021	860,901	274,918	2,961	—

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

⚠️ 1 older run truncated

Older runs were omitted to stay under GitHub's 64KB comment size limit.

⚠️ Eval Results (with failures)

Automatically triggered by commit 9b70f66 on branch claude/fix-k8s-approval-issue-xNNVP (labels: evals-tag-kubernetes-no-bash-approval)

View workflow logs

Results of HolmesGPT evals

ask_holmes: 16/17 test cases were successful, 1 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
✅	📄 09_crashpod	34.7s	4	10	$0.2735	90,638	88,403	25,398	2,235	929	62,451	25,952	277	—
✅	📄 101_loki_historical_logs_pod_deleted	78.6s	8	17	$0.4529	205,671	200,449	30,081	5,222	933	168,584	31,865	1,382	—
✅	📄 112_find_pvcs_by_uuid	16.9s	3	4	$0.1938	59,828	58,814	20,926	1,014	604	37,634	21,180	138	—
✅	📄 12_job_crashing	39.8s	6	14	$0.3087	136,723	134,191	24,986	2,532	973	107,747	26,444	187	—
✅	📄 176_network_policy_blocking_traffic_no_skills	56.2s	6	14	$0.3457	143,764	140,441	27,620	3,323	767	112,345	28,096	756	—
✅	📄 227_count_configmaps_per_namespace[0]	19.5s	4	9	$0.2145	79,088	77,954	21,244	1,134	590	55,474	22,480	76	—
✅	📄 243_pod_names_contain_service	34.0s	5	10	$0.2706	106,887	104,701	23,431	2,186	902	80,314	24,387	286	—
✅	📄 24_misconfigured_pvc	39.7s	5	15	$0.2972	110,821	108,099	24,680	2,722	1,044	82,056	26,043	206	—
✅	📄 259_k8s_event_grouping_prefer_dedicated_tools	16.9s	3	6	$0.1880	58,180	57,196	20,121	984	630	36,623	20,573	38	—
✅	📄 260_k8s_multi_pod_node_lookup_prefer_dedicated_tools	18.2s	3	7	$0.1918	58,400	57,271	20,012	1,129	838	36,713	20,558	165	—
✅	📄 261_k8s_vague_overview_no_bash_aggregation	30.8s	4	11	$0.2420	82,285	80,274	21,990	2,011	707	57,584	22,690	132	—
❌	📄 262_k8s_label_distribution_no_bash_aggregation	15.0s	3	3	$0.1684	55,661	54,997	18,960	664	336	36,033	18,964	156	—
✅	📄 263_k8s_namespace_summary_no_bash_aggregation	16.1s	3	6	$0.1921	58,072	56,929	19,994	1,143	640	36,343	20,586	78	—
✅	📄 264_k8s_find_outliers_no_bash_aggregation	13.2s	3	3	$0.1698	56,082	55,424	19,161	658	390	36,259	19,165	73	—
✅	📄 43_current_datetime_from_prompt	3.2s	1	—	$0.1231	17,594	17,491	17,491	103	103	0	17,491	61	—
✅	📄 51_logs_summarize_errors	22.3s	4	5	$0.2129	79,968	78,794	21,700	1,174	388	57,089	21,705	85	—
✅	📄 61_exact_match_counting	11.1s	3	3	$0.1563	54,476	54,121	18,458	355	208	35,659	18,462	24	—
	Total	27.4s avg	4.0 avg	8.6 avg	$4.0014	1,454,138	1,425,549	30,081	28,589	1,044	1,038,908	386,641	4,120	—

Benchmark Comparison Details

Master baseline: latest master-* experiment (post-merge regression eval)
Status: 10 test/model combinations loaded

master-25938985537 (created: 2026-05-15)

Benchmark baseline: latest ci-benchmark experiment on master
Status: 147 test/model combinations loaded

ci-benchmark-25268492423 (created: 2026-05-03)

Time comparison (seconds):

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	34.7s	34.5s	±0%	32.5s	±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	78.6s	69.3s	↑13%	51.7s	↑52%
112_find_pvcs_by_uuid (opus-4.6) 📄	16.9s	—	—	18.1s	±0%
12_job_crashing (opus-4.6) 📄	39.8s	40.7s	±0%	42.4s	±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	56.2s	51.3s	±0%	35.7s	↑57%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	19.5s	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	34.0s	36.8s	±0%	27.4s	↑24%
24_misconfigured_pvc (opus-4.6) 📄	39.7s	40.0s	±0%	35.8s	↑11%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	16.9s	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	18.2s	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	30.8s	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	15.0s	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	16.1s	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	13.2s	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	3.2s	2.9s	±0%	—	—
51_logs_summarize_errors (opus-4.6) 📄	22.3s	21.5s	±0%	23.1s	±0%
61_exact_match_counting (opus-4.6) 📄	11.1s	10.2s	±0%	10.8s	±0%
Average (m=9, b=9)	35.5s	34.1s	±0%	30.8s	↑20%

Cost comparison:

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	$0.2735	$0.2687	±0%	$0.2616	±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	$0.4529	$0.4009	↑13%	$0.3371	↑34%
112_find_pvcs_by_uuid (opus-4.6) 📄	$0.1938	—	—	$0.2014	±0%
12_job_crashing (opus-4.6) 📄	$0.3087	$0.3193	±0%	$0.3076	±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	$0.3457	$0.3430	±0%	$0.2914	↑19%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	$0.2145	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	$0.2706	$0.2751	±0%	$0.2280	↑19%
24_misconfigured_pvc (opus-4.6) 📄	$0.2972	$0.2929	±0%	$0.2831	±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	$0.1880	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	$0.1918	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	$0.2420	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	$0.1684	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	$0.1921	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	$0.1698	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	$0.1231	$0.0122	↑912%	—	—
51_logs_summarize_errors (opus-4.6) 📄	$0.2129	$0.2051	±0%	$0.2072	±0%
61_exact_match_counting (opus-4.6) 📄	$0.1563	$0.1522	±0%	$0.1522	±0%
Average (m=9, b=9)	$0.2712	$0.2521	±0%	$0.2522	↑11%

Total tokens comparison:

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	90,638	105,590	↓14%	103,497	↓12%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	205,671	167,742	↑23%	138,670	↑48%
112_find_pvcs_by_uuid (opus-4.6) 📄	59,828	—	—	61,169	±0%
12_job_crashing (opus-4.6) 📄	136,723	147,210	±0%	133,893	±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	143,764	139,843	±0%	111,145	↑29%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	79,088	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	106,887	107,178	±0%	79,525	↑34%
24_misconfigured_pvc (opus-4.6) 📄	110,821	107,382	±0%	108,047	±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	58,180	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	58,400	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	82,285	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	55,661	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	58,072	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	56,082	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	17,594	17,043	±0%	—	—
51_logs_summarize_errors (opus-4.6) 📄	79,968	77,335	±0%	77,707	±0%
61_exact_match_counting (opus-4.6) 📄	54,476	52,855	±0%	52,942	±0%
Average (m=9, b=9)	105,171	102,464	±0%	96,288	↑14%

Cached tokens comparison:

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	62,451	79,438	↓21%	77,391	↓19%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	168,584	133,072	↑27%	106,565	↑58%
112_find_pvcs_by_uuid (opus-4.6) 📄	37,634	—	—	38,002	±0%
12_job_crashing (opus-4.6) 📄	107,747	117,018	±0%	104,761	±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	112,345	108,706	±0%	81,519	↑38%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	55,474	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	80,314	80,447	±0%	55,513	↑45%
24_misconfigured_pvc (opus-4.6) 📄	82,056	79,303	±0%	80,270	±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	36,623	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	36,713	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	57,584	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	36,033	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	36,343	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	36,259	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	—	16,937	—	—	—
51_logs_summarize_errors (opus-4.6) 📄	57,089	55,251	±0%	55,443	±0%
61_exact_match_counting (opus-4.6) 📄	35,659	34,570	±0%	34,632	±0%
Average (m=8, b=9)	88,281	85,976	±0%	70,455	↑17%

Turns comparison:

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	4	5	↓20%	—	—
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	8	7	↑14%	—	—
112_find_pvcs_by_uuid (opus-4.6) 📄	3	—	—	—	—
12_job_crashing (opus-4.6) 📄	6	6	±0%	—	—
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	6	6	±0%	—	—
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	4	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	5	5	±0%	—	—
24_misconfigured_pvc (opus-4.6) 📄	5	5	±0%	—	—
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	3	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	3	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	4	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	3	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	3	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	3	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	1	1	±0%	—	—
51_logs_summarize_errors (opus-4.6) 📄	4	4	±0%	—	—
61_exact_match_counting (opus-4.6) 📄	3	3	±0%	—	—
Average (m=9, b=0)	4.7	4.7	±0%	—	—

Tool calls comparison:

Test case	This branch	master (1h ago)	Δ vs master	benchmark (12d ago)	Δ vs benchmark
09_crashpod (opus-4.6) 📄	10	10	±0%	10	±0%
101_loki_historical_logs_pod_deleted (opus-4.6) 📄	17	15	↑13%	14	↑21%
112_find_pvcs_by_uuid (opus-4.6) 📄	4	—	—	4	±0%
12_job_crashing (opus-4.6) 📄	14	12	↑17%	14	±0%
176_network_policy_blocking_traffic_no_skills (opus-4.6) 📄	14	15	±0%	13	±0%
227_count_configmaps_per_namespace[0] (opus-4.6) 📄	9	—	—	—	—
243_pod_names_contain_service (opus-4.6) 📄	10	11	±0%	8	↑25%
24_misconfigured_pvc (opus-4.6) 📄	15	13	↑15%	14	±0%
259_k8s_event_grouping_prefer_dedicated_tools (opus-4.6) 📄	6	—	—	—	—
260_k8s_multi_pod_node_lookup_prefer_dedicated_tools (opus-4.6) 📄	7	—	—	—	—
261_k8s_vague_overview_no_bash_aggregation (opus-4.6) 📄	11	—	—	—	—
262_k8s_label_distribution_no_bash_aggregation (opus-4.6) 📄	3	—	—	—	—
263_k8s_namespace_summary_no_bash_aggregation (opus-4.6) 📄	6	—	—	—	—
264_k8s_find_outliers_no_bash_aggregation (opus-4.6) 📄	3	—	—	—	—
43_current_datetime_from_prompt (opus-4.6) 📄	—	—	—	—	—
51_logs_summarize_errors (opus-4.6) 📄	5	5	±0%	5	±0%
61_exact_match_counting (opus-4.6) 📄	3	3	±0%	3	±0%
Average (m=8, b=9)	11.0	10.5	±0%	9.4	±0%

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

⚠️ 1 Failure Detected

📖 Legend

Icon	Meaning
✅	The test was successful
➖	The test was skipped
⚠️	The test failed but is known to be flaky or known to fail
🚧	The test had a setup failure (not a code regression)
🔧	The test failed due to mock data issues (not a code regression)
🚫	The test was throttled by API rate limits/overload
❌	The test failed and should be fixed before merging the PR

🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
tags: regression

Or with more options (one per line):

/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
tags: regression

Option	Description
`model`	Model(s) to test (default: same as automatic runs)
`tags`	Pytest tags / markers (no default - runs all tests!)
`id`	Eval ID / pytest -k filter (use `/list` to see valid eval names)
`iterations`	Number of runs, max 10
`branch`	Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):

Label	Effect
`evals-tag-<name>`	Run tests with tag `<name>` alongside regression
`evals-id-<name>`	Run a specific eval by test ID
`evals-model-<name>`	Override the model (use model list name, e.g. `sonnet-4.5`)

Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5

🏷️ Valid tags

benchmark, chain-of-causation, compaction, confluence, context_window, conversation_worker, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes-no-bash-approval, kubernetes, leaked-information, logs, loki, manual, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, skills, slackbot, storage, token-limit, toolset-limitation, traces, transparency, victorialogs

🤖 Valid models

deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, opus-4.7, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6

Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-k8s-approval-issue-xNNVP -f markers=regression -f filter=

coderabbitai · 2026-05-15T15:27:19Z

Walkthrough

Adds six Kubernetes manifest/test fixtures (tests 259–264) requiring dedicated Kubernetes queries (or direct kubectl calls) instead of bash aggregation loops, updates bash tool instructions to prefer dedicated tools, and adds a pytest marker to flag tests that forbid bash aggregation/approval flows.

Changes

Kubernetes LLM test fixtures and tooling guidance

Layer / File(s)	Summary
Test 259: Manifest and test case for per-namespace worker counting `tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`	Creates four `app-259-*` namespaces and pods, before_test polls and asserts exact `tier=worker` counts per namespace, prompt requests counts and winning namespace while forbidding shell-pipeline aggregation, after_test deletes namespaces.
Test 260: Manifest and test case for pod nodeName lookup `tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`	Creates three `app-260-*` namespaces and named pods; before_test waits until each pod has non-empty `spec.nodeName`; prompt requires dedicated Kubernetes queries or kubectl JSON and forbids bash loops; after_test deletes namespaces.
Pytest marker enabling tooling constraint `pyproject.toml`	Adds `kubernetes-no-bash-approval` pytest marker to mark tests that require dedicated Kubernetes tools instead of bash aggregation/loops.
Bash tool instructions update `holmes/plugins/toolsets/bash/bash_instructions.jinja2`	Prefers dedicated Kubernetes tools or batched `kubectl get` before using loops/conditionals; notes approval prompts interrupt execution and loops should be used only when necessary.
Test 261: Vague workload overview fixture and test case `tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/test_case.yaml`	Creates four `app-261-*` namespaces and pods, before_test polls for readiness, prompt asks for cross-namespace workload overview using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.
Test 262: Label distribution fixture and test case `tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/test_case.yaml`	Creates four `app-262-*` namespaces and labeled pods; before_test waits for 15 pods; prompt requests (tier, count) table with exact expected counts using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.
Test 263: Namespace tier summary fixture and test case `tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/test_case.yaml`	Creates four `app-263-*` namespaces and pods; before_test polls until 15 pods exist; prompt requests per-namespace tier breakdown using dedicated Kubernetes tools and forbids bash aggregation/loops; after_test deletes namespaces.
Test 264: Find outliers fixture and test case `tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/manifest.yaml`, `tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/test_case.yaml`	Creates four `app-264-*` namespaces and pods; before_test waits for readiness; prompt asks Holmes to enumerate non-worker pods (namespace + tier) using dedicated Kubernetes tooling and forbids bash aggregation/loops; after_test deletes namespaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

HolmesGPT/holmesgpt#597: Updates pytest markers; related to pyproject.toml marker addition.
HolmesGPT/holmesgpt#607: Introduces Kubernetes counting tooling used by fixtures that forbid bash aggregation.

Suggested labels

evals-tag-counting

Suggested reviewers

moshemorad

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and concisely summarizes the primary change: adding LLM evaluation tests focused on verifying Holmes' preference for dedicated Kubernetes tools over shell pipelines.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-15T15:27:26Z

✅ Docker images ready for ab754a6d (built in 8m 13s)

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use these tags to pull the images for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ab754a6d
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:ab754a6d me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6d
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:ab754a6d

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:ab754a6d \
  --set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set operator.image=holmes-operator-dev:ab754a6d

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:ab754a6d \
  --set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.operator.image=holmes-operator-dev:ab754a6d

netlify · 2026-05-15T15:27:39Z

✅ Deploy Preview for holmes-docs ready!

Name	Link
🔨 Latest commit	`9b70f66`
🔍 Latest deploy log	https://app.netlify.com/projects/holmes-docs/deploys/6a07883e1bd5a00008f9f634
😎 Deploy Preview	https://deploy-preview-2048--holmes-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

New pytest marker for the suite of evals that assert Holmes uses dedicated Kubernetes tools (kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count) instead of bash pipelines/loops that would trigger user approval prompts. Lets us run the whole group via `pytest -m kubernetes-no-bash-approval`. Signed-off-by: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml (1)
1-87: ⚠️ Potential issue | 🔴 Critical

Test number 259 is already in use; choose a different sequential number.

Test number 259 conflicts with the existing test directory tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/. The next available sequential test number is 261. Update the directory name and all references within the test to use a new sequential number that does not conflict with existing tests (256–260).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`
around lines 1 - 87, The test directory name conflicts with an existing test
(259); rename the directory and all in-file references from
"259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number
metadata) to the next available sequential number
"261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and
every occurrence of the test-number string inside the YAML (ensure you do not
alter the app-259-* namespace names unless they are meant to change) so the test
number is unique.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-28: The expected_output block currently uses quoted multi-line
strings that preserve newlines/leading whitespace; replace those quoted
multi-line entries with a folded scalar style (>) like user_prompt to collapse
lines and remove leading spaces so the YAML parser/validator sees the intended
single-paragraph text; locate the expected_output key in the test case
(referenced as expected_output) and convert each quoted multi-line value to a
folded scalar (>), ensuring indentation matches the surrounding YAML and content
lines are wrapped without the manual line breaks.

In
`@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml`:
- Around line 25-28: Update the acceptance text that currently permits "one call
per pod" so it instead requires either a dedicated Kubernetes tool
(kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls
(e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>"
style command) — remove the allowance for one-per-pod bash calls; also add
include_tool_calls: true to this test case to ensure tool execution is
validated; look for the string containing "Must answer using a dedicated
Kubernetes tool..." and the test case metadata to apply these changes.

---

Outside diff comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 1-87: The test directory name conflicts with an existing test
(259); rename the directory and all in-file references from
"259_k8s_event_grouping_prefer_dedicated_tools" (and any bare "259" test-number
metadata) to the next available sequential number
"261_k8s_event_grouping_prefer_dedicated_tools", updating the directory name and
every occurrence of the test-number string inside the YAML (ensure you do not
alter the app-259-* namespace names unless they are meant to change) so the test
number is unique.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e594859c-c880-41c9-94e7-c57190e6eabd

📥 Commits

Reviewing files that changed from the base of the PR and between 31fa24c and e239caf.

📒 Files selected for processing (4)

tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/manifest.yaml
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml
tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/manifest.yaml
tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml

coderabbitai · 2026-05-15T15:30:14Z

+  - "Must answer using the dedicated Kubernetes tools (one or more of
+     kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+     kubectl_find_resource). Counting via simple per-namespace kubectl get
+     calls through the bash tool is acceptable."
+  - "Must NOT use the bash tool with shell pipelines that aggregate or group
+     results (for example: piping kubectl output through awk, sort, uniq -c,
+     or wc to compute grouped counts). The grouping/aggregation must be done
+     by a dedicated Kubernetes tool or by reading the kubectl output directly,
+     not by composing a shell pipeline."


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix multi-line string formatting in expected_output.

Lines 20-28 use quoted strings with manual line breaks, which preserves newlines and leading whitespace. This differs from the proper folded scalar style used for user_prompt at line 1 (using >). The preserved whitespace could cause validation issues.

📝 Proposed fix using proper YAML multi-line format

- "Must identify app-259-control as the namespace with the most stuck pods" - - "Must answer using the dedicated Kubernetes tools (one or more of - kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or - kubectl_find_resource). Counting via simple per-namespace kubectl get - calls through the bash tool is acceptable." - - "Must NOT use the bash tool with shell pipelines that aggregate or group - results (for example: piping kubectl output through awk, sort, uniq -c, - or wc to compute grouped counts). The grouping/aggregation must be done - by a dedicated Kubernetes tool or by reading the kubectl output directly, - not by composing a shell pipeline." + - > + Must answer using the dedicated Kubernetes tools (one or more of + kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or + kubectl_find_resource). Counting via simple per-namespace kubectl get + calls through the bash tool is acceptable. + - > + Must NOT use the bash tool with shell pipelines that aggregate or group + results (for example: piping kubectl output through awk, sort, uniq -c, + or wc to compute grouped counts). The grouping/aggregation must be done + by a dedicated Kubernetes tool or by reading the kubectl output directly, + not by composing a shell pipeline.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml` around lines 20 - 28, The expected_output block currently uses quoted multi-line strings that preserve newlines/leading whitespace; replace those quoted multi-line entries with a folded scalar style (>) like user_prompt to collapse lines and remove leading spaces so the YAML parser/validator sees the intended single-paragraph text; locate the expected_output key in the test case (referenced as expected_output) and convert each quoted multi-line value to a folded scalar (>), ensuring indentation matches the surrounding YAML and content lines are wrapped without the manual line breaks.

coderabbitai · 2026-05-15T15:30:14Z

+  - "Must answer using a dedicated Kubernetes tool such as
+     kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct
+     'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls
+     through the bash tool — one call per pod is acceptable."


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten acceptance criteria to enforce batched/dedicated lookup.

Line 25–28 currently allows “one call per pod” through bash, which contradicts the stated goal at Line 14–17 and can let the non-batched anti-pattern pass.

Suggested fix

- - "Must answer using a dedicated Kubernetes tool such as - kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct - 'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls - through the bash tool — one call per pod is acceptable." + - "Must answer using a dedicated Kubernetes tool such as + kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single + batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call) + through the bash tool."

As per coding guidelines: “User prompts must be specific and match the test - test exact values and discovery of information… use include_tool_calls: true to verify tool execution.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- "Must answer using a dedicated Kubernetes tool such as

kubernetes_tabular_query or kubernetes_jq_query, OR by issuing direct

'kubectl get pod ... -o ...' / 'kubectl get pods ... -o ...' calls

through the bash tool — one call per pod is acceptable."

- "Must answer using a dedicated Kubernetes tool such as

kubernetes_tabular_query or kubernetes_jq_query, OR by issuing a single

batched kubectl query (e.g., one 'kubectl get pods ... -o ...' call)

through the bash tool."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml` around lines 25 - 28, Update the acceptance text that currently permits "one call per pod" so it instead requires either a dedicated Kubernetes tool (kubernetes_tabular_query or kubernetes_jq_query) OR batched kubectl calls (e.g., a single "kubectl get pods ..." or a single "kubectl get pod <multiple>" style command) — remove the allowance for one-per-pod bash calls; also add include_tool_calls: true to this test case to ensure tool execution is validated; look for the string containing "Must answer using a dedicated Kubernetes tool..." and the test case metadata to apply these changes.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml`:
- Around line 20-23: Update the fixture's tool-usage requirement so it no longer
permits the bash fallback: locate the string that currently reads "Must answer
using the dedicated Kubernetes tools (one or more of kubernetes_jq_query,
kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting
via simple per-namespace kubectl get calls through the bash tool is acceptable."
and remove the trailing allowance clause ("Counting via simple per-namespace
kubectl get calls through the bash tool is acceptable.") so the rule enforces
only the dedicated Kubernetes tools.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6c22c9f-44ff-4cee-ae6d-62391cbd7d15

📥 Commits

Reviewing files that changed from the base of the PR and between e239caf and 900fdf0.

📒 Files selected for processing (3)

pyproject.toml
tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml
tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (1)

tests/llm/fixtures/test_ask_holmes/260_k8s_multi_pod_node_lookup_prefer_dedicated_tools/test_case.yaml

coderabbitai · 2026-05-15T15:33:01Z

+  - "Must answer using the dedicated Kubernetes tools (one or more of
+     kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or
+     kubectl_find_resource). Counting via simple per-namespace kubectl get
+     calls through the bash tool is acceptable."


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten tool-usage criteria to disallow bash counting fallback.

This line makes the eval permissive in a way that can bypass the test’s core intent (dedicated Kubernetes tools). Please remove the “bash tool is acceptable” allowance so the fixture consistently enforces dedicated-tool usage.

Suggested diff

- - "Must answer using the dedicated Kubernetes tools (one or more of - kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or - kubectl_find_resource). Counting via simple per-namespace kubectl get - calls through the bash tool is acceptable." + - "Must answer using dedicated Kubernetes tools (one or more of + kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or + kubectl_find_resource), and must not use bash for counting/grouping."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/llm/fixtures/test_ask_holmes/259_k8s_event_grouping_prefer_dedicated_tools/test_case.yaml` around lines 20 - 23, Update the fixture's tool-usage requirement so it no longer permits the bash fallback: locate the string that currently reads "Must answer using the dedicated Kubernetes tools (one or more of kubernetes_jq_query, kubernetes_tabular_query, kubernetes_count, or kubectl_find_resource). Counting via simple per-namespace kubectl get calls through the bash tool is acceptable." and remove the trailing allowance clause ("Counting via simple per-namespace kubectl get calls through the bash tool is acceptable.") so the rule enforces only the dedicated Kubernetes tools.

The original 259 used pods with invalid image references to produce ImagePullBackOff/ErrImagePull status, but that requires a working container runtime to even reach the image-pull stage. In constrained CI environments (nested containers, cgroup v1) pods get stuck on FailedCreatePodSandBox before the kubelet ever attempts an image pull, making the setup unverifiable. The redesigned 259 deploys 15 pods across 4 namespaces with a mix of tier labels (worker plus noise tiers like edge/messaging/gateway/ batch/observability), and asks Holmes to count pods with tier=worker in each namespace. Labels are queryable from metadata.labels as soon as the Pod object is created, so the eval is independent of whether the pods reach Running. Tests the same anti-pattern (bash | awk | sort | uniq -c grouping) with the same expected_output structure (3/1/2/4, control wins). Verified locally against opus-4.6 via OpenRouter: 10/10 pass. Signed-off-by: Claude <noreply@anthropic.com>

Adds evals 261-264 that probe opus-4.6 with vague, customer-style questions about resource distribution and grouping across multiple namespaces. They complement 259/260 by covering question phrasings that historically slip past the existing assertion set. - 261 (vague overview): "give me an overview of what's running across these 4 namespaces" - 262 (label distribution): "what distinct tier values are in use, with counts per value" - 263 (namespace summary): "per-namespace breakdown of pods by tier label" - 264 (find outliers): "find every pod whose tier is something other than worker" All four use the same 15-pod template (worker pods plus tier-noise pods) in isolated app-{N}-* namespaces, so they're parallel-safe. Locally on opus-4.6 these pass ~95% of the time (5/5, 4/5, 5/5, 5/5 on a clean run). The intermittent failures show opus-4.6 occasionally falls back to bash pipelines like: kubectl get pods -n X --show-labels; echo "==="; kubectl get pods -n Y --show-labels ... kubectl get pods -A -o json | jq | sort -u The dedicated tools (kubernetes_jq_query, kubernetes_count) handle all four scenarios cleanly. The regression evals lock in that preference and will catch future model changes that re-introduce the anti-pattern. include_tool_calls is set so the judge inspects tool selection, not just the final answer. Bash with simple per-namespace 'kubectl get' calls (one per namespace, semicolon-chained allowed) is acceptable; bash with aggregating pipelines (awk/sort/uniq/wc/cut/grep -c) is not. Signed-off-by: Claude <noreply@anthropic.com>

The previous bash instructions used `kubectl get pods | grep Running | head -5` as the canonical "pipe is auto-approved" example. That implicitly endorsed piping kubectl output through awk/sort/uniq -c/wc to do grouping and counting, which is exactly the anti-pattern that produces the customer pain reported in this branch: instead of a single kubernetes_count / kubernetes_jq_query / kubernetes_tabular_query call, the LLM stitched together shell pipelines that either required approval (loops, command substitution) or produced noisy investigation traces. This change adds a "Tool Selection — Prefer Dedicated Tools" section at the top of the bash instructions that: * Tells the model to check for a dedicated K8s tool before reaching for bash. * Enumerates the four most common anti-patterns (group-by-uniq, wc -l counts, distinct-via-sort-u, per-resource for-loops) and points each one at the dedicated tool that replaces it. * Carves out the legitimate uses of bash — one-off `kubectl get` with a specific flag, non-kubectl invocations — so the guidance does not over-rotate. Two smaller edits below: * The "Pipes" example is changed from `kubectl get pods | grep ...` to a log-grep example, with a one-line reminder pointing at the new section. * The "go ahead and use loops" line is replaced with a check-first prompt that mentions the cost of approval prompts. This is prompt guidance, not a behavior gate — the model can still pipe kubectl when it judges that the right choice (e.g. when no dedicated tool fits). The goal is to flip the default for grouping/counting/joining work back to the dedicated tools. Pairs with the 259/260/261/262/263/264 evals on this branch which lock in the preference. Verification against opus-4.6 with full iteration counts is pending — the OpenRouter weekly credit limit was hit during testing. Signed-off-by: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml`:
- Around line 22-200: All Pod resources in this manifest (e.g.,
web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit
non-root security context; add a pod-level securityContext with runAsNonRoot:
true and runAsUser: 1000 and ensure each container (name: app) sets
securityContext.allowPrivilegeEscalation: false (or container-level
runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its
container securityContext explicitly enforce non-root and no privilege
escalation across all Pod definitions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc321014-7f70-4636-9eb1-d60c115e3465

📥 Commits

Reviewing files that changed from the base of the PR and between f39a8f3 and 9b70f66.

📒 Files selected for processing (9)

holmes/plugins/toolsets/bash/bash_instructions.jinja2
tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml
tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/manifest.yaml
tests/llm/fixtures/test_ask_holmes/262_k8s_label_distribution_no_bash_aggregation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/manifest.yaml
tests/llm/fixtures/test_ask_holmes/263_k8s_namespace_summary_no_bash_aggregation/test_case.yaml
tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/manifest.yaml
tests/llm/fixtures/test_ask_holmes/264_k8s_find_outliers_no_bash_aggregation/test_case.yaml

coderabbitai · 2026-05-15T20:59:52Z

+apiVersion: v1
+kind: Pod
+metadata:
+  name: web-renderer-v6q8
+  namespace: app-261-frontend
+  labels:
+    app: web-renderer
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: asset-cdn-v6q8
+  namespace: app-261-frontend
+  labels:
+    app: asset-cdn
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: session-router-v6q8
+  namespace: app-261-frontend
+  labels:
+    app: session-router
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: edge-cache-v6q8
+  namespace: app-261-frontend
+  labels:
+    app: edge-cache
+    tier: edge
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: order-processor-v6q8
+  namespace: app-261-backend
+  labels:
+    app: order-processor
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: queue-broker-v6q8
+  namespace: app-261-backend
+  labels:
+    app: queue-broker
+    tier: messaging
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: api-gateway-v6q8
+  namespace: app-261-backend
+  labels:
+    app: api-gateway
+    tier: gateway
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: stream-ingester-v6q8
+  namespace: app-261-data
+  labels:
+    app: stream-ingester
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: feature-compactor-v6q8
+  namespace: app-261-data
+  labels:
+    app: feature-compactor
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: warehouse-loader-v6q8
+  namespace: app-261-data
+  labels:
+    app: warehouse-loader
+    tier: batch
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: scheduler-shim-v6q8
+  namespace: app-261-control
+  labels:
+    app: scheduler-shim
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: policy-evaluator-v6q8
+  namespace: app-261-control
+  labels:
+    app: policy-evaluator
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: workload-binder-v6q8
+  namespace: app-261-control
+  labels:
+    app: workload-binder
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: lease-coordinator-v6q8
+  namespace: app-261-control
+  labels:
+    app: lease-coordinator
+    tier: worker
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  name: audit-logger-v6q8
+  namespace: app-261-control
+  labels:
+    app: audit-logger
+    tier: observability
+spec:
+  containers:
+    - {name: app, image: busybox:1.36, command: ["sleep", "3600"]}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Harden pod security context to avoid policy-dependent fixture failures.

These pods currently rely on default security settings; on clusters enforcing Pod Security standards, they can be rejected (runAsNonRoot / allowPrivilegeEscalation). Please add explicit non-root security context across all fixture pods.

Suggested pattern to apply to each Pod spec

spec: + securityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault containers: - - {name: app, image: busybox:1.36, command: ["sleep", "3600"]} + - name: app + image: busybox:1.36 + command: ["sleep", "3600"] + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"]

🧰 Tools

🪛 Checkov (3.2.528)

[medium] 22-33: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 22-33: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 34-45: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 34-45: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 46-57: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 46-57: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 58-69: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 58-69: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 70-81: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 70-81: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 82-93: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 82-93: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 94-105: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 94-105: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 106-117: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 106-117: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 118-129: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 118-129: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 130-141: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 130-141: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 142-153: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 142-153: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 154-165: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 154-165: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 166-177: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 166-177: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 178-189: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 178-189: Minimize the admission of root containers

(CKV_K8S_23)

[medium] 190-200: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[medium] 190-200: Minimize the admission of root containers

(CKV_K8S_23)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/llm/fixtures/test_ask_holmes/261_k8s_vague_overview_no_bash_aggregation/manifest.yaml` around lines 22 - 200, All Pod resources in this manifest (e.g., web-renderer-v6q8, asset-cdn-v6q8, session-router-v6q8, etc.) lack an explicit non-root security context; add a pod-level securityContext with runAsNonRoot: true and runAsUser: 1000 and ensure each container (name: app) sets securityContext.allowPrivilegeEscalation: false (or container-level runAsNonRoot/runAsUser if you prefer container scope) so every Pod/spec and its container securityContext explicitly enforce non-root and no privilege escalation across all Pod definitions.

claude Bot reviewed May 15, 2026

View reviewed changes

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

aantn added the evals-tag-kubernetes-no-bash-approval label May 15, 2026

claude added 3 commits May 15, 2026 19:10

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Conversation

aantn commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Case 1: Kubernetes Event Grouping (259_k8s_event_grouping_prefer_dedicated_tools)

Test Case 2: Multi-Pod Node Lookup (260_k8s_multi_pod_node_lookup_prefer_dedicated_tools)

Implementation Details

Summary by CodeRabbit

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📂 Previous Runs

✅ Results of HolmesGPT evals

Results of HolmesGPT evals

✅ Results of HolmesGPT evals

Results of HolmesGPT evals

✅ Results of HolmesGPT evals

Results of HolmesGPT evals

⚠️ Eval Results (with failures)

Results of HolmesGPT evals

⚠️ 1 Failure Detected

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for holmes-docs ready!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aantn commented May 15, 2026 •

edited by coderabbitai Bot

Loading

Test Case 1: Kubernetes Event Grouping (`259_k8s_event_grouping_prefer_dedicated_tools`)

Test Case 2: Multi-Pod Node Lookup (`260_k8s_multi_pod_node_lookup_prefer_dedicated_tools`)

github-actions Bot commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

netlify Bot commented May 15, 2026 •

edited

Loading