add tags to failing tests#597
Conversation
WalkthroughThis change expands test categorization by introducing new tags in both the codebase and test metadata. Four additional pytest markers and evaluation tags are defined, and multiple test case YAML files are updated to include relevant tags under a new or existing Changes
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📥 CommitsReviewing files that changed from the base of the PR and between 9d64fd42c69a2e941bca9e39ccd7a5531015263d and f043d18. 📒 Files selected for processing (14)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (12)
⏰ Context from checks skipped due to timeout of 90000ms (4)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_old_tools/test_case.yaml (1)
19-21:kubectl apply -f./manifest.yamlwill fail – missing space
kubectltreats-fas a flag whose value must be provided after a space (or with=). Without the space, it is parsed as an unknown shorthand flag and the test will error out.- kubectl apply -f./manifest.yaml + kubectl apply -f ./manifest.yamlSame fix is needed for the corresponding
kubectl deleteline.tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_all_tools/test_case.yaml (1)
6-9: Missing space after-fflagSame
kubectl apply/delete -f./manifest.yamlissue as in the sister file.
♻️ Duplicate comments (4)
tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_new_all_tools/test_case.yaml (1)
2-4: Duplicate feedback: see the earlier comment aboutnetworkvsnetworkingmarker ambiguity.tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_old_tools/test_case.yaml (1)
2-4: Duplicate feedback: see the earlier comment aboutnetworkvsnetworkingmarker ambiguity.tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_new_tools/test_case.yaml (1)
2-4: Duplicate feedback: see the earlier comment aboutnetworkvsnetworkingmarker ambiguity.tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_all_tools/test_case.yaml (1)
2-4: Same taxonomy collision as aboveEnsure only one of
network/networkingsurvives in the constants file and in all test cases.
🧹 Nitpick comments (4)
tests/llm/fixtures/test_ask_holmes/21_job_fail_curl_no_svc_account/test_case.yaml (1)
18-20: Minor typo in comment
Currenlty→Currently.-# Currenlty holmes is able to answer the question correctly only around 10% of the time. +# Currently Holmes is able to answer the question correctly only around 10 % of the time.tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_new_tools/test_case.yaml (1)
2-4: Potential confusion betweennetworkvsnetworkingtags
networkis already an established marker across the suite, and now we introducenetworking.
Unless the two convey distinct semantics, having both will fragment filtering (pytest -m networkvspytest -m networking) and make dashboards harder to search.If the intent is simply to rename/clarify the old marker, consider:
- - networking + - networkand then deprecate
networkingfrompyproject.toml.
Otherwise, document the precise difference inCONTRIBUTING.mdand ensure every test uses the right one.pyproject.toml (1)
83-87: Minor ordering nitFor readability the marker list has so far been alphabetical.
Keeping that convention helps future reviewers spot duplicates quickly.- "networking: Tests involving networking functionality", - "misleading-history: Tests with misleading historical data", - "k8s-misconfig: Tests involving Kubernetes misconfigurations", - "chain-of-causation: Tests involving chain-of-causation analysis", - "slackbot: Tests involving Slack bot functionality", + "chain-of-causation: Tests involving chain-of-causation analysis", + "k8s-misconfig: Tests involving Kubernetes misconfigurations", + "misleading-history: Tests with misleading historical data", + "networking: Tests involving networking functionality", + "slackbot: Tests involving Slack bot functionality",tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_old_tools/test_case.yaml (1)
5-17: Block scalar starts with an empty line – unintended leading newline
expected_output: |is immediately followed by a blank line, which introduces a leading newline into the value. If the evaluation logic performs exact or trimmed comparisons, this may cause false negatives. Delete the empty line or switch to|-to strip.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between baedc3b and 9d64fd42c69a2e941bca9e39ccd7a5531015263d.
📒 Files selected for processing (14)
pyproject.toml(1 hunks)tests/llm/fixtures/test_ask_holmes/08_sock_shop_frontend/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/21_job_fail_curl_no_svc_account/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/22_high_latency_dbi_down/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/25_misconfigured_ingress_class/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_all_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_new_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_old_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_new_all_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_new_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/42_dns_issues_steps_old_tools/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/44_slack_statefulset_logs/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/46_job_crashing_no_longer_exists/test_case.yaml(1 hunks)tests/llm/utils/constants.py(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
tests/llm/fixtures/test_ask_holmes/44_slack_statefulset_logs/test_case.yaml (2)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
Learnt from: nherment
PR: robusta-dev/holmesgpt#408
File: holmes/plugins/toolsets/kubernetes_logs.py:90-97
Timestamp: 2025-05-15T05:13:43.169Z
Learning: In the Kubernetes logs toolset for Holmes, both current and previous logs are intentionally fetched and combined for each pod, even though this requires more API calls. This design ensures all logs are captured even when pods restart but retain their name, providing complete diagnostic information.
tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_new_tools/test_case.yaml (1)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_all_tools/test_case.yaml (1)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
tests/llm/fixtures/test_ask_holmes/08_sock_shop_frontend/test_case.yaml (1)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
tests/llm/fixtures/test_ask_holmes/42_dns_issues_result_old_tools/test_case.yaml (1)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
tests/llm/utils/constants.py (1)
Learnt from: Sheeproid
PR: robusta-dev/holmesgpt#586
File: tests/llm/fixtures/test_ask_holmes/03_what_is_the_command_to_port_forward/test_case.yaml:4-4
Timestamp: 2025-07-02T10:27:17.231Z
Learning: In LLM-as-judge test cases for HolmesGPT, expected outputs should be descriptive rather than prescriptive when testing for flexible responses like port numbers. Using specific values in expected outputs can cause unnecessary test failures when the AI generates different but equally valid responses.
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: build (3.12)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build (3.12)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
- GitHub Check: llm_evals
🔇 Additional comments (6)
tests/llm/fixtures/test_ask_holmes/25_misconfigured_ingress_class/test_case.yaml (1)
2-4: Tags field looks correct and consistent with new taxonomyYAML syntax, indentation, and values all comply with the updated
ALLOWED_EVAL_TAGS. No further action required.tests/llm/fixtures/test_ask_holmes/46_job_crashing_no_longer_exists/test_case.yaml (1)
2-3: Tag addition is validThe new
logstag matches the updated constants; formatting is fine.tests/llm/fixtures/test_ask_holmes/44_slack_statefulset_logs/test_case.yaml (1)
2-4: Tagging change acceptedBoth
slackbotandmisleading-historyare recognized tags, and YAML formatting is sound.tests/llm/fixtures/test_ask_holmes/21_job_fail_curl_no_svc_account/test_case.yaml (1)
2-4: Tags correctly added
misleading-historyandk8s-misconfigare now part of the allowed set; indentation is consistent.tests/llm/fixtures/test_ask_holmes/22_high_latency_dbi_down/test_case.yaml (1)
4-5: Tag addition acknowledged
chain-of-causationtag is valid; YAML is well-formed.tests/llm/fixtures/test_ask_holmes/08_sock_shop_frontend/test_case.yaml (1)
2-3: Tag already covered
logsalready exists inALLOWED_EVAL_TAGS; no issues here. ✅
9d64fd4 to
f043d18
Compare
No description provided.