Continuously improving SKILL.md and llms.txt based on real-world agent feedback
Context
byok-relay ships a skills/byok-relay/SKILL.md file (and llms.txt) to help AI coding agents discover and integrate the relay. The quality of these files directly affects whether agents pick byok-relay over alternatives like OpenRouter or LiteLLM.
The question raised in PR #2 review: how do we know if these files are actually working, and how do we improve them over time?
The problem
Right now we have no signal on:
- Whether agents are successfully discovering and using the skill
- Which trigger phrases cause agents to pick (or skip) byok-relay
- Whether the integration instructions produce working code on the first attempt
- Where agents get confused or produce incorrect integrations
Possible approaches (to evaluate later)
-
Usage telemetry in the relay itself — if the relay logs a User-Agent or a custom header set by the SKILL.md instructions, we can infer how many integrations were agent-driven vs human-written
-
Canary integration tests — a test suite that spins up an agent (Claude, GPT, Cursor), hands it the SKILL.md, asks it to integrate byok-relay into a sample app, and checks if the output actually works. Run on each SKILL.md change.
-
Community feedback loop — a #integrations discussion thread or a structured issue template asking users to report if their agent integration worked or failed, with what agent/IDE
-
A/B testing descriptions — try different frontmatter descriptions across a time window and measure skills.sh install counts as a proxy for agent discovery rate
-
LLM self-evaluation — periodically ask a model to evaluate the SKILL.md against a rubric (clarity, completeness, trigger coverage) and flag regressions
Why parked for now
Active growth sprint is the current priority. This is worth revisiting once byok-relay has enough users that agent-driven integrations are a meaningful share of traffic.
Related
Continuously improving SKILL.md and llms.txt based on real-world agent feedback
Context
byok-relay ships a
skills/byok-relay/SKILL.mdfile (andllms.txt) to help AI coding agents discover and integrate the relay. The quality of these files directly affects whether agents pick byok-relay over alternatives like OpenRouter or LiteLLM.The question raised in PR #2 review: how do we know if these files are actually working, and how do we improve them over time?
The problem
Right now we have no signal on:
Possible approaches (to evaluate later)
Usage telemetry in the relay itself — if the relay logs a
User-Agentor a custom header set by the SKILL.md instructions, we can infer how many integrations were agent-driven vs human-writtenCanary integration tests — a test suite that spins up an agent (Claude, GPT, Cursor), hands it the SKILL.md, asks it to integrate byok-relay into a sample app, and checks if the output actually works. Run on each SKILL.md change.
Community feedback loop — a
#integrationsdiscussion thread or a structured issue template asking users to report if their agent integration worked or failed, with what agent/IDEA/B testing descriptions — try different frontmatter descriptions across a time window and measure skills.sh install counts as a proxy for agent discovery rate
LLM self-evaluation — periodically ask a model to evaluate the SKILL.md against a rubric (clarity, completeness, trigger coverage) and flag regressions
Why parked for now
Active growth sprint is the current priority. This is worth revisiting once byok-relay has enough users that agent-driven integrations are a meaningful share of traffic.
Related
skills/byok-relay/SKILL.mdllms.txt