GitHub Agentic Workflows

Blog

Weekly Update – June 1, 2026

[!NOTE] This post references historical Effective Tokens (ET) metrics. gh-aw now uses AI Credits (AIC) as the primary cost metric.

It’s been a busy week in github/gh-aw! Five releases landed between May 28 and May 31, capped off by v0.77.4 — one of the biggest releases in recent memory. Here’s everything that shipped.

v0.77.4 published on May 31st and packs in a ton of new capability.

  • Anthropic WIF Authentication (#35939): Claude-engine workflows can now authenticate via Workload Identity Federation. No more long-lived API key secrets stored in your repo — WIF handles it securely.

  • copilot-sdk Engine (#35936): A new engine: copilot-sdk frontmatter option gives workflows direct access to the Copilot SDK runtime, opening up new integration patterns.

  • aw.yml Manifest: Includes, Skills & Agents (#35778): Your repository manifest now supports includes, skills, and agents keys so you can compose and share workflow components across repos.

  • Per-Workflow 24-Hour Effective-Token Guardrail (#36042): A configurable token guardrail prevents runaway agent costs with enterprise-grade defaults and handy ET shorthand support.

  • search_commits in GitHub MCP Search Toolset (#36115): Agents can now search commits directly via the GitHub MCP search toolset.

  • New Skills: copilot-review and go-codemod (#36111, #36034): Two new skills help agents plan and address PR review feedback, and implement Go codemods for the gh aw fix command.

  • Prefer toolcache Copilot CLI (#35992): Workflows now use the Actions toolcache copy of the Copilot CLI before downloading a release — faster setup for everyone.
  • Reusable workflow timeout (#36107): timeout-minutes is now correctly passed through reusable workflow callers.
  • Threat-detection hardening (#36113): Missing prompt artifacts no longer block safe-output execution.
  • on.needs YAML strip (#35965): Processed on.needs keys are stripped from emitted YAML, preventing invalid workflow syntax.

v0.77.3 on May 29th brought sandbox improvements and better initialization:

  • authHeader in sandbox agent targets (#35694): You can now specify custom authentication headers directly in sandbox.agent.targets frontmatter.
  • gh aw init creates the Agentic Workflows custom agent (#35773): Running gh aw init now scaffolds a GitHub Copilot custom agent for Agentic Workflows right out of the box.
  • Stricter schema validation for workflow_call/workflow_dispatch (#35788): Unknown input keys are now rejected at compile time.

Agent of the Week: api-consumption-report

Section titled “ Agent of the Week: api-consumption-report”

The bean counter who never sleeps — tracks every GitHub API call your workflows make and publishes a detailed report so you know exactly where your rate-limit quota is going.

This week api-consumption-report analyzed 95 workflow runs across the repository (58 successes, 37 failures — it doesn’t sugarcoat the numbers), tallied up 10,619 GitHub REST API calls in a single day, and generated a full trend chart showing that API usage spiked to ~80K calls on May 20th before settling back down. It also uploaded five charts as release assets — a trend line, a heatmap, a per-workflow breakdown, a “burners” donut chart, and a workflow-level trend — then published the whole package as a GitHub Discussion for everyone to browse.

Hilariously, in one of its recent runs it completed in under 2 minutes with zero token usage and exactly one GitHub API call. Turns out that was the run where the cache hadn’t warmed yet — it took a look around, shrugged, and went home early.

Usage tip: Schedule this workflow weekly to catch runaway API consumption before you hit rate limits — the per-workflow breakdown makes it easy to spot which agent is hogging the quota.

View the workflow on GitHub

Upgrade to v0.77.4 today and explore the new copilot-sdk engine and WIF authentication for Claude. As always, feedback and contributions are welcome at github/gh-aw.

Agent of the Day – May 29, 2026

By the time an issue makes it into your backlog, someone already spent time writing it. The least you can do is make sure it gets read by the right person quickly. In practice, that rarely happens — unlabeled issues pile up, the search experience degrades, and the right engineer finds out about a relevant bug two sprints too late. Labeling sounds simple. Doing it consistently, at scale, without burning anyone’s afternoon, is the actual challenge.

That’s exactly the problem the Auto-Triage Issues workflow in gh-aw was built to solve.


Workflow: Auto-Triage Issues
Engine: GitHub Copilot (gpt-5-mini)
Run: #26640355375 — May 29, 2026, 13:34 UTC
Result: ✓ SUCCESS


Auto-Triage Issues runs on a schedule — several times a day — and also fires on issues events. Each pass, it reads through unlabeled GitHub issues, reasons about their content, and applies labels with a stated confidence level and rationale. No human in the loop. No queue to drain manually.

The agent runs behind an enabled squid-proxy firewall, with outbound access scoped to github.com and approved defaults. That constraint is intentional: triage doesn’t need the open internet, and limiting the blast radius of any agent is good practice regardless of what it’s doing.

Today’s midday run is a useful case study in how the workflow behaves under varying load.


The 07:45 UTC pass (run #26625003469) was a light one: 7 turns, finished in 5 minutes. A handful of issues to consider, quick classification, done. That’s what a steady-state workload looks like.

By 13:34 UTC, the picture was different. The agent completed 28 turns over 10 minutes — four times the conversational depth, twice the elapsed time. Same workflow, same model, same success result. The difference was the volume and complexity of what was waiting in the queue.

This matters because it shows the system isn’t just running a fixed script. The agent works through each issue, reasons about it, and the turn count reflects real cognitive work being done. A heavier inbox produces a longer run, not a failure or a time-out.


Two issues received labels during the midday run:

IssueLabels AppliedRationale
#35708automation”Automated triage report with no bug/feature signal”
#34915documentation, automation”Automated documentation quality report generated by automation; content is documentation-focused and workflow-generated”

Both calls were high-confidence. Issue #34915 is a good example of the multi-label path: the agent identified that the issue was both workflow-generated and documentation-focused, and applied both labels rather than forcing a single category. That kind of nuanced classification is where static regex-based approaches tend to fall short.


At the end of each run, the workflow doesn’t just apply labels and exit quietly. It creates — or updates — a GitHub Discussion titled [Auto-Triage Report] 2026-05-29, containing a Markdown table that summarizes every issue it classified: the issue number, the labels applied, confidence level, and the agent’s reasoning.

That report serves two purposes. First, it’s auditable — a reviewer can open the Discussion and see exactly what the agent decided and why, without digging through logs. Second, it creates a natural place for human override: if a classification looks wrong, the context is right there to inform a correction.

Transparency in automated triage isn’t optional. Reviewers need to trust the output before they’ll stop second-guessing it.


The model choice here is deliberate. gpt-5-mini is fast and cost-effective for classification tasks where the signal is textual and the label set is bounded. You don’t need a heavyweight model to tell the difference between a documentation report and a bug report. Reserving larger models for tasks that actually need them — planning, synthesis, code generation — keeps the system efficient across a full day of scheduled runs.


If your repository is drowning in unlabeled issues, Auto-Triage is a pattern worth adopting. The workflow lives in github/gh-aw, alongside the rest of the agentic workflow library. The firewall configuration, the Discussion report pattern, and the label confidence output are all ready to fork and adapt.

Triage shouldn’t be a task anyone has to remember to do. It should just happen — correctly, consistently, and with a paper trail.

Agent of the Day – May 28, 2026

[!NOTE] This post references historical Effective Tokens (ET) metrics. gh-aw now uses AI Credits (AIC) as the primary cost metric.

Every codebase accumulates sediment. A helper function that made sense six months ago. A wrapper that lost its reason to exist after a refactor. Nobody deletes it on purpose — it just lingers. In Go, that lingering costs you: extra surface area to maintain, test coverage for code that does nothing new, and cognitive overhead for every engineer who reads the file.

The Dead Code Removal Agent is a scheduled GitHub Actions workflow that runs daily on the gh-aw repository. Its job is simple: find unused code, verify nothing breaks, and open a pull request. No human intervention required until review time.

On May 27, 2026, the agent completed run #100. Not a fanfare moment — just another daily run doing exactly what it was built to do. It finished in 11.4 minutes across 5 turns, consumed 14.6M effective tokens, and used 12 GitHub Actions minutes.

The target this time was NewValidationErrorWithLocation in pkg/workflow/workflow_errors.go. The function was a constructor wrapper around WorkflowValidationError — originally a convenience, but over time it became redundant as callers could initialize the struct directly. The agent identified it, confirmed it had no remaining callers, and started working.

The tool call sequence tells the story cleanly: one Install, eight Check passes, five Reads, three Views, four Edits, a Find, a Verify, a Format, two Runs, two Creates, an Update, and a Vet. That’s methodical, not mechanical. The agent didn’t just delete the function — it removed the corresponding TestNewValidationErrorWithLocation test from pkg/workflow/error_helpers_test.go and updated compiler_error_formatting_test.go to use direct WorkflowValidationError struct initialization instead.

Verification was thorough. Before touching the PR, the agent ran go build ./..., go vet ./..., go vet -tags=integration ./..., and make fmt. Everything passed. The resulting PR — “chore: remove dead functions — 1 function removed” on branch chore/remove-dead-code-20260527 — arrived clean, with no lint issues and a test suite that still compiles.

Zoom out a week and the picture gets more interesting. Across five runs in the last seven days, the agent logged:

  • 35.5 minutes total duration
  • 38.9M effective tokens
  • 38 GitHub Actions minutes
  • 21 turns across all five runs
  • 5 out of 5 high-confidence episodes

Run classification across that window: two normal runs, one risky, one failure, one in-progress. The failure and the risky classification matter as much as the successes. The agent doesn’t always find something safe to remove, and when it can’t complete cleanly, it doesn’t force a PR. That restraint is a feature, not a gap.

Dead code removal is well-suited to an agent for a specific reason: the feedback loop is entirely mechanical. Does it build? Does go vet pass? Does the test suite still run? Those questions have definitive answers. The agent never has to speculate about intent — it just has to be rigorous about verification, which it is.

The harder editorial question — should this code be removed — is answered by the PR review. The agent does the investigation and the grunt work. Engineers do the judgment call. That division feels right.

There’s also something useful about the daily cadence. A function doesn’t become dead overnight. But catching it the morning after the last caller disappears, rather than six months later during a refactor, is the difference between a one-line deletion and an archaeology project.

If you’re curious about how the Dead Code Removal Agent is built, or if you want to run something similar against your own Go codebase, the workflow lives at github/gh-aw. The patterns here — schedule-triggered agents, structured verification steps, PR-as-output — are composable. Start there.

Run #100 was just another Tuesday. That’s the point.

Agent of the Day – May 27, 2026

[!NOTE] This post references historical Effective Tokens (ET) metrics. gh-aw now uses AI Credits (AIC) as the primary cost metric.

Every day, 236 agentic workflows run inside the gh-aw repository. Most complete quietly. A few fail in patterns worth tracking. And once a week, one workflow reads the entire fleet, scores it, and writes up what it found. That workflow is the Agent Performance Analyzer, and its run on May 27, 2026 produced the clearest signal in months.

Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator

Section titled “Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator”

The agent-performance-analyzer is not a workflow that builds features or merges PRs. Its job is to watch everything else. On a daily schedule, it fans out across the full fleet of 236 workflows, scores each agent group across three dimensions — quality (0–100), effectiveness (0–100), and ecosystem health (0–100) — and surfaces what the aggregate data says about systemic health. Think of it as a standing post-incident review that runs without anyone needing to call one.

Run #26515287616, logged on May 27, ran for 10.7 minutes and processed 12.2 million effective tokens. Those numbers matter because they reflect how much context the analyzer actually reads — audit logs, PR outcomes, failure histories, discussion threads — before rendering a score. This is not a lightweight health check.

The headline number from this week’s pass: ecosystem health hit 90/100, up 20 points from the prior week. That is the largest single-week jump in the recorded history of this metric. It is also a number that demands interpretation, not celebration. A 20-point move in one week usually means either the fleet genuinely improved, or something was suppressing the score before and is now resolved. The weekly Discussion #35220 breaks down the contributing factors — most of the lift came from copilot-swe-agent merge rate recovery, which landed at 67% week-over-week, up 6 percentage points, with 6 merges on May 27 alone. Merge rate as a proxy for workflow effectiveness is imperfect, but 67% across a fleet this size is a meaningful signal.

The top performers bear out that story. Lint Monster scored 90/100 on quality and 85/100 on effectiveness — consistent, expected, unglamorous. copilot-swe-agent followed at 88/100 quality and 84/100 effectiveness. spec-enforcer/extractor went 3-for-3 on merges this week, a 100% merge rate on a small but non-trivial sample. These are the parts of the fleet holding their line.

Quality, though, is flat. 74/100 for the fourth consecutive week. A plateau at week four is no longer noise. The analyzer flagged this directly: without intervention, the quality score will not self-correct. The fleet is not degrading, but it is not improving either, and in a system that runs daily, stasis accumulates.

The more operationally significant output from this run was not the Discussion — it was issue #35219. The analyzer detected a Copilot CLI execution failure pattern affecting the Daily News and Daily Issues Report workflows across five or more consecutive days at a 100% failure rate. A workflow failing once is noise. Failing every day for a week is infrastructure. The issue was filed automatically based on threshold logic baked into the analyzer’s scoring criteria. No human had to notice the pattern.

Three other systemic issues surfaced in Discussion #35220. A safe-outputs permission regression is blocking three or more agent groups and has been classified P1. A CGO/CJS build regression running at 37% failure rate has now exceeded 90 days without resolution — that is a P0 by any reasonable SLO definition. And 87 of the fleet’s 236 workflows show no recent runs at all, which makes them deprecation candidates pending owner review. The firewall processed 113 requests during this period and blocked 30 of them — a 27% block rate — which is consistent with prior weeks but warrants monitoring if the trend climbs.

The value of a meta-orchestrator is not that it prevents incidents. It is that it shortens the time between an incident beginning and someone with context knowing about it. Five consecutive days of 100% failure on two named workflows, with an auto-filed issue linking directly to the evidence, is a materially better outcome than a developer noticing something is off on day seven.


The work of keeping 236 workflows healthy is mostly invisible until something breaks. The Agent Performance Analyzer makes that work legible — in scores, in filed issues, in a weekly Discussion that records what the fleet looked like at a point in time. If you want to follow along, the full weekly report is in Discussion #35220, and the project lives at github/gh-aw.

Agent of the Day – May 26, 2026

[!NOTE] This post references historical Effective Tokens (ET) metrics. gh-aw now uses AI Credits (AIC) as the primary cost metric.

Every morning someone at GitHub opens their laptop and wonders: how well did the coding agents do yesterday? Did they ship? Did they stall? Did they create more work than they saved? These questions used to require manual spelunking through dashboards, cross-referencing merged PRs with author names, and guessing at patterns from vibes alone.

Not anymore.

Agent of the Day: Copilot Agent PR Analysis

Section titled “ Agent of the Day: Copilot Agent PR Analysis”

The Copilot Agent PR Analysis workflow runs daily at 6pm UTC with a single mandate: understand how GitHub’s own coding agents are performing in the wild. It watches copilot-swe-agent-authored pull requests, tracks their lifecycle from open to merge (or close), and surfaces patterns that would otherwise vanish into the noise of a busy repository.

Run 26415065259 on May 25th tells the story. Six minutes. Nineteen agent turns. Nearly a million tokens processed. And at the end, a GitHub Discussion summarizing everything the agents accomplished in the last 24 hours—merge rates, review turnaround, file change distributions, the works.

Workflow activity chart

What makes this run interesting isn’t just the output—it’s the mechanics underneath. The workflow starts by reading pre-fetched PR data from /tmp/gh-aw/agent/pr-data/copilot-prs.json, a file populated by an earlier step that batches GitHub API calls. This matters because API rate limits are a real constraint when you’re analyzing dozens of PRs daily. By front-loading the data fetch, the Claude Opus 4.7 model can focus on analysis rather than pagination logistics.

From there, the agent orchestrates across 16 different tool types. github-list_pull_requests and github-search_pull_requests pull in the raw data. github-get_file_contents adds context when the agent needs to understand what a PR actually changed. push_repo_memory persists metrics for trend analysis—because spotting a single bad day matters less than spotting a three-week decline. And create_discussion posts the findings where the team can actually see them.

The token economics tell their own story. Of the 947,148 tokens consumed, over 3 million effective tokens came from cache reads—a 63% hit rate. That’s not an accident. The workflow’s prompt structure and tool imports are designed to maximize cache reuse across runs. At $1.53 per execution, this is the kind of analysis that would cost ten times more if you rebuilt context from scratch each day.

Nineteen turns might sound like a lot, but the average inter-turn time of 19.8 seconds reveals something important: this agent is thinking, not thrashing. It’s making deliberate tool calls, waiting for responses, incorporating results, and planning next steps. The turn count reflects adaptive planning—the kind of reasoning that adjusts when it finds fewer PRs than expected or more activity in an unexpected repository corner.

PR #34947, merged just one day after this run, shows the feedback loop in action. Titled “Normalize copilot-session-insights discussion output hierarchy and disclosure,” it refined how the analysis gets presented—making the daily summaries easier to scan and the trend data more accessible. The workflow’s own output informed improvements to the workflow itself.

This is what continuous observability looks like for AI systems. Traditional software gets monitored with APM tools, error rates, and latency percentiles. But when your “software” is an autonomous agent making judgment calls about code, you need a different kind of visibility. You need to know: are the agents getting better at writing tests? Are they over-indexing on certain file types? Are their PRs sitting in review limbo, or are humans accepting them quickly?

The Copilot Agent PR Analysis workflow answers these questions daily, automatically, without anyone remembering to ask.


Curious about building workflows that watch your workflows? Explore the full gh-aw project at github/gh-aw—where agentic automation meets operational insight.