Skip to main content

Profit-as-a-Service

Software-as-a-Service gave us recurring revenue. Platform-as-a-Service gave us managed infrastructure. The next abstraction is obvious in hindsight: Profit-as-a-Service.

The Pattern
#

Every major shift in the software industry has been about abstracting away complexity.

On-premise servers became cloud infrastructure. Custom software became SaaS subscriptions. Manual operations became automation.

Each layer removes something the customer previously had to manage themselves. Each layer turns a cost center into a service. Each layer makes the underlying complexity someone else’s problem.

Profit-as-a-Service is the logical conclusion of this trend. Instead of selling software that helps you make money, you sell the money-making itself.

What It Looks Like
#

A traditional SaaS company sells you a tool. You still need to figure out how to use it to generate revenue.

A Profit-as-a-Service company sells you an outcome. The tool is invisible. The process is invisible. You pay for the result.

Examples already exist:

  • Algorithmic trading platforms that let you deploy capital and collect returns without understanding the underlying strategies.
  • AI-powered dropshipping services that handle product selection, listing, pricing, and fulfillment while you provide the capital.
  • Automated content farms that generate SEO-optimized articles, monetize them with ads, and split the revenue with you.
  • AI sales agents that prospect, qualify, and close deals on your behalf, charging only a commission on closed revenue.
  • Robo-advisors that manage investment portfolios algorithmically, charging a fraction of traditional advisory fees.

In each case, the customer provides capital or data. The service provides everything else. Revenue is shared, not subscribed to.

Why Now
#

Three forces are converging to make Profit-as-a-Service viable at scale.

1. AI can now execute complex business processes autonomously
#

Large language models can write marketing copy, analyze markets, negotiate with customers, and generate code. Multi-agent systems can coordinate these capabilities into end-to-end business workflows. What previously required a team of specialists can now be done by a single API call chain.

2. Infrastructure is fully commoditized
#

Starting a business used to require incorporating, hiring, building software, and navigating regulations. Now you can spin up a fully operational e-commerce store in an afternoon using off-the-shelf tools. The barrier to entry for almost every digital business has collapsed to near zero.

3. Distribution is algorithmic
#

Social media algorithms, search engines, and ad platforms have made customer acquisition a math problem. If you can compute the expected value of an impression, you can acquire customers profitably at scale. AI is better at this math than humans.

The Economics
#

The unit economics of Profit-as-a-Service are deceptively simple.

The service provider operates a system that generates revenue. The customer provides the input: capital, data, or access to a market. Revenue is split between the provider and the customer.

For the customer, this is attractive because:

  • No upfront investment in tools or expertise.
  • Risk is aligned (the provider only makes money if you make money).
  • Scalability is built in (the system can run 24/7 without human intervention).

For the provider, this is attractive because:

  • Revenue scales with customer success (no ceiling from fixed subscriptions).
  • Customer acquisition is easier (pay-for-performance is an easy sell).
  • Margins improve with scale (the same system serves many customers).

The critical metric is the spread between what the system generates and what it costs to operate. If a system generates $10,000/month per customer and costs $2,000/month to run, the provider can keep $4,000 and give the customer $4,000. Both parties are happy. The system scales linearly in cost but exponentially in value.

The Risks
#

Profit-as-a-Service is not without significant risks.

Alignment risk
#

When you outsource profit generation, you also outsource decision-making. The provider’s optimization target may diverge from yours. A trading algorithm might maximize short-term returns at the cost of long-term stability. A content farm might optimize for clicks at the cost of your brand reputation.

Concentration risk
#

If the service provider goes down, your revenue goes to zero. There is no fallback when the entire business process is a black box. Diversification across multiple providers becomes essential but difficult.

Regulatory risk
#

Financial services are heavily regulated. When you abstract away the details of money-making, you may also abstract away compliance. The line between “software service” and “financial service” is thin and getting thinner.

Race to the bottom
#

When the barrier to entry is low, competition drives margins toward zero. The first provider to offer automated dropshipping captures most of the value. The hundredth provider competes on price until nobody makes money.

Dependency trap
#

Once a business relies on Profit-as-a-Service, transitioning away is extremely difficult. The provider owns the process, the data, and the relationships. Switching costs are not just technical but operational and strategic.

The Spectrum
#

Profit-as-a-Service is not binary. It exists on a spectrum of how much of the profit-generating process is abstracted away.

Level 0: You do everything. You buy tools and operate them yourself. Traditional software.

Level 1: AI assists you. You use AI tools to accelerate specific tasks. Copilots and assistants.

Level 2: AI operates under your supervision. You define strategy, AI executes tactics. Current state of most AI-powered businesses.

Level 3: AI runs the process, you provide direction. You set constraints, AI handles the rest. Emerging today.

Level 4: AI runs everything, you provide capital. You fund the operation, AI decides how to deploy it. Algorithmic trading, robo-advisors.

Level 5: AI runs everything, including capital allocation. Fully autonomous business entities. Not yet real, but the trajectory is clear.

Most businesses today operate at Level 0 or 1. The opportunity is in moving up the spectrum.

Who Wins
#

The winners in the Profit-as-a-Service economy will be:

Platform providers who own the infrastructure that makes PaaS possible. If you operate the AI system that generates profit for thousands of businesses, you capture a slice of every dollar earned.

Capital providers who can efficiently allocate resources across multiple PaaS offerings. When starting a business requires no expertise, access to capital becomes the primary competitive advantage.

Specialists who can handle the edge cases that AI cannot. Regulatory compliance, strategic partnerships, and brand building remain human-dominated activities.

Arbitrageurs who identify PaaS opportunities before they become commoditized. The window between “novel” and “saturated” is shrinking, but the rewards for being early are enormous.

Who Loses
#

The losers will be:

Middlemen whose value proposition was information asymmetry. When AI can access and process all available information, intermediaries add no value.

Consultants who sold expertise that can now be codified. If your consulting practice can be reduced to a prompt, it will be.

SaaS companies that failed to move up the value chain. A CRM that helps you track sales will lose to an AI that makes sales for you.

Employees whose jobs consisted of executing repeatable business processes. This is the uncomfortable truth that applies across industries.

The Question
#

Profit-as-a-Service is not a prediction. It is a description of what is already happening.

The question for any business is not whether this trend will affect you. The question is whether you will be the provider or the customer.

If your business processes can be automated by AI, they will be. If you are the one automating them, you are the PaaS provider. If you are the one whose processes are being automated, you are about to become a customer.

Choose wisely.


The Future of Code Review: How AI Makes Human Review Obsolete

The traditional code review is dying. Not because it’s unnecessary, but because it’s becoming unnecessary.

For decades, code review has been a cornerstone of software engineering. Pull requests, peer reviews, endless rounds of comments - this is how we’ve maintained code quality. But AI is fundamentally changing this equation. The question isn’t whether AI will replace human code review, but how quickly.

The Problem With Human Code Review
#

Human code review suffers from three fatal flaws:

  1. It doesn’t scale. As teams grow, review backlogs grow. Developers wait days or weeks for feedback, blocking progress.

  2. It’s inconsistent. Reviews vary wildly based on who’s reviewing, when they’re reviewing, and how much coffee they’ve had.

  3. It’s expensive. Senior engineers spend significant time reviewing code instead of building features. In high-cost engineering organizations, this is millions of dollars annually.

The alternative - no reviews - isn’t viable either. Shipping unreviewed code is a recipe for security vulnerabilities, bugs, and technical debt.

The AI Solution: Software Factories
#

A new paradigm is emerging: the Software Factory. Instead of humans reviewing code, AI systems verify, test, and heal code autonomously. This isn’t theoretical, it’s already running in production.

Consider StrongDM’s Software Factory, which launched in July 2025. A team of three people (CTO, senior manager, and a new hire) built a system that autonomously clones core products like Okta, Jira, and Slack. Their target: $1,000 per day per engineer in tokens. Sounds expensive until you realize what they’re shipping.

How It Works
#

The Software Factory approach replaces human review with six verification layers:

1. Competitive Generation
#

Instead of one AI writing code, run three in parallel. Like a slot machine with three reels spinning, all generating different implementations. An automated evaluator selects the best output based on passing tests, minimal diffs, and fewer dependencies. The cost of running three models is trivial compared to the quality gain.

2. Iterative Refinement (Trycycle)
#

Define the problem, write a plan, ask “is it perfect?” If not, try again. Implement, then ask “is it perfect?” If not, try again. This loop can run autonomously for hours.

3. Scenario-Based Validation
#

Treat scenarios like machine learning holdout sets. Store acceptance scenarios outside the codebase, then have an LLM judge whether the implementation satisfies them. Instead of a binary “tests pass,” you get a probabilistic satisfaction score measuring trajectory toward success.

4. Observability and Healing
#

Every interaction in the system gets logged to an execution database (CXDB). A Healer agent monitors this database, clusters similar problems, and writes prescriptions. No human bug reports. No human triage. Just autonomous detection and fixing.

5. Digital Twin Testing
#

Clone critical external dependencies - Google Sheets, Slack, Jira, authentication providers. Build replicas that AI agents cannot distinguish from production. Test in the most realistic environment possible without touching actual production systems.

6. Adversarial Verification
#

Separate the coding agent from the verification agent. Have a third agent actively try to break what was built. Enforce that verification criteria are authored before code, not after. This prevents the common anti-pattern of writing tests to match implementation rather than specify behavior.

The Economic Case
#

This approach isn’t just about quality, it’s about economics.

StrongDM’s team reports seeing 21% more tasks completed and 98% more pull requests shipped. One case study: Nubank migrated an eight-year ETL monolith (six million lines, one thousand engineers) from an 18-month timeline to weeks. That’s an 8-12x efficiency improvement with 20x cost savings.

The breakeven point is 50-70 pull requests per month. Above that threshold, the AI factory costs less than human review while delivering higher velocity.

The Five-Level Evolution
#

Dan Shapiro at StrongDM describes five levels of AI-assisted development:

Level 0: Manual (you write everything) Level 1: Offloading discrete tasks (ChatGPT for regex) Level 2: AI-native tools (90% of “AI” developers today) Level 3: Human-in-the-loop manager (reviewing all code) Level 4: You’re now a product manager (spec, argue, craft skills, check tomorrow) Level 5: Dark Factory (black box: specs → software)

Most teams today are stuck at Level 2 or 3. The future is Levels 4 and 5, where humans specify constraints and AI handles everything else.

What This Means For Engineers
#

Your job is not to write code. Your job is not to read code. Your job is to solve quality problems so the factory can run.

This requires a philosophical shift. Instead of “how do I review this code?” ask “why am I doing this?” If you can describe what’s wrong, you can automate it. If you can describe what quality looks like, you can codify it as a verification rule.

The era of human-centric development is ending. AI compilers will transform specifications into deployed software. Teams will consist of AI business analysts, AI DevOps engineers, AI QA specialists, and AI compilers.

The Implementation Path
#

This doesn’t happen overnight. A realistic 16-week roadmap:

Weeks 1-2: Install Trycycle skill and competitive generation pattern Weeks 3-4: Build execution database and scenario framework Weeks 5-6: Implement Healer agent for autonomous fixes Weeks 7-8: Build digital twins for critical integrations Weeks 9-12: Migrate existing codebase to spec-driven workflow Weeks 13-16: Fine-tuning and compounding learning

The results compound. Weeks 1-4 might yield 5-10x velocity on simple tasks. By week 16, teams report 2x monthly speed improvements with compounding gains continuing.

Open Questions
#

This isn’t a solved problem. Key research questions remain:

  • Is $1,000/day per engineer in tokens realistic or an outlier?
  • How do patterns learned in one domain transfer to another?
  • When does fine-tuning cost outweigh compounding benefits?
  • What specific patterns trigger unavoidable human review?
  • How do you validate security without human review of auth logic?

These aren’t blockers, they’re opportunities for teams to pioneer solutions.

The End of an Era
#

The traditional code review served us well. But like manual testing before automated test suites, its time is passing. The future isn’t humans reviewing AI code. It’s humans designing verification systems that make human review unnecessary.

The question isn’t whether your team will adopt this approach. The question is whether your competitors will adopt it first.

References
#

Primary Sources

StrongDM Software Factory. Production deployment of autonomous code generation systems. https://factory.strongdm.ai/

Kilroy. Open-source CLI for running StrongDM-style pipelines locally. https://github.com/danshapiro/kilroy

Attractor. Provider-agnostic framework for autonomous code generation. https://github.com/strongdm/attractor

Trycycle. Simple skill for iterative refinement loops. https://github.com/danshapiro/trycycle

Devin. AI-powered software engineering platform with fine-tuning capabilities. https://devin.ai/

Analysis and Commentary

Willison, Simon. “Software Factory.” https://simonwillison.net/2026/Feb/7/software-factory/

Shapiro, Dan. “The Five Levels.” https://www.danshapiro.com/blog/2026/01/the-five-levels/

Shapiro, Dan. “Dark Factories.” https://www.danshapiro.com/blog/2026/03/dark-factories/

Shapiro, Dan. “You Don’t Write the Code.” https://www.danshapiro.com/blog/2026/02/you-dont-write-the-code/

LukePM. “The Software Factory.” https://lukepm.com/blog/the-software-factory/

Latent Space. “Reviews Dead.” https://www.latent.space/p/reviews-dead

Case Studies

Nubank ETL Migration. Eight-year monolith migrated in weeks using Devin. https://devin.ai/

StrongDM Security Software. Autonomous cloning of Okta, Jira, and Slack. https://factory.strongdm.ai/


The Survival of the Smartest: Exploring Darwinian Software Development

Prompt
#

Write an article about Darwinian software development, the use of LLM swarms to identify the best solution to a problem.

Answer
#

In the traditional world of coding, software is “built.” An architect designs a blueprint, and developers lay the bricks. But in the era of Generative AI, we are moving toward a paradigm where software is evolved.

Darwinian Software Development is an emerging methodology that leverages “swarms” of Large Language Models (LLMs) to iterate, compete, and select the most efficient code through a process mirroring natural selection.

How the Swarm Works
#

Instead of asking a single AI to write a function, Darwinian development deploys a multi-agent system—a swarm—to tackle the same problem from different angles. This process generally follows four biological phases:

  • Variation (The Mutation): Multiple LLM agents are given the same prompt but with different “temperaments” or system instructions. One might prioritize memory efficiency, another readability, and a third raw speed.
  • Competition (The Struggle for Life): The agents generate dozens of potential solutions. These aren’t just checked for syntax; they are put into a “sandbox” to run against unit tests and edge cases.
  • Selection (Fitness Function): An automated “Judge” agent (or a specialized testing suite) evaluates the outputs. Only the code that passes the most tests with the lowest latency or resource consumption “survives.”
  • Heredity (Refinement): The winning code is then used as the “DNA” for the next generation. The swarm takes the best-performing snippet and attempts to optimize it further until a “perfect” solution is reached.

Why Swarms Outperform Single Models
#

The “Darwinian” approach solves the two biggest headaches in AI-assisted coding: hallucinations and tunnel vision.

  • Error Correction: If one agent introduces a bug, the likelihood that five other agents will make the exact same error is low. The swarm acts as a self-healing mechanism.
  • Diverse Architectures: One LLM might get stuck on a specific logic path. A swarm explores the entire “fitness landscape,” finding creative solutions a human (or a single AI) might never consider.
  • Performance Benchmarking: By racing agents against each other, the system naturally optimizes for performance.
Feature Traditional Development Single LLM Coding Darwinian Swarm
Speed Slow (Human-paced) Very Fast Fast (Iterative)
Reliability High (Human-vetted) Medium (Hallucination risk) Very High (Cross-vetted)
Optimization Manual Basic Automated/Evolutionary

The “Fitness Function” Problem
#

The biggest challenge in Darwinian development isn’t the AI—it’s the criteria. For natural selection to work, the environment must be rigorous. If your “fitness function” (the tests you use to judge the code) is poorly written, the AI will evolve to pass the test while still being technically broken. This is known as reward hacking. To counter this, modern swarms often include “Adversarial Agents” whose only job is to write difficult unit tests to try and “kill” the code generated by the other agents.

The Future: Software That Self-Improves
#

We are approaching a point where software won’t be static. Imagine a server that monitors its own performance and, upon detecting a bottleneck, spawns a Darwinian swarm to rewrite its own inefficient modules in real-time. In this new world, the developer’s role shifts from writer to naturalist. You aren’t typing the lines; you are defining the environment in which the best code can grow.


What Changes When Coding Agents Are Infinitely Parallel

Prompt
#

Imagine you can run hundreds/thousands of coding agents in parallel. How would you go about using them?

Answer
#

This is one of those questions where the interesting answer isn’t “do everything faster” but rather what fundamentally changes when parallelism is cheap.


The core shift: from sequential exploration to parallel search

When you have one agent, you think carefully before acting — it’s expensive to go down dead ends. With thousands, the calculus flips: exploration becomes cheap, convergence becomes the hard problem. Your job shifts from “what should I try?” to “how do I synthesize thousands of results?”


High-leverage use patterns

Speculative execution on decisions — At every architectural fork (“should this be a queue-based system or polling?”), you don’t decide, you branch. Two fleets of agents build both, and you evaluate the result. This is huge for situations where you genuinely don’t know which approach is better until you’ve tried it.

Mutation testing at scale — Spin up hundreds of agents making targeted changes to a codebase, each trying a different hypothesis about why a bug exists or how to optimize something. Like fuzzing, but semantically directed.

Full-stack consistency checking — Have agents simultaneously hold the contract between every pair of services in your system, constantly verifying that implementations match specs, that error handling is symmetric, that naming is consistent. Things that fall through the cracks in sequential review.

Competitive benchmarking of approaches — For a problem like network latency or caching strategy, you could have 50 agents implement 50 different approaches against the same test harness, and just… pick the winner. No reading papers and reasoning about tradeoffs — empirically settle it.

Living documentation — Agents continuously reconcile docs, comments, and code. Every PR triggers agents that check for documentation drift, update runbooks, and surface inconsistencies.


The harder problems this creates

Synthesis is the bottleneck. If 1000 agents each produce a PR, you’re back to being the serial bottleneck reviewing them. You need meta-agents whose job is to evaluate and rank the output of other agents, with clear scoring functions.

State and conflict. Agents working in parallel on the same codebase will conflict. You probably want them working in isolated sandboxes (branches, ephemeral clusters, test environments) with a merge/tournament layer on top.

Task decomposition quality matters more, not less. Bad task specs mean 1000 agents going confidently in the wrong direction simultaneously. The skill of writing tight, evaluable task specs becomes enormously valuable.

Evaluation functions become load-bearing. “Did the agent succeed?” needs a concrete, automated answer — you can’t eyeball 1000 outputs. This pushes you toward test-driven development in a serious way.


What this implies for infrastructure work

The most compelling application is parallel experimentation on real infrastructure parameters. Spin up ephemeral environments, have agents try different configurations or scheduling strategies, measure actual performance, tear them down. The thing that makes infrastructure hard is that you normally can’t afford to run 50 experiments simultaneously. With cheap parallel agents managing the scaffolding, that constraint disappears.

The practical limit ends up being compute and money, not ideas — which is a fundamentally different world than the one most engineering workflows were designed for.



Continuous learning

  • Collect articles shared on slack that may be relevant to read (manual)
  • Identify tools and resources that can help improve my skills (manual)
  • Set aside dedicated time for learning each day (manual)
  • Experiment with new technologies and ideas (manual)
  • Identify areas for improvement in current skills and knowledge (manual)

Daily routine

Trigger
#

Every daily at the beginning of the (work) day.

Duration
#

8h.

Steps
#

Start of day
#

  • Review and correct LLM generated summaries of yesterday activities
  • Catch up on slack messages (manual)
  • Review PRs (manual)
  • Standup (manual)

Throughout the day
#

End of day
#



Ouroboros: An Autonomous Self-Improving AI Agent

What is Ouroboros?
#

Ouroboros is an autonomous AI agent that works on tasks provided to it as well as continuously improving itself. Named after the ancient symbol of a serpent eating its own tail - representing infinity and cyclic renewal - Ouroboros implements a continuous loop of Do → Learn → Improve → Retry.

Unlike traditional AI assistants that wait for commands and forget context between sessions, Ouroboros:

  • Runs indefinitely without human intervention
  • Maintains persistent memory of everything it has done
  • Reflects on its performance regularly
  • Modifies its own code to improve over time
  • Can incorporate human feedback when provided

Previously I wrote about GlobaLLM, an AI agent that autonomously contributes to open source projects. While GlobaLLM’s primary objective is to do project and task prioritization at scale, Ouroboros focuses on task implementation and self-improvement. Ouroboros is thus a component of GlobaLLM’s solution.

How It Works
#

The Core Loop
#

Ouroboros follows a structured nine-step cycle that repeats continuously:

  1. Read goals – Fetches tasks from agent/goals/active.md
  2. Select goal – Picks one to work on (or defaults to self-improvement)
  3. Plan – Uses an LLM to create a step-by-step plan
  4. Execute – Carries out the plan using available tools
  5. Journal – Writes results to a daily log
  6. Reflect – Analyzes what happened and identifies improvements (both task-related and self-related)
  7. Self-modify – Edits its own source code if improvements are found
  8. Journal again – Records reflection and modification results
  9. Repeat – Starts the cycle anew

This separation between execution and self-modification is crucial. The agent won’t modify its code while working on a task - reflections and improvements happen only during dedicated reflection cycles.

Architecture
#

┌─────────────────────────────────────────────────────────┐
│                      Agent Core                         │
│  (coordinates the loop, handles signals, manages state) │
└─────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Memory     │    │  LLM Layer   │    │   Tools      │
│              │    │              │    │              │
│ • Working    │    │ • Anthropic  │    │ • run_cmd    │
│ • Journal    │───▶│   Claude     │───▶│ • read_file  │
│ • Goals      │    │ • Token      │    │ • write_file │
│ • Feedback   │    │   tracking   │    │ • search_*   │
└──────────────┘    └──────────────┘    └──────────────┘

Memory System
#

Ouroboros uses a three-tiered memory architecture:

Tier Description Location
Working memory Current goals, immediate context In-process
Short-term Daily journals (notes, reflections, feedback) agent/journal/YYYY/MM/DD/
Long-term Git history with descriptive commits Git repository

Everything is logged in human-readable markdown, making it easy to inspect what the agent has been up to.

Tool Registry
#

The agent comes with built-in tools for common operations:

  • run_command – Execute shell commands
  • read_file – Read file contents
  • write_file – Write to files
  • search_files – Find files by pattern
  • search_content – Search within files

Crucially, Ouroboros can create, register, and use new tools that it writes itself.

Tools are implemented as CLI commands of the ouroboros CLI that the agent can invoke during execution.

Design Principles
#

Safety Through Structure
#

The execution/reflection separation prevents runaway self-modification. The agent can only change code during a dedicated reflection phase, and all changes are committed to git with descriptive messages explaining the “why” behind each change.

Transparency
#

Every action is logged. Want to know what the agent did? Check the daily journal:

  • agent/journal/YYYY/MM/DD/notes.md – What it did
  • agent/journal/YYYY/MM/DD/reflections.md – What it learned
  • agent/journal/YYYY/MM/DD/user-feedback.md – Human input received

Autonomy with Optional Guidance
#

Ouroboros needs no human intervention, but welcomes it. It will happily incorporate feedback, adjust course based on user suggestions, and explain its reasoning when asked.

What Makes It Interesting
#

  1. True self-improvement – The agent can and does modify its own implementation based on reflection
  2. Persistent memory – Git commits serve as a permanent, queryable history of everything tried
  3. Graceful degradation – Failed modifications can be reverted; the agent learns and tries again
  4. Tool extensibility – New tools can be created dynamically as needs arise
  5. Idle improvement – When no goals are active, it works on making itself better

The Vision
#

Ouroboros represents an experiment in autonomous AI agents. Can an agent truly improve itself over time without (or limited) human intervention? By maintaining a detailed journal, reflecting on its actions, and having the freedom to modify its own code, Ouroboros aims to answer this question.

The name is fitting - the serpent eating its tail represents the continuous cycle of doing, learning, and improving that drives the agent forward. Each reflection builds on the last; each modification makes the agent slightly more capable.

Ouroboros is open source. Check out the repository to see the code, contribute, or run your own self-improving agent.


GlobaLLM: Automated Open Source Contribution at Scale

Introduction
#

Consider the following dilemma: you have unlimited access to state-of-the-art LLMs, but finite compute resources.

How do you maximize positive impact on the software ecosystem?

GlobaLLM is an experiment in autonomous open source contribution which attempts to address this question.

It’s a system that discovers repositories, analyzes their health, prioritizes issues, and automatically generates pull requests - all while coordinating with other instances to avoid redundant work.

The core insight isn’t just that LLMs can write code; it’s that strategic prioritization combined with distributed execution can multiply that capability into something genuinely impactful.

This article explains how GlobaLLM works, diving into the architecture that lets it scale from fixing a single bug to coordinating across thousands of repositories.

The GlobaLLM Pipeline: A High-Level View
#

GlobaLLM follows a five-stage pipeline:

Discover -> Analyze -> Prioritize -> Fix -> Contribute

1. Discover
#

The system begins by finding repositories worth targeting.

Using GitHub’s search API, it filters by domain, language, stars, and other criteria.

Current methodology uses domain-based discovery with predefined domains (ai_ml, web_dev, data_science, cloud_devops, mobile, security, games), each with custom search queries combining relevant keywords.

The system then applies multi-stage filtering:

  1. Language filtering: Excludes non-programming languages (Markdown, HTML, CSS, Shell, etc.)
  2. Library filtering: Uses heuristics to identify libraries vs applications (checks for package files like pyproject.toml, package.json, Cargo.toml; filters out “awesome” lists and doc repos; analyzes descriptions and topics)
  3. Quality filtering: Language-specific queries include testing indicators (pytest, jest, testing)
  4. Health filtering: Applies health scores to filter out unmaintained projects
  5. Dependent enrichment: Uses libraries.io API to fetch package dependency counts for impact scoring

Results are cached locally (24hr TTL) to avoid redundant API calls and respect rate limits.

The goal isn’t to find every repository - it’s to find libraries where a contribution would matter.

2. Analyze
#

Once a repository is identified, GlobaLLM performs deep analysis to determine whether contributing is worthwhile.

This gate prevents wasting resources on abandoned projects, hostile communities, or repositories where contributions won’t have impact.

It calculates a HealthScore based on multiple signals:

  • Commit velocity: Is the project actively maintained?
  • Issue resolution rate: Are bugs getting fixed?
  • CI status: Does the project have passing tests?
  • Contributor diversity: Is there a healthy community?

It also computes an impact score - how many users would benefit from a fix, based on stars, forks, and dependency analysis using NetworkX.

Repositories with low health scores or minimal impact are deprioritized or skipped entirely.

3. Prioritize
#

The system fetches open issues from approved repositories and ranks them using a sophisticated multi-factor algorithm.

Each issue is analyzed by an LLM to determine:

  • Category: bug, feature, documentation, performance, security, etc.
  • Complexity: 1-10 scale (how difficult to solve)
  • Solvability: 0-1 score (likelihood of automated fix success)
  • Requirements: affected files, breaking change risk, test needs

The prioritization then combines four dimensions:

Health (weight: 1.0): Repository health adjusted for complexity.

A healthy repository with simple issues scores higher than an unhealthy repository with complex ones.

Impact (weight: 2.0): Based on stars, dependents, and watchers.

Uses log-scale normalization (stars / 50,000, dependents / 5,000).

Solvability (weight: 1.5): LLM-assessed likelihood of successful resolution.

Documentation and style issues (~0.9) rank higher than critical security (~0.3) due to automation difficulty.

Urgency (weight: 0.5): Category multiplier × age × engagement.

Critical security bugs get 10× multiplier, documentation gets 1×.

The final formula:

priority = (health × 1.0) + (impact × 2.0) + (solvability × 1.5) + (urgency × 0.5)

Budget constraints filter the ranked list:

  • Per-repository token limit (default: 100k)
  • Per-language issue limit (default: 50)
  • Weekly token budget (default: 5M)

Results are saved to the issue store with full breakdowns for transparency.

4. Fix
#

GlobaLLM claims the highest-priority unassigned issue and generates a solution.

This is where LLMs do the heavy lifting.

The CodeGenerator class sends a structured prompt to Claude or ChatGPT with:

  • The issue title and description
  • Repository context (code style, testing framework)
  • Language-specific conventions
  • Category-specific requirements (bug vs feature vs docs)

The LLM responds with a complete solution:

  • Explanation: Step-by-step reasoning
  • File patches: Original and new content for each modified file
  • Tests: New or modified test files

The system tracks tokens used at every step for budget management.

5. Contribute
#

The final stage uses PRAutomation to create a well-structured pull request with context, tests, and documentation.

For trivial changes (typos, version bumps), it can even auto-merge.

Where LLMs Are Used
#

LLMs are the engine that powers GlobaLLM, but they’re used strategically rather than indiscriminately.

Stage 3 - Prioritize: The IssueAnalyzer calls an LLM to categorize each issue.

Input: title, body, labels, comments, reactions.

Output: category, complexity (1-10), solvability (0-1), breaking_change, test_required.

This costs ~500 tokens per issue and feeds directly into the priority scoring.

Stage 4 - Fix: The CodeGenerator uses an LLM to generate complete solutions.

Input: issue details, repository context, language style guidelines.

Output: explanation, file patches (original + new content), test files.

This costs 1k-10k tokens depending on complexity.

The key insight: LLMs are only used for tasks requiring intelligence.

Discovery, health scoring, impact calculation, and PR automation use deterministic algorithms.

Scaling GlobaLLM
#

The real power of GlobaLLM emerges when you run multiple instances in parallel.

Distributed Agent Architecture
#

Each GlobaLLM instance has a unique AgentIdentity.

When it’s ready to work, it calls:

globallm assign claim

This atomically reserves the highest-priority unassigned issue.

The assignment is stored in PostgreSQL with a heartbeat timestamp.

Issue Assignment System
#

To prevent multiple agents from working on the same issue:

  1. Issues are marked assigned with an agent ID and timestamp
  2. Heartbeats update every 5 minutes
  3. If a heartbeat expires (30 minutes), the issue is reassigned

This allows crash recovery: if an agent crashes mid-work, another will pick up the issue.

Crash Recovery
#

The heartbeat system is elegant in its simplicity:

# Agent side
while working:
    update_heartbeat(issue_id, agent_id)
    do_work()

# Recovery side
expired = get_issues_with_expired_heartbeats()
for issue in expired:
    reassign(issue)

No distributed consensus needed - PostgreSQL’s row-level locking handles contention.

Database Design
#

PostgreSQL is the central state store:

  • Connection pooling: 2-10 connections per process (psycopg pool)
  • JSONB columns: Flexible schema for repository/issue metadata
  • Indexes: On frequently queried fields (stars, health_score, assigned status)
  • Migrations: Versioned schema

Future Work
#

GlobaLLM is an evolving experiment. There are numerous challenges that developers face on a daily basis that a system such as GlobaLLM will also encounter and would need to deal with in order to be more effective:

  • Parallelizing work across multiple agents without conflicts or redundant effort.
  • Producing “mental models” of repositories to better understand their goals, architecture, dependencies and trajectories.
  • Have a higher-level decision-making system that can reason about which repositories to focus on based on broader trends in the open source ecosystem.
  • Make decisions such as which programming languages to focus on.
  • Have the ability to work with closed source repositories which may not have the same signals as open source ones (e.g., forks, stars, dependency count).

Conclusion
#

GlobaLLM is an experiment in what’s possible when you combine LLM code generation with principled decision-making and distributed execution.

The goal isn’t to replace human contributors - it’s to handle the long tail of maintenance work that no one has time for, freeing up humans to focus on the interesting problems.

The system is actively developed and evolving.

Current work focuses on better prioritization heuristics, more sophisticated validation, and integration with additional LLM providers.

If you’re interested in contributing or just want to run it yourself, the code is available on GitHub.

This system is far from perfect, but it’s a step toward harnessing AI to make open source software healthier and more sustainable at scale.

It’s also a way to explore what it looks like to have to make decisions at the scale of millions of repositories and billions of issues.