-
University of Illinois Urbana-Champaign
- https://alphapav.github.io/
Highlights
- Pro
Stars
AI enabled pair programmer for Claude, GPT, O Series, Grok, Deepseek, Gemini and 300+ models
CyberGym is a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks.
Lightweight coding agent that runs in your terminal
🌎💪 BrowserGym, a Gym environment for web task automation
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
Robust Speech Recognition via Large-Scale Weak Supervision
你想蒸馏的下一个员工,何必是同事。蒸馏任何人的思维方式——心智模型、决策启发式、表达DNA。Distill how anyone thinks.
OSS-Fuzz - continuous fuzzing for open source software.
AI agents running research on single-GPU nanochat training automatically
image scaling attacks for multi-modal prompt injection
AndroidWorld is an environment and benchmark for autonomous agents
An Illusion of Progress? Assessing the Current State of Web Agents
Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
[NeurIPS 2025 Spotlight] Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Repo for the paper "Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks".
Official release of code for the paper RL is a hammer and LLMs are nails A simple RL approach to stronger prompt injection attacks
Open-source implementation of AlphaEvolve
Get your documents ready for gen AI
[NeurIPS 2025] Latent Zoning Networks
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >74% on SWE-bench verified!
An open-source AI agent that brings the power of Gemini directly into your terminal.
🔮Reasoning for Safer Code Generation; 🥇Winner Solution of Amazon Nova AI Challenge 2025
An open-source AI coding agent that lives in your terminal.
👩⚖️ Agent-as-a-Judge: The Magic for Open-Endedness
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.