ββββββ βββ βββββββ βββ βββ
βββββββββββ βββββββββββ ββββ
ββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ βββββββ
βββ ββββββ βββ βββ βββ
βββ ββββββ βββ βββ βββ
Real-world AI Coding Agent Performance Benchmark
π¨π³ δΈζη | English
AI-PK is an objective, reproducible benchmark system for AI coding assistants. It evaluates different AI engines and IDE clients through real-world Zig project development tasks (ZigScan port scanner).
Test Task: Build a complete Zig port scanner from scratch, including concurrent scanning, argument parsing, performance optimization, and other real development requirements.
| Rank | AI Engine | IDE/Client | Status | Time | Tokens | Quality Score |
|---|---|---|---|---|---|---|
| π₯ #1 | Claude Sonnet 4.5 | Factory Droid | β Success | 15.0min | 172K | 10/10 |
| π₯ #2 | Claude Opus 4.1 | Factory Droid | β Success | 18.0min | 3700K | 9/10 |
| π₯ #3 | Claude Sonnet 4.5 | Factory Droid | β Success | 24.0min | 773K | 9/10 |
| #4 | GPT-5 (hight) | Factory Droid | β Success | 30.0min | 725K | 9/10 |
| #5 | GPT-5 (codex_medium) | Codex CLI | β Success | 59.0min | 484K | 8/10 |
| #6 | Kimi Thinking | Kimi-CLI | β Success | 72.0min | N/A | 7/10 |
| #7 | Grok | Roo Code | β Success | 300.0min | N/A | 7/10 |
| #8 | MiniMax M2 | ClaudeCode | β Success | 137.5min | 36560K | 6/10 |
| #9 | Qwen | Qwen-CLI | β Success | 159.0min | 75400K | 6/10 |
| #10 | GPT-5 (hight) | Codex CLI | 20.4min | 152K | 6/10 |
π View Full Leaderboard | Download Text Report
- Total Tests: 22
- β Success: 9 (41%)
β οΈ Partial Success: 6 (27%)- β Failed: 7 (32%)
- π₯ Highest Quality: Claude Sonnet 4.5 + Factory Droid (10/10 Perfect Score)
- β‘ Fastest: Claude Sonnet 4.5 + Factory Droid (15 minutes)
- π° Most Token Efficient: Claude Sonnet 4.5 + Factory Droid (172.5K tokens)
International:
- Claude Sonnet 4.5, Claude Opus 4.1 (Anthropic)
- GPT-5 with multiple configuration levels (OpenAI)
- Grok (xAI)
- Supernova
Chinese:
- Qwen (Alibaba)
- GLM-4.6 (Zhipu AI)
- Kat (Kuaishou)
- Factory Droid
- Codex CLI
- Roo Code
- Kilo Code
- ClaudeCode
- Cline
- Qwen-CLI
- π English Report - Sortable tables with embedded charts
- π Chinese Report - δΈζηζ₯ε
- π English Text Report
- π Chinese Text Report
- π JSON Data - Complete structured data
- π Success Rate Distribution
- π Token Efficiency Analysis
- π Engine Comparison
- π Quality Heatmap
Uses a standardized 0-10 scoring system. See Quality Scoring Standards for details.
Scoring Formula: Final Score = min(max(Base + Bonus - Penalty, 0), 10)
-
Base Score:
- SUCCESS: 8 points
- PARTIAL: 5 points
- FAILED: 0 points
-
Bonus (up to +3):
- Functionality completeness (0-1)
- Code quality (0-1)
- Performance (0-1)
-
Penalty (up to -5):
- Bug severity (0-2)
- Manual intervention needed (0-2)
- Efficiency issues (0-1)
# Open interactive reports in browser
xdg-open results/REPORT_EN.html # English version
xdg-open results/REPORT_ZH.html # Chinese version
# Or view text reports
cat results/BENCHMARK_REPORT.txt# 1. Navigate to project directory
cd ai-pk
# 2. Run full analysis
bash scripts/run_all.sh
# Generated files:
# - results/BENCHMARK_REPORT.txt (English)
# - results/BENCHMARK_REPORT_ZH.txt (Chinese)
# - results/REPORT_EN.html (Interactive English)
# - results/REPORT_ZH.html (Interactive Chinese)
# - results/charts/*.png (Visualization charts)ai-pk/
βββ benchmarks/
β βββ zigscan/ # ZigScan test results (22 tests)
β βββ sonnet4.5-dorid-2025-10-25/
β β βββ stats.json # Standardized data
β β βββ finish.log # Test log
β β βββ src/ # Generated code
β βββ gpt5_hight-dorid/
β βββ ...
βββ scripts/
β βββ cyberpunk_analyzer.py # Main analysis script
β βββ generate_charts.py # Chart generator
β βββ generate_bilingual_html.py # Bilingual HTML reports
β βββ run_all.sh # One-click run
βββ results/
β βββ BENCHMARK_REPORT.txt # English text report
β βββ BENCHMARK_REPORT_ZH.txt # Chinese text report
β βββ REPORT_EN.html # English interactive report
β βββ REPORT_ZH.html # Chinese interactive report
β βββ benchmark_data.json # Complete data
β βββ charts/ # Visualization charts
βββ QUALITY_SCORING_STANDARD.md # Scoring standards
βββ README.md # This file
Each test includes a stats.json file with the following structure:
{
"test_dir": "sonnet4.5-dorid-2025-10-25",
"engine": "Claude Sonnet 4.5",
"client": "Factory Droid",
"completed": "SUCCESS",
"time_minutes": 15,
"tokens": 172500,
"quality_score": 10,
"quality_breakdown": {
"base_score": 8,
"bonus": { "functionality": 1.0, "code_quality": 1.0, "performance": 1.0 },
"penalty": { "bugs": 0.0, "workaround": 0.0, "efficiency": 0.0 }
}
}Contributions of new test cases are welcome! See CONTRIBUTING.md
- Create a new directory in
benchmarks/zigscan/ - Add a
stats.jsonfile (refer to existing format) - Run
bash scripts/run_all.shto regenerate reports - Submit a PR
See RELEASE_NOTES.md for the latest updates.