Skip to content

gnusec/AI-PK

Repository files navigation

AI-PK: AI Coding Agent Benchmark

 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘      β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• 
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•— 
β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘      β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•—
β•šβ•β•  β•šβ•β•β•šβ•β•      β•šβ•β•     β•šβ•β•  β•šβ•β•

Real-world AI Coding Agent Performance Benchmark

πŸ‡¨πŸ‡³ δΈ­ζ–‡η‰ˆ | English

Tests Engines License

🎯 Introduction

AI-PK is an objective, reproducible benchmark system for AI coding assistants. It evaluates different AI engines and IDE clients through real-world Zig project development tasks (ZigScan port scanner).

Test Task: Build a complete Zig port scanner from scratch, including concurrent scanning, argument parsing, performance optimization, and other real development requirements.


πŸ† TOP 10 Leaderboard

Rank AI Engine IDE/Client Status Time Tokens Quality Score
πŸ₯‡ #1 Claude Sonnet 4.5 Factory Droid βœ… Success 15.0min 172K 10/10
πŸ₯ˆ #2 Claude Opus 4.1 Factory Droid βœ… Success 18.0min 3700K 9/10
πŸ₯‰ #3 Claude Sonnet 4.5 Factory Droid βœ… Success 24.0min 773K 9/10
#4 GPT-5 (hight) Factory Droid βœ… Success 30.0min 725K 9/10
#5 GPT-5 (codex_medium) Codex CLI βœ… Success 59.0min 484K 8/10
#6 Kimi Thinking Kimi-CLI βœ… Success 72.0min N/A 7/10
#7 Grok Roo Code βœ… Success 300.0min N/A 7/10
#8 MiniMax M2 ClaudeCode βœ… Success 137.5min 36560K 6/10
#9 Qwen Qwen-CLI βœ… Success 159.0min 75400K 6/10
#10 GPT-5 (hight) Codex CLI ⚠️ Partial 20.4min 152K 6/10

πŸ“Š View Full Leaderboard | Download Text Report


πŸ“Š Statistics Overview

  • Total Tests: 22
  • βœ… Success: 9 (41%)
  • ⚠️ Partial Success: 6 (27%)
  • ❌ Failed: 7 (32%)

πŸ… Best Performance

  • πŸ₯‡ Highest Quality: Claude Sonnet 4.5 + Factory Droid (10/10 Perfect Score)
  • ⚑ Fastest: Claude Sonnet 4.5 + Factory Droid (15 minutes)
  • πŸ’° Most Token Efficient: Claude Sonnet 4.5 + Factory Droid (172.5K tokens)

πŸ€– Tested AI Engines

International:

  • Claude Sonnet 4.5, Claude Opus 4.1 (Anthropic)
  • GPT-5 with multiple configuration levels (OpenAI)
  • Grok (xAI)
  • Supernova

Chinese:

  • Qwen (Alibaba)
  • GLM-4.6 (Zhipu AI)
  • Kat (Kuaishou)

πŸ› οΈ Tested IDE/Clients

  • Factory Droid
  • Codex CLI
  • Roo Code
  • Kilo Code
  • ClaudeCode
  • Cline
  • Qwen-CLI

πŸ“ˆ Reports & Visualizations

Interactive HTML Reports

Text Reports

Data Files

  • πŸ“Š JSON Data - Complete structured data

Charts


🎯 Scoring Standards

Uses a standardized 0-10 scoring system. See Quality Scoring Standards for details.

Scoring Formula: Final Score = min(max(Base + Bonus - Penalty, 0), 10)

  • Base Score:

    • SUCCESS: 8 points
    • PARTIAL: 5 points
    • FAILED: 0 points
  • Bonus (up to +3):

    • Functionality completeness (0-1)
    • Code quality (0-1)
    • Performance (0-1)
  • Penalty (up to -5):

    • Bug severity (0-2)
    • Manual intervention needed (0-2)
    • Efficiency issues (0-1)

πŸš€ Quick Start

View Reports

# Open interactive reports in browser
xdg-open results/REPORT_EN.html  # English version
xdg-open results/REPORT_ZH.html  # Chinese version

# Or view text reports
cat results/BENCHMARK_REPORT.txt

Run Analysis

# 1. Navigate to project directory
cd ai-pk

# 2. Run full analysis
bash scripts/run_all.sh

# Generated files:
# - results/BENCHMARK_REPORT.txt (English)
# - results/BENCHMARK_REPORT_ZH.txt (Chinese)
# - results/REPORT_EN.html (Interactive English)
# - results/REPORT_ZH.html (Interactive Chinese)
# - results/charts/*.png (Visualization charts)

πŸ“ Project Structure

ai-pk/
β”œβ”€β”€ benchmarks/
β”‚   └── zigscan/              # ZigScan test results (22 tests)
β”‚       β”œβ”€β”€ sonnet4.5-dorid-2025-10-25/
β”‚       β”‚   β”œβ”€β”€ stats.json    # Standardized data
β”‚       β”‚   β”œβ”€β”€ finish.log    # Test log
β”‚       β”‚   └── src/          # Generated code
β”‚       β”œβ”€β”€ gpt5_hight-dorid/
β”‚       └── ...
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ cyberpunk_analyzer.py      # Main analysis script
β”‚   β”œβ”€β”€ generate_charts.py         # Chart generator
β”‚   β”œβ”€β”€ generate_bilingual_html.py # Bilingual HTML reports
β”‚   └── run_all.sh                 # One-click run
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ BENCHMARK_REPORT.txt       # English text report
β”‚   β”œβ”€β”€ BENCHMARK_REPORT_ZH.txt    # Chinese text report
β”‚   β”œβ”€β”€ REPORT_EN.html             # English interactive report
β”‚   β”œβ”€β”€ REPORT_ZH.html             # Chinese interactive report
β”‚   β”œβ”€β”€ benchmark_data.json        # Complete data
β”‚   └── charts/                    # Visualization charts
β”œβ”€β”€ QUALITY_SCORING_STANDARD.md    # Scoring standards
└── README.md                      # This file

πŸ“Š Data Format

Each test includes a stats.json file with the following structure:

{
  "test_dir": "sonnet4.5-dorid-2025-10-25",
  "engine": "Claude Sonnet 4.5",
  "client": "Factory Droid",
  "completed": "SUCCESS",
  "time_minutes": 15,
  "tokens": 172500,
  "quality_score": 10,
  "quality_breakdown": {
    "base_score": 8,
    "bonus": { "functionality": 1.0, "code_quality": 1.0, "performance": 1.0 },
    "penalty": { "bugs": 0.0, "workaround": 0.0, "efficiency": 0.0 }
  }
}

🀝 Contributing

Contributions of new test cases are welcome! See CONTRIBUTING.md

Adding New Tests

  1. Create a new directory in benchmarks/zigscan/
  2. Add a stats.json file (refer to existing format)
  3. Run bash scripts/run_all.sh to regenerate reports
  4. Submit a PR

πŸ“ Changelog

See RELEASE_NOTES.md for the latest updates.

About

We compared the development efficiency and cost of various LLM and AI development kits through common practical projects, providing a basis for our development selection.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors