AppEvalPilot

Introduction

Welcome to the AppEvalPilot project, a cutting-edge automated evaluation framework designed to comprehensively assess software application functionalities across an array of platforms. Tailored for versatility, this framework adeptly handles the evaluation of desktop, mobile, and web-based applications under a unified methodology.

AppEvalPilot's fully automated process operates without manual intervention, streamlining your workflow while significantly cutting costs. By leveraging our framework, you not only accelerate the evaluation process but also achieve superior accuracy in assessment outcomes. Ideal for developers and QA teams looking to enhance efficiency and quality in their testing procedures, AppEvalPilot stands out as a reliable solution for comprehensive, precise, and efficient application assessments. Join us in advancing software evaluation with AppEvalPilot.

Features

Cross-Platform Compatibility: A unified codebase facilitating evaluation across desktop applications, mobile applications, and web-based interfaces.
Methodologically Robust Dynamic Assessment: In contrast to conventional benchmarks employing static evaluation methodologies, AppEvalPilot replicates the systematic workflow of professional testing engineers to conduct thorough application evaluation.
Resource Efficiency: AppEvalPilot completes comprehensive evaluation of 15-20 functional components within an application in approximately 8-9 minutes. The system operates continuously (24/7) to evaluate diverse applications at a cost of $0.26 per app—substantially more economical than human-conducted evaluations.

Sample Videos

demo.mp4

Installation

From Scratch

# Create a conda environment
conda create -n appeval python=3.10
conda activate appeval

# Clone the repository
git clone https://github.com/tanghaom/AppEvalPilot.git
cd AppEvalPilot

# Install dependencies
pip install uv
uv pip install -r requirements.txt

# Install appeval
uv pip install -e .
# Optional: Install enhanced version with OCR and icon detection capabilities
uv pip install -e .[ultra]

LLM Configuration

Edit config/config2.yaml to configure your LLM model
Recommended model: claude-3-5-sonnet-v2
Ensure appropriate configuration of api_key and base_url parameters in the configuration file
For integration of additional multimodal models (e.g., Qwen2.5-VL-72B), add the corresponding model identifiers in metagpt/provider/constant.py

Usage

Basic Commands

# Run the main program to execute automated application evaluation
# This will run a single test case on a web application and evaluate its functionality
python main.py

# Run OSagent, which is a powerful GUI-based agent that automates everyday tasks for you - from ordering food delivery and booking rides to searching information and sending it to your contacts.
python scripts/run_osagent.py

# Start the FastAPI task management server, which enables you to:
# - Submit and manage different types of test tasks (URL, Python app, Python Web app)
# - Asynchronously process tasks with status tracking
# - Manage conda environments and processes for application testing
python scripts/server.py

# Launch the Gradio web interface for easy test configuration and execution
# Provides a user-friendly UI to:
# - Configure and run tests on web applications
# - Monitor test execution progress and action history
# - View and analyze test results in real-time
python gradio_app.py

Project Structure

AppEvalPilot/
├── main.py                           # Main program entry
├── gradio_app.py                     # Gradio web interface for test configuration and execution
├── setup.py                          # Package setup script
├── assets/                           # Media assets for documentation
│   ├── images/                       # Images for README and documentation
│   └── videos/                       # Demo videos showcasing functionality
├── appeval/                          # Core modules
│   ├── roles/                        # Role definitions
│   │   ├── eval_runner.py            # Automated testing role
│   │   └── osagent.py                # Operating system agent
│   ├── actions/                      # Action definitions
│   │   ├── screen_info_extractor.py  # Screen information extraction
│   │   ├── case_generator.py         # Test case generation
│   │   └── reflection.py             # Reflection and analysis
│   ├── tools/                        # Tool definitions
│   │   ├── chrome_debugger.py        # Browser debugging tool
│   │   ├── icon_detect.py            # Icon detection and description tool
│   │   ├── device_controller.py      # Device control tool
│   │   └── ocr.py                    # OCR recognition tool
│   ├── prompts/                      # Prompt templates
│   │   ├── case_generator.py         # Application evaluation prompts
│   │   └── osagent.py                # OS agent prompts
│   ├── utils/                        # Utility functions
│   │   ├── excel_json_converter.py   # Excel and JSON format conversion utilities
│   │   └── window_utils.py           # Window control and browser automation utilities
│   └── __init__.py                   # Package initialization
├── scripts/                          # Script files
│   ├── server.py                     # Service deployment script
│   └── test_*.py                     # Various component test scripts
├── data/                             # Data files
├── config/                           # Configuration files
│   └── config2.yaml.example          # Example configuration template
└── work_dirs/                        # Working directories for runtime data

Contribution

Contributions to AppEvalPilot are welcomed by the research community. For inquiries, suggestions, or potential collaborations, please join our Discord community: MetaGPT

Citation

If you find AppEvalPilot useful, please consider citing our work:

@article{bian2025you,
  title={You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation},
  author={Bian, Yutong and Lin, Xianhao and Xie, Yupeng and Liu, Tianyang and Zhuge, Mingchen and Lu, Siyuan and Tang, Haoming and Wang, Jinlin and Zhang, Jiayi and Chen, Jiaqi and others},
  journal={arXiv preprint arXiv:2508.14104},
  year={2025}
}

License

This project is distributed under the MIT License - refer to the LICENSE file for comprehensive details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AppEvalPilot

Introduction

Features

Sample Videos

Installation

From Scratch

LLM Configuration

Usage

Basic Commands

Project Structure

Contribution

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
appeval		appeval
assets/images		assets/images
config		config
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
gradio_app.py		gradio_app.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

License

tanghaom/AppEvalPilot

Folders and files

Latest commit

History

Repository files navigation

AppEvalPilot

Introduction

Features

Sample Videos

Installation

From Scratch

LLM Configuration

Usage

Basic Commands

Project Structure

Contribution

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages