This repository contains a complete pipeline for evaluating Large Language Models (LLMs) on the QCBench chemistry dataset using both traditional accuracy metrics and the xVerify verification system.
First, install the required dependencies:
pip install datasets swanlabinference.py- Runs LLM inference on QCBench dataseteval.py- Evaluates model answers using numerical accuracy metricsxVerify_eval.py- Evaluates model answers using xVerify verification systemreport.py- Generates comprehensive analysis reportsQCBench.json- The chemistry dataset containing 350 questions
First, run the inference script to generate model answers:
python inference.py --model your_model_name --workers 30Important: Before running, update the following in inference.py:
- Replace
your_urlwith your API endpoint URL - Replace
your_api_keywith your API key
The script will:
- Load the QCBench dataset (350 chemistry questions)
- Send questions to your specified LLM model
- Save results to
results/results_{model_name}.jsonl
Update the file paths in eval.py:
- Replace
your_result_pathwith the path to your inference results file - Replace the second
your_result_pathwith your desired output path
Then run:
python eval.pyThis will:
- Extract answers from
\boxed{}LaTeX environments - Compare numerical answers with ground truth
- Calculate accuracy scores
- Save scored results to JSON format
Update the input path in xVerify_eval.py:
- Replace the default input path with your inference results file path
Then run:
python xVerify_eval.py --input your_results_file.jsonThis will:
- Use the xVerify model to verify answer correctness
- Provide detailed verification scores
- Save verification results
Update the input path in report.py:
- Replace the default input path with your evaluation results file path
Then run:
python report.py --input your_evaluation_results.json --model your_model_nameThis will:
- Calculate overall accuracy
- Calculate per-class accuracy for each chemistry category
- Generate detailed performance reports
- Optionally log results to SwanLab (use
--no-swanlabto disable)
The QCBench dataset includes the following chemistry categories:
- Analytical Chemistry
- Biochemistry (merged with Organic as "BOC" in reports)
- Inorganic Chemistry
- Materials Science
- Organic Chemistry (merged with Biochemistry as "BOC" in reports)
- Physical Chemistry
- Polymer Chemistry
- Technical Chemistry
results/
├── results_{model_name}.jsonl # Raw inference results
├── results_{model_name}.json # Scored results (from eval.py)
└── xverify_results/ # xVerify evaluation results
reports/
└── report_{model_name}.json # Analysis reports
# 1. Run inference
python inference.py --model gpt-4o --workers 30
# 2. Evaluate accuracy
python eval.py
# 3. Generate report
python report.py --input data/results_acc/results_gpt-4o.json --model gpt-4o- Update API endpoints and keys in
inference.py - Modify system prompts for different evaluation scenarios
- Adjust timeout and retry settings as needed
- Modify tolerance settings in
eval.pyfor numerical comparisons - Adjust xVerify model parameters in
xVerify_eval.py - Customize report generation options in
report.py
- The pipeline automatically handles Biochemistry and Organic chemistry categories by merging them as "BOC" in reports
- All numerical comparisons use high-precision decimal arithmetic to avoid floating-point errors
- The xVerify evaluation requires the xVerify model to be properly installed and configured
- SwanLab integration is optional and can be disabled with the
--no-swanlabflag
- API Connection Issues: Check your API endpoint and key configuration in
inference.py - File Path Errors: Ensure all file paths are correctly updated in each script
- xVerify Model Issues: Verify xVerify model installation and configuration
- Memory Issues: Reduce the number of workers in inference.py if encountering memory problems
datasets- For dataset loading utilitiesswanlab- For experiment tracking (optional)requests- For API callstqdm- For progress barsdecimal- For high-precision numerical operationsjson- For data serializationos- For file operationsargparse- For command-line argument parsing