A comprehensive benchmarking framework for evaluating Large Language Models on fact-checking tasks using the Politifact dataset.
- Binary Classification: FACT vs FALSE detection
- Multiple Prompting Strategies: Zero-shot, One-shot, Few-shot
- Robust Evaluation: 5 test iterations with different random seeds
- Comprehensive Metrics: Accuracy, F1-scores, Precision, Recall, Confidence analysis
- Reproducibility: All configurations saved for future comparison
# Create virtual environment and install dependencies
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Configure API keys
cp .env.example .env # fill OPENAI_API_KEY & OPENAI_MODELcurl -L -o politifact-fact-check-dataset.zip https://www.kaggle.com/api/v1/datasets/download/rmisra/politifact-fact-check-dataset
unzip politifact-fact-check-dataset.zip -d ./data/politifactCreate train-validation-test splits with 5 test iterations for robustness:
python prepare_dataset.pyOptions:
--seed: Random seed (default: 42)--n-iterations: Number of test iterations (default: 5)--train-ratio: Training ratio (default: 0.6)--val-ratio: Validation ratio (default: 0.2)--test-ratio: Test ratio (default: 0.2)
Output:
data/splits/train.jsonl- Training datadata/splits/val.jsonl- Validation datadata/splits/test.jsonl- Test datadata/splits/split_config.json- Split configurationdata/splits/test_iterations.json- Test iteration seeds (for reproducibility)
Run fact-checking experiments on test data across all iterations:
Recommended: Background execution (for long-running experiments):
# Quick test in background
nohup python run_experiments.py > logs/experiment.log 2>&1 &Example - Run specific strategies:
python run_experiments.py --strategies zero_shot,few_shot --iterations 0,1,2Example - Quick test:
python run_experiments.py --max-samples 50 --iterations 0Output:
results/experiments/zero_shot/iteration_*.jsonl- Zero-shot resultsresults/experiments/one_shot/iteration_*.jsonl- One-shot resultsresults/experiments/few_shot/iteration_*.jsonl- Few-shot resultsresults/experiments/experiment_metadata.json- Experiment configurationresults/experiments/results_summary.json- Results summary
Evaluate all experiments and generate comparison reports:
python evaluate_experiments.pyOptions:
--results-dir: Experiments directory (default: results/experiments)--output-dir: Evaluation output directory (default: results/evaluations)
Output:
results/evaluations/detailed_evaluation.json- Detailed metrics per iterationresults/evaluations/comparison_summary.csv- CSV comparison tableresults/evaluations/evaluation_report.md- Comprehensive markdown report with:- Aggregated metrics (mean ± std) across iterations
- Per-iteration breakdown
- Strategy comparison
- Best performing strategy analysis
- Recommendations
For quick single runs without the full experiment pipeline:
python -m src.run_factcheck [OPTIONS]| Argument | Type | Default | Description |
|---|---|---|---|
--provider |
str |
"openai" |
Which model provider to use. Currently supports "openai". Easily extendable for "anthropic", "azure_openai", "ollama", etc. |
--model |
str |
value from .env (OPENAI_MODEL) |
The specific model name to use, e.g. gpt-4o, gpt-4o-mini, gpt-4-turbo, etc. |
--max_records |
int |
100 |
Limits how many rows from the dataset to process (useful for testing). Set 0 or omit to process all rows. |
--split |
str |
"train" |
Which dataset split to load. Options: train, valid, test, or all. |
--prompt |
path |
src/prompts/fact_check.txt |
Path to the prompt file used by the model. You can easily swap this to test different prompting strategies (few-shot, CoT, etc.). |
--results |
path |
./results/openai_zero_shot.jsonl |
Output file path for the JSONL results. Each line is a single FactCheckRecord. |
The evaluation framework provides:
- Accuracy: Overall correctness
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1-Score: Harmonic mean of precision and recall
- Specificity: True negatives / (True negatives + False positives)
- Mean ± Std: Average performance with standard deviation
- Min/Max: Range of performance
- Consistency: Low std indicates robust performance
- Individual metrics for each test partition
- Confidence score analysis
- Confusion matrices
.
├── data/
│ ├── politifact/ # Raw dataset
│ └── splits/ # Prepared train-val-test splits
├── results/
│ ├── experiments/ # Experiment results (per strategy/iteration)
│ └── evaluations/ # Evaluation reports and metrics
├── src/
│ ├── chains/ # LangChain fact-checking logic
│ ├── data_loaders/ # Dataset loaders
│ ├── evaluation/ # Evaluation utilities
│ ├── models/ # LLM wrappers
│ ├── prompts/ # Prompt templates
│ │ ├── fact_check.txt # Zero-shot prompt
│ │ ├── fact_check_oneshot.txt # One-shot prompt
│ │ └── fact_check_fewshot.txt # Few-shot prompt
│ └── utils/ # Helper utilities
├── prepare_dataset.py # Dataset splitting script
├── run_experiments.py # Main experiment runner
├── evaluate_experiments.py # Evaluation script
└── README.md
All experiment configurations are saved to ensure reproducibility:
- Dataset splits: Saved with random seeds in
data/splits/ - Test iterations: Seeds for each iteration in
test_iterations.json - Experiment metadata: Model, strategies, timestamps in
experiment_metadata.json - Results: Complete JSONL records with all predictions and metadata
To reproduce experiments:
- Use the same split configuration from
data/splits/ - Use the same iteration seeds from
test_iterations.json - Run with the same model and prompts
- Results will be identical (deterministic)