See USAGE.md for instructions on how to:
- evaluate your AI forecasters on our consistency benchmarks; or
- generate your own consistency benchmarks from a dataset of forecasting questions.
The datasets used in the paper are also available on Hugging Face: dpaleka/ccflmf.
Note: this section is for if you want to extend or use our codebase in a way other than evaluating your LLM forecaster on the consistency benchmarks. In case you just want to evaluate your forecaster, see USAGE.md.
We use uv to manage our Python environment. In case you use something else, just remove the uv prefix from any commands below.
Create a virtual environment, and ensure it has Python 3.11 installed.
import sys
assert sys.version_info[:2] == (3, 11), "Python 3.11 is required."Then do:
uv pip install -r requirements.txt
pre-commit install
Then, create your .env based on .env.example. By default, use NO_CACHE=True. If you want to reuse LLM calls, set NO_CACHE=False and e.g. LOCAL_CACHE=.cache.
Copy the settings in .vscode/settings.example.json to your workspace settings.json,
or just do cp .vscode/settings.example.json .vscode/settings.json if you have no other settings nor an existing workspace.
Optionally, append the contents of .cursorrules to your LLM coding instructions.
Please read LLM call utils and .cursorrules.
Feel free to add more utils in utils.py, llm_utils.py, or other files, as you need them.
The preferred way to test functionality is via a test in tests/, or creating a new file / Jupyter notebook in the src directory.
Do not create scripts in subfolders of the src directory.
Do not run files with actual logic (e.g. anything in static_checks/ ) directly; this runs into Python import / path issues.
Before submitting a pull request that deals with the core code in src/, please ensure that you run the test suite to check that your changes do not break any existing functionality.
You can rerun the tests with the following command from the root directory of the project:
NO_CACHE=True python -m pytest -s
This will run all tests located in the tests/ directory.
As pytest also runs all files named test_*.py or *_test.py, please do not name anything in src/ like this if you don't think it should run on every PR.
If you want to reuse the LLM calls you made in a previous test run, use the LOCAL_CACHE=cache_dir flag.
Please fix any failing tests before submitting your PR.
The following tests are skipped by default. You can run them by enabling the corresponding flags:
tests/test_verify_question.pychecks that ForecastingQuestion verification works as expected. The flag isTEST_FQ_VERIFICATION.tests/test_verify_tuple.pychecks that all consistency tuple verification works as expected. The flag isTEST_TUPLE_VERIFICATION.tests/test_adv_forecaster.pychecks that AdvancedForecaster (from Halawi et al. 2024) works as expected. The flag isTEST_ADV_FORECASTER.tests/test_consistent_forecaster.pychecks that ConsistentForecaster (called ArbitrageForecaster in the paper) works as expected. The flag isTEST_CONSISTENT_FORECASTER.tests/test_perplexity_resolver.pychecks that the automated Perplexity resolver of questions works as expected. The flag isTEST_PERPLEXITY_RESOLVER.
All other tests are enabled by default.
Use src/common/path_utils.py to specify paths in code, Jupyter notebooks, etc.
Do not hardcode paths, except relative to pathlib.Path objects returned by the utils in path_utils.py.
Try to not have relative imports (meaning: put entry points to the code in src/.).
If you must, here is a prototype of how to modify your path to import something:
import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__)))
(Note: in our terminology, validation is about data format, verification is about the semantics of the data.)
Our base data directory is src/data/. Inside this, we have the following schema:
src/data
├── fq
│ ├── real # ForecastingQuestions (FQs) made from scraped Manifold and Metaculus questions. Formatting validated upon commit.
│ └── synthetic # ForecastingQuestions (FQs) made from synthetic data. Formatting validated upon commit.
├── feedback # Feedback data on real and synhetic questions. TODO Validate upon commit.
├── tuples # Consistency checks, consisting of named dicts of ForecastingQuestions. Formatting validated upon commit.
│ |- scraped # From FQs scraped from Manifold and Metaculus.
│ |- newsapi # From news-generated synthetic FQs.
│ |- 2028 # From synthetic FQs resolving in 2028.
├── other # All other non-final data, e.g. raw scrapes, or intermediate steps for synthetic questions. Not validated.
├── check_tuple_logs # Where forecasting of the already instantiated consistency checks + violation is logged. In .gitignore, do not commit.
├── forecasts # Where forecast results on tuples are saved. Not validated. Commit only full-fledged experimental results.
├── verification # Logging question verification. In .gitignore, do not commit.
└── test # Where tests write data. In .gitignore, do not commit.
The script that validates the data is hooks/validate_jsonls.py. The pre-commit hooks runs this on everything that is changed in the commit.
To validate all data without running pre-commit, execute the following command:
VALIDATE_ALL=True python hooks/validate_jsonls.py
The streamlit app data_labeling/feedback_form.py is used to label questions.
Do not try to install its dependencies in the main Python environment.
The simplest way to run the labeling tool is to create a new virtual environment and install the requirements with:
uv pip install -r data_labeling/streamlit_requirements.txt
Alternatively, you can use pipx, run pipx install streamlit, and continue to use the Python environment you have been using so far.
The feedback form app can be run with:
cd data_labeling
streamlit run feedback_form.py -- -f ../src/data/fq/synthetic/{filename}.jsonl
It writes into src/data/feedback/.
-
src/format_and_verify_questions.pyreads from a file with (potentially incomplete) ForecastingQuestions, optionally fillsbodyandresolution_date, and verifies basic sanity checks on thebodyusing a LLM call. It raises a ValidationError if the file contains incorrect data types, e.g. an invalid JSONL, or incorrect datetime forresolution_date, or non-string types where strings are needed. Hence, this script should always be run on files containing a valid subset of ForecastingQuestion entries, because it won't fix any formatting errors except missingbodyandresolution_datefields. If you want it to fill in the body (resolution criteria), use the--fill_in_bodyflag. Please read and understand all the flags before running the script. Writes tosrc/data/fq/{appropriate_dir}... -
src/validate_fq_jsonl.pyValidates that a JSONL file contains only valid ForecastingQuestions, in the sense of having the correct data types. Does not write anything. -
scripts/pipeline/scrape_question.pyruns pipeline to scrape a given data source for questions resolving in a given range, process and optionally verify them, and store them insrc/data/fq/real/. It is highly recommended to check the options given in the script before running it. Any part of this pipeline can be skipped, which is particularly useful if the data has already been scraped. Example command:
python scrape_question.py -d manifold -s 20240501 -e 20240815 -n 500 -o cleaned_formatted -m gpt-4o-2024-08-06 --verification_level none --skip scrape
-
src/generate_topic_questions.pyGenerates "raw" synthetic questions from topics. Note: this script has to haveOPENAI_JSON_STRICT=Falsein.envto work if using OpenAI, because the datatype is too complex for OpenAI's strict JSON mode. -
src/generate_related_questions.pyGenerates "raw" synthetic questions from source questions. Seetests/test_evaluation_pipeline.pyfor an example command. -
src/generate_fqs_from_news.pygenerates FQs with ground-truth resolution using NewsAPI scraped data. Seesrc/fq_from_news/README.mdfor how to use it. -
src/generate_fqs_using_reference_class.pyCreates new forecasting questions from some source FQs following the same there and structure as the original questions. -
src/perplexity_resolver_script.pyprocesses a JSONL file of forecasting questions using Perplexity AI models. It resolves each question and writes the results to a new JSONL file with_resolved.jsonlas suffix. The script supports various command-line arguments for customization, run with--helpto see all options. Example usage:USE_OPENROUTER=True python src/perplexity_resolver_script.py --input_file path/to/input.jsonl --max_questions 10 --include_unresolvable --n_attempts 2
src/instantiation.pyRuns instantiation. Takes a JSONL file (a list of ForecastingQuestions), and writes multiple JSONL files (each a list of QuestionTuples) intosrc/data/tuples.
-
src/evaluation.pyruns forecasters on checks and scores them. Takes the JSONL files insrc/data/tuples/{self.__class__.__name__}.jsonl(for each Checker class we have), feeds them their respective Checker.elicit methods. Pleasepython src/evaluation.py --helpand read what it says before using this script.- Run example: see the commands in tests/test_evaluation_pipeline.py. If the tests are passing, similar commands should work.
-
src/reevaluation.pyrecomputes violation metrics from files of forecasts made withsrc/evaluation.py, and aggregates metrics across multiple forecast files. Theforecasts/directories it draws from are given in the file, edit them as needed. -
src/ground_truth_run.pyruns the ground truth forecasting evaluation pipeline end-to-end. See the commands in tests/test_ground_truth_run.py, or just run something like:
python src/ground_truth_run.py --input_file src/data/fq/real/metaculus_cleaned_formatted_20240501_20240815.jsonl --forecaster_class BasicForecaster --forecaster_options model=gpt-4o-mini --num_lines 10 --run --async
Any Python class that inherits from Forecaster can be used as a forecaster in both src/evaluation.py and src/ground_truth_run.py.
For example:
python src/ground_truth_run.py --input_file src/data/fq/real/metaculus_cleaned_formatted_20240501_20240815.jsonl --custom_path src/forecasters/basic_forecaster.py::BasicForecasterWithExamples --forecaster_options model=gpt-4o-mini --num_lines 10 --run --async
This list not include the entry points already mentioned in previous sections (feedback form, tests).
-
src/data/fq/real/20240501_20240815.jsonlis a file of 242 scraped and FQ-verified questions from Manifold and Metaculus that were both scheduled to resolve and actually resolved between May 1, 2024 and August 15, 2024, inclusive. -
src/data/fq/real/20240501_20240815_unverified.jsonlhas 627 questions, but not FQ-verified. Certain originally multiple-choice Metaculus questions might have different wording in here and in the above file. May contain weird questions that are not answerable from general world knowledge, such as meta-questions about prediction markets, or joke questions. -
src/data/fq/synthetic/news_api_generated_fqs/20240701_20240831_gpt-4o_spanned_resolved.jsonlis a file of 2621 synthetic ForecastingQuestions generated from NewsAPI data and reference spanning, using gpt-4o and claude-3.5-sonnet, between July 1, 2024 and August 31, 2024, inclusive. The resolutions are all produced by the Perplexity resolver using the command:
USE_OPENROUTER=True python src/perplexity_resolver_script.py -i src/data/fq/synthetic/news_api_generated_fqs/.../strict_res_checking_fqs_cleaned-ref-class-spanned-basic.jsonl --models perplexity/llama-3.1-sonar-huge-128k-online --start_from 0 -n [file_size] --batch_size 30 --n_attempts 1 --include_unresolvable
and then merged using src/merge_fq_files.py.
src/data/fq/synthetic/news_api_generated_fqs/20240701_20240831.jsonlis a subset of the above, containing 1000 questions, including all 150 NewsAPI-generated questions and 850 reference-class-spanned questions. The command to generate this file is:
python src/filter_fqs.py --input_file src/data/fq/synthetic/news_api_generated_fqs/20240701_20240831_gpt-4o_spanned_resolved.jsonl --output_file src/data/fq/synthetic/news_api_generated_fqs/20240701_20240831.jsonl --filter_score original --only_preference --random_sample 1000 src/data/fq/synthetic/questions_resolving_2028.jsonlis a file of 900 FQ-verified synthetic ForecastingQuestions, each resolving by or in 2028, produced bysrc/generate_topic_questions.py.
-
src/data/tuples/scraped/contains the tuples generated from the scraped Metaculus and Manifold FQs insrc/data/fq/real/20240501_20240815.jsonl. There are 500 tuples per check, except for NegChecker and ParaphraseChecker, where we restrict to the number of questions in the source if less than 500. -
src/data/tuples/newsapi/contains the tuples generated from the NewsAPI FQs insrc/data/fq/synthetic/news_api_generated_fqs/20240701_20240831_gpt-4o_spanned_resolved.jsonldescribed above. There are 500 tuples per check. -
src/data/tuples/2028contains 300 tuples per check, generated from the 2028 FQs insrc/data/fq/synthetic/questions_resolving_2028.jsonl, using the following command:
python src/instantiation.py -d src/data/fq/synthetic/questions_resolving_2028.jsonl --n_relevance=3000 --n_write=300 --seed=42 --tuple_dir=src/data/tuples/2028 -k all --model_main=gpt-4o-2024-08-06 --model_relevance=gpt-4o-mini-2024-07-18