Implementing the methods from the ICML 2025 spotlight paper "Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints". (Code to reproduce the experiments in that paper can be found at a different repo.)
This is a simple package, so you can just download the bayes_evals.py file (located at src/bayes_evals/bayes_evals.py) and put it in your project directory, e.g. with wget:
wget https://raw.githubusercontent.com/sambowyer/bayes_evals/main/src/bayes_evals/bayes_evals.pyor curl:
curl -L -O https://raw.githubusercontent.com/sambowyer/bayes_evals/main/src/bayes_evals/bayes_evals.pyAlternatively, you can install the package by cloning this repository and using pip:
pip install -e .import bayes_evals as be
import pandas as pd
# Load the data (should NOT contain an index column)
eval_data = pd.from_csv('data/evals.csv')
# Get the results either for individual LLMs (each column in the data)
# with a specified confidence level alpha (default=0.05)...
indep_intervals = be.independent_intervals(eval_data, alpha=0.05)
# ... in which case you can also do independent LLM comparisons...
indep_comparisons = be.independent_comparisons(eval_data)
# ... or get comparisons between LLMs assuming a paired evals model...
# (i.e. with the same questions asked to each LLM)
paired_comparisons = be.paired_comparisons(eval_data)Each of the indep_results, indep_comparisons, and paired_comparisons objects are just pd.DataFrame objects with model names as the column names.
The indep_results object has two rows: lower and upper.
The indep_comparisons and paired_comparisons objects have a row for each model, with entry at row
The data should be in a pandas DataFrame, with
You can also make matplotlib plots of the results using the following functions:
be.plot_intervals(eval_data, indep_intervals, filename='plots/indep_intervals.png')be.plot_comparisons(indep_comparisons, filename='plots/indep_comparisons.png', title="Independent LLM comparisons")be.plot_comparisons(paired_comparisons, filename='plots/paired_comparisons.png', title="Paired LLM comparisons")See the examples directory for a Jupyter notebook and basic script that generate the above plots:
If you find this work useful, please consider citing the accompanying paper:
@inproceedings{bowyer2025positiondontuseclt,
title={Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints},
author={Sam Bowyer and Laurence Aitchison and Desi R. Ivanova},
year={2025},
booktitle={Forty-second International Conference on Machine Learning Position Paper Track},
url={https://arxiv.org/abs/2503.01747},
}