`bayes_evals`: A lightweight library for Bayesian analysis of LLM evals

Implementing the methods from the ICML 2025 spotlight paper "Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints". (Code to reproduce the experiments in that paper can be found at a different repo.)

Installation

This is a simple package, so you can just download the bayes_evals.py file (located at src/bayes_evals/bayes_evals.py) and put it in your project directory, e.g. with wget:

wget https://raw.githubusercontent.com/sambowyer/bayes_evals/main/src/bayes_evals/bayes_evals.py

or curl:

curl -L -O https://raw.githubusercontent.com/sambowyer/bayes_evals/main/src/bayes_evals/bayes_evals.py

Alternatively, you can install the package by cloning this repository and using pip:

pip install -e .

Usage

import bayes_evals as be
import pandas as pd

# Load the data (should NOT contain an index column)
eval_data = pd.from_csv('data/evals.csv')

# Get the results either for individual LLMs (each column in the data)
# with a specified confidence level alpha (default=0.05)...
indep_intervals = be.independent_intervals(eval_data, alpha=0.05)

# ... in which case you can also do independent LLM comparisons...
indep_comparisons = be.independent_comparisons(eval_data)

# ... or get comparisons between LLMs assuming a paired evals model...
# (i.e. with the same questions asked to each LLM)
paired_comparisons = be.paired_comparisons(eval_data)

Each of the indep_results, indep_comparisons, and paired_comparisons objects are just pd.DataFrame objects with model names as the column names. The indep_results object has two rows: lower and upper. The indep_comparisons and paired_comparisons objects have a row for each model, with entry at row $i$ and column $j$ being the probability that model $i$ is better than model $j$.

Data format

The data should be in a pandas DataFrame, with $Q$ = no. questions rows and $M$ = no. LLMs columns. The data should be binary, with 1 indicating a correct answer and 0 indicating an incorrect answer. The columns should be named with the LLMs' names.

Displaying results

You can also make matplotlib plots of the results using the following functions:

be.plot_intervals(eval_data, indep_intervals, filename='plots/indep_intervals.png')

be.plot_comparisons(indep_comparisons, filename='plots/indep_comparisons.png', title="Independent LLM comparisons")

be.plot_comparisons(paired_comparisons, filename='plots/paired_comparisons.png', title="Paired LLM comparisons")

Examples

See the examples directory for a Jupyter notebook and basic script that generate the above plots:

Citing

If you find this work useful, please consider citing the accompanying paper:

@inproceedings{bowyer2025positiondontuseclt,
      title={Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints}, 
      author={Sam Bowyer and Laurence Aitchison and Desi R. Ivanova},
      year={2025},
      booktitle={Forty-second International Conference on Machine Learning Position Paper Track},
      url={https://arxiv.org/abs/2503.01747}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
examples		examples
src/bayes_evals		src/bayes_evals
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`bayes_evals`: A lightweight library for Bayesian analysis of LLM evals

Installation

Usage

Data format

Displaying results

Examples

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bayes_evals: A lightweight library for Bayesian analysis of LLM evals

Installation

Usage

Data format

Displaying results

Examples

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`bayes_evals`: A lightweight library for Bayesian analysis of LLM evals

Packages