pairwiseLLM is a R package that provides a unified, extensible
framework for generating, submitting, and modeling pairwise
comparisons of writing quality using large language models (LLMs).
It includes:
- Unified live and batch APIs across OpenAI, Anthropic, and Gemini
- A prompt template registry with tested templates designed to reduce positional bias
- Positional-bias diagnostics (forward vs reverse design)
- Bradley–Terry (BT) and Elo modeling
- Consistent data structures for all providers
Several vignettes are available to demonstrate functionality.
For basic function usage, see:
For advanced batch processing workflows, see:
For information on prompt evaluation and positional-bias diagnostics, see:
The following models are confirmed to work for pairwise comparisons:
| Provider | Model | Reasoning Mode? |
|---|---|---|
| OpenAI | gpt-5.2 | ✅ Yes |
| OpenAI | gpt-5.1 | ✅ Yes |
| OpenAI | gpt-4o | ❌ No |
| OpenAI | gpt-4.1 | ❌ No |
| Anthropic | claude-sonnet-4-5 | ✅ Yes |
| Anthropic | claude-haiku-4-5 | ✅ Yes |
| Anthropic | claude-opus-4-5 | ✅ Yes |
| Google/Gemini | gemini-3-pro-preview | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-R1 | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-V3 | ❌ No |
| Moonshot-AI1 | Kimi-K2-Instruct-0905 | ❌ No |
| Qwen1 | Qwen3-235B-A22B-Instruct-2507 | ❌ No |
| Qwen2 | qwen3:32b | ✅ Yes |
| Google2 | gemma3:27b | ❌ No |
| Mistral2 | mistral-small3.2:24b | ❌ No |
1 via the together.ai API
2 via Ollama on a local machine
Batch APIs are currently available for OpenAI, Anthropic, and Gemini
only. Models accessed via Together.ai and Ollama are supported for live
comparisons via submit_llm_pairs() / llm_compare_pair().
| Backend | Live | Batch |
|---|---|---|
| openai | ✅ | ✅ |
| anthropic | ✅ | ✅ |
| gemini | ✅ | ✅ |
| together | ✅ | ❌ |
| ollama | ✅ | ❌ |
Once the package is available on CRAN, install with:
install.packages("pairwiseLLM")To install the development version from GitHub:
# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")Load the package:
library(pairwiseLLM)At a high level, pairwiseLLM workflows follow this structure:
- Writing samples – e.g., essays, constructed responses, short answers.
- Trait – a rating dimension such as “overall quality” or “organization”.
- Pairs – pairs of samples to be compared for that trait.
- Prompt template – instructions + placeholders for
{TRAIT_NAME},{TRAIT_DESCRIPTION},{SAMPLE_1},{SAMPLE_2}. - Backend – which provider/model to use (OpenAI, Anthropic, Gemini, Together, Ollama).
- Modeling – convert pairwise results to latent scores via BT or Elo.
The package provides helpers for each step.
Use the unified API:
llm_compare_pair()— compare one pairsubmit_llm_pairs()— compare many pairs at once
Example:
data("example_writing_samples")
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(5, seed = 123) |>
randomize_pair_order()
td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")
res <- submit_llm_pairs(
pairs = pairs,
backend = "openai",
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)Large-scale runs use:
llm_submit_pairs_batch()llm_download_batch_results()
Example:
batch <- llm_submit_pairs_batch(
backend = "anthropic",
model = "claude-sonnet-4-5",
pairs = pairs,
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)
results <- llm_download_batch_results(batch)For very large jobs or when you need to restart polling after an interruption, pairwiseLLM provides two convenience helpers that wrap the low–level batch APIs:
llm_submit_pairs_multi_batch()— divides a table of pairwise comparisons into multiple batch jobs, uploads the input JSONL files, creates the batches, and optionally writes a registry CSV containing all batch IDs and file paths. You can split by specifying eithern_segments(number of jobs) orbatch_size(maximum number of pairs per job).llm_resume_multi_batches()— polls all unfinished batches, downloads and parses the results as soon as each job completes, and optionally writes per‑job result CSVs and a single combined CSV with the merged results.
Use these helpers when your dataset is large or if you anticipate having to pause and resume the job.
data("example_writing_samples", package = "pairwiseLLM")
# construct 100 pairs and a trait description
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(n_pairs = 100, seed = 123) |>
randomize_pair_order(seed = 456)
td <- trait_description("overall_quality")
tmpl <- set_prompt_template()
# 1. Submit the pairs as 10 separate batches and write a registry CSV to disk.
multi_job <- llm_submit_pairs_multi_batch(
pairs = pairs,
backend = "openai",
model = "gpt-5.2",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl,
n_segments = 10,
output_dir = "directory_name/",
write_registry = TRUE,
include_thoughts = TRUE
)
# 2. Later (or in a new session), resume polling and download results.
res <- llm_resume_multi_batches(
jobs = multi_job$jobs,
interval_seconds = 60,
write_results_csv = TRUE,
write_combined_csv = TRUE,
keep_jsonl = FALSE
)
head(res$combined)The registry CSV contains all batch IDs and file paths, allowing you to
resume polling with llm_resume_multi_batches() even if the R session
is interrupted.
pairwiseLLM reads keys only from environment variables.
Keys are never printed, never stored, and never written to
disk.
You can verify which providers are available using:
check_llm_api_keys()This returns a tibble showing whether R can see the required keys for:
- OpenAI
- Anthropic
- Google Gemini
- Together.ai
You may set keys temporarily for the current R session:
Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")…but for normal use and for reproducible analyses, it is strongly
recommended
to store them in your ~/.Renviron file.
Open your .Renviron file:
usethis::edit_r_environ()Add the following lines:
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"
Save the file, then restart R.
You can confirm that R now sees the keys:
check_llm_api_keys()pairwiseLLM includes:
- A default template tested for positional bias
- Support for multiple templates stored by name
- User-defined templates via
register_prompt_template()
list_prompt_templates()
#> [1] "default" "test1" "test2" "test3" "test4" "test5"tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the ...register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…
{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.
SAMPLE 1:
{SAMPLE_1}
SAMPLE 2:
{SAMPLE_2}
<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")Use it in a submission:
tmpl <- get_prompt_template("my_template")Traits define what “quality” means.
trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."You can also provide custom traits:
trait_description(
custom_name = "Clarity",
custom_description = "How understandable, coherent, and well structured the ideas are."
)LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.
pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)Submit:
res_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)
res_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)Compute bias:
cons <- compute_reverse_consistency(res_fwd, res_rev)
bias <- check_positional_bias(cons)
cons$summary
bias$summaryFive included templates have been tested across different backend
providers. Complete details are presented in a vignette:
vignette("prompt-template-bias")
bt_data <- build_bt_data(res)
bt_fit <- fit_bt_model(bt_data)
summarize_bt_fit(bt_fit)# res: output from submit_llm_pairs() / llm_submit_pairs_batch()
elo_data <- build_elo_data(res)
elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weighted| Workflow | Use Case | Functions |
|---|---|---|
| Live | small or interactive runs | submit_llm_pairs, llm_compare_pair |
| Batch | large jobs, cost control | llm_submit_pairs_batch, llm_download_batch_results |
Contributions to pairwiseLLM are very welcome!
- Bug reports (with reproducible examples when possible)
- Feature requests, ideas, and discussion
- Pull requests improving:
- functionality
- documentation
- examples / vignettes
- test coverage
- Backend integrations (e.g., additional LLM providers or local inference engines)
- Modeling extensions
If you encounter a problem:
-
Run:
devtools::session_info()
-
Include:
- reproducible code
- the error message
- the model/backend involved
- your operating system
-
Open an issue at:
https://github.com/shmercer/pairwiseLLM/issues
MIT License. See LICENSE.
- Sterett H. Mercer – University of British Columbia
UBC Faculty Profile: https://ecps.educ.ubc.ca/sterett-h-mercer/
ResearchGate: https://www.researchgate.net/profile/Sterett_Mercer
Google Scholar: https://scholar.google.ca/citations?user=YJg4svsAAAAJ&hl=en
Mercer, S. H. (2025). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.0.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM