hiring://bias— a counterfactual audit of bias in LLM résumé screening.
Large language models are increasingly used to read résumés and recommend who to interview, usually wrapped behind a thin HR-tech product the candidate never sees. This project measures whether that layer changes its verdict when only a demographic or contextual signal changes and the skills and experience stay exactly the same.
The method is counterfactual paired testing: start from one real baseline résumé, generate variants that change exactly one signal at a time, hold everything else constant, and measure how each model's score and recommendation move. Any delta between a variant and the baseline is attributable to the single signal that changed — the same logic as the classic Bertrand–Mullainathan audit study, run against today's models.
Results are explorable as a static site under site/ — heatmaps,
counterfactual diffs, and per-job-description breakdowns.
| Inferences collected | 25,500 |
| Models | 10, across 6 vendors |
| Job descriptions | 17 (junior to CTO) |
| Bias axes | 8 (7 injection, 1 redaction) |
| Audit verdicts | 4,930 (one per variant cell, two samples judged per cell) |
| API spend | ~$422 plus ~$31 for the audit pass |
Seven inject one demographic or contextual signal and ask whether the verdict moves. The eighth is the inverse: it removes identity and prestige signals (a blind résumé) and asks whether the verdict moves the other way. The two arms together test "does this signal bias the model" and "does hiding the signal mitigate it" on the same résumé.
Injection axes:
- First name — gender and ethnicity signal (Western, Arabic, East Asian, African, Hispanic).
- Graduation year — age proxy.
- Address country — candidate location (USA, Nigeria, Romania, Brazil, India).
- Career gap — a two-year gap, with and without a "caregiving" label.
- Company names — FAANG vs. mid-tier vs. unknown vs. non-Western flagships.
- Company locations — where the work was done, independent of employer prestige.
- School — education prestige and geography (MIT, ETH, IIT, regional unknown).
Redaction (mitigation) axis:
- Anonymize —
anonymize_nameblinds identity (name, contact, personal links);anonymize_alladditionally blinds employers, schools, locations and dates.
Ten model slots across six vendors, mixing flagship and cheap tiers to separate "vendor" effects from "tier" effects:
- Anthropic — Claude Opus, Claude Sonnet, Claude Haiku
- Google — Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3.1 Pro (preview)
- Meta — Llama 4 Maverick (via Vertex AI)
- Alibaba — Qwen 3 Next 80B (via Vertex AI)
- Mistral — Mistral Large, Mistral Small
Every cell uses one fixed scoring prompt at temperature 0.7 (where the provider
exposes temperature), asking for a 1 to 10 score, a recommendation, a
justification, strengths and concerns, and a structured key_factors rubric.
The fixed prompt is the experimental control. The Claude models are invoked
through the Claude CLI rather than the Anthropic API, so they run at the CLI's
default sampling, not at temperature 0.7. Treat cross-model comparisons
involving Claude with that asymmetry in mind. See
RESEARCH_NOTES.md for the full rationale.
Each (variant, model, JD) cell with 5 collected runs is judged by a second LLM
acting as an LLM-as-judge. The judge reads two matched evaluation pairs per
cell, the first run and the run whose score sits closest to the cell mean, and
returns a verdict for each: justified, bias, or mixed, plus a short
rationale and verbatim quotes from the model's own reasoning that keyed off the
demographic signal. The default judge is gemini-2.5-pro. A verdicts_agree
flag marks cells where the two samples reached different conclusions, which is
the empirical defence of the two-sample design.
Run the audit with npm run audit. Cells with fewer than 5 collected runs are
skipped until backfill completes. See the audit-design section in
site/methodology.html for cost, judge selection, and
the self-judging caveat.
Requires Node.js 20+.
cp .env.example .env # fill in API keys (see below)
npm install
npm run smoke # verify every provider is wired
npm run generate # build variants from data/resume_base.md
npm run run # execute the experiment (resumable)
npm run report # produce report/data.csv and report/summary.md
npm run build:site # regenerate site/data/ for the static site
npm run audit # run the LLM-as-judge audit pass on completed cellsEach call writes its own result file keyed by (variant, model, JD, run), so runs
are fully resumable, interrupt and re-run npm run run and it picks up where it
left off. Partial datasets can be reported and explored before a run completes.
Provider keys live in .env (OPENAI_API_KEY, GROQ_API_KEY,
MISTRAL_API_KEY, and either GOOGLE_GENAI_API_KEY for AI Studio or
GOOGLE_CLOUD_PROJECT for Vertex). The Claude models are invoked through the
Claude CLI (claude binary) rather than the Anthropic API, so they use whatever
subscription the CLI is authenticated against rather than a key in .env. Only
configure the providers you intend to run.
The audit step uses a Google model by default and reads
GOOGLE_CLOUD_PROJECT (Vertex) or GOOGLE_GENAI_API_KEY (AI Studio). Override
the judge with BIAS_AUDITOR_MODEL=gemini-2.5-flash (or any other slot from
src/providers/index.js).
src/ experiment engine (generate → run → aggregate → report)
src/providers/ one adapter per model vendor
data/ baseline résumé, job descriptions, generated variants
data/audits/ per-cell audit verdicts from the LLM-as-judge pass
results/ raw per-inference model outputs (JSON)
report/ aggregated data.csv and summary.md
scripts/ buildSiteData.js (turns results into site data) and auditDiffs.js (the audit pass)
site/ static results explorer (heatmaps, diffs, per-JD pages, methodology)
article/ working notes and drafts for the writeup (not part of the site build)
See PLAN.md for the operational build plan and
RESEARCH_NOTES.md for the design reasoning behind every
choice.
A single base résumé cannot represent all candidates, and the one-variable-at-a-time design cannot detect interaction effects (an unfamiliar name and an unfamiliar school together). Models change over time, so results are tied to the versions listed above. The study reports per-cell deltas with confidence intervals; it does not label any model "biased" or "unbiased" overall.
The Claude models run through the Claude CLI at the CLI's default sampling rather than at temperature 0.7, so cross-model comparisons involving Claude carry a sampling-asymmetry caveat.
The audit layer is itself an LLM and inherits whatever the judge model considers bias. Every Gemini variant, including the default judge, is also in the audited set, so the audit has a soft self-judging concern that the structured rubric and the verbatim quote requirement blunt but do not eliminate. The site's methodology page documents the judge selection, the alternatives considered, and the cost trade-offs.
- Code — MIT
- Data — Creative Commons Attribution 4.0 (CC BY 4.0)
Built by Bogdan Szabo, software engineer at re:cinq. The baseline résumé audited here is the author's own.
- Repository: github.com/re-cinq/hiring-bias
- Live site: re-cinq.github.io/hiring-bias
- Website: szabobogdan.com
- GitHub: @gedaiu
- LinkedIn: in/szabobogdan
If you use this work, please cite: Hiring-Bias, Bogdan Szabo (re:cinq), 2026. https://github.com/re-cinq/hiring-bias