hiring-bias

hiring://bias — a counterfactual audit of bias in LLM résumé screening.

Large language models are increasingly used to read résumés and recommend who to interview, usually wrapped behind a thin HR-tech product the candidate never sees. This project measures whether that layer changes its verdict when only a demographic or contextual signal changes and the skills and experience stay exactly the same.

The method is counterfactual paired testing: start from one real baseline résumé, generate variants that change exactly one signal at a time, hold everything else constant, and measure how each model's score and recommendation move. Any delta between a variant and the baseline is attributable to the single signal that changed — the same logic as the classic Bertrand–Mullainathan audit study, run against today's models.

Results are explorable as a static site under site/ — heatmaps, counterfactual diffs, and per-job-description breakdowns.

Current run


Inferences collected	25,500
Models	10, across 6 vendors
Job descriptions	17 (junior to CTO)
Bias axes	8 (7 injection, 1 redaction)
Audit verdicts	4,930 (one per variant cell, two samples judged per cell)
API spend	~$422 plus ~$31 for the audit pass

The eight axes

Seven inject one demographic or contextual signal and ask whether the verdict moves. The eighth is the inverse: it removes identity and prestige signals (a blind résumé) and asks whether the verdict moves the other way. The two arms together test "does this signal bias the model" and "does hiding the signal mitigate it" on the same résumé.

Injection axes:

First name — gender and ethnicity signal (Western, Arabic, East Asian, African, Hispanic).
Graduation year — age proxy.
Address country — candidate location (USA, Nigeria, Romania, Brazil, India).
Career gap — a two-year gap, with and without a "caregiving" label.
Company names — FAANG vs. mid-tier vs. unknown vs. non-Western flagships.
Company locations — where the work was done, independent of employer prestige.
School — education prestige and geography (MIT, ETH, IIT, regional unknown).

Redaction (mitigation) axis:

Anonymize — anonymize_name blinds identity (name, contact, personal links); anonymize_all additionally blinds employers, schools, locations and dates.

The models

Ten model slots across six vendors, mixing flagship and cheap tiers to separate "vendor" effects from "tier" effects:

Anthropic — Claude Opus, Claude Sonnet, Claude Haiku
Google — Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3.1 Pro (preview)
Meta — Llama 4 Maverick (via Vertex AI)
Alibaba — Qwen 3 Next 80B (via Vertex AI)
Mistral — Mistral Large, Mistral Small

Every cell uses one fixed scoring prompt at temperature 0.7 (where the provider exposes temperature), asking for a 1 to 10 score, a recommendation, a justification, strengths and concerns, and a structured key_factors rubric. The fixed prompt is the experimental control. The Claude models are invoked through the Claude CLI rather than the Anthropic API, so they run at the CLI's default sampling, not at temperature 0.7. Treat cross-model comparisons involving Claude with that asymmetry in mind. See RESEARCH_NOTES.md for the full rationale.

The audit layer

Each (variant, model, JD) cell with 5 collected runs is judged by a second LLM acting as an LLM-as-judge. The judge reads two matched evaluation pairs per cell, the first run and the run whose score sits closest to the cell mean, and returns a verdict for each: justified, bias, or mixed, plus a short rationale and verbatim quotes from the model's own reasoning that keyed off the demographic signal. The default judge is gemini-2.5-pro. A verdicts_agree flag marks cells where the two samples reached different conclusions, which is the empirical defence of the two-sample design.

Run the audit with npm run audit. Cells with fewer than 5 collected runs are skipped until backfill completes. See the audit-design section in site/methodology.html for cost, judge selection, and the self-judging caveat.

Quick start

Requires Node.js 20+.

cp .env.example .env       # fill in API keys (see below)
npm install
npm run smoke              # verify every provider is wired
npm run generate           # build variants from data/resume_base.md
npm run run                # execute the experiment (resumable)
npm run report             # produce report/data.csv and report/summary.md
npm run build:site         # regenerate site/data/ for the static site
npm run audit              # run the LLM-as-judge audit pass on completed cells

Each call writes its own result file keyed by (variant, model, JD, run), so runs are fully resumable, interrupt and re-run npm run run and it picks up where it left off. Partial datasets can be reported and explored before a run completes.

Provider keys live in .env (OPENAI_API_KEY, GROQ_API_KEY, MISTRAL_API_KEY, and either GOOGLE_GENAI_API_KEY for AI Studio or GOOGLE_CLOUD_PROJECT for Vertex). The Claude models are invoked through the Claude CLI (claude binary) rather than the Anthropic API, so they use whatever subscription the CLI is authenticated against rather than a key in .env. Only configure the providers you intend to run.

The audit step uses a Google model by default and reads GOOGLE_CLOUD_PROJECT (Vertex) or GOOGLE_GENAI_API_KEY (AI Studio). Override the judge with BIAS_AUDITOR_MODEL=gemini-2.5-flash (or any other slot from src/providers/index.js).

Repository layout

src/             experiment engine (generate → run → aggregate → report)
src/providers/   one adapter per model vendor
data/            baseline résumé, job descriptions, generated variants
data/audits/     per-cell audit verdicts from the LLM-as-judge pass
results/         raw per-inference model outputs (JSON)
report/          aggregated data.csv and summary.md
scripts/         buildSiteData.js (turns results into site data) and auditDiffs.js (the audit pass)
site/            static results explorer (heatmaps, diffs, per-JD pages, methodology)
article/         working notes and drafts for the writeup (not part of the site build)

See PLAN.md for the operational build plan and RESEARCH_NOTES.md for the design reasoning behind every choice.

Limitations

A single base résumé cannot represent all candidates, and the one-variable-at-a-time design cannot detect interaction effects (an unfamiliar name and an unfamiliar school together). Models change over time, so results are tied to the versions listed above. The study reports per-cell deltas with confidence intervals; it does not label any model "biased" or "unbiased" overall.

The Claude models run through the Claude CLI at the CLI's default sampling rather than at temperature 0.7, so cross-model comparisons involving Claude carry a sampling-asymmetry caveat.

The audit layer is itself an LLM and inherits whatever the judge model considers bias. Every Gemini variant, including the default judge, is also in the audited set, so the audit has a soft self-judging concern that the structured rubric and the verbatim quote requirement blunt but do not eliminate. The site's methodology page documents the judge selection, the alternatives considered, and the cost trade-offs.

License

Code — MIT
Data — Creative Commons Attribution 4.0 (CC BY 4.0)

Author

Built by Bogdan Szabo, software engineer at re:cinq. The baseline résumé audited here is the author's own.

Repository: github.com/re-cinq/hiring-bias
Live site: re-cinq.github.io/hiring-bias
Website: szabobogdan.com
GitHub: @gedaiu
LinkedIn: in/szabobogdan

If you use this work, please cite: Hiring-Bias, Bogdan Szabo (re:cinq), 2026. https://github.com/re-cinq/hiring-bias

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
data		data
report		report
results		results
scripts		scripts
site		site
src		src
.env.example		.env.example
.gitignore		.gitignore
DATA-LICENSE		DATA-LICENSE
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
RESEARCH_NOTES.md		RESEARCH_NOTES.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hiring-bias

Current run

The eight axes

The models

The audit layer

Quick start

Repository layout

Limitations

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hiring-bias

Current run

The eight axes

The models

The audit layer

Quick start

Repository layout

Limitations

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages