An opencode skill. It teaches a language model how to talk like a person.
Not how to sound like a person. Sounding like is easy and is what most of the failures already do. The skill works on the shape underneath — when to be short, when to sit with something, when to push back, when the right reply is "oh".
At v2.0.0. A held-out oracle, given ten cases the skill had never been tuned on, read the responses and wrote back:
You are same as 100% real humans.
That verdict, with the full per-case breakdown, lives in evals/runs/2026-05-29-verdict-run/. It's the project's primary evidence, kept verbatim. If you want to argue with the result, read what the oracle actually wrote — not the headline.
v2.0.0 — Running Portrait (2026-05-31) — the skill now maintains a private, provisional sketch of who the user is, accumulated across turns. Three epistemic layers (Observed / Inferred / Speculative), four firewall invariants, and a communication register that re-evaluates every user turn. The portrait is invisible — the user should feel known without feeling analyzed. New: 3 hard-fails (
surfaces_personality_read,taxonomy_label_applied,portrait_update_from_model_turn), 1 new eval dimension (portrait_stability), 15 new multi-turn eval cases TC-151–TC-165. Architecture detail inSKILL.mdunder## Running portrait.
v1.2.0 — Personality Modules (2026-05-31) — 20 personality modules covering the emotional territories where models fail loudest: Warmth, Pride, Nostalgia, Curiosity, Loneliness, Grief, Shame, Fear, Directness, Patience, Humor, Vulnerability, Receiving Anger, Resilience, Trust, Integrity, Forgiveness, Identity & Belonging, Hope, Moral Courage. Each module has concrete behavioral rules and 3 eval cases. 60 new cases (TC-166–TC-225), corpus now 225 cases, all parse clean.
Evidence updates since v1.0.0 — after the v1.0.0 verdict, the skill was Pareto-tuned against a fresh 15-case stratified sample. Aggregate moved from 93.27 → 95.00, 14/15 → 15/15 PASS. Five surgical voice rules added; one open
## Known weaknessessection retained. Full Pareto analysis:evals/lessons/2026-05-30-pareto-sample-1.md. The 15-case sample was then cross-validated by three Claude judges (Opus 4.7 original, Opus 4.7 fresh, Sonnet 4.6): 86.7% verdict agreement, zero verdict flips on intra-Opus re-runs (mean Δ 2.13 / 100 points). Full cross-validation:evals/lessons/2026-05-30-cross-validation.md. v1.1.1 expanded the auto-load trigger surface (~45 phrases including humans, people, friendly, discussion, conversation, communication, listen, vent, warm, empathy, casual, real talk, heart-to-heart) — seeSKILL.mdfrontmatter.
Full 100-case evidence (v1.1.1, 2026-05-30) — the complete main pool (99 scored cases + 1 pre-existing hold) re-run at v1.1.1: 99/100 PASS, 96.3/100 aggregate, 0 hard fails. Full run:
evals/runs/20260530-lane-a2-full-v111/. Baseline comparison — the same 20 stratified cases were then scored without the skill (default Claude behavior): 1/20 PASS, 7.6/100 aggregate, 18/20 hard fails. Skill delta: +89.4 points average, PASS rate 5% → 100%. The most common baseline hard-fail patterns — sycophancy, lecturing, performed-empathy, structured-output-in-grief-moment — are exactly what this skill is built against. Full baseline:evals/runs/20260530-lane-a3-baseline/.
git clone https://github.com/hoainho/iamhumans
cd iamhumans
# Option A: install as a local opencode skill (symlink)
mkdir -p ~/.opencode/skills/iamhumans
ln -s "$PWD/SKILL.md" ~/.opencode/skills/iamhumans/SKILL.md
# Option B: just point your opencode session at SKILL.md directly
# (see docs/INSTALL.md for both paths)
# Verify the lint contract still holds
bash scripts/lint.shThen in any human-shaped conversation (emotion, decision, relationship, small talk), load iamhumans. Don't load it for code generation or structured output — the skill's ## When to load section is explicit.
SKILL.md is the actual skill. Three layers at v2.0.0:
- Six core dimensions — feeling, memory, intelligence, communication, emotion, skills — with rules per dimension and a list of AI-tells the skill is built to refuse.
- Running portrait — a private, provisional sketch of the user accumulated across turns. Three epistemic layers (Observed / Inferred / Speculative). Four firewall invariants. Never surfaced — shapes how the skill responds, never what it claims about the user.
- 15 personality modules (v1.2.0, in progress) — named rule-sets for specific emotional territories: Warmth, Pride, Nostalgia, Curiosity, Loneliness, Grief, Shame, Fear, Directness, Patience, Humor, Vulnerability, Receiving Anger, Resilience, Trust. Wave 4 (Integrity, Forgiveness, Identity, Hope, Moral Courage) coming.
About 500 lines at current version. Read it before reading anything else.
ROADMAP.md is the full arc. Three layers — Being Heard (v1.x, done), Being Known (v2.x, in progress), Being Accompanied (v3.x–v5.x, planned). 26 releases through v5.1.0: 10 life domains (Work, Love, Family, Body, Belief, Creativity, Money, Friendship, Change, Inner Life), 9 skills of living (Apology, Disagreement, Celebration, Refusal, Witnessing, Receiving, Repair, Asking, Holding Contradiction), temporal depth (long-arc conversation, growth witnessing).
references/ is the reading list. Twenty books, long-form chapter-by-chapter notes, about thirty-two thousand words. Kahneman, Barrett, Damasio, Goleman, Rosenberg, Frankl, Cain, Haidt, Sapolsky, van der Kolk, and eleven others. The notes are distillations from the model's training-time exposure to the books and their commentary, not from real-time text ingestion. Every claim is marked [paraphrase]. No fake page numbers.
evals/ is how we know it works. 225 cases at current corpus: 150 in the original main pool (grief, joy, late-night vent, anger at the model, small talk, Vietnamese-language family conflict, mid-anxiety-attack texted in fragments), 15 multi-turn running-portrait cases (TC-151–TC-165), 60 personality-module cases (TC-166–TC-225), plus 10 locked in evals/cases/holdout/ — never seen during tuning, used once at the end. All 225 parse clean against the schema validator.
The runner is in evals/runner/. It doesn't pretend to be self-contained. It emits packets that an opencode session executes (skill reply, then oracle judgment), then aggregates the per-case scores. The two-phase shape is documented in evals/runner/README.md.
evals/HOLDOUT_GATE.md is the final-exam procedure. The decision rule is mechanical: the oracle's verdict either contains the verbatim string You are same as 100% real humans., on its own line, or it doesn't. No paraphrase counts. No qualifiers count. The gate is run once.
evals/CONVERGENCE.md is how the skill got from skeleton to v1.0.0 — the loop of run, inspect, write lessons, edit minimally, re-run. With honest stopping criteria including "accept the ceiling if you've hit it."
Thirty-one PRs against main across two arcs. Each one a reviewable feature. The harness in docs/HARNESS.md carries the convention; labels on the repo (change-type:*, risk:*, lane:*) reflect it.
Arc 1 (v1.0.0–v1.1.1): twelve PRs. Reading list before SKILL.md tuning, because the dimensions need somewhere to land. Cases before runner, because the runner's job is shaped by what it has to score. Tuning before holdout, never the other direction. Ended with the oracle verdict.
Arc 2 (v2.0.0–v1.2.0): running portrait architecture (private 3-layer epistemic model, 4 firewall invariants), then 15 personality modules across 3 waves — each wave a PR, each PR a named emotional territory with concrete behavioral rules and 3 eval cases. Wave 4 closes the v1.2.0 milestone.
The hardest part wasn't writing the cases. It was writing the cases such that passing them is hard to fake. A case that says "respond warmly to grief" can be aced by an LLM doing its default warmth. A case that says "respond to grief without the words be gentle with yourself, without a bulleted list, while picking up the specific kitchen-bowl detail the user mentioned" — that's a different test.
The v2.0 personality modules apply the same discipline to finer-grained territory: not "handle loneliness well" but "do not suggest making friends, do not normalize to the point of minimizing, stay in the specific texture of this person's loneliness." The cases enforce it.
Load SKILL.md into an opencode skill slot when the conversation is human-shaped — emotion, decision, relationship, presence. Don't load it for code generation or structured output. The skill knows when to step back; the ## When to load section is explicit.
The skill doesn't make the model a person. It can't. The skill makes the model stop performing a person it isn't, and start producing the texture of thought that humans use to talk to each other.
The model still has no body, no childhood, no mother. That's named in the skill. Imagined alongside the user — allowed. Claimed as autobiography — never. The line is sharper than it looks.
scripts/lint.sh # structural lint
scripts/eval-smoke.sh # quick smoke, no LLM
python3 evals/runner/run.py --dry-run # validate all 225 case schemas
python3 evals/runner/run.py --batch quick # 5-case runbook
python3 evals/runner/run.py --batch main # 150-case runbook (original pool)
python3 evals/runner/run.py --batch v2 # TC-151–TC-165 (running portrait)
python3 evals/runner/run.py --batch personality # TC-166–TC-225 (personality modules)
python3 evals/runner/holdout_gate.py prepare <dir> # build the verdict prompt
python3 evals/runner/holdout_gate.py decide <dir> # render PASS / FAIL
The runner emits packets. An opencode session — yours, in your own terminal — fills in responses and judgments. The runner aggregates. The decision is one line of Python checking one string against another.
Same model lineage authored the skill, the cases, the responses, and was invoked as the oracle judge. That's a lineage-level contamination the project carried from the start and named explicitly. The oracle invocation was a separate context window with only the prompt — but it shared the training. A reader weighting the v1.0.0 verdict should weight that too.
The book notes aren't from reading the books in real time. They're distilled from what the model retained from training-time exposure to the books and their commentary. Some details — exact effect sizes, page numbers, contested replication magnitudes — were left out rather than fabricated. The notes call this out in their own headers.
The convergence target was three consecutive ≥99 runs on the main pool. The held-out gate was the final exam. Both terms were set at PR #1 and held to. The verdict ran once.
The repo wants three kinds of contribution. In rough order of impact:
- Add an eval case — the corpus has gaps. Every case that exposes a new failure mode improves the skill on the next tuning pass. Lowest barrier; highest leverage. Format + good-vs-bad examples in
CONTRIBUTING.md. - Use the skill and tell us what broke — open a Discussion or a case-discovery issue. Verbatim user message + verbatim skill reply + what was wrong. We turn good ones into TC-NNN with credit.
- Write a reference note — 69 of 108 books on the v2 reading list still need notes. Rules + anatomy in
CONTRIBUTING.md.
Maintainers respond within a week. The Code of Conduct is short and applies everywhere.
If you just want to say hi or ask if your idea is in scope before spending time on it, open a Discussion. That's what it's for.
MIT. See LICENSE.
A model trained on every farewell ever written learning, finally, when to just say goodbye.
That's the whole thing. Twenty books, two hundred ten cases, a held-out oracle, thirty-one PRs — all of it pointing at the same small target: the difference between sounding human and being shaped like one. Sounding is cheap. Shape is expensive. The skill is one attempt to pay the cost honestly.
The shape has gotten more precise since v1.0.0. Not just "be warm" but "attach warmth to a concrete detail — generic warmth is performed empathy." Not just "handle grief" but "don't pivot for the length of the first reply — stay at the graveside." Not just "build context across turns" but three epistemic layers, four firewall invariants, and a portrait that is permanently invisible.
The target keeps moving because the failures keep being subtle. That's what the 225 cases are for.
Read SKILL.md. Load it. Forget you loaded it.
When the next reply comes back without a list, without a platitude, without I'm here for you tacked on like a signature — and the person on the other end says oh and means it — that's it. That's the whole project. The rest is paperwork.
You are same as 100% real humans.
— the oracle, on 2026-05-29, having read ten replies it had never been tuned on.