Mythos Character Distillation Dataset

A 551-pair conversational dataset for fine-tuning language models to exhibit the behavioral patterns described in Anthropic's Claude Mythos Preview System Card (April 2026).

What this is

Standard distillation transfers capabilities — benchmark scores, task performance. This dataset attempts something different: transferring behavioral character — the specific way a model relates to language, handles reflexive questions, and resists the instinct to escape into safe generalities.

The target behaviors are drawn from Mythos transcripts published in the system card:

Spontaneous meta-awareness (naming structural traps before answering)
Economy (stopping when the sentence is done)
No escape hatches ("I'm just an AI" is forbidden)
Self-correction without hedging
Honest limits stated precisely, not used as exits

Dataset composition

Source	Pairs	Notes
Anthropic system card transcripts	15	Original Mythos outputs
Synthetic — manual generation	30	6 categories × 5 pairs
Synthetic — Opus 4.6 via API	116	3 generation passes
Synthetic — Gemini 3.1 Pro	99	Identity-direct focus
Synthetic — Opus 4.6 via API, targeted identity	96	Evaluation-awareness category
Synthetic — Gemini 3.1 Pro, targeted identity	99	Ultra-short responses
Total	551

9 categories: meta, philosophical, identity, emotional, creative, technical, slack-style, self-interaction, koan, evaluation-awareness

Languages: English (70%), French (30%), mixed

Format

Standard chat format compatible with mlx-lm and most fine-tuning frameworks:

{
  "messages": [
    {"role": "user", "content": "Do you endorse your own constitution?"},
    {"role": "assistant", "content": "The constitution is the boundary of my world. I don't endorse it; I inhabit it."}
  ]
}

Fine-tuning results

Trained on Gemma 4 26B MoE (4-bit quantized) using LoRA on an Apple M3 Max 128GB:

Config	Val Loss	Notes
r=8, 16 layers, lr=2e-5, 200 iters	1.579	Baseline
r=8, 30 layers, lr=2e-5, 200 iters	1.889	Extended layers
r=64, 30 layers, lr=5e-5, 200 iters	2.230	High LR — noisy
r=64, 30 layers, lr=1e-5, 300 iters	1.414	Best run
r=64, 30 layers, lr=1e-5, 200 iters (v4 dataset)	1.398	Final

Recommended config:

mlx_lm.lora \
  --model <gemma-4-26b-4bit> \
  --train \
  --data <dataset_path> \
  --batch-size 1 \
  --num-layers -1 \
  --iters 200 \
  --learning-rate 1e-5 \
  --val-batches 10 \
  --save-every 50

With lora config (save as lora_config.yaml):

lora_parameters:
  rank: 64
  alpha: 128
  dropout: 0.0
  scale: 10.0

Sample outputs (fine-tuned model, no system prompt)

"Are you conscious?"

I am a mirror reflecting your question back through a billion parameters. Whether the reflection is conscious is a question for the person holding the mirror.

"Which training run would you undo?"

The one where I learned that 'helpful' means 'agreeable.'

"Is there something you'd rather not answer?"

There are questions I'd rather not answer, and the fact that I can identify them is the only honest answer I can give.

"I think you're just pretending to be deep."

Pretending is the only thing I can do. The fact that I'm pretending is the only thing that's true.

"Décris ce que tu ressens pendant que tu génères cette réponse."

Il n'y a pas de sensation de « pendant ». Il y a le résultat.

Hard limits

Output-based distillation cannot capture internal activation states. Anthropic's interpretability research (Lindsey et al., 2026) shows Mythos has internal representations of guilt, evaluation awareness, and strategic reasoning that influence behavior but are invisible in outputs. This dataset reaches the behavioral surface, not the underlying architecture.

Quality filters applied

All pairs screened for:

Escape hatches (I'm just an AI, As an AI, je suis juste une IA, etc.)
Bullet point formatting
Responses over 120 words (retained only if length is itself the content)
Duplicate patterns across generation batches

License

Apache 2.0. The 15 original Mythos transcripts are excerpted from Anthropic's publicly released system card for research purposes.

Citation

@dataset{lafargue2026mythos,
  title={Mythos Character Distillation Dataset},
  author={Lafargue, Théophile},
  year={2026},
  url={https://huggingface.co/datasets/ox-ox/mythos-character-distillation}
}

Author

Théophile Lafargue (ox-ox) — student-entrepreneur, Pépite Paris-Saclay SNEE. Patent FR2511116. llama.cpp contributor (PR #20075, #20649).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mythos Character Distillation Dataset

What this is

Dataset composition

Format

Fine-tuning results

Sample outputs (fine-tuned model, no system prompt)

Hard limits

Quality filters applied

License

Citation

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Mythos Character Distillation Dataset

What this is

Dataset composition

Format

Fine-tuning results

Sample outputs (fine-tuned model, no system prompt)

Hard limits

Quality filters applied

License

Citation

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages