emotype

Real-time Korean subtitles whose typography carries the speaker's affect.

emotype reads your face and your voice, decides what emotion is happening right now, and types the subtitle in a form that carries that emotion. Anger shakes. Sadness slumps. Joy bounces. Same words, different shapes — the way real speech already works, just made visible.

"Text carries meaning. Form carries affect."

한국어 README · taewoopark.com — author site

┌──────────┐   ┌─────────────┐
│  Camera  │──▶│  HSEmotion  │──┐
│  (face)  │   │  + V/A      │  │   ┌───────────────────┐
└──────────┘   └─────────────┘  │   │ Subtitle Overlay  │
                                ├──▶│  free-place       │
┌──────────┐   ┌─────────────┐  │   │  click-through    │
│   Mic    │──▶│  VAD + STT  │──┘   │  always-on-top    │
│  (voice) │   │  (Whisper)  │      └───────────────────┘
└──────────┘   └─────────────┘
                       │
                  Fusion: utterance segment ⟷ averaged emotion frames
                  → 1 utterance = 1 emotion = 1 design

Why emotype?

Most subtitles only carry what was said. Typography is treated as a neutral medium for legibility, and emotion is left to the viewer's imagination. emotype's hypothesis runs the other way:

Traditional Subtitles	emotype
Same neutral typography for every utterance	Per-utterance design — typography carries affect
Static font, color, weight	Dynamic intensity scaling (1px tremor → 32px shake)
Emotion left to the viewer	Emotion measured (face) and bound (audio) at the utterance level
One linear stream of text	A typographic instrument that performs the line

The same Korean line "괜찮아" ("it's fine") is a different utterance when the letters tremble on top of anger, slump on top of sadness, or bounce on top of joy. emotype takes the affect that's already in your face and voice and translates it into the typography of the subtitle, so the caption carries both meaning and feeling at once.

Design principles

One utterance = one emotion = one design. Even if your face shifts mid-sentence, the subtitle's design does not. Stability beats accuracy: a subtitle whose form mutates inside one breath is unreadable. The next utterance gets a fresh decision.
Affect is multidimensional. Categorical labels alone collapse 80% of everyday speech to neutral. emotype keeps three representations side by side: 8 categorical emotions, a continuous valence × arousal pair (Russell's circumplex), and a scalar intensity in [0, 1].
Form is amplitude. Strong anger shakes wider; soft anger trembles. Same category, different intensity. Letter spacing, motion frequency, glow radius, and weight all scale with intensity.
A subtitle expresses the utterance, not the speaker. Same person, two utterances, two designs. emotype never labels a person as "the angry one"; it dresses each utterance in a form. Form is performance, not diagnosis.

Theoretical Foundations

emotype's design draws on established work in affective computing, perceptual psychology, and on-device inference.

Affect Models

Ekman's basic emotions (1992) — the categorical layer. emotype uses 8 categories (Ekman 7 + contempt): anger, contempt, disgust, fear, happiness, neutral, sadness, surprise.
Russell's Circumplex (1980) — the dimensional layer. Every utterance also receives a continuous valence (pleasant ↔ unpleasant) and arousal (calm ↔ excited) pair. The four V/A quadrants determine the mood layer of the design (excited / tense / depressed / content).
Plutchik's dyads (1980) — reserved for V2. The current implementation cross-fades between adjacent base presets at 200 ms instead of synthesizing dyad designs.

Perception & Reading

One breath = one design — typographic identity is preserved within a single utterance because reading collapses when form mutates mid-line. The subtitle is a reading artifact first.
Intensity ease-out — the mapping from intensity ∈ [0, 1] to amplitude follows 1 − (1 − i)² so weak expressions are still legible (a 2 px tremor is detectable) and strong expressions reach 160% of the preset's base amplitude.

On-Device Inference

HSEmotion ENet-B0 8 V/A MTL (Savchenko 2022) — a 16 MB ONNX model that emits both 8-class logits and V/A from a single forward pass. Faster and more compact than running classification + V/A separately.
Streaming Silero VAD (Snakers4) — utterance segmentation through VADIterator at a 16 kHz / 512-sample window. Far more responsive than batch get_speech_timestamps and the only pattern that yields low end-of-utterance latency.
Apple MLX Whisper — Metal + unified memory makes whisper-large-v3-turbo transcribe ~10 s of speech in ~0.6 s on M-series silicon, with Korean WER within 1% of full large-v3.

Features

Eight emotion presets

Each preset is a DesignToken — color palette, Korean primary + display font, Latin display font, motion type, intensity sensitivity. Drawn from nexu-io/open-design and re-grounded for Korean reading.

Emotion	Mood	Primary	Korean display	Latin display	Motion
happiness	excited	`#EA580C` burnt orange	Jua	Limelight	bounce
anger	tense	`#DC2626` red	Black Han Sans	Outfit	shake + glitch
sadness	depressed	`#2F5B4F` forest	Nanum Myeongjo	Lora Italic	slow descend
fear	tense	`#3B82F6` electric	Pretendard Light	Audiowide	jitter + flicker
disgust	tense	`#37F712` toxic green	D2Coding	Space Mono	warp
surprise	excited	`#DB2777` magenta	Gowun Dodum	Fascinate	pop scale
contempt	depressed	`#111111` ink	Nanum Myeongjo Italic	Gelasio Italic	tilt + slow
neutral	content	`#0C0C09` graphite	Pretendard Regular	Inter	static

Live-reloadable from ~/.submaker/design_presets.yaml. Editing the YAML and saving updates the next utterance — no restart.

On-device by default

  microphone ──▶ Silero VAD ──▶ utterance segments ──▶ mlx-whisper ──▶ Korean text
                                                       (turbo, ~0.6 s for 10 s audio)

The full STT path runs on your Mac. No audio leaves the machine unless you explicitly switch to a cloud provider. The cloud path is opt-in (pip install emotype[cloud]) and the router validates the credential, falls back to local on auth failure, and never logs the API key.

A free-place subtitle window

  ┌──────────────────────────────────────┐
  │                                      │  ◀ frameless
  │                                      │  ◀ always-on-top
  │            괜 찮 아 …                │  ◀ click-through (toggle with `L`)
  │                                      │  ◀ remembers position across runs
  │                                      │
  └──────────────────────────────────────┘

A Qt.FramelessWindowHint | Qt.WindowStaysOnTopHint overlay you can drag anywhere on any monitor. Multi-monitor positions persist in ~/.submaker/layout.json and snap back to the bottom-center if the saved rect ends up off-screen on a smaller setup.

Multi-line typesetting with stagger continuity

Long Korean utterances are wrapped at the overlay's render width using QFontMetricsF glyph advances, then animated with a monotonically advancing character index across line breaks. The motion stagger feels continuous even when the line wraps mid-utterance.

Hot-swap capture devices

Pick a different camera or microphone from the control panel mid-session — the orchestrator gracefully tears down the running thread, joins it with a 2 s ceiling, and starts a fresh thread on the new device without dropping the live STT/emotion pipeline.

Getting Started

Prerequisites

macOS (Apple Silicon recommended — mlx-whisper is M-series-only)
Python ≥ 3.11, < 3.13
~3 GB free disk for the cached Whisper turbo model + ONNX weights + fonts (downloaded on first run)
A camera and a microphone (built-in or USB; both hot-swappable)

Installation

git clone https://github.com/TaewoooPark/emotype.git
cd emotype
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .

Cloud STT (optional):

pip install -e '.[cloud]'         # both Google + OpenAI
pip install -e '.[google]'        # only Google STT v2
pip install -e '.[openai]'        # only OpenAI Whisper API

Pre-fetch the ~80 MB of fonts and models so the first launch isn't a download:

emotype-fetch-assets

(Otherwise they're fetched lazily into ~/.submaker/assets/ the first time the pipeline starts.)

Quick Start

emotype
# or, equivalently:
python -m submaker.app.cli

Pick a camera and a microphone in the right-hand control panel.
Press Space (or click Start). Lamps turn green when each stage is hot.
Drag the subtitle window to where you want it. Press L to lock click-through.
Speak. The subtitle appears in the design that matches what your face and voice are saying.
Slide intensity gain if the motion is too subtle or too aggressive for your taste.

Controls

Action	Input
Start / stop pipeline	`Space`
Lock subtitle (click-through)	`L`
Reset overlay to bottom-center	`R`
Decrease intensity gain	`[`
Increase intensity gain	`]`
Open `design_presets.yaml`	`Cmd+,`
Quit	`Cmd+Q`

Configuring STT providers

Drop credentials into ~/.submaker/credentials.toml (the file is created with 0600 perms; the loader masks every field in __repr__ so a stray log line never leaks a key):

[local]
model_repo = "mlx-community/whisper-large-v3-turbo"
language = "ko"

[google]
service_account_json_path = "/path/to/sa.json"
language = "ko-KR"

[openai]
api_key = "sk-…"
language = "ko"

Switch the active provider from the control panel's STT dropdown. The lamp turns green once the healthcheck passes, red on auth or network failure, and the router auto-falls-back to local so a missing key never silently breaks transcription.

Architecture

Tech Stack

Layer	Technology
Capture	`opencv-python`, `sounddevice`, `pyobjc-framework-AVFoundation`
Face & emotion	`mediapipe.tasks.vision.FaceDetector`, HSEmotion ENet-B0 ONNX (`onnxruntime` + CoreML EP)
Speech	Streaming Silero VAD, `mlx-whisper` (local), Google STT v2 / OpenAI (optional)
State	`pydantic` v2 schemas, `queue.Queue` for data, `pyqtSignal` for UI
UI	PyQt5 — main window + frameless overlay + Qt stylesheets
Design tokens	YAML preset library, V/A interpolation, intensity scaling in LCH
Build	`pip install -e .`, optional PyInstaller `.app` (`submaker.spec`)

Pipeline Stages

  capture ──▶ emotion ──┐
                        ├──▶ fusion ──▶ design ──▶ overlay
  mic ──▶ VAD ──▶ STT ──┘

Stage	What it does	Output
Capture	Camera frames + mic PCM, hot-swappable, with backpressure	`Frame`, `AudioChunk`
Emotion	Face crop → HSEmotion → 8-class logits + V/A + intensity	`EmotionFrame`
VAD + STT	Silero VAD splits the mic into segments; Whisper transcribes each (final-only, no partials)	`Utterance`
Fusion	Averages the emotion frames whose timestamp falls inside the utterance	`EmotionSegment`
Design	Maps the chosen emotion to a `DesignToken` (font, colors, motion type, intensity)	`DesignedSubtitle`
Overlay	Frameless click-through PyQt5 window, multi-line wrap + per-glyph stagger	painted pixels

All cross-thread payloads are Pydantic models. All timestamps are time.monotonic_ns(). Cross-thread data flows through bounded queue.Queues (with backpressure); UI notifications go through pyqtSignals (drop-able). The two are never mixed.

Data Model

class EmotionFrame(BaseModel):
    track_id: UUID                     # face track, never persisted across sessions
    ts_ns: int                          # time.monotonic_ns()
    label: Literal[                     # 8 categorical emotions (Ekman + contempt)
        "anger", "contempt", "disgust", "fear",
        "happiness", "neutral", "sadness", "surprise",
    ]
    probs: list[float]                  # 8-class softmax, sums to 1.0
    valence: float                      # Russell circumplex, [-1, 1]
    arousal: float                      # Russell circumplex, [-1, 1]
    intensity: float                    # 0.5·dom_margin + 0.5·|arousal|, [0, 1]


class Utterance(BaseModel):
    segment_id: UUID                    # one per VAD segment
    t_start_ns: int
    t_end_ns: int
    text: str                           # final, never partial
    lang: str                           # ISO 639-1, default "ko"
    provider: Literal["local", "google", "openai"]
    confidence: float | None
    words: list[Word] | None            # word-level timing if provider supplies it


class DesignToken(BaseModel):
    font_family_korean: str
    font_family_korean_display: str
    font_family_latin: str
    color_fg: str                       # hex sRGB
    color_outline: str
    color_glow: str | None
    motion_type: Literal[
        "bounce", "shake_glitch", "slow_descend", "jitter_flicker",
        "warp", "pop_scale", "tilt_slow", "static",
    ]
    motion_amplitude_px: float
    motion_frequency_hz: float
    intensity: float                    # forwarded from EmotionSegment, drives ease-out

Project Structure

submaker/
├── core/        # Pydantic data contracts (every cross-thread payload)
├── capture/     # CameraThread, AudioThread, DeviceManager (hot-swap)
├── emotion/     # MediaPipe FaceDetector + HSEmotion ONNX session
├── stt/         # Streaming Silero VAD, mlx-whisper, optional cloud adapters, router
├── fusion/      # Utterance × EmotionFrame ring buffer → EmotionSegment
├── design/      # 8 preset library, V/A interpolation, LCH intensity scaling
├── ui/          # MainWindow, ControlPanel, SubtitleOverlay, HUD
├── presets/     # design_presets.yaml — the YAML layer of the design system
└── assets/      # Lazy-downloaded ONNX/tflite models + fonts

Design Principles

Stability over accuracy. A subtitle whose form mutates mid-breath is unreadable. The cost of a wrong-but-stable design is far smaller than the cost of a flickering one.
Time is monotonic. Every ts_ns is time.monotonic_ns(). Wall clock is forbidden because device hot-swap, daylight savings, and NTP corrections silently corrupt subtitle ordering otherwise.
Final, not partial. STT speaks once per VAD segment. Partial transcriptions would force the design to redecide mid-utterance — exactly what principle 1 forbids.
Queues for data, signals for UI. queue.Queue carries cross-thread payloads with backpressure; pyqtSignal carries UI notifications and is allowed to drop. The two are never aliased.
No persistent identity. A track ID never inherits an emotion. When the face track breaks, the new track starts emotion-blank. The system measures the utterance, not the person.

Ethics

The face-emotion model is a hypothesis, not the truth. emotype is a tool for expression, not diagnosis.

Models are not facts. HSEmotion is biased toward Western expressions and AffectNet's collection conditions. Korean faces, masks, and non-prototypical expressions degrade the signal sharply.
No persistence. No camera frame or audio buffer is written to disk by emotype. Cloud STT calls are governed by the provider's policy and the user must give consent before sending audio out of the device.
No diagnosis. Don't use this for hiring, medical screening, or any decision-bearing context.
No identity labels. A track ID never inherits an emotion. emotype dresses utterances, not people.

If you build on emotype, please keep these constraints visible to your end users.

Status

emotype is alpha — it works on the developer's machine, the pipeline is end-to-end, and the eight design presets render. Expect rough edges around device hot-swap (occasionally drops frames during a swap), first-run UX (the model download is silent on stderr — no GUI progress yet), and Apple Silicon-only constraints (mlx-whisper does not run on Intel; cloud STT works on Intel but the local provider does not).

License

MIT — see LICENSE. Bundled-asset licenses are honoured separately:

Asset	License	Source
Pretendard	SIL OFL 1.1	`orioncactus/pretendard`
Black Han Sans, Jua, Nanum Myeongjo, Gowun Dodum, Limelight, Audiowide, Fascinate, Outfit, Lora, Inter, Space Mono, Gelasio	SIL OFL 1.1	Google Fonts
HSEmotion ENet-B0 ONNX	Apache 2.0	`HSE-asavchenko/face-emotion-recognition`
MediaPipe BlazeFace short-range	Apache 2.0	Google MediaPipe
Silero VAD	MIT	`snakers4/silero-vad`
Apple MLX Whisper	MIT	`ml-explore/mlx-examples`

Acknowledgements

HSEmotion for the ENet-B0 8 V/A MTL ONNX model.
MediaPipe for BlazeFace short-range.
silero-vad for the streaming VAD that makes utterance segmentation feel instant.
Apple MLX for making on-device Whisper actually fast on Mac.
Pretendard for the Korean variable typeface that holds 8 emotional registers without breaking a sweat.
nexu-io/open-design for the design-system seeds the 8 presets are derived from.

Connect

Built with PyQt5, Apple MLX, MediaPipe, ONNX Runtime, and Silero VAD.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
submaker		submaker
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
pyproject.toml		pyproject.toml
submaker.spec		submaker.spec

Folders and files

Latest commit

History

Repository files navigation

emotype

Why emotype?

Design principles

Theoretical Foundations

Affect Models

Perception & Reading

On-Device Inference

Features

Eight emotion presets

On-device by default

A free-place subtitle window

Multi-line typesetting with stagger continuity

Hot-swap capture devices

Getting Started

Prerequisites

Installation

Quick Start

Controls

Configuring STT providers

Architecture

Tech Stack

Pipeline Stages

Data Model

Project Structure

Design Principles

Ethics

Status

License

Acknowledgements

Connect

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages