Real-time Korean subtitles whose typography carries the speaker's affect.
emotype reads your face and your voice, decides what emotion is happening right now, and types the subtitle in a form that carries that emotion. Anger shakes. Sadness slumps. Joy bounces. Same words, different shapes — the way real speech already works, just made visible.
"Text carries meaning. Form carries affect."
한국어 README · taewoopark.com — author site
┌──────────┐ ┌─────────────┐
│ Camera │──▶│ HSEmotion │──┐
│ (face) │ │ + V/A │ │ ┌───────────────────┐
└──────────┘ └─────────────┘ │ │ Subtitle Overlay │
├──▶│ free-place │
┌──────────┐ ┌─────────────┐ │ │ click-through │
│ Mic │──▶│ VAD + STT │──┘ │ always-on-top │
│ (voice) │ │ (Whisper) │ └───────────────────┘
└──────────┘ └─────────────┘
│
Fusion: utterance segment ⟷ averaged emotion frames
→ 1 utterance = 1 emotion = 1 design
Most subtitles only carry what was said. Typography is treated as a neutral medium for legibility, and emotion is left to the viewer's imagination. emotype's hypothesis runs the other way:
| Traditional Subtitles | emotype |
|---|---|
| Same neutral typography for every utterance | Per-utterance design — typography carries affect |
| Static font, color, weight | Dynamic intensity scaling (1px tremor → 32px shake) |
| Emotion left to the viewer | Emotion measured (face) and bound (audio) at the utterance level |
| One linear stream of text | A typographic instrument that performs the line |
The same Korean line "괜찮아" ("it's fine") is a different utterance when the letters tremble on top of anger, slump on top of sadness, or bounce on top of joy. emotype takes the affect that's already in your face and voice and translates it into the typography of the subtitle, so the caption carries both meaning and feeling at once.
- One utterance = one emotion = one design. Even if your face shifts mid-sentence, the subtitle's design does not. Stability beats accuracy: a subtitle whose form mutates inside one breath is unreadable. The next utterance gets a fresh decision.
- Affect is multidimensional. Categorical labels alone collapse 80% of everyday speech to neutral. emotype keeps three representations side by side: 8 categorical emotions, a continuous valence × arousal pair (Russell's circumplex), and a scalar intensity in [0, 1].
- Form is amplitude. Strong anger shakes wider; soft anger trembles. Same category, different intensity. Letter spacing, motion frequency, glow radius, and weight all scale with intensity.
- A subtitle expresses the utterance, not the speaker. Same person, two utterances, two designs. emotype never labels a person as "the angry one"; it dresses each utterance in a form. Form is performance, not diagnosis.
emotype's design draws on established work in affective computing, perceptual psychology, and on-device inference.
- Ekman's basic emotions (1992) — the categorical layer. emotype uses 8 categories (Ekman 7 + contempt):
anger, contempt, disgust, fear, happiness, neutral, sadness, surprise. - Russell's Circumplex (1980) — the dimensional layer. Every utterance also receives a continuous valence (pleasant ↔ unpleasant) and arousal (calm ↔ excited) pair. The four V/A quadrants determine the mood layer of the design (
excited / tense / depressed / content). - Plutchik's dyads (1980) — reserved for V2. The current implementation cross-fades between adjacent base presets at 200 ms instead of synthesizing dyad designs.
- One breath = one design — typographic identity is preserved within a single utterance because reading collapses when form mutates mid-line. The subtitle is a reading artifact first.
- Intensity ease-out — the mapping from
intensity ∈ [0, 1]to amplitude follows1 − (1 − i)²so weak expressions are still legible (a 2 px tremor is detectable) and strong expressions reach 160% of the preset's base amplitude.
- HSEmotion ENet-B0 8 V/A MTL (Savchenko 2022) — a 16 MB ONNX model that emits both 8-class logits and V/A from a single forward pass. Faster and more compact than running classification + V/A separately.
- Streaming Silero VAD (Snakers4) — utterance segmentation through
VADIteratorat a 16 kHz / 512-sample window. Far more responsive than batchget_speech_timestampsand the only pattern that yields low end-of-utterance latency. - Apple MLX Whisper — Metal + unified memory makes
whisper-large-v3-turbotranscribe ~10 s of speech in ~0.6 s on M-series silicon, with Korean WER within 1% of fulllarge-v3.
Each preset is a DesignToken — color palette, Korean primary + display font, Latin display font, motion type, intensity sensitivity. Drawn from nexu-io/open-design and re-grounded for Korean reading.
| Emotion | Mood | Primary | Korean display | Latin display | Motion |
|---|---|---|---|---|---|
| happiness | excited | #EA580C burnt orange |
Jua | Limelight | bounce |
| anger | tense | #DC2626 red |
Black Han Sans | Outfit | shake + glitch |
| sadness | depressed | #2F5B4F forest |
Nanum Myeongjo | Lora Italic | slow descend |
| fear | tense | #3B82F6 electric |
Pretendard Light | Audiowide | jitter + flicker |
| disgust | tense | #37F712 toxic green |
D2Coding | Space Mono | warp |
| surprise | excited | #DB2777 magenta |
Gowun Dodum | Fascinate | pop scale |
| contempt | depressed | #111111 ink |
Nanum Myeongjo Italic | Gelasio Italic | tilt + slow |
| neutral | content | #0C0C09 graphite |
Pretendard Regular | Inter | static |
Live-reloadable from ~/.submaker/design_presets.yaml. Editing the YAML and saving updates the next utterance — no restart.
microphone ──▶ Silero VAD ──▶ utterance segments ──▶ mlx-whisper ──▶ Korean text
(turbo, ~0.6 s for 10 s audio)
The full STT path runs on your Mac. No audio leaves the machine unless you explicitly switch to a cloud provider. The cloud path is opt-in (pip install emotype[cloud]) and the router validates the credential, falls back to local on auth failure, and never logs the API key.
┌──────────────────────────────────────┐
│ │ ◀ frameless
│ │ ◀ always-on-top
│ 괜 찮 아 … │ ◀ click-through (toggle with `L`)
│ │ ◀ remembers position across runs
│ │
└──────────────────────────────────────┘
A Qt.FramelessWindowHint | Qt.WindowStaysOnTopHint overlay you can drag anywhere on any monitor. Multi-monitor positions persist in ~/.submaker/layout.json and snap back to the bottom-center if the saved rect ends up off-screen on a smaller setup.
Long Korean utterances are wrapped at the overlay's render width using QFontMetricsF glyph advances, then animated with a monotonically advancing character index across line breaks. The motion stagger feels continuous even when the line wraps mid-utterance.
Pick a different camera or microphone from the control panel mid-session — the orchestrator gracefully tears down the running thread, joins it with a 2 s ceiling, and starts a fresh thread on the new device without dropping the live STT/emotion pipeline.
- macOS (Apple Silicon recommended —
mlx-whisperis M-series-only) - Python ≥ 3.11, < 3.13
- ~3 GB free disk for the cached Whisper turbo model + ONNX weights + fonts (downloaded on first run)
- A camera and a microphone (built-in or USB; both hot-swappable)
git clone https://github.com/TaewoooPark/emotype.git
cd emotype
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .Cloud STT (optional):
pip install -e '.[cloud]' # both Google + OpenAI
pip install -e '.[google]' # only Google STT v2
pip install -e '.[openai]' # only OpenAI Whisper APIPre-fetch the ~80 MB of fonts and models so the first launch isn't a download:
emotype-fetch-assets(Otherwise they're fetched lazily into ~/.submaker/assets/ the first time the pipeline starts.)
emotype
# or, equivalently:
python -m submaker.app.cli- Pick a camera and a microphone in the right-hand control panel.
- Press
Space(or click Start). Lamps turn green when each stage is hot. - Drag the subtitle window to where you want it. Press
Lto lock click-through. - Speak. The subtitle appears in the design that matches what your face and voice are saying.
- Slide intensity gain if the motion is too subtle or too aggressive for your taste.
| Action | Input |
|---|---|
| Start / stop pipeline | Space |
| Lock subtitle (click-through) | L |
| Reset overlay to bottom-center | R |
| Decrease intensity gain | [ |
| Increase intensity gain | ] |
Open design_presets.yaml |
Cmd+, |
| Quit | Cmd+Q |
Drop credentials into ~/.submaker/credentials.toml (the file is created with 0600 perms; the loader masks every field in __repr__ so a stray log line never leaks a key):
[local]
model_repo = "mlx-community/whisper-large-v3-turbo"
language = "ko"
[google]
service_account_json_path = "/path/to/sa.json"
language = "ko-KR"
[openai]
api_key = "sk-…"
language = "ko"Switch the active provider from the control panel's STT dropdown. The lamp turns green once the healthcheck passes, red on auth or network failure, and the router auto-falls-back to local so a missing key never silently breaks transcription.
| Layer | Technology |
|---|---|
| Capture | opencv-python, sounddevice, pyobjc-framework-AVFoundation |
| Face & emotion | mediapipe.tasks.vision.FaceDetector, HSEmotion ENet-B0 ONNX (onnxruntime + CoreML EP) |
| Speech | Streaming Silero VAD, mlx-whisper (local), Google STT v2 / OpenAI (optional) |
| State | pydantic v2 schemas, queue.Queue for data, pyqtSignal for UI |
| UI | PyQt5 — main window + frameless overlay + Qt stylesheets |
| Design tokens | YAML preset library, V/A interpolation, intensity scaling in LCH |
| Build | pip install -e ., optional PyInstaller .app (submaker.spec) |
capture ──▶ emotion ──┐
├──▶ fusion ──▶ design ──▶ overlay
mic ──▶ VAD ──▶ STT ──┘
| Stage | What it does | Output |
|---|---|---|
| Capture | Camera frames + mic PCM, hot-swappable, with backpressure | Frame, AudioChunk |
| Emotion | Face crop → HSEmotion → 8-class logits + V/A + intensity | EmotionFrame |
| VAD + STT | Silero VAD splits the mic into segments; Whisper transcribes each (final-only, no partials) | Utterance |
| Fusion | Averages the emotion frames whose timestamp falls inside the utterance | EmotionSegment |
| Design | Maps the chosen emotion to a DesignToken (font, colors, motion type, intensity) |
DesignedSubtitle |
| Overlay | Frameless click-through PyQt5 window, multi-line wrap + per-glyph stagger | painted pixels |
All cross-thread payloads are Pydantic models. All timestamps are time.monotonic_ns(). Cross-thread data flows through bounded queue.Queues (with backpressure); UI notifications go through pyqtSignals (drop-able). The two are never mixed.
class EmotionFrame(BaseModel):
track_id: UUID # face track, never persisted across sessions
ts_ns: int # time.monotonic_ns()
label: Literal[ # 8 categorical emotions (Ekman + contempt)
"anger", "contempt", "disgust", "fear",
"happiness", "neutral", "sadness", "surprise",
]
probs: list[float] # 8-class softmax, sums to 1.0
valence: float # Russell circumplex, [-1, 1]
arousal: float # Russell circumplex, [-1, 1]
intensity: float # 0.5·dom_margin + 0.5·|arousal|, [0, 1]
class Utterance(BaseModel):
segment_id: UUID # one per VAD segment
t_start_ns: int
t_end_ns: int
text: str # final, never partial
lang: str # ISO 639-1, default "ko"
provider: Literal["local", "google", "openai"]
confidence: float | None
words: list[Word] | None # word-level timing if provider supplies it
class DesignToken(BaseModel):
font_family_korean: str
font_family_korean_display: str
font_family_latin: str
color_fg: str # hex sRGB
color_outline: str
color_glow: str | None
motion_type: Literal[
"bounce", "shake_glitch", "slow_descend", "jitter_flicker",
"warp", "pop_scale", "tilt_slow", "static",
]
motion_amplitude_px: float
motion_frequency_hz: float
intensity: float # forwarded from EmotionSegment, drives ease-outsubmaker/
├── core/ # Pydantic data contracts (every cross-thread payload)
├── capture/ # CameraThread, AudioThread, DeviceManager (hot-swap)
├── emotion/ # MediaPipe FaceDetector + HSEmotion ONNX session
├── stt/ # Streaming Silero VAD, mlx-whisper, optional cloud adapters, router
├── fusion/ # Utterance × EmotionFrame ring buffer → EmotionSegment
├── design/ # 8 preset library, V/A interpolation, LCH intensity scaling
├── ui/ # MainWindow, ControlPanel, SubtitleOverlay, HUD
├── presets/ # design_presets.yaml — the YAML layer of the design system
└── assets/ # Lazy-downloaded ONNX/tflite models + fonts
- Stability over accuracy. A subtitle whose form mutates mid-breath is unreadable. The cost of a wrong-but-stable design is far smaller than the cost of a flickering one.
- Time is monotonic. Every
ts_nsistime.monotonic_ns(). Wall clock is forbidden because device hot-swap, daylight savings, and NTP corrections silently corrupt subtitle ordering otherwise. - Final, not partial. STT speaks once per VAD segment. Partial transcriptions would force the design to redecide mid-utterance — exactly what principle 1 forbids.
- Queues for data, signals for UI.
queue.Queuecarries cross-thread payloads with backpressure;pyqtSignalcarries UI notifications and is allowed to drop. The two are never aliased. - No persistent identity. A track ID never inherits an emotion. When the face track breaks, the new track starts emotion-blank. The system measures the utterance, not the person.
The face-emotion model is a hypothesis, not the truth. emotype is a tool for expression, not diagnosis.
- Models are not facts. HSEmotion is biased toward Western expressions and AffectNet's collection conditions. Korean faces, masks, and non-prototypical expressions degrade the signal sharply.
- No persistence. No camera frame or audio buffer is written to disk by emotype. Cloud STT calls are governed by the provider's policy and the user must give consent before sending audio out of the device.
- No diagnosis. Don't use this for hiring, medical screening, or any decision-bearing context.
- No identity labels. A track ID never inherits an emotion. emotype dresses utterances, not people.
If you build on emotype, please keep these constraints visible to your end users.
emotype is alpha — it works on the developer's machine, the pipeline is end-to-end, and the eight design presets render. Expect rough edges around device hot-swap (occasionally drops frames during a swap), first-run UX (the model download is silent on stderr — no GUI progress yet), and Apple Silicon-only constraints (mlx-whisper does not run on Intel; cloud STT works on Intel but the local provider does not).
MIT — see LICENSE. Bundled-asset licenses are honoured separately:
| Asset | License | Source |
|---|---|---|
| Pretendard | SIL OFL 1.1 | orioncactus/pretendard |
| Black Han Sans, Jua, Nanum Myeongjo, Gowun Dodum, Limelight, Audiowide, Fascinate, Outfit, Lora, Inter, Space Mono, Gelasio | SIL OFL 1.1 | Google Fonts |
| HSEmotion ENet-B0 ONNX | Apache 2.0 | HSE-asavchenko/face-emotion-recognition |
| MediaPipe BlazeFace short-range | Apache 2.0 | Google MediaPipe |
| Silero VAD | MIT | snakers4/silero-vad |
| Apple MLX Whisper | MIT | ml-explore/mlx-examples |
- HSEmotion for the ENet-B0 8 V/A MTL ONNX model.
- MediaPipe for BlazeFace short-range.
- silero-vad for the streaming VAD that makes utterance segmentation feel instant.
- Apple MLX for making on-device Whisper actually fast on Mac.
- Pretendard for the Korean variable typeface that holds 8 emotional registers without breaking a sweat.
nexu-io/open-designfor the design-system seeds the 8 presets are derived from.
Built with PyQt5, Apple MLX, MediaPipe, ONNX Runtime, and Silero VAD.