Skip to content

TaewoooPark/emotype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

emotype

Real-time Korean subtitles whose typography carries the speaker's affect.

License GitHub stars Last commit Top language   Python PyQt5 macOS   Apple MLX Whisper MediaPipe ONNX Hugging Face

emotype reads your face and your voice, decides what emotion is happening right now, and types the subtitle in a form that carries that emotion. Anger shakes. Sadness slumps. Joy bounces. Same words, different shapes — the way real speech already works, just made visible.

"Text carries meaning. Form carries affect."

한국어 README  ·  taewoopark.com — author site

┌──────────┐   ┌─────────────┐
│  Camera  │──▶│  HSEmotion  │──┐
│  (face)  │   │  + V/A      │  │   ┌───────────────────┐
└──────────┘   └─────────────┘  │   │ Subtitle Overlay  │
                                ├──▶│  free-place       │
┌──────────┐   ┌─────────────┐  │   │  click-through    │
│   Mic    │──▶│  VAD + STT  │──┘   │  always-on-top    │
│  (voice) │   │  (Whisper)  │      └───────────────────┘
└──────────┘   └─────────────┘
                       │
                  Fusion: utterance segment ⟷ averaged emotion frames
                  → 1 utterance = 1 emotion = 1 design

Why emotype?

Most subtitles only carry what was said. Typography is treated as a neutral medium for legibility, and emotion is left to the viewer's imagination. emotype's hypothesis runs the other way:

Traditional Subtitles emotype
Same neutral typography for every utterance Per-utterance design — typography carries affect
Static font, color, weight Dynamic intensity scaling (1px tremor → 32px shake)
Emotion left to the viewer Emotion measured (face) and bound (audio) at the utterance level
One linear stream of text A typographic instrument that performs the line

The same Korean line "괜찮아" ("it's fine") is a different utterance when the letters tremble on top of anger, slump on top of sadness, or bounce on top of joy. emotype takes the affect that's already in your face and voice and translates it into the typography of the subtitle, so the caption carries both meaning and feeling at once.


Design principles

  1. One utterance = one emotion = one design. Even if your face shifts mid-sentence, the subtitle's design does not. Stability beats accuracy: a subtitle whose form mutates inside one breath is unreadable. The next utterance gets a fresh decision.
  2. Affect is multidimensional. Categorical labels alone collapse 80% of everyday speech to neutral. emotype keeps three representations side by side: 8 categorical emotions, a continuous valence × arousal pair (Russell's circumplex), and a scalar intensity in [0, 1].
  3. Form is amplitude. Strong anger shakes wider; soft anger trembles. Same category, different intensity. Letter spacing, motion frequency, glow radius, and weight all scale with intensity.
  4. A subtitle expresses the utterance, not the speaker. Same person, two utterances, two designs. emotype never labels a person as "the angry one"; it dresses each utterance in a form. Form is performance, not diagnosis.

Theoretical Foundations

emotype's design draws on established work in affective computing, perceptual psychology, and on-device inference.

Affect Models

  • Ekman's basic emotions (1992) — the categorical layer. emotype uses 8 categories (Ekman 7 + contempt): anger, contempt, disgust, fear, happiness, neutral, sadness, surprise.
  • Russell's Circumplex (1980) — the dimensional layer. Every utterance also receives a continuous valence (pleasant ↔ unpleasant) and arousal (calm ↔ excited) pair. The four V/A quadrants determine the mood layer of the design (excited / tense / depressed / content).
  • Plutchik's dyads (1980) — reserved for V2. The current implementation cross-fades between adjacent base presets at 200 ms instead of synthesizing dyad designs.

Perception & Reading

  • One breath = one design — typographic identity is preserved within a single utterance because reading collapses when form mutates mid-line. The subtitle is a reading artifact first.
  • Intensity ease-out — the mapping from intensity ∈ [0, 1] to amplitude follows 1 − (1 − i)² so weak expressions are still legible (a 2 px tremor is detectable) and strong expressions reach 160% of the preset's base amplitude.

On-Device Inference

  • HSEmotion ENet-B0 8 V/A MTL (Savchenko 2022) — a 16 MB ONNX model that emits both 8-class logits and V/A from a single forward pass. Faster and more compact than running classification + V/A separately.
  • Streaming Silero VAD (Snakers4) — utterance segmentation through VADIterator at a 16 kHz / 512-sample window. Far more responsive than batch get_speech_timestamps and the only pattern that yields low end-of-utterance latency.
  • Apple MLX Whisper — Metal + unified memory makes whisper-large-v3-turbo transcribe ~10 s of speech in ~0.6 s on M-series silicon, with Korean WER within 1% of full large-v3.

Features

Eight emotion presets

Each preset is a DesignToken — color palette, Korean primary + display font, Latin display font, motion type, intensity sensitivity. Drawn from nexu-io/open-design and re-grounded for Korean reading.

Emotion Mood Primary Korean display Latin display Motion
happiness excited #EA580C burnt orange Jua Limelight bounce
anger tense #DC2626 red Black Han Sans Outfit shake + glitch
sadness depressed #2F5B4F forest Nanum Myeongjo Lora Italic slow descend
fear tense #3B82F6 electric Pretendard Light Audiowide jitter + flicker
disgust tense #37F712 toxic green D2Coding Space Mono warp
surprise excited #DB2777 magenta Gowun Dodum Fascinate pop scale
contempt depressed #111111 ink Nanum Myeongjo Italic Gelasio Italic tilt + slow
neutral content #0C0C09 graphite Pretendard Regular Inter static

Live-reloadable from ~/.submaker/design_presets.yaml. Editing the YAML and saving updates the next utterance — no restart.

On-device by default

  microphone ──▶ Silero VAD ──▶ utterance segments ──▶ mlx-whisper ──▶ Korean text
                                                       (turbo, ~0.6 s for 10 s audio)

The full STT path runs on your Mac. No audio leaves the machine unless you explicitly switch to a cloud provider. The cloud path is opt-in (pip install emotype[cloud]) and the router validates the credential, falls back to local on auth failure, and never logs the API key.

A free-place subtitle window

  ┌──────────────────────────────────────┐
  │                                      │  ◀ frameless
  │                                      │  ◀ always-on-top
  │            괜 찮 아 …                │  ◀ click-through (toggle with `L`)
  │                                      │  ◀ remembers position across runs
  │                                      │
  └──────────────────────────────────────┘

A Qt.FramelessWindowHint | Qt.WindowStaysOnTopHint overlay you can drag anywhere on any monitor. Multi-monitor positions persist in ~/.submaker/layout.json and snap back to the bottom-center if the saved rect ends up off-screen on a smaller setup.

Multi-line typesetting with stagger continuity

Long Korean utterances are wrapped at the overlay's render width using QFontMetricsF glyph advances, then animated with a monotonically advancing character index across line breaks. The motion stagger feels continuous even when the line wraps mid-utterance.

Hot-swap capture devices

Pick a different camera or microphone from the control panel mid-session — the orchestrator gracefully tears down the running thread, joins it with a 2 s ceiling, and starts a fresh thread on the new device without dropping the live STT/emotion pipeline.


Getting Started

Prerequisites

  • macOS (Apple Silicon recommended — mlx-whisper is M-series-only)
  • Python ≥ 3.11, < 3.13
  • ~3 GB free disk for the cached Whisper turbo model + ONNX weights + fonts (downloaded on first run)
  • A camera and a microphone (built-in or USB; both hot-swappable)

Installation

git clone https://github.com/TaewoooPark/emotype.git
cd emotype
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .

Cloud STT (optional):

pip install -e '.[cloud]'         # both Google + OpenAI
pip install -e '.[google]'        # only Google STT v2
pip install -e '.[openai]'        # only OpenAI Whisper API

Pre-fetch the ~80 MB of fonts and models so the first launch isn't a download:

emotype-fetch-assets

(Otherwise they're fetched lazily into ~/.submaker/assets/ the first time the pipeline starts.)

Quick Start

emotype
# or, equivalently:
python -m submaker.app.cli
  1. Pick a camera and a microphone in the right-hand control panel.
  2. Press Space (or click Start). Lamps turn green when each stage is hot.
  3. Drag the subtitle window to where you want it. Press L to lock click-through.
  4. Speak. The subtitle appears in the design that matches what your face and voice are saying.
  5. Slide intensity gain if the motion is too subtle or too aggressive for your taste.

Controls

Action Input
Start / stop pipeline Space
Lock subtitle (click-through) L
Reset overlay to bottom-center R
Decrease intensity gain [
Increase intensity gain ]
Open design_presets.yaml Cmd+,
Quit Cmd+Q

Configuring STT providers

Drop credentials into ~/.submaker/credentials.toml (the file is created with 0600 perms; the loader masks every field in __repr__ so a stray log line never leaks a key):

[local]
model_repo = "mlx-community/whisper-large-v3-turbo"
language = "ko"

[google]
service_account_json_path = "/path/to/sa.json"
language = "ko-KR"

[openai]
api_key = "sk-…"
language = "ko"

Switch the active provider from the control panel's STT dropdown. The lamp turns green once the healthcheck passes, red on auth or network failure, and the router auto-falls-back to local so a missing key never silently breaks transcription.


Architecture

Tech Stack

Layer Technology
Capture opencv-python, sounddevice, pyobjc-framework-AVFoundation
Face & emotion mediapipe.tasks.vision.FaceDetector, HSEmotion ENet-B0 ONNX (onnxruntime + CoreML EP)
Speech Streaming Silero VAD, mlx-whisper (local), Google STT v2 / OpenAI (optional)
State pydantic v2 schemas, queue.Queue for data, pyqtSignal for UI
UI PyQt5 — main window + frameless overlay + Qt stylesheets
Design tokens YAML preset library, V/A interpolation, intensity scaling in LCH
Build pip install -e ., optional PyInstaller .app (submaker.spec)

Pipeline Stages

  capture ──▶ emotion ──┐
                        ├──▶ fusion ──▶ design ──▶ overlay
  mic ──▶ VAD ──▶ STT ──┘
Stage What it does Output
Capture Camera frames + mic PCM, hot-swappable, with backpressure Frame, AudioChunk
Emotion Face crop → HSEmotion → 8-class logits + V/A + intensity EmotionFrame
VAD + STT Silero VAD splits the mic into segments; Whisper transcribes each (final-only, no partials) Utterance
Fusion Averages the emotion frames whose timestamp falls inside the utterance EmotionSegment
Design Maps the chosen emotion to a DesignToken (font, colors, motion type, intensity) DesignedSubtitle
Overlay Frameless click-through PyQt5 window, multi-line wrap + per-glyph stagger painted pixels

All cross-thread payloads are Pydantic models. All timestamps are time.monotonic_ns(). Cross-thread data flows through bounded queue.Queues (with backpressure); UI notifications go through pyqtSignals (drop-able). The two are never mixed.

Data Model

class EmotionFrame(BaseModel):
    track_id: UUID                     # face track, never persisted across sessions
    ts_ns: int                          # time.monotonic_ns()
    label: Literal[                     # 8 categorical emotions (Ekman + contempt)
        "anger", "contempt", "disgust", "fear",
        "happiness", "neutral", "sadness", "surprise",
    ]
    probs: list[float]                  # 8-class softmax, sums to 1.0
    valence: float                      # Russell circumplex, [-1, 1]
    arousal: float                      # Russell circumplex, [-1, 1]
    intensity: float                    # 0.5·dom_margin + 0.5·|arousal|, [0, 1]


class Utterance(BaseModel):
    segment_id: UUID                    # one per VAD segment
    t_start_ns: int
    t_end_ns: int
    text: str                           # final, never partial
    lang: str                           # ISO 639-1, default "ko"
    provider: Literal["local", "google", "openai"]
    confidence: float | None
    words: list[Word] | None            # word-level timing if provider supplies it


class DesignToken(BaseModel):
    font_family_korean: str
    font_family_korean_display: str
    font_family_latin: str
    color_fg: str                       # hex sRGB
    color_outline: str
    color_glow: str | None
    motion_type: Literal[
        "bounce", "shake_glitch", "slow_descend", "jitter_flicker",
        "warp", "pop_scale", "tilt_slow", "static",
    ]
    motion_amplitude_px: float
    motion_frequency_hz: float
    intensity: float                    # forwarded from EmotionSegment, drives ease-out

Project Structure

submaker/
├── core/        # Pydantic data contracts (every cross-thread payload)
├── capture/     # CameraThread, AudioThread, DeviceManager (hot-swap)
├── emotion/     # MediaPipe FaceDetector + HSEmotion ONNX session
├── stt/         # Streaming Silero VAD, mlx-whisper, optional cloud adapters, router
├── fusion/      # Utterance × EmotionFrame ring buffer → EmotionSegment
├── design/      # 8 preset library, V/A interpolation, LCH intensity scaling
├── ui/          # MainWindow, ControlPanel, SubtitleOverlay, HUD
├── presets/     # design_presets.yaml — the YAML layer of the design system
└── assets/      # Lazy-downloaded ONNX/tflite models + fonts

Design Principles

  1. Stability over accuracy. A subtitle whose form mutates mid-breath is unreadable. The cost of a wrong-but-stable design is far smaller than the cost of a flickering one.
  2. Time is monotonic. Every ts_ns is time.monotonic_ns(). Wall clock is forbidden because device hot-swap, daylight savings, and NTP corrections silently corrupt subtitle ordering otherwise.
  3. Final, not partial. STT speaks once per VAD segment. Partial transcriptions would force the design to redecide mid-utterance — exactly what principle 1 forbids.
  4. Queues for data, signals for UI. queue.Queue carries cross-thread payloads with backpressure; pyqtSignal carries UI notifications and is allowed to drop. The two are never aliased.
  5. No persistent identity. A track ID never inherits an emotion. When the face track breaks, the new track starts emotion-blank. The system measures the utterance, not the person.

Ethics

The face-emotion model is a hypothesis, not the truth. emotype is a tool for expression, not diagnosis.

  • Models are not facts. HSEmotion is biased toward Western expressions and AffectNet's collection conditions. Korean faces, masks, and non-prototypical expressions degrade the signal sharply.
  • No persistence. No camera frame or audio buffer is written to disk by emotype. Cloud STT calls are governed by the provider's policy and the user must give consent before sending audio out of the device.
  • No diagnosis. Don't use this for hiring, medical screening, or any decision-bearing context.
  • No identity labels. A track ID never inherits an emotion. emotype dresses utterances, not people.

If you build on emotype, please keep these constraints visible to your end users.


Status

emotype is alpha — it works on the developer's machine, the pipeline is end-to-end, and the eight design presets render. Expect rough edges around device hot-swap (occasionally drops frames during a swap), first-run UX (the model download is silent on stderr — no GUI progress yet), and Apple Silicon-only constraints (mlx-whisper does not run on Intel; cloud STT works on Intel but the local provider does not).


License

MIT — see LICENSE. Bundled-asset licenses are honoured separately:

Asset License Source
Pretendard SIL OFL 1.1 orioncactus/pretendard
Black Han Sans, Jua, Nanum Myeongjo, Gowun Dodum, Limelight, Audiowide, Fascinate, Outfit, Lora, Inter, Space Mono, Gelasio SIL OFL 1.1 Google Fonts
HSEmotion ENet-B0 ONNX Apache 2.0 HSE-asavchenko/face-emotion-recognition
MediaPipe BlazeFace short-range Apache 2.0 Google MediaPipe
Silero VAD MIT snakers4/silero-vad
Apple MLX Whisper MIT ml-explore/mlx-examples

Acknowledgements

  • HSEmotion for the ENet-B0 8 V/A MTL ONNX model.
  • MediaPipe for BlazeFace short-range.
  • silero-vad for the streaming VAD that makes utterance segmentation feel instant.
  • Apple MLX for making on-device Whisper actually fast on Mac.
  • Pretendard for the Korean variable typeface that holds 8 emotional registers without breaking a sweat.
  • nexu-io/open-design for the design-system seeds the 8 presets are derived from.

Connect

GitHub X (Twitter) LinkedIn Instagram Personal site Email


Built with PyQt5, Apple MLX, MediaPipe, ONNX Runtime, and Silero VAD.

About

Real-time Korean subtitle engine whose typography carries the speaker's affect (PyQt5 overlay, on-device Whisper, macOS).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages