Tags: pretyflaco/millet
Tags
v0.12.7: single-source path keeps diarized in-room speakers v0.12.6 routed in-room recordings to the mono path (correctly diarizing the in-room speakers), but the mono path then remapped diarized speakers onto YOU/REMOTE by channel energy. On dual-mono audio every speaker is equally mic-dominant, so that remap collapsed the genuine speakers back into one. Skip the channel-energy YOU/REMOTE relabeling (and channel correction) when the recording was detected single-source, keeping the pyannote diarization result so voiceprint naming can label each in-room speaker.
v0.12.6: fix in-room multi-speaker collapse in dual-diarize path The default dual-diarize path assumes the mic (left) channel carries a single local speaker (labeled YOU) and only diarizes the system (right) channel. For an in-room recording -- several people sharing one mic, system channel silent or a duplicate of the mic -- every mic speaker collapsed into one. Now detect single-source stereo and fall back to the mono path (mix down + diarize the combined signal), which splits the in-room speakers. Genuine remote calls (active, distinct system channel) keep using dual-diarize. - _is_single_source_stereo: True when the system channel's active-sample RMS is below system_inactive_rms_ratio (0.10) of the mic's, OR the channels' Pearson correlation is >= channel_duplicate_corr (0.98). Conservative on analysis failure (keeps dual-diarize). - _load_stereo_int16: ffmpeg-based stereo decode (wav/ogg). - CLI --single-source-fallback/--no-single-source-fallback (default on); TranscriptionConfig.single_source_fallback + the two thresholds. - Tests: silent/duplicate/decorrelated detection + dispatch fallback + the no-regression guard for real remote calls.
v0.12.5: title-aware schedule matching + sync collision guard detect_meeting_type now considers the session title: a titled session only auto-matches a schedule whose name/folder slug equals the title slug, otherwise it returns None so the caller files it under its own folder. This stops an ad-hoc meeting recorded inside a schedule window (e.g. a "post-scrum" at 09:03 inside the 06:30-09:30 standup window) from being misfiled as the scheduled meeting. Untitled sessions keep the prior pure time-window behavior. sync_session writes a local-only .session-id marker into each synced folder and disambiguates (<folder>-<sessionid-suffix>) instead of overwriting when an existing folder belongs to a different session. The marker is registered in the clone's .git/info/exclude so it is never committed/pushed and never trips the uncommitted-changes guard. Pairs with vezir v0.7.16 (title injection + sync-as override).
v0.12.4: robust language detection + sync exit-code Language: whisperx detects from only the first ~30s of each channel, so a misleading opener (e.g. an opening 'Gracias') mislabeled an English meeting as Spanish even after the dominant-channel fix. - Multi-window detection: sample N windows across each channel via faster-whisper's detect_language(language_detection_segments=N) instead of the first-30s guess (whisperx backend; --language-detection-segments, default 6). - Soft default-language bias: --default-language <lang> keeps the team default unless a channel confidently detects another language (>= default_language_override_confidence, default 0.70). Fed into the dominant-channel selection. Sync: cli/sync.py now raises SystemExit(1) when any session fails (e.g. git push rejected) instead of exiting 0 — so callers no longer rely on scraping the log to notice a failed sync. Tests: +default-language bias, +CLI sync exit-code. Full suite 295 pass, 7 pre-existing env-only failures.
v0.12.3: summary language from dominant channel + per-language summaries In the dual-channel paths the transcript/summary language was taken from the mic channel only. A local speaker's minority-language asides (e.g. a few Portuguese phrases) made the whole summary that language even when the meeting was mostly English on the system channel. - Summary/transcript language now follows the channel with the most speech (_dominant_channel_language); mic wins exact ties. - Each channel is word-aligned with its OWN detected language (_align_channel) instead of sharing the mic's language model. - apply_labels gains summary_language: regenerate the summary in a chosen language and save it as an ADDITIONAL <base>.summary.<lang>.md (with suffixed meta/frontmatter sidecars), preserving the primary auto-detected summary. MeetingSummary.save gains lang_suffix. - sync: <base>.summary.<lang>.md syncs as a distinct summary.<lang>.md; .frontmatter.json is excluded (also fixes a latent collision where the frontmatter sidecar could be pushed as transcript.json). Tests: +8 (dominant-language selection, additional-language save/override). Full suite 285 pass, 7 pre-existing env-only failures.
v0.12.2: suppress phantom remote speakers in dual-diarize pyannote can over-segment a single remote stream into multiple clusters (e.g. peeling short backchannel "yeah/cool/awesome" off the main speaker into a phantom), which voiceprint matching then mis-names from a weak, barely-over-threshold match. - Voiceprint auto-apply gate: a match at/above MATCH_THRESHOLD is applied only if it has enough embeddable speech AND is unambiguous (strong absolute confidence OR a clear margin over the runner-up profile). SpeakerMatch gains evidence_seconds + margin; identify_speakers computes the per-cluster margin. Weak/ambiguous matches stay raw and route to needs_labeling instead of confidently mislabeling (e.g. the observed 0.69/0.13-margin false positive). Sidecar records only applied matches. - Remote-cluster consolidation (dual-diarize): merge same-speaker clusters (voiceprint cosine >= cluster_merge_similarity) and absorb thin clusters (< cluster_min_speech_seconds embeddable) into the dominant remote; attach trivial unassigned segments to the nearest remote so a 0.4s one-liner no longer surfaces as a generic REMOTE. Behind --no-consolidate-remote-clusters. Validated on a real 2-speaker session (4 speakers -> 2 + 1 raw, no false name) and a 13-speaker session (no legit speaker suppressed). Tests: +18 (consolidation merge/absorb/no-over-merge/orphan/config + gate policy). Full suite 277 pass, 7 pre-existing env-only failures.
v0.12.1: fix label --auto discarding matches in non-interactive runs label --auto auto-applied confident voiceprint matches, then prompted interactively for unrecognized speakers. In the vezir worker (no TTY) click.prompt hit EOF -> Abort, discarding ALL matches before they were written. Meetings with fully-recognizable speakers were left stuck in needs_labeling with raw SPEAKER_N ids. Now: when stdin is not a TTY, skip prompting -- apply auto-matches, leave unmatched speakers as raw ids. Also adds a *.autoid.json sidecar (name + confidence per speaker, keyed by final transcript id) so vezir's labeling screen can pre-fill recognized names and show confidence. Excluded from sync + transcript resolution. 3 new tests.
v0.12.0: dual-diarize default — per-channel ASR + remote speaker diar…
…ization
New default mixdown for stereo: dual-diarize. Transcribes mic and system
channels separately (Kemal = continuous YOU from mic, immune to overlap),
then runs pyannote diarization on the system channel only to split distinct
remote speakers (Openoms/Jonas/Max/...). Overlapping segments preserved.
Eliminates the overlap-fragmentation bug where mono+diarization flickered
words between speakers during talk-over ('This year' -> Openoms, 'they' ->
Kemal, 'rented the' -> Openoms, 'whole island' -> Kemal — Kemal said the
entire sentence).
Also includes:
- Channel-energy correction (mono path, --channel-correct): per-segment/word
RMS reassignment for turn-boundary leaks; on by default for --mixdown mono.
--channel-correct-margin (default 0.30) for tuning.
- DNS-retry hardening for millet sync git operations (clone/pull/push):
transient DNS failures auto-retry 5x with backoff.
- 11 new channel-correction tests; default-mixdown test updated.
Validated on DEVSTANDUP (5spk), LUKAS_2 (2spk), AB_BOARD (4spk .ogg):
overlap-fragmentation eliminated, all distinct remote speakers preserved.
v0.11.0: opt-in Parakeet ASR backend (onnx-asr, English, CUDA) Add a third ASR backend alongside whisperx and mlx: NVIDIA Parakeet TDT via onnx-asr (ONNX Runtime, pure-Python — no extra torch/transformers). Opt-in via --asr-backend parakeet; auto selection unchanged. - millet/parakeet.py: backend + Silero VAD chunking for long audio (Parakeet's ~20-30s per-utterance limit), WhisperX-shaped output contract, cuDNN/cuBLAS ctypes preload so onnxruntime-gpu finds the torch-bundled CUDA libs, HF-cache completeness check. - transcribe.py: parakeet backend validation, _transcribe_asr dispatch, config B (native timestamps, default) / C (--parakeet-keep-alignment) alignment toggle. - cli: --asr-backend parakeet, --parakeet-model, --parakeet-keep-alignment; millet download parakeet (explicit, lazy model fetch). - [parakeet] optional extra (onnx-asr[hub]); scripts/bench_asr.py harness + benchmark results doc. - tests/test_parakeet.py: 12 tests (contract, B/C wiring, validation, dispatch, availability guard). Benchmark note: on a 3090, whisperx is faster than Parakeet; Parakeet's value is finer segmentation, not speed. Stays opt-in pending further validation.
PreviousNext