Working vault for a GSoC 2026 project to fine-tune Whisper on Greek municipal council speech, with LLM post-correction. Holds dataset exploration notes, decisions, specs, and the local review-UI prototype.
No training yet — dataset exploration comes first.
Start here:
- Current state
- Progress vs GSoC plan
- Agent instructions (
AGENTS.mdis a symlink to the same file) - Project map
- Roadmap
- Decisions index
- Meetings index
- Logs index
- Exploration UI spec
- Local data model
- OpenCouncil meeting JSON notes
- GSoC proposal
- UI prototype
CURRENT.mdis the first file to read and should stay short.docs/decisions/— accepted decisions and open questions, split by theme. See decisions index.docs/progress.md— where we are against the GSoC plan. The plan itself is in the proposal.docs/roadmap.md— phases and current direction.docs/meetings/— normalized meeting notes.docs/specs/— product and implementation specs.docs/logs/— weekly digests only. See logs index for cadence rules.docs/reference/— stable technical references.archive/— superseded material, local only (gitignored).- Data outputs live under
data/; scripts live underscripts/. CLAUDE.mdis the single source of truth for assistant instructions;AGENTS.mdis a symlink to it for tools that read that filename.
- Full May 12 export:
utterance-edits-may12-26.csv - Historical sample (archived):
archive/old-data/corrections-sample.csv - Clean CSV:
data/clean/corrections_clean.csv - Rejected rows:
data/reports/corrections_rejected.csv - Summary JSON:
data/reports/corrections_summary.json
Regenerate the cleaned data:
rtk python3 scripts/preprocess_corrections.pyThe SvelteKit correction-review prototype lives in ui/. It ingests the May 12 CSV into local SQLite and supports review labels, timestamp adjustments, stats, and included-row export. Meeting JSON matching (utterance IDs, speaker context, surrounding transcript) is the next gap to close.