[
A computational linguistics tool to identify "forgotten" Romanian words - terms that exist in official dictionaries but have fallen out of modern usage.
Status: 📚 Definitions + 🔍 Phase 3 + 🌐 Web UI (PHP) — shortlist generated, definitions complete, web validation next
- Generates a list of the least used or forgotten words from Romanian dictionaries
- Compares official dictionaries (including archaisms) against usage frequency data
- Identifies linguistic "dark matter" - words that exist in dictionaries but have fallen out of active use
- Produces curated lists with rarity scores and linguistic metadata
See exploratory UI prototype: lab.gov2.ro/oțios
Vezi și: initial specs / live (google doc)
flowchart TD
DEX[("DEX Online dump\n1.2 GB MySQL")]
subgraph P1["Phase 1 · Dictionary Analysis"]
A["create_sample_db.py"] --> B[("dex-sample.sql")]
B --> C["extract_lexemes.py"] --> D[("lexemes.db\n315k lexemes")]
D --> E["analyze_forgotten_words.py"] --> F[("forgotten_words_v1.csv")]
F --> G["create_curated_list.py"] --> H[("forgotten_words_curated.csv\n~140k candidates")]
D --> TAX["extract_taxonomy.py\n(run once)"] --> D
end
subgraph P2["Phase 2 · Corpus Validation"]
subgraph P2A["2a · wordfreq — fast, rough"]
WF["validate_with_wordfreq.py"] --> WF_OUT[("validated_wordfreq.csv\n1,868 rows")]
end
subgraph P2B["2b · Diachronic — recommended ✅"]
WS["process_wikisource.py\n(historical · 14M tokens)"]
CX["process_culturax.py\n(modern web · 17B tokens)"]
WS & CX --> CORP[("corpus_frequencies.db")]
CORP --> DIA["validate_diachronic.py"] --> DIA_OUT[("forgotten_words_diachronic.csv\n130k rows · 4 taxonomy cols")]
end
subgraph P2C["2c · Legacy Wikipedia — ⚠️ P0 bug"]
DL["download_wikipedia_ro.py"] --> PC["process_corpus.py"] --> VFW["validate_forgotten_words.py"] --> LEG[("validated.csv")]
end
end
subgraph P25["Phase 2.5 · Shortlist"]
SL["make_shortlist.py"] --> SL_OUT[("forgotten_words_shortlist.csv\n23,112 rows · 3 tiers")]
end
subgraph P3["Phase 3 · Web Validation"]
SW["search_wild.py\n--provider ddg | google"] --> WEB_OUT[("forgotten_words_web_validated.csv\nweb_score · last_seen_approx")]
end
DEX --> A
H --> WF
H --> WS & CX
H --> DL
D -.->|"taxonomy tags\n(dex_pos, register,\ndomain, etymology)"| DIA
DIA_OUT --> SL
WF_OUT --> SL
SL_OUT --> SW
LEG --> SW
Phase 2 paths are alternatives — run 2a for a quick pass, 2b for the recommended diachronic analysis (historical vs modern corpora), or 2c only if reproducing earlier results (it has a known P0 bug).
make_shortlist.py(Phase 2.5) filters the 130k diachronic rows down to the ~23k most defensible forgotten words before web validation.
# Activate virtual environment (adjust path to your venv)
source ~/devbox/envs/240826/bin/activate
# Install all dependencies
pip install -r requirements.txt# 1. Create sample database (reduces 1.2GB to 285MB)
python create_sample_db.py
# 2. Extract lexeme data (creates CSV + SQLite database)
python extract_lexemes.py
# 3. Generate analysis and statistics
python analyze_forgotten_words.py
# 4. Create final curated list
python create_curated_list.pyOutput: forgotten_words_curated.csv (~140k candidates)
python validate_with_wordfreq.pyOutputs two files:
forgotten_words_validated_wordfreq.csv— words with Zipf < 3.0 (tier=forgotten)rare_words_wordfreq.csv— words with Zipf 3.0–4.5 and a non-emptydex_registertag (tier=rare_in_use)
The tier column supersedes is_forgotten (kept for backward compatibility). Note: wordfreq's Romanian coverage is binary (0.000 or ≥ 3.0), so treat this as a rough first pass. The rare_in_use gate on dex_register prevents modern unmarked vocabulary (neurologie, cowboy…) from polluting the rare list.
Extracts Tag, ObjectTag, and EntryLexeme tables from the DEX dump into lexemes.db, enabling register/domain/etymology/POS columns in the diachronic output.
# Sample dump (fast, ~5% coverage)
python extract_taxonomy.py
# Full dump (recommended — ~990k ObjectTag rows, full coverage)
python extract_taxonomy.py --sql data/dictionaries/dex-database.sqlUses Wikisource RO (historical literary baseline) and CulturaX RO (modern web) to compute actual per-corpus frequencies. Designed to find words that appear in 19th-century literature but are absent from modern text.
# Wikisource — test run (500 docs, ~10s)
python process_wikisource.py --test
# Wikisource — full run (best on a VPS)
mkdir -p data/logs
nohup python process_wikisource.py --resume >> data/logs/wikisource.log 2>&1 &
echo $! > data/logs/wikisource.pid
# CulturaX — full run (64 parquet shards, ~40M docs; auto-restarts on network errors)
# Interactive (watch it run):
while true; do
python -u process_culturax.py --resume
[ $? -eq 0 ] && break
echo "[$(date)] restarting in 15s..." && sleep 15
done
# Background (logs to file):
VENV=~/g2-dev/monitorulpreturilor/venv/bin/python
mkdir -p data/logs
nohup bash -c "while true; do $VENV -u process_culturax.py --resume; [ \$? -eq 0 ] && break; echo \"[\$(date)] restarting in 15s...\"; sleep 15; done" \
>> data/logs/culturax.log 2>&1 &
echo $! > data/logs/culturax.pidOutput: corpus_frequencies.db with corpus_name = 'wikisource_ro' and corpus_name = 'culturax_ro'.
Note: process_culturax.py reads the 64 parquet shards directly via HfFileSystem + pyarrow and checkpoints at file + row-group level. This avoids the datasets streaming ds.skip() cycling bug that triggers when the checkpoint offset exceeds the dataset size.
# Compare historical vs modern frequencies, add taxonomy columns
python validate_diachronic.py
# Output: forgotten_words_diachronic.csv (130k rows)
# Filter down to the most defensible forgotten words
python make_shortlist.py --stats # preview counts by tier
python make_shortlist.py # write forgotten_words_shortlist.csv (~17k rows)The DEX MySQL dump's DefinitionSimple table only covers ~4.6k of the 17.4k shortlist words. scrape_definitions.py fills the remaining gaps by extracting the synthesis (definition) from dexonline.ro for each missing word.
# Smoke test (5 words, no HTTP)
python scrape_definitions.py --dry-run --limit 5
# Small live run (test the scraper)
python scrape_definitions.py --limit 20 --delay 3.0
# Full run (all missing words, ~5–7 hours at 3s/request)
python scrape_definitions.py --delay 3.0 --merge
# Resume an interrupted run
python scrape_definitions.py --delay 3.0 --merge # automatically skips already-scraped
# Just upsert checkpoint into DB (if scraping completed but merge wasn't run)
python scrape_definitions.py --merge-onlyOutput: data/processed/scraped_definitions.csv (checkpoint with columns: word, definition, source_url, scraped_at, status). With --merge, all status=ok rows are upserted into definitions.db immediately. Resume is safe — each row is flushed instantly; Ctrl+C stops cleanly.
# Dry run first
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
--provider ddg --limit 5 --dry-run
# DDG triage (no API key, good for first pass)
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
--provider ddg --limit 200 --delay 2
# Google CSE (cleaner results, needs env vars, 100/day free tier)
export GOOGLE_API_KEY="AIza..."
export GOOGLE_CSE_ID="017576..."
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
--provider google --limit 100flowchart LR
subgraph JOBS["Long-running corpus jobs"]
J1["process_wikisource.py"]
J2["process_culturax.py"]
end
subgraph LOGS["data/logs/"]
L1["wikisource.log / .pid"]
L2["culturax.log / .pid"]
L3["run_history.jsonl"]
L4["health_status.json"]
end
subgraph MON["Monitoring scripts"]
ST["status.py\n(read-only · any time)"]
HC["health_check.py\n(cron · every 30 min)"]
AU["audit.py\n(cron · daily 02:00)"]
end
subgraph ALERT["Alert channels"]
AW["webhook\nOTZIOS_ALERT_URL"]
AE["email\nOTZIOS_ALERT_EMAIL"]
end
JOBS --> LOGS
LOGS --> ST & HC & AU
HC & AU --> AW & AE
health_check.py, audit.py, and status.py keep tabs on long-running corpus jobs. Run them manually or via cron (see CLAUDE.md for crontab lines).
python status.py # at-a-glance summary — corpora, artifacts, loops, audit
python health_check.py # check liveness, stalls, log errors, completion
python audit.py # snapshot run history + DB quality checks
python health_check.py --dry-run # print without alerting or writing statestatus.py is read-only — safe to run any time. health_check.py and audit.py write logs and may alert.
Set OTZIOS_ALERT_URL (webhook) or OTZIOS_ALERT_EMAIL to receive push alerts.
Apostrophes in the word column — DEX Online encodes syllable stress using apostrophes (e.g. bucl'e, băt'ârn). These are not real Romanian words; the clean form is in word_no_accent. The validated output from validate_with_wordfreq.py uses word_no_accent for all lookups and moves the raw word column to the end of the CSV for reference.
All generated files live under data/processed/. Columns shared across files have the same meaning everywhere.
How the files relate:
forgotten_words_curated.csv — 140k dictionary suspects (no corpus signal)
↓ validate_diachronic.py
forgotten_words_diachronic.csv — 130k rows with corpus frequencies + taxonomy
↓ make_shortlist.py
forgotten_words_shortlist.csv — ~17k most defensible forgotten words (2 tiers)
↓ search_wild.py
forgotten_words_web_validated.csv — shortlist + real-world web presence
| Column | Description |
|---|---|
word |
Word form as it appears in DEX, including stress apostrophes (e.g. bucl'e). Use word_no_accent for lookups. |
word_no_accent |
Clean form with apostrophes removed — the canonical key for all frequency lookups. |
frequency / dex_frequency |
DEX frequency score, 0.0–1.0. Lower = rarer. 0.0 means the field was absent in DEX — treat it as missing data, not "rarest". |
rarity_category |
Bin derived from dex_frequency: very_rare (< 0.30), rare (0.30–0.50), uncommon (0.50–0.60), standard (0.60–1.0). standard means DEX considers the word canonical but corpus evidence may disagree. |
description |
Part-of-speech and register abbreviation from DEX (e.g. s.n. = neuter noun, adj. = adjective, înv. = archaic). |
model_type |
DEX inflection model code (e.g. I, A1). Identifies the paradigm used for conjugation/declension. |
Every DEX entry with frequency < 1.0 that passes form filters (length, not a proper noun, has a word-class marker). No corpus evidence — these are suspects, not confirmed forgotten words. Currently ~140k rows.
| Column | Description |
|---|---|
notes |
Raw notes from the DEX entry (register markers, usage labels, etc.). |
One row per candidate from forgotten_words_curated.csv, enriched with measured frequencies from both corpora and a verdict. This is the file to use for any downstream analysis — it tells you whether each word is actually missing from modern text, and by how much.
| Column | Description |
|---|---|
hist_occurrences |
Raw occurrence count in the Wikisource RO corpus (historical literary baseline, ~14M tokens). |
hist_documents |
Number of distinct Wikisource documents containing the word. |
hist_ppm |
hist_occurrences normalised to occurrences per million tokens in Wikisource. |
modern_occurrences |
Raw occurrence count in the CulturaX RO corpus (modern web text, ~17B tokens). |
modern_documents |
Number of distinct CulturaX documents containing the word. |
modern_ppm |
modern_occurrences normalised to occurrences per million tokens in CulturaX. |
log_ratio |
log₂((hist_ppm + S) / (modern_ppm + S)) where S = 0.1 per million (Laplace smoothing). Positive = historically skewed; negative = more common today. A value of 1.0 means the word is twice as frequent historically; −1.0 means twice as frequent now. |
verdict |
Categorical summary — see table below. |
dex_pos |
Full part-of-speech label from DEX Tag taxonomy (e.g. substantiv neutru, adjectiv, verb). Pipe-delimited if multiple. Empty until extract_taxonomy.py is run against the full dump. |
dex_register |
Stylistic register tags from DEX (e.g. învechit, popular, dialectal, livresc). Pipe-delimited. A word tagged învechit in DEX is direct editorial evidence of archaism, independent of corpus signal. |
dex_domain |
Subject domain tags (e.g. muzică, chimie, medicină, drept). Pipe-delimited. Useful for filtering out technical jargon. |
dex_etymology |
Etymology/origin tags (e.g. grecism, latinism, anglicism, turcism, slavonism). Pipe-delimited. |
Verdict values:
| Verdict | Condition |
|---|---|
extinct |
hist_ppm ≥ 1.0 and modern_ppm < 0.1 — well-attested historically, nearly absent today. |
declining |
log_ratio ≥ 1.0 — at least 2× more frequent historically, but still has some modern presence. |
historical_only |
hist_ppm ≥ 0.1 and modern_ppm < 0.1 — appears in old texts but not in modern corpus. |
stable |
` |
modern_only |
modern_ppm ≥ 0.1 and hist_ppm < 0.1 — not in historical texts but present today (likely a newer word or false positive). |
emerging |
log_ratio ≤ −1.0 — at least 2× more frequent in modern corpus. |
absent |
Both hist_ppm < 0.1 and modern_ppm < 0.1 — too rare to appear meaningfully in either corpus. |
Generated by make_shortlist.py from the diachronic CSV. Three selection tiers, all with POS exclusions applied:
| Tier | confidence_tier value |
Count | Criterion |
|---|---|---|---|
| A | corpus_extinct |
1,218 | verdict=extinct, hist_ppm > 0 |
| A | corpus_declining |
6,169 | verdict=declining, hist_ppm > 0 |
| A | corpus_historical_only |
9,399 | verdict=historical_only, hist_ppm > 0 |
| B | dex_invechit_absent |
2,994 | verdict=absent + dex_register contains învechit |
| C | dex_absent_highfreq |
3,332 | verdict=absent, hist_ppm=0, modern_ppm < 0.1, dex_frequency ≥ 0.85 |
Tier B: DEX editorial + absent from all corpora — two independent archaism signals. Tier C: highest DEX legitimacy but no corpus trace at all — the "most forgotten" words (e.g. oțios, dex_frequency=0.85). Tune Tier C with --dex-freq-threshold (default 0.85).
All rows carry is_forgotten = true (required by search_wild.py). Columns are a subset of the diachronic CSV plus confidence_tier.
All columns from the shortlist, plus web search results from search_wild.py.
| Column | Description |
|---|---|
total_results |
Approximate search result count returned by the provider for the word query. |
in_wild |
true if the provider returned at least one result — word still appears somewhere on the Romanian web. |
web_score |
Categorical bucket based on total_results. DDG: 0 / alive_rare (1–9) / alive (10–29) / common (30+). Google: 0 / alive_rare (1–9) / alive (10–99) / common (100+). |
top_url |
URL of the top-ranked search result, if any. |
last_seen_approx |
Best-effort approximate date the word was last seen on the web (parsed from result metadata; often empty). |
provider |
Search backend used: ddg (DuckDuckGo, no API key) or google (Google Custom Search, needs env vars). |
Quick frequency screen via the wordfreq library, without streaming any corpus.
| Column | Description |
|---|---|
lemma |
Base form produced by simplemma.lemmatize(word, lang='ro'). This is what gets looked up in wordfreq. |
zipf_frequency |
Zipf-scale frequency from wordfreq's Romanian model (roughly: 6 = very common, 3 = uncommon, 0 = not in wordfreq's list at all). 0.0 does not mean "least common" — it means wordfreq has no signal for this word. |
tier |
Classification: forgotten (zipf < 3.0) / rare_in_use (3.0 ≤ zipf < 4.5, non-empty dex_register) / common (≥ 4.5 or no register tag). |
is_forgotten |
true if tier == 'forgotten' — kept for backward compatibility with search_wild.py. |
otios/
├── data/
│ ├── dictionaries/ # DEX Online database (download separately)
│ └── processed/ # Generated lexeme data and results
├── docs/ # Documentation and specifications
│ ├── scripts-guide.md # Detailed script documentation
│ ├── romanian-forgotten-words-spec.md
│ └── results-summary.md
└── *.py # Processing scripts
- docs/scripts-guide.md - Comprehensive guide to all scripts
- docs/romanian-forgotten-words-spec.md - Technical specification
- docs/results-summary.md - Analysis results and findings
- docs/oțios.docx.md - Initial brainstorming document
- more docs: PHASE2_COMPLETE.md; phase2-test-results.md; scripts-guide.md
Top extinct words from the diachronic analysis (high historical frequency, near-zero modern):
| Word | Meaning | DEX freq | log₂ ratio | Register | Etymology |
|---|---|---|---|---|---|
| tibișir | type of muslin fabric | 0.82 | 8.53 | — | franțuzism |
| ghiftui | to stuff oneself | 0.94 | 7.44 | — | franțuzism |
| coșcodan | monkey (archaic) | 0.77 | 7.15 | — | — |
| bolboacă | clay cooking pot | 0.94 | 6.65 | învechit | — |
| stacan | type of goblet | 0.90 | 7.04 | — | — |
| ietac | private chamber | 0.91 | 4.19 | învechit | — |
DEX-tagged archaic words with no corpus signal at all (Tier B — "dark matter"):
| Word | Meaning | DEX freq | Register |
|---|---|---|---|
| vece | outhouse (from Ger. Wasserklose) | 0.99 | învechit |
| alenă | breath, exhalation | 0.97 | învechit |
| hurducăi | to jolt, to shake about | 0.95 | învechit |
| pripoană | tethering stake | 0.95 | învechit |
- DEX Online Database: Official Romanian dictionary (1.2 GB MySQL dump)
- Download: dexonline.ro
- 315,247 lexemes with frequency data
- Archaic markers and linguistic metadata
- fix mysql import - try a llm assisted import
- create another sample db with max 3 inserts per table - for analytics
- Database setup and conversion
- Lexeme extraction pipeline
- Frequency-based analysis
- Quality filtering and curation
- CSV export with ~140k candidates (cutoff raised to DEX freq < 1.0)
Output: forgotten_words_curated.csv — ~140k candidates (dictionary suspects, corpus validation is the real gate)
- Wikisource RO corpus — 12,921 docs, 14.3M tokens (historical baseline)
- CulturaX RO corpus — 40.3M docs, 17.0B tokens (modern web)
- Diachronic comparison: log₂(hist_ppm / modern_ppm) per word
- Taxonomy enrichment:
dex_pos,dex_register,dex_domain,dex_etymology - Shortlist generation: 23,112 words across 5 confidence tiers (Tier C added for corpus-absent DEX-canonical words)
Output: forgotten_words_diachronic.csv (130k rows) → forgotten_words_shortlist.csv (23k rows)
- Extract full definitions from DEX database
- Join Definition and DefinitionSimple tables
- Identify archaic markers (înv., arh., reg., dial.) —
dex_registercolumn via Tag taxonomy - Extract etymology information —
dex_etymologycolumn (grecism, latinism, turcism…) - Add part-of-speech tagging —
dex_poscolumn (substantiv neutru, adjectiv, verb…) - Flag words with no definition body ("Fără definiție." entries like nombrilist)
- Parse first attestation dates
- Temporal analysis (when words fell out of use)
- Link to word families and cognates
- Integrate Romanian lemmatizer (spaCy-ro or nlp-cube)
- Match inflected forms to base words
- Improve recall (find "frumoaselor" when searching "frumos")
- Named entity recognition for better filtering
- Semantic clustering of forgotten words
- DDG triage pass on shortlist (~17k words, no quota)
- Google CSE pass on high-confidence subset (100/day free tier)
- Cross-reference: corpus verdict vs web presence
- Exploratory UI for browsing the shortlist (filter by tier, POS, etymology, domain, verdict, marks)
- Word detail view: DEX definition, corpus stats, dexonline.ro link
- PHP thin-API port — deployable on shared hosting (
public/,tools/build_ui_db.py) - localStorage bookmarks / notes / quick-tags (no server-side auth needed)
- Forgotten / rare-in-use toggle — browse two word tiers in both Flask and PHP UIs
- REST API for programmatic access
- Interactive visualizations
- Frequency decay curves (hist_ppm vs modern_ppm scatter)
- Etymological breakdown of extinct words
- Word cloud weighted by log_ratio
- Revival potential scoring algorithm
- Compare with other Romance languages
- Historical corpus analysis (Project Gutenberg)
- Machine translation of forgotten word contexts
- Crowdsourced validation platform
- Word-of-the-day feature
- Educational tools and quizzes
- Create a reverse, browse news and r/romania and find new words, used more than 3? times that are not in dictionary -> alternative dictionary
- tools: convert texts to archaic form - less used words. with a coeficient of uniqueness (bigger number, harder words)
- filter out uninteresting words. Too domain specific: medicine, biology etc
- one word a day game? quizz, guess what it means?
- No lemmatization —
bucleledoesn't matchbuclein the corpus. Inflected forms are invisible. Addingsimplemmawould significantly improve recall (backlog #6). - POS tag noise — some words get wrong POS tags due to the ObjectTag join occasionally pulling tags from adjacent dictionary entries. Supplementary metadata only; doesn't affect core analysis.
- Sparse etymology -ism tags — many words store "limba franceză" not "franțuzism" in DEX. Both are captured but the vocabulary is inconsistent across DEX editors.
absentverdict ambiguity — 83k words with no corpus signal conflate "truly unused" with "only appears in inflected forms not tracked". Lemmatization (#6) would reduce this category significantly.
# Web validation — DDG triage on full shortlist
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
--provider ddg --limit 500 --delay 2
# Or just the highest-confidence tier first
python make_shortlist.py --limit 1137 # corpus_extinct only
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
--provider ddg --delay 2See Activity History and Backlog for the changelong and open items / roadmap.
See CLAUDE.md for development guidelines and project context.