Oțios - Romanian Forgotten Words Finder

[⚠️NOTE] Data still needs some finetunning and QA.

A computational linguistics tool to identify "forgotten" Romanian words - terms that exist in official dictionaries but have fallen out of modern usage.

Status: 📚 Definitions + 🔍 Phase 3 + 🌐 Web UI (PHP) — shortlist generated, definitions complete, web validation next

What It Does

Generates a list of the least used or forgotten words from Romanian dictionaries
Compares official dictionaries (including archaisms) against usage frequency data
Identifies linguistic "dark matter" - words that exist in dictionaries but have fallen out of active use
Produces curated lists with rarity scores and linguistic metadata

See exploratory UI prototype: lab.gov2.ro/oțios

Vezi și: initial specs / live (google doc)

Pipeline

End-to-end flow

flowchart TD
    DEX[("DEX Online dump\n1.2 GB MySQL")]

    subgraph P1["Phase 1 · Dictionary Analysis"]
        A["create_sample_db.py"] --> B[("dex-sample.sql")]
        B --> C["extract_lexemes.py"] --> D[("lexemes.db\n315k lexemes")]
        D --> E["analyze_forgotten_words.py"] --> F[("forgotten_words_v1.csv")]
        F --> G["create_curated_list.py"] --> H[("forgotten_words_curated.csv\n~140k candidates")]
        D --> TAX["extract_taxonomy.py\n(run once)"] --> D
    end

    subgraph P2["Phase 2 · Corpus Validation"]
        subgraph P2A["2a · wordfreq — fast, rough"]
            WF["validate_with_wordfreq.py"] --> WF_OUT[("validated_wordfreq.csv\n1,868 rows")]
        end
        subgraph P2B["2b · Diachronic — recommended ✅"]
            WS["process_wikisource.py\n(historical · 14M tokens)"]
            CX["process_culturax.py\n(modern web · 17B tokens)"]
            WS & CX --> CORP[("corpus_frequencies.db")]
            CORP --> DIA["validate_diachronic.py"] --> DIA_OUT[("forgotten_words_diachronic.csv\n130k rows · 4 taxonomy cols")]
        end
        subgraph P2C["2c · Legacy Wikipedia — ⚠️ P0 bug"]
            DL["download_wikipedia_ro.py"] --> PC["process_corpus.py"] --> VFW["validate_forgotten_words.py"] --> LEG[("validated.csv")]
        end
    end

    subgraph P25["Phase 2.5 · Shortlist"]
        SL["make_shortlist.py"] --> SL_OUT[("forgotten_words_shortlist.csv\n23,112 rows · 3 tiers")]
    end

    subgraph P3["Phase 3 · Web Validation"]
        SW["search_wild.py\n--provider ddg | google"] --> WEB_OUT[("forgotten_words_web_validated.csv\nweb_score · last_seen_approx")]
    end

    DEX --> A
    H --> WF
    H --> WS & CX
    H --> DL
    D -.->|"taxonomy tags\n(dex_pos, register,\ndomain, etymology)"| DIA
    DIA_OUT --> SL
    WF_OUT --> SL
    SL_OUT --> SW
    LEG --> SW

Phase 2 paths are alternatives — run 2a for a quick pass, 2b for the recommended diachronic analysis (historical vs modern corpora), or 2c only if reproducing earlier results (it has a known P0 bug). make_shortlist.py (Phase 2.5) filters the 130k diachronic rows down to the ~23k most defensible forgotten words before web validation.

Quick Start

Prerequisites

# Activate virtual environment (adjust path to your venv)
source ~/devbox/envs/240826/bin/activate

# Install all dependencies
pip install -r requirements.txt

Phase 1: Dictionary Analysis

# 1. Create sample database (reduces 1.2GB to 285MB)
python create_sample_db.py

# 2. Extract lexeme data (creates CSV + SQLite database)
python extract_lexemes.py

# 3. Generate analysis and statistics
python analyze_forgotten_words.py

# 4. Create final curated list
python create_curated_list.py

Output: forgotten_words_curated.csv (~140k candidates)

Phase 2a: Quick frequency screen (wordfreq)

python validate_with_wordfreq.py

Outputs two files:

forgotten_words_validated_wordfreq.csv — words with Zipf < 3.0 (tier=forgotten)
rare_words_wordfreq.csv — words with Zipf 3.0–4.5 and a non-empty dex_register tag (tier=rare_in_use)

The tier column supersedes is_forgotten (kept for backward compatibility). Note: wordfreq's Romanian coverage is binary (0.000 or ≥ 3.0), so treat this as a rough first pass. The rare_in_use gate on dex_register prevents modern unmarked vocabulary (neurologie, cowboy…) from polluting the rare list.

Phase 1b: Taxonomy extraction (run once after Phase 1)

Extracts Tag, ObjectTag, and EntryLexeme tables from the DEX dump into lexemes.db, enabling register/domain/etymology/POS columns in the diachronic output.

# Sample dump (fast, ~5% coverage)
python extract_taxonomy.py

# Full dump (recommended — ~990k ObjectTag rows, full coverage)
python extract_taxonomy.py --sql data/dictionaries/dex-database.sql

Phase 2b: Corpus validation — diachronic (recommended)

Uses Wikisource RO (historical literary baseline) and CulturaX RO (modern web) to compute actual per-corpus frequencies. Designed to find words that appear in 19th-century literature but are absent from modern text.

# Wikisource — test run (500 docs, ~10s)
python process_wikisource.py --test

# Wikisource — full run (best on a VPS)
mkdir -p data/logs
nohup python process_wikisource.py --resume >> data/logs/wikisource.log 2>&1 &
echo $! > data/logs/wikisource.pid

# CulturaX — full run (64 parquet shards, ~40M docs; auto-restarts on network errors)
# Interactive (watch it run):
while true; do
    python -u process_culturax.py --resume
    [ $? -eq 0 ] && break
    echo "[$(date)] restarting in 15s..." && sleep 15
done

# Background (logs to file):
VENV=~/g2-dev/monitorulpreturilor/venv/bin/python
mkdir -p data/logs
nohup bash -c "while true; do $VENV -u process_culturax.py --resume; [ \$? -eq 0 ] && break; echo \"[\$(date)] restarting in 15s...\"; sleep 15; done" \
  >> data/logs/culturax.log 2>&1 &
echo $! > data/logs/culturax.pid

Output: corpus_frequencies.db with corpus_name = 'wikisource_ro' and corpus_name = 'culturax_ro'.

Note: process_culturax.py reads the 64 parquet shards directly via HfFileSystem + pyarrow and checkpoints at file + row-group level. This avoids the datasets streaming ds.skip() cycling bug that triggers when the checkpoint offset exceeds the dataset size.

Phase 2b continued: diachronic comparison + shortlist

# Compare historical vs modern frequencies, add taxonomy columns
python validate_diachronic.py
# Output: forgotten_words_diachronic.csv (130k rows)

# Filter down to the most defensible forgotten words
python make_shortlist.py --stats   # preview counts by tier
python make_shortlist.py           # write forgotten_words_shortlist.csv (~17k rows)

Phase 2.5: Fill definition gaps from dexonline.ro

The DEX MySQL dump's DefinitionSimple table only covers ~4.6k of the 17.4k shortlist words. scrape_definitions.py fills the remaining gaps by extracting the synthesis (definition) from dexonline.ro for each missing word.

# Smoke test (5 words, no HTTP)
python scrape_definitions.py --dry-run --limit 5

# Small live run (test the scraper)
python scrape_definitions.py --limit 20 --delay 3.0

# Full run (all missing words, ~5–7 hours at 3s/request)
python scrape_definitions.py --delay 3.0 --merge

# Resume an interrupted run
python scrape_definitions.py --delay 3.0 --merge      # automatically skips already-scraped

# Just upsert checkpoint into DB (if scraping completed but merge wasn't run)
python scrape_definitions.py --merge-only

Output: data/processed/scraped_definitions.csv (checkpoint with columns: word, definition, source_url, scraped_at, status). With --merge, all status=ok rows are upserted into definitions.db immediately. Resume is safe — each row is flushed instantly; Ctrl+C stops cleanly.

Phase 3: Web validation

# Dry run first
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
    --provider ddg --limit 5 --dry-run

# DDG triage (no API key, good for first pass)
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
    --provider ddg --limit 200 --delay 2

# Google CSE (cleaner results, needs env vars, 100/day free tier)
export GOOGLE_API_KEY="AIza..."
export GOOGLE_CSE_ID="017576..."
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
    --provider google --limit 100

Monitoring

flowchart LR
    subgraph JOBS["Long-running corpus jobs"]
        J1["process_wikisource.py"]
        J2["process_culturax.py"]
    end

    subgraph LOGS["data/logs/"]
        L1["wikisource.log / .pid"]
        L2["culturax.log / .pid"]
        L3["run_history.jsonl"]
        L4["health_status.json"]
    end

    subgraph MON["Monitoring scripts"]
        ST["status.py\n(read-only · any time)"]
        HC["health_check.py\n(cron · every 30 min)"]
        AU["audit.py\n(cron · daily 02:00)"]
    end

    subgraph ALERT["Alert channels"]
        AW["webhook\nOTZIOS_ALERT_URL"]
        AE["email\nOTZIOS_ALERT_EMAIL"]
    end

    JOBS --> LOGS
    LOGS --> ST & HC & AU
    HC & AU --> AW & AE

health_check.py, audit.py, and status.py keep tabs on long-running corpus jobs. Run them manually or via cron (see CLAUDE.md for crontab lines).

python status.py                # at-a-glance summary — corpora, artifacts, loops, audit
python health_check.py          # check liveness, stalls, log errors, completion
python audit.py                 # snapshot run history + DB quality checks
python health_check.py --dry-run  # print without alerting or writing state

status.py is read-only — safe to run any time. health_check.py and audit.py write logs and may alert.

Set OTZIOS_ALERT_URL (webhook) or OTZIOS_ALERT_EMAIL to receive push alerts.

Data notes

Apostrophes in the word column — DEX Online encodes syllable stress using apostrophes (e.g. bucl'e, băt'ârn). These are not real Romanian words; the clean form is in word_no_accent. The validated output from validate_with_wordfreq.py uses word_no_accent for all lookups and moves the raw word column to the end of the CSV for reference.

Output files

All generated files live under data/processed/. Columns shared across files have the same meaning everywhere.

How the files relate:

forgotten_words_curated.csv    — 140k dictionary suspects (no corpus signal)
        ↓ validate_diachronic.py
forgotten_words_diachronic.csv — 130k rows with corpus frequencies + taxonomy
        ↓ make_shortlist.py
forgotten_words_shortlist.csv  — ~17k most defensible forgotten words (2 tiers)
        ↓ search_wild.py
forgotten_words_web_validated.csv — shortlist + real-world web presence

Shared columns

Column	Description
`word`	Word form as it appears in DEX, including stress apostrophes (e.g. `bucl'e`). Use `word_no_accent` for lookups.
`word_no_accent`	Clean form with apostrophes removed — the canonical key for all frequency lookups.
`frequency` / `dex_frequency`	DEX frequency score, 0.0–1.0. Lower = rarer. `0.0` means the field was absent in DEX — treat it as missing data, not "rarest".
`rarity_category`	Bin derived from `dex_frequency`: `very_rare` (< 0.30), `rare` (0.30–0.50), `uncommon` (0.50–0.60), `standard` (0.60–1.0). `standard` means DEX considers the word canonical but corpus evidence may disagree.
`description`	Part-of-speech and register abbreviation from DEX (e.g. `s.n.` = neuter noun, `adj.` = adjective, `înv.` = archaic).
`model_type`	DEX inflection model code (e.g. `I`, `A1`). Identifies the paradigm used for conjugation/declension.

`forgotten_words_curated.csv` — Phase 1 candidates (dictionary only)

Every DEX entry with frequency < 1.0 that passes form filters (length, not a proper noun, has a word-class marker). No corpus evidence — these are suspects, not confirmed forgotten words. Currently ~140k rows.

Column	Description
`notes`	Raw notes from the DEX entry (register markers, usage labels, etc.).

`forgotten_words_diachronic.csv` — Phase 2b validated output (corpus evidence)

One row per candidate from forgotten_words_curated.csv, enriched with measured frequencies from both corpora and a verdict. This is the file to use for any downstream analysis — it tells you whether each word is actually missing from modern text, and by how much.

Column	Description
`hist_occurrences`	Raw occurrence count in the Wikisource RO corpus (historical literary baseline, ~14M tokens).
`hist_documents`	Number of distinct Wikisource documents containing the word.
`hist_ppm`	`hist_occurrences` normalised to occurrences per million tokens in Wikisource.
`modern_occurrences`	Raw occurrence count in the CulturaX RO corpus (modern web text, ~17B tokens).
`modern_documents`	Number of distinct CulturaX documents containing the word.
`modern_ppm`	`modern_occurrences` normalised to occurrences per million tokens in CulturaX.
`log_ratio`	`log₂((hist_ppm + S) / (modern_ppm + S))` where S = 0.1 per million (Laplace smoothing). Positive = historically skewed; negative = more common today. A value of 1.0 means the word is twice as frequent historically; −1.0 means twice as frequent now.
`verdict`	Categorical summary — see table below.
`dex_pos`	Full part-of-speech label from DEX Tag taxonomy (e.g. `substantiv neutru`, `adjectiv`, `verb`). Pipe-delimited if multiple. Empty until `extract_taxonomy.py` is run against the full dump.
`dex_register`	Stylistic register tags from DEX (e.g. `învechit`, `popular`, `dialectal`, `livresc`). Pipe-delimited. A word tagged `învechit` in DEX is direct editorial evidence of archaism, independent of corpus signal.
`dex_domain`	Subject domain tags (e.g. `muzică`, `chimie`, `medicină`, `drept`). Pipe-delimited. Useful for filtering out technical jargon.
`dex_etymology`	Etymology/origin tags (e.g. `grecism`, `latinism`, `anglicism`, `turcism`, `slavonism`). Pipe-delimited.

Verdict values:

Verdict	Condition
`extinct`	`hist_ppm ≥ 1.0` and `modern_ppm < 0.1` — well-attested historically, nearly absent today.
`declining`	`log_ratio ≥ 1.0` — at least 2× more frequent historically, but still has some modern presence.
`historical_only`	`hist_ppm ≥ 0.1` and `modern_ppm < 0.1` — appears in old texts but not in modern corpus.
`stable`	`
`modern_only`	`modern_ppm ≥ 0.1` and `hist_ppm < 0.1` — not in historical texts but present today (likely a newer word or false positive).
`emerging`	`log_ratio ≤ −1.0` — at least 2× more frequent in modern corpus.
`absent`	Both `hist_ppm < 0.1` and `modern_ppm < 0.1` — too rare to appear meaningfully in either corpus.

`forgotten_words_shortlist.csv` — Phase 2.5 filtered shortlist

Generated by make_shortlist.py from the diachronic CSV. Three selection tiers, all with POS exclusions applied:

Tier	`confidence_tier` value	Count	Criterion
A	`corpus_extinct`	1,218	`verdict=extinct`, `hist_ppm > 0`
A	`corpus_declining`	6,169	`verdict=declining`, `hist_ppm > 0`
A	`corpus_historical_only`	9,399	`verdict=historical_only`, `hist_ppm > 0`
B	`dex_invechit_absent`	2,994	`verdict=absent` + `dex_register` contains `învechit`
C	`dex_absent_highfreq`	3,332	`verdict=absent`, `hist_ppm=0`, `modern_ppm < 0.1`, `dex_frequency ≥ 0.85`

Tier B: DEX editorial + absent from all corpora — two independent archaism signals. Tier C: highest DEX legitimacy but no corpus trace at all — the "most forgotten" words (e.g. oțios, dex_frequency=0.85). Tune Tier C with --dex-freq-threshold (default 0.85).

All rows carry is_forgotten = true (required by search_wild.py). Columns are a subset of the diachronic CSV plus confidence_tier.

`forgotten_words_web_validated.csv` — Phase 3 output

All columns from the shortlist, plus web search results from search_wild.py.

Column	Description
`total_results`	Approximate search result count returned by the provider for the word query.
`in_wild`	`true` if the provider returned at least one result — word still appears somewhere on the Romanian web.
`web_score`	Categorical bucket based on `total_results`. DDG: `0` / `alive_rare` (1–9) / `alive` (10–29) / `common` (30+). Google: `0` / `alive_rare` (1–9) / `alive` (10–99) / `common` (100+).
`top_url`	URL of the top-ranked search result, if any.
`last_seen_approx`	Best-effort approximate date the word was last seen on the web (parsed from result metadata; often empty).
`provider`	Search backend used: `ddg` (DuckDuckGo, no API key) or `google` (Google Custom Search, needs env vars).

`forgotten_words_validated_wordfreq.csv` — Phase 2a output

Quick frequency screen via the wordfreq library, without streaming any corpus.

Column	Description
`lemma`	Base form produced by `simplemma.lemmatize(word, lang='ro')`. This is what gets looked up in wordfreq.
`zipf_frequency`	Zipf-scale frequency from wordfreq's Romanian model (roughly: 6 = very common, 3 = uncommon, 0 = not in wordfreq's list at all). `0.0` does not mean "least common" — it means wordfreq has no signal for this word.
`tier`	Classification: `forgotten` (zipf < 3.0) / `rare_in_use` (3.0 ≤ zipf < 4.5, non-empty `dex_register`) / `common` (≥ 4.5 or no register tag).
`is_forgotten`	`true` if `tier == 'forgotten'` — kept for backward compatibility with `search_wild.py`.

Project Structure

otios/
├── data/
│   ├── dictionaries/       # DEX Online database (download separately)
│   └── processed/          # Generated lexeme data and results
├── docs/                   # Documentation and specifications
│   ├── scripts-guide.md    # Detailed script documentation
│   ├── romanian-forgotten-words-spec.md
│   └── results-summary.md
└── *.py                    # Processing scripts

Documentation

docs/scripts-guide.md - Comprehensive guide to all scripts
docs/romanian-forgotten-words-spec.md - Technical specification
docs/results-summary.md - Analysis results and findings
docs/oțios.docx.md - Initial brainstorming document
more docs: PHASE2_COMPLETE.md; phase2-test-results.md; scripts-guide.md

Sample Results

Top extinct words from the diachronic analysis (high historical frequency, near-zero modern):

Word	Meaning	DEX freq	log₂ ratio	Register	Etymology
tibișir	type of muslin fabric	0.82	8.53	—	franțuzism
ghiftui	to stuff oneself	0.94	7.44	—	franțuzism
coșcodan	monkey (archaic)	0.77	7.15	—	—
bolboacă	clay cooking pot	0.94	6.65	învechit	—
stacan	type of goblet	0.90	7.04	—	—
ietac	private chamber	0.91	4.19	învechit	—

DEX-tagged archaic words with no corpus signal at all (Tier B — "dark matter"):

Word	Meaning	DEX freq	Register
vece	outhouse (from Ger. Wasserklose)	0.99	învechit
alenă	breath, exhalation	0.97	învechit
hurducăi	to jolt, to shake about	0.95	învechit
pripoană	tethering stake	0.95	învechit

Data Sources

DEX Online Database: Official Romanian dictionary (1.2 GB MySQL dump)
- Download: dexonline.ro
- 315,247 lexemes with frequency data
- Archaic markers and linguistic metadata

Roadmap

misc notes / tasks

fix mysql import - try a llm assisted import
create another sample db with max 3 inserts per table - for analytics

Phase 1: Dictionary Analysis (Complete ✅)

Database setup and conversion
Lexeme extraction pipeline
Frequency-based analysis
Quality filtering and curation
CSV export with ~140k candidates (cutoff raised to DEX freq < 1.0)

Output: forgotten_words_curated.csv — ~140k candidates (dictionary suspects, corpus validation is the real gate)

Phase 2: Corpus Validation (Complete ✅)

Wikisource RO corpus — 12,921 docs, 14.3M tokens (historical baseline)
CulturaX RO corpus — 40.3M docs, 17.0B tokens (modern web)
Diachronic comparison: log₂(hist_ppm / modern_ppm) per word
Taxonomy enrichment: dex_pos, dex_register, dex_domain, dex_etymology
Shortlist generation: 23,112 words across 5 confidence tiers (Tier C added for corpus-absent DEX-canonical words)

Output: forgotten_words_diachronic.csv (130k rows) → forgotten_words_shortlist.csv (23k rows)

Phase 3: Enhanced Metadata

Extract full definitions from DEX database
Join Definition and DefinitionSimple tables
Identify archaic markers (înv., arh., reg., dial.) — dex_register column via Tag taxonomy
Extract etymology information — dex_etymology column (grecism, latinism, turcism…)
Add part-of-speech tagging — dex_pos column (substantiv neutru, adjectiv, verb…)
Flag words with no definition body ("Fără definiție." entries like nombrilist)
Parse first attestation dates
Temporal analysis (when words fell out of use)
Link to word families and cognates

Phase 4: Lemmatization & Advanced NLP

Integrate Romanian lemmatizer (spaCy-ro or nlp-cube)
Match inflected forms to base words
Improve recall (find "frumoaselor" when searching "frumos")
Named entity recognition for better filtering
Semantic clustering of forgotten words

Phase 3: Web Validation (Next 🔍)

DDG triage pass on shortlist (~17k words, no quota)
Google CSE pass on high-confidence subset (100/day free tier)
Cross-reference: corpus verdict vs web presence

Phase 5: User Interface & Visualization

Exploratory UI for browsing the shortlist (filter by tier, POS, etymology, domain, verdict, marks)
Word detail view: DEX definition, corpus stats, dexonline.ro link
PHP thin-API port — deployable on shared hosting (public/, tools/build_ui_db.py)
localStorage bookmarks / notes / quick-tags (no server-side auth needed)
Forgotten / rare-in-use toggle — browse two word tiers in both Flask and PHP UIs
REST API for programmatic access
Interactive visualizations
- Frequency decay curves (hist_ppm vs modern_ppm scatter)
- Etymological breakdown of extinct words
- Word cloud weighted by log_ratio

Future Enhancements

Revival potential scoring algorithm
Compare with other Romance languages
Historical corpus analysis (Project Gutenberg)
Machine translation of forgotten word contexts
Crowdsourced validation platform
Word-of-the-day feature
Educational tools and quizzes
Create a reverse, browse news and r/romania and find new words, used more than 3? times that are not in dictionary -> alternative dictionary

Further enhancements, marketing

tools: convert texts to archaic form - less used words. with a coeficient of uniqueness (bigger number, harder words)
filter out uninteresting words. Too domain specific: medicine, biology etc
one word a day game? quizz, guess what it means?

Known Issues & Limitations

No lemmatization — buclele doesn't match bucle in the corpus. Inflected forms are invisible. Adding simplemma would significantly improve recall (backlog #6).
POS tag noise — some words get wrong POS tags due to the ObjectTag join occasionally pulling tags from adjacent dictionary entries. Supplementary metadata only; doesn't affect core analysis.
Sparse etymology -ism tags — many words store "limba franceză" not "franțuzism" in DEX. Both are captured but the vocabulary is inconsistent across DEX editors.
absent verdict ambiguity — 83k words with no corpus signal conflate "truly unused" with "only appears in inflected forms not tracked". Lemmatization (#6) would reduce this category significantly.

Next Steps

# Web validation — DDG triage on full shortlist
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
    --provider ddg --limit 500 --delay 2

# Or just the highest-confidence tier first
python make_shortlist.py --limit 1137   # corpus_extinct only
python search_wild.py --input data/processed/forgotten_words_shortlist.csv \
    --provider ddg --delay 2

See Activity History and Backlog for the changelong and open items / roadmap.

Contributing

See CLAUDE.md for development guidelines and project context.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.superpowers/brainstorm/35951-1778833973		.superpowers/brainstorm/35951-1778833973
data		data
docs		docs
public		public
tests		tests
tools		tools
ui		ui
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
analyze_forgotten_words.py		analyze_forgotten_words.py
audit.py		audit.py
audit_report.py		audit_report.py
audit_sample.py		audit_sample.py
convert_to_sqlite.sh		convert_to_sqlite.sh
corpus_frequencies.db		corpus_frequencies.db
create_curated_list.py		create_curated_list.py
create_sample_db.py		create_sample_db.py
download_wikipedia_ro.py		download_wikipedia_ro.py
extract_definitions.py		extract_definitions.py
extract_lexemes.py		extract_lexemes.py
extract_taxonomy.py		extract_taxonomy.py
health_check.py		health_check.py
make_shortlist.py		make_shortlist.py
mysql_to_sqlite.py		mysql_to_sqlite.py
process_corpus.py		process_corpus.py
process_culturax.py		process_culturax.py
process_subtitles.py		process_subtitles.py
process_wikisource.py		process_wikisource.py
readme.md		readme.md
requirements.txt		requirements.txt
scrape_definitions.py		scrape_definitions.py
search_wild.py		search_wild.py
status.py		status.py
validate_diachronic.py		validate_diachronic.py
validate_forgotten_words.py		validate_forgotten_words.py
validate_with_wordfreq.py		validate_with_wordfreq.py

Folders and files

Latest commit

History

Repository files navigation