Skip to content

Refresh ISO 639-3 data to current SIL code tables#53

Open
LavX wants to merge 1 commit into
Diaoul:mainfrom
LavX:fix/iso-639-3-stale-data
Open

Refresh ISO 639-3 data to current SIL code tables#53
LavX wants to merge 1 commit into
Diaoul:mainfrom
LavX:fix/iso-639-3-stale-data

Conversation

@LavX

@LavX LavX commented Jun 7, 2026

Copy link
Copy Markdown

Summary

The bundled babelfish/data/iso-639-3.tab predated the 2017 ISO 639-3 update, so codes added since then were rejected. Most visibly, Montenegrin (cnr, added in 2017) was unknown:

>>> Language('cnr')
ValueError: 'cnr' is not a valid language

This regenerates the table from the current SIL code tables, picking up cnr plus every other addition, retirement, and correction since the old snapshot.

What changed

Source: https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab

The downloaded data was reformatted to match the existing file's conventions (CRLF line endings, no trailing newline, unchanged header) so the diff is limited to real data changes instead of a whole-file line-ending churn:

  • +207 codes added (including cnr)
  • −152 codes retired
  • 324 in-place updates: ref-name corrections, plus the retirement of the A (ancient) language type whose members are now classified H (historical)

Net language count: 7874 → 7929.

Tests

  • Added a regression test for cnr (test_montenegrin_cnr).
  • Updated the count-based assertions (LANGUAGES, name, alpha3b, alpha3t, opensubtitles, and the type code set) to the new totals.
  • Full suite passes on a clean install (poetry install && poetry run pytest).

Not addressed here

The issue mentioned #22 and #23 as related in spirit. Those (along with #24) are alternate-spelling fromname lookups (PashtoPushto, DivehiDhivehi, GreekModern Greek). ISO 639-3 stores a single canonical Ref_Name per language, so refreshing the data does not resolve them. They need alternate-name aliasing (e.g. bundling SIL's iso-639-3_Names.tab), which is left out of scope to keep this PR focused on the staleness fix.

Fixes #52

The bundled iso-639-3.tab predated the 2017 ISO 639-3 update, so codes
added since then were rejected, most visibly Montenegrin (cnr):

    >>> Language('cnr')
    ValueError: 'cnr' is not a valid language

Regenerate the table from the current SIL download
(https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab),
keeping the existing CRLF format so the diff is limited to real data
changes: +207 codes (including cnr), -152 retired codes, and 324 in-place
updates (ref-name corrections plus the retirement of the "A" ancient type,
whose languages are now classified "H" historical).

Update the count-based test assertions to the new totals and add a
regression test for cnr.

Fixes Diaoul#52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ISO 639-3 data is stale: Montenegrin (cnr) and other post-2017 codes are missing

1 participant