Refresh ISO 639-3 data to current SIL code tables#53
Open
LavX wants to merge 1 commit into
Open
Conversation
The bundled iso-639-3.tab predated the 2017 ISO 639-3 update, so codes
added since then were rejected, most visibly Montenegrin (cnr):
>>> Language('cnr')
ValueError: 'cnr' is not a valid language
Regenerate the table from the current SIL download
(https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab),
keeping the existing CRLF format so the diff is limited to real data
changes: +207 codes (including cnr), -152 retired codes, and 324 in-place
updates (ref-name corrections plus the retirement of the "A" ancient type,
whose languages are now classified "H" historical).
Update the count-based test assertions to the new totals and add a
regression test for cnr.
Fixes Diaoul#52
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The bundled
babelfish/data/iso-639-3.tabpredated the 2017 ISO 639-3 update, so codes added since then were rejected. Most visibly, Montenegrin (cnr, added in 2017) was unknown:This regenerates the table from the current SIL code tables, picking up
cnrplus every other addition, retirement, and correction since the old snapshot.What changed
Source: https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab
The downloaded data was reformatted to match the existing file's conventions (CRLF line endings, no trailing newline, unchanged header) so the diff is limited to real data changes instead of a whole-file line-ending churn:
cnr)A(ancient) language type whose members are now classifiedH(historical)Net language count: 7874 → 7929.
Tests
cnr(test_montenegrin_cnr).LANGUAGES,name,alpha3b,alpha3t,opensubtitles, and thetypecode set) to the new totals.poetry install && poetry run pytest).Not addressed here
The issue mentioned #22 and #23 as related in spirit. Those (along with #24) are alternate-spelling
fromnamelookups (Pashto→Pushto,Divehi→Dhivehi,Greek→Modern Greek). ISO 639-3 stores a single canonicalRef_Nameper language, so refreshing the data does not resolve them. They need alternate-name aliasing (e.g. bundling SIL'siso-639-3_Names.tab), which is left out of scope to keep this PR focused on the staleness fix.Fixes #52