Skip to content

language profiles for `language-detection` derived from wikimedia `*abstract.xml` files

Notifications You must be signed in to change notification settings

tenzap/language-detection-profiles

Repository files navigation

Language profiles for language-detection

This repository contains profiles for language-detection java library.

They have been generated as described on the wiki of the library. The details can be found at https://github.com/shuyo/language-detection/blob/wiki/Tools.md.

Input files are the wikimedia abstract.xml files.

Production of these files by wikimedia was discontinued in February 2025. More details can be found on their ticket.

In june 2025, most files generated on 2025-02-01 were still accessible on the Wikimedia dump server. However for some languages, this was not the case anymore. The most recent versions I could find still having the abstract.xml files for these languages were dated 2022-08-20 on the your.org mirror. After some time the most recent abstract.xml files will not be accessible anywhere anymore.

This repository contains the profiles generated based on these abstract.xml. Each language profile has a corresponding file with .txt extension which mentions:

  • which version of langdetect.jar and jsonic.jar were used to generate them as well as they sha256 sum.
  • the URL where the abstract.xml was downloaded from
  • the timestamp in both unix & human readable format of the abstract.xml file as on the http server
  • the date of the filename/dirname of the URL
  • their license in both human & SPDX format

Note that the result of the profiles generated with:

  • jsonic.jar-file| jsonic-1.3.10.jar
  • jsonic.jar-sha256| 150e3ac24b103cb0cd50595018dbe3c8ba4b3dee81a76eb4b987607f96e9602b
  • langdetect.jar-file| langdetect-1.1-20120112.jar
  • langdetect.jar-sha256| f692f03ecd2dc926ae79ddca689ccd24c66576a9eacb05d99ee191930e1257b3

are strictly identical to the profiles generated with the version of the jars at this commit https://github.com/shuyo/language-detection/commit/a1b65d981fc40aad0763dd782acbc99ab40a6228 which at the time of writing this is the last commit on the master branch.

License of language profiles

The language profile files (in the profiles directory) are computationally derived from Wikimedia dumps and are subject to the Wikimedia Foundation Terms of Use Consequently: text submitted to Wikimedia by the copyright holders is licensed under

  • Creative Commons Attribution-ShareAlike 4.0 International License ("CC BY-SA 4.0"), and
  • GNU Free Documentation License ("GFDL") (unversioned, with no invariant sections, front-cover texts, or back-cover texts).

Reusers may comply with either license or both.

SPDX-License-Identifier would likely be CC-BY-SA-4.0 OR GFDL-1.1-no-invariants-or-later.

License of the extraction script

extract.sh is licensed under MIT License. Copyright holder and license can be found at the header of that file.

License of the 3rd party jars

  • langdetect*.jar is licensed under Apache-2.0 © Nakatani Shuyo
  • jsonic*.jar is licensed under Apache-2.0 © Hidekatsu Izuno

About

language profiles for `language-detection` derived from wikimedia `*abstract.xml` files

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

No packages published

Languages