This repository contains profiles for language-detection java library.
They have been generated as described on the wiki of the library. The details can be found at https://github.com/shuyo/language-detection/blob/wiki/Tools.md.
Input files are the wikimedia abstract.xml files.
Production of these files by wikimedia was discontinued in February 2025. More details can be found on their ticket.
In june 2025, most files generated on 2025-02-01 were still accessible on the
Wikimedia dump server. However for some
languages, this was not the case anymore. The most recent versions I could find
still having the abstract.xml files for these languages were dated 2022-08-20
on the your.org mirror. After some time the
most recent abstract.xml files will not be accessible anywhere anymore.
This repository contains the profiles generated based on these abstract.xml.
Each language profile has a corresponding file with .txt extension which
mentions:
- which version of langdetect.jar and jsonic.jar were used to generate them as well as they sha256 sum.
- the URL where the abstract.xml was downloaded from
- the timestamp in both unix & human readable format of the abstract.xml file as on the http server
- the date of the filename/dirname of the URL
- their license in both human & SPDX format
Note that the result of the profiles generated with:
- jsonic.jar-file| jsonic-1.3.10.jar
- jsonic.jar-sha256| 150e3ac24b103cb0cd50595018dbe3c8ba4b3dee81a76eb4b987607f96e9602b
- langdetect.jar-file| langdetect-1.1-20120112.jar
- langdetect.jar-sha256| f692f03ecd2dc926ae79ddca689ccd24c66576a9eacb05d99ee191930e1257b3
are strictly identical to the profiles generated with the version of the jars at this commit https://github.com/shuyo/language-detection/commit/a1b65d981fc40aad0763dd782acbc99ab40a6228 which at the time of writing this is the last commit on the master branch.
The language profile files (in the profiles directory) are computationally
derived from Wikimedia dumps and are subject to the
Wikimedia Foundation Terms of Use
Consequently: text submitted to Wikimedia by the copyright holders is licensed under
- Creative Commons Attribution-ShareAlike 4.0 International License ("CC BY-SA 4.0"), and
- GNU Free Documentation License ("GFDL") (unversioned, with no invariant sections, front-cover texts, or back-cover texts).
Reusers may comply with either license or both.
SPDX-License-Identifier would likely be CC-BY-SA-4.0 OR GFDL-1.1-no-invariants-or-later.
extract.sh is licensed under MIT License. Copyright holder and license can be
found at the header of that file.
langdetect*.jaris licensed under Apache-2.0 © Nakatani Shuyojsonic*.jaris licensed under Apache-2.0 © Hidekatsu Izuno