A fast, accurate language detection library available for Java, Rust, Python, and C/C++ (via Rust FFI).
LangIdentify detects the language of text using a combination of ngram frequency analysis and whole-word ("topwords") frequency signals, both trained on the Wikipedia corpus. It supports 80+ languages across Latin, Cyrillic, Arabic, CJK, and many other scripts. It runs entirely offline with no network calls.
All implementations use the same model data files and produce equivalent detection results.
Most language detection libraries rely solely on character ngram models. While ngrams are an excellent primary signal, they struggle with short or ambiguous text. Consider:
- "was it Jimmy?" (English) vs. "was ist Jimmy?" (German) -- a single character difference
- "Where is Oberammergau?" -- clearly English, even though most ngrams look German
LangIdentify augments ngram scoring with a topwords signal that identifies common whole words from each language. This was the original motivation for writing the library: we needed higher accuracy on short sentences than existing libraries could provide.
- Accuracy -- blended ngram + topwords scoring, especially effective on short text
- Speed -- open-addressing hash tables, zero allocations in the detection path, fixed arrays
- Low memory -- the 28-language
europe_commonmodel is ~60 MB (lite) or ~306 MB (full) in Java; load only the languages you need - Extensible -- adding a new language is straightforward if it has a reasonably sized Wikipedia edition
For Rust, see the Rust README. For Python, see the Python README.
<!-- Core detection library -->
<dependency>
<groupId>com.jlpka.langidentify</groupId>
<artifactId>langidentify-lib</artifactId>
<version>1.0.2</version>
</dependency>
<!-- Bundled model data (choose one) -->
<dependency>
<groupId>com.jlpka.langidentify</groupId>
<artifactId>langidentify-models-lite</artifactId>
<version>1.0.2</version>
</dependency>
<!-- or: langidentify-models-full for higher accuracy at more memory cost -->import com.jlpka.langidentify.*;
import java.util.List;
// Load the lite model for the languages you care about (throws IOException).
List<Language> languages = Language.fromCommaSeparated("en,fr,de,es,it");
Model model = Model.loadLite(languages);
// Create a detector (lightweight, not thread-safe -- use one per thread).
Detector detector = new Detector(model);
// Detect.
Language lang = detector.detect("Bonjour le monde");
System.out.println(lang); // FRENCH
System.out.println(lang.isoCode()); // frAfter detection, detector.results() provides scoring details:
detector.detect("The quick brown fox");
Detector.Results results = detector.results();
System.out.println(results.result); // ENGLISH
System.out.println(results.gap); // confidence gap (0.0 = close, 1.0 = decisive)
System.out.println(results); // full per-language score breakdownFor streaming or multi-part text, use the addText API:
detector.clearScores();
detector.addText("Bonjour");
detector.addText(" le monde");
Language result = detector.computeResult(); // FRENCHThis also supports char[] and Reader inputs.
When you have prior context (e.g. an HTTP Accept-Language header or user locale), you can bias detection toward expected languages:
double[] frenchBoost = model.buildBoostArray(Language.FRENCH, 0.08);
Language lang = detector.detect("message", frenchBoost); // FRENCH
// Without the boost, "message" is ambiguous between English and French.Try to only configure the languages you actually need. Each additional language increases model loading time, memory usage, and detection latency. More importantly, closely related languages can cross-detect on very short phrases -- for example, adding Luxembourgish when you only need German may cause short German phrases to be misidentified.
In addition to being able to specify a list of languages, LangIdentify provides group aliases for convenience:
| Alias | Languages |
|---|---|
efigs |
English, French, Italian, German, Spanish |
efigsnp |
EFIGS + Dutch, Portuguese |
nordic |
Danish, Swedish, Norwegian, Finnish |
cjk |
Chinese (Simplified), Chinese (Traditional), Japanese, Korean |
europe_west_common |
EFIGSNP + Nordic |
europe_east_latin |
Albanian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovenian |
europe_cyrillic |
Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian |
europe_common |
Western + Eastern European + Cyrillic |
europe_latin |
All European Latin-script languages |
europe |
All European languages (Latin + Cyrillic) |
latin_alphabet |
All Latin-script languages |
cyrillic_alphabet |
All Cyrillic-script languages |
arabic_alphabet |
Arabic, Pashto, Persian, Urdu |
unique_alphabet |
Languages where the script implies the language (Thai, Greek, Armenian, Georgian, etc.) |
all |
All 84 languages |
List<Language> langs = Language.fromCommaSeparated("europe_west_common,cjk");Note that languages trained on smaller Wikipedia corpora may be less accurate.
Both models are trained from the same Wikipedia data but cropped at different probability floors:
| Lite | Full | |
|---|---|---|
| Log-probability floor | -12 (≈ 6.1 × 10⁻⁶) | -15 (≈ 3.1 × 10⁻⁷) |
| Memory (28 langs) | ~60 MB | ~306 MB |
| Best for | Most use cases; good accuracy/memory balance | Maximum accuracy when memory is not a concern |
Model lite = Model.loadLite(languages); // recommended default
Model full = Model.loadFull(languages); // when you need every last bit of accuracyLangIdentify was benchmarked against two other well-known Java detection libraries: Lingua and Shuyo LangDetect (optimaize fork). Test data is from Lingua's accuracy report corpus.
Each library was loaded with all of its supported languages (LangIdentify: 84, Shuyo: 70, Lingua: 75) and evaluated on 10 European language test sets (1,000 sentences each).
| Language | LangIdentify (full) | LangIdentify (lite) | Lingua | Shuyo LangDetect |
|---|---|---|---|---|
| English | 100.0% | 99.9% | 99.1% | 99.3% |
| French | 99.8% | 99.7% | 98.8% | 99.0% |
| German | 99.9% | 99.8% | 99.7% | 99.8% |
| Danish | 99.6% | 98.9% | 97.8% | 94.3% |
| Finnish | 100.0% | 100.0% | 100.0% | 99.9% |
| Italian | 100.0% | 99.9% | 99.7% | 99.2% |
| Spanish | 99.7% | 99.4% | 96.7% | 97.3% |
| Portuguese | 99.8% | 99.8% | 97.9% | 98.8% |
| Dutch | 100.0% | 100.0% | 96.2% | 97.0% |
| Swedish | 99.4% | 98.8% | 98.7% | 96.3% |
Word pairs (Lingua test corpus, all supported languages loaded) -- where short-text accuracy matters most
| Language | LangIdentify (full) | LangIdentify (lite) | Lingua | Shuyo LangDetect |
|---|---|---|---|---|
| English | 94.3% | 91.4% | 88.6% | 57.7% |
| French | 96.1% | 93.6% | 94.5% | 78.6% |
| German | 94.6% | 90.8% | 94.1% | 72.7% |
| Danish | 84.9% | 80.6% | 83.9% | 69.0% |
| Finnish | 98.8% | 97.9% | 98.0% | 95.5% |
| Italian | 95.9% | 93.5% | 91.9% | 81.2% |
| Spanish | 79.2% | 76.1% | 68.7% | 43.5% |
| Portuguese | 88.5% | 83.4% | 85.3% | 58.3% |
| Dutch | 83.2% | 75.5% | 80.7% | 49.6% |
| Swedish | 91.2% | 82.8% | 88.6% | 66.5% |
LangIdentify wins all 10 languages. The advantage is most pronounced on short text, where the topwords signal makes the biggest difference. Note that word-pair accuracy drops for all libraries when the full language set is loaded, since two-word phrases are inherently ambiguous and more candidate languages increase the chance of a false match. Shuyo's percentages are somewhat inflated because it skips phrases it cannot classify (e.g. 191 of 1,000 Spanish word pairs), while LangIdentify and Lingua always produce a result.
If you know the likely languages in advance, configuring only those languages substantially improves short-text accuracy. The table below shows LangIdentify word-pair results with only the 10 test languages loaded, compared to all 84:
| Language | Full (10 langs) | Full (84 langs) | Lite (10 langs) | Lite (84 langs) |
|---|---|---|---|---|
| English | 97.8% | 83.4% | 95.4% | 78.9% |
| French | 97.7% | 95.9% | 95.8% | 92.9% |
| German | 96.8% | 94.9% | 94.0% | 91.2% |
| Danish | 95.9% | 84.9% | 93.9% | 80.2% |
| Finnish | 99.4% | 98.8% | 98.7% | 97.9% |
| Italian | 97.8% | 96.4% | 95.5% | 93.7% |
| Spanish | 84.6% | 78.9% | 82.8% | 75.7% |
| Portuguese | 90.5% | 88.6% | 86.5% | 83.3% |
| Dutch | 91.8% | 84.7% | 88.6% | 77.3% |
| Swedish | 94.5% | 91.1% | 90.1% | 82.7% |
Narrowing from 84 to 10 languages improves overall word-pair accuracy from 89.8% to 94.7% (full model). The gain is largest for languages that share vocabulary with many others -- English jumps from 83.4% to 97.8%, and Danish from 84.9% to 95.9%. For applications processing very short text, configuring only the expected languages is one of the most effective ways to improve accuracy.
Detection throughput was benchmarked on the same 10 European language sentence corpus (10,000 phrases). Each library was tested in two configurations: with only the 10 test languages loaded, and with all supported languages loaded (which increases per-phrase work since every language must be scored).
| Library | Mwords/s | ns/word |
|---|---|---|
| LangIdentify (lite) | 3.02 | 331 |
| LangIdentify (full) | 2.07 | 484 |
| Shuyo LangDetect | 1.03 | 969 |
| Lingua | 0.17 | 6,016 |
| Library | Mwords/s | ns/word | Languages |
|---|---|---|---|
| LangIdentify (lite) | 1.29 | 774 | 84 |
| LangIdentify (full) | 0.87 | 1,153 | 84 |
| Shuyo LangDetect | 0.34 | 2,933 | 70 |
| Lingua | 0.04 | 25,082 | 75 |
LangIdentify lite with all 84 languages loaded is still faster than Shuyo with only 10
languages. The relative performance gap widens as more languages are added, since
LangIdentify's open-addressing hash tables and fixed-array scoring scale more efficiently
than the alternatives. LangIdentify's hot loop operates on char[] primitives and avoids heap allocations.
Benchmarks were run single-threaded on a MacBook Air M4. Absolute throughput will vary by machine; relative comparisons between libraries are the more useful metric.
LangIdentify combines two statistical signals, both derived from Wikipedia:
-
ngrams -- character subsequences extracted from each word. For example, "hello" yields the 3-grams "hel", "ell", "llo". The relative frequencies of these ngrams differ across languages and form the primary detection signal. We typically evaluate 5-grams down to 1-grams, stopping at 3-grams if the word is fully covered.
-
Topwords -- whole-word frequencies for common words like "the", "what", "vous", "ist". This signal is critical for short phrases where ngrams alone are ambiguous. For example, "was ist..." vs. "was it..." differ by a single character -- word frequencies make the distinction clear.
For each ngram and topword, we compute per-language log-probabilities from Wikipedia frequency data. We use log-space because raw probabilities are extremely small numbers (the product of many small per-token probabilities). For instance, a probability of 0.00003% becomes log(3 × 10⁻⁷) ≈ -15. In log-space, multiplication becomes addition, which is both faster and avoids floating-point underflow.
There is a probability floor below which statistical noise dominates. Training data is domain-specific (Wikipedia), so overly precise probabilities would overfit. The lite model crops at log-probability -12 (≈ 6.1 × 10⁻⁶) and the full model at -15 (≈ 3.1 × 10⁻⁷). Ngrams and words not present in the model are assigned the floor probability.
For each word in the input:
- ngram scoring: we look up ngrams from 5-grams down to 1-grams in open-addressing hash tables, summing log-probabilities per language. If all tiles of a given ngram size are found (fully covered), we skip smaller sizes as an optimization.
- Topword scoring: the whole word is looked up in a separate topwords table. Single Latin-alphabet characters without accents are excluded, since isolated letters are not language-indicative.
- Apostrophe handling: words like "l'homme" are split at the apostrophe and each part ("l'" and "homme") is looked up separately as a topword. Apostrophes are included in ngrams (e.g. "d'u" is a valid 3-gram), which benefits languages like French, Italian, and English.
The ngram and topword signals are normalized and blended, with topwords weighted more heavily when topword coverage is high (i.e. when many of the input words have topword hits).
For scripts that uniquely identify a language -- such as Thai, Georgian, Armenian, or Burmese -- detection is immediate based on the script alone, with no ngram lookup required. Ngram data is only loaded for alphabets shared by multiple configured languages (e.g. Latin, Cyrillic, Arabic). These can be added with the "unique_alphabet" alias.
When text contains multiple scripts (e.g. "He likes to say привет"), words are segmented at script boundaries and the predominant alphabet is determined by weighted character count. CJK ideographs are weighted 3× and Korean/Kana 2× to reflect their higher linguistic density per character. Only languages using the predominant alphabet are considered for the final result. For example, "我的名字是Jonathan" detects as Chinese because 4 HAN characters at 3× weight outweigh 8 Latin characters at 1×.
CJK detection is handled by the related CJClassifier library. Chinese and Japanese share the same Unicode ideograph range and don't use spaces between words (with an average "word" length of roughly 1.5 characters), so standard ngram approaches don't work well. CJClassifier uses character unigram and adjacent-character bigram frequencies instead, also trained on Wikipedia data, to distinguish Chinese Simplified, Chinese Traditional, and Japanese.
Korean uses the distinct Hangul script and is identified by alphabet.
A small set of language-independent tokens (e.g. "http", "www") are marked as skipwords and excluded from scoring entirely.
All text is lowercased before scoring. Accented characters are preserved for detection (e.g. "café" retains the accent in both ngram and topword lookups).
We experimented with topword bigrams (e.g. the French sequence "y a" from "il y a") but found the memory cost was not justified by the marginal improvement in aggregate accuracy, even when restricted to bigrams of short words.
Both Bokmål (no) and Nynorsk (nn) are supported. If you only care about the Norwegian
language cluster without distinguishing dialects, configure just Bokmål (no), which has a
4x larger training corpus. The two dialects are similar enough that they cross-detect at some
rate when both are configured.
Afrikaans is very similar to Dutch — Afrikaans evolved from Dutch dialects spoken by settlers in
Southern Africa and the two remain
largely mutually intelligible.
When both are configured, Afrikaans text will frequently cross-detect as Dutch. If you don't need
to distinguish them, configure only Dutch (nl).
Malay (ms) and Indonesian (id) are
closely related standardizations
of the same Malay language. When both are configured, Malay text will frequently cross-detect as
Indonesian. If you don't need to distinguish them, configure only Indonesian (id), which has a
larger training corpus.
We use Croatian (hr) for Latin-script and Serbian (sr) for Cyrillic-script detection.
Bosnian has its own Wikipedia edition, but is statistically so close to Croatian that it
cross-detects heavily (~55% accuracy), so it is not included as a separate language. Montenegrin
does not have its own Wikipedia edition.
When evaluating on Wikipedia text (as opposed to curated test sets), one recurring issue is that articles contain foreign-language text (e.g. a French article quoting English). This means a measured accuracy of, say, 98.8% is typically closer to 100% in practice -- most of the "misses" are genuinely not in the expected language.
A new language can be added if it has a reasonably sized Wikipedia edition.
-
Download the Wikipedia dump (e.g. for Nynorsk):
https://dumps.wikimedia.org/nnwiki/20260201/nnwiki-20260201-pages-articles.xml.bz2 -
Extract ngrams and topwords using the provided script:
python3 scripts/calcngrams.py --alphabet latin --languages nn
-
Reduce to model thresholds using ModelBuilder:
export INVOKEBUILDER="java -cp tools/target/langidentify-tools-1.0.2.jar \ com.jlpka.langidentify.tools.ModelBuilder" # Lite model (-12/-12) $INVOKEBUILDER reducengrams --infile ../wikidata/derived/ngrams-nn.txt \ --outfile models-lite/src/main/resources/com/jlpka/langidentify/models/lite/ngrams-nn.txt.gz \ --minlogprob -12.0 $INVOKEBUILDER reducetopwords --infile ../wikidata/derived/topwords-nn.txt \ --outfile models-lite/src/main/resources/com/jlpka/langidentify/models/lite/topwords-nn.txt.gz \ --twminlogprob -12.0 # Full model (-15/-15) $INVOKEBUILDER reducengrams --infile ../wikidata/derived/ngrams-nn.txt \ --outfile models-full/src/main/resources/com/jlpka/langidentify/models/full/ngrams-nn.txt.gz \ --minlogprob -15.0 $INVOKEBUILDER reducetopwords --infile ../wikidata/derived/topwords-nn.txt \ --outfile models-full/src/main/resources/com/jlpka/langidentify/models/full/topwords-nn.txt.gz \ --twminlogprob -15.0
-
Add the language enum in
Language.javaif it doesn't already exist, and rebuild.
A Rust implementation is available in the rust/langidentify/ directory.
It uses the same model data files and produces equivalent detection results. See the
Rust README for full documentation.
use langidentify::{Language, Model, Detector};
use std::sync::Arc;
let languages = Language::from_comma_separated("en,fr,de,es,it").unwrap();
let model = Arc::new(Model::load_lite(&languages).unwrap());
let mut detector = Detector::new(model);
assert_eq!(Language::French, detector.detect("Bonjour le monde"));Detection speed and memory usage are roughly the same as Java — around 6% faster in some benchmarks, with comparable memory footprint.
The langidentify-ffi crate provides a C-compatible shared/static library for use from
C, C++, or any language with a C FFI. See the
FFI README for the full API reference,
compiling/linking instructions, and a working example at
rust/eval/src/useffi.c.
A pure Python implementation is available in the python/ directory. It uses
the same model data files and produces equivalent detection results. See the
Python README for full documentation.
As a pure-Python implementation with no native extensions, detection is roughly 14× slower than Java/Rust at ~5,500 ns/word (lite, 10 languages). Memory usage is significantly higher — ~195 MB vs ~17 MB (lite, 10 languages) — due to per-object overhead in CPython's dict and float representations. For latency-sensitive or memory-constrained Python applications, consider the Rust FFI bindings.
cd python
make models # copy model data from the Java project
pip install .from langidentify import Detector, Model, Language
languages = Language.from_comma_separated("en,fr,de,es,it")
model = Model.load(languages)
detector = Detector(model)
lang = detector.detect("Bonjour le monde")
print(lang) # Language.FRENCH
print(lang.iso_code) # frlangidentify-parent
core/ langidentify-lib Core detection library (Java)
models-lite/ langidentify-models-lite Bundled lite model data
models-full/ langidentify-models-full Bundled full model data
tools/ langidentify-tools Evaluation and model building tools (Java)
python/ langidentify Pure Python port
rust/langidentify/ Rust port (core library + model crates + FFI)
rust/eval/ Rust benchmarking/evaluation tools + C FFI example
Loading a Model is the expensive step — it decompresses and indexes the model data from the
bundled JAR resources. For the lite model with 10 languages this takes roughly 0.1 seconds and
~17 MB of resident memory; with 28 languages, ~0.4 seconds and ~60 MB. Once loaded, the model
is cached as a static singleton, so subsequent calls to Model.loadLite() with the same
language set return immediately without reloading.
Creating a Detector is cheap — it just allocates a small set of scoring arrays against the
already-loaded model. However, Detector is intentionally not thread-safe (it reuses
internal buffers across calls for performance). Use one Detector per thread or per class
instance — there's no need to create a new one for every detection call, but don't share one
across threads:
Model model = Model.loadLite(languages); // expensive once, then cached
ThreadLocal<Detector> detector = ThreadLocal.withInitial(() -> new Detector(model));
// In each thread:
Language lang = detector.get().detect(text);mvn clean packageThis produces:
core/target/langidentify-lib-1.0.2.jar-- the core librarymodels-lite/target/langidentify-models-lite-1.0.2.jar-- bundled lite model datamodels-full/target/langidentify-models-full-1.0.2.jar-- bundled full model datatools/target/langidentify-tools-1.0.2.jar-- uber-JAR for evaluation and model building
To run tests:
mvn test- Java 11+
Contributions are welcome! Please open an issue or pull request at github.com/jlpka/langidentify.
Before submitting a PR, make sure all tests pass:
mvn test- Author: Jeremy Lilley
- GitHub: github.com/jlpka/langidentify
- Email: jeremy@jlilley.net
Apache License 2.0 -- see LICENSE.
The bundled models contain statistical parameters derived from Wikipedia text. The models do not contain or reproduce Wikipedia text.