Skip to content

jlpka/langidentify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LangIdentify

A fast, accurate language detection library available for Java, Rust, Python, and C/C++ (via Rust FFI).

LangIdentify detects the language of text using a combination of ngram frequency analysis and whole-word ("topwords") frequency signals, both trained on the Wikipedia corpus. It supports 80+ languages across Latin, Cyrillic, Arabic, CJK, and many other scripts. It runs entirely offline with no network calls.

All implementations use the same model data files and produce equivalent detection results.

Why LangIdentify?

Most language detection libraries rely solely on character ngram models. While ngrams are an excellent primary signal, they struggle with short or ambiguous text. Consider:

  • "was it Jimmy?" (English) vs. "was ist Jimmy?" (German) -- a single character difference
  • "Where is Oberammergau?" -- clearly English, even though most ngrams look German

LangIdentify augments ngram scoring with a topwords signal that identifies common whole words from each language. This was the original motivation for writing the library: we needed higher accuracy on short sentences than existing libraries could provide.

Design goals

  • Accuracy -- blended ngram + topwords scoring, especially effective on short text
  • Speed -- open-addressing hash tables, zero allocations in the detection path, fixed arrays
  • Low memory -- the 28-language europe_common model is ~60 MB (lite) or ~306 MB (full) in Java; load only the languages you need
  • Extensible -- adding a new language is straightforward if it has a reasonably sized Wikipedia edition

Quick start (Java)

For Rust, see the Rust README. For Python, see the Python README.

Maven dependency

<!-- Core detection library -->
<dependency>
    <groupId>com.jlpka.langidentify</groupId>
    <artifactId>langidentify-lib</artifactId>
    <version>1.0.2</version>
</dependency>

<!-- Bundled model data (choose one) -->
<dependency>
    <groupId>com.jlpka.langidentify</groupId>
    <artifactId>langidentify-models-lite</artifactId>
    <version>1.0.2</version>
</dependency>
<!-- or: langidentify-models-full for higher accuracy at more memory cost -->

Basic usage

import com.jlpka.langidentify.*;
import java.util.List;

// Load the lite model for the languages you care about (throws IOException).
List<Language> languages = Language.fromCommaSeparated("en,fr,de,es,it");
Model model = Model.loadLite(languages);

// Create a detector (lightweight, not thread-safe -- use one per thread).
Detector detector = new Detector(model);

// Detect.
Language lang = detector.detect("Bonjour le monde");
System.out.println(lang);           // FRENCH
System.out.println(lang.isoCode()); // fr

Inspecting results

After detection, detector.results() provides scoring details:

detector.detect("The quick brown fox");
Detector.Results results = detector.results();
System.out.println(results.result);  // ENGLISH
System.out.println(results.gap);     // confidence gap (0.0 = close, 1.0 = decisive)
System.out.println(results);         // full per-language score breakdown

Incremental detection

For streaming or multi-part text, use the addText API:

detector.clearScores();
detector.addText("Bonjour");
detector.addText(" le monde");
Language result = detector.computeResult();  // FRENCH

This also supports char[] and Reader inputs.

Language boosts

When you have prior context (e.g. an HTTP Accept-Language header or user locale), you can bias detection toward expected languages:

double[] frenchBoost = model.buildBoostArray(Language.FRENCH, 0.08);
Language lang = detector.detect("message", frenchBoost);  // FRENCH
// Without the boost, "message" is ambiguous between English and French.

Choosing languages

Try to only configure the languages you actually need. Each additional language increases model loading time, memory usage, and detection latency. More importantly, closely related languages can cross-detect on very short phrases -- for example, adding Luxembourgish when you only need German may cause short German phrases to be misidentified.

In addition to being able to specify a list of languages, LangIdentify provides group aliases for convenience:

Alias Languages
efigs English, French, Italian, German, Spanish
efigsnp EFIGS + Dutch, Portuguese
nordic Danish, Swedish, Norwegian, Finnish
cjk Chinese (Simplified), Chinese (Traditional), Japanese, Korean
europe_west_common EFIGSNP + Nordic
europe_east_latin Albanian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovenian
europe_cyrillic Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian
europe_common Western + Eastern European + Cyrillic
europe_latin All European Latin-script languages
europe All European languages (Latin + Cyrillic)
latin_alphabet All Latin-script languages
cyrillic_alphabet All Cyrillic-script languages
arabic_alphabet Arabic, Pashto, Persian, Urdu
unique_alphabet Languages where the script implies the language (Thai, Greek, Armenian, Georgian, etc.)
all All 84 languages
List<Language> langs = Language.fromCommaSeparated("europe_west_common,cjk");

Note that languages trained on smaller Wikipedia corpora may be less accurate.

Lite vs. full model

Both models are trained from the same Wikipedia data but cropped at different probability floors:

Lite Full
Log-probability floor -12 (≈ 6.1 × 10⁻⁶) -15 (≈ 3.1 × 10⁻⁷)
Memory (28 langs) ~60 MB ~306 MB
Best for Most use cases; good accuracy/memory balance Maximum accuracy when memory is not a concern
Model lite = Model.loadLite(languages);  // recommended default
Model full = Model.loadFull(languages);  // when you need every last bit of accuracy

Accuracy comparison

LangIdentify was benchmarked against two other well-known Java detection libraries: Lingua and Shuyo LangDetect (optimaize fork). Test data is from Lingua's accuracy report corpus.

Sentences (Lingua test corpus, all supported languages loaded)

Each library was loaded with all of its supported languages (LangIdentify: 84, Shuyo: 70, Lingua: 75) and evaluated on 10 European language test sets (1,000 sentences each).

Language LangIdentify (full) LangIdentify (lite) Lingua Shuyo LangDetect
English 100.0% 99.9% 99.1% 99.3%
French 99.8% 99.7% 98.8% 99.0%
German 99.9% 99.8% 99.7% 99.8%
Danish 99.6% 98.9% 97.8% 94.3%
Finnish 100.0% 100.0% 100.0% 99.9%
Italian 100.0% 99.9% 99.7% 99.2%
Spanish 99.7% 99.4% 96.7% 97.3%
Portuguese 99.8% 99.8% 97.9% 98.8%
Dutch 100.0% 100.0% 96.2% 97.0%
Swedish 99.4% 98.8% 98.7% 96.3%

Word pairs (Lingua test corpus, all supported languages loaded) -- where short-text accuracy matters most

Language LangIdentify (full) LangIdentify (lite) Lingua Shuyo LangDetect
English 94.3% 91.4% 88.6% 57.7%
French 96.1% 93.6% 94.5% 78.6%
German 94.6% 90.8% 94.1% 72.7%
Danish 84.9% 80.6% 83.9% 69.0%
Finnish 98.8% 97.9% 98.0% 95.5%
Italian 95.9% 93.5% 91.9% 81.2%
Spanish 79.2% 76.1% 68.7% 43.5%
Portuguese 88.5% 83.4% 85.3% 58.3%
Dutch 83.2% 75.5% 80.7% 49.6%
Swedish 91.2% 82.8% 88.6% 66.5%

LangIdentify wins all 10 languages. The advantage is most pronounced on short text, where the topwords signal makes the biggest difference. Note that word-pair accuracy drops for all libraries when the full language set is loaded, since two-word phrases are inherently ambiguous and more candidate languages increase the chance of a false match. Shuyo's percentages are somewhat inflated because it skips phrases it cannot classify (e.g. 191 of 1,000 Spanish word pairs), while LangIdentify and Lingua always produce a result.

Word pairs with narrowed language set (10 languages loaded)

If you know the likely languages in advance, configuring only those languages substantially improves short-text accuracy. The table below shows LangIdentify word-pair results with only the 10 test languages loaded, compared to all 84:

Language Full (10 langs) Full (84 langs) Lite (10 langs) Lite (84 langs)
English 97.8% 83.4% 95.4% 78.9%
French 97.7% 95.9% 95.8% 92.9%
German 96.8% 94.9% 94.0% 91.2%
Danish 95.9% 84.9% 93.9% 80.2%
Finnish 99.4% 98.8% 98.7% 97.9%
Italian 97.8% 96.4% 95.5% 93.7%
Spanish 84.6% 78.9% 82.8% 75.7%
Portuguese 90.5% 88.6% 86.5% 83.3%
Dutch 91.8% 84.7% 88.6% 77.3%
Swedish 94.5% 91.1% 90.1% 82.7%

Narrowing from 84 to 10 languages improves overall word-pair accuracy from 89.8% to 94.7% (full model). The gain is largest for languages that share vocabulary with many others -- English jumps from 83.4% to 97.8%, and Danish from 84.9% to 95.9%. For applications processing very short text, configuring only the expected languages is one of the most effective ways to improve accuracy.

Speed comparison

Detection throughput was benchmarked on the same 10 European language sentence corpus (10,000 phrases). Each library was tested in two configurations: with only the 10 test languages loaded, and with all supported languages loaded (which increases per-phrase work since every language must be scored).

10 languages loaded

Library Mwords/s ns/word
LangIdentify (lite) 3.02 331
LangIdentify (full) 2.07 484
Shuyo LangDetect 1.03 969
Lingua 0.17 6,016

All supported languages loaded

Library Mwords/s ns/word Languages
LangIdentify (lite) 1.29 774 84
LangIdentify (full) 0.87 1,153 84
Shuyo LangDetect 0.34 2,933 70
Lingua 0.04 25,082 75

LangIdentify lite with all 84 languages loaded is still faster than Shuyo with only 10 languages. The relative performance gap widens as more languages are added, since LangIdentify's open-addressing hash tables and fixed-array scoring scale more efficiently than the alternatives. LangIdentify's hot loop operates on char[] primitives and avoids heap allocations.

Benchmarks were run single-threaded on a MacBook Air M4. Absolute throughput will vary by machine; relative comparisons between libraries are the more useful metric.

How it works

Signals

LangIdentify combines two statistical signals, both derived from Wikipedia:

  1. ngrams -- character subsequences extracted from each word. For example, "hello" yields the 3-grams "hel", "ell", "llo". The relative frequencies of these ngrams differ across languages and form the primary detection signal. We typically evaluate 5-grams down to 1-grams, stopping at 3-grams if the word is fully covered.

  2. Topwords -- whole-word frequencies for common words like "the", "what", "vous", "ist". This signal is critical for short phrases where ngrams alone are ambiguous. For example, "was ist..." vs. "was it..." differ by a single character -- word frequencies make the distinction clear.

Probability model

For each ngram and topword, we compute per-language log-probabilities from Wikipedia frequency data. We use log-space because raw probabilities are extremely small numbers (the product of many small per-token probabilities). For instance, a probability of 0.00003% becomes log(3 × 10⁻⁷) ≈ -15. In log-space, multiplication becomes addition, which is both faster and avoids floating-point underflow.

There is a probability floor below which statistical noise dominates. Training data is domain-specific (Wikipedia), so overly precise probabilities would overfit. The lite model crops at log-probability -12 (≈ 6.1 × 10⁻⁶) and the full model at -15 (≈ 3.1 × 10⁻⁷). Ngrams and words not present in the model are assigned the floor probability.

Scoring

For each word in the input:

  • ngram scoring: we look up ngrams from 5-grams down to 1-grams in open-addressing hash tables, summing log-probabilities per language. If all tiles of a given ngram size are found (fully covered), we skip smaller sizes as an optimization.
  • Topword scoring: the whole word is looked up in a separate topwords table. Single Latin-alphabet characters without accents are excluded, since isolated letters are not language-indicative.
  • Apostrophe handling: words like "l'homme" are split at the apostrophe and each part ("l'" and "homme") is looked up separately as a topword. Apostrophes are included in ngrams (e.g. "d'u" is a valid 3-gram), which benefits languages like French, Italian, and English.

The ngram and topword signals are normalized and blended, with topwords weighted more heavily when topword coverage is high (i.e. when many of the input words have topword hits).

Alphabet-based detection

For scripts that uniquely identify a language -- such as Thai, Georgian, Armenian, or Burmese -- detection is immediate based on the script alone, with no ngram lookup required. Ngram data is only loaded for alphabets shared by multiple configured languages (e.g. Latin, Cyrillic, Arabic). These can be added with the "unique_alphabet" alias.

When text contains multiple scripts (e.g. "He likes to say привет"), words are segmented at script boundaries and the predominant alphabet is determined by weighted character count. CJK ideographs are weighted 3× and Korean/Kana 2× to reflect their higher linguistic density per character. Only languages using the predominant alphabet are considered for the final result. For example, "我的名字是Jonathan" detects as Chinese because 4 HAN characters at 3× weight outweigh 8 Latin characters at 1×.

Chinese, Japanese, and Korean

CJK detection is handled by the related CJClassifier library. Chinese and Japanese share the same Unicode ideograph range and don't use spaces between words (with an average "word" length of roughly 1.5 characters), so standard ngram approaches don't work well. CJClassifier uses character unigram and adjacent-character bigram frequencies instead, also trained on Wikipedia data, to distinguish Chinese Simplified, Chinese Traditional, and Japanese.

Korean uses the distinct Hangul script and is identified by alphabet.

Skipwords

A small set of language-independent tokens (e.g. "http", "www") are marked as skipwords and excluded from scoring entirely.

Case and accents

All text is lowercased before scoring. Accented characters are preserved for detection (e.g. "café" retains the accent in both ngram and topword lookups).

What we tried and didn't keep

We experimented with topword bigrams (e.g. the French sequence "y a" from "il y a") but found the memory cost was not justified by the marginal improvement in aggregate accuracy, even when restricted to bigrams of short words.

Language-specific notes

Norwegian dialects

Both Bokmål (no) and Nynorsk (nn) are supported. If you only care about the Norwegian language cluster without distinguishing dialects, configure just Bokmål (no), which has a 4x larger training corpus. The two dialects are similar enough that they cross-detect at some rate when both are configured.

Afrikaans and Dutch

Afrikaans is very similar to Dutch — Afrikaans evolved from Dutch dialects spoken by settlers in Southern Africa and the two remain largely mutually intelligible. When both are configured, Afrikaans text will frequently cross-detect as Dutch. If you don't need to distinguish them, configure only Dutch (nl).

Malay and Indonesian

Malay (ms) and Indonesian (id) are closely related standardizations of the same Malay language. When both are configured, Malay text will frequently cross-detect as Indonesian. If you don't need to distinguish them, configure only Indonesian (id), which has a larger training corpus.

Serbo-Croatian

We use Croatian (hr) for Latin-script and Serbian (sr) for Cyrillic-script detection. Bosnian has its own Wikipedia edition, but is statistically so close to Croatian that it cross-detects heavily (~55% accuracy), so it is not included as a separate language. Montenegrin does not have its own Wikipedia edition.

Wikipedia evaluation caveats

When evaluating on Wikipedia text (as opposed to curated test sets), one recurring issue is that articles contain foreign-language text (e.g. a French article quoting English). This means a measured accuracy of, say, 98.8% is typically closer to 100% in practice -- most of the "misses" are genuinely not in the expected language.

Adding a new language

A new language can be added if it has a reasonably sized Wikipedia edition.

  1. Download the Wikipedia dump (e.g. for Nynorsk):

    https://dumps.wikimedia.org/nnwiki/20260201/nnwiki-20260201-pages-articles.xml.bz2
    
  2. Extract ngrams and topwords using the provided script:

    python3 scripts/calcngrams.py --alphabet latin --languages nn
  3. Reduce to model thresholds using ModelBuilder:

    export INVOKEBUILDER="java -cp tools/target/langidentify-tools-1.0.2.jar \
        com.jlpka.langidentify.tools.ModelBuilder"
    
    # Lite model (-12/-12)
    $INVOKEBUILDER reducengrams --infile ../wikidata/derived/ngrams-nn.txt \
        --outfile models-lite/src/main/resources/com/jlpka/langidentify/models/lite/ngrams-nn.txt.gz \
        --minlogprob -12.0
    $INVOKEBUILDER reducetopwords --infile ../wikidata/derived/topwords-nn.txt \
        --outfile models-lite/src/main/resources/com/jlpka/langidentify/models/lite/topwords-nn.txt.gz \
        --twminlogprob -12.0
    
    # Full model (-15/-15)
    $INVOKEBUILDER reducengrams --infile ../wikidata/derived/ngrams-nn.txt \
        --outfile models-full/src/main/resources/com/jlpka/langidentify/models/full/ngrams-nn.txt.gz \
        --minlogprob -15.0
    $INVOKEBUILDER reducetopwords --infile ../wikidata/derived/topwords-nn.txt \
        --outfile models-full/src/main/resources/com/jlpka/langidentify/models/full/topwords-nn.txt.gz \
        --twminlogprob -15.0
  4. Add the language enum in Language.java if it doesn't already exist, and rebuild.

Rust port

A Rust implementation is available in the rust/langidentify/ directory. It uses the same model data files and produces equivalent detection results. See the Rust README for full documentation.

Quick start (Rust)

use langidentify::{Language, Model, Detector};
use std::sync::Arc;

let languages = Language::from_comma_separated("en,fr,de,es,it").unwrap();
let model = Arc::new(Model::load_lite(&languages).unwrap());
let mut detector = Detector::new(model);

assert_eq!(Language::French, detector.detect("Bonjour le monde"));

Performance relative to Java

Detection speed and memory usage are roughly the same as Java — around 6% faster in some benchmarks, with comparable memory footprint.

C/C++ FFI

The langidentify-ffi crate provides a C-compatible shared/static library for use from C, C++, or any language with a C FFI. See the FFI README for the full API reference, compiling/linking instructions, and a working example at rust/eval/src/useffi.c.

Python port

A pure Python implementation is available in the python/ directory. It uses the same model data files and produces equivalent detection results. See the Python README for full documentation.

Performance relative to Java

As a pure-Python implementation with no native extensions, detection is roughly 14× slower than Java/Rust at ~5,500 ns/word (lite, 10 languages). Memory usage is significantly higher — ~195 MB vs ~17 MB (lite, 10 languages) — due to per-object overhead in CPython's dict and float representations. For latency-sensitive or memory-constrained Python applications, consider the Rust FFI bindings.

Quick start (Python)

cd python
make models      # copy model data from the Java project
pip install .
from langidentify import Detector, Model, Language

languages = Language.from_comma_separated("en,fr,de,es,it")
model = Model.load(languages)
detector = Detector(model)

lang = detector.detect("Bonjour le monde")
print(lang)            # Language.FRENCH
print(lang.iso_code)   # fr

Project structure

langidentify-parent
  core/        langidentify-lib         Core detection library (Java)
  models-lite/ langidentify-models-lite  Bundled lite model data
  models-full/ langidentify-models-full  Bundled full model data
  tools/       langidentify-tools       Evaluation and model building tools (Java)
  python/      langidentify             Pure Python port
  rust/langidentify/                    Rust port (core library + model crates + FFI)
  rust/eval/                            Rust benchmarking/evaluation tools + C FFI example

Model loading and thread safety

Loading a Model is the expensive step — it decompresses and indexes the model data from the bundled JAR resources. For the lite model with 10 languages this takes roughly 0.1 seconds and ~17 MB of resident memory; with 28 languages, ~0.4 seconds and ~60 MB. Once loaded, the model is cached as a static singleton, so subsequent calls to Model.loadLite() with the same language set return immediately without reloading.

Creating a Detector is cheap — it just allocates a small set of scoring arrays against the already-loaded model. However, Detector is intentionally not thread-safe (it reuses internal buffers across calls for performance). Use one Detector per thread or per class instance — there's no need to create a new one for every detection call, but don't share one across threads:

Model model = Model.loadLite(languages);  // expensive once, then cached
ThreadLocal<Detector> detector = ThreadLocal.withInitial(() -> new Detector(model));

// In each thread:
Language lang = detector.get().detect(text);

Building from source

mvn clean package

This produces:

  • core/target/langidentify-lib-1.0.2.jar -- the core library
  • models-lite/target/langidentify-models-lite-1.0.2.jar -- bundled lite model data
  • models-full/target/langidentify-models-full-1.0.2.jar -- bundled full model data
  • tools/target/langidentify-tools-1.0.2.jar -- uber-JAR for evaluation and model building

To run tests:

mvn test

Requirements

  • Java 11+

Contributing

Contributions are welcome! Please open an issue or pull request at github.com/jlpka/langidentify.

Before submitting a PR, make sure all tests pass:

mvn test

Contact

License

Apache License 2.0 -- see LICENSE.

The bundled models contain statistical parameters derived from Wikipedia text. The models do not contain or reproduce Wikipedia text.

About

LangIdentify is a fast, high-accuracy language detection library for Java. It combines ngram classification with a "topwords" signal that boosts accuracy on short or ambiguous text. Models were trained on the Wikipedia corpus and cover 80+ languages.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors