Skip to content

Standardize language option #79

@rth

Description

@rth

From #78 (comment) by @joshlk

I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.

I am also wondering if in Rust, we should use String for the language parameter or define an Enum e.g.

use vtext::lang

let stemmer = SnowballStemmerParams::default().lang(lang::en).build()

The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.

Also just to be consistent the parameter name would be "lang" not "language", right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions