stem, a stemming algorithm in OCaml

A stemming algorithm is an algorithm that attempts to find the root of words. This library allows you to "tokenize" a document and apply the stemming algorithm to these tokens (considered to be words). It then calculates the frequency of occurrence of these words and produces a CSV document mapping the "stems" to their frequencies.

The purpose of stemming is to be able to treat several words (such as "tout", "toutes", and "tous") as a single root. This way, the resulting stems and their frequencies better reflect the information the document is trying to convey. The idea is then to enable document indexing based on these stems.

How to install it and use it?

stem is a package available through OPAM. It provides two tools: stemmer and stem.ts. The latter allows you to specify multiple tokenizers, the language, and the way the result is displayed in CSV format:

$ opam install stem
$ stem.ts -l french -a bert:remove -a whitespace:remove file.txt
"est",14                             
"son",13
"tout",11
"Julien",11
"plus",9
"trouv",8
"dan",8
"bien",7
"m\195\170m",7
...

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bin		bin
snowball		snowball
stopwords		stopwords
tokenizer		tokenizer
.gitignore		.gitignore
.ocamlformat		.ocamlformat
README.md		README.md
dune-project		dune-project
stem.opam		stem.opam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

stem, a stemming algorithm in OCaml

How to install it and use it?

About

Uh oh!

Releases

Packages

Languages

robur-coop/stem

Folders and files

Latest commit

History

Repository files navigation

stem, a stemming algorithm in OCaml

How to install it and use it?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages