Details about the algorithm can be found in I. Segalovich «A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine», MLMTA-2003.
The wrapper itself is MIT-licensed, but please remember that Yandex.MyStem is not open source and is licensed under the Yandex MyStem licensing terms.
- Java 8 or later (the artifacts are compiled with
target=1.8). - Linux, macOS, or Windows. The wrapper downloads the appropriate
mystembinary the first time it is needed; you can also pass a pre-installed binary's path explicitly.
The artifact is cross-built for Scala 2.12 and Scala 2.13.
TODO: deploy recent improvements
<!-- For Scala 2.13 -->
<dependency>
<groupId>ru.stachek66.nlp</groupId>
<artifactId>mystem-scala_2.13</artifactId>
<version>0.3.1</version>
</dependency>
<!-- For Scala 2.12 -->
<dependency>
<groupId>ru.stachek66.nlp</groupId>
<artifactId>mystem-scala_2.12</artifactId>
<version>0.3.1</version>
</dependency>libraryDependencies += "ru.stachek66.nlp" %% "mystem-scala" % "0.3.0"mystemversions 3.0 and 3.1 are supported.0.3.0modernizes the build (Scala 2.12 / 2.13 cross-build, Java 8 target, publishing via the Sonatype Central Portal) and fixes the long-standing process-leak issue (#3). See CHANGELOG below.- Please open issues for compatibility troubles or other requests.
Important: keep one
MySteminstance permystembinary in your application. The instance owns a long-lived child process; constructing and discarding many instances will spawn many processes.
import java.io.File
import ru.stachek66.nlp.mystem.holding.{Factory, MyStem, Request}
object Example {
// Construct once, reuse, close on shutdown.
val mystem: MyStem =
new Factory("-igd --eng-gr --format json --weight")
.newMyStem("3.0", Option(new File("/path/to/mystem"))) // or None to auto-download
.get
def main(args: Array[String]): Unit = try {
mystem
.analyze(Request("Есть большие пассажиры мандариновой травы"))
.info
.foreach(info => println(info.initial + " -> " + info.lex))
} finally mystem.close() // returns the OS process. Idempotent.
}MyStem extends AutoCloseable, so you can also use it from any
Using / try-with-resources style helper:
import scala.util.Using
Using.resource(new Factory().newMyStem("3.0").get) { mystem =>
mystem.analyze(Request("Привет, мир!")).info.foreach(println)
}A JVM shutdown hook is also installed automatically as a safety net for
code paths that forget to call close().
scala.collection.JavaConversions was removed in Scala 2.13. The wrapper
now exposes a Java-friendly accessor on Response, so Java callers do not
need any Scala collection conversions:
import ru.stachek66.nlp.mystem.holding.Factory;
import ru.stachek66.nlp.mystem.holding.MyStem;
import ru.stachek66.nlp.mystem.holding.MyStemApplicationException;
import ru.stachek66.nlp.mystem.holding.Request;
import ru.stachek66.nlp.mystem.model.Info;
import scala.Option;
import java.io.File;
import java.util.List;
public final class MyStemJavaExample {
private static final MyStem MYSTEM =
new Factory("-igd --eng-gr --format json --weight")
.newMyStem("3.0", Option.<File>empty())
.get();
public static void main(String[] args) throws MyStemApplicationException {
try {
List<Info> result =
MYSTEM.analyze(Request.apply("И вырвал грешный мой язык")).getInfoAsList();
for (Info info : result) {
System.out.println(info.initial() + " -> " + info.lex() + " | " + info.rawResponse());
}
} finally {
MYSTEM.close();
}
}
}MYSTEM.close() is also safe to call as the body of a try-with-resources
since MyStem implements java.lang.AutoCloseable.
GrammarInfo.person(API change). The1p/2p/3pperson tags emitted by mystem are now populated into a newperson: Set[Person.Value]field. Previously the parser silently dropped them on the floor. Note: positional constructor calls toGrammarInfowill break — named-argument call sites are unaffected.- Parens-pipe
grparsing. Real mystem 3.x output with--weightemits multi-analysis strings likeA,plen=(acc,sg,m,anim|gen,sg,m|gen,sg,n). The pre-existing parser threwNoSuchElementExceptionon the leading(. NewGrammarInfoParsing.toGrammarInfos(s): List[GrammarInfo]returns oneGrammarInfoper pipe-alternative; the previoustoGrammarInforemains and now returns the most-likely interpretation (mystem orders alternatives by descending probability, so.head). - Wire-format aliases. mystem 3.x emits
indic(we declaredValue("ind")for indicative mood) andpraet(we declaredValue("past")for past tense). Without alias support every real verb output threw on parsing.GrammarMapBuilder.aliases(public) now mapsindic→indandpraet→past; both forms parse. - Process robustness.
ExternalProcessServer.syncRequestno longer spins forever inwhile (!reader.ready()) {}when the child process exits without responding —BufferedReader.ready()returnsfalseat EOF, not true. The busy-wait now also gates onprocess.isAlive, the drain loop breaks onnull(avoiding appending the literal string"null"), and an exit-without-output surfaces asIOException("process exited before producing any response")soFailSafeExternalProcessServercan spawn a fresh child rather than wrap an empty success. The restart-on-death path is now functionally testable. - Refactor:
Factory.getExecutable's cached-binary version check is extracted intoprivate[holding] isCorrectVersion(file, version): Boolean. Same observable contract, but the version-matching logic is now exercisable in unit tests without hitting the CDN. - Test coverage. ~70 new tests (75 → 140 across 15 test classes).
Statement coverage > 80%, branch coverage > 74% under scoverage.
Direct unit tests for
syncRequest(usingpython3 -uas a portable line-buffered echo stand-in),isCorrectVersion,MyStem.normalize,MyStemApplicationException, archive edge cases (empty archive, directory-as-first-entry),FailSafeExternalProcessServerrestart-on-death, plus broader coverage ofGrammarInfoParsing(parens-pipe alternatives, person, alias support, fail-loud on unknown tags inside parens).
- Fixed process leak (#3):
MyStemnow extendsAutoCloseable; an internal JVM shutdown hook destroys the spawnedmystemprocess if the user forgets to callclose(). This unblocks Maven Surefire /mvn packageruns that previously hung on Windows. - Java-friendly API:
Response#getInfoAsList(): java.util.List<Info>replaces the need for the (removed)scala.collection.JavaConversions. - Cross-built for Scala 2.12 + 2.13 with profiles
scala-2.12/scala-2.13. - Java 8 target preserved; runs on JDK 8/11/17 (CI matrix).
- Modernized build: publishing via Sonatype Central Portal
(OSSRH was sunset 2025-06-30); current
commons-compress,slf4j,org.json,scalatest,scala-maven-plugin. logback-classicis now atestdependency; consumers pick their own logging backend (onlyslf4j-apiis on the compile classpath).- OS detection fixed for modern Windows (10/11) — used to misroute to the macOS download URL.
Traversable→Iterablethroughout (deprecated/removed in 2.13/3).- Linting + formatting: scalafmt, strict scalac flags
(
-deprecation -feature -unchecked -Xlint:_). - GitHub Actions CI with a Java × Scala matrix and scoverage report.
See git log and the
release page.
@misc{alekseev2018mystemscala,
author = {Anton Alekseev},
title = {mystem-scala},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/alexeyev/mystem-scala/}},
commit = {the latest commit of the codebase you have used}
}If you cite the wrapper, please also cite the original algorithm's paper.
Anton Alekseev anton.m.alexeyev@gmail.com
- Vladislav Dolbilov, @darl
- Mikhail Malchevsky
- @anton-shirikov
- Filipp Malkovsky
- @dizzy7