-
Notifications
You must be signed in to change notification settings - Fork 0
Clause segmenter for Estonian \ Eesti keele osalausestaja
License
soras/osalausestaja
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
=============================================================
Osalausestaja: Clause Segmenter for Estonian
=============================================================
Clause Segmenter is a program that splits long and complex natural
language sentences into smaller segments (clauses). For example, the
sentence "Mees, keda seal kohtasime, oli tuttav ja teretas meid."
will be split into following clauses:
"[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]"
(in the example, clauses are surrounded by brackets)
The algorithm mainly relies on punctuation, conjunction words, and
finite verb forms on identifying the clause boundaries.
For linguistic details/motivations behind the algorithm, see (Kaalep,
Muischnek 2012).
=========================
Requirements
=========================
For building the program (JAR file):
** Java JDK (at least version 1.8.x is expected);
** Apache Ant (at least version 1.8.2);
For using the program:
** A sentence segmentator;
** A word tokenizer;
** Estonian morphological analyzer, possible options:
-- Filosoft Vabamorf: https://github.com/Filosoft/vabamorf
-- PyVabamorf: https://github.com/estnltk/pyvabamorf
-- T3MESTA (a commercial morphological analyzer);
** Estonian morphological disambiguator;
-- Vabamorf's disambiguator:
https://github.com/Filosoft/vabamorf
NB! The clause segmenter also works on morphologically ambiguous
input, but the quality of the analysis is expected to be lower
than in the case of morphologically disambiguated text.
=========================
Building the program
=========================
The most straightforward way for compiling the program is by using
Apache Ant and the build script ("build.xml" in root dir);
Before building, correct path to JDK must be set in the file
"build.properties" (variable "java.home.location"). Then, building
and deploying can be evoked with the command:
ant deploy
(in the same directory where "build.xml" is located);
This compiles the Java source code, makes the JAR file (Osalau.jar),
and copies the JAR file along with required files into the folder
"test";
=========================
Using the program
=========================
Basic usage
---------------
Before the clause segmenter can be applied on a text, a number of
text preprocessing steps must be made: text must be split into
sentences and tokens (words), and words must be morphologically
analysed (and disambiguated).
Core of these functionalities is provided by EstNLTK toolkit, so the
easiest way to use the clause segmenter is within this toolkit ( see
https://github.com/estnltk/estnltk for more details ).
It is expected that the input of the clause segmenter is in the same
format as the output of Vabamorf's command 'etana analyze' - a JSON
structured text in UTF8 encoding. Note that the clause segmenter expects
that word root analyses are 'clean', without any phonetic markup symbols
(which can be optionally added in 'etana' with flag '-phonetic').
An example of JSON input can be found in file "test/example_input.json";
In the "test" folder, following command evokes clause segmenter on
the input file "example_input.json" and outputs the results to standard
input:
java -jar Osalau.jar -in file example_input.json -pretty_print
(flag "-pretty_print" switches on the pretty printing mode, otherwise,
all of the output JSON is on single line);
Alternatively, output can also be directed to a file by specifying:
java -jar Osalau.jar -in file example_input.json -pretty_print -out file my_output.json
Flag "-pyvabamorf" evokes the program in a special standard input/output
processing mode, where the program reads a line from the standard input,
analyzes the line, and outputs the results (in a single line) to the standard
output.
java -jar Osalau.jar -pyvabamorf
More details about Vabamorf and its JSON format:
https://github.com/Filosoft/vabamorf
The "insensitive to missing commas" mode
-------------------------------------------
The clause segmenter can also be executed in the mode in which the program
attempts to be less sensitive to missing commas while detecting clause boundaries.
The flag "-ins_comma_mis" can be used to switch this mode on:
java -jar Osalau.jar -in file example_input_missing_commas.json -pretty_print -ins_comma_mis
(the file "example_input_missing_commas.json" can be found in the folder "test");
Note that this mode is experimental and compared to the default mode, it can
introduce additional incorrect clause boundaries, although it also improves clause
boundary detection in texts with (a lot of) missing commas.
============================
Interpreting the output
============================
The clause segmenter marks clause boundaries: boundaries between regular
clauses, and start and end positions of embedded clauses.
In JSON input/output format, the clause boundary is indicated by adding object
'clauseAnnotation' to the token (at the same level as objects 'text' and
'analysis'). The 'clauseAnnotation' (which is a list of strings) can contain
three types of boundary markings:
KINDEL_PIIR -- indicates that there is a clause boundary AFTER current
token: one clause ends and another starts;
KIILU_ALGUS -- marks a beginning of a new embedded clause BEFORE current
token;
KIILU_LOPP -- marks ending of an embedded clause AFTER current token;
Example:
The sentence
"Mees, keda seal kohtasime, oli tuttav ja teretas meid."
will obtain following clause annotations:
{'words': [ {'analysis': [ ... ],
'text': 'Mees,'},
{'analysis': [ ... ],
'clauseAnnotation': ['KIILU_ALGUS'],
'text': 'keda'},
{'analysis': [ ... ],
'text': 'seal'},
{'analysis': [ ... ],
'clauseAnnotation': ['KIILU_LOPP'],
'text': 'kohtasime,'},
{'analysis': [ ... ],
'text': 'oli'},
{'analysis': [ ... ],
'text': 'tuttav'},
{'analysis': [ ... ],
'clauseAnnotation': ['KINDEL_PIIR'],
'text': 'ja'},
{'analysis': [ ... ],
'text': 'teretas'},
{'analysis': [ ... ],
'text': 'meid.'} ]}
which should be interpreted as:
"keda" (KIILU_ALGUS) -- an embedded clause begins before "keda";
"kohtasime," (KIILU_LOPP) -- the embedded clause ends after "kohtasime,";
"ja" (KINDEL_PIIR) -- one clause ends after "ja" and another begins;
so, the corresponding clause structure should look like:
"[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]"
(clauses are surrounded by brackets)
Note that embedded clauses can contain other clauses and other embedded
clauses, and so the whole clause structure has a recursive nature.
=========================
References
=========================
*) Kaalep, Heiki-Jaan; Muischnek, Kadri (2012). Osalausete tuvastamine
eestikeelses tekstis kui iseseisev ülesanne. Helle Metslang, Margit
Langemets, Maria-Maren Sepper (Toim.). Eesti Rakenduslingvistika Ühingu
aastaraamat (55 - 68). Tallinn: Eesti Rakenduslingvistika Ühing;
*) Kaalep, Heiki-Jaan; Muischnek, Kadri (2012). Robust clause boundary
identification for corpus annotation. Nicoletta Calzolari, Khalid Choukri,
Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mar (Toim.).
Proceedings of the Eight International Conference on Language Resources
and Evaluation (LREC'12) (1632 - 1636). Istanbul, Türgi: ELRA;
About
Clause segmenter for Estonian \ Eesti keele osalausestaja
Topics
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published