A Linguistic Search Tool for Semitic Languages
Alon Itai
Knowledge Center for Processing Hebrew
Computer Science Department
Technion, Haifa, Israel
E-mail: itai@cs.technion.ac.il
Abstract
The paper discusses searching a corpus for linguistic patterns. Semitic languages have complex morphology and ambiguous writing
systems. We explore the properties of Semitic Languages that challenge linguistic search and describe how we used the Corpus
Workbench (CWB) to enable linguistic searches in Hebrew corpora.
be analyzed as the preposition b (in) + the noun bit
1. Introduction (house), whereas the word bit, which also starts with a b
As linguistics matures, so the methods it uses turn towards can be analyzed only as the noun bit, since the remainder,
the empirical. It is no longer enough to introspect to it, is not a Hebrew word. Thus to find a preposition one
gather linguistic insight. Data is required. While most needs to perform a morphological analysis of the word to
search engines look for words, linguists are interested in decide whether the first letter is a preposition or part of the
grammatical patterns and usage of words within these lemma. Hence, in order to extract useful information the
patterns. For example, searching the adjective "record" text has to be first morphologically analyzed.
followed by a noun should yield "record highs" but not
"record songs"; searching the verb to eat (in any inflection) All this leads to a high degree of morphological ambiguity.
followed by a noun, should yield sentences such as "John Ambiguity is increased since the writing systems of
ate dinner." but not "Mary ate an apple" (the verb is Arabic and Hebrew omit most of the vowels. In a running
followed by an article, not a noun) Hebrew text a word has 2.2-2.8 different analyses on the
average. (The number of analyses depends on the corpus
To answer these needs, several systems have been and the morphological analyzer – if the analyzer
constructed. Such systems take a corpus and preprocess it distinguishes between more analyses and if it uses a larger
to enable linguistic searches. We argue that general lexicon it will find more analyses.)
purpose search tools are not suitable for Semitic
languages unless special measures are taken. We then Ideally one would wish to use manually tagged corpora,
show how to use one such tool to enable linguistic i.e., corpora where the correct analysis of each word was
searches in Semitic Languages, with Modern Hebrew as a manually chosen. However, since it is expensive to
test case. manually tag a large corpus, the size of such corpora is
limited and many interesting linguistic phenomena will
not be represented. Thus, one may either use
2. Semitic Languages automatically tagged corpora or use an ambiguous corpus
and retrieve a sentence if any one of its possible analyses
Semitic languages pose interesting challenges to
satisfies the query. We preferred the latter approach
linguistic search engines. The rich morphology entails
because of the high error rate of programs that attempt to
that each word contains, in addition to the lemma, a large
find the right analysis in context. (An error rate of 5% per
number of morphological and syntactical features: part of
word entails that a 20 word sentence has probability
speech, number, gender, case. Nouns also inflect for status
(1− 0.05) ≈ e of being analyzed incorrectly.) Moreover,
20 −1
(absolute or construct) and possessive. Verbs inflect for
since these systems use Machine learning (HMM or SVM)
person, tense, voice and accusative. Thus one may want to
(Bar Haim et al. 2005, Diab et al. 2004, Habash et al
search for a plural masculine noun, followed by a plural
2005), they prefer the more common structure, thus rare
verb in past tense with accusative inflection first person
linguistic structures will be more likely to be incorrectly
singular.
tagged. However these are exactly the phenomena a
corpus linguist would like to search. Consequently, to
Additional problems arise by the writing system. Some
successfully perform linguistic searches one cannot rely
prepositions and conjunctives are attached to the word as
on the automatic morphological disambiguation and it
prefixes. For example, in Hebrew the word bbit1 may only
would be better to allow all possible analyses and retrieve
1 a sentence even if only one of the analyses satisfies the
We use the following transliteration query.
תשרקצפעסנמלכיטחזוהדגבאabgdhwzxTiklmnsypcqršt
3. CWB mxiibim
CWB – the Corpus Workbench – is a tool created at the :mxaiib-ADJECTIVE-masculine-plural-abs-indef
University of Stuttgart for searching corpora. The tool :xiib-PARTICIPLE-Pi'el-xwb-unspecified-masculine-
enables linguistically motivated searches, for example plural-abs-indef
one may search a single word, say "interesting". :xiib-VERB-Pi'el-xwb-unspecified-masculine-plural-
The query language consists of Boolean combinations of present:PREFIX-m-preposition
regular expressions, which uses the POSIX EGREP :xiib-NOUN-masculine-plural-abs-indefinite
syntax, e.g. the query :PREFIX-m-preposition- xiib-ADJECTIVE- masculine-
"interest(s|(ed|ing)(ly)?)?" plural -abs-indef:
yields a search for either of the words interest,
interests, interested, interesting, The analyses are:
interestedly, interestingly. 1. The adjective mxiib, gender masculine, number
plural, status absolute and the word is indefinite;
One can also search for the lemma, say "lemma=go" 2. The verb xiib it is a participle of a verb whose binyan
should yield sentences containing the words go, goes, (inflection pattern of verb) is Pi'el, the root is xwb, the
going, went and gone. The search can be focused on person is unspecified, gender masculine, number
part of speech "POS=VERB". CWB deals with incomplete plural, the type of participle is noun, the status
specifications by using regular expressions. For example, absolute and the word is indefinite.
a verb can be subcategorized as VBG (present/past) and 3. A verb whose root is xwb, binyan Pi'el, person
VGN (participle). The query [pos="VB.*"] matches unspecified, number plural and tense present.
both forms and may be used to math all parts of speech 4. The noun xiib, prefixed by the preposition m.
that start with the letters VB. ("." matches any single 5. The adjective xiib, prefixed by the preposition m.
character and "*" after a pattern indicates 0 or more
repetitions, thus ".*" matches any string of length 0 or Thus one can retrieve the word by any one of the queries
more.). Finally, a query may consist of several words thus by POS:
["boy"][POS=VERB] yields all sentences that contain [POS=".*-ADJECTIVE-.*"], [POS=".*-PARTICIPLE-.*"],
the word boy followed by a verb. [POS=".*-VERB-.*"], [POS=".*-NOUN-.*"].
However, one may also specify additional properties by
To accommodate linguistic searches, the corpus needs to using a pattern that matches subfields:
be tagged with the appropriate data (such as, lemma, [POS=".*PREFIX-[^:]*preposition[^:]*-NOUN-.*"]
POS). The system then loads the tagged corpus to create indicating that we are searching for a noun that is prefixed
an index. To that end the corpus should be reformatted in a by a preposition. The sequence [^:]* denotes any
special format. sequence of 0 or more characters that does not contain ":"
and is used to skip over unspecified sub-fields. Since the
CWB has been used for a variety of languages. It also different analyses of a word are separated by ":" and ":"
supports UTF-8, thus allowing easy processing of non cannot appear within an analysis, the query cannot be
Latin alphabets. satisfied by matching the part of the query by one analysis
and the remainder of the query by a subsequent analysis.
4. Creating an Index To create an index from a corpus, we first run the
morphological analyzer of MILA (Itai and Wintner 2008)
In principle we adopted the CWB solution to partial and
that creates XML files containing all the morphological
multiple analyses, i.e., use regular expressions for partial
analyses for each word. We developed a program to
matches. We created composite POS consisting of the
transform the XML files to the above format, which
concatenation of all subfields of the analysis. For example,
conforms to CWB's index format. Thus we were able to
the complete morphological analysis of hšxqnim "the
create CWB files.
(male) players" is
"NOUN-masculine-plural-absolute-definite", we encode
Our architecture enables some Boolean combinations.
all this information as the POS of the word, [pos="
Suppose we wanted to search for a two-word expression
šxqnim-NOUN-masculine-plural-absolute-definite"], the
noun-adjective that agree in number. We therefore could
lemma is šxqnim , the main POS is noun, the gender
require that the first word be a singular noun and the
masculine, the number plural, the status absolute and the
second word a singular adjective or the first word is a
prefix h indicates that the word is definite. We included
plural noun and the second word a plural adjective. The
the lemma, since each analysis might have a different
query
lemma.
([pos=".*NOUN-singular-.*"]
To accommodate for multiple analyses, we concatenate [pos=".*ADJECTIVE-singular-.*"])
all the analyses (separated by ":"). For example, |([pos=".*NOUN-plural-.*"]
[pos=".*ADJECTIVE-plural-.*"])
6. Writing Queries
However, one must be careful to avoid queries of the type Even though it is possible to write queries in the above
[pos=".*NOUN.*" & pos=".*-singular-.*"] format we feel that it is unwieldy. First the format is
since then we might return a word that has one analysis as complicated and one may easily err. However more
a plural noun and another analysis as a singular verb. importantly, in order to write a query one must be familiar
with all the features of each POS and in which order they
5. Performance appear in the index. This is extremely user-unfriendly and
To test the performance of the system we uploaded a file we don't believe many people will be able to use such a
of 814,147 words, with a total of 1,564,324 analyses, i.e., system.
2.36 analyses per word. Table 1 shows a sample of
queries and their performance. The more general the To overcome this problem, we are in the process of
queries the more time they required. However, the creating a GUI which will show for each POS the
running time for these queries is reasonable. If the running appropriate subfields and once a subfield is chosen a
time is linear in the size of the corpus, CWB should be menu will show all possible values of that subfield.
able to support queries to 100 million word corpora. Unspecified subfields will be filled by placeholders. The
One problem we encountered is that of space. The index graphic query will then be translated to a CWB query and
of the 814,147 word file required 25.2 MB. Thus each the results of this query will be presented to the user. We
word requires about 31 bytes. Thus a 100 Million word believe that the GUI will also be helpful for queries in
corpus would require a 3.09 Gigabyte index file. languages that now use CWB format.
Regular Expression Time (sec) Output File (KB)
[pos=".*-MODAL-.*"] [pos=":היה-.*"][pos=".*-VERB-[^:]*-infinitive:.*"]; 0.117 13
[word="[]"עלword="[]"מנתpos=".*-VERB-[^:]*-infinitive:.*"]; 0.038 28
[word="[]"עלword="[]"מנתpos=".*PREFIX-ש.*"]; 0.025 5
[pos=".*:הלך-[^:]*-present:.*"] [pos=".*-VERB-[^:]*-infinitive:.*"]; 0.099 2
[word="[]"ביתpos=":ספר.*"]; 0.017 7
[word="[]"ביתpos=".*:^[ספר:]*-SUFFIX-possessive-.*"]; 0.014 1
";"כותב 0.009
[pos=".*:[^:]*-VERB-[^:]*:.*"]; 0.569
".*"; 1.854
[pos=".*"]; 1.85
".*"; [pos=".*"]; 3.677
All the previous regular expressions concatenated 7.961
("[ | "כותבpos=".*:[^:]*-VERB-[^:]*:.*"] | ".*" | [pos=".*"] ); 2.061
([pos=".*:[^:]*-VERB-[^:]*:.*" & word=" & "כותבword=".*" & pos=".*"]); 0.168
Table 1: Example queries and their performance.
Association for Computational Linguistics, pp.
7. Conclusion 573--580, Ann Arbor.
Until now Linguistic searches were oriented to Western Hajič, J. (2000). Morphological tagging: data vs.
languages. Semitic languages exhibit more complex dictionaries. In Proceedings of NAACL-ANLP, pp.
patterns, which at first sight might require designing 94--101, Seattle, Washington.
entirely new tools. We have showed how to reuse existing
tool to efficiently conduct sophisticated searches. Hajič, J. and Barbora Hladká, B. (1998). Tagging
Inflective Languages: Prediction of Morphological
The interface of current systems is UNIX based. This Categories for a Rich, Structured Tagset. In
might be acceptable when the linguistic features are Proceedings of COLING-ACL 1998. pp. 483--490,
simple, however, for complex features, it is virtually Montreal, Canada
impossible to memorize all the possibilities and render the Itai, A. and Wintner, S. (2008). Language Resources for
queries properly. Thus a special GUI is necessary. Hebrew. Language Resources and Evaluation, 42, pp.
75--98.
Lee, Y-S. et al. (2003). Language model based Arabic
8. Acknowledgements word segmentation. In ACL 2003, pp. 399--406.
It is a pleasure to thank Ulrich Heid, Serge Heiden and Segal, E. (2001). Hebrew morphological analyzer for
Andrew Hardie who helped us use CWB. Last and Hebrew undotted texts. M.Sc. thesis, Computer
foremost I wish to thank Gassan Tabajah whose technical Science Department, Technion, Haifa, Israel.
assistance was invaluable.
9. References
Official Web page of CWB: http://cwb.sourceforge.net/
Bar Haim, R., Sima’an, K. and Winter, Y. (2005).
Choosing an Optimal Architecture for Segmentation
and POS-Tagging of Modern Hebrew. ACL Workshop
on Computational Approaches to Semitic Languages.
Buckwalter, T. (2002). Buckwalter Arabic Morphological
Analyzer Version 1.0. Linguistic Data Consortium
catalog number LDC2002L49, ISBN 1-58563-257-0.
Buckwalter, T. (2004). Buckwalter Arabic Morphological
Analyzer Version 2.0. Linguistic Data Consortium
catalog number LDC2004L02, ISBN 1-58563-324-0.
Christ, O. (1994). A modular and flexible architecture for
an integrated corpus query system. In Papers in
Computational Lexicography (COMPLEX ’94), pp.
22--32, Budapest, Hungary.
Christ, O. and Schulze, B. M. (1996). Ein flexibles und
modulares Anfragesystem für Textcorpora. In H. Feldweg
and E. W. Hinrichs (eds.), Lexikon und Text, pp.
121--133. Max Niemeyer Verlag, Tübingen.
Diab, M., Hacioglu, K. and Jurafsky, D. (2004).
Automatic Tagging of Arabic Text: From Raw Text to
Base Phrase Chunks. In HLT-NAACL: Short Papers,
pp. 149--152.
Habash, N. and Rambow, O. (2005). Arabic
Tokenization, Part-of-Speech Tagging and Morpho-
logical Disambiguation in One Fell Swoop. In
Proceedings of the 43rd Annual Meeting of the