0% found this document useful (0 votes)

31 views4 pages

Linguistic Search

Uploaded by

apple4red

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views4 pages

Linguistic Search

Uploaded by

apple4red

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

A Linguistic Search Tool for Semitic Languages

Alon Itai
Knowledge Center for Processing Hebrew
Computer Science Department
Technion, Haifa, Israel
E-mail: itai@cs.technion.ac.il

Abstract
The paper discusses searching a corpus for linguistic patterns. Semitic languages have complex morphology and ambiguous writing
systems. We explore the properties of Semitic Languages that challenge linguistic search and describe how we used the Corpus
Workbench (CWB) to enable linguistic searches in Hebrew corpora.

be analyzed as the preposition b (in) + the noun bit

1. Introduction (house), whereas the word bit, which also starts with a b
As linguistics matures, so the methods it uses turn towards can be analyzed only as the noun bit, since the remainder,
the empirical. It is no longer enough to introspect to it, is not a Hebrew word. Thus to find a preposition one
gather linguistic insight. Data is required. While most needs to perform a morphological analysis of the word to
search engines look for words, linguists are interested in decide whether the first letter is a preposition or part of the
grammatical patterns and usage of words within these lemma. Hence, in order to extract useful information the
patterns. For example, searching the adjective "record" text has to be first morphologically analyzed.
followed by a noun should yield "record highs" but not
"record songs"; searching the verb to eat (in any inflection) All this leads to a high degree of morphological ambiguity.
followed by a noun, should yield sentences such as "John Ambiguity is increased since the writing systems of
ate dinner." but not "Mary ate an apple" (the verb is Arabic and Hebrew omit most of the vowels. In a running
followed by an article, not a noun) Hebrew text a word has 2.2-2.8 different analyses on the
average. (The number of analyses depends on the corpus
To answer these needs, several systems have been and the morphological analyzer – if the analyzer
constructed. Such systems take a corpus and preprocess it distinguishes between more analyses and if it uses a larger
to enable linguistic searches. We argue that general lexicon it will find more analyses.)
purpose search tools are not suitable for Semitic
languages unless special measures are taken. We then Ideally one would wish to use manually tagged corpora,
show how to use one such tool to enable linguistic i.e., corpora where the correct analysis of each word was
searches in Semitic Languages, with Modern Hebrew as a manually chosen. However, since it is expensive to
test case. manually tag a large corpus, the size of such corpora is
limited and many interesting linguistic phenomena will
not be represented. Thus, one may either use
2. Semitic Languages automatically tagged corpora or use an ambiguous corpus
and retrieve a sentence if any one of its possible analyses
Semitic languages pose interesting challenges to
satisfies the query. We preferred the latter approach
linguistic search engines. The rich morphology entails
because of the high error rate of programs that attempt to
that each word contains, in addition to the lemma, a large
find the right analysis in context. (An error rate of 5% per
number of morphological and syntactical features: part of
word entails that a 20 word sentence has probability
speech, number, gender, case. Nouns also inflect for status
(1− 0.05) ≈ e of being analyzed incorrectly.) Moreover,
20 −1

(absolute or construct) and possessive. Verbs inflect for

since these systems use Machine learning (HMM or SVM)
person, tense, voice and accusative. Thus one may want to
(Bar Haim et al. 2005, Diab et al. 2004, Habash et al
search for a plural masculine noun, followed by a plural
2005), they prefer the more common structure, thus rare
verb in past tense with accusative inflection first person
linguistic structures will be more likely to be incorrectly
singular.
tagged. However these are exactly the phenomena a
corpus linguist would like to search. Consequently, to
Additional problems arise by the writing system. Some
successfully perform linguistic searches one cannot rely
prepositions and conjunctives are attached to the word as
on the automatic morphological disambiguation and it
prefixes. For example, in Hebrew the word bbit1 may only
would be better to allow all possible analyses and retrieve
1 a sentence even if only one of the analyses satisfies the
We use the following transliteration query.
‫ תשרקצפעסנמלכיטחזוהדגבא‬abgdhwzxTiklmnsypcqršt
3. CWB mxiibim
CWB – the Corpus Workbench – is a tool created at the :mxaiib-ADJECTIVE-masculine-plural-abs-indef
University of Stuttgart for searching corpora. The tool :xiib-PARTICIPLE-Pi'el-xwb-unspecified-masculine-
enables linguistically motivated searches, for example plural-abs-indef
one may search a single word, say "interesting". :xiib-VERB-Pi'el-xwb-unspecified-masculine-plural-
The query language consists of Boolean combinations of present:PREFIX-m-preposition
regular expressions, which uses the POSIX EGREP :xiib-NOUN-masculine-plural-abs-indefinite
syntax, e.g. the query :PREFIX-m-preposition- xiib-ADJECTIVE- masculine-
"interest(s|(ed|ing)(ly)?)?" plural -abs-indef:
yields a search for either of the words interest,
interests, interested, interesting, The analyses are:
interestedly, interestingly. 1. The adjective mxiib, gender masculine, number
plural, status absolute and the word is indefinite;
One can also search for the lemma, say "lemma=go" 2. The verb xiib it is a participle of a verb whose binyan
should yield sentences containing the words go, goes, (inflection pattern of verb) is Pi'el, the root is xwb, the
going, went and gone. The search can be focused on person is unspecified, gender masculine, number
part of speech "POS=VERB". CWB deals with incomplete plural, the type of participle is noun, the status
specifications by using regular expressions. For example, absolute and the word is indefinite.
a verb can be subcategorized as VBG (present/past) and 3. A verb whose root is xwb, binyan Pi'el, person
VGN (participle). The query [pos="VB.*"] matches unspecified, number plural and tense present.
both forms and may be used to math all parts of speech 4. The noun xiib, prefixed by the preposition m.
that start with the letters VB. ("." matches any single 5. The adjective xiib, prefixed by the preposition m.
character and "*" after a pattern indicates 0 or more
repetitions, thus ".*" matches any string of length 0 or Thus one can retrieve the word by any one of the queries
more.). Finally, a query may consist of several words thus by POS:
["boy"][POS=VERB] yields all sentences that contain [POS=".*-ADJECTIVE-.*"], [POS=".*-PARTICIPLE-.*"],
the word boy followed by a verb. [POS=".*-VERB-.*"], [POS=".*-NOUN-.*"].
However, one may also specify additional properties by
To accommodate linguistic searches, the corpus needs to using a pattern that matches subfields:
be tagged with the appropriate data (such as, lemma, [POS=".*PREFIX-[^:]*preposition[^:]*-NOUN-.*"]
POS). The system then loads the tagged corpus to create indicating that we are searching for a noun that is prefixed
an index. To that end the corpus should be reformatted in a by a preposition. The sequence [^:]* denotes any
special format. sequence of 0 or more characters that does not contain ":"
and is used to skip over unspecified sub-fields. Since the
CWB has been used for a variety of languages. It also different analyses of a word are separated by ":" and ":"
supports UTF-8, thus allowing easy processing of non cannot appear within an analysis, the query cannot be
Latin alphabets. satisfied by matching the part of the query by one analysis
and the remainder of the query by a subsequent analysis.

4. Creating an Index To create an index from a corpus, we first run the

morphological analyzer of MILA (Itai and Wintner 2008)
In principle we adopted the CWB solution to partial and
that creates XML files containing all the morphological
multiple analyses, i.e., use regular expressions for partial
analyses for each word. We developed a program to
matches. We created composite POS consisting of the
transform the XML files to the above format, which
concatenation of all subfields of the analysis. For example,
conforms to CWB's index format. Thus we were able to
the complete morphological analysis of hšxqnim "the
create CWB files.
(male) players" is
"NOUN-masculine-plural-absolute-definite", we encode
Our architecture enables some Boolean combinations.
all this information as the POS of the word, [pos="
Suppose we wanted to search for a two-word expression
šxqnim-NOUN-masculine-plural-absolute-definite"], the
noun-adjective that agree in number. We therefore could
lemma is šxqnim , the main POS is noun, the gender
require that the first word be a singular noun and the
masculine, the number plural, the status absolute and the
second word a singular adjective or the first word is a
prefix h indicates that the word is definite. We included
plural noun and the second word a plural adjective. The
the lemma, since each analysis might have a different
query
lemma.
([pos=".*NOUN-singular-.*"]
To accommodate for multiple analyses, we concatenate [pos=".*ADJECTIVE-singular-.*"])
all the analyses (separated by ":"). For example, |([pos=".*NOUN-plural-.*"]
[pos=".*ADJECTIVE-plural-.*"])
6. Writing Queries
However, one must be careful to avoid queries of the type Even though it is possible to write queries in the above
[pos=".*NOUN.*" & pos=".*-singular-.*"] format we feel that it is unwieldy. First the format is
since then we might return a word that has one analysis as complicated and one may easily err. However more
a plural noun and another analysis as a singular verb. importantly, in order to write a query one must be familiar
with all the features of each POS and in which order they
5. Performance appear in the index. This is extremely user-unfriendly and
To test the performance of the system we uploaded a file we don't believe many people will be able to use such a
of 814,147 words, with a total of 1,564,324 analyses, i.e., system.
2.36 analyses per word. Table 1 shows a sample of
queries and their performance. The more general the To overcome this problem, we are in the process of
queries the more time they required. However, the creating a GUI which will show for each POS the
running time for these queries is reasonable. If the running appropriate subfields and once a subfield is chosen a
time is linear in the size of the corpus, CWB should be menu will show all possible values of that subfield.
able to support queries to 100 million word corpora. Unspecified subfields will be filled by placeholders. The
One problem we encountered is that of space. The index graphic query will then be translated to a CWB query and
of the 814,147 word file required 25.2 MB. Thus each the results of this query will be presented to the user. We
word requires about 31 bytes. Thus a 100 Million word believe that the GUI will also be helpful for queries in
corpus would require a 3.09 Gigabyte index file. languages that now use CWB format.

Regular Expression Time (sec) Output File (KB)

[pos=".-MODAL-."] [pos=":‫היה‬-."][pos=".-VERB-[^:]-infinitive:."]; 0.117 13

[word="‫[]"על‬word="‫[]"מנת‬pos=".*-VERB-[^:]*-infinitive:.*"]; 0.038 28

[word="‫[]"על‬word="‫[]"מנת‬pos=".*PREFIX-‫ש‬.*"]; 0.025 5

[pos=".:‫הלך‬-[^:]-present:."] [pos=".-VERB-[^:]-infinitive:."]; 0.099 2

[word="‫[]"בית‬pos=":‫ספר‬.*"]; 0.017 7

[word="‫[]"בית‬pos=".*:‫^[ספר‬:]*-SUFFIX-possessive-.*"]; 0.014 1

"‫;"כותב‬ 0.009

[pos=".*:[^:]*-VERB-[^:]*:.*"]; 0.569

".*"; 1.854

[pos=".*"]; 1.85

"."; [pos="."]; 3.677

All the previous regular expressions concatenated 7.961

("‫[ | "כותב‬pos=".:[^:]-VERB-[^:]:."] | "." | [pos="."] ); 2.061

([pos=".:[^:]-VERB-[^:]:." & word="‫ & "כותב‬word="." & pos="."]); 0.168

Table 1: Example queries and their performance.

Association for Computational Linguistics, pp.
7. Conclusion 573--580, Ann Arbor.
Until now Linguistic searches were oriented to Western Hajič, J. (2000). Morphological tagging: data vs.
languages. Semitic languages exhibit more complex dictionaries. In Proceedings of NAACL-ANLP, pp.
patterns, which at first sight might require designing 94--101, Seattle, Washington.
entirely new tools. We have showed how to reuse existing
tool to efficiently conduct sophisticated searches. Hajič, J. and Barbora Hladká, B. (1998). Tagging
Inflective Languages: Prediction of Morphological
The interface of current systems is UNIX based. This Categories for a Rich, Structured Tagset. In
might be acceptable when the linguistic features are Proceedings of COLING-ACL 1998. pp. 483--490,
simple, however, for complex features, it is virtually Montreal, Canada
impossible to memorize all the possibilities and render the Itai, A. and Wintner, S. (2008). Language Resources for
queries properly. Thus a special GUI is necessary. Hebrew. Language Resources and Evaluation, 42, pp.
75--98.
Lee, Y-S. et al. (2003). Language model based Arabic
8. Acknowledgements word segmentation. In ACL 2003, pp. 399--406.
It is a pleasure to thank Ulrich Heid, Serge Heiden and Segal, E. (2001). Hebrew morphological analyzer for
Andrew Hardie who helped us use CWB. Last and Hebrew undotted texts. M.Sc. thesis, Computer
foremost I wish to thank Gassan Tabajah whose technical Science Department, Technion, Haifa, Israel.
assistance was invaluable.

9. References

Official Web page of CWB: http://cwb.sourceforge.net/

Bar Haim, R., Sima’an, K. and Winter, Y. (2005).

Choosing an Optimal Architecture for Segmentation
and POS-Tagging of Modern Hebrew. ACL Workshop
on Computational Approaches to Semitic Languages.
Buckwalter, T. (2002). Buckwalter Arabic Morphological
Analyzer Version 1.0. Linguistic Data Consortium
catalog number LDC2002L49, ISBN 1-58563-257-0.
Buckwalter, T. (2004). Buckwalter Arabic Morphological
Analyzer Version 2.0. Linguistic Data Consortium
catalog number LDC2004L02, ISBN 1-58563-324-0.
Christ, O. (1994). A modular and flexible architecture for
an integrated corpus query system. In Papers in
Computational Lexicography (COMPLEX ’94), pp.
22--32, Budapest, Hungary.

Christ, O. and Schulze, B. M. (1996). Ein flexibles und

modulares Anfragesystem für Textcorpora. In H. Feldweg
and E. W. Hinrichs (eds.), Lexikon und Text, pp.
121--133. Max Niemeyer Verlag, Tübingen.

Diab, M., Hacioglu, K. and Jurafsky, D. (2004).

Automatic Tagging of Arabic Text: From Raw Text to
Base Phrase Chunks. In HLT-NAACL: Short Papers,
pp. 149--152.
Habash, N. and Rambow, O. (2005). Arabic
Tokenization, Part-of-Speech Tagging and Morpho-
logical Disambiguation in One Fell Swoop. In
Proceedings of the 43rd Annual Meeting of the

A Software Tool For Building A Statistical Prefix Processor
No ratings yet
A Software Tool For Building A Statistical Prefix Processor
6 pages
NLP Tagging for Text Analysis
No ratings yet
NLP Tagging for Text Analysis
8 pages
Spanish Word Frequency Study
70% (10)
Spanish Word Frequency Study
11 pages
Demorphy, German Language Morphological Analyzer
No ratings yet
Demorphy, German Language Morphological Analyzer
7 pages
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
No ratings yet
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
12 pages
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
No ratings yet
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
12 pages
MT&Obr2008 FASSBL6
No ratings yet
MT&Obr2008 FASSBL6
12 pages
Tense&aspectsystems PDF
No ratings yet
Tense&aspectsystems PDF
112 pages
Arabic Sentence Parsing Framework
No ratings yet
Arabic Sentence Parsing Framework
7 pages
Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
No ratings yet
Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
7 pages
EU COST C13 Glass and in Building Envelopes - Final Report - Volume 1 Research in Architectural Engineering Series (Research in Architectural Engineering)
100% (1)
EU COST C13 Glass and in Building Envelopes - Final Report - Volume 1 Research in Architectural Engineering Series (Research in Architectural Engineering)
288 pages
08-13 Towards A New Dictionary of Biblical Hebrew Based (Reinier de Blois) NN Doi
No ratings yet
08-13 Towards A New Dictionary of Biblical Hebrew Based (Reinier de Blois) NN Doi
33 pages
Stylo R Script Mini Howto
No ratings yet
Stylo R Script Mini Howto
6 pages
Elex 2021 16 pp269-287-1
No ratings yet
Elex 2021 16 pp269-287-1
19 pages
011 - 2004 - v1 - Adam Kilgarriff, Pavel Rychly, Pavel SMRZ, David Tugwell - The Sketch Engine
No ratings yet
011 - 2004 - v1 - Adam Kilgarriff, Pavel Rychly, Pavel SMRZ, David Tugwell - The Sketch Engine
11 pages
ChatGPT As A COBUILD Lexicographer - PREPRINT 2023-10
No ratings yet
ChatGPT As A COBUILD Lexicographer - PREPRINT 2023-10
14 pages
Indo-European - From The Website of Jay Jasanoff - Harvard
No ratings yet
Indo-European - From The Website of Jay Jasanoff - Harvard
5 pages
(Yearbook of Phraseology) Christiane Fellbaum (Ed.) - Idioms and Collocations - Corpus-Based Linguistic and Lexicographic Studies PDF
No ratings yet
(Yearbook of Phraseology) Christiane Fellbaum (Ed.) - Idioms and Collocations - Corpus-Based Linguistic and Lexicographic Studies PDF
7 pages
Cabre Montane Nazar Reus Paper
No ratings yet
Cabre Montane Nazar Reus Paper
15 pages
Arabic WordNet Development
No ratings yet
Arabic WordNet Development
6 pages
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Full Chapters Included
100% (2)
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Full Chapters Included
157 pages
Greville (2005) Resources For Suppletion - A Typological Database
No ratings yet
Greville (2005) Resources For Suppletion - A Typological Database
10 pages
Definition of A Corpus
No ratings yet
Definition of A Corpus
6 pages
Cognacy and Computational Cladistics Iss
No ratings yet
Cognacy and Computational Cladistics Iss
21 pages
IITC 2008p4 PDF
No ratings yet
IITC 2008p4 PDF
10 pages
A DErivational ARabic Ontology Based On Verbs
No ratings yet
A DErivational ARabic Ontology Based On Verbs
19 pages
A Corpus of Spoken Faroese: Janne Bondi Johannessen University of Oslo
No ratings yet
A Corpus of Spoken Faroese: Janne Bondi Johannessen University of Oslo
11 pages
Discovering The Lexical Features of A Language
No ratings yet
Discovering The Lexical Features of A Language
2 pages
TEXT-MESS: Intelligent, Interactive and Multilingual Text Mining Based On Human Language Technologies TIN2006-15265-C06
No ratings yet
TEXT-MESS: Intelligent, Interactive and Multilingual Text Mining Based On Human Language Technologies TIN2006-15265-C06
23 pages
Corpus and Dictionary Making
No ratings yet
Corpus and Dictionary Making
17 pages
Corpus Linguistics Part 1
No ratings yet
Corpus Linguistics Part 1
30 pages
3 Corpora Annotation - Documento 1
No ratings yet
3 Corpora Annotation - Documento 1
14 pages
Macrostructure PDF
100% (1)
Macrostructure PDF
7 pages
Jones 2022
No ratings yet
Jones 2022
14 pages
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Instant Download
100% (2)
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Instant Download
49 pages
Youglish y Sus Dimenciones
No ratings yet
Youglish y Sus Dimenciones
16 pages
Benard Comrie's Book On Aspect
No ratings yet
Benard Comrie's Book On Aspect
200 pages
A Finite-State Morphological Grammar of Hebrew
No ratings yet
A Finite-State Morphological Grammar of Hebrew
19 pages
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Instant Download
100% (4)
(Original PDF) Quantitative Corpus Linguistics With R Second Edition Instant Download
39 pages
Corpora in The Classroom1
No ratings yet
Corpora in The Classroom1
81 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
D Ifficulties in P Rocessing M Alayalam V Erbs For S Tatistical M Achine T Ranslation
No ratings yet
D Ifficulties in P Rocessing M Alayalam V Erbs For S Tatistical M Achine T Ranslation
12 pages
Welcome To International Journal of Engineering Research and Development (IJERD)
No ratings yet
Welcome To International Journal of Engineering Research and Development (IJERD)
4 pages
Urum Lexicon
No ratings yet
Urum Lexicon
107 pages
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
No ratings yet
Bowker - Corpus Linguistics - Library Hi Tech 2018 - Accepted Version
26 pages
Morphological Generator For Tamil: P. Anandan, Dr. Ranjani Parthasarathy & Dr. T.V. Geetha
No ratings yet
Morphological Generator For Tamil: P. Anandan, Dr. Ranjani Parthasarathy & Dr. T.V. Geetha
3 pages
Front 2
No ratings yet
Front 2
13 pages
Corpus Impact on Modern Lexicography
No ratings yet
Corpus Impact on Modern Lexicography
23 pages
Huang 2015
No ratings yet
Huang 2015
5 pages
Interrogative Pronominal Afro-Asiatic Languages
No ratings yet
Interrogative Pronominal Afro-Asiatic Languages
93 pages
7
No ratings yet
7
4 pages
Ermolaeva Parser
No ratings yet
Ermolaeva Parser
5 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
6 pages
Multilingual FrameNets in Computational Lexicography Methods and Applications 1st Edition Hans C. Boas PDF Download
No ratings yet
Multilingual FrameNets in Computational Lexicography Methods and Applications 1st Edition Hans C. Boas PDF Download
52 pages
Copia Di CORPUS LINGUISTICS
No ratings yet
Copia Di CORPUS LINGUISTICS
51 pages
Morphological Analysis With Limited Resources: Latvian Example
No ratings yet
Morphological Analysis With Limited Resources: Latvian Example
11 pages
WIDA - Grades 4-5 - ELD Standards
No ratings yet
WIDA - Grades 4-5 - ELD Standards
34 pages
Basic 5 English
No ratings yet
Basic 5 English
3 pages
Alana English Grammar
No ratings yet
Alana English Grammar
20 pages
Cambridge Grammar Gear Teacher S Manual CBSE For India Class 8 1st Edition Ritu Taneja and Dr. CLN Prakash. Ready To Read
0% (1)
Cambridge Grammar Gear Teacher S Manual CBSE For India Class 8 1st Edition Ritu Taneja and Dr. CLN Prakash. Ready To Read
161 pages
Đáp án đề thi B2 tiếng Anh
No ratings yet
Đáp án đề thi B2 tiếng Anh
8 pages
PCOM M6 Varities and Registers of Spoken and Written Language 1
No ratings yet
PCOM M6 Varities and Registers of Spoken and Written Language 1
9 pages
Part - of - Speech - Quiz - Muhammad Al Gavin Hanafi - 2110912029
No ratings yet
Part - of - Speech - Quiz - Muhammad Al Gavin Hanafi - 2110912029
16 pages
PSLCE STD 8 English Mock
100% (1)
PSLCE STD 8 English Mock
10 pages
Year 3 Final Exam Engl Ish
No ratings yet
Year 3 Final Exam Engl Ish
16 pages
Using Language Well Book2 Student Sample
No ratings yet
Using Language Well Book2 Student Sample
20 pages
All Clear kl8 TRF Vocab&Gramm U1
No ratings yet
All Clear kl8 TRF Vocab&Gramm U1
4 pages
Grammar Pretest
No ratings yet
Grammar Pretest
6 pages
Basic 7 Sol T3
No ratings yet
Basic 7 Sol T3
15 pages
Microeconomics: Improve Your World 3rd Edition Dean Karlan - Ebook PDF Instant Download
100% (1)
Microeconomics: Improve Your World 3rd Edition Dean Karlan - Ebook PDF Instant Download
81 pages
Writing Framework For Sentene Writing 2 - Syllabus
No ratings yet
Writing Framework For Sentene Writing 2 - Syllabus
4 pages
Nouns, Verbs and Modifiers
No ratings yet
Nouns, Verbs and Modifiers
117 pages
Grade 3 Language Quiz
No ratings yet
Grade 3 Language Quiz
9 pages
A. Ga Past Questions (Bece-2024-Ga) : Composition (Essay Writing)
No ratings yet
A. Ga Past Questions (Bece-2024-Ga) : Composition (Essay Writing)
4 pages
Third Term Sol Basic 6
No ratings yet
Third Term Sol Basic 6
10 pages
U T - 1 Comp Slabus
No ratings yet
U T - 1 Comp Slabus
25 pages
(Ebook) Traditions - Warriner's Handbook, 1 First Course Teachers Edition Grade 7 by HOLT (HOLT) ISBN 9780030990366, 003099036X Digital Download
100% (7)
(Ebook) Traditions - Warriner's Handbook, 1 First Course Teachers Edition Grade 7 by HOLT (HOLT) ISBN 9780030990366, 003099036X Digital Download
168 pages
السادس الإعدادي الوحدة الاولى النشاط 2025
No ratings yet
السادس الإعدادي الوحدة الاولى النشاط 2025
24 pages
The Syntax of Old Norse Jan Terje Faarlund - The Full Ebook Version Is Ready For Instant Download
100% (4)
The Syntax of Old Norse Jan Terje Faarlund - The Full Ebook Version Is Ready For Instant Download
47 pages
Bengali - A Comprehensive Grammar
No ratings yet
Bengali - A Comprehensive Grammar
81 pages
Preposition For Writing Task 1
No ratings yet
Preposition For Writing Task 1
3 pages
ENGLISH NOTES CLASS 10th
No ratings yet
ENGLISH NOTES CLASS 10th
38 pages
Writer S Choice Grammar Practice Workbook Grade 6 Teacher S Annotated Edition Updated 2025
33% (3)
Writer S Choice Grammar Practice Workbook Grade 6 Teacher S Annotated Edition Updated 2025
108 pages
Cambridge Grammar and Writing Skills - Learner's Book 3
100% (4)
Cambridge Grammar and Writing Skills - Learner's Book 3
108 pages
Teach Terrific Grammar Grades 4 5 Mcgraw Hill Teacher Resources 1st Edition Gary Robert Muschla - Own The Ebook Now With All Fully Detailed Chapters
100% (1)
Teach Terrific Grammar Grades 4 5 Mcgraw Hill Teacher Resources 1st Edition Gary Robert Muschla - Own The Ebook Now With All Fully Detailed Chapters
41 pages
Cambridge IGCSE German Grammar Workbook Second Edition Kent All Chapters Available
No ratings yet
Cambridge IGCSE German Grammar Workbook Second Edition Kent All Chapters Available
156 pages

Linguistic Search

Uploaded by

Linguistic Search

Uploaded by

A Linguistic Search Tool for Semitic Languages

be analyzed as the preposition b (in) + the noun bit

(absolute or construct) and possessive. Verbs inflect for

4. Creating an Index To create an index from a corpus, we first run the

Regular Expression Time (sec) Output File (KB)

[pos=".*-MODAL-.*"] [pos=":‫היה‬-.*"][pos=".*-VERB-[^:]*-infinitive:.*"]; 0.117 13

[pos=".*:‫הלך‬-[^:]*-present:.*"] [pos=".*-VERB-[^:]*-infinitive:.*"]; 0.099 2

".*"; [pos=".*"]; 3.677

All the previous regular expressions concatenated 7.961

("‫[ | "כותב‬pos=".*:[^:]*-VERB-[^:]*:.*"] | ".*" | [pos=".*"] ); 2.061

([pos=".*:[^:]*-VERB-[^:]*:.*" & word="‫ & "כותב‬word=".*" & pos=".*"]); 0.168

Table 1: Example queries and their performance.

Official Web page of CWB: http://cwb.sourceforge.net/

Bar Haim, R., Sima’an, K. and Winter, Y. (2005).

Christ, O. and Schulze, B. M. (1996). Ein flexibles und

Diab, M., Hacioglu, K. and Jurafsky, D. (2004).

You might also like

[pos=".-MODAL-."] [pos=":‫היה‬-."][pos=".-VERB-[^:]-infinitive:."]; 0.117 13

[pos=".:‫הלך‬-[^:]-present:."] [pos=".-VERB-[^:]-infinitive:."]; 0.099 2

"."; [pos="."]; 3.677

("‫[ | "כותב‬pos=".:[^:]-VERB-[^:]:."] | "." | [pos="."] ); 2.061

([pos=".:[^:]-VERB-[^:]:." & word="‫ & "כותב‬word="." & pos="."]); 0.168