POS Tagging
POS Tagging
A Thesis
BRAC University
by
December 2006
Declaration
I hereby declare that this thesis is based on the results found by myself.
Materials of work found by other researcher are mentioned by reference.
This Thesis, neither in whole nor in part, has been previously submitted
for any degree.
Acknowledgments
I would like to thank my thesis supervisor, Dr. Mumit Khan and co-
supervisor, Naushad UzZaman, for their guidance and ever helpful
comments on my work. I also thank my teachers at BRAC University, my
family and friends.
iv
Abstract
Table Of Content
DECLARATION ........................................................................................II
ACKNOWLEDGMENTS ..........................................................................III
ABSTRACT............................................................................................. IV
TABLE OF CONTENT............................................................................. V
BACKGROUND....................................................................................... 1
REFERENCES....................................................................................... 47
APPENDIX ............................................................................................ 53
List of Tables
TAGSET] ........................................................................................... 36
TABLE 4: PERFORMANCE OF POS TAGGERS FOR HINDI [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 37
TABLE 5: PERFORMANCE OF POS TAGGERS FOR TELEGU [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 39
TABLE 6: PERFORMANCE OF POS TAGGERS FOR BANGLA [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 40
TABLE 7: PERFORMANCE OF POS TAGGERS FOR BANGLA ON MERGED
TRAINING AND TESTING DATA [TEST DATA AND TAGSET SOURCE: [41]] .. 40
viii
List of Figures
TAGSET] ........................................................................................... 36
FIGURE 8: PERFORMANCE OF POS TAGGERS FOR HINDI [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 38
FIGURE 9: PERFORMANCE OF POS TAGGERS FOR TELEGU [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 39
FIGURE 10: PERFORMANCE OF POS TAGGERS FOR BANGLA [TEST DATA AND
TAGSET SOURCE: [41]] ...................................................................... 40
1
Background
1.1 Introduction
This thesis discusses the different techniques for POS tagging for
western languages as well as the South Asian languages. It displays the
analyses of performance of well known tagging methods for the western
languages using corpora of varying sizes. It compares the performance
of some South Asian languages using the same techniques with that of
the western languages and attempts to suggest which technique might
be better for the South Asian languages. It concludes with a discussion
on some improvement techniques that could be added to a baseline
tagger to improve its performance.
assigns a POS tag to the word or to each word in the sentence, and
produces the tagged text as output.
1.3 Classification
word-tag frequencies, rule sets etc [10]. The performance of the models
generally increase with the increase in size of this corpora.
The rule based POS tagging models apply a set of hand written
rules and use contextual information to assign POS tags to words.
These rules are often known as context frame rules. For example, a
context frame rule might say something like: “If an ambiguous/unknown
word X is preceded by a Determiner and followed by a Noun, tag it as an
Adjective.” On the other hand, the transformation based approaches use
a pre-defined set of handcrafted rules as well as automatically induced
rules that are generated during training.
Morphology is a linguistic term which means how words are built up from
smaller units of meaning known as morphemes [3]. In addition to
contextual information, morphological information is also used by some
models to aid in the disambiguation process. One such rule might be: “If
5
1.3.4. Stochastic
There are different models that can be used for stochastic POS tagging,
some of which are described below.
There are three basic problems that the HMM must solve to be used for
any practical purpose. They are as follows:
For POS tagging, HMM is used to model the joint probability distribution
P(word, tag). The generation process uses a probabilistic Finite State
Machine (FSM).
The HMM model trains on annotated corpora to find out the transition
and emission probabilities. For a sequence of words w, HMM
determines the sequence of tags t using the formula: t = argmax P(w, t)
The probability model for MEM is defined over (H, X, T), where H is the
set of possible word and tag contexts or “histories”, and T is the set of
allowable tags. The model's probability of a history h together with a tag t
is defined as:
where π is a normalization constant, {a1, ..., ak} are the positive model
parameters, {f 1 , . . , fk} are known as “features”, where fj(h, t) is in {0, 1}
and each parameter aj corresponds to a feature fj.
Given a sequence of words {wl , . . . , wn} and tags {tl,..., tn} as training
data, hi is defined as the history available when predicting ti. The
parameters {a1, …, ak} are then chosen to maximize the likelihood of the
training data p [10], using the following formula:
The different models described above have their own advantages and
disadvantages, however, they all face one difficulty, which is to assign a
tag to an unknown word which the tagger has not seen previously i.e.
the word was not present in the training corpora. Different tagging
models use different methods to get around this problem. The rule
based taggers use certain rules to specially handle unknown or
ambiguous words. But the stochastic taggers have no way to calculate
the probabilities of an unknown word beforehand. So to solve the
problem, the taggers of this category calculate the probability that a
suffix of an unknown word occurs with a particular tag. If HMM is used,
the probability that a word containing that suffix occurs with a particular
tag in the given sequence is calculated. An alternate approach is to
assign a set of default tags to unknown words. The default tags typically
consist of the open classes, that are word classes, which freely admit
new words and are readily modified by morphological processes [3],
examples are Noun, Verb, Adjective, Adverb etc. The tagger then
disambiguates using the probabilities that those tags occur at the end of
the n-gram in question. A third approach is to calculate the probability
that each tag in the tagset occurs at the end of the n-gram, and to select
the path with the highest probability. This, however, is not the optimal
solution if the size of the tag set is large [9].
11
Considerable amount of work has already been done in the field of POS
tagging for English. Different approaches like the rule based approach,
the stochastic approach and the transformation based learning approach
along with modifications have been tried and implemented. However, if
we look at the same scenario for South-Asian languages such as Bangla
and Hindi, we find out that not much work has been done. The main
reason for this is the unavailability of a considerable amount of
annotated corpora of sound quality, on which the tagging models could
train to generate rules for the rule based and transformation based
models and probability distributions for the stochastic models [15].
To test the tagger, a 14276 entry lexicon was built from the first 20
sections of the 24 data sets of the Penn Treebank corpora,
corresponding to the Wall Street Journal. 95% coverage was found by
14
Another different approach [28] is used in a POS tagger for the German
language. This tagger is known as the tree tagger. To correctly tag text,
the tagger builds a decision tree where the nodes are tags of previous
words. These nodes are used to determine the current node i.e. the tag
of the current word. Along with this, the tagger also uses a suffix tree to
improve its performance. Figure shows a small part of a sample decision
tree built by the tree tagger.
when trained on 21,000 words. When the tagger is trained on the same
amount of data with the best feature set, the accuracy improves to
82.67%. But the size and type of the testing data is not mentioned in this
paper, which can improve or deteriorate the performance of the tagger
to a great extent.
In [32], the authors report a hybrid tagger for Hindi that runs on
two phases to POS tag input text. In the first phase, the HMM based TnT
tagger is run on the untagged text to perform the initial tagging. During
this phase, a set of transformation rules is induced which are used later.
In the second phase, the set of transformation rules learnt earlier is used
on the initially tagged text to correct error created in the first phase.
However, the performance of this tagger is not as good as the other
taggers reported for Hindi. It uses a corpus of 35,000 words annotated
with 26 tags, and the resulting accuracy is 79.66% using the TnT tagger.
The authors suggest that the low score could be the result of the
sparseness of the training data. The use of the set of transformation
rules in post processing improves the overall accuracy to 80.74%.
For the Tamil language, a tagger is reported in [33], that uses a suffix
stripper before performing the actual tagging to improve the accuracy.
The suffix stripper uses a list of suffices, pronouns, adjectives and
adverbs to remove the suffices from words. A simple block diagram of
the suffix stripper is included below.
The input format for the tagger is one sentence per line in which each
word is separated by a white space. On the input text, the tagger runs
the following algorithm to remove suffices and then to complete the
tagging.
3. Using the combination of suffixes and the rules, apply the lexical rules
and assign the category.
4. For each sentence,
4.1. Apply the context sensitive rules on the unknown words.
4.2. Apply the context sensitive rules on the wrongly tagged words.
4.3. If no context rule applies for any unknown words, tag it as
noun.
2.5 Bangla
A rule based POS tagger for Bangla is reported in [34], but only
the rules for Noun and Adjective are showed. No review or comparison
with established work on POS tagging is done, neither is the presence of
any performance analysis report in the paper, which makes uncertain
whether the approach is worthwhile or not.
The tagset here consists of only 9 tags which is very small compared to
renowned tagsets. A POS tagger works as an intermediate tool or
component for many advanced NLP applications as described earlier,
but with a tagset consisting of only 9 tags, the output of the POS tagger
can only be used in restricted applications. For English, most widely
used tagsets include the Brown tagset [35] consisting of 87 distinct tags,
and Penn Treebank’s tagset [36], consisting of (36 + 12 = 48) tags.
19
These show the necessity of a large tagset to use the POS tagging
information in other applications.
Notable work on POS tagging has been reported in [37] for Indian
Bangla. Here, a HMM based approach is used for tagging Bangla which
is a combination of both supervised and unsupervised learning for
training a Bigram based HMM. It also uses a morphological analyzer
before tagging that takes a word as input and gives all possible POS
tags for the word. This restricts the set of possible tags for a given word
to possibly increase the performance of the tagger.
Another paper in [38] uses a suffix based tree tagger, influenced by [28],
but this is also more of morphological analyzer than POS tagger. Here
the authors also mention about the n-gram based tagging, but do not
describe how to combine both. This paper also lacks any review or
comparison with established work on POS tagging, instead it only
proposes a rule-based technique. The paper also does not show any
20
The tagger was also tested on random sentences of 1003 words from
the CIIL corpus, which were more complex than the training data and
these were tagged manually. This resulted in some reduction in
accuracy. The performance of method 3 in this case was 84.37% while
that of method 1 and 2 were 59.93% and 61.79%, respectively.
3.1 Corpora
experiments that we did, our test sets had been disjoint from the training
corpora.
3.2 Tagsets
1. Fineness Vs coarseness
When choosing the tagset for a POS tagger, we have to decide
whether the tags will allow for precise distinction of the various
24
For English, we used the Brown tagset [35], while for Bangla we used
our bangla tagset [43], which is a two level tagset. The first level is the
high-level tagset for Bangla, which consists of only (12+2 = 14) tags
(Noun, Adjective, Cardinal, Ordinal, Fractional, Pronoun, Indeclinable,
Verb, Post Positions, Quantifiers, Adverb, Punctuation, Abbreviation and
Others). The second level is more fine-grained with 41 tags. Most of our
experiments are based on the level 2 tagset (41 tags). However, we also
experimented several cases with the level 1 tagset (14 tags). We also
used the 26 tags tagset in [15], for experimenting with Bangla, Hindi and
Telegu.
Apart from the corpora and the tagsets, we used the Natural Language
Toolkit (NLTK) [40], which is a set of computational linguistics and NLP
program modules, annotated corpora and tutorials supporting research
25
and teaching for the Python language. NLTK allows various NLP tasks
by providing implementation of various algorithms such as the Brill
tagger, HMM based POS tagger, n-gram based taggers etc. For our
experiments, we used the parts of the Unigram, Bigram, Brill and the
HMM tagging modules of NLTK.
3.3 Taggers
The Bigram tagger works in exactly the same way as the Unigram
Tagger, the only difference is that it considers the context when
assigning a tag to the current word. When training, it creates a frequency
distribution describing the frequencies with which, each word is tagged
in different contexts. The context consists of the word to be tagged and
26
the tag of the previous word. When tagging, the tagger uses the
frequency distribution to tag words by assigning each word, the tag with
the maximum frequency given the context. For our case, when a context
is encountered for which no data has been learnt, the tagger backs off to
the Unigram tagger.
We compared the Unigram and Bigram taggers to the more advanced
taggers like HMM and Brill. We also used the Unigram tagger as the pre-
tagger of the Brill tagger. We used the unigram and n-gram (specifying
n=2 for creating a Bigram model) tagging modules of NLTK [40] to
create Unigram and Bigram taggers and do the tagging. We found that
for a small corpus of Bangla, both the taggers tag with similar results.
They are also extremely fast compared to the other taggers.
3.3.2 HMM
The HMM approach is different than the other POS tagging approaches
in the sense that it considers the best combination of tags for a
sequence of words, whereas the other tagging methods greedily tag one
word at a time, without regard to the optimal combination. [10]
As each of the words Wi can take any of the words in {w1, w2, …, ww} as
its value, we denote the value of Wi by wi and a particular sequence of
values for Wi,j (i<= j) by wi,j. Similarly, we denote the value of Ti by ti and a
particular sequence of values for Ti,j (i<=j) by ti,j. Then the probability
Pr(t1,n, wl,n) using the following formula can be used to find the most
likely sequence of POS tags for a given sequence of’ words.
In HMM, the probability of the current tag ti depends on only the previous
k tags ti-k,i-1 and the probability of’ the current word wi depends on only
the current tag ti. [9, 10]
Using linguistic rules for a free word order language is more troublesome
than a language that is not so. For languages like this, [6] suggests that
the HMM based approach is more appropriate than other approaches for
POS tagging.
We used the HMM tagger of NLTK [40] on parts of the Brown corpus as
well as the Bangla corpora to test its performance. We started from a
small size and increased the size of the corpus to find out how the
performance improves with the increase in size of training tokens. We’ve
noticed that for Bangla HMM performs with similar results even when the
size of the tagset changes. This can be observed in the results given in
the next chapter. We’ve also experimented with merging or disjointing
the training and testing data sets to find out how the performance of
HMM changes with data that have been seen previously.
The stochastic taggers have high accuracy and are very fast to
tag after having been trained. But a common drawback for all stochastic
taggers is the size. A stochastic nth order tagger using back-off may
store huge tables containing n-gram entries and large sparse arrays
having millions of entries. So these taggers are not a very good choice if
they have to be deployed on mobile computing devices which have
relatively small storage space and computation power. This is where the
rule or transformation based taggers are useful.
The Brill tagger is a transformation based tagger that performs very well
but uses only a tiny fraction of the space required by the nth-order
stochastic taggers [16, 17]. The general idea of the tagger is very
simple. It uses a set of rules to tag data. Then it checks the tagged data
for potential errors and corrects those. In the same time it may learn
some new rules. Then it uses these new rules to again tag the corrected
29
In the same fashion we might paint the trunk a uniform brown before
going back to over-paint further details with a fine brush. Brill tagging
uses the same idea: get the bulk of the painting right with broad brush
strokes, then fix up the details. As time goes on, successively finer
brushes are used, and the scale of the changes becomes arbitrarily
small. The decision of when to stop is somewhat arbitrary [2].
The Brill tagging model works in two phases. In the first phase, the
tagger tags the input tokens with their most likely tag. This is usually
done using a Unigram tagging model. Then in the second phase, a set
of transformation rules are applied to the tagged data [16]. An
improvement to this technique is described in [17], where unannotated
text is passed through the initial state annotator at first. The initial state
annotator can range in complexity from assigning random structure to
assigning the output of a sophisticated manually created annotator. After
getting the output from the initial state annotator, it is compared to the
truth as specified in a manually annotated corpus. In this stage,
transformation rules are applied to the output of the initial state annotator
so that it resembles the truth better. After that, a greedy learning
algorithm is applied. At each iteration of learning, the transformation is
found whose application results in the highest score and the
transformation is then added to the ordered transformation list and the
training corpus is updated by applying the learned transformation.
30
1. িdতীয় িব যুেd িমt বািহনীর েনতা িbিটশ pধানমntী uinটন চাির্চলেক গত সpােহর শুরেু ত টপেক েবয়ার e
তািলকায় sান লাভ কেরন ।
2. তেব িতিন যিদ আবার িনরব্াচন কেরন eবং জয়ী হন তাহেল হয়েতা e েরকর্ডo ভাঙেত পারেবন ।
Brill:
1. িdতীয়/NC িব যুেd/NC িমt/NC বািহনীর/NC েনতা/NC িbিটশ/ADJ pধানমntী/NC
uinটন/NP চাির্চলেক/NP গত/ADJ সpােহর/NC শুরেু ত/ADVT টপেক/NP েবয়ার/NP
e/DP তািলকায়/NC sান/NC লাভ/NC কেরন/VF । /PUNSF
Unigram:
1. িdতীয়/NP িব যুেd/NP িমt/NP বািহনীর/NC েনতা/NC িbিটশ/ADJ pধানমntী/NC
uinটন/NP চাির্চলেক/NP গত/ADJ সpােহর/NC শুরেু ত/ADVT টপেক/NP েবয়ার/NP
e/DP তািলকায়/NC sান/NC লাভ/NP কেরন/VF । /PUNSF
HMM:
1. িdতীয়/DP িব যুেd/NC িমt/NC বািহনীর/NC েনতা/NC িbিটশ/ADJ pধানমntী/NC
uinটন/NP চাির্চলেক/NP গত/ADJ সpােহর/NC শুরেু ত/ADVT টপেক/ADVT েবয়ার/NP
e/NP তািলকায়/NC sান/NC লাভ/NC কেরন/VF । /PUNSF
32
Brill:
1. িdতীয়/NN িব যুেd/NN িমt/NN বািহনীর/NN েনতা/NN িbিটশ/ADJ pধানমntী/NN
uinটন/NN চাির্চলেক/NN গত/ADJ সpােহর/NN শুরেু ত/ADV টপেক/NN েবয়ার/NN
e/PN তািলকায়/NN sান/NN লাভ/NN কেরন/VB । /PUNC
Unigram:
1. িdতীয়/NN িব যুেd/NN িমt/NN বািহনীর/NN েনতা/NN িbিটশ/ADJ pধানমntী/NN
uinটন/NN চাির্চলেক/NN গত/ADJ সpােহর/NN শুরেু ত/ADV টপেক/NN েবয়ার/NN
e/PN তািলকায়/NN sান/NN লাভ/NN কেরন/VB । /PUNC
HMM:
1. িdতীয়/PN িব যুেd/NN িমt/NN বািহনীর/NN েনতা/NN িbিটশ/ADJ pধানমntী/NN
uinটন/NN চাির্চলেক/NN গত/ADJ সpােহর/NN শুরেু ত/ADV টপেক/ADV েবয়ার/NN
e/NN তািলকায়/NN sান/NN লাভ/NN কেরন/VB । /PUNC
The results of our experiments are shown below in the forms of tables
and graphs.
100
A ccuracy
90
80
70
HMM
60
Unigram
Brill
50
Log. (HMM)
Log. (Brill)
40
Log. (Unigram)
30
20
10
Tokens
0
0 60 104 503 1011 2023 3016 4484
100
A ccuracy
90
80
70
HMM
60
Unigram
Brill
50
Log. (HMM)
Log. (Brill)
40
Log. (Unigram)
30
20
10
Tokens
0
0 60 104 503 1011 2023 3016 4484
100
Accu racy
90
80
70
HMM
60
Unigram
Brill
50
Log. (HMM)
Log. (Brill)
40
Log. (Unigram)
30
20
10
Tokens
0
0
06
03
32
32
29
3
13
01
04
02
03
7
10
30
50
70
90
00
03
00
01
00
20
40
60
80
10
30
50
70
90
trained the taggers using the training data as well as the development
data provided for [41] and tested the performance using the testing data
for the same.
Test data: 209 sentences, 4924 tokens from the SPSAL test
corpus
HMM Unigram Bigram Brill
Sentences Tokens Accuracy Accuracy Accuracy Accuracy
0 0 0 0 0 0
4 60 36 18 Insufficient data 37.6
7 113 32.2 23.8 Insufficient data 43.6
12 201 30.6 27.6 Insufficient data 46.7
21 415 39.8 35.8 35.8 53.8
30 607 43.6 37.6 37.7 56.2
38 826 50.5 40.3 40.5 60.3
43 1039 53.3 41.9 42.1 59.7
85 2017 57.8 46 46.4 61.8
182 4031 61.9 49.2 49.3 64.9
259 6017 62.8 50.9 51 68.8
362 8009 64.4 52 52.3 69.4
450 10001 64.4 52.7 53.1 69.1
535 12003 65.7 54.1 54.5 69.6
619 14011 66.3 54.5 54.9 69.7
698 16020 67.3 55.5 55.5 70.6
784 18019 67.3 55.8 56.2 70.6
865 20004 68 56.9 57.1 70.7
934 22010 67.5 57 57.3 70.8
1007 24030 68.6 57.7 55.7 71.1
1125 26005 68.5 58.4 57.5 71.3
1135 26148 68.5 58.5 57.5 71.5
Table 4: Performance of POS Taggers for Hindi [Test data and Tagset
source: [41]]
38
90
Accuracy
80
70
60 HMM
Unigram
50 Brill
Bigram
40 Log. (HMM)
Log. (Brill)
30 Log. (Unigram)
Log. (Bigram)
20
10
0 Tokens
0
60
6
39
17
31
17
09
8
11
20
41
60
82
00
00
01
02
01
00
01
03
00
14
10
20
40
60
80
10
12
14
16
18
20
22
24
26
26
Figure 8: Performance of POS Taggers for Hindi [Test data and Tagset
source: [41]]
Test data: 415 sentences, 5193 tokens from the SPSAL test
corpus
HMM Unigram Bigram Brill
Sentences Tokens Accuracy Accuracy Accuracy Accuracy
0 0 0 0 0 0
5 50 28.4 15.6 Insufficient data 45.7
9 102 28.1 16.4 Insufficient data 47.7
23 202 32.1 16.9 Insufficient data 48
54 401 30.8 18 Insufficient data 49.2
87 612 29.6 18.3 18.3 49.1
107 811 30.9 18.8 18.8 49.6
131 1004 31.7 19.1 19.1 38.2
248 2010 32.8 23.4 23.4 53.5
421 4001 42.6 28.1 28.2 57.9
605 6007 48 31.7 31.7 60.4
783 8002 51.1 34.9 34.5 62.6
994 10018 53 37.4 37.2 63.9
1192 12000 53.6 38.8 38.3 64.6
1409 14010 53.3 38.8 38.7 64.4
1626 16005 53.9 39.6 39.2 65
1842 18004 53.7 40.1 39.7 65.1
2048 20012 54.9 40.4 40.2 65.1
2184 22013 54.8 41.5 41 65.8
2335 24002 55.6 41.6 41 65.8
39
80 Accuracy
70
60
HMM
50
Unigram
Brill
40 Bigram
Log. (HMM)
Log. (Brill)
30
Log. (Unigram)
Log. (Bigram)
20
10
0 Tokens
0
50
1
04
10
01
07
02
1
10
20
40
61
81
01
00
01
00
00
01
01
00
02
51
10
20
40
60
80
10
12
14
16
18
20
22
24
26
27
Figure 9: Performance of POS Taggers for Telegu [Test data and Tagset
source: [41]]
Test data: 400 sentences, 5225 tokens from the SPSAL test
corpus
HMM Unigram Bigram Brill
Sentences Tokens Accuracy Accuracy Accuracy Accuracy
0 0 0 0 0 0
8 51 14.3 14 Insufficient data 35.6
13 108 20.7 17.9 Insufficient data 39.6
21 206 26.5 19.3 19.3 40.9
37 405 30.7 21.8 21.8 42.7
53 605 32.7 24.1 24.1 45.4
69 807 36.4 27.7 27.7 48.6
87 1002 39.3 28.6 28.6 50.2
173 2004 44.3 36 36 55.8
304 4003 49.7 42.4 41.9 61.3
398 6036 49.8 45.6 45.3 63.8
532 8026 53.6 48.1 47.9 64.7
677 10001 54.3 49.8 49.5 65.6
40
Table 6: Performance of POS Taggers for Bangla [Test data and Tagset
source: [41]]
80
Accuracy
70
60
HMM
50
Unigram
Brill
40 Bigram
Log. (HMM)
Log. (Brill)
30
Log. (Unigram)
Log. (Bigram)
20
10
0 Tokens
0
51
7
02
04
03
36
26
6
10
20
40
60
80
00
00
02
00
00
00
01
00
42
10
20
40
60
80
10
12
14
16
18
20
22
24
25
Figure 10: Performance of POS Taggers for Bangla [Test data and
Tagset source: [41]]
The resulting high accuracy gain of the HMM model (62.7 to 92.9) once
more reveals that stochastic models are far superior to any other when
the knowledge about unknown words is available.
42
For Bangla, we did not have any annotated corpus available, and the
reason of very low performance of Bangla on our cases is mostly due to
the small corpus size and sparseness of training data, which makes it
very difficult for stochastic taggers to create probability distribution to
hold transitions between different states [44].
Within this limited corpus, our experiments suggested that for the three
South Asian languages Bangla, Hindi and Telegu, with limited tagged
corpus, Brill’s tagger performs better than HMM based tagger and n-
gram based Unigram and Bigram taggers. Researchers, who want to
implement a tagger for a language with limited language resources, i.e.
annotated corpora of large size, can try Brill’s tagger or any other rule
based tagger for their languages too.
43
From [20], we find that a tagger should have the following qualities to be
of any practical purpose.
Accurate: The tagger should assign the best possible POS tag for every
word in the text to tag.
The next step could be to find out whether these patterns are present in
South Asian Languages and also, whether the above mentioned
guidelines are applicable for these languages as well. There are some
other state of the art POS tagging techniques, which could also be tried
out for Bangla.
46
References
[29] Aniket Dalal, Kumar Nagaraj, Uma Sawant and Sandeep Shelke,
“Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy
Approach”, In Proceeding of theNLPAI Machine Learning 2006
Competition.
[39] Goutam Kumar Saha, Amiya Baran Saha and Sudipto Debnath,
“Computer Assisted Bangla Words POS Tagging”, Proceedings of the
International Symposium on Machine Translation NLP & TSS
(iSTRANS-2004), New Delhi 2004.
[43] Bangla POS Tagset used in our Bangla POS tagger, available
online at http://www.naushadzaman.com/bangla_tagset.pdf
52
Appendix
12 Sentence-Final PUNC |, ?, !
Punctuation
Quote ", "
Parenthesis ( ) {} []
Mid-sentence ,;:
Punctuation
Other Punctuation %.
13 Abbreviation ABB েমাঃ, ডাঃ
14 Others OTHER