0% found this document useful (0 votes)
27 views6 pages

Introduction To Conll

The CoNLL-2003 shared task focuses on language-independent named entity recognition, providing training and test data for English and German. The task evaluates various systems that utilize machine learning techniques, emphasizing the incorporation of additional resources like gazetteers and unannotated data. The document details the data sets, evaluation methods, and performance of participating systems, highlighting the diversity of approaches and the importance of combining different learning techniques.

Uploaded by

Shawn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Introduction To Conll

The CoNLL-2003 shared task focuses on language-independent named entity recognition, providing training and test data for English and German. The task evaluates various systems that utilize machine learning techniques, emphasizing the incorporation of additional resources like gazetteers and unannotated data. The document details the data sets, evaluation methods, and performance of participating systems, highlighting the diversity of approaches and the importance of combining different learning techniques.

Uploaded by

Shawn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Introduction to the CoNLL-2003 Shared Task:

Language-Independent Named Entity Recognition

Erik F. Tjong Kim Sang and Fien De Meulder


CNTS - Language Technology Group
University of Antwerp
{erikt,fien.demeulder}@uia.ua.ac.be

Abstract of the 2003 shared task have been offered training


and test data for two other European languages:
We describe the CoNLL-2003 shared task: English and German. They have used the data
language-independent named entity recog- for developing a named-entity recognition system
nition. We give background information on that includes a machine learning component. The
the data sets (English and German) and shared task organizers were especially interested in
the evaluation method, present a general approaches that made use of resources other than
overview of the systems that have taken the supplied training data, for example gazetteers
part in the task and discuss their perfor- and unannotated data.
mance.
2 Data and Evaluation
1 Introduction In this section we discuss the sources of the data
that were used in this shared task, the preprocessing
Named entities are phrases that contain the names steps we have performed on the data, the format of
of persons, organizations and locations. Example: the data and the method that was used for evaluating
the participating systems.
[ORG U.N. ] official [PER Ekeus ] heads for
[LOC Baghdad ] . 2.1 Data
This sentence contains three named entities: Ekeus The CoNLL-2003 named entity data consists of eight
is a person, U.N. is a organization and Baghdad is files covering two languages: English and German1 .
a location. Named entity recognition is an impor- For each of the languages there is a training file, a de-
tant task of information extraction systems. There velopment file, a test file and a large file with unanno-
has been a lot of work on named entity recognition, tated data. The learning methods were trained with
especially for English (see Borthwick (1999) for an the training data. The development data could be
overview). The Message Understanding Conferences used for tuning the parameters of the learning meth-
(MUC) have offered developers the opportunity to ods. The challenge of this year’s shared task was
evaluate systems for English on the same data in a to incorporate the unannotated data in the learning
competition. They have also produced a scheme for process in one way or another. When the best pa-
entity annotation (Chinchor et al., 1999). More re- rameters were found, the method could be trained on
cently, there have been other system development the training data and tested on the test data. The
competitions which dealt with different languages results of the different learning methods on the test
(IREX and CoNLL-2002). sets are compared in the evaluation of the shared
The shared task of CoNLL-2003 concerns task. The split between development data and test
language-independent named entity recognition. We data was chosen to avoid systems being tuned to the
will concentrate on four types of named entities: test data.
persons, locations, organizations and names of The English data was taken from the Reuters Cor-
miscellaneous entities that do not belong to the pre- pus2 . This corpus consists of Reuters news stories
vious three groups. The shared task of CoNLL-2002 1
Data files (except the words) can be found on
dealt with named entity recognition for Spanish and http://lcg-www.uia.ac.be/conll2003/ner/
2
Dutch (Tjong Kim Sang, 2002). The participants http://www.reuters.com/researchandstandards/
English data Articles Sentences Tokens English data LOC MISC ORG PER
Training set 946 14,987 203,621 Training set 7140 3438 6321 6600
Development set 216 3,466 51,362 Development set 1837 922 1341 1842
Test set 231 3,684 46,435 Test set 1668 702 1661 1617

German data Articles Sentences Tokens German data LOC MISC ORG PER
Training set 553 12,705 206,931 Training set 4363 2288 2427 2773
Development set 201 3,068 51,444 Development set 1181 1010 1241 1401
Test set 155 3,160 51,943 Test set 1035 670 773 1195

Table 1: Number of articles, sentences and tokens in Table 2: Number of named entities per data file
each data file.

2.3 Data format


between August 1996 and August 1997. For the
All data files contain one word per line with empty
training and development set, ten days’ worth of data
lines representing sentence boundaries. At the end
were taken from the files representing the end of Au-
of each line there is a tag which states whether the
gust 1996. For the test set, the texts were from De-
current word is inside a named entity or not. The
cember 1996. The preprocessed raw data covers the
tag also encodes the type of named entity. Here is
month of September 1996.
an example sentence:
The text for the German data was taken from the
ECI Multilingual Text Corpus3 . This corpus consists
U.N. NNP I-NP I-ORG
of texts in many languages. The portion of data that
official NN I-NP O
was used for this task, was extracted from the Ger-
Ekeus NNP I-NP I-PER
man newspaper Frankfurter Rundshau. All three of
heads VBZ I-VP O
the training, development and test sets were taken
for IN I-PP O
from articles written in one week at the end of Au-
Baghdad NNP I-NP I-LOC
gust 1992. The raw data were taken from the months
. . O O
of September to December 1992.
Table 1 contains an overview of the sizes of the Each line contains four fields: the word, its part-
data files. The unannotated data contain 17 million of-speech tag, its chunk tag and its named entity
tokens (English) and 14 million tokens (German). tag. Words tagged with O are outside of named en-
2.2 Data preprocessing tities and the I-XXX tag is used for words inside a
named entity of type XXX. Whenever two entities of
The participants were given access to the corpus af- type XXX are immediately next to each other, the
ter some linguistic preprocessing had been done: for first word of the second entity will be tagged B-XXX
all data, a tokenizer, part-of-speech tagger, and a in order to show that it starts another entity. The
chunker were applied to the raw data. We created data contains entities of four types: persons (PER),
two basic language-specific tokenizers for this shared organizations (ORG), locations (LOC) and miscel-
task. The English data was tagged and chunked by laneous names (MISC). This tagging scheme is the
the memory-based MBT tagger (Daelemans et al., IOB scheme originally put forward by Ramshaw and
2002). The German data was lemmatized, tagged Marcus (1995). We assume that named entities are
and chunked by the decision tree tagger Treetagger non-recursive and non-overlapping. When a named
(Schmid, 1995). entity is embedded in another named entity, usually
Named entity tagging of English and German only the top level entity has been annotated.
training, development, and test data, was done by
Table 2 contains an overview of the number of
hand at the University of Antwerp. Mostly, MUC
named entities in each data file.
conventions were followed (Chinchor et al., 1999).
An extra named entity category called MISC was 2.4 Evaluation
added to denote all names which are not already in
the other categories. This includes adjectives, like The performance in this task is measured with Fβ=1
Italian, and events, like 1000 Lakes Rally, making it rate:
a very diverse category.
(β 2 + 1) ∗ precision ∗ recall
3
http://www.ldc.upenn.edu/ Fβ = (1)
(β 2 ∗ precision + recall)
lex pos aff pre ort gaz chu pat cas tri bag quo doc
Florian + + + + + + + - + - - - -
Chieu + + + + + + - - - + - + +
Klein + + + + - - - - - - - - -
Zhang + + + + + + + - - + - - -
Carreras (a) + + + + + + + + - + + - -
Curran + + + + + + - + + - - - -
Mayfield + + + + + - + + - - - + -
Carreras (b) + + + + + - - + - - - - -
McCallum + - - - + + - + - - - - -
Bender + + - + + + + - - - - - -
Munro + + + - - - + - + + + - -
Wu + + + + + + - - - - - - -
Whitelaw - - + + - - - - + - - - -
Hendrickx + + + + + + + - - - - - -
De Meulder + + + - + + + - + - - - -
Hammerton + + - - - + + - - - - - -

Table 3: Main features used by the the sixteen systems that participated in the CoNLL-2003 shared task
sorted by performance on the English test data. Aff: affix information (n-grams); bag: bag of words; cas:
global case information; chu: chunk tags; doc: global document information; gaz: gazetteers; lex: lexical
features; ort: orthographic information; pat: orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is between quotes; tri: trigger words.

with β=1 (Van Rijsbergen, 1975). Precision is the this kind of task: the top three results for English
percentage of named entities found by the learning and the top two results for German were obtained
system that are correct. Recall is the percentage of by participants who employed them in one way or
named entities present in the corpus that are found another.
by the system. A named entity is correct only if it Hidden Markov Models were employed by four of
is an exact match of the corresponding entity in the the systems that took part in the shared task (Flo-
data file. rian et al., 2003; Klein et al., 2003; Mayfield et al.,
2003; Whitelaw and Patrick, 2003). However, they
3 Participating Systems were always used in combination with other learning
Sixteen systems have participated in the CoNLL- techniques. Klein et al. (2003) also applied the re-
2003 shared task. They employed a wide variety of lated Conditional Markov Models for combining clas-
machine learning techniques as well as system com- sifiers.
bination. Most of the participants have attempted Learning methods that were based on connection-
to use information other than the available train- ist approaches were applied by four systems. Zhang
ing data. This information included gazetteers and and Johnson (2003) used robust risk minimization,
unannotated data, and there was one participant which is a Winnow technique. Florian et al. (2003)
who used the output of externally trained named en- employed the same technique in a combination of
tity recognition systems. learners. Voted perceptrons were applied to the
shared task data by Carreras et al. (2003a) and
3.1 Learning techniques Hammerton used a recurrent neural network (Long
The most frequently applied technique in the Short-Term Memory) for finding named entities.
CoNLL-2003 shared task is the Maximum Entropy Other learning approaches were employed less fre-
Model. Five systems used this statistical learning quently. Two teams used AdaBoost.MH (Carreras
method. Three systems used Maximum Entropy et al., 2003b; Wu et al., 2003) and two other groups
Models in isolation (Bender et al., 2003; Chieu and employed memory-based learning (De Meulder and
Ng, 2003; Curran and Clark, 2003). Two more Daelemans, 2003; Hendrickx and Van den Bosch,
systems used them in combination with other tech- 2003). Transformation-based learning (Florian et
niques (Florian et al., 2003; Klein et al., 2003). Max- al., 2003), Support Vector Machines (Mayfield et al.,
imum Entropy Models seem to be a good choice for 2003) and Conditional Random Fields (McCallum
and Li, 2003) were applied by one system each. G U E English German
Combination of different learning systems has Zhang + - - 19% 15%
proven to be a good method for obtaining excellent Florian + - + 27% 5%
results. Five participating groups have applied sys- Chieu + - - 17% 7%
tem combination. Florian et al. (2003) tested dif- Hammerton + - - 22% -
ferent methods for combining the results of four sys- Carreras (a) + - - 12% 8%
tems and found that robust risk minimization worked Hendrickx + + - 7% 5%
best. Klein et al. (2003) employed a stacked learn- De Meulder + + - 8% 3%
ing system which contains Hidden Markov Models, Bender + + - 3% 6%
Maximum Entropy Models and Conditional Markov Curran + - - 1% -
Models. Mayfield et al. (2003) stacked two learners McCallum + + - ? ?
and obtained better performance. Wu et al. (2003) Wu + - - ? ?
applied both stacking and voting to three learners.
Munro et al. (2003) employed both voting and bag- Table 4: Error reduction for the two develop-
ging for combining classifiers. ment data sets when using extra information like
gazetteers (G), unannotated data (U) or externally
3.2 Features developed named entity recognizers (E). The lines
The choice of the learning approach is important for have been sorted by the sum of the reduction per-
obtaining a good system for recognizing named en- centages for the two languages.
tities. However, in the CoNLL-2002 shared task we
found out that choice of features is at least as impor-
tant. An overview of some of the types of features with extra information compared to while using only
chosen by the shared task participants, can be found the available training data. The inclusion of ex-
in Table 3. tra named entity recognition systems seems to have
All participants used lexical features (words) ex- worked well (Florian et al., 2003). Generally the sys-
cept for Whitelaw and Patrick (2003) who imple- tems that only used gazetteers seem to gain more
mented a character-based method. Most of the sys- than systems that have used unannotated data for
tems employed part-of-speech tags and two of them other purposes than obtaining capitalization infor-
have recomputed the English tags with better tag- mation. However, the gain differences between the
gers (Hendrickx and Van den Bosch, 2003; Wu et al., two approaches are most obvious for English for
2003). Othographic information, affixes, gazetteers which better gazetteers are available. With the ex-
and chunk information were also incorporated in ception of the result of Zhang and Johnson (2003),
most systems although one group reports that the there is not much difference in the German results
available chunking information did not help (Wu et between the gains obtained by using gazetteers and
al., 2003) Other features were used less frequently. those obtained by using unannotated data.
Table 3 does not reveal a single feature that would
be ideal for named entity recognition. 3.4 Performances
A baseline rate was computed for the English and the
3.3 External resources German test sets. It was produced by a system which
Eleven of the sixteen participating teams have at- only identified entities which had a unique class in
tempted to use information other than the training the training data. If a phrase was part of more than
data that was supplied for this shared task. All in- one entity, the system would select the longest one.
cluded gazetteers in their systems. Four groups ex- All systems that participated in the shared task have
amined the usability of unannotated data, either for outperformed the baseline system.
extracting training instances (Bender et al., 2003; For all the Fβ=1 rates we have estimated sig-
Hendrickx and Van den Bosch, 2003) or obtaining nificance boundaries by using bootstrap resampling
extra named entities for gazetteers (De Meulder and (Noreen, 1989). From each output file of a system,
Daelemans, 2003; McCallum and Li, 2003). A rea- 250 random samples of sentences have been chosen
sonable number of groups have also employed unan- and the distribution of the Fβ=1 rates in these sam-
notated data for obtaining capitalization features for ples is assumed to be the distribution of the perfor-
words. One participating team has used externally mance of the system. We assume that performance
trained named entity recognition systems for English A is significantly different from performance B if A
as a part in a combined system (Florian et al., 2003). is not within the center 90% of the distribution of B.
Table 4 shows the error reduction of the systems The performances of the sixteen systems on the
two test data sets can be found in Table 5. For En- than the training data in their system. Four of them
glish, the combined classifier of Florian et al. (2003) have obtained error reductions of 15% or more for
achieved the highest overall Fβ=1 rate. However, English and one has managed this for German. The
the difference between their performance and that resources used by these systems, gazetteers and ex-
of the Maximum Entropy approach of Chieu and Ng ternally trained named entity systems, still require a
(2003) is not significant. An important feature of the lot of manual work. Systems that employed unanno-
best system that other participants did not use, was tated data, obtained performance gains around 5%.
the inclusion of the output of two externally trained The search for an excellent method for taking advan-
named entity recognizers in the combination process. tage of the fast amount of available raw text, remains
Florian et al. (2003) have also obtained the highest open.
Fβ=1 rate for the German data. Here there is no sig-
nificant difference between them and the systems of Acknowledgements
Klein et al. (2003) and Zhang and Johnson (2003). Tjong Kim Sang is financed by IWT STWW as a
We have combined the results of the sixteen sys- researcher in the ATraNoS project. De Meulder is
tem in order to see if there was room for improve- supported by a BOF grant supplied by the University
ment. We converted the output of the systems to of Antwerp.
the same IOB tagging representation and searched
for the set of systems from which the best tags for
the development data could be obtained with ma- References
jority voting. The optimal set of systems was de-
termined by performing a bidirectional hill-climbing Oliver Bender, Franz Josef Och, and Hermann Ney.
search (Caruana and Freitag, 1994) with beam size 9, 2003. Maximum Entropy Models for Named En-
tity Recognition. In Proceedings of CoNLL-2003.
starting from zero features. A majority vote of five
systems (Chieu and Ng, 2003; Florian et al., 2003; Andrew Borthwick. 1999. A Maximum Entropy Ap-
Klein et al., 2003; McCallum and Li, 2003; Whitelaw proach to Named Entity Recognition. PhD thesis,
and Patrick, 2003) performed best on the English New York University.
development data. Another combination of five sys-
Xavier Carreras, Lluı́s Màrquez, and Lluı́s Padró.
tems (Carreras et al., 2003b; Mayfield et al., 2003;
2003a. Learning a Perceptron-Based Named En-
McCallum and Li, 2003; Munro et al., 2003; Zhang tity Chunker via Online Recognition Feedback. In
and Johnson, 2003) obtained the best result for the Proceedings of CoNLL-2003.
German development data. We have performed a
majority vote with these sets of systems on the re- Xavier Carreras, Lluı́s Màrquez, and Lluı́s Padró.
lated test sets and obtained Fβ=1 rates of 90.30 for 2003b. A Simple Named Entity Extractor using
English (14% error reduction compared with the best AdaBoost. In Proceedings of CoNLL-2003.
system) and 74.17 for German (6% error reduction). Rich Caruana and Dayne Freitag. 1994. Greedy At-
tribute Selection. In Proceedings of the Eleventh
4 Concluding Remarks International Conference on Machine Learning,
We have described the CoNLL-2003 shared task: pages 28–36. New Brunswick, NJ, USA, Morgan
Kaufman.
language-independent named entity recognition.
Sixteen systems have processed English and German Hai Leong Chieu and Hwee Tou Ng. 2003. Named
named entity data. The best performance for both Entity Recognition with a Maximum Entropy Ap-
languages has been obtained by a combined learn- proach. In Proceedings of CoNLL-2003.
ing system that used Maximum Entropy Models,
Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty
transformation-based learning, Hidden Markov Mod-
Robinson. 1999. 1999 Named Entity Recognition
els as well as robust risk minimization (Florian et al., Task Definition. MITRE and SAIC.
2003). Apart from the training data, this system also
employed gazetteers and the output of two externally James R. Curran and Stephen Clark. 2003. Lan-
trained named entity recognizers. The performance guage Independent NER using a Maximum En-
of the system of Chieu et al. (2003) was not signif- tropy Tagger. In Proceedings of CoNLL-2003.
icantly different from the best performance for En- Walter Daelemans, Jakub Zavrel, Ko van der Sloot,
glish and the method of Klein et al. (2003) and the and Antal van den Bosch. 2002. MBT: Memory-
approach of Zhang and Johnson (2003) were not sig- Based Tagger, version 1.0, Reference Guide. ILK
nificantly worse than the best result for German. Technical Report ILK-0209, University of Tilburg,
Eleven teams have incorporated information other The Netherlands.
Fien De Meulder and Walter Daelemans. 2003. English test Precision Recall Fβ=1
Memory-Based Named Entity Recognition using Florian 88.99% 88.54% 88.76±0.7
Unannotated Data. In Proceedings of CoNLL- Chieu 88.12% 88.51% 88.31±0.7
2003. Klein 85.93% 86.21% 86.07±0.8
Zhang 86.13% 84.88% 85.50±0.9
Radu Florian, Abe Ittycheriah, Hongyan Jing, and
Carreras (a) 84.05% 85.96% 85.00±0.8
Tong Zhang. 2003. Named Entity Recognition
through Classifier Combination. In Proceedings of Curran 84.29% 85.50% 84.89±0.9
CoNLL-2003. Mayfield 84.45% 84.90% 84.67±1.0
Carreras (b) 85.81% 82.84% 84.30±0.9
James Hammerton. 2003. Named Entity Recogni- McCallum 84.52% 83.55% 84.04±0.9
tion with Long Short-Term Memory. In Proceed- Bender 84.68% 83.18% 83.92±1.0
ings of CoNLL-2003. Munro 80.87% 84.21% 82.50±1.0
Wu 82.02% 81.39% 81.70±0.9
Iris Hendrickx and Antal van den Bosch. 2003.
Whitelaw 81.60% 78.05% 79.78±1.0
Memory-based one-step named-entity recognition:
Effects of seed list features, classifier stacking, and Hendrickx 76.33% 80.17% 78.20±1.0
unannotated data. In Proceedings of CoNLL-2003. De Meulder 75.84% 78.13% 76.97±1.2
Hammerton 69.09% 53.26% 60.15±1.3
Dan Klein, Joseph Smarr, Huy Nguyen, and Christo- Baseline 71.91% 50.90% 59.61±1.2
pher D. Manning. 2003. Named Entity Recogni-
tion with Character-Level Models. In Proceedings German test Precision Recall Fβ=1
of CoNLL-2003. Florian 83.87% 63.71% 72.41±1.3
James Mayfield, Paul McNamee, and Christine Pi- Klein 80.38% 65.04% 71.90±1.2
atko. 2003. Named Entity Recognition using Hun- Zhang 82.00% 63.03% 71.27±1.5
dreds of Thousands of Features. In Proceedings of Mayfield 75.97% 64.82% 69.96±1.4
CoNLL-2003. Carreras (a) 75.47% 63.82% 69.15±1.3
Bender 74.82% 63.82% 68.88±1.3
Andrew McCallum and Wei Li. 2003. Early results Curran 75.61% 62.46% 68.41±1.4
for Named Entity Recognition with Conditional McCallum 75.97% 61.72% 68.11±1.4
Random Fields, Feature Induction and Web- Munro 69.37% 66.21% 67.75±1.4
Enhanced Lexicons. In Proceedings of CoNLL-
Carreras (b) 77.83% 58.02% 66.48±1.5
2003.
Wu 75.20% 59.35% 66.34±1.3
Robert Munro, Daren Ler, and Jon Patrick. Chieu 76.83% 57.34% 65.67±1.4
2003. Meta-Learning Orthographic and Contex- Hendrickx 71.15% 56.55% 63.02±1.4
tual Models for Language Independent Named En- De Meulder 63.93% 51.86% 57.27±1.6
tity Recognition. In Proceedings of CoNLL-2003. Whitelaw 71.05% 44.11% 54.43±1.4
Hammerton 63.49% 38.25% 47.74±1.5
Eric W. Noreen. 1989. Computer-Intensive Methods
Baseline 31.86% 28.89% 30.30±1.3
for Testing Hypotheses. John Wiley & Sons.

Lance A. Ramshaw and Mitchell P. Marcus. Table 5: Overall precision, recall and Fβ=1 rates ob-
1995. Text Chunking Using Transformation-Based tained by the sixteen participating systems on the
Learning. In Proceedings of the Third ACL Work- test data sets for the two languages in the CoNLL-
shop on Very Large Corpora, pages 82–94. Cam- 2003 shared task.
bridge, MA, USA.

Helmut Schmid. 1995. Improvements in Part-of- Casey Whitelaw and Jon Patrick. 2003. Named En-
Speech Tagging with an Application to German. tity Recognition Using a Character-based Proba-
In Proceedings of EACL-SIGDAT 1995. Dublin, bilistic Approach. In Proceedings of CoNLL-2003.
Ireland.
Dekai Wu, Grace Ngai, and Marine Carpuat. 2003.
Erik F. Tjong Kim Sang. 2002. Introduction to the A Stacked, Voted, Stacked Model for Named En-
CoNLL-2002 Shared Task: Language-Independent tity Recognition. In Proceedings of CoNLL-2003.
Named Entity Recognition. In Proceedings of
CoNLL-2002, pages 155–158. Taipei, Taiwan. Tong Zhang and David Johnson. 2003. A Robust
Risk Minimization based Named Entity Recogni-
C.J. van Rijsbergen. 1975. Information Retrieval. tion System. In Proceedings of CoNLL-2003.
Buttersworth.

You might also like