CS6113
Semantic Computing
Tagging Data with XML
Dr. Mohammad Abdul Qadir
aqadir@cust.edu.pk
1
The Tree Model of XML Documents:
An Example
<email>
<head>
<from name="Michael Maher"
address="michaelmaher@cs.gu.edu.au"/>
<to name="Grigoris Antoniou"
address="grigoris@cs.unibremen.de"/>
<subject>Where is your draft?</subject>
</head>
<body>
Grigoris, where is the draft of the paper you promised me
last week?
</body>
</email>
aqadir@cust.edu.pk
Tagging with XML
Information Extraction from unstructured
documents and then tagging the certain
information
Find and understand limited relevant parts of texts
Gather information from many pieces of text
Produce a semi-structured representation in XML
3
Named Entity Recognition (NER)
A very important sub-task: find and classify
names in text:
For example
names of persons,
Names of organizations,
Names of geographical locations (countries, cities),
Dates,
Products,
4
NER Example
Salma lives in Rawalpindi and is studying Computer
Science at Capital University of Science & Technology.
She is a part time worker at a call center in Islamabad.
<person> Salma </person> lives in <location>
Rawalpindi </location> and is studying Computer
Science at <organization> Capital University of
Science & Technology </organization> in
<date>2019</date>. She is a part time worker at a
call center in <location> Islamabad </location>.
5
NER
6
7
Evaluation of NER
Precision, Recall, and the F measure
2x2 Evaluation Table
Precision: % of selected items that are correct
Recall: % of correct items that are selected
correct
8
9
A combined measure: F
A combined measure that assesses the P/R tradeoff is F measure
(weighted harmonic mean):
The harmonic mean is a very conservative average
People use F1 with with β = 1 (that is, α = ½)
F = 2PR/(P+R)
10
A combined measure: F
P = 40% R = 40% F =?
P = 75% R = 25% F =?
11
Accuracy
12
OKE Challenge
13
Message Understanding Conference (MUC) was an
annual event/competition where results were
presented
Focused on extracting information from news
articles:
Terrorist events
Industrial joint ventures
Company management changes
14
NER
Typically, NER demands optimally combining a
variety of clues including,
orthographic features,
parts of speech,
similarity with existing database of entities,
presence of specific signature words and so on.
15
Methods for NER
Hand-written regular expressions
Finding (US) phone numbers
(?:\(?[0-9]{3}\)?[ -.])?[0-9]{3}[ -.]?[0-9]{4}
Develop rules
Using classifiers
Sequence models
16
CustNER
Illinois Annotated
NER Text
CustNER
Standard Annotated Pre Rule Named
Input Text
NER Text Processor Engine Entities
DBpedia Annotated
Spotlight Text DBpedia
Pre-Processor: The lists of entities annotated by the
annotators contain some apparent false positives like he, his,
goes, the etc., which need to be removed.
17
Rule 1 – Deciding the Boundary and Type of e
from three Annotations
Text Annotation by Annotation
Stanford NER Illinois NER DBpedia selected by
Spotlight CustNER
White House loc: White org: White House thing: National org: White House
National House National Trade Trade Council National Trade
Trade Council Council Council
Mr Trump per: Trump per: Trump surname: Mr per: Trump
Trump
Dublin City loc: Dublin org: Dublin City org: Dublin City org: Dublin City
Council City Council Council Council
The Coming misc: The loc: China book: The loc: China
China Wars Coming China Coming China
Wars Wars
UK loc: UK loc: UK org: UK org: UK
government government government
US President- loc: US title: President- per: US per: US President-
18
elect elect President-elect elect
Rule 2 - Addition of Entities Recognized by
Stanford or Illinois NER
Rule 3 - Checking around Title Entity
Rule 4 – Expanding Nationality Entities
Rule 5 – Addition of Mentions Having Corresponding
DBpedia Resources
Rule 6 – For Recognizing Acronyms
Rule 7 – For Adding Re-Occurrences of Added
Entities
19
Example incorrect annotations in
OKE dataset and the corrections
made
20
Named Entity Previous Corrected Comments
Annotation annotation
Irish 1 0 "Irish" is not a person, organization or location. It is a
nationality. Therefore, is removed from the dataset.
Korean 1 org: Korean "Korean" is nationality. But the text actually has "Korean Air",
Air which is an organization.
Yonhap news 1 org: Yonhap "Yonhap" is name of organization, not "Yonhap news agency".
agency
Ministry of Defence 0 1: org "Ministry of Defence" is an organization.
Russia 0 1: loc "Russia" is a location.
Paul Pogba's 1 per: Paul "'s" is not part of the person name.
Pogba
King Koopa 1 0 "King Koopa" is a turtle-like fictional character and not a
person, location or organization.
legendary 1 per: Alan "legendary cryptanalyst" is not part of the person name.
cryptanalyst Alan Turing
Turing
Santa 0 per: Santa "Santa" or "Santa Claus" is a human fictional character.
U.S. 0 1: loc "U.S." is a location named entity.
Joker 0 1: per "Joker" is a person fictional character.
Persian army 0 1: org "Persian army" is name of an organization.
Greenwich Village, 1 loc: This entity has been broken down into three location entities,
Manhattan, New Greenwich "Greenwich Village", "Manhattan" and "New York City".
York City Village, loc:
Manhattan,
loc: New York
City
21
FIFA 0 1: org "FIFA" is acronym of an organization.
Results comparison of NE recognition
task on OKE evaluation dataset
Annotator Weak Annotation Strong Annotation Match
Match
Precision Recall F1 Precision Recall F1
Stanford NER 74.94 85.22 79.75 68.75 72.04 70.36
Illinois NER 94.66 84.17 89.11 86.14 77.45 81.56
CustNER 92.13 92.37 92.25 85.64 83.42 84.51
22
Results comparison of NE recognition
and classification task on OKE
evaluation dataset
Annotator Weak Annotation Match Strong Annotation Match
Precision Recall F1 (micro) Precision Recall F1
(micro)
Stanford
68.91 78.36 73.33 64.18 67.25 65.68
NER
Illinois
85.76 76.25 80.73 79.94 71.88 75.70
NER
CustNER 83.73 83.95 83.84 80.27 77.98 79.11
23
Results comparison of strong
annotation match for each type on
OKE evaluation dataset
Annota person location organization
tor Precisi Recall MicroF Precisi Recall MicroF Precisi Recal Micro
on on on l F
Stanfo
rd 62.56 72.11 66.99 64.71 64.23 64.47 68.85 60.00 64.12
NER
Illinois
87.01 74.86 80.48 80.99 77.78 79.35 60.94 54.17 57.35
NER
CustN
85.88 83.52 84.68 80.49 75.57 77.95 66.67 68.49 67.57
ER
24
Results comparison of NE
recognition task on CoNLL03
evaluation dataset
Annotator Weak Annotation Match Strong Annotation Match
Precision Recall F1 Precision Recall F1
Stanford
86.33 94.66 90.30 86.28 87.72 86.99
NER
Illinois
95.70 95.29 95.49 98.05 91.20 94.50
NER
CustNER 90.98 97.70 94.22 91.80 91.31 91.55
25
Assignment: NER
1. Gather small paragraphs from the web with entities of your
interest (atleast ten)
2. Mark the entities in these paragraphs with relevant domain
specific tags
3. Use the publicly available NER systems to tag these
paragraph
4. Tabulate the results
5. Compute P, R, F1 for each paragraph and each NER system
6. Compute average P, R, F1 and then give your opinion in the
discussion form
7. Submit your report to the Assignment Folder before next
class
26