We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 52
CHAPTER 1
INTRODUCTION
ene
CHAPTER OVERVIEW
This chapter gives an idea of natural language processing (NLP)
UN. Various levels of analysis involved in NLP along with the
levels of analysis are discussed. Some of the difficulties in
factors that make automatic processing of languages difficult are also touched upon, The
chapter underlines the role of grammar in language processing and introduces
transformational grammar. Indian languages differ a lot from English. These differences
axe clearly pointed out. Further, a number of NLP applications are introduced along
with some of the early NLP systems. Towards the end, information retrieval is discussed.
and information retrieval
knowledge used by these
analysing text and specific
ii WHAT IS NATURAL LANGUAGE PROCESSING (NLP)
Language is the primary means of communication used by humans. It is
the tool we use to express the greater part of our ideas and emotions. It
shapes thought, has a structure, and carries meaning. Learning new
concepts and expressing ideas through them is so natural that we hardly
realize how we process natural language. But there must be some kind of
representation in our mind, of the content of language. When we want to
express a thought, this content helps represent langua
children, we never learn a computational model of language, yet this is
the first step in the automatic processing of languages. Natural language
processing (NLP) is concerned with the development of computational
models of aspects of human language processing, There are (vo main
reasons for such development
1. To develop’ automated tools for language processing
2. To gain a better understanding of human communication
Building computational models with human language-processing abilities
requires a knowledge of how humans acquite, store, and process language,
It also requires a knowledge of the world and of language.
ein real time. Asion Retrieval
age Processing and Information Ret
Lang:
2 Net
have been two major approache:
Sto Nip.
empiricist approach. Early NLP reseas, th,
pich assumes the existence of some jr lok
n. Supporters of this approach arg 8%
a complex thing like natural jt in
all Tay
ists do not believe in "i
existen
ce of
Historically, there
approach and th
ratte onalist
4 rationalist approach, wh
ity in the human brain
for children to learn
sensory inputs, Empirie
ad, they believe in the existence of some
ation principles suc has pattern recognition, generalization
Learning of dotailed structures can, therefore, take 4
ation of these principles on sensory inputs avail
ty
fact
not possible
from limited
aculty, Inste
Metal
association
through the applic
the child
4.2. ORIGINS OF NLP =
Natural language processing sometimes mistakenly termed natural langua,
a ejerstanding-—originated from machine translation research, Whi
natural language understanding involves only the interpretation of
language, natural language processing includes both understanding
: and generation (production). The NLP also includes speech
in this book, we are concerned with text Processing
f computational linguistics, and the tasks
{interpretation)
processing. However,
only, covering work in the area of
in which NLP has found useful application.
Computational linguistics is similar to theoretical- and psycho-linguistis,
but uses different tools. Theoretical linguists mainly provide structural
description of natural language and its semantics. They are not concerned
with the actual processing of sentences or generation of sentences from
structural description. They ate in a quest for principles that remain
common across languages and identify rules that capture linguistic
generalization. For example, most languages have constructs like noun
and verb phrases. Theoretical linguists identify rules that describe and
restrict the structure of languages (grammar). Psycholinguists explain how
bumans produce and comprehend natural language. Unlike theoretical
Iinguists, they are interested in the representation of linguistic structures
2s well 2s in the process by which these structures are produced. They
ad primarily on empirical investigations to back up their theories.
ae eee ee is concerned with the study of language wsins
pftisesis i + of linguistic phenomena. It deals with the application
company linguistics, representing a anguage i major problem:
#6 representations tackle only a small part of knowledge:.
13 LANGUAGE AND KNOWLEDGE seme perce
Introduction 3
Representing the whole body
words knowledge and languay
in detail in Section 1.3,
Computational models m
driven
of knowledge is almost impossible. The
age should not be confused, This is
discussed
ay be broadly classified under kn
; a knowledge.
id data-driven. categories, Knowle E
aaa coded linguistic knowledge, often expressed as a set of
handcrafted grammar niles. Acquiring and encoding such knowledge is
difficult and is the main bottleneck in the development of such systems.
They are, therefore, often constrained by the lack of sufficient cove age
of domain knowledge. Data-driven approaches presume the existence of
a large amount of data and usually employ some machine learning
technique to learn syntactic patterns. ‘The amount of human effort is less
and the performance of these systems is dependent on the quantity of the
data, These systems are usually adaptive to noisy data,
As mentioned earlier, this book is mainly concerned with computational
linguistics approaches. We try to achieve a balance between semantic
(knowledge-driven) and data-driven approaches on one hand, and between
theory and practice on the other. It is at this point that the book differs
significantly from other textbooks in this area. The tools and techniques
have been covered to the extent that is needed to build sufficient
understanding of the domain and to provide a base for application.
The NLP is no longer confined to classroom teaching and a few
traditional applications. With the unprecedented amount of information
now available on the web, NLP has become one of the leading techniques
for processing and retrieving information. In order to cope with these
developments, this book brings together information retrieval with NLP.
The term information retrieval is used here in a broad manner to include a
number of information processing applications such as information
extraction, text summarization, question answering, and so forth. The
distinction between these applications is made in terms of the level of
detail or amount of information retrieved. We consider retrieval of
information as part of processing. The word ‘information’ itself has a
much broader sense. It includes multiple modes of information, including
speech, images, and text, However, it is not possible to cover all these
modes due to space constraints, Hence, this book focuses on textual
information only.
edge-driven systems rely on
Language is the medium of expression in which knowledge is deciphered,
We are not competent enough to define language and knowledge and its4 Natural Language Processing and lofermation Retrieval
implications. We are here considering the text from of the language ang
the content of it as knowledg
ng a medium of expression, is the outer form of the
Language, b
content it expresses. The same content ean be expressed in differen
» be separated from its content? Ifso, how can
ning, of one language
languages. But can tangy
the content itself be represent
is written in the same lang
also be written in some other, formal, language. Hence, to process a
language means to process the content of it, As computers are not able to
understand natural language, methods are developed to map its content
Sometimes, formal language content may have to
atural language as well. ‘Thus, in this book, language is
knowledge representation tool that has historically
? Generally, the me
«(but with a different set of words). It may
in a formal lang
be expressed in a
taken up as
represented the whole body of knowledge and that has been modified,
maybe through generation of new words, to include new ideas and
situations. The language and speech community, on the other hand,
considers a language as a set of sounds that, through combinations, conveys
meaning to a listener. However, we are concerned with representing and
processing text only. Language (text) processing has different levels, each
involving different types of knowledge. We now discuss various levels of
processing and the types of knowledge it involves.
The simplest level of analysis is lexical analysis, which involves analysis
of words. Words are the most fundamental unit (syntactic as well as
semantic) of any natural language text. Word-level processing requires
morphological knowledge, i.e., knowledge about the structure and
formation of words from basic units (morphemes). The rules for forming
words from morphemes are language specific.
The next level of analysis is syntactic analysis, which considers a sequence
of words as a unit, usually a sentence, and finds its structure. Syntactic
analysis decomposes a sentence into its constituents (or words) and
identifies how they relate to cach other. It captures grammaticality or
non-grammaticality of sentences by looking at constraints like word order,
number, and case agreement. This level of processing requires syntactic
knowledge, i.c., knowledge about how words are combined to form larger
units such as phrases and sentences, and what constraints are imposed on
them. Not every sequence of words results in a sentence. For example, ‘I
went to the market’ is a valid sentence whereas ‘went the I market to’ is
not. Similarly, ‘She is going to the market’ is valid, but ‘She are going t0
the market’ is not. Thus, this level of analysis requires detailed knowledge
about rules of grammar.Introduction 5
Yet another level of analysis is semantic anal)
lysis. Semantics is
. e meaning of the language, Sey rnineniGai
with the v ung 15 ge Semantic analysis is conce
creating meaningful representation of linguistic inputs,
of semantic interpretation is to take natural 4
utterances and map them onto some repr
meaning components is difficult
associated
ered with
The general idea
janguage sentences or
esentation of meaning. Defining
i as grammatically valid sentences can be
meaningless. One of the famous examples is, ‘Colorless green se
furiously’ (Chomsky 1957). The sentence is well-formed, i.e., syn
correct, but semantically anomalou: However, this doe
syntax has no role to play in meaning. Bach (2002)
sleep
actically
'S not mean that
considers:
semantics to be a projection of its syntax. That is semantic
structure is interpreted syntactic structure.’
But definitely, syntax is not the only component to contribute meaning.
Our conception of meaning is quite broad. We feel that humans apply all
sorts of knowledge (i.c., lexical, syntactic, semantic, discourse, pragmatic,
and world knowledge) to arrive at the meaning of a sentence. The starting
point in semantic analysis, however, has been lexical semantics (meaning
of words). A word can have a number of possible meanings associated
with it, But in a given context, only one of these meanings participates.
Finding out the correct meaning of a particular use of word is necessary
to find meaning of larger units. However, the meaning of a sentence
cannot be composed solely on the basis of the meaning of its words.
Consider the following sentences:
Kabir and Ayan are married. (1.1a)
Kabir and Suha are married. (1.20)
Both sentences have identical structures, and the meanings of individual
words are clear. But most of us end up with two different interpretations.
We may interpret the second sentence to mean that Kabir and Suha are
married to each other, but this interpretation does not occur for the first
sentence. Syntactic structure and compositional semantics fail to explain
these interpretations. We make use of pragmatic information, This means
that semantic analysis requires pragmatic knowledge besides semanuc
and syntactic knowledge. ee
A still higher level of analysis is discourse analysis. Diseone! eve
Processing attempts to interpret the structure and meaning of oe au
units, e.g., at the paragraph and document level, in terms of! ba d
phrases, clusters, and sentences. It requires the resolution of anap! pong
references and identification of discourse structure. It also requires discourse
[ eaning a sentence is
knowledge, that is, knowledge of how the meaning of a o
exfs how a pronoun refers to the
determined by preceding sentences>
letermine the function of a Sentence in
e may be needed for resolving q
6 Natural Language Processing and Information Retrieval
preceding noun—and how to d ;
text. In fact, pragmatic knowledg : aap
‘or example, in the following sentences, resolving the Napho, .
owledge: ie
references.
reference ‘they’ requires pragmatic kn
The district administration refused to give the trade union
permission for the meeting because they feared violence, (129
The district administration refused to give the trade union
permission for the meeting because they oppose governmeny,
(1.2)
The highest level of processing is pragmatic analysis, which deals yi,
the purposeful use of sentences in situations. It requires knowledge of the
world, nowledge that extends beyond the contents of the text. The
Cyc project (Lenat 1986) at University of Austin is an attempt to utilize
world knowledge in NLP. However, its usefulness in a general-domain
NLP system is yet to be demonstrated. Furthermore, whether or not
semantics can be associated with a symbol manipulator and whether
humans use logic in the same way as the Cyc project, are both issues of
debate.
1.4 THE CHALLENGES OF NLP
There are a number of factors that make NLP difficult. These relate to
the problems of representation and interpretation. Language computing
requires precise representation of content. Given that natural languages
are highly ambiguous and vague, achieving such representation can be
difficult. The inability to capture all the required knowledge is another
source of difficulty. It is almost impossible to embody all sources of
knowledge that humans use to process language. Even if this were done,
it is not possible to write procedures that imitate language processing as
done by humans. In this section, we detail some of the problems associated
with NLP.
Perhaps the greatest source of difficulty in natural language is identifyi
its semantics. The principle of compositional semantics considers the
meaning of a sentence to be a composition of the meaning of words
appearing in it. In the earlier section, we saw a number of examples
where this principle failed to work. Our viewpoint is that words alone 40
not make a sentence. Instead, it is the words as well as their syntactic and
semantic relation that give meaning to a sentence. As pointed out by
Witigenstein (1953): “The meaning of a word is its use in the language.” A
Janguage keeps on evolving. New words are added continually and existingIntroduction 7
e introduced in new context. For example, most ne
word a spape
Fa channels use O/ TE to refer lo the terrorist act on the World ee
Cente in the USA in 2001, When we process written text or spoken
vamfacen, we have access 40 underlying Mental representation, The oct
iy machine can learn the meaning of a specific word in a message i
Fy onsidering its context, unless some explicitly coded general work og
ymain knowledge is available, The context of a word is defined by co
vveaing words. I includes everything that occurs before or afer «
word. ‘The frequeney of a word being used in a particular sense also
ilects its meaning. The English word ‘while’ was initially used to mean
short interval of ime’, But now il is more in use as a conjunction, None
Af the usages of ‘while’ discussed in this chapter correspond to this
meaning,
Idioms, metaphor, and ellips
meaning of the written text, As an
The old man finally kicked the bucket. (1.3)
ick”
add more complexity to identify the
mple, consider the sentence:
The meaning of this sentence has nothing to do with the words ‘
and ‘bucket’ appearing in it
Quantifier-scoping is another problem. The scope of quantifiers (the,
‘h, etc.) is often not clear and poses problem in automatic processing.
‘The ambiguity of natural languages is another difficulty. These go
unnoticed most of the times, yet are correctly interpreted. This is possible
because we use explicit as well as implicit sources of knowledge.
Communication via language involves two brains not just one—the brain
of the speaker/writer and that of the hearer/reader. Anything that is
ed to be known to the receiver is not explicitly encoded. The
receiver possesses the necessary knowledge and fills in the gaps while
making an interpretation. As humans, we are aware of the context and
current cultural knowledge, and also of the language and traditions, and
utilize these to process the meaning, However, incorporating contextual
test difficulty in language computing.
anguage is the representation of different
may be hard for a person living
assul
and world knowledge poses the gr
An example of cultural impact on I
shades of white in the Eskimo world. It
in plain to distinguish among various shades. Similarly, to an Indian, the
word “Taj may mean a monument, a brand of tea, or a hotel, which may
not be so for a non-Indian. Let us now take a look at the various sources
of ambiguities in natural languages.
The first level of ambiguity ar
effort, we can identify words that have
at the word level, Without much
es
ted with
multiple meanings associ~
al
8 Natural Language Processing and Information Retrieval
n, bat, and still. A word may be ambiguous in its pay
ambiguous in its meaning. The word ral
ambiguous in its part-oFspeech whereas ies word i fe ambiguous ini
meaning. We hardly consider all possible hae a Mahe 0 get the
correct one. A program on the other hand, must be explicitly coed
fh meaning. Hence, we need to develop various model, ang
\g whether ‘can’ is @ noun or a verb ig
them, e.g. bank,
of-speech or it may be
resolve "
algorithms to resolve them. Deci etacan
solved by ‘part-of-speech tagging’ whereas identifying whether a particu).
use of ‘bank’ corresponds to ‘financial inseiutlony sense or ‘river bany:
sense is solved by ‘word sense disambiguation’. ‘Part-of speech tagging
and ‘word sense disambiguation’ algorithms are discussed in Chapters 3
and 5 respectively.
A sentence may be ambiguous even if the words are not, for example,
the sentence: ‘Stolen rifle found by tree.’ None of the words in this sentence
is ambiguous but the sentence is. This is an example of structural ambiguity.
Verb sub-categorization may help to resolve this type of ambiguity by:
not always. Probabilistic parsing, which is discussed in Chapter 4, js
another solution. At a still higher level are pragmatic and discourse
ambiguities. Ambiguities are discussed in Chapter 5.
A number of grammars have been proposed to describe the structure
of sentences. However, there are an infinite number of ways to generate
them, which makes writing grammar rules, and grammar itself, extremely
complex. On top of it, we often make correct semantic interpretations of
non-grammatical sentences. This fact makes it almost impossible for
grammar to capture the structure of all and only meaningful text.
1.5 LANGUAGE AND GRAMMAR = vs mmm ~ a
Automatic processing of language requires the rules and exceptions of 2
language to be explained to the computer. Grammar defines language. It
consists of a set of rules that allows us to parse and generate sentences in
a language. Thus, it provides the means to specify natural language.
These rules relate information to coding devices at the language level—
not at the world-knowledge level (Bharati et al. 1995). However, since
world knowledge affects both the coding (ie., words) and the coding
convention (structure), this blurs the boundary between syntax and
semantics. Nevertheless such a separation is made because of the ease of
processing and grammar writing,
‘The main hurdle in language specification comes from the constantly
changing nature of natural languages and the presence of a large numbe?of hard:to speeity pela everal efforts have been made to provide
oe peeificationss whieh bas Ted to the development of a mumbe
ce tars. Main among them are transformational grammar (Ch : fa
uN gaicalfusetional grammar (Kaplan and Bresnan 1942), gove pony
if pining (Chomsky 1981), generalized phrase a
and Gormational grammar (Chomsky 1957), dependency [iol
Fenian grammar, and (ee-adjoining grammar (Joshi 1985) Some of
these grammars focus on derivation (e.g., phrase structure grammar) while
there focus on relationships (e.g. dependency grammar, lexical fanctional
grammar, Paninian grammar, and link grammar). We discuss some of
sexe in Chapter 2. The greatest contribution to grammar comes from
Noam Chomsky, who proposed a hierarchy of formal grammar based on
revel of complexity. These grammars use phrase structure rules (or rewrite
rules). The term ‘generative grammar’ is often used to refer to the general
famework introduced by Chomsky. Generative grammar basically refers
mar that uses a set of rules to specify or generate all and only
ccrammatical (well-formed) sentences in a language. Chomsky argued that
phrase structure grammars are not adequate to specify natural language.
He proposed a complex system of transformational grammar in his book
on Syntactic Structures (1957), in which he suggested that each sentence ip
a language has two levels of representation, namely, a deep structure and
« surface structure (See Figure 1.1). The mapping from deep structure 0
surface structure is carried out by transformations. In the following
we introduce transformational grammar.
Introduction 9
to any gramt
paragraphs,
Jn ve
NP Vv NP vP
ZN
is played by Pooja
ZA
Pooja plays Veena Veena
Surface structure
subj
Pooja plays
Deep structure
Figure 1.1. Surface and deep structures of sentenceos
a aguas and Int jeval
10 Natural Language Processing and Information Retriev
introduced by Chomsky jn, Ios
4 marewi
transformational gram w
sky ce is the surface representation,
Chomsky argued that an utterance is th p ion
“deeper structure’ representing its meaning. The deep structure ca" be
transformed in a number of ways to yield many different Surface leg}
representations, Sentences with different surface-level TEPTESentation,
having the same meaning, share a common deeprlevel represen.
Chomsky’s theory was able to explain why sentences like
Pooja plays veena. (1.43)
Veena is played by Pooja. (4b)
have the same meaning, despite having different surface structures (role,
of subject and object are inverted). Both the sentences are being generzto,
from the same ‘deep structure’ in which the deep subject is Pooja and the
deep object is the vena.
Transformational grammar has three components:
1. Phrase structure grammar
2. Transformational rules
3. Morphophonemic rules—These rules match each sentence
Tepresentation to a string of phonemes.
Each of these components consists of a set of rules. Phrase structure
grammar consists of rules that generate natural language sentences and
assign a structural description to them. As an example, consider the
following set of rules:
S— NP + VP
VP V+ NP
NP — Det + Noun
V— Aux + Verb
Det — the, a, an, ...
Verb — catch, write, eat, ...
Noun —+ police, snatcher, ...
Aux will, is, can, ...
In these rules, S stands for sentence, NP for noun phrase, VP for verb
phrase, and Det for determiner. Sentences that can be geverated Using
these rules are termed grammatical, The structure assigned by the grammar
is a constituent structure analysis of the sentence,
The second component of transformational grammar is a set of
transformation rules, which transform one Phrase-maker (underlying) into
another phrase-marker (derived). These ules are applied on the terminalIntroduction 11
string generated by phrase structure rule:
transformational rules are heterogencou
symbol on their left hand side. ‘These
surface representation into another, ¢
one. The rule relating active and pas
8. Unlike phrase structure rules,
sand may have more th
an one
rules are
used to transform one
8» an active sentence into passive
ive sentences (as given by Chomsky)
is
NP ~ Aux ~V~ NP) — NP2 = Aux + be + en ~V~ by + NP,
This rule says that an underlying input having the structure NP-Aux-
V-NP can be transformed to NP - Aux + be + en - V — by + NP. This
transformation involves addition of strings ‘be’ and ‘en’ and certain re.
arrangements of the constituents of a sentenc:
Transformational rules
can be obligatory or optional. An obligatory transformation is one that
ensures agreement in number of subject and verb, etc., whereas an optional
transformation is one that modifies the structure of a sentence while
preserving its meaning. Morphophonemic rules match each sentence
representation to a string of phonemes.
Consider the active sentence:
The police will catch the snatcher. (1.5)
‘The application of phrase structure rules will assign the structure shown
in Figure 1.2 to this sentence.
s
NP ve
The police Verb A
Aux Vo Det_—- Noun
will catch the snateher
Figure 1.2 Parse structure of sentence (1.5)
‘The passive transformation rules will convert the sentence im z Fi
The + culprit + will + be + en + catch + by + police (Figure 1.3).otrieval
sing and Information Retrieve
anguage Processing and
12 Natural Language
s
oS
NP ve
The snatcher Verb NP
/\
/
Aux Vv by NP
AN | /\
will be en catch Det Noun
I |
the police
Figure 1.3 Structure of sentence (1.5) after applying passive transformations
Another transformational rule will then reorder ‘en + catch’ to ‘catch +
en’ and subsequently one of the morphophonemic rules will convex
‘catch + en? to ‘caught’. In general, the noun phrase is not always as
Simple as in sentence (1.5), It may contain other embedded structures
such as adjectives, modifiers, relative clause, ete. Long distance
dependencies are other language phenomena that cannot be adequately
handled by Phrase structure tules. Long distance dependency refers 2
“yntactic phenomena where a verb and its subject or object can be
arbitrarily apart. The Problem in the Specification of appropriate phrase
Te Sure Tules occurs because these phenomena cannot be localized a
the surface structure level (oshi and Vijayshanker 1989). Wh-movement
aF€ @ specific case of these types of dependencies,
16 PROCESSING INDIAN LANGUAGES .
There are a number of differences betwe.
This introduces differences in the
are listed here.
+ Unlike English, Indic cture,
* Unlike English, Indian languages have $C Vv Subject-Object-Verb) as
the default sentence structs
en Indian languages and English.
ir processing, Some of these differences
es
{lt refers to a syntactic ph
the sentence. For example, why
words, called wh-w
vf the verb ‘read! in the sen
What ts she reading?" instead ar
Appear at the beginning of
‘She is reading a book’ 8
he is reading what?Introduction 13
+ Indian Langnages have a fiee word order, ie
freely within a sentence without «
«Spelling standard
» words can be
anging the meaning of the
fon is more subtle in Hindi than in English.
# Indian languages have a relatively tich set of morpholoag
e Indian languages make extensive i ene
predicates (CPs).
# Indian languages use post-position (Karaka
prepositions.
« Indian languages use verb cony
moved
sentence,
al variants.
and productive use of complex
case markers instead of
plexes consisting of sequences of verbs
cogs ME RIB (ea raha hai—singing) and Ber 2B (dhl whe har”
playing). The ausiliary verbs in this sequence provide inforinaron
about tense, aspect, modality, ete.
Except for the direction in which its script is written, Urdu is closely
related to Hindi. Both share similar phonology, morphology, and syntax.
Both are free-word-order languages and use post-positions. They also
share a large amount of their vocabulary. Differences in the vocabulary
arise mainly because a significant portion of Urdu vocabulary comes
from Persian and Arabic, while Hindi borrows much of its vocabulary
from Sanskrit.
Paninian grammar provides a framework for Indian language models.
‘These can be used for computation of Indian languages. The grammar
focuses on extraction of Karaka relations from a sentence. We talk about
the details of modelling in Chapter 2. A parsing framework based on
Paninian grammar is introduced in Chapter 4 and issues involved in
Indian language translation (using Paninian grammar theory) are discussed
in Chapter 8.
17 NLP APPLICATIONS trssmsnnesn
Machine translation is the first application area of NLP. It involves the
complete linguistic analysis of a natural language sentence, and linguistic
generation of an output sentence. It is one of the most comprehensive
and most challenging tasks in the area (Al-complete). However, the recent
dramatic progress in the field of NLP has found interesting applications
in information retrieval, information extraction, text summarization, etc.
is book offers an extensive coverage of these recent applications, and
also of traditional ones like machine translation and natural language
: Keneration. ‘The focus has been on bridging the gap between theory and
Practice rather than on offering a gamut of linguistic, psychological, and
computational theories,1A. Natural Language Processing and Information Retrieval
‘The applications utilizing NLP include the following.
Machine Transtation
This refers to automatic translation of text from one human lan
involved, semantics of the languages, and world knowledge
Speech Recognition
This is the process of mapping acoustic speech signals to a set of words
The difficulties arise due to wide variations in the pronunciation of word,
homonym (e.g. dear and deer) and acoustic ambiguities (e.g. in the rex
and interest).
Speech Synthesis
Speech synthesis refers to automatic production of speech (utterance of
natural language sentences). Such systems can read out your mails on
telephone, or even read out a storybook for you. In order to generate
utterances, text has to be processed. So, NLP remains an important
component of any speech synthesis system.
Natural Language Interfaces to Databases
Natural language interfaces allow querying a structured database using
natural language sentences.
Information Retrieval
This is concerned with identifying documents relevant to a user's query.
NLP techniques have found useful applications in information retrieval
Indexing (stop word elimination, stemming, phrase extraction, etc.), word
sense disambiguation, query modification, and knowledge bases have
also been used in IR system to enhance performance, e.g., by providing
methods for query expansion. WordNet, LDOCE (Longman Dictionary of
Contemporary English) and Roget's Thesaurus are some of the useful lexical
resources for IR research.
Information Extraction
An information extraction system captures and outputs factual information
contained within a document. Similar to an information retrieval system.
it responds to a user’s information need. However, unlike in an information
retrieval system, the information need is not expressed as a keyword
query. Instead it is specified as pre-defined database schemas or templates
Whereas an IR system identifies a subset of documents in a large repositor
of text database—e.g., in a library scenario, a subset of resources in @SOME SUCCESSFUL EARLY NLP SYSTEMS =o
Introduction 15
library-
within 4
Question Answering
Given a question and a set of documents, a ques
attempts (0 find the precise
ich the answer appea
_an information extraction system identifies a subset of inf
a document that fits the pre-defined templ prmation
stion answering syste
nswer, or at least the precise callon or fe
in whi This is unlike an IR system, which returns
whole document that seems relevant to the user's query. A questio ;
wavering system is different from an information extraction system in
that the content that is to be extracted is unknown. In general, eaten
answering system benefits from having an information extraction system
to identify entities in the text. A question answering system requires more
NLP than an information retrieval system or an information extraction
system. It requires not only precise analysis of questions and portions of
texts but also semantic as well as background knowledge to answer certain
type of questions.
Text Summarization
This deals with the creation of summaries of documents and involves
syntactic, semantic, and discourse level processing of text.
ELIZA (Weizenbaum 1966)
ELIZA is one of the earliest natural language understanding programs. It
uses syntactic patterns to mimic human conversation with the user. Here
is a sample conversation.
Eliza: Hello. | am ELIZA. How may | help you?
User: | am feeling a little bit sleepy.
Eliza: How long have you been feeling a little bit sleepy?
User: For almost half an hour.
Eliza: Please go on.
‘SysTran (System Translation)
The first SysTran machine translation system w
Russian-English translation. SysTran also provided the
machine translation service called Babel Fish, which is used by AltaVista
search engines for handling translation requests from use!
TAUM METEO
‘This is a natural language generation system used
weather reports. It accepts daily weather data and gene
reports in English and French.
Jeveloped in 1969 for
first on-line
in Canada to generate
rates weather—y
jon Retrieval
16 Natural Language Processing and Informe
SHROLU (Winogard 1972) dint otrantes
This is a nots langage wnderstoncting syste hat states egy 5
avrobot ina block world domain, [uses ayntactle parsing, and ye
reasoning to understand instructions, The wer can ask the
anainiputate the blocks, to tell the blacks configurations, and ty exp,
reasoning.
LUNAR (Woods 1977)
This way an early question answering system that answered questing,
about moon rock,
ic
Hobo
My
9 INFORMATION RETRIEVAL m=
The availability of a large amount of text in electronic form hay 1
extremely difficult to. get relevant information, Information
systems aim at providing a solution to this,
The term ‘information’ should not be confused with the term ‘entropy’
(numerical measure of the uncertainty of an outcome) as it is used in
communication theory, Information is being used here to reflect ‘subject
matter’ or the ‘content’ of some text, We are not interested in ‘digital
communication’, where bits and bytes are the information carriers. Instead
our focus in on the communication taking place between human beings
as expressed through natural languages. Information is always associated
with some data (text, number, image, and so. on): we are concerned with
text only. Hence, we consider words as the carriers of information and
written text as the message encoded in natural language.
As a cognitive activity, the word ‘retrieval’ refers to operation of
accessing information from memory. We use the word ‘retrieval’ to refer
to the operation of accessing information from some computer-based
representation. Retrieval of information thus requires information to be
processed and stored. Not all the inforn
form is ret
ide ip
tr
tion represented in computable
ieved. Instead, only the information relevant to the needs
expressed in the form of query is located. In order to get this relevance,
the stored and processed information needs to be compared against query
Fepresentation, Information retrieval (IR) deals with all these facets, Its
concerned with the organization, sto
information relevant to the query,
Information retrieval deals with unstructured data, The retrieval is
performed based on the content of the document rather than on its
ry lure: The IR systems usually return a ranked list of documents, The
IR components have been traditionally incorporated into different types
Be, retrieval, and evaluation oftroduction 47
of information systems including. database manag
bubliographic test retrieval systoms, question answer BEMERE systems,
recently in search engines,
Current approaches for accessing |
classified into two categories. The
that construct topic hierarchy, eg,, Yahoo,
documents of interest manually by (raversing.
requires manual classification of new documents
taxonomy. This makes it cost ineffective
growth of documents on the Web,
B Systems, and More
large text col
fitst category
This helps the
the hierarchy
User locate
However, it
within the exis
and inappl a
The second ¢
approaches that rank the retrieved documents a
We discuss various IR models that support ranked
cable due to rapid
alegory consists of
ccording to relevance
retrieval in Chapter 9,
6)
Major Issues in Information Retrieval (Siddiqui 200}
There are a number of issues involved in the design and evaluation of IR
systems, which are briefly discussed in this section,
point is to choose a representation of the docu
knowledge is coded in natural language,
knowledge representation language for co!
current retrieval models are based on keyword representation. This
representation creates problems during retrieval due to polysemy,
homonymy, and synonymy. Polysemy involves the phenomenon of «
Jexeme with multiple meaning. Homonymy is an ambiguity in which
words that appear the same have unrelated meanings. Ambiguity makes
it difficult for a computer to automatically determine the conceptual content
of documents. Synonymy creates problem when a document is indexed
with one term and the query contains a different term, and the two terms
share a common meaning. Another problem associated with keyword-
based retrieval is that it ignores semantic and contextual information in
the retrieval process. This information is lost in the extraction of keywords
from the text and cannot be recovered by the retrieval algorithms. :
A related issue is that of inappropriate characterization of aueries by
the user. There can be many reasons for the vagueness and Pee
the user's queries, say for instance, her lack of hnowledge a
or even the inherent vagueness of the natural language. ans unt texts
‘o include relevant terms in the query or may include’ el pectoeabie
Inappropriate or inaccurate queries lead to poor reer ee oe
The problem of ill-specified query can be dealt i Oe ate
expanding queries, An effective technique base¢ See eid
relevance feedback which modifies queries based on
by the user on initial retrieval
The first important
ment. Most human
which is difficult to use as
mputer systems. Most of theaN
18 Natural Language Processing and Information Retrieval
SUMMARY
In order to satisfy the user's request, an TR aystom matches document
sentation with query representation, Matching query representation
with that of the document is another ine, A number of measures have
been proposed to quantify the similarity between a query and the
document to produce a ranked list of results, Selection of the appropriate
similarity measure is a crucial issue in the design of I systerns
Svaluating the performance of TC systems is also a major iswe. ‘There
are many aspeets of evaluation, the most important being the effectiveness
of the system, Recall and precision are the most widely used measures of
effectiveness.
As the major goa
to the query, underst
issue, Relevance is subjective in nature (Saracevic 1991), Only the user
can tell the true relevance; it is not possible (o measure this ‘true relevance’,
One may however, define the degree of relevance. Relevance has been
considered as a binary concept, whereas it is in fact a continuous function
(a document may be exactly what the user wants or it may be closely
related). Current evaluation techniques do not support this continuity as
it is quite difficult to put into practice. A number of relevance frameworks
have been proposed (Saracevic 1996). These include the system,
communication, psychological, and situational frameworks. The most
inclusive is the situational framework, which is based on the cognitive
iew of the information seeking process and considers the importance of
situation, context, multi-dimensionality, and time. A survey of relevance
studies can be found in Mizzaro (1997). Most of the evaluations of IR
systems have so far been done on document test collections with known
relevance judgments.
‘The size of document collections and the varying needs of users also
complicate text retrieval. Some users require answers of limited scope,
while others require documents with a wider scope. These differing needs
can require different and specialized retrieval methods. However, these
are research issues and have not been dealt with in this book.
of IR is to search a document in a manner relevant
dling what constitutes relevance is also an important
humans.
+ Language is the primary means of communication used by
+ Natural language processing is concerned with the development o
computational snodels of aspects of human language processing.
+ Theoretical linguists are mainly interested in providing a descriptio®
of the structure and s ge. whereas
mantics of natural languaIntroduction 19
deal with the study of |
anguag
view. y guage from a
computational lingui
computational point ¢
« Historically, there have been two major approaches to natural language
processing, namely rationalist approach and empiricist approach Set
«The highly ambiguous and vague nature of natural language makes it
dieult to create a representation amenable to computing. :
NCES
pach, Kent, 2002, Meaning and Truth, J. Keim Campbell, M. O'Rourke
‘ind D. Shei (Eds.), Seven Bridges Press, New York, pp. 284-92.
Chomsky, Noam, 1957, Syntactic Structures, Mouton, The Hague.
_ 1981, Lectures on Government and Binding, Foris Publications,
Dordrecht, The Netherlands. >
Joshi, Aravind K., 1985, ‘Tree adjoining grammar: How much sensitivity
is required to provide reasonable structural description,’ Natural Language
Parsing, D. Dowty, L. Karttunen, and A. Zwicky (Eds.), Cambridge
University Press, Cambridge.
Joshi, Aravind K. and K. Vijayshanker, 1989, ‘Treatment of long distance
dependencies in LFG and TAG: functional uncertainty in LFG is a
corollary in TAG,’ Proceedings of the 27th Annual Meeting on Association
jor Computational Linguistics, Vancouver, British Columbia, pp. 220-27.
Kaplan, RM. and Joan Bresnan, 1982, ‘Lexical functional grammar: A
formal system for grammatical representation,’ The Mental Representation
of Grammatical Relations, Joan Bresnan (Ed.), MIT Press, Cambridge.
Lenat, D.B., M. Prakash, and M. Shepherd, 1986, ‘Cyc: using common
sense knowledge to overcome brittleness and knowledge acquisition
bottlenecks,’ A/ Magazine, 6(4).
Mizzaro, S., 1997, ‘Relevance: the whole history,
Society for Information Science, 48(0), pp. 810-32.
Saracevic, T., 1991, ‘Individual differences in organizing, searching and
retrieving information,’ Proceedings of the 54th ‘Annual Meeting of the
American Society for Information Science (ASI), pp- 82-86.
1996, ‘Relevance reconsidered,’ Proceedings of GoLIS 2, Second
International Conference on Conceptions of Library and Information Science:
Integration in Perspective, P. Ingwersen and N.O. Pors (Eds.), The Royal
School of Librarianship, Copenhagen, pp. 201-18. oo
Siddiqui, Tanveer, 2006, ‘Intelligent techniques for effective information
retrieval: a conceptual graph-based approach,” Ph.D. Thesis, -K. Institute
of Applied Physics, Deptt. of Electronics and Communication,
University of Allahabad.
REFERE
+ Journal of the American20 Natural Language Processing and Information Retrieval
Weizenbaum, R., 1966, ‘ELIZA—A computer program for the study of
natural language communication between man and machine,
Communications of the ACM, 9{1).
Winogard, Terry, 1972, Understanding Natural Language, Academic Press,
New York.
Woods, William, 1977, ‘Lunar Rocks in Natural English: Explorations in
Natural Language Question-answering,’ Linguistic Structures Processing,
A. Zampolli (Ed.), Elsevier, North Holland.
EXERCISES m=
- Differentiate between the rationalist and empiricist approaches to
natural language processing.
2. List the motivation behind the development of computational models
of languages.
5. Briefly discuss the meaning components of a language.
4. What makes natural language processing difficult?
5. What is the role of transformational rules in transformational grammar?
Explain with the help of examples,CHAPTER 2
“LANG
UAGE MODELLING __
CHAPTER OVERVIEW
f language is quite vast. It presents an almost infinite number of sentences
to the reader (OF computer). To handle such a large number of sentences, we have to
create @ model of the domain, which can subsequently be simplified and handled
computationally. A number of language models have been proposed. We introduce
some of these jmodels in this chapter. To create a general model of any language is a
difficult task. There are two approaches for language modelling. One is to define a
grammar that can handle the language. The other is to capture the patterns in a grammar
Janguage statistically. This chapter has a mixed approach. It gives a glimpse of both
grammar-based model and statistical language model. These include lexical functional
grammar, government and binding, Paninian grammar, and mgram based model.
The domain o}
2.4 INTRODUCTION soammescmms = SEPT
Why and how do we model a language? This question has been discussed
by linguists since 500 BC. Computational linguists also have to confront
this question. It is obvious that our purpose is to understand and generate
natural languages from a computational viewpoint. One approach can be
to just take a language, try to understand every word and sentence of it,
and then come to a conclusion. This approach has not succeeded as there
are difficulties at each stage, which we will understand as we go through
this book. An alternative approach is to study the grammar of various
languages, compare them, and if possible, arrive at reasonable models
that facilitate our understanding of the problem and designing of natural-
language tools,
A model is a description of some complex entity or process. A language
model is thus a description of language, Indeed, natural language 15 &
complex entity and in order to. process it through & computer-based
program, we need to build a representation (model) of it. This is known22 Natura! Language Processing and information Retrieval
as dengmage modelling. Language modelling can be viewed either as a
problem of grammar inference or a problem of probability estimation, A
we model attempts to distinguish a grammat
ace from @ non-graminatical (ill-formed) one, whereas a probabilist
language model attempts to identify a sentence based on a probability
measure, usually a maximum likelihood estimate. These two viewpoints
have led to the following categorization of language modelling approaches.
Grammardesed language model
A grammar-based approach uses the grammar of a language to create its
model. It attempts to represent the syntactic structure of language.
Grammar consists of hand-coded rules defining the structure and ordering
of various constituents appearing in a linguistic unit (phrase, sentence,
etc). For example, a sentence usually consists of noun phrase and a verb
phrase. The grammar-based approach attempts to utilize this structure
and also the relationships between these structures.
Statistical language modelling
The statistical approach creates a language model by training it from a
corpus. In order to capture regularities of a language, the training corpus
needs to be sufficiently large. Rosenfeld (1994) pointed out:
Statistical language modelling (SLM) is the attempt to capture
regularities of natural language for the purpose of improving the
performance of various natural language applications.
Statistical language modelling is one of the fundamental tasks in many
NLP applications, including speech recognition, spelling correction,
handwriting recognition, and machine translation. It has now found
2pplications in information retrieval, text summarization, and question
answering also. A number of statistical language models have been
proposed in literature. The most popular of these are the n-gram models.
We discuss this model in Section 2.3. The following section discusses
various grammar-based models,
22 VARIOUS GRAMMAR-BASED LANGUAGE MODELS «= CTI,
Various computational grammars have been Proposed and studied, eg.
ttansformational grammar (Chomsky 1957), lexi
( Wl functional grammar
(Kaplan and Bresnan 1982), government and binding (Chomsky. 1981),
generalized phrase structure grammar (Gazdar ot
pra al. 1985), dependency
grammar, paninian grammar, and tree-adjoining
grammar (Joshi 1985).Language Modeling 23
this section focuses on Texical functional grammar (LEG), generalized
structure grammar (GPSG), government and binding (GB), and
nd introduces various approaches to understand
2.2.4 Generative Grammars
In 1957, in his book on Syntactic Structures, Noam Chomsky wrote that we
can generate sentences in a language if we know a collection of words
ora fes in that Ianguage. Only those sentences that can be generated as
per the rules are grammatical. This point of view has dominated
computational linguistics and is appropriately termed generative grammar.
The same idea can be used to model a language. If we have a complete
set of rules that can generate all possible sentences in a language, those
rules provide a model of that language. Of course, we are talking only
about the syntactical structure of language here.
Language is a relation between the sound (or the written text) and its
meaning, Thus, any model of a language should also deal with the meaning
of its sentences. As seen earlier, we can have a perfectly grammatical but
meaningless sentence.
In this chapter, we will assume that grammars are a type of language
models.
2.2.2 Hierarchical Grammar
Chomsky (1956) described classes of grammars in a hierarchical manner,
where the top layer contained the grammars represented by its sub classes.
Hence, Type 0 (or unrestricted) grammar contains Type I (or context
sensitive grammar), which in turn contains Type 2 (context-free grammar)
and that again contains Type 3 grammar (regular grammar). Although
this relationship has been given for classes of formal grammars, it can be
extended to describe grammars at various levels, such as in 2 class-sub
class (embedded) relationship.
2.2.3. Government and Binding (GB)
As discussed in Chapter 1, a common viewpoint taken by linguists (not
computational linguists, however) is that the structure of a language (or
how well its sentences are formed) can be understood at the level of its
meaning, particularly while resolving struct al ambiguity. However, the
sentences are given at the syntactical level and the transformation from
meaning to syntax or vice vers:
is not well understood.Natural Language Processing and Information Retrieval
Transformational grammars assume two levels of existence of sentence,
one at the surface level and the other at the deep root level (this thon
not be confused with the meaning level). Government and binding (GE
theories have renamed them as slevel and d-level, and identified ten
more levels of representation (parallel to each other) called phonetic form
and logical form. According to GB theories, language can be considered
for analysis at the levels shown in Figure 2.1.
d-structure
s-structure
Phonetic form Logical form
Figure 2.1. Different levels of representation in GB
If we describe language as the representation of some ‘meaning
‘sound’ form, then according to Figure 2.1, these two ends are the log
form (LF) and phonetic form (PF) respectively The GB is concemed w
LF, rather than PF. Chomsky was the first to put forward a GB theory
(Peter Sells 1985).
Transformational grammars have hundreds of rew7itng rules, w
are generally language-specific and also construct-specific (@y.
wales for assertive and interrogative sentences in English, or for active and
passive voice sentences). Generation of a complete set of coherent
may not be possible. The GB envisages that if we define rules for struct
units at the deep level, it will be possible to generate any languss®
fewer rales. These deep-level structures are abstractions of noun-phris.
verb-phrase, etc., and common to all languages. It is possible to do
GB theory states, a child learns its mother tongue because the hur
mind is ‘hard-wired’ with some universal grammar rules. Thus, the
enters the mind and its abstract structure gives rise to actual phox
structures. The existence of deep level, language-independent, abst!
structures, and the expression of these in surface level, language spe
structures with the help of simple rules is the main concern of GB theo
Let us take an example to explain d- and s-structures.
Example 2.1
Mukesh was killed.Language Modelling 25
gi) In te ,sformational grammar, this can be represented as S-NP Aux
VP as given below:
Mukesh was killed
Figure 2.2 TG representation of sentence (2.1)
(ii) In GB, the s-structure and d-structure are as follows:
Mukesh was killed
(c) killed Mukesh
Pp INFL VP (e) past kill Mukesh
Mukesh past Vo VP
Be V NP INFL: Inflection
NP: Noun phrase
| VP: Verb phrase
Killed et empty
Figure 2.3 Surface structure of sentence (2.1) in GB
JN
NP INFL VP.
| A
past V VP
Be V NP
Killed Mukesh
Figure 2.4 Deep structure of sentence (2.1) in GBptrioval
26 Natural Language Processing and Information Retr
v set of theories thar y
Components of GB
[binding (GB) compris 7
aanntane ao sesteuctare an (0 Hogical for (py gy! Me
form). A general nsformational rule called 4,
structure Hevel, This cag
AN thy
ANU PUL Dy eye
can be 1
presen
Me
Government anc
structures from de
aside the phonetic
iecd at clstructire level
cif ine
as well as é
Joes not violate the const
in ity simplest from G
is appli
constituents at any pla
iples. Hence,
theories and priv
by Figure
distructure =
| Move a
‘Theories and principles
[__
\ nove a
pen
Theories and principles
logical form «
a)
structure
Figure 2.5 Components of GB
Hence, GB consists of ‘a series of modules that contain constraints and
principles’ (Sells 1985) applied at various levels of its representations and
the transformation rule, Move a. Before elaborating on these modules—
which include X-bar theory, projection principle, @-theory and @-criterion,
C-command and government, case theory, empty category principle (ECP,
and binding theory—we discuss the general characteristics of GB.
The GB considers all three levels of representations (d-, s-, and LF) a
syntactic, and LF is also related to meaning or semantic-interpretive
mechanisms. However, GB applies the same Move @. transformation 10
map d-levels to s-levels or s-levels to LF level. LF level helps in quantifier
scoping and also in handling various sentence constructions such as pass
or interrogative constructions. An example of LF representation may be
helpful.
Example 2.2 Consider the sentence:
Two countries are visited by most travellers. ea
Its two possible logical forms are:
LPI: [, Two countries are visited by [,,most travellers]
LF2: Applying Move
[xp Most travellers, | [, two countries are visited by ¢]Language Modelling 27
¢ interpretation is that most travellers visit the same tw
| India and China). In LF2, when we move [most travellers}
scope of the sentence, the interpretation can be that i
travellers visit vO countries, which may be different for different travellers
One of the important concepts in GB is that of constraints. It is the
pat of the grammar which prohibits certain combinations and movements;
Prrerwise Move «can move anything (0 any possible position. Thus, GB
epasically he formulation of theories or principles which create constraints
to disallow the construction of ill-formed sentences. To account for cross:
lingual constraints of similar type, GB can specify that ‘a constituent
cannot be moved from position X? (where X can have value X, in one
Janguage, X2 in another, and so on), These rules are so general and
Janguage-independent that ‘Janguage-particular details of description
typically go uncharted in GB’ (Sells 1985).
Figure 2.5 showed the application of various theories and principles at
three different levels of representations in GB. Figure 9.6 mentions these
explicitly to understand the organization of GB.
In LF1, th
countries (sa)
outside the
aestructure ~<———|_X-bar theory Outheory
Move a | Projection principle, O-criterion
ee
~c——] Projection principle, O-criterion
ECP, Binding theory, control
Move A hu
logical form <——| Projection principle, O-criterion
adapted from Peter Sells 1985)
Case filter | —> s-structure
Figure 2.6 Organization of GB (
X Theory
The X theory (pronounce
in GB. Instead of defining several phras
structure with separate sets of rules, xX
maximal projections of some head. In this manne!
become language independent. ‘Thus, noun ph
(VP), adjective phrase (AP), and prepositional phrase (PP) are mass!
projections of noun (N), verb (V), adjective (A), and preposition (P)
respectively, and can be represented as head X of their corresponding
phrases (where X = (N, V, A, P}). Not only that, eve
d X-bar theory) is one of the central concepts
e structures and the sentence
theory defines them both as
the entities defined
(NP), verb phrase
en the sentence