0% found this document useful (0 votes)
55 views52 pages

Aai NLP

Uploaded by

ANANYA V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
55 views52 pages

Aai NLP

Uploaded by

ANANYA V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 52
CHAPTER 1 INTRODUCTION ene CHAPTER OVERVIEW This chapter gives an idea of natural language processing (NLP) UN. Various levels of analysis involved in NLP along with the levels of analysis are discussed. Some of the difficulties in factors that make automatic processing of languages difficult are also touched upon, The chapter underlines the role of grammar in language processing and introduces transformational grammar. Indian languages differ a lot from English. These differences axe clearly pointed out. Further, a number of NLP applications are introduced along with some of the early NLP systems. Towards the end, information retrieval is discussed. and information retrieval knowledge used by these analysing text and specific ii WHAT IS NATURAL LANGUAGE PROCESSING (NLP) Language is the primary means of communication used by humans. It is the tool we use to express the greater part of our ideas and emotions. It shapes thought, has a structure, and carries meaning. Learning new concepts and expressing ideas through them is so natural that we hardly realize how we process natural language. But there must be some kind of representation in our mind, of the content of language. When we want to express a thought, this content helps represent langua children, we never learn a computational model of language, yet this is the first step in the automatic processing of languages. Natural language processing (NLP) is concerned with the development of computational models of aspects of human language processing, There are (vo main reasons for such development 1. To develop’ automated tools for language processing 2. To gain a better understanding of human communication Building computational models with human language-processing abilities requires a knowledge of how humans acquite, store, and process language, It also requires a knowledge of the world and of language. ein real time. As ion Retrieval age Processing and Information Ret Lang: 2 Net have been two major approache: Sto Nip. empiricist approach. Early NLP reseas, th, pich assumes the existence of some jr lok n. Supporters of this approach arg 8% a complex thing like natural jt in all Tay ists do not believe in "i existen ce of Historically, there approach and th ratte onalist 4 rationalist approach, wh ity in the human brain for children to learn sensory inputs, Empirie ad, they believe in the existence of some ation principles suc has pattern recognition, generalization Learning of dotailed structures can, therefore, take 4 ation of these principles on sensory inputs avail ty fact not possible from limited aculty, Inste Metal association through the applic the child 4.2. ORIGINS OF NLP = Natural language processing sometimes mistakenly termed natural langua, a ejerstanding-—originated from machine translation research, Whi natural language understanding involves only the interpretation of language, natural language processing includes both understanding : and generation (production). The NLP also includes speech in this book, we are concerned with text Processing f computational linguistics, and the tasks {interpretation) processing. However, only, covering work in the area of in which NLP has found useful application. Computational linguistics is similar to theoretical- and psycho-linguistis, but uses different tools. Theoretical linguists mainly provide structural description of natural language and its semantics. They are not concerned with the actual processing of sentences or generation of sentences from structural description. They ate in a quest for principles that remain common across languages and identify rules that capture linguistic generalization. For example, most languages have constructs like noun and verb phrases. Theoretical linguists identify rules that describe and restrict the structure of languages (grammar). Psycholinguists explain how bumans produce and comprehend natural language. Unlike theoretical Iinguists, they are interested in the representation of linguistic structures 2s well 2s in the process by which these structures are produced. They ad primarily on empirical investigations to back up their theories. ae eee ee is concerned with the study of language wsins pftisesis i + of linguistic phenomena. It deals with the application company linguistics, representing a anguage i major problem: #6 representations tackle only a small part of knowledge: . 13 LANGUAGE AND KNOWLEDGE seme perce Introduction 3 Representing the whole body words knowledge and languay in detail in Section 1.3, Computational models m driven of knowledge is almost impossible. The age should not be confused, This is discussed ay be broadly classified under kn ; a knowledge. id data-driven. categories, Knowle E aaa coded linguistic knowledge, often expressed as a set of handcrafted grammar niles. Acquiring and encoding such knowledge is difficult and is the main bottleneck in the development of such systems. They are, therefore, often constrained by the lack of sufficient cove age of domain knowledge. Data-driven approaches presume the existence of a large amount of data and usually employ some machine learning technique to learn syntactic patterns. ‘The amount of human effort is less and the performance of these systems is dependent on the quantity of the data, These systems are usually adaptive to noisy data, As mentioned earlier, this book is mainly concerned with computational linguistics approaches. We try to achieve a balance between semantic (knowledge-driven) and data-driven approaches on one hand, and between theory and practice on the other. It is at this point that the book differs significantly from other textbooks in this area. The tools and techniques have been covered to the extent that is needed to build sufficient understanding of the domain and to provide a base for application. The NLP is no longer confined to classroom teaching and a few traditional applications. With the unprecedented amount of information now available on the web, NLP has become one of the leading techniques for processing and retrieving information. In order to cope with these developments, this book brings together information retrieval with NLP. The term information retrieval is used here in a broad manner to include a number of information processing applications such as information extraction, text summarization, question answering, and so forth. The distinction between these applications is made in terms of the level of detail or amount of information retrieved. We consider retrieval of information as part of processing. The word ‘information’ itself has a much broader sense. It includes multiple modes of information, including speech, images, and text, However, it is not possible to cover all these modes due to space constraints, Hence, this book focuses on textual information only. edge-driven systems rely on Language is the medium of expression in which knowledge is deciphered, We are not competent enough to define language and knowledge and its 4 Natural Language Processing and lofermation Retrieval implications. We are here considering the text from of the language ang the content of it as knowledg ng a medium of expression, is the outer form of the Language, b content it expresses. The same content ean be expressed in differen » be separated from its content? Ifso, how can ning, of one language languages. But can tangy the content itself be represent is written in the same lang also be written in some other, formal, language. Hence, to process a language means to process the content of it, As computers are not able to understand natural language, methods are developed to map its content Sometimes, formal language content may have to atural language as well. ‘Thus, in this book, language is knowledge representation tool that has historically ? Generally, the me «(but with a different set of words). It may in a formal lang be expressed in a taken up as represented the whole body of knowledge and that has been modified, maybe through generation of new words, to include new ideas and situations. The language and speech community, on the other hand, considers a language as a set of sounds that, through combinations, conveys meaning to a listener. However, we are concerned with representing and processing text only. Language (text) processing has different levels, each involving different types of knowledge. We now discuss various levels of processing and the types of knowledge it involves. The simplest level of analysis is lexical analysis, which involves analysis of words. Words are the most fundamental unit (syntactic as well as semantic) of any natural language text. Word-level processing requires morphological knowledge, i.e., knowledge about the structure and formation of words from basic units (morphemes). The rules for forming words from morphemes are language specific. The next level of analysis is syntactic analysis, which considers a sequence of words as a unit, usually a sentence, and finds its structure. Syntactic analysis decomposes a sentence into its constituents (or words) and identifies how they relate to cach other. It captures grammaticality or non-grammaticality of sentences by looking at constraints like word order, number, and case agreement. This level of processing requires syntactic knowledge, i.c., knowledge about how words are combined to form larger units such as phrases and sentences, and what constraints are imposed on them. Not every sequence of words results in a sentence. For example, ‘I went to the market’ is a valid sentence whereas ‘went the I market to’ is not. Similarly, ‘She is going to the market’ is valid, but ‘She are going t0 the market’ is not. Thus, this level of analysis requires detailed knowledge about rules of grammar. Introduction 5 Yet another level of analysis is semantic anal) lysis. Semantics is . e meaning of the language, Sey rnineniGai with the v ung 15 ge Semantic analysis is conce creating meaningful representation of linguistic inputs, of semantic interpretation is to take natural 4 utterances and map them onto some repr meaning components is difficult associated ered with The general idea janguage sentences or esentation of meaning. Defining i as grammatically valid sentences can be meaningless. One of the famous examples is, ‘Colorless green se furiously’ (Chomsky 1957). The sentence is well-formed, i.e., syn correct, but semantically anomalou: However, this doe syntax has no role to play in meaning. Bach (2002) sleep actically 'S not mean that considers: semantics to be a projection of its syntax. That is semantic structure is interpreted syntactic structure.’ But definitely, syntax is not the only component to contribute meaning. Our conception of meaning is quite broad. We feel that humans apply all sorts of knowledge (i.c., lexical, syntactic, semantic, discourse, pragmatic, and world knowledge) to arrive at the meaning of a sentence. The starting point in semantic analysis, however, has been lexical semantics (meaning of words). A word can have a number of possible meanings associated with it, But in a given context, only one of these meanings participates. Finding out the correct meaning of a particular use of word is necessary to find meaning of larger units. However, the meaning of a sentence cannot be composed solely on the basis of the meaning of its words. Consider the following sentences: Kabir and Ayan are married. (1.1a) Kabir and Suha are married. (1.20) Both sentences have identical structures, and the meanings of individual words are clear. But most of us end up with two different interpretations. We may interpret the second sentence to mean that Kabir and Suha are married to each other, but this interpretation does not occur for the first sentence. Syntactic structure and compositional semantics fail to explain these interpretations. We make use of pragmatic information, This means that semantic analysis requires pragmatic knowledge besides semanuc and syntactic knowledge. ee A still higher level of analysis is discourse analysis. Diseone! eve Processing attempts to interpret the structure and meaning of oe au units, e.g., at the paragraph and document level, in terms of! ba d phrases, clusters, and sentences. It requires the resolution of anap! pong references and identification of discourse structure. It also requires discourse [ eaning a sentence is knowledge, that is, knowledge of how the meaning of a o exfs how a pronoun refers to the determined by preceding sentences > letermine the function of a Sentence in e may be needed for resolving q 6 Natural Language Processing and Information Retrieval preceding noun—and how to d ; text. In fact, pragmatic knowledg : aap ‘or example, in the following sentences, resolving the Napho, . owledge: ie references. reference ‘they’ requires pragmatic kn The district administration refused to give the trade union permission for the meeting because they feared violence, (129 The district administration refused to give the trade union permission for the meeting because they oppose governmeny, (1.2) The highest level of processing is pragmatic analysis, which deals yi, the purposeful use of sentences in situations. It requires knowledge of the world, nowledge that extends beyond the contents of the text. The Cyc project (Lenat 1986) at University of Austin is an attempt to utilize world knowledge in NLP. However, its usefulness in a general-domain NLP system is yet to be demonstrated. Furthermore, whether or not semantics can be associated with a symbol manipulator and whether humans use logic in the same way as the Cyc project, are both issues of debate. 1.4 THE CHALLENGES OF NLP There are a number of factors that make NLP difficult. These relate to the problems of representation and interpretation. Language computing requires precise representation of content. Given that natural languages are highly ambiguous and vague, achieving such representation can be difficult. The inability to capture all the required knowledge is another source of difficulty. It is almost impossible to embody all sources of knowledge that humans use to process language. Even if this were done, it is not possible to write procedures that imitate language processing as done by humans. In this section, we detail some of the problems associated with NLP. Perhaps the greatest source of difficulty in natural language is identifyi its semantics. The principle of compositional semantics considers the meaning of a sentence to be a composition of the meaning of words appearing in it. In the earlier section, we saw a number of examples where this principle failed to work. Our viewpoint is that words alone 40 not make a sentence. Instead, it is the words as well as their syntactic and semantic relation that give meaning to a sentence. As pointed out by Witigenstein (1953): “The meaning of a word is its use in the language.” A Janguage keeps on evolving. New words are added continually and existing Introduction 7 e introduced in new context. For example, most ne word a spape Fa channels use O/ TE to refer lo the terrorist act on the World ee Cente in the USA in 2001, When we process written text or spoken vamfacen, we have access 40 underlying Mental representation, The oct iy machine can learn the meaning of a specific word in a message i Fy onsidering its context, unless some explicitly coded general work og ymain knowledge is available, The context of a word is defined by co vveaing words. I includes everything that occurs before or afer « word. ‘The frequeney of a word being used in a particular sense also ilects its meaning. The English word ‘while’ was initially used to mean short interval of ime’, But now il is more in use as a conjunction, None Af the usages of ‘while’ discussed in this chapter correspond to this meaning, Idioms, metaphor, and ellips meaning of the written text, As an The old man finally kicked the bucket. (1.3) ick” add more complexity to identify the mple, consider the sentence: The meaning of this sentence has nothing to do with the words ‘ and ‘bucket’ appearing in it Quantifier-scoping is another problem. The scope of quantifiers (the, ‘h, etc.) is often not clear and poses problem in automatic processing. ‘The ambiguity of natural languages is another difficulty. These go unnoticed most of the times, yet are correctly interpreted. This is possible because we use explicit as well as implicit sources of knowledge. Communication via language involves two brains not just one—the brain of the speaker/writer and that of the hearer/reader. Anything that is ed to be known to the receiver is not explicitly encoded. The receiver possesses the necessary knowledge and fills in the gaps while making an interpretation. As humans, we are aware of the context and current cultural knowledge, and also of the language and traditions, and utilize these to process the meaning, However, incorporating contextual test difficulty in language computing. anguage is the representation of different may be hard for a person living assul and world knowledge poses the gr An example of cultural impact on I shades of white in the Eskimo world. It in plain to distinguish among various shades. Similarly, to an Indian, the word “Taj may mean a monument, a brand of tea, or a hotel, which may not be so for a non-Indian. Let us now take a look at the various sources of ambiguities in natural languages. The first level of ambiguity ar effort, we can identify words that have at the word level, Without much es ted with multiple meanings associ ~ al 8 Natural Language Processing and Information Retrieval n, bat, and still. A word may be ambiguous in its pay ambiguous in its meaning. The word ral ambiguous in its part-oFspeech whereas ies word i fe ambiguous ini meaning. We hardly consider all possible hae a Mahe 0 get the correct one. A program on the other hand, must be explicitly coed fh meaning. Hence, we need to develop various model, ang \g whether ‘can’ is @ noun or a verb ig them, e.g. bank, of-speech or it may be resolve " algorithms to resolve them. Deci etacan solved by ‘part-of-speech tagging’ whereas identifying whether a particu). use of ‘bank’ corresponds to ‘financial inseiutlony sense or ‘river bany: sense is solved by ‘word sense disambiguation’. ‘Part-of speech tagging and ‘word sense disambiguation’ algorithms are discussed in Chapters 3 and 5 respectively. A sentence may be ambiguous even if the words are not, for example, the sentence: ‘Stolen rifle found by tree.’ None of the words in this sentence is ambiguous but the sentence is. This is an example of structural ambiguity. Verb sub-categorization may help to resolve this type of ambiguity by: not always. Probabilistic parsing, which is discussed in Chapter 4, js another solution. At a still higher level are pragmatic and discourse ambiguities. Ambiguities are discussed in Chapter 5. A number of grammars have been proposed to describe the structure of sentences. However, there are an infinite number of ways to generate them, which makes writing grammar rules, and grammar itself, extremely complex. On top of it, we often make correct semantic interpretations of non-grammatical sentences. This fact makes it almost impossible for grammar to capture the structure of all and only meaningful text. 1.5 LANGUAGE AND GRAMMAR = vs mmm ~ a Automatic processing of language requires the rules and exceptions of 2 language to be explained to the computer. Grammar defines language. It consists of a set of rules that allows us to parse and generate sentences in a language. Thus, it provides the means to specify natural language. These rules relate information to coding devices at the language level— not at the world-knowledge level (Bharati et al. 1995). However, since world knowledge affects both the coding (ie., words) and the coding convention (structure), this blurs the boundary between syntax and semantics. Nevertheless such a separation is made because of the ease of processing and grammar writing, ‘The main hurdle in language specification comes from the constantly changing nature of natural languages and the presence of a large numbe? of hard:to speeity pela everal efforts have been made to provide oe peeificationss whieh bas Ted to the development of a mumbe ce tars. Main among them are transformational grammar (Ch : fa uN gaicalfusetional grammar (Kaplan and Bresnan 1942), gove pony if pining (Chomsky 1981), generalized phrase a and Gormational grammar (Chomsky 1957), dependency [iol Fenian grammar, and (ee-adjoining grammar (Joshi 1985) Some of these grammars focus on derivation (e.g., phrase structure grammar) while there focus on relationships (e.g. dependency grammar, lexical fanctional grammar, Paninian grammar, and link grammar). We discuss some of sexe in Chapter 2. The greatest contribution to grammar comes from Noam Chomsky, who proposed a hierarchy of formal grammar based on revel of complexity. These grammars use phrase structure rules (or rewrite rules). The term ‘generative grammar’ is often used to refer to the general famework introduced by Chomsky. Generative grammar basically refers mar that uses a set of rules to specify or generate all and only ccrammatical (well-formed) sentences in a language. Chomsky argued that phrase structure grammars are not adequate to specify natural language. He proposed a complex system of transformational grammar in his book on Syntactic Structures (1957), in which he suggested that each sentence ip a language has two levels of representation, namely, a deep structure and « surface structure (See Figure 1.1). The mapping from deep structure 0 surface structure is carried out by transformations. In the following we introduce transformational grammar. Introduction 9 to any gramt paragraphs, Jn ve NP Vv NP vP ZN is played by Pooja ZA Pooja plays Veena Veena Surface structure subj Pooja plays Deep structure Figure 1.1. Surface and deep structures of sentence os a aguas and Int jeval 10 Natural Language Processing and Information Retriev introduced by Chomsky jn, Ios 4 marewi transformational gram w sky ce is the surface representation, Chomsky argued that an utterance is th p ion “deeper structure’ representing its meaning. The deep structure ca" be transformed in a number of ways to yield many different Surface leg} representations, Sentences with different surface-level TEPTESentation, having the same meaning, share a common deeprlevel represen. Chomsky’s theory was able to explain why sentences like Pooja plays veena. (1.43) Veena is played by Pooja. (4b) have the same meaning, despite having different surface structures (role, of subject and object are inverted). Both the sentences are being generzto, from the same ‘deep structure’ in which the deep subject is Pooja and the deep object is the vena. Transformational grammar has three components: 1. Phrase structure grammar 2. Transformational rules 3. Morphophonemic rules—These rules match each sentence Tepresentation to a string of phonemes. Each of these components consists of a set of rules. Phrase structure grammar consists of rules that generate natural language sentences and assign a structural description to them. As an example, consider the following set of rules: S— NP + VP VP V+ NP NP — Det + Noun V— Aux + Verb Det — the, a, an, ... Verb — catch, write, eat, ... Noun —+ police, snatcher, ... Aux will, is, can, ... In these rules, S stands for sentence, NP for noun phrase, VP for verb phrase, and Det for determiner. Sentences that can be geverated Using these rules are termed grammatical, The structure assigned by the grammar is a constituent structure analysis of the sentence, The second component of transformational grammar is a set of transformation rules, which transform one Phrase-maker (underlying) into another phrase-marker (derived). These ules are applied on the terminal Introduction 11 string generated by phrase structure rule: transformational rules are heterogencou symbol on their left hand side. ‘These surface representation into another, ¢ one. The rule relating active and pas 8. Unlike phrase structure rules, sand may have more th an one rules are used to transform one 8» an active sentence into passive ive sentences (as given by Chomsky) is NP ~ Aux ~V~ NP) — NP2 = Aux + be + en ~V~ by + NP, This rule says that an underlying input having the structure NP-Aux- V-NP can be transformed to NP - Aux + be + en - V — by + NP. This transformation involves addition of strings ‘be’ and ‘en’ and certain re. arrangements of the constituents of a sentenc: Transformational rules can be obligatory or optional. An obligatory transformation is one that ensures agreement in number of subject and verb, etc., whereas an optional transformation is one that modifies the structure of a sentence while preserving its meaning. Morphophonemic rules match each sentence representation to a string of phonemes. Consider the active sentence: The police will catch the snatcher. (1.5) ‘The application of phrase structure rules will assign the structure shown in Figure 1.2 to this sentence. s NP ve The police Verb A Aux Vo Det_—- Noun will catch the snateher Figure 1.2 Parse structure of sentence (1.5) ‘The passive transformation rules will convert the sentence im z Fi The + culprit + will + be + en + catch + by + police (Figure 1.3). otrieval sing and Information Retrieve anguage Processing and 12 Natural Language s oS NP ve The snatcher Verb NP /\ / Aux Vv by NP AN | /\ will be en catch Det Noun I | the police Figure 1.3 Structure of sentence (1.5) after applying passive transformations Another transformational rule will then reorder ‘en + catch’ to ‘catch + en’ and subsequently one of the morphophonemic rules will convex ‘catch + en? to ‘caught’. In general, the noun phrase is not always as Simple as in sentence (1.5), It may contain other embedded structures such as adjectives, modifiers, relative clause, ete. Long distance dependencies are other language phenomena that cannot be adequately handled by Phrase structure tules. Long distance dependency refers 2 “yntactic phenomena where a verb and its subject or object can be arbitrarily apart. The Problem in the Specification of appropriate phrase Te Sure Tules occurs because these phenomena cannot be localized a the surface structure level (oshi and Vijayshanker 1989). Wh-movement aF€ @ specific case of these types of dependencies, 16 PROCESSING INDIAN LANGUAGES . There are a number of differences betwe. This introduces differences in the are listed here. + Unlike English, Indic cture, * Unlike English, Indian languages have $C Vv Subject-Object-Verb) as the default sentence structs en Indian languages and English. ir processing, Some of these differences es {lt refers to a syntactic ph the sentence. For example, why words, called wh-w vf the verb ‘read! in the sen What ts she reading?" instead ar Appear at the beginning of ‘She is reading a book’ 8 he is reading what? Introduction 13 + Indian Langnages have a fiee word order, ie freely within a sentence without « «Spelling standard » words can be anging the meaning of the fon is more subtle in Hindi than in English. # Indian languages have a relatively tich set of morpholoag e Indian languages make extensive i ene predicates (CPs). # Indian languages use post-position (Karaka prepositions. « Indian languages use verb cony moved sentence, al variants. and productive use of complex case markers instead of plexes consisting of sequences of verbs cogs ME RIB (ea raha hai—singing) and Ber 2B (dhl whe har” playing). The ausiliary verbs in this sequence provide inforinaron about tense, aspect, modality, ete. Except for the direction in which its script is written, Urdu is closely related to Hindi. Both share similar phonology, morphology, and syntax. Both are free-word-order languages and use post-positions. They also share a large amount of their vocabulary. Differences in the vocabulary arise mainly because a significant portion of Urdu vocabulary comes from Persian and Arabic, while Hindi borrows much of its vocabulary from Sanskrit. Paninian grammar provides a framework for Indian language models. ‘These can be used for computation of Indian languages. The grammar focuses on extraction of Karaka relations from a sentence. We talk about the details of modelling in Chapter 2. A parsing framework based on Paninian grammar is introduced in Chapter 4 and issues involved in Indian language translation (using Paninian grammar theory) are discussed in Chapter 8. 17 NLP APPLICATIONS trssmsnnesn Machine translation is the first application area of NLP. It involves the complete linguistic analysis of a natural language sentence, and linguistic generation of an output sentence. It is one of the most comprehensive and most challenging tasks in the area (Al-complete). However, the recent dramatic progress in the field of NLP has found interesting applications in information retrieval, information extraction, text summarization, etc. is book offers an extensive coverage of these recent applications, and also of traditional ones like machine translation and natural language : Keneration. ‘The focus has been on bridging the gap between theory and Practice rather than on offering a gamut of linguistic, psychological, and computational theories, 1A. Natural Language Processing and Information Retrieval ‘The applications utilizing NLP include the following. Machine Transtation This refers to automatic translation of text from one human lan involved, semantics of the languages, and world knowledge Speech Recognition This is the process of mapping acoustic speech signals to a set of words The difficulties arise due to wide variations in the pronunciation of word, homonym (e.g. dear and deer) and acoustic ambiguities (e.g. in the rex and interest). Speech Synthesis Speech synthesis refers to automatic production of speech (utterance of natural language sentences). Such systems can read out your mails on telephone, or even read out a storybook for you. In order to generate utterances, text has to be processed. So, NLP remains an important component of any speech synthesis system. Natural Language Interfaces to Databases Natural language interfaces allow querying a structured database using natural language sentences. Information Retrieval This is concerned with identifying documents relevant to a user's query. NLP techniques have found useful applications in information retrieval Indexing (stop word elimination, stemming, phrase extraction, etc.), word sense disambiguation, query modification, and knowledge bases have also been used in IR system to enhance performance, e.g., by providing methods for query expansion. WordNet, LDOCE (Longman Dictionary of Contemporary English) and Roget's Thesaurus are some of the useful lexical resources for IR research. Information Extraction An information extraction system captures and outputs factual information contained within a document. Similar to an information retrieval system. it responds to a user’s information need. However, unlike in an information retrieval system, the information need is not expressed as a keyword query. Instead it is specified as pre-defined database schemas or templates Whereas an IR system identifies a subset of documents in a large repositor of text database—e.g., in a library scenario, a subset of resources in @ SOME SUCCESSFUL EARLY NLP SYSTEMS =o Introduction 15 library- within 4 Question Answering Given a question and a set of documents, a ques attempts (0 find the precise ich the answer appea _an information extraction system identifies a subset of inf a document that fits the pre-defined templ prmation stion answering syste nswer, or at least the precise callon or fe in whi This is unlike an IR system, which returns whole document that seems relevant to the user's query. A questio ; wavering system is different from an information extraction system in that the content that is to be extracted is unknown. In general, eaten answering system benefits from having an information extraction system to identify entities in the text. A question answering system requires more NLP than an information retrieval system or an information extraction system. It requires not only precise analysis of questions and portions of texts but also semantic as well as background knowledge to answer certain type of questions. Text Summarization This deals with the creation of summaries of documents and involves syntactic, semantic, and discourse level processing of text. ELIZA (Weizenbaum 1966) ELIZA is one of the earliest natural language understanding programs. It uses syntactic patterns to mimic human conversation with the user. Here is a sample conversation. Eliza: Hello. | am ELIZA. How may | help you? User: | am feeling a little bit sleepy. Eliza: How long have you been feeling a little bit sleepy? User: For almost half an hour. Eliza: Please go on. ‘SysTran (System Translation) The first SysTran machine translation system w Russian-English translation. SysTran also provided the machine translation service called Babel Fish, which is used by AltaVista search engines for handling translation requests from use! TAUM METEO ‘This is a natural language generation system used weather reports. It accepts daily weather data and gene reports in English and French. Jeveloped in 1969 for first on-line in Canada to generate rates weather —y jon Retrieval 16 Natural Language Processing and Informe SHROLU (Winogard 1972) dint otrantes This is a nots langage wnderstoncting syste hat states egy 5 avrobot ina block world domain, [uses ayntactle parsing, and ye reasoning to understand instructions, The wer can ask the anainiputate the blocks, to tell the blacks configurations, and ty exp, reasoning. LUNAR (Woods 1977) This way an early question answering system that answered questing, about moon rock, ic Hobo My 9 INFORMATION RETRIEVAL m= The availability of a large amount of text in electronic form hay 1 extremely difficult to. get relevant information, Information systems aim at providing a solution to this, The term ‘information’ should not be confused with the term ‘entropy’ (numerical measure of the uncertainty of an outcome) as it is used in communication theory, Information is being used here to reflect ‘subject matter’ or the ‘content’ of some text, We are not interested in ‘digital communication’, where bits and bytes are the information carriers. Instead our focus in on the communication taking place between human beings as expressed through natural languages. Information is always associated with some data (text, number, image, and so. on): we are concerned with text only. Hence, we consider words as the carriers of information and written text as the message encoded in natural language. As a cognitive activity, the word ‘retrieval’ refers to operation of accessing information from memory. We use the word ‘retrieval’ to refer to the operation of accessing information from some computer-based representation. Retrieval of information thus requires information to be processed and stored. Not all the inforn form is ret ide ip tr tion represented in computable ieved. Instead, only the information relevant to the needs expressed in the form of query is located. In order to get this relevance, the stored and processed information needs to be compared against query Fepresentation, Information retrieval (IR) deals with all these facets, Its concerned with the organization, sto information relevant to the query, Information retrieval deals with unstructured data, The retrieval is performed based on the content of the document rather than on its ry lure: The IR systems usually return a ranked list of documents, The IR components have been traditionally incorporated into different types Be, retrieval, and evaluation of troduction 47 of information systems including. database manag bubliographic test retrieval systoms, question answer BEMERE systems, recently in search engines, Current approaches for accessing | classified into two categories. The that construct topic hierarchy, eg,, Yahoo, documents of interest manually by (raversing. requires manual classification of new documents taxonomy. This makes it cost ineffective growth of documents on the Web, B Systems, and More large text col fitst category This helps the the hierarchy User locate However, it within the exis and inappl a The second ¢ approaches that rank the retrieved documents a We discuss various IR models that support ranked cable due to rapid alegory consists of ccording to relevance retrieval in Chapter 9, 6) Major Issues in Information Retrieval (Siddiqui 200} There are a number of issues involved in the design and evaluation of IR systems, which are briefly discussed in this section, point is to choose a representation of the docu knowledge is coded in natural language, knowledge representation language for co! current retrieval models are based on keyword representation. This representation creates problems during retrieval due to polysemy, homonymy, and synonymy. Polysemy involves the phenomenon of « Jexeme with multiple meaning. Homonymy is an ambiguity in which words that appear the same have unrelated meanings. Ambiguity makes it difficult for a computer to automatically determine the conceptual content of documents. Synonymy creates problem when a document is indexed with one term and the query contains a different term, and the two terms share a common meaning. Another problem associated with keyword- based retrieval is that it ignores semantic and contextual information in the retrieval process. This information is lost in the extraction of keywords from the text and cannot be recovered by the retrieval algorithms. : A related issue is that of inappropriate characterization of aueries by the user. There can be many reasons for the vagueness and Pee the user's queries, say for instance, her lack of hnowledge a or even the inherent vagueness of the natural language. ans unt texts ‘o include relevant terms in the query or may include’ el pectoeabie Inappropriate or inaccurate queries lead to poor reer ee oe The problem of ill-specified query can be dealt i Oe ate expanding queries, An effective technique base¢ See eid relevance feedback which modifies queries based on by the user on initial retrieval The first important ment. Most human which is difficult to use as mputer systems. Most of the aN 18 Natural Language Processing and Information Retrieval SUMMARY In order to satisfy the user's request, an TR aystom matches document sentation with query representation, Matching query representation with that of the document is another ine, A number of measures have been proposed to quantify the similarity between a query and the document to produce a ranked list of results, Selection of the appropriate similarity measure is a crucial issue in the design of I systerns Svaluating the performance of TC systems is also a major iswe. ‘There are many aspeets of evaluation, the most important being the effectiveness of the system, Recall and precision are the most widely used measures of effectiveness. As the major goa to the query, underst issue, Relevance is subjective in nature (Saracevic 1991), Only the user can tell the true relevance; it is not possible (o measure this ‘true relevance’, One may however, define the degree of relevance. Relevance has been considered as a binary concept, whereas it is in fact a continuous function (a document may be exactly what the user wants or it may be closely related). Current evaluation techniques do not support this continuity as it is quite difficult to put into practice. A number of relevance frameworks have been proposed (Saracevic 1996). These include the system, communication, psychological, and situational frameworks. The most inclusive is the situational framework, which is based on the cognitive iew of the information seeking process and considers the importance of situation, context, multi-dimensionality, and time. A survey of relevance studies can be found in Mizzaro (1997). Most of the evaluations of IR systems have so far been done on document test collections with known relevance judgments. ‘The size of document collections and the varying needs of users also complicate text retrieval. Some users require answers of limited scope, while others require documents with a wider scope. These differing needs can require different and specialized retrieval methods. However, these are research issues and have not been dealt with in this book. of IR is to search a document in a manner relevant dling what constitutes relevance is also an important humans. + Language is the primary means of communication used by + Natural language processing is concerned with the development o computational snodels of aspects of human language processing. + Theoretical linguists are mainly interested in providing a descriptio® of the structure and s ge. whereas mantics of natural langua Introduction 19 deal with the study of | anguag view. y guage from a computational lingui computational point ¢ « Historically, there have been two major approaches to natural language processing, namely rationalist approach and empiricist approach Set «The highly ambiguous and vague nature of natural language makes it dieult to create a representation amenable to computing. : NCES pach, Kent, 2002, Meaning and Truth, J. Keim Campbell, M. O'Rourke ‘ind D. Shei (Eds.), Seven Bridges Press, New York, pp. 284-92. Chomsky, Noam, 1957, Syntactic Structures, Mouton, The Hague. _ 1981, Lectures on Government and Binding, Foris Publications, Dordrecht, The Netherlands. > Joshi, Aravind K., 1985, ‘Tree adjoining grammar: How much sensitivity is required to provide reasonable structural description,’ Natural Language Parsing, D. Dowty, L. Karttunen, and A. Zwicky (Eds.), Cambridge University Press, Cambridge. Joshi, Aravind K. and K. Vijayshanker, 1989, ‘Treatment of long distance dependencies in LFG and TAG: functional uncertainty in LFG is a corollary in TAG,’ Proceedings of the 27th Annual Meeting on Association jor Computational Linguistics, Vancouver, British Columbia, pp. 220-27. Kaplan, RM. and Joan Bresnan, 1982, ‘Lexical functional grammar: A formal system for grammatical representation,’ The Mental Representation of Grammatical Relations, Joan Bresnan (Ed.), MIT Press, Cambridge. Lenat, D.B., M. Prakash, and M. Shepherd, 1986, ‘Cyc: using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks,’ A/ Magazine, 6(4). Mizzaro, S., 1997, ‘Relevance: the whole history, Society for Information Science, 48(0), pp. 810-32. Saracevic, T., 1991, ‘Individual differences in organizing, searching and retrieving information,’ Proceedings of the 54th ‘Annual Meeting of the American Society for Information Science (ASI), pp- 82-86. 1996, ‘Relevance reconsidered,’ Proceedings of GoLIS 2, Second International Conference on Conceptions of Library and Information Science: Integration in Perspective, P. Ingwersen and N.O. Pors (Eds.), The Royal School of Librarianship, Copenhagen, pp. 201-18. oo Siddiqui, Tanveer, 2006, ‘Intelligent techniques for effective information retrieval: a conceptual graph-based approach,” Ph.D. Thesis, -K. Institute of Applied Physics, Deptt. of Electronics and Communication, University of Allahabad. REFERE + Journal of the American 20 Natural Language Processing and Information Retrieval Weizenbaum, R., 1966, ‘ELIZA—A computer program for the study of natural language communication between man and machine, Communications of the ACM, 9{1). Winogard, Terry, 1972, Understanding Natural Language, Academic Press, New York. Woods, William, 1977, ‘Lunar Rocks in Natural English: Explorations in Natural Language Question-answering,’ Linguistic Structures Processing, A. Zampolli (Ed.), Elsevier, North Holland. EXERCISES m= - Differentiate between the rationalist and empiricist approaches to natural language processing. 2. List the motivation behind the development of computational models of languages. 5. Briefly discuss the meaning components of a language. 4. What makes natural language processing difficult? 5. What is the role of transformational rules in transformational grammar? Explain with the help of examples, CHAPTER 2 “LANG UAGE MODELLING __ CHAPTER OVERVIEW f language is quite vast. It presents an almost infinite number of sentences to the reader (OF computer). To handle such a large number of sentences, we have to create @ model of the domain, which can subsequently be simplified and handled computationally. A number of language models have been proposed. We introduce some of these jmodels in this chapter. To create a general model of any language is a difficult task. There are two approaches for language modelling. One is to define a grammar that can handle the language. The other is to capture the patterns in a grammar Janguage statistically. This chapter has a mixed approach. It gives a glimpse of both grammar-based model and statistical language model. These include lexical functional grammar, government and binding, Paninian grammar, and mgram based model. The domain o} 2.4 INTRODUCTION soammescmms = SEPT Why and how do we model a language? This question has been discussed by linguists since 500 BC. Computational linguists also have to confront this question. It is obvious that our purpose is to understand and generate natural languages from a computational viewpoint. One approach can be to just take a language, try to understand every word and sentence of it, and then come to a conclusion. This approach has not succeeded as there are difficulties at each stage, which we will understand as we go through this book. An alternative approach is to study the grammar of various languages, compare them, and if possible, arrive at reasonable models that facilitate our understanding of the problem and designing of natural- language tools, A model is a description of some complex entity or process. A language model is thus a description of language, Indeed, natural language 15 & complex entity and in order to. process it through & computer-based program, we need to build a representation (model) of it. This is known 22 Natura! Language Processing and information Retrieval as dengmage modelling. Language modelling can be viewed either as a problem of grammar inference or a problem of probability estimation, A we model attempts to distinguish a grammat ace from @ non-graminatical (ill-formed) one, whereas a probabilist language model attempts to identify a sentence based on a probability measure, usually a maximum likelihood estimate. These two viewpoints have led to the following categorization of language modelling approaches. Grammardesed language model A grammar-based approach uses the grammar of a language to create its model. It attempts to represent the syntactic structure of language. Grammar consists of hand-coded rules defining the structure and ordering of various constituents appearing in a linguistic unit (phrase, sentence, etc). For example, a sentence usually consists of noun phrase and a verb phrase. The grammar-based approach attempts to utilize this structure and also the relationships between these structures. Statistical language modelling The statistical approach creates a language model by training it from a corpus. In order to capture regularities of a language, the training corpus needs to be sufficiently large. Rosenfeld (1994) pointed out: Statistical language modelling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. Statistical language modelling is one of the fundamental tasks in many NLP applications, including speech recognition, spelling correction, handwriting recognition, and machine translation. It has now found 2pplications in information retrieval, text summarization, and question answering also. A number of statistical language models have been proposed in literature. The most popular of these are the n-gram models. We discuss this model in Section 2.3. The following section discusses various grammar-based models, 22 VARIOUS GRAMMAR-BASED LANGUAGE MODELS «= CTI, Various computational grammars have been Proposed and studied, eg. ttansformational grammar (Chomsky 1957), lexi ( Wl functional grammar (Kaplan and Bresnan 1982), government and binding (Chomsky. 1981), generalized phrase structure grammar (Gazdar ot pra al. 1985), dependency grammar, paninian grammar, and tree-adjoining grammar (Joshi 1985). Language Modeling 23 this section focuses on Texical functional grammar (LEG), generalized structure grammar (GPSG), government and binding (GB), and nd introduces various approaches to understand 2.2.4 Generative Grammars In 1957, in his book on Syntactic Structures, Noam Chomsky wrote that we can generate sentences in a language if we know a collection of words ora fes in that Ianguage. Only those sentences that can be generated as per the rules are grammatical. This point of view has dominated computational linguistics and is appropriately termed generative grammar. The same idea can be used to model a language. If we have a complete set of rules that can generate all possible sentences in a language, those rules provide a model of that language. Of course, we are talking only about the syntactical structure of language here. Language is a relation between the sound (or the written text) and its meaning, Thus, any model of a language should also deal with the meaning of its sentences. As seen earlier, we can have a perfectly grammatical but meaningless sentence. In this chapter, we will assume that grammars are a type of language models. 2.2.2 Hierarchical Grammar Chomsky (1956) described classes of grammars in a hierarchical manner, where the top layer contained the grammars represented by its sub classes. Hence, Type 0 (or unrestricted) grammar contains Type I (or context sensitive grammar), which in turn contains Type 2 (context-free grammar) and that again contains Type 3 grammar (regular grammar). Although this relationship has been given for classes of formal grammars, it can be extended to describe grammars at various levels, such as in 2 class-sub class (embedded) relationship. 2.2.3. Government and Binding (GB) As discussed in Chapter 1, a common viewpoint taken by linguists (not computational linguists, however) is that the structure of a language (or how well its sentences are formed) can be understood at the level of its meaning, particularly while resolving struct al ambiguity. However, the sentences are given at the syntactical level and the transformation from meaning to syntax or vice vers: is not well understood. Natural Language Processing and Information Retrieval Transformational grammars assume two levels of existence of sentence, one at the surface level and the other at the deep root level (this thon not be confused with the meaning level). Government and binding (GE theories have renamed them as slevel and d-level, and identified ten more levels of representation (parallel to each other) called phonetic form and logical form. According to GB theories, language can be considered for analysis at the levels shown in Figure 2.1. d-structure s-structure Phonetic form Logical form Figure 2.1. Different levels of representation in GB If we describe language as the representation of some ‘meaning ‘sound’ form, then according to Figure 2.1, these two ends are the log form (LF) and phonetic form (PF) respectively The GB is concemed w LF, rather than PF. Chomsky was the first to put forward a GB theory (Peter Sells 1985). Transformational grammars have hundreds of rew7itng rules, w are generally language-specific and also construct-specific (@y. wales for assertive and interrogative sentences in English, or for active and passive voice sentences). Generation of a complete set of coherent may not be possible. The GB envisages that if we define rules for struct units at the deep level, it will be possible to generate any languss® fewer rales. These deep-level structures are abstractions of noun-phris. verb-phrase, etc., and common to all languages. It is possible to do GB theory states, a child learns its mother tongue because the hur mind is ‘hard-wired’ with some universal grammar rules. Thus, the enters the mind and its abstract structure gives rise to actual phox structures. The existence of deep level, language-independent, abst! structures, and the expression of these in surface level, language spe structures with the help of simple rules is the main concern of GB theo Let us take an example to explain d- and s-structures. Example 2.1 Mukesh was killed. Language Modelling 25 gi) In te ,sformational grammar, this can be represented as S-NP Aux VP as given below: Mukesh was killed Figure 2.2 TG representation of sentence (2.1) (ii) In GB, the s-structure and d-structure are as follows: Mukesh was killed (c) killed Mukesh Pp INFL VP (e) past kill Mukesh Mukesh past Vo VP Be V NP INFL: Inflection NP: Noun phrase | VP: Verb phrase Killed et empty Figure 2.3 Surface structure of sentence (2.1) in GB JN NP INFL VP. | A past V VP Be V NP Killed Mukesh Figure 2.4 Deep structure of sentence (2.1) in GB ptrioval 26 Natural Language Processing and Information Retr v set of theories thar y Components of GB [binding (GB) compris 7 aanntane ao sesteuctare an (0 Hogical for (py gy! Me form). A general nsformational rule called 4, structure Hevel, This cag AN thy ANU PUL Dy eye can be 1 presen Me Government anc structures from de aside the phonetic iecd at clstructire level cif ine as well as é Joes not violate the const in ity simplest from G is appli constituents at any pla iples. Hence, theories and priv by Figure distructure = | Move a ‘Theories and principles [__ \ nove a pen Theories and principles logical form « a) structure Figure 2.5 Components of GB Hence, GB consists of ‘a series of modules that contain constraints and principles’ (Sells 1985) applied at various levels of its representations and the transformation rule, Move a. Before elaborating on these modules— which include X-bar theory, projection principle, @-theory and @-criterion, C-command and government, case theory, empty category principle (ECP, and binding theory—we discuss the general characteristics of GB. The GB considers all three levels of representations (d-, s-, and LF) a syntactic, and LF is also related to meaning or semantic-interpretive mechanisms. However, GB applies the same Move @. transformation 10 map d-levels to s-levels or s-levels to LF level. LF level helps in quantifier scoping and also in handling various sentence constructions such as pass or interrogative constructions. An example of LF representation may be helpful. Example 2.2 Consider the sentence: Two countries are visited by most travellers. ea Its two possible logical forms are: LPI: [, Two countries are visited by [,,most travellers] LF2: Applying Move [xp Most travellers, | [, two countries are visited by ¢] Language Modelling 27 ¢ interpretation is that most travellers visit the same tw | India and China). In LF2, when we move [most travellers} scope of the sentence, the interpretation can be that i travellers visit vO countries, which may be different for different travellers One of the important concepts in GB is that of constraints. It is the pat of the grammar which prohibits certain combinations and movements; Prrerwise Move «can move anything (0 any possible position. Thus, GB epasically he formulation of theories or principles which create constraints to disallow the construction of ill-formed sentences. To account for cross: lingual constraints of similar type, GB can specify that ‘a constituent cannot be moved from position X? (where X can have value X, in one Janguage, X2 in another, and so on), These rules are so general and Janguage-independent that ‘Janguage-particular details of description typically go uncharted in GB’ (Sells 1985). Figure 2.5 showed the application of various theories and principles at three different levels of representations in GB. Figure 9.6 mentions these explicitly to understand the organization of GB. In LF1, th countries (sa) outside the aestructure ~<———|_X-bar theory Outheory Move a | Projection principle, O-criterion ee ~c——] Projection principle, O-criterion ECP, Binding theory, control Move A hu logical form <——| Projection principle, O-criterion adapted from Peter Sells 1985) Case filter | —> s-structure Figure 2.6 Organization of GB ( X Theory The X theory (pronounce in GB. Instead of defining several phras structure with separate sets of rules, xX maximal projections of some head. In this manne! become language independent. ‘Thus, noun ph (VP), adjective phrase (AP), and prepositional phrase (PP) are mass! projections of noun (N), verb (V), adjective (A), and preposition (P) respectively, and can be represented as head X of their corresponding phrases (where X = (N, V, A, P}). Not only that, eve d X-bar theory) is one of the central concepts e structures and the sentence theory defines them both as the entities defined (NP), verb phrase en the sentence

You might also like