0% found this document useful (0 votes)
34 views14 pages

NLP Unit 1 Part 2

NLP UNIT 1&2

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views14 pages

NLP Unit 1 Part 2

NLP UNIT 1&2

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 14
— he Structure of Docun,, Chapter 2 Finding t has certain weaknesses. For exan, 36 ch as POS tags of the wor,” ntional HMM approach formation beyond words, su n. / ensions have been proposed: Shriberg et al. [27] sugges, prosodic cut ach vere s end, two simple ex , it . (27 To this one testo em boundary tokens, hence incorporating nonlexical informs, is ‘ch is used for sentence segmentation ang using explicit stat i i jon wit < ward by the Oa oven language model (HELM), as introduced by Stolcke and Shy, 9s. which was originally designed for speech disfluencies- The approach Pe to treats events as extra meta tokens. In this model, one state is reserved for each boundary to SB and NB. and the rest of the states are for generating words. To ease the computa Il consecutive words in case the word prec, 1 is a conceptual representation « has been shown that the conve! it is not possible to use any int fs, for speech segmentati es to emit the h other models. This approa an imaginary token is inserted between al a disfluency. Example 2~ the boundary is not part of sequence with boundary tokens: EXAMPLE 2-1: ... people NB are NB dead YB few NB pictures ..- The most probable boundary token sequence is again obtained simply by Viterbi deo ing. The conceptual HELM for segmentation is depicted in Figure 2-3. These extra boundary tokens are then used to capture other meta-information. The mo. commonly used meta-information is the feedback obtained from other classifiers. Typical the posterior probability of being in that boundary state is used as a state observatis likelihood after being divided by prior probabilities [27]. These other classifiers also ms be trained with other feature sets, such as prosodic or syntactic. This hybrid approach presented in Section 2.2.4. For topic segmentation, Tur et al. [29] used the same idea and modeled topic-start # topic-final sections explicitly, which helped greatly for broadcast news topic segmentatict The second extension is inspired from factored language models [30], which capture»! only words but also morphological, syntactic, and other information. Guz et al. [31] prop using factored HELM (fHELM) for sentence segmentation using POS tags in addition’ words. 2.2.2 Discriminative Local Classification Methods Discriminative classifiers aim to model P(y,|x,) of Equation 2.1 directly. The most im?! tant distinction is that whereas class densities, p(x|y), are model assumptions in gene, approaches, such as saive Bayes, in discriminative methods, discriminant functions b feature space define the model. A number of discriminative classification approach® as support vector machines, boosting, maximum entropy, and regression, are b oe y—® © Figure 2-3: Conceptual hidden event language model for segment ition Methods ae 37 cpen ees eee discriminative approaches have been shown ‘0 oul s ny spe i 5 1 call recuites iterative optimization, y speech and language processing tasks, training *Y In discriminative local classification, each boundary is processed separately with local and contextual features. No global (i.e., sentence or document wide) optimization is performed, unlike in sequence classification models. Instead, features related to a wider context may be incorporated into the feature set. For example, the predicted class of the previous or next, boundary can be used in an iterative fashion. For sentence segmentation, supervised learning methods have primarily been applied to newspaper articles. Stamatatos, Fakotakis, and Kokkinakis [32] used transformation- based learning (TBL) to infer rules for finding sentence boundaries. Many classifiers have been tried for the task: regression trees [33], neural networks [34, 35], a C4.5 classification tree [36]. maximum entropy classifiers [37, 38], support vector machines (SVMs), and naive Bayes classifiers | Mikheev treated the sentence segmentation problem as @ subtask for POS tagging by assigning a tag to punctuation similar to other tokens [39]. For tagging he employed a combination of HMM and maximum entropy approaches. ‘The popular TextTiling method of Hearst for topic segmentation (40, 22] uses a lexical a word vector space as an indicator of topic similarity. TextTiling can be d with a single feature of similarity. Figure 2-4 depicts t to consecutive segmentation units. The document cohesion metric in seen as a local classification metho a typical graph of similarity with respect is chopped when the similarity is below some threshold. 07 06 Os 04 03 02 Ou Chapter TU the similarity scores were proposed: block r atin} : lary ds tn fi block comparison, ae adjecent Dl dmin “BT chain Most automatic topic segmentation work based on text sources has explored topical word usage cues in one form or other. Kozima [65] used mutual similarity of words in a sequence oftext as an indicator of text structure. Reynar (66) presented a method that finds topically milar regions in the text by graphically modeling the distribution of word repetitions. Ponte and Croft [67] extracted related word sets for topic segments with the information retrieval technique of local context analysis and then compared the expanded word sets. Beeferman et al. [48] combined a large set of automatically selected lexical discourse “u ina maximum entropy model. They also incorporated topical word usage into the model by building two statistical language models: one static (topic independent) and one that adapts f past words. They showed that the log likelihood ratio its word predictions on the basis ol of the two predictors behaves as an indicator of topic boundaries and can thus be used as au additional feature in the exponential model classifier. Syntactic Features of studies. Mikheev for global reranking tituency trees successfully captured by & number ation. Similarly, nce segment syntactic features in the form of cons' eae information has been 38 imp used POS tags for soe eae described in Section 2.2.5, For lency parse trees are also used. of os orphologically rich languages: $' Porm, used as additional cues (31, 68)... vj eal et f,.-+ tn be the sequent of POS or morpholo' toss a The same features can be extracted as for words tres. i candidate boundary), for example, De-1ter Ttatios f are typically less useful for topic segmentation because topic ¢ chara ‘acterized by content shifts. such as Czech and ‘Turkish, morphological analyses gic tags extracted for words (n-grams before, after. and and 2t,.1,ti-2 Syntactic fea- anges are usually Chapter 2 Finding the Structure of Documen date in the global model under a proj, a sentence’ coe pute the sum of the probability of all m CFG), we can 0 abilistic context-free gram eee id parse trees for t _ p “ spate = P= DTP) ts ret where ¢ is a parse tree and r is a production rule used in that tree (69) vhere t is s Discourse Features speech or text, discour: ways oe cat news show, the anchor first gives then the stories are presented one by one wi i ic start d phrases. Sosa i segmentation has shown that cue phrases o Previous work on both text and speech seg Phe eater te phrase i icles (i sor by the way), discourse particles (items such as now or by te ati rovide ‘ahable indicators of structural units in discourse [e-g., 70,71] a for speed onan ,e of speaker may indicate a sentence boundary, and commercials may indicate a topi houndary in broadcast news or conversations. Formally, for all events e € € that appear i the vicinity of a boundary, a feature x, can be generated to represent the occurrence of that event, and if relevant, rz will be used to represent the nonoccurrence of that event. Event have to be detected using additional systems not detailed in this book (such as a commercia detector) that may output confidence scores. In this case, the feature will be 7, = cs wher os is the confidence score for that event to be recognized. Whereas earlier approaches try to capture such predetermined discourse cues, mor corpus-based studies rely on the machine learning approaches to automatically learn sud patterns using informative feature sets. For example, Tur et al. [29] used explicit HM) states for topic initial and final sentences, which improved performance greatly. Rosenber and Hirschberg [50] used statistical hypothesis testing for predetermining such phrases For meeting or conversation segmentation, discourse features are more complex and rel on argumentation structure. Most studies simply use previous and next turns as discou™ features, but higher-level semantic information such as dialog act tags or meeting agend items can also be used for exploiting discourse information [72]. yays i for segmentation. For example, j are always important ‘ : ese the headlines, then a commercial follows, ang th optional anchor/reporter interaction an; 2.5.2 Features Only for Text Typographical and Structural Features For sentence and topi and headlines, are vei gmentation, typographical and structural cues, such as punct y informative. Sentenc “ . Sentence segmet tuation before and after the bound: peal length, and how frequently they lowercase word) compared to at mation containing abbreviations to process text. Formally, let g be a set of w that yu) = 1 if w ystems use words and P. lary, capitalization and POS tags of those words, are used in nonsentence boundary contexts (¢-8" eee the end/beginning of a sentence. Similarly, gazette" Py i . and preprocessing and postprocessing patterns is empl! appear in a gazetti is genera” th azetteer. A feature 1s ett the feature that denotes the frequency: of the 1%” ‘ords that, € g. Similarl, Features 45 ofa word can be computed fle) asx lew) form ( Sle(w) = Heo Oo where lew’ version oF wy Where Le(w) denotes the lowercase In his work on sentence segmentation, Gillick [21] obse ajetboe of laser had 4 much smal impact Cha ent ou 8 eiven set of features, e Impa a mismat ining and the test data " a mamatch a the tokenization of the input a coaaae wank [3] » supervised appro: Andi * IS. INISS a mn sal vats son an una rai finding sentence boundaries that ieamesueeris esis SO it is unable to id Peled corpus. Even though the approach is inde vendent of the languase. it is 0 identify abbreviations if they ate not used multiple tines i he test cOmPUS: / 3 multiple times in ther structural cues include paragraph boundaries, headlines, and section numbering ear only in structured textu: ; eae : Ps ‘al sources and may not exist in certain text such prol Such cues API blogs and chatrooms. 7.5.3 Features for Speech n working with speech recognition output, some words may be incorrect due to recog- ding the quality of lexical features. Similarly. token start times and their be wrongly estimated, causing errors in prosodic feature computation. stness to these errors. Wher sition errors, degra durations may also Tryeally. a large set of prosodic features are extracted for rob Prosodic Features When applying segmentation to speech rather than written text, many of the same approaches can be used, but with some important considerations. First. in the case of au- n, lexical information comes from the output of a speech recog- .d, spoken language lacks explicit punctuation, his information is conveyed through the rtly. Third, although some spoken lan- speech is conversational. from the perspective of such tomatic processing of speech nition, which typically contains errors. Secon capitalization, and formatting information. Rather, t! language and also through prosody, as explained sho muage, such as news broadcasts, is read from a text, most natural 5] In natural, spontaneous speech, sentences can be “ungrammatical” ( formal syntax) and typically contain significant numbers of normal speech disfluencies. as filled pauses, repetitions, and repairs. . Spoken language input, on the other hand, provides additional, “beyond words infor- mation through its intonational and rhythmic information, that is, through its prosody. Prosody refers to patterns in pitch (fundamental frequency), loudness (energy), and t= ing (as conveyed through pausing and phonetic durations). Prosodic cues are known to be televant to discourse structure in spontaneous speech and can therefore be expected to play topic transitions. Furthermore, prosodic cues # role in indicatin, oundaries and by their nature are in eae Ta jepeider of word identity. Thus they tend fo suffer less than do lexical features from errors in automatic speech recognition. . Figure 2-5 depicts some general prosodic features used for segmenting SO h Pa at tences along with lexical features. Broadly speaking, the prosodic features sociate a ; Sentence boundaries are similar to those for topic boundaries because both involve convey- "ga brei ; i Tenath, and pitch and energy resets ‘ak the 5 ation. Pause length, d et are at serves to chunk inform: fies topic) breaks, but similar types of Senerally ; ‘ large rose i ey Sreater in may nitude for the largt Prosodic features can be i for both tasks, trained of course for the task at hand. Speaker change? “ . Stylized pitch \ Pitehenergy {difference + VowelRhyme | \ Word n-gram duration : } POS n-gram ar Pause a Prev. word Boundary + Next word Figure 2-5: Some basic prosodic and lexical features for speech segmentation Prosodic features for sentence segmentation have been used in |. 75, 27, 76, 77. 78 a number of Studies *. 78. 51, 11, 79, 60, 80]. The simplest and most often used feature is a Danse the boundary of interest. For automatic processing, pauses are more easily obtained than her prosodic features because, unlike pitch and energy features, pause information can be tracted from automatic speech recognition output. Of course, not all sentence boundaries mtain pauses, particularly in Spontaneous speech. And conversely, not all pauses corre yond to sentence boundaries. For example, many sentence-internal disfluencies also contain auses. Some methods use simply the presence of a pause; others model the duration of the ‘ause. Pause durations can ‘be quite large in the case of turn-final sentence boundaries in onversation because such regions correspond to time during which another participant is alking. Sentence segmentation for certain dialog acts, such as backchannels (e.g., “uh-huh’), which tend to occur in isolated turns, can thus be achieved fairly successfully using only pause information The pause feature is computed aS Zpause = start(wiy1) — end(w, i) where start() and end() represent the timing in seconds of the beginning and the end of a word in the speech recognition output. Relevant side features are the pause before the word (to know if it s 'solated) and the quantized pause rapause(Wi) = 1 iff pause > thrpause, Where thrpause is set to, for example, 0.2 second. Pause duration does not follow a normal distribution, bY nature, and tends to confuse classifiers that expect such a distribution. However, this single feature is often the most relevant one for segmenting speech. More detailed prosodic modeling has included pitch, phone duration, and energy infor mation. Pitch is captured by modeling fundamental frequency during voiced regions “t speech. Pitch conveys a wide range of types of information, including information abo"! the prominence of a s 7 ture yilable, but for sentence segmentation the goal is usually to ae & reset in pitch. Thus, methods have looked at pitch differences across a word bounds with a larger negative difference indic in ating higher probability of a sentence boundary. addition to modeling the break in pitch across a word boundary, some approaches [27] hid also modeled a speaker-specific value to which pitch falls at the ends of utterances, rel not only improves performance but also allows for causal modeling because it does no! on speech after the pause [81]. : ; jon Pitch is not a continuous function and cannot be computed outside of voiced FB Therefore, pitch features can be undefined { problem with certain classifiers. Computin ich might be? ior a given boundary candidate, which might is not the matter of this book and shoul Wy. set propel! i pitch, smoothing and interpolating, ee ch 8 ld be handled by appropriate software 35 Features a revit used Praat meee 21, Tepcaly: features are computed from statistics of pitch vues in 8 window bel Ey t i" end of the word before the candidate boundary and after v veel innit of the word after the boundary. For example, the pitch difference feature we ed in the previous paragraph results in pitch = (__max _pitch(t) } - i i pit (anes! (1 )) (cg, #000) the pitch value at time t, W,(w,) is a temporal window anchored at the and W,(wi+1) is a si milar window at the start of word w;+1. Variants of be created by changing the window size (.e., 200 ms, 500 ms). changing the etatistics computed on both sides of the boundary (i.e., min, max, mean), and normalizing ech values according to different factors (ie., log space projection, standardization of the pite ; eabation of pitch values of the current speaker). tures for sentence segmentation aim to capture a phenomenon known as Duration feat undary lengthening in which the last region of speech before the end of a unit uration. (Interestingly, this phenomenon is also observed in music and ssonin bird song [83).) Automatic modeling methods best capture preboundary lengthening shen phone durations are normalized by the average duration of those phones in a corpus of “nila speaking style. The duration of the rhyme (the vowel and any following consonants) “fa prefinal syllable typically shows more lengthening than does the onset of that syllable. For example, let v be the last vowel in wi, the word before the boundary candidate. 4 feature can be computed as the relative duration of that vowel compared to its average duration in a corpus C chere pitch(t end of word Us this feature ca prebor is stretched out 11 d start(Uw,) tart(Uw) employed in sentence boundary modeling, but with less energy behaves somewhat like pitch, falling toward t for the next sentence. However, energy is ding itself, and can be difficult to normalize eral been less successful than pause, pitch, Energy features have also been success. From a descriptive point of view, the end of a sentence and often showing a rese affected by a myriad of factors, including the recor both within and across talkers. Thus it has in gem aud duration features for automatic segmentation. pu final feature that is sometimes considered in prosodic modeling is voice quality. Gane ta work has chown an association between sentence boundaries and voice quality mee i because such phenomena are highly speaker dependent and difficult to capture ai ea tes automatic segmentation work has relied on the previously mentioned Ingective work on topic boundaries has found that major shifts in topic typically show ises, an extra-high FO onset or reset, @ higher maximum accent peak, shifts in y [eg., 84, 85, 86, 87, 27]. Such cues ts can perceive major discourse | filtering [88]. In auto- ‘at features such as changes in speaker .d the presence of certain cue phrases m to their approach improved their Speakin ig rate, and greater range in FO and intensit; subj a . bare of silence and overlapping speech, an icative of changes in topic, and adding ther 6 al Chapter 2 Finding the Structure of Docume, ificantly. Georgescul, Clark, and Armstrong [89] found that gi, ine improvement with their approach. However, Hsueh, Moore, 4 for coarse-grained topic shifts (corresponding in nq. of the meeting, such as introductions or cle! shifts in subject matter showed no improvernen’ segmentation accuracy S ilar features also gave som " Renals (90) found this to be true only cases to changes in the activity or state review) and that detection of finer-grained 2.6 Processing Stages Usually. the first step in the segmentation tasks is preprocessing to determine tokens ayj candidate boundaries. In language like English, words are candidate tokens, but special casa like abbreviations and acronyms exist. In languages like Mandarin, with textual sources, , preceding word segmentation step can be employed. Then a set of features, as described in the previous section, is extracted for each cand, date. For speech data, token start times and durations are usually not available in the refe. ence annotations of the spoken utterances, but these are necessary for computing prosodir features. Usually, a forced alignment of decoding step is performed to obtain these features Once the features are extracted, each candidate boundary is classified using one of the methods described in the previous sections. For testing, the automatically estimated token boundaries are compared to the bound aries in reference transcriptions. When speech recognition output is used for training o testing, reference tokens are aligned with speech recognition output words using dynamic programming to minimize alignment error (such as using NIST sclite alignment tools) and boundary annotations are transferred to the speech recognition output. Unfortunately, sometimes perfect alignment is not possible. For example, two tokens in reference annots tions with a sentence boundary between them may be recognized by the speech recognizet as a single token. In such cases, it is not clear if the sentence boundary should be omit ved from the speech recognition annotations or should be included so a heuristic rule ® used. SOS 2.7 Discussion Although sentence segientation is a useful step for many language processing tasks, cal optimization of the segmentation parameters directly for the following task in compati® jo independent optimization for segmentation quality of the predicted sentence boundst®® has been empirically shown to be useful, For example, Walker et al. {91] observed t the hardcoded rules for sentence segmentation in a machine translation system rest! in very poor sentence segmentation generalization performance compared to the us? od - tnachine learning approach. Matusoy et al. (92] show that optimizing parameters of Similarly: Poy (he source language is useful for machine translation of spoken docu 1 infornaate a et ae {931 and Liu and Xie [94] study the effect of parameter optimiza he cont on extraction and speech summarization, respectively, instead of optii™ on the sentence segmentation task itself, . gibliogr@PhY 49 ang topic segmentation, automatic i pegarding ‘omatic transcription of s ; ope a s speech uses | e ols ict top eae the language model, and this has bean shown 60 tate : m4 ‘0 improve guage model trained on a matching topic or by building a 10 asp. either by re lel wherein the ic is i topic is a latent variable estimated during decoding. AS nal Haneage moe ic-driven domai jon i i topic-driven domain adaptation is used in a wide range of natural language [95] by allowing words More ge”! 5 : cessing tasks: In information retrieval, topic is modeled explicitly to contribute differently in function of the topic in which they occur or implicitly [96 using co-occurrence space reduction techniques. In automatic summarization Tan os ‘nd Chen (97) propose to reconsider the common assumption that a document is made of , ingle topic and include topic-specific information in their model. Word-sense disambi ati ‘ benefits from topic information, as many words have probably a dominant sense ae ven topic (98) sey 2.8 Summary We described the tasks of sentence and topic segmentation for text and speech input. We jncribed learning algorithms for these tacks in several categories. Depending on the type of input (i.e., text versus speech), several different types of features may be used for these For example, in text, typographical cues such as capitalization and punctuation can be benefical. whereas in speech, prosodic features may be useful. In parallel with the recent ‘advances in speech processing and discriminative machine ing methods, performance of sentence and topic segmentation systems have improved J high-dimensional feature sets. However, these systems still make errors, nn processing stages, such as machine translation, to be robust to such h is required for jointly optimizing the segmentation stage with the leas noise, Further researc! follow-on processing systems. Bibliogr aphy hatain, and S. Furui, «Automatic sentence Se&- 41 J. Mrozinski, E. W. D. Whittaker, P. CI ainsi,” in Proceedings of the International mentation of speech for automatic summal™ i Speech and Signal Processing (ICASSP), 2005. Conference on Acoustics, 2) J. Makhoul, A. Baron, I. Bulyko, L. Nguyen, L- Ramshaw, D. Stallard, R. Schwartz, nition and punctuation on information e on Spoken Lan- h recog) is of International Conferencé and B. Xiang, “The effects of speee extraction performance,” in Proceeding: guage Processing (Interspeech), 2005. : p. Jones, W. Shen, E, Shribers, A, Stol¢ Cxperiments comparing reading with listening for B lephone speech,” in Proceedings ° EUROSPEECH, pp- 1} ackie, Frequency ‘Analysis of English Mifflin, 1982- ye, T. Kamm, and D- Reynolds, “Two uman processing of conversational 45-1148, 2005. Usage: Lexicon W. Fra ‘ wv. Francis, H. Kuéera, and A. M il Crome Poston: Houghton

You might also like