—
he Structure of Docun,,
Chapter 2 Finding t
has certain weaknesses. For exan,
36
ch as POS tags of the wor,”
ntional HMM approach
formation beyond words, su
n. /
ensions have been proposed: Shriberg et al. [27] sugges,
prosodic cut ach vere
s end, two simple ex , it . (27
To this one testo em boundary tokens, hence incorporating nonlexical informs,
is ‘ch is used for sentence segmentation ang
using explicit stat
i i jon wit <
ward by the Oa oven language model (HELM), as introduced by Stolcke and Shy,
9s. which was originally designed for speech disfluencies- The approach Pe to treats
events as extra meta tokens. In this model, one state is reserved for each boundary to
SB and NB. and the rest of the states are for generating words. To ease the computa
Il consecutive words in case the word prec,
1 is a conceptual representation «
has been shown that the conve!
it is not possible to use any int
fs, for speech segmentati
es to emit the
h other models. This approa
an imaginary token is inserted between al
a disfluency. Example 2~
the boundary is not part of
sequence with boundary tokens:
EXAMPLE 2-1: ... people NB are NB dead YB few NB pictures ..-
The most probable boundary token sequence is again obtained simply by Viterbi deo
ing. The conceptual HELM for segmentation is depicted in Figure 2-3.
These extra boundary tokens are then used to capture other meta-information. The mo.
commonly used meta-information is the feedback obtained from other classifiers. Typical
the posterior probability of being in that boundary state is used as a state observatis
likelihood after being divided by prior probabilities [27]. These other classifiers also ms
be trained with other feature sets, such as prosodic or syntactic. This hybrid approach
presented in Section 2.2.4.
For topic segmentation, Tur et al. [29] used the same idea and modeled topic-start #
topic-final sections explicitly, which helped greatly for broadcast news topic segmentatict
The second extension is inspired from factored language models [30], which capture»!
only words but also morphological, syntactic, and other information. Guz et al. [31] prop
using factored HELM (fHELM) for sentence segmentation using POS tags in addition’
words.
2.2.2 Discriminative Local Classification Methods
Discriminative classifiers aim to model P(y,|x,) of Equation 2.1 directly. The most im?!
tant distinction is that whereas class densities, p(x|y), are model assumptions in gene,
approaches, such as saive Bayes, in discriminative methods, discriminant functions b
feature space define the model. A number of discriminative classification approach®
as support vector machines, boosting, maximum entropy, and regression, are b oe
y—®
©
Figure 2-3: Conceptual hidden event language model for segment
itionMethods
ae 37
cpen ees eee discriminative approaches have been shown
‘0 oul s ny spe i 5
1 call recuites iterative optimization, y speech and language processing tasks, training
*Y In discriminative local classification, each boundary is processed separately with local and
contextual features. No global (i.e., sentence or document wide) optimization is performed,
unlike in sequence classification models. Instead, features related to a wider context may be
incorporated into the feature set. For example, the predicted class of the previous or next,
boundary can be used in an iterative fashion.
For sentence segmentation, supervised learning methods have primarily been applied
to newspaper articles. Stamatatos, Fakotakis, and Kokkinakis [32] used transformation-
based learning (TBL) to infer rules for finding sentence boundaries. Many classifiers have
been tried for the task: regression trees [33], neural networks [34, 35], a C4.5 classification
tree [36]. maximum entropy classifiers [37, 38], support vector machines (SVMs), and naive
Bayes classifiers | Mikheev treated the sentence segmentation problem as @ subtask for
POS tagging by assigning a tag to punctuation similar to other tokens [39]. For tagging he
employed a combination of HMM and maximum entropy approaches.
‘The popular TextTiling method of Hearst for topic segmentation (40, 22] uses a lexical
a word vector space as an indicator of topic similarity. TextTiling can be
d with a single feature of similarity. Figure 2-4 depicts
t to consecutive segmentation units. The document
cohesion metric in
seen as a local classification metho
a typical graph of similarity with respect
is chopped when the similarity is below some threshold.
07
06
Os
04
03
02
OuChapter TU
the similarity scores were proposed: block
r atin} :
lary ds tn fi block comparison, ae adjecent Dl
dmin
“BT
chain
Most automatic topic segmentation work based on text sources has explored topical word
usage cues in one form or other. Kozima [65] used mutual similarity of words in a sequence
oftext as an indicator of text structure. Reynar (66) presented a method that finds topically
milar regions in the text by graphically modeling the distribution of word repetitions.
Ponte and Croft [67] extracted related word sets for topic segments with the information
retrieval technique of local context analysis and then compared the expanded word sets.
Beeferman et al. [48] combined a large set of automatically selected lexical discourse “u
ina maximum entropy model. They also incorporated topical word usage into the model by
building two statistical language models: one static (topic independent) and one that adapts
f past words. They showed that the log likelihood ratio
its word predictions on the basis ol
of the two predictors behaves as an indicator of topic boundaries and can thus be used as
au additional feature in the exponential model classifier.
Syntactic Features
of studies. Mikheev
for global reranking
tituency trees
successfully captured by & number
ation. Similarly,
nce segment
syntactic features in the form of cons'
eae information has been
38 imp used POS tags for soe
eae described in Section 2.2.5,
For lency parse trees are also used.
of os orphologically rich languages: $'
Porm, used as additional cues (31, 68)...
vj eal et f,.-+ tn be the sequent of POS or morpholo'
toss a The same features can be extracted as for words
tres. i candidate boundary), for example, De-1ter Ttatios f
are typically less useful for topic segmentation because topic ¢
chara
‘acterized by content shifts.
such as Czech and ‘Turkish, morphological analyses
gic tags extracted for words
(n-grams before, after. and
and 2t,.1,ti-2 Syntactic fea-
anges are usuallyChapter 2 Finding the Structure of Documen
date in the global model under a proj,
a sentence’ coe pute the sum of the probability of all
m
CFG), we can 0
abilistic context-free gram
eee
id parse trees for t _ p
“ spate = P= DTP)
ts ret
where ¢ is a parse tree and r is a production rule used in that tree (69)
vhere t is s
Discourse Features
speech or text, discour: ways
oe cat news show, the anchor first gives
then the stories are presented one by one wi
i ic start d phrases.
Sosa i segmentation has shown that cue phrases o
Previous work on both text and speech seg Phe eater te phrase
i icles (i sor by the way),
discourse particles (items such as now or by te ati
rovide ‘ahable indicators of structural units in discourse [e-g., 70,71] a for speed
onan ,e of speaker may indicate a sentence boundary, and commercials may indicate a topi
houndary in broadcast news or conversations. Formally, for all events e € € that appear i
the vicinity of a boundary, a feature x, can be generated to represent the occurrence of that
event, and if relevant, rz will be used to represent the nonoccurrence of that event. Event
have to be detected using additional systems not detailed in this book (such as a commercia
detector) that may output confidence scores. In this case, the feature will be 7, = cs wher
os is the confidence score for that event to be recognized.
Whereas earlier approaches try to capture such predetermined discourse cues, mor
corpus-based studies rely on the machine learning approaches to automatically learn sud
patterns using informative feature sets. For example, Tur et al. [29] used explicit HM)
states for topic initial and final sentences, which improved performance greatly. Rosenber
and Hirschberg [50] used statistical hypothesis testing for predetermining such phrases
For meeting or conversation segmentation, discourse features are more complex and rel
on argumentation structure. Most studies simply use previous and next turns as discou™
features, but higher-level semantic information such as dialog act tags or meeting agend
items can also be used for exploiting discourse information [72].
yays i for segmentation. For example, j
are always important ‘ :
ese the headlines, then a commercial follows, ang
th optional anchor/reporter interaction an;
2.5.2 Features Only for Text
Typographical and Structural Features
For sentence and topi
and headlines, are vei
gmentation, typographical and structural cues, such as punct
y informative. Sentenc
“ . Sentence segmet
tuation before and after the bound: peal
length, and how frequently they
lowercase word) compared to at
mation containing abbreviations
to process text.
Formally, let g be a set of w
that yu) = 1 if w
ystems use words and P.
lary, capitalization and POS tags of those words,
are used in nonsentence boundary contexts (¢-8" eee
the end/beginning of a sentence. Similarly, gazette" Py
i .
and preprocessing and postprocessing patterns is empl!
appear in a gazetti is genera”
th azetteer. A feature 1s ett
the feature that denotes the frequency: of the 1%”
‘ords that,
€ g. Similarl,Features
45
ofa word can be computed
fle) asx lew)
form ( Sle(w) = Heo Oo where lew’
version oF wy Where Le(w) denotes the lowercase
In his work on sentence segmentation, Gillick [21] obse
ajetboe of laser had 4 much smal impact Cha ent ou 8 eiven set of features,
e Impa a mismat ining
and the test data " a mamatch a the tokenization of the input a coaaae wank [3]
» supervised appro: Andi * IS. INISS a
mn sal vats son an una rai finding sentence boundaries that ieamesueeris
esis SO it is unable to id Peled corpus. Even though the approach is inde vendent of
the languase. it is 0 identify abbreviations if they ate not used multiple tines i
he test cOmPUS: / 3 multiple times in
ther structural cues include paragraph boundaries, headlines, and section numbering
ear only in structured textu: ; eae :
Ps ‘al sources and may not exist in certain text such
prol
Such cues API
blogs and chatrooms.
7.5.3 Features for Speech
n working with speech recognition output, some words may be incorrect due to recog-
ding the quality of lexical features. Similarly. token start times and their
be wrongly estimated, causing errors in prosodic feature computation.
stness to these errors.
Wher
sition errors, degra
durations may also
Tryeally. a large set of prosodic features are extracted for rob
Prosodic Features
When applying segmentation to speech rather than written text, many of the same
approaches can be used, but with some important considerations. First. in the case of au-
n, lexical information comes from the output of a speech recog-
.d, spoken language lacks explicit punctuation,
his information is conveyed through the
rtly. Third, although some spoken lan-
speech is conversational.
from the perspective of
such
tomatic processing of speech
nition, which typically contains errors. Secon
capitalization, and formatting information. Rather, t!
language and also through prosody, as explained sho
muage, such as news broadcasts, is read from a text, most natural 5]
In natural, spontaneous speech, sentences can be “ungrammatical” (
formal syntax) and typically contain significant numbers of normal speech disfluencies.
as filled pauses, repetitions, and repairs. .
Spoken language input, on the other hand, provides additional, “beyond words infor-
mation through its intonational and rhythmic information, that is, through its prosody.
Prosody refers to patterns in pitch (fundamental frequency), loudness (energy), and t=
ing (as conveyed through pausing and phonetic durations). Prosodic cues are known to be
televant to discourse structure in spontaneous speech and can therefore be expected to play
topic transitions. Furthermore, prosodic cues
# role in indicatin, oundaries and
by their nature are in eae Ta jepeider of word identity. Thus they tend fo suffer less
than do lexical features from errors in automatic speech recognition. .
Figure 2-5 depicts some general prosodic features used for segmenting SO h Pa at
tences along with lexical features. Broadly speaking, the prosodic features sociate a ;
Sentence boundaries are similar to those for topic boundaries because both involve convey-
"ga brei ; i Tenath, and pitch and energy resets
‘ak the 5 ation. Pause length, d et
are at serves to chunk inform: fies topic) breaks, but similar types of
Senerally ; ‘ large
rose i ey Sreater in may nitude for the largt
Prosodic features can be i for both tasks, trained of course for the task at hand.Speaker change?
“
.
Stylized pitch \ Pitehenergy
{difference +
VowelRhyme | \ Word n-gram
duration : } POS n-gram
ar Pause a
Prev. word Boundary + Next word
Figure 2-5: Some basic prosodic and lexical features for speech segmentation
Prosodic features for sentence segmentation have been used in
|. 75, 27, 76, 77. 78
a number of Studies
*. 78. 51, 11, 79, 60, 80]. The simplest and most often used feature is a Danse
the boundary of interest. For automatic processing, pauses are more easily obtained than
her prosodic features because, unlike pitch and energy features, pause information can be
tracted from automatic speech recognition output. Of course, not all sentence boundaries
mtain pauses, particularly in Spontaneous speech. And conversely, not all pauses corre
yond to sentence boundaries. For example, many sentence-internal disfluencies also contain
auses. Some methods use simply the presence of a pause; others model the duration of the
‘ause. Pause durations can ‘be quite large in the case of turn-final sentence boundaries in
onversation because such regions correspond to time during which another participant is
alking. Sentence segmentation for certain dialog acts, such as backchannels (e.g., “uh-huh’),
which tend to occur in isolated turns, can thus be achieved fairly successfully using only
pause information
The pause feature is computed aS Zpause = start(wiy1) — end(w,
i) where start() and
end() represent the timing in seconds of the beginning and the end of a word in the speech
recognition output. Relevant side features are the pause before the word (to know if it s
'solated) and the quantized pause rapause(Wi) = 1 iff pause > thrpause, Where thrpause
is set to, for example, 0.2 second. Pause duration does not follow a normal distribution, bY
nature, and tends to confuse classifiers that expect such a distribution. However, this single
feature is often the most relevant one for segmenting speech.
More detailed prosodic modeling has included pitch, phone duration, and energy infor
mation. Pitch is captured by modeling fundamental frequency during voiced regions “t
speech. Pitch conveys a wide range of types of information, including information abo"!
the prominence of a s
7 ture
yilable, but for sentence segmentation the goal is usually to ae
& reset in pitch. Thus, methods have looked at pitch differences across a word bounds
with a larger negative difference indic
in
ating higher probability of a sentence boundary.
addition to modeling the break in pitch across a word boundary, some approaches [27] hid
also modeled a speaker-specific value to which pitch falls at the ends of utterances, rel
not only improves performance but also allows for causal modeling because it does no!
on speech after the pause [81].
: ; jon
Pitch is not a continuous function and cannot be computed outside of voiced FB
Therefore, pitch features can be undefined {
problem with certain classifiers. Computin
ich might be?
ior a given boundary candidate, which might
is not the matter of this book and shoul
Wy.
set propel!
i pitch, smoothing and interpolating, ee ch 8
ld be handled by appropriate software35 Features a
revit used Praat meee 21, Tepcaly: features are computed from statistics of pitch
vues in 8 window bel Ey t i" end of the word before the candidate boundary and after
v veel innit of the word after the boundary. For example, the pitch difference feature
we ed in the previous paragraph results in
pitch = (__max _pitch(t) } - i i
pit (anes! (1 )) (cg, #000)
the pitch value at time t, W,(w,) is a temporal window anchored at the
and W,(wi+1) is a si milar window at the start of word w;+1. Variants of
be created by changing the window size (.e., 200 ms, 500 ms). changing the
etatistics computed on both sides of the boundary (i.e., min, max, mean), and normalizing
ech values according to different factors (ie., log space projection, standardization of the
pite ;
eabation of pitch values of the current speaker).
tures for sentence segmentation aim to capture a phenomenon known as
Duration feat
undary lengthening in which the last region of speech before the end of a unit
uration. (Interestingly, this phenomenon is also observed in music and
ssonin bird song [83).) Automatic modeling methods best capture preboundary lengthening
shen phone durations are normalized by the average duration of those phones in a corpus of
“nila speaking style. The duration of the rhyme (the vowel and any following consonants)
“fa prefinal syllable typically shows more lengthening than does the onset of that syllable.
For example, let v be the last vowel in wi, the word before the boundary candidate.
4 feature can be computed as the relative duration of that vowel compared to its average
duration in a corpus C
chere pitch(t
end of word Us
this feature ca
prebor
is stretched out 11 d
start(Uw,)
tart(Uw)
employed in sentence boundary modeling, but with less
energy behaves somewhat like pitch, falling toward
t for the next sentence. However, energy is
ding itself, and can be difficult to normalize
eral been less successful than pause, pitch,
Energy features have also been
success. From a descriptive point of view,
the end of a sentence and often showing a rese
affected by a myriad of factors, including the recor
both within and across talkers. Thus it has in gem
aud duration features for automatic segmentation.
pu final feature that is sometimes considered in prosodic modeling is voice quality.
Gane ta work has chown an association between sentence boundaries and voice quality
mee i because such phenomena are highly speaker dependent and difficult to capture
ai ea tes automatic segmentation work has relied on the previously mentioned
Ingective work on topic boundaries has found that major shifts in topic typically show
ises, an extra-high FO onset or reset, @ higher maximum accent peak, shifts in
y [eg., 84, 85, 86, 87, 27]. Such cues
ts can perceive major discourse
| filtering [88]. In auto-
‘at features such as changes in speaker
.d the presence of certain cue phrases
m to their approach improved their
Speakin
ig rate, and greater range in FO and intensit;
subj
a .
bare of silence and overlapping speech, an
icative of changes in topic, and adding ther
6 alChapter 2 Finding the Structure of Docume,
ificantly. Georgescul, Clark, and Armstrong [89] found that gi,
ine improvement with their approach. However, Hsueh, Moore, 4
for coarse-grained topic shifts (corresponding in nq.
of the meeting, such as introductions or cle!
shifts in subject matter showed no improvernen’
segmentation accuracy S
ilar features also gave som "
Renals (90) found this to be true only
cases to changes in the activity or state
review) and that detection of finer-grained
2.6 Processing Stages
Usually. the first step in the segmentation tasks is preprocessing to determine tokens ayj
candidate boundaries. In language like English, words are candidate tokens, but special casa
like abbreviations and acronyms exist. In languages like Mandarin, with textual sources, ,
preceding word segmentation step can be employed.
Then a set of features, as described in the previous section, is extracted for each cand,
date. For speech data, token start times and durations are usually not available in the refe.
ence annotations of the spoken utterances, but these are necessary for computing prosodir
features. Usually, a forced alignment of decoding step is performed to obtain these features
Once the features are extracted, each candidate boundary is classified using one of the
methods described in the previous sections.
For testing, the automatically estimated token boundaries are compared to the bound
aries in reference transcriptions. When speech recognition output is used for training o
testing, reference tokens are aligned with speech recognition output words using dynamic
programming to minimize alignment error (such as using NIST sclite alignment tools)
and boundary annotations are transferred to the speech recognition output. Unfortunately,
sometimes perfect alignment is not possible. For example, two tokens in reference annots
tions with a sentence boundary between them may be recognized by the speech recognizet
as a single token. In such cases, it is not clear if the sentence boundary should be omit
ved from the speech recognition annotations or should be included so a heuristic rule ®
used.
SOS
2.7 Discussion
Although sentence segientation is a useful step for many language processing tasks, cal
optimization of the segmentation parameters directly for the following task in compati®
jo independent optimization for segmentation quality of the predicted sentence boundst®®
has been empirically shown to be useful, For example, Walker et al. {91] observed t
the hardcoded rules for sentence segmentation in a machine translation system rest!
in very poor sentence segmentation generalization performance compared to the us? od -
tnachine learning approach. Matusoy et al. (92] show that optimizing parameters of
Similarly: Poy (he source language is useful for machine translation of spoken docu
1 infornaate a et ae {931 and Liu and Xie [94] study the effect of parameter optimiza
he cont on extraction and speech summarization, respectively, instead of optii™
on the sentence segmentation task itself, .gibliogr@PhY
49
ang topic segmentation, automatic i
pegarding ‘omatic transcription of s
; ope a s speech uses | e ols
ict top eae the language model, and this has bean shown 60 tate
: m4 ‘0 improve
guage model trained on a matching topic or by building a
10
asp. either by re
lel wherein the ic is i
topic is a latent variable estimated during decoding.
AS nal Haneage moe
ic-driven domai jon i i
topic-driven domain adaptation is used in a wide range of natural language
[95] by allowing words
More ge”! 5 :
cessing tasks: In information retrieval, topic is modeled explicitly
to contribute differently in function of the topic in which they occur or implicitly [96
using co-occurrence space reduction techniques. In automatic summarization Tan os
‘nd Chen (97) propose to reconsider the common assumption that a document is made of ,
ingle topic and include topic-specific information in their model. Word-sense disambi ati ‘
benefits from topic information, as many words have probably a dominant sense ae
ven topic (98)
sey
2.8 Summary
We described the tasks of sentence and topic segmentation for text and speech input. We
jncribed learning algorithms for these tacks in several categories. Depending on the type
of input (i.e., text versus speech), several different types of features may be used for these
For example, in text, typographical cues such as capitalization and punctuation can
be benefical. whereas in speech, prosodic features may be useful.
In parallel with the recent ‘advances in speech processing and discriminative machine
ing methods, performance of sentence and topic segmentation systems have improved
J high-dimensional feature sets. However, these systems still make errors,
nn processing stages, such as machine translation, to be robust to such
h is required for jointly optimizing the segmentation stage with the
leas
noise, Further researc!
follow-on processing systems.
Bibliogr
aphy
hatain, and S. Furui, «Automatic sentence Se&-
41 J. Mrozinski, E. W. D. Whittaker, P. CI
ainsi,” in Proceedings of the International
mentation of speech for automatic summal™ i
Speech and Signal Processing (ICASSP), 2005.
Conference on Acoustics,
2) J. Makhoul, A. Baron, I. Bulyko, L. Nguyen, L- Ramshaw, D. Stallard, R. Schwartz,
nition and punctuation on information
e on Spoken Lan-
h recog)
is of International Conferencé
and B. Xiang, “The effects of speee
extraction performance,” in Proceeding:
guage Processing (Interspeech), 2005.
: p. Jones, W. Shen, E, Shribers, A, Stol¢
Cxperiments comparing reading with listening for B
lephone speech,” in Proceedings ° EUROSPEECH, pp- 1}
ackie, Frequency ‘Analysis of English
Mifflin, 1982-
ye, T. Kamm, and D- Reynolds, “Two
uman processing of conversational
45-1148, 2005.
Usage: Lexicon
W. Fra ‘
wv. Francis, H. Kuéera, and A. M
il Crome Poston: Houghton