Technische Universitat Dresden
Intonation Patterns of German -
Model-based Quantitative
Analysis and Synthesis of
F0 contours
Dipl.-Ing. Hansjorg Mixdor
Von der Fakultat Elektrotechnik der Technischen Universitat Dresden
zur Erlangung des akademischen Grades eines
Doktor-Ingenieurs
(Dr.-Ing.)
genehmigte Dissertation
Zusammensetzung des Promotionsausschusses:
Vorsitzender: Prof. Dr.-Ing.habil. Schreiber
1. Gutachter: Prof. Dr.-Ing.habil. Homann
2. Gutachter: Prof. Dr.-Ing.habil. Fujisaki, Science University of Tokyo
3. Gutachter: Prof. Dr.habil. Stock, Martin-Luther-Universitat Halle-Wittenberg
Tag der Einreichung: 26.5.1997
Tag der wissenschaftlichen Aussprache: 26.5.1998
For Ken
Contents
0.1 Table of Abbreviations Used : : : : : : : : : : : : : : : : : : : : : iv
0.2 List of Symbols : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
0.3 Typography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1 Introduction 2
I Theoretical Part 5
2 Intonation - An Overview 7
2.1 The Term `Intonation' Employed in the Current Study : : : : : : 8
2.2 An Interdisciplinary Research Topic : : : : : : : : : : : : : : : : : 9
2.3 Production and Perception of Intonation : : : : : : : : : : : : : : 11
2.3.1 The Production Process of Intonation : : : : : : : : : : : : 11
2.3.2 The Role of Intonation in Speech Perception : : : : : : : : 13
2.4 The Functions of Intonation and the Terminology Employed : : : 15
2.4.1 General Functions of Intonation : : : : : : : : : : : : : : : 15
2.4.2 Word accent : : : : : : : : : : : : : : : : : : : : : : : : : : 17
2.4.3 Accent, Stress, Pitch Accent and Stress Accent : : : : : : 18
2.4.4 Accent Group, Prosodic Phrase, Sentence Accent : : : : : 19
2.4.5 Focus : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
2.4.6 Sentence Mode : : : : : : : : : : : : : : : : : : : : : : : : 23
2.4.7 Declination : : : : : : : : : : : : : : : : : : : : : : : : : : 23
2.4.8 Location of F0 peaks : : : : : : : : : : : : : : : : : : : : : 25
2.4.9 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : 25
2.5 Models of Intonation : : : : : : : : : : : : : : : : : : : : : : : : : 27
2.5.1 Isacenko & Schadlich, 1964 : : : : : : : : : : : : : : : : : 29
2.5.2 Bierwisch, 1966 : : : : : : : : : : : : : : : : : : : : : : : : 29
2.5.3 Pheby, 1981 : : : : : : : : : : : : : : : : : : : : : : : : : : 31
2.5.4 Stock & Zacharias, 1982 : : : : : : : : : : : : : : : : : : : 33
2.5.5 Altmann, Batliner and Oppenrieder, 1989 : : : : : : : : : : 35
2.5.6 Fery, Uhmann based on Pierrehumbert 1980 : : : : : : : : 37
2.5.7 Adriaens, 1991, IPO : : : : : : : : : : : : : : : : : : : : : 39
2.5.8 d'Allesassandro & Mertens 1995 : : : : : : : : : : : : : : : 40
2.5.9 Bannert 1983 : : : : : : : : : : : : : : : : : : : : : : : : : 40
2.5.10 Kohler 1977, 1991 : : : : : : : : : : : : : : : : : : : : : : : 43
2.5.11 Discussion and Conclusions : : : : : : : : : : : : : : : : : 44
i
ii CONTENTS
3 The Fujisaki-Model 47
3.1 Model Components : : : : : : : : : : : : : : : : : : : : : : : : : : 48
3.2 Physiological Interpretation : : : : : : : : : : : : : : : : : : : : : 52
3.3 Earlier Works Employing the Model : : : : : : : : : : : : : : : : : 53
3.4 Earlier Application to German : : : : : : : : : : : : : : : : : : : : 56
3.5 Reasons for Choosing the Fujisaki-Model in this Study : : : : : : 59
4 The Approach Chosen for this Study 61
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62
4.2 Pre-Processing : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62
4.2.1 Recording and Conversion of Speech-Data : : : : : : : : : 62
4.2.2 Extraction and Editing of F0 contours : : : : : : : : : : : 62
4.2.3 Marking of Word Boundaries : : : : : : : : : : : : : : : : 66
4.3 Modeling of F0 contours : : : : : : : : : : : : : : : : : : : : : : : 66
4.3.1 Modeling Procedure : : : : : : : : : : : : : : : : : : : : : 66
4.3.2 Modeling Constraints : : : : : : : : : : : : : : : : : : : : : 68
4.3.3 Selection of Time Constants : : : : : : : : : : : : : : : : : 73
4.3.4 Numerical Optimization : : : : : : : : : : : : : : : : : : : 73
4.3.5 Auditory Check by Means of LPC-Analysis-Re-synthesis : 74
II Experimental Part 77
5 Production of Sentence Mode and Focus 79
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
5.2 Experimental Setting : : : : : : : : : : : : : : : : : : : : : : : : : 80
5.2.1 Speech Material : : : : : : : : : : : : : : : : : : : : : : : : 80
5.2.2 Reading Conditions : : : : : : : : : : : : : : : : : : : : : : 82
5.2.3 Method of Analysis : : : : : : : : : : : : : : : : : : : : : : 82
5.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
5.3.1 General Observations : : : : : : : : : : : : : : : : : : : : : 82
5.3.2 Sentence Mode : : : : : : : : : : : : : : : : : : : : : : : : 83
5.3.3 Focal Condition : : : : : : : : : : : : : : : : : : : : : : : : 90
5.3.4 Accent command timing : : : : : : : : : : : : : : : : : : : 95
5.3.5 Fb and Ap : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
5.3.6 Introduction of a Slow Rise Component : : : : : : : : : : : 98
5.4 Parallels with Other Intonational Contrasts : : : : : : : : : : : : : 100
5.5 Discussion and Conclusions : : : : : : : : : : : : : : : : : : : : : 103
6 Perception of Sentence Mode 105
6.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106
6.2 Experimental Setting : : : : : : : : : : : : : : : : : : : : : : : : : 106
6.2.1 The Stimuli : : : : : : : : : : : : : : : : : : : : : : : : : : 106
6.2.2 Experimental Design : : : : : : : : : : : : : : : : : : : : : 109
6.2.3 Subjects : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112
6.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112
6.3.1 General Observations : : : : : : : : : : : : : : : : : : : : : 112
CONTENTS iii
6.3.2 Method of Evaluation : : : : : : : : : : : : : : : : : : : : : 115
6.3.3 Statement vs. Non-terminal Intonation : : : : : : : : : : : 116
6.3.4 Non-terminal vs. Question Intonation : : : : : : : : : : : : 117
6.3.5 Number of Playback Actions : : : : : : : : : : : : : : : : : 117
6.3.6 Drop-Outs : : : : : : : : : : : : : : : : : : : : : : : : : : : 118
6.4 Discussion and Conclusions : : : : : : : : : : : : : : : : : : : : : 118
6.5 Proposition of Intonational Elements : : : : : : : : : : : : : : : : 119
7 Phrasing and Accentuation of Complex Utterances 125
7.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126
7.2 The Text : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126
7.3 Objectives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128
7.4 Features Extracted : : : : : : : : : : : : : : : : : : : : : : : : : : 129
7.4.1 Utterance Level : : : : : : : : : : : : : : : : : : : : : : : : 129
7.4.2 Phrase Level : : : : : : : : : : : : : : : : : : : : : : : : : 129
7.4.3 Accent Level : : : : : : : : : : : : : : : : : : : : : : : : : 131
7.5 Experimental Setting : : : : : : : : : : : : : : : : : : : : : : : : : 133
7.6 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133
7.6.1 General Observations : : : : : : : : : : : : : : : : : : : : : 133
7.6.2 Normalizing Eect of the Model : : : : : : : : : : : : : : : 134
7.6.3 Phrasing : : : : : : : : : : : : : : : : : : : : : : : : : : : : 138
7.6.4 Accentuation : : : : : : : : : : : : : : : : : : : : : : : : : 151
7.6.5 Overall Speech Rate : : : : : : : : : : : : : : : : : : : : : 172
7.6.6 Variation of Speech Rate : : : : : : : : : : : : : : : : : : : 173
7.7 Discussion and Conclusions : : : : : : : : : : : : : : : : : : : : : 175
III Applications 179
8 Generation of F0 Contours for Speech Synthesis 181
8.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182
8.2 Existing Approaches : : : : : : : : : : : : : : : : : : : : : : : : : 182
8.2.1 Inherent Prosody : : : : : : : : : : : : : : : : : : : : : : : 183
8.2.2 `Transplantation' of F0 contours : : : : : : : : : : : : : : : 183
8.2.3 Rule-based Systems : : : : : : : : : : : : : : : : : : : : : : 183
8.2.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : 183
8.3 Rule-Generation for the Fujisaki-Model : : : : : : : : : : : : : : : 184
8.4 Design of a Synthesis Scheme : : : : : : : : : : : : : : : : : : : : 186
8.4.1 Linguistic Pre-Processing : : : : : : : : : : : : : : : : : : 187
8.4.2 Application of Phonetic Rules : : : : : : : : : : : : : : : : 190
8.4.3 Discussion and Conclusions : : : : : : : : : : : : : : : : : 190
9 Model-Based Analysis in Cross-Language Comparison 192
9.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193
9.2 Deviations Caused by L1 : : : : : : : : : : : : : : : : : : : : : : : 194
9.3 Contrasts of German and Japanese : : : : : : : : : : : : : : : : : 195
9.3.1 Syntax : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 195
iv CONTENTS
9.3.2 Segmental Phonetics : : : : : : : : : : : : : : : : : : : : : 196
9.3.3 Intonation : : : : : : : : : : : : : : : : : : : : : : : : : : : 197
9.4 The Experiment : : : : : : : : : : : : : : : : : : : : : : : : : : : : 206
9.4.1 Speech Material : : : : : : : : : : : : : : : : : : : : : : : : 206
9.4.2 Method of Analysis : : : : : : : : : : : : : : : : : : : : : : 206
9.4.3 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 206
9.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 210
10 Discussion and Conclusions 212
10.1 Summary of Chapters : : : : : : : : : : : : : : : : : : : : : : : : : 213
10.2 General Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : 214
10.3 Discussion of Results : : : : : : : : : : : : : : : : : : : : : : : : : 215
10.3.1 Results Concerning the Phrase Component : : : : : : : : : 216
10.3.2 Results Concerning the Accent Component : : : : : : : : : 216
10.4 Short-term Objectives : : : : : : : : : : : : : : : : : : : : : : : : 217
10.4.1 Semi-automatic Analysis : : : : : : : : : : : : : : : : : : : 217
10.4.2 Development of a TTS-System with Fujisaki-model Prosody219
10.5 Long-term Objectives : : : : : : : : : : : : : : : : : : : : : : : : : 219
10.6 Examples of Future Applications : : : : : : : : : : : : : : : : : : : 220
10.7 Final Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 221
A Texts of Speech Corpora and Contexts 232
A.1 Contexts for Speech Material Section 5.2.1 : : : : : : : : : : : : : 232
A.2 Contexts for Speech Material Section 5.4 : : : : : : : : : : : : : : 234
A.3 List of Parameter Values Used for Stimuli in the Perception Ex-
periment : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 235
A.4 Corpus Chapter 7 - (German) : : : : : : : : : : : : : : : : : : : : 236
A.5 Corpus Chapter 9 (Japanese) : : : : : : : : : : : : : : : : : : : : 237
B Rules for Prosodic Control in Speech Synthesis (Chapter 8) 238
B.1 Symbolic Rules : : : : : : : : : : : : : : : : : : : : : : : : : : : : 238
B.2 Phonetic Rules for the Fujisaki-Model : : : : : : : : : : : : : : : : 240
C Details on Preliminary Experiment on Japanese Sentence In-
tonation (Chapter 9) 244
D Acknowledgements 245
E Curriculum Vitae 246
0.1 Table of Abbreviations Used
FFT Fast-Fourier-Transform
IPA International Phonetic Alphabet
LPC Linear Predictive Coding
TTS Text-to-Speech
VCG Verb-Complement Group
0.2. List of Symbols 1
0.2 List of Symbols
F0 Fundamental Frequency
T0 Fundamental period
Ap phrase command magnitude
T 0: phrase command onset time
Aa: accent command amplitude
T 1: accent command onset time
T 2: accent command oset time
"E Rising tone-switch early in the accent syllable
"L Rising tone-switch late in the accent syllable
#E Falling tone-switch early in the accent syllable
#L Falling tone-switch late in the accent syllable
I# Information intoneme
I" Contact intoneme
N" Non-terminal intoneme
Bcat " Boundary Tone, concatenated type
Bnocat " Boundary Tone, non-concatenated type
0.3 Typography
Italics are generally used to indicate English translations of German exemplary
texts. Syllables which are locations of word accents are set in bold type. In
Figures, words which are the location of a narrow focus are CAPITALIZED.
Chapter 1
Introduction
Abstract
In the present study a model of German intonation is presented which elaborates
on the early tone switch approach by Isacenko to form a quantitative descrip-
tion of intonational events. Basic elements, `intonemes' and `boundary tones'
are dened which characterize an arbitrary intonation (F0 ) contour and whose
properties can be described in terms of the physiologically motivated, mathemat-
ically formulated Fujisaki-model of the generation process of F0 . Natural speech
data is analyzed to yield typical parameter values for intonemes in a given lin-
guistic context.
In this chapter the motivation and aims of the present study are brie
y dis-
cussed.
2
3
At rst thought, a text-to-speech (TTS) system and a learner of a foreign
language1 reading a text in his target language may not have too much in com-
mon. If we think of a listener who is a native speaker of this language we nd
some similarities. In the case of TTS, a limited set of sound elements and their
imperfect concatenation may degrade the segmental quality of the speech pro-
duced. In the case of the non-native speaker, strong interference with the sound
inventory of that speaker's mother tongue may produce a comparable eect.
Even worse and very much inhibiting the intelligibility of the speech on the part
of the listener, is the lack or the non-nativeness of prosodic control.
Prosody can help us segment a speech continuum into meaningful units, in-
dicate, for instance, if the speaker is asking a question or giving an order, prosody
provides clues to what he or she wants to emphasize. Apart from this informa-
tion, which will be called linguistic information, prosody can hint at the emotional
state of the speaker, his nationality, social background and, if the conversation is
made by means of a telephone, his age and sex. The somewhat strange compar-
ison given at the beginning shows that prosody is a subject equally important for
quite dierent disciplines; i.e. speech processing, foreign language teaching and,
of course, general and contrastive linguistics. According to the requirements and
traditions of these elds of application of prosodic knowledge, various theories
and models for describing prosodic phenomena were developed. On the one hand,
there are, for instance, linguistic models which are based on generative gram-
mars and treat prosody as completely predictable from syntactic structures, and
on the other hand prosodic models in the domain of speech processing, which
use statistical information derived from measurements of acoustic parameters,
like the fundamental frequency (F0) contour, segmental duration and intensity2.
Most descriptions for the purpose of teaching foreign prosody conne themselves
to crude stylizations of melodic patterns.
The author found himself confronted with this `gap' between disciplines when
he took an interest in the question, if a non-native accent was detectable from the
prosodic features of the speech signal which can be associated with `intonation'3
and if so, if this could be useful for teaching foreign intonation.
Unfortunately, there is no universal system for describing intonation like the
IPA used for the sounds of speech, and it is not clear if the dierence between a
native and a non-native intonation is a qualitative or a quantitative one, or both.
Hence a model of German intonation had to be developed by analyzing selected
1
The term `foreign' requires some discussion since its use is context-dependent. If we
speak of the process of a learner acquiring a language other than his mother tongue the term
`foreign' applies to that second, third, n'th... language (henceforth `L2'). If the same learner
is confronted with native speakers of that language, he may show a `foreign accent'; an accent
in
uenced by his mother tongue. Hence, in this case the term `foreign' denotes his native
language. By convention, in this study the term `foreign' will be applied to denote the language
that is foreign from the standpoint of the learner. From the standpoint of the listener belonging
to that foreign language community the learner may exhibit an accent which will be denoted
as `non-native'.
2
Hidden-Markov-Modeling of F0 or segmental durations, for instance.
3
By intonation we do not only mean the melodic feature of an utterance re
ected by the
F0 contour, but also pause structure and the temporal relationship of the F0 contour with the
segmental string.
4 Chapter 1. Introduction
speech corpora and thus creating a `standard' with which non-native intonation
could be compared. The model had to meet the following requirements:
1. quantitative and qualitative description of natural F0 contours
2. direct relationship between linguistic units and structures (word accent,
phrase accent, sentence mode etc.) and model components
3. direct control by means of speech analysis / re-synthesis
The quantitative model by Fujisaki [FH84] was adopted since it permits a
succinct description of natural F0 contours without a signicant loss of accuracy
and permits an immediate synthesis. The model was combined with a linguistic
description of German intonation based on the works of Isacenko [IS64] and
Stock [SZ82] to yield linguistically meaningful model parameters.
The pair of languages to be contrastively examined were German and Ja-
panese. Most of the study was conducted during a two-year stay granted by
the Japanese Ministry of Education, during which the author worked with Prof.
Hiroya Fujisaki, collected speech data from German and Japanese subjects and
analyzed them under linguistic constraints.
Chapter 2 starts the theoretical part (Part I) of this dissertation with a de-
scription of the state of research in the eld of intonation and the meaning of
intonation for speech perception and the mechanisms of its production. Then the
terminology and models of intonation are discussed with respect to earlier works
on which the current dissertation is based.
Chapter 3 introduces the Fujisaki-model, the interpretation of its components
and its physiological background. Special emphasis is placed on the studies by
Bernd Mobius who rst systematically applied the Fujisaki-model to German
intonation.
In Chapter 4 the approach adopted in the present study, the acquisition, pre-
processing and analysis of speech data and the problems involved are addressed.
The experimental part (Part II) of the dissertation begins with Chapter 5,
describing an experiment dealing with the analysis of short utterances that convey
intonational contrasts of sentence mode and focus.
In Chapter 6 a perception experiment is presented which was conducted to
exemplarily verify the perceptual signicance of the parameter values yielded for
the distinction of sentence modes in Chapter 5.
In the experiment discussed in Chapter 7, the model is applied to the analysis
of more complex utterances to examine phrasing and accentuation. A syntactic
and semantic analysis of the speech material is conducted to examine if and how
syntactic units are mapped onto intonation patterns.
Part III of this dissertation addresses applications of the intonation model
developed. Based on the statistical evaluation of the parameter values in the
experiments, a scheme for synthesizing near-to-natural F0 contours is proposed
in Chapter 8.
Chapter 9 examines intonational dierences between Japanese learners of
German and German native speakers reading the text used in Chapter 7.
Part I
Theoretical Part
5
Chapter 2
Intonation - An Overview
Abstract
This chapter explains the term `intonation' as employed in this work and gives
an overview of intonation as an interdisciplinary research topic. The production
process and some important ndings about the role of intonation in speech per-
ception are brie
y discussed. The functions of intonation are then introduced.
In this context, emphasis is put on the linguistic functions and the terminology
commonly adopted. In the last part of this chapter, conventional models of inton-
ation are discussed which aim to establish the relationship between the linguistic
contents of an utterance and the F0 contour.
7
8 Chapter 2. Intonation - An Overview
2.1 The Term `Intonation' Employed in the Current
Study
The speech melody which is commonly associated with `intonation' belongs to
the group of features which aect domains in the speech signal above the seg-
ments (syllables, phonemes) and hence has `suprasegmental' character. In some
empirical studies, the term `prosody' is employed as a synonym for intonation
(see, for instance, [FSH95, Bru95, TI94a, TI94b]) though prosody - in addition
to intonation - concerns other suprasegmental features like the relative timing
or intensity of segments in an utterance or features like voice quality or vowel
formant frequencies.
For a provisional denition it is assumed, that the primary acoustical correl-
ate of intonation is the fundamental frequency contour F0 (t) 1 . It corresponds
to the periodicity of voiced speech sounds and is interrupted by voiceless sounds
and speaking pauses. Since F0 is inseparably connected with the underlying seg-
mental string of an utterance, the duration of these segments, typically syllables,
words, phrases can be regarded as a secondary correlate of intonation.
Hence, in the current study, the term `F0 contour' denotes the feature F0 in
its temporal relationship to meaningful units of speech (typically words) and the
timing of intonational events will be described relative to these units.
speech waveform
Fo [Hz]
240 +++ extracted F0
180
120
Model-F0
60 phrase component
Aa Sie haben den Wagen geliehen.
0.6
0.2 accent commands
-0.5 0.0 0.5 1.0 1.5
Time [s]
Figure 2.1: The key to gures showing speech data as used in the current study. From
top to bottom the speech waveform, the extracted F0 contour, the contour produced
with the Fujisaki-model and the corresponding accent commands are displayed. Word
boundaries are marked by the vertical dotted lines.
This is the reason why all gures of F0 contours produced for the study feature
the locations of word boundaries marked with vertical dotted lines as shown in
Figure 2.1, where the key to the gures of F0 contours henceforth employed is
given. In section 6.5 a more systematic denition of the relationship between
F0 contour and meaningful speech segments will be presented. The F0 contour
is consistently displayed in the log F domain 2 .
1
The slightly dierent notion of `intonaziya' as employed by Russian scholars, also comprises
features as speech tempo, intensity and voice quality [Bul70, p.167].
2
Throughout this dissertation the term `log' denotes the natural (base e) logarithm.
2.2. An Interdisciplinary Research Topic 9
2.2 An Interdisciplinary Research Topic
Linguistics
Intonation General L.
Research Comparative L.
Sociolinguistics
Conversation Analysis
Discourse Theory
Model
Implement- Linguistic Phonetic
ations Knowledge Knowledge
Linguistic
Models
Phonetics
Pathological Linguistic
Problems Speech Production Knowledge
Linguistic
Knowledg Speech Perception
Phonetic Phonetic
Knowledge Model Knowledge
Phonetic Didactic
Implement- Problems
Models
Pathologic ations Didactic
Knowledge Problems
Therapeutic Speech Technology Didactic
Speech Pathology Aids Requirements Foreign Language
Speech Analysis
Therapeutic Speech Synthesis
Teaching Aids Teaching
Requirements
Speech Recognition
Figure 2.2: Intonation as an interdisciplinary research topic. The gure shows the in-
teraction between linguistics and phonetics with applied sciences like speech technology,
foreign language teaching and speech pathology.
Intonation by its nature is an object of interest to a number of scientic elds with
rather dierent traditions. In Figure 2.2 an outline of the potential interactions
between the disciplines is attempted without laying a claim to completeness.
Nowadays it is widely accepted that intonation research should be conduc-
ted inter-disciplinarily, but in praxis this interaction is still much impaired by
the dierent concepts of scholars in the humanities (linguistics, language educa-
tion, speech pathology) and natural scientists (speech technology). The author,
by personal experience, feels that eventually the individual scholars have to de-
velop interdisciplinary skills in order to bridge the gap and bring about scientic
progress. The present study is designed as a small contribution towards this
goal.
As an introduction to the diverse character of intonation a brief overview of
the applications of knowledge of intonation in various elds of research will be
given.
Linguistics: Traditionally, linguists concentrated on the systematic description
of written language. They developed hierarchical models (`grammars') which
divide sentences into words, words into syllables and syllables into phonemes.
Such models are therefore `segment-oriented'. The traditional descriptions of
intonation deny it a function independent of syntax [CH68] and reduce it to a
feature of the speech segments. The latter view still prevails in the in
uential
10 Chapter 2. Intonation - An Overview
auto-segmental theory of intonation by Pierrehumbert [Pie80].
Although the importance of intonation is no longer denied, research activities
on segmental features of speech are still dominant. The leading question of
how linguistic units are expressed by means of intonation is subject to deeply
controversial discussions and scholars agree that intonation research is still at its
very beginning 3 .
During recent years linguists have come to widely accept the importance of
experimental phonetics and conducted intensive studies on intonation (for Ger-
man, see section 2.5). On the one hand these were aimed at completing the
grammars of individual languages by a (language-specic) description of inton-
ation. On the other hand, scholars made an attempt to compare the intona-
tional features of dierent languages in order to establish a universal system
of intonation [PB88]. In this context, studies on foreign accent are especially
important [dB81].
A relatively new eld is the application of intonational knowledge to con-
versation analysis [Sel95] and discourse theory [Nak94, NA92, Yan95]. The
goal of these disciplines is the establishment of a theory on how the organization
of complex speech interactions is re
ected by prosodic cues.
Speech Processing: The increasing demand for the application of speech in
man-machine communication in all areas ranging from telephony, telematics, and
automated translation to aids for the handicapped requires sophisticated techno-
logy for the analysis, recognition and synthesis of speech. In the eld of speech
recognition, the more the task develops from the recognition of single words in
a limited vocabulary towards the understanding of complex utterances, the more
suprasegmental features like intonation have to be taken into account [Not91]
4 . These are important cues for the segmentation and classication (question
vs. declaration, for instance) of utterances [Str95]. In this context, modeling
intonation is an important task.
In speech synthesis modeling prosodic features is indispensable for increas-
ing the intelligibility and naturalness of synthetic speech. Sophisticated models
of intonation are needed which predict F0 contours for a particular sentence at
a given speech rate. Chapter 8 of this study is dedicated to this problem.
Phonetics: Besides the issues connected with the linguistic contents of an ut-
terance as re
ected by prosodic features, the general mechanisms of speech pro-
duction and the role of intonation in speech perception are important topics
of phonetic research. These issues will be addressed brie
y in sections 2.3.1
and 2.3.2.
3
This was, for instance, the resume of the nal discussion at the ATR 1995 Interna-
tional Workshop on Computational Modeling of Prosody for Spontaneous Speech Processing
in Kyoto, April 12-14 1995, where renown researchers like Hiroya Fujisaki, Kim Silverman,
Klaus Kohler, Gosta Bruce and others took part.
4 For languages which feature a lexical tone the evaluation of intonation even on the word
level is necessary for distinguishing between segmentally similar items.
2.3. Production and Perception of Intonation 11
Foreign Language Education: It is widely agreed that the acquisition of a
good command of intonational features in a foreign language is one of the most
dicult tasks a student must accomplish. Yet it is crucial for the degree of
intelligibility he or she will achieve [Die89]. In traditional language education,
however, intonation usually comes second to segmental phonetics, which itself
forms only a small part in the curriculum of common language courses. This
decit has become more apparent as the political and economical globalization
requires better communicative skills on the part of the learner of a foreign lan-
guage. In this context, individual computer-based language education will play
a further growing role. Software is needed which is t for the special problems
of the speaker of a language L1 who studies a target language L2. Although a
number of programs exist whereby the student can train his lexical, grammatical
or orthographic skills, there are few systems which use speech input to help cor-
rect the student's pronunciation [HR+ 93]. In this context, visualization of speech
can provide additional feedback where the auditory channel fails, because of the
mother tongue interference.
It seems desirable to develop more intelligent systems which are customized to
the special requirements of students with the same native language. Contrastive
studies of the intonational systems of L1 and L2 can help to predict problems and
select appropriate teaching materials. In this context, studies on foreign accent
are especially important [VEDB87].
Chapter 9 is dedicated to this problem in the case of German for Japanese
students.
Speech Pathology: Hearing impairments, especially if they are congenital or
acquired at an early age, are accompanied by a reduced intelligibility of speech.
Major factors for this are an imperfect command of phonatory eort [DM+ 82]
and a lack of control of the laryngeal function[ME77] which result in a high degree
of variation in the pitch patterns produced. The speech of hearing impaired
people may sound monotonous or on the contrary excessively emotional. The
basic pitch is often kept on a level either too high or too low and it was observed
that hearing impaired persons have diculties in changing their pitch within a
single syllable.
Teaching aids have been developed to overcome these problems which provide
a feedback for pitch over tactile [PA95] or visual [Wat88] channels.
2.3 Production and Perception of Intonation
2.3.1 The Production Process of Intonation
The source of voiced speech sounds is the oscillation of the vocal cords, a mu-
cous membrane with muscular parts, the vocalis muscle, inside the larynx that
separates the sub-glottal volume (lungs and trachea) from the vocal tract. The
larynx is a cartilaginous structure suspended in a mesh of antagonistic muscles
which connect it with other parts of the skeleton and ensure its functionality
independent of head or neck position.
12 Chapter 2. Intonation - An Overview
Figure 2.3: The muscles and cartilages of the larynx. The tension of the vocal cords is
actively in
uenced by the vocalis muscle itself and passively by the cricothyroid muscle
changing the relative position of the cricoid and thyroid cartilages. From Helfrich (1985).
The major parts of the larynx (see Figure 2.3) are the thyroid, cricoid and
arytenoid cartilages and the epiglottis. The cartilages are connected by a number
of intrinsic larynx muscles. The vocal cords are suspended horizontally between
the cricoid and thyroid cartilages. The position of the pair-wise arytenoid car-
tilages at the rear end of the glottis determines its degree of closure.
The arytenoid cartilages are controlled by the cricoarytenoid muscle (pos-
terior) which dilates and the lateral cricoarytenoid and arytenoid muscles which
contract the glottis (Figure 2.4 from [Hob93]).
The frequency of the glottal oscillation (i.e. the laryngeal frequency Fx ) is
basically determined by the oscillating mass which is controlled by the degree
of tension of the vocal cords [Hel85, p.31]. The tension of the vocal cords can
either be actively increased by activity of the vocalis muscle or passively by the
cricothyroid muscle which modies the relative position of the cricoid and thyroid
cartilages and hence causes a change in length of the vocal cords. As shown in
gure 2.5, the movement of the thyroid cartilage has two degrees of freedom:
1) Rotation around the cricothyroid joint 2) Translation of the thyroid cartilage
against the cricoid cartilage. The former is ascribed to pars recta, the latter to
pars obliqua of the cricothyroid muscle.
Shape and size of the glottal opening vary considerably. For light respira-
tion and whispered speech the membranous part is closed and only the small
part between the arytenoid cartilages remains open (Figure 2.6c). For stronger
2.3. Production and Perception of Intonation 13
Figure 2.4: Muscles attached to the arytenoid cartilages controling the degree of clos-
ure of the glottis (left: cricoarytenoid muscle (posterior), center: Lateral cricoaryten-
oid muscle (anticus), right: arytenoid muscle and lateral cricoarytenoid muscle. From
Hobohm (1993).
respiration also the membranous part is dilated, producing a rhombic opening
(Figure 2.6a from [K+ 84] ).
Before voice onset, the vocal folds are adducted, but they do not have to be
fully closed to initiate phonation. The velocity of the air
ow from the subglottal
volume is increased at the place of the glottal obstruction. Due to the so-called
`Bernoulli eect' this results in a negative pressure at the medial edge of the
glottis which are subsequently `sucked' together. As long as the subglottal pres-
sure is sucient it bursts the glottal closure. This slightly reduces the subglottal
pressure, causing the rim of the glottis to collapse and shortly close (Figure 2.6b).
Subsequently, the glottis opens and closes again, and a quasi-periodic oscillation
starts, an eect very similar to playing a comb.
The frequency of the glottal oscillation is controlled via auditory feedback
and can be kept relatively constant even by untrained subjects 5 .
2.3.2 The Role of Intonation in Speech Perception
Anybody may have experienced that in an environment where several speakers
are talking at the same time, it is possible to selectively `tune in' and follow a
particular conversation. This phenomenon is known as the `cocktail party eect'
and alludes to the function of intonation as the `carrier wave' of speech which
works much like the carrier wave of a radio transmitter.
In the process of verbal communication, a major role of intonation is its
prominence-lending function by which salient information is acoustically marked
in the speech signal. It was shown that the degree of prominence of a syllable in
an utterance can basically be increased by manipulating either F0 or the duration
of the respective syllable [FKN94]. This has two consequences:
1. The percepts of F0 movements are easily confused with percepts of seg-
ments with higher energy (accounting for the inaccuracy of auditory tran-
scription of intonation)
2. The speech-organizing function of intonation (in non-tone languages) can
5 Hanson [Han78] reports a standard deviation between 2 and 3% for the fundamental period
of a single tone.
14 Chapter 2. Intonation - An Overview
Figure 2.5: Two degrees of freedom in the movement of the cricoid cartilage: Trans-
lation (left) and rotation (right) against the thyroid cartilage caused by activity of the
cricothyroid muscle producing a change in F0. From Zemlin (1968).
only partially be ascribed to its melodic properties, for instance, for the
discrimination of questions from declarations.
Helfrich 1985 [Hel85] conducted perception experiments to examine the role
of intonation as an intermediate step in the process of speech perception.
The rst experiment (after Abrams & Bever [AB69]) deals with the localiz-
ation of click sounds whose position in a test utterance was varied. The results
show that the click is seldom identied at the point in the utterance where it actu-
ally occurs but the perceived location shifts to varying degrees depending on its
distance from inter-clause boundaries which therefore must present perceptual
cues for segmentation. Since, as Helfrich observes, the shapes of F0 movements
at inter-clause boundaries do not correspond to singularities or transients in the
Figure 2.6: The shape of the glottal opening for a) respiration b) voicing and c) whisper.
From Kahle (1984).
2.4. The Functions of Intonation and the Terminology Employed 15
contour - as similar movements also occur within a clause - the signicance of
these shapes can only be explained as a result of evaluation of the partial pitch
contour assigned to the clause. This evaluation requires a `storage' of the pitch
for a duration between 1 and 2 seconds. Helfrich suggests that the storage may
well be connected to the association of F0 movements with lexical elements.
In a second experiment (after Treismann [Tre64]) the subject is exposed to two
parallel acoustic stimuli (two read utterances presented to the left and right ears
spoken by dierent speakers). The subject is asked to repeat one of the utter-
ances, the other one is declared irrelevant and to be ignored. The delay between
the utterances is varied as well as the text underlying the utterances, which may
be identical for both. After every trial the subject is asked if he thought the texts
presented were equal or not.
The results of the second experiment suggest that the storage of F0 for at least
2s is possible without the listener identifying the lexical items in the utterance.
This may facilitate the backward directed `correction' of inadequate hypotheses
as to the syntactic and semantic contents of an utterance and thus generally
enhance the intelligibility of speech in the process of speech perception.
Helfrich concludes that F0 patterns serve as auditory units on the level of
the sensory memory. They constitute transitory steps in the perception process
towards higher-level (linguistic) representations.
A dierent kind of perceptual experiments was conducted by 't Hardt [tH84]
and the IPO group and later by d'Allessandro and Mertens [Md95]. Whereas
the former proposed linear `copy contours' which they claimed produced the
same perceptual impression as the original contours, the latter examined which
changes in F0 caused a change in perception and determined thresholds. These
approaches are discussed in section 2.5 on intonation models.
2.4 The Functions of Intonation and the Terminology
Employed
2.4.1 General Functions of Intonation
Speaking of the functions of intonation, one has to take into account, that speech
is characterized by the co-occurrence of various features, like F0 , intensity, dur-
ation, voice quality etc. and hence intonation is never the only correlate of the
functions we assign to it.
Helfrich [Hel85] distinguishes between those functions of intonation which
modify meaning and those which do not (see table 2.1). The former could also
be seen as the part of information which is consciously and intentionally provided
by the speaker, the `message', whereas the latter involuntarily accompanies it6 .
The linguistic features concern the way a message is formally coded and
organized into (intonational) units of a certain language. They correspond to
6
Speakers, of course, are capable of disguising their dialect consciously or, on the contrary
emphasize it, to mark their membership to a certain group. Actors are well capable of simu-
lating various emotional conditions. This kind of voluntary control of otherwise non-linguistic
features may rather belong to the paralinguistic features.
16 Chapter 2. Intonation - An Overview
Table 2.1: Information conveyed by intonation (after Helfrich, 1985), `*' marking
features discussed in this study.
modifying meaning not modifying meaning
linguistic (lexical,syntactic, semantic) paralinguistic non-linguistic
age
sentence mode* speaker's intention, sex*
discourse organization (focus)* attitude speaker's background
segmentation(integration, delimitation)* (native language*,
disambiguation dialect, sociolect)
emotional condition
the `surface structure' of the message on a still rather abstract level. The ac-
tual meaning of the message can often not be decoded without interpreting the
underlying paralinguistic information.
The question \Are you tired ?", for instance, is simply a request for being
supplied information on someone's psychological and physiological condition. If
it is asked with a concerned undertone then the message may be: \Come on,
you've been working so hard, you have to get yourself some sleep !" With an
ironical undertone, it may mean \You lazy guy, you've been sleeping all day and
still you're tired !"
This work is centered around the linguistic features marked with an asterisk
in table 2.1, as the author is convinced that the formal characteristics of intona-
tion must have been described appropriately, before an attempt can be made to
examine paralinguistically in
uenced varieties.
Figure 2.7: Processes by which various kinds of information are coded in the prosodic
features of speech (Fujisaki 1995a). Higher level input information (linguistic, paralin-
guistic, non-linguistic) is transformed into speech sounds by a multi-stage process: 1)
message planning, 2) utterance planning, 3) motor command generation and 4) speech
sound production.
Fujisaki [Fuj92] - independent of Helfrich - developed his systematic model
of the process by which information is coded in prosodic features of speech. He
takes into account, that intonation and its acoustic correlate, the F0 contour, are
the result of a complex multi-stage process, which is subject to certain constraints
at each of its steps (see Figure 2.7 from [Fuj95a]). Information from higher level
2.4. The Functions of Intonation and the Terminology Employed 17
processes (`Input Information') is coded into abstract units and structures of a
particular language, here called `message planning'. The message planning is
guided by the rules and constraints which could also be called the `grammar' of
the language. In the next stage, an utterance is planned, taking into account
the phrasing, accentuation and pausing principles of the particular language. On
this step, paralinguistic 7 and non-linguistic information rst enter the production
process, determining, for instance, the style and segmentation of the utterance.
The utterance plan leads to the generation of a set of neuro-motor commands
for controling the speech production mechanism. The contents of information
in the utterance is converted into acoustic correlates like accents, phrases and
pauses. Both stages, command generation and the speech production mechanism
are characterized by physiological and anatomical constraints (nite time con-
stants, limited repertory of articulatory movements, limited range of F0 etc.).
Artemov [Art78] once stated that research on intonation is characterized by
the dilemma that depending on the particular language and even the particu-
lar researcher the terminology varies considerably. He therefore suggested that
scholars in the eld should rst understand the meaning of the terms used by their
fellow-researchers before they criticized them. For this reason, rst an overview
of terms employed in studies of intonation is given. Those terms will be critically
discussed which are employed with dierent meaning by dierent scholars.
In cases where terms simply represent synonyms for the same feature, syn-
onyms are given and one term will be consistently employed for the scope of this
study.
2.4.2 Word accent
Early observations of spoken German revealed that every word of German when
uttered in isolation has at least one syllable which is most prominent. This led
to the assumption that this syllable was connected to a higher intensity than the
others and therefore was called a `stressed' syllable bearing the `word accent' or
`lexical accent'.
Empirical studies, however, indicated that word accent syllables generally
feature a co-occurrence (in descending order of importance) of a distinct F0 move-
ment, a duration longer than unaccented syllables and a peak of intensity [Rus91].
The contribution of these acoustic features may vary due to the syllable struc-
ture (short vowels ! shorter syllable duration) or the absence of F0 (whispered
speech). Minimal pairs of words (mainly verbs) exist in German for which are
segmentally equivalent and only distinguished by the position of their word ac-
cent. Fig. 2.8 shows the example umgehen (to handle) vs. um gehen (to avoid).
The rules by which the accent syllable of a German word can be determined
will not be discussed here. They largely depend on the origin of the word and
there exists no complete formulation for all lexical items. In the case of simple
native words of German, the rst syllable of the stem, generally the penultim-
ate of the word, carries the word accent[Koh77]. A number of function words
7
Unlike the conventional notions of `paralinguistic information' [Lav95], Fujisaki denes it
as the information which cannot be inferred from the written counterpart of an utterance, and
which is deliberately added by the speaker.
18 Chapter 2. Intonation - An Overview
Fo [Hz] Fo [Hz]
240 240
180 180
120 120
60 60
’umgehen um’gehen
-0.5 0.0 0.5 -0.5 0.0 0.5
Time [s] Time [s]
Figure 2.8: Example of words distinguished by word-accent: \ umgehen" (to handle)
vs. \um gehen" (to avoid). The speech wave form is displayed at the top of the gure,
and the extracted F0 contour in the center. It can be seen that a syllable bearing
the word accent is characterized by a distinct F0 movement, and longer duration, and
higher intensity than in the unaccented case.
(prepositions, articles, conjunctions) are regarded as being unaccentable [SZ82,
p.44].
2.4.3 Accent, Stress, Pitch Accent and Stress Accent
In traditional linguistic literature, the terms `accent' and `stress' have often been
employed synonymously; to describe the degree of prominence of one syllable
in an utterance against the others [Phe81, p.850]. As already explained, it was
assumed for German that the prominence was caused by a higher intensity of the
accented syllable. For this reason, German is still often classied as a `stress ac-
cent language' as opposed to a `pitch accent language' such as Japanese [HK95].
In contrast, recent empirical studies discussed the occurrence of `pitch ac-
cents' in English, a language which also traditionally belonged to the group of
stress accent languages [Pie80].
Grnnum [Grn90] employs the terms `stressed syllable' and `sentence ac-
cent', with the sentence accent being the stressed syllable which is most prom-
inent in an utterance. She claims that both, inter alia, are connected to changes
in F0.
The author feels that the inconsistency in the use of the terms `accent' and
`stress', at least as far as German is concerned, makes it necessary to reconsider
them with regard to the results of our empirical phonetic studies.
As shown in Figure 2.8, the lexical accent of an isolated word of German is
connected with F0 movement, longer duration and intensity of the accent syllable.
The realization of word accent in disyllabic words of German and Japanese
was compared by Mixdor and Fujisaki [MF95b] and it was found that F0 pat-
terns of words in both languages may look rather similar. The most important
dierence between the languages is that syllable durations in Japanese are little
2.4. The Functions of Intonation and the Terminology Employed 19
aected by the location of the pitch accent whereas a shift of the word accent
from the rst to the second syllable in German words changes them signicantly.
This leads us to the conclusion that in contrast to the so-called pitch accent lan-
guages, F0 is not indispensable for marking the word accent location in German
since duration and intensity shift with it.
So what is the function of F0 ? In gures 2.9 and 2.10 two utterances of the
same sentence \Er kann damit umgehen."| \He can handle it." are displayed.
In the rst utterance, a simple statement is made:\You, know, he can handle
the situation, no problem.". In the second, the auxiliary verb `kann'|`can'
is being emphasized and the connotation could be something like: \My God,
why do you underestimate him all the time ? I told you, he can do it !" The
F0 pattern in Figure 2.9 on `umgehen' resembles the one for the word in isolation
(Figure 2.8), whereas it is rather
at in Figure 2.10. If the piece of speech
belonging to `umgehen' in the latter utterance is listened to in isolation, it can
still be identied as ` umgehen' against `um gehen', because the ratio of duration
and intensity of the syllables in the word remains the same. Returning to the
original meaning of `stress' this is exactly what can be found here: The word
accent syllable is prominent because of its higher energy, it is `stressed'.
For this reason the term `accented' will henceforth be reserved to those words
of an utterance which feature a distinct F0 movement on the word accent syl-
lable. If the F0 movement is absent, the word accent syllable is still stressed,
but the word itself is `de-accented'. Hence in this denition for German, the
terms `stress' and `accent' are not relational dichotomies but the former simply
contains a subgroup of prosodic features of the latter:
term prosodic features
accent F0 , duration, intensity
stress duration, intensity
This view is shared by Sluijter [Slu95] for English who shows that stress and
accent also dier in their vocal source features.
2.4.4 Accent Group, Prosodic Phrase, Sentence Accent
In utterances containing more than one word unaccentable words are prosod-
ically linked to accented words. This either occurs proclitically, to a following
accented word (article + noun, for instance) or enclitically, to a preceding accen-
ted word (verb + pronoun, for instance). This produces the smallest meaningful
prosodic units, of which a longer utterance consists. These units are called
`accent-groups' [SZ82, p.55] or `tone groups' [Phe81, p.856]. Generally accent
groups are combined to form larger prosodic `phrases' building a sentence.
In an utterance containing several accentable words, the word accent syl-
lables are potential locations of prominence. The word accent syllable which is
most prominent compared with the others in the utterance and/or marks the
communicatively most important part, is called the `sentence accent' or `core
accent' [SZ82, p.48]. In Grnnum's terminology [Grn90] the sentence accent
necessarily is the acoustically most prominent. This denition, however, seems
hardly applicable to German, since generally the last accent of an utterance,
20 Chapter 2. Intonation - An Overview
Fo [Hz]
240
180
120
60
Er kann damit ’umgehen.
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.9: Speech waveform (top) and F0 contour (center) of the utterance \Er kann
damit umgehen."|\He can handle it.", `umgehen' accented, neutral, unemotional state-
ment.
Fo [Hz]
240
180
120
60
Er KANN damit ’umgehen.
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.10: Speech waveform (top) and F0 contour (center) of the utterance \Er kann
damit umgehen."|\He can handle it.", `umgehen' de-accented, variety emphasizing
the capability of a person by putting the sentence accent on the otherwise de-accented
auxiliary verb `kann'.
2.4. The Functions of Intonation and the Terminology Employed 21
which is often acoustically less prominent, determines its meaning. This will be
illustrated by the following example.
In Figures 2.11 and 2.12 two utterances of the sentence \Der Wagen war an
der Wiese."|\The car was on the lawn." are displayed. Whereas in the rst
utterance information is given concerning the place where the car was found (last
accent on `Wiese'|`lawn'), in the second utterance emphasis is laid on the fact,
that the car was found on the lawn, and not, for instance, the bicycle. In both
cases, however, the word `Wagen'{`car' is the one with the largest F0 movement
and hence the acoustically most prominent.
In neutral, unemotional speech, the position of the sentence accent is generally
determined by the underlying syntactic structure. Some examples will be given,
with the sentence accent syllable printed in bold type.
Peter hat geschrieben.|Peter has written.
Peter hat einen Brief geschrieben.|Peter has written a letter.
Peter hat einen Brief an seinen Vater geschrieben.|Peter has writ-
ten a letter to his father.
Among others, Stock [SZ82] and Pheby [Phe81] have formulated rules for
determining the sentence accent in German.
2.4.5 Focus
The term `focus' [ABO89, p.267.] is strongly related with the sentence accent.
It describes the semantic concept, by which parts of speech (parts of a sentence,
single words or even syllables) receive prominence against others.
It also applies to the kind of isolated sentences, that were given as examples
for the placement of the sentence accent in the preceding section. It is, for
instance, shown by the fact that the sentence accent shifts when the sentence
\Peter hat geschrieben." is expanded by the object \einen Brief", presumably
because of the higher amount of information in the noun.
In continuous speech the focus may be determined by the context preceding
an utterance. It may emphasize new information or contrast it against old in-
formation. This background information is not necessarily given explicitly, but
may be tacit knowledge of the speakers or even common sense. The salient
part-of-speech, which is also called the `focus domain', is marked by placing the
sentence accent on a word belonging to it, the `focus exponent' [Hoh82]. De-
pending on the extent of the focus domain it is called a `narrow' or `broad' focus,
the latter extending over more than a single content word.
One example of narrow focus was already displayed in Figure 2.12: \The car
(not the bicycle) was on the lawn."
Not every change of the focal condition, however, brings about a change of
the location of the sentence accent. This will be illustrated on an example from
Altmann [ABO89, p.277].
22 Chapter 2. Intonation - An Overview
Fo [Hz]
240
180
120
60
Der Wagen war an der Wiese.
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.11: Example for the placement of the sentence accent: Speech waveform (top)
and F0 contour (center) of the utterance \Der Wagen war an der Wiese"|\The car
was on the lawn.". Variety stating the location where a car was found.
Fo [Hz]
240
180
120
60
Der WAGEN war an der Wiese.
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.12: Example for the placement of the sentence accent: Speech waveform (top)
and F0 contour (center) of the utterance \Der Wagen war an der Wiese".|\The car
was on the lawn.". Variety where emphasis was put on the type of vehicle that was
found on the lawn.
2.4. The Functions of Intonation and the Terminology Employed 23
A: Gulda spielt Geige. A: Was macht eigentlich Gulda ?
1) B: Nein ! Gulda spielt Klavier. B: Gulda spielt Klavier.
2) A:
A: Gulda plays the violin. By the way, what's Gulda doing ?
B: No ! Gulda plays the piano. B: Gulda's playing the piano.
A: Was ist denn hier los ?
3) B: Gulda spielt Klavier.
A: What's happening round here ?
B: Gulda's playing the piano.
Altmann and his co-workers found in their study, that a double contrast as
shown by the following example is generally not marked intonationally: \Sie lat
die Nina das Leinen weben vs. \...den Baumwollsto farben."| \She lets Nina
weave the linnen." vs. \...dye the cotton fabric.".
Stock 1982 [SZ82] does not employ the term `focus', but distinguishes between
`neutral' and `contrastive' intonation. According to his concept, neutral intona-
tion can be predicted by the syntactic relationship between the constituents of
an utterance.
The examples of contrastive intonation he gives [p.59] correspond to the nar-
row focus condition discussed above.
2.4.6 Sentence Mode
The term `sentence mode' [Alt87] denotes syntactic structures (sentences) which
can be assigned certain functions (statement, yes-/no-question etc.). The fea-
tures characterizing the sentence mode belong to four groups: a) specic words
(wh-words in wh-questions, for instance) b) word order (the position of the nite
verb, for instance) c) morphological marking (imperfect subjunctive in optative
sentences) d) intonational marking. Intonational marking is especially import-
ant in those cases where a distinction by other features is not possible, namely
when two sentences have identical wording (see example Figures 2.13 and 2.14:
\Leihen wir den Wagen !" vs. Leihen wir den Wagen ?"|\Let's rent the car !"
vs. \Do we rent the car ?") The example also shows that the F0 contours do
not only dier at the tail of the utterances (falling in the exhortation and rising
in the question), but change their course already at the sentence accent location
on `leihen'.
2.4.7 Declination
The term `declination' denotes the global downwards trend, which generally can
be observed on F0 contours of utterances of declarations. It causes the oset
value of F0 of the contour to generally lie below the onset value (see, for instance
Figure 2.14).
In longer utterances, generally a reset or readjustment of the declination line
occurs.
24 Chapter 2. Intonation - An Overview
Fo [Hz]
240
180
120
60
Leihen wir den Wagen?
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.13: Example for sentence mode `yes-/no-question': Speech waveform (top)
and F0 contour (center) of the utterance \Leihen wir den Wagen ?'|\Shall we rent
the car ?" .
Fo [Hz]
240
180
120
60
Leihen wir den Wagen!
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.14: Example for sentence mode `exhortation': Speech waveform (top) and
F0 contour (center) of the utterance \Leihen wir den Wagen !"|\Let's rent the car !".
2.4. The Functions of Intonation and the Terminology Employed 25
Figure 2.15: The three F0 peak patterns proposed by Kohler [Koh91], `Von' marking
the onset of the nuclear vowel. According to Kohler, `early' peaks denote established
facts, `medial' peaks new information and `late' peaks surprise, for instance.
2.4.8 Location of F0 peaks
Kohler [Koh91] observed that the perceptual impression caused by an utter-
ance, where the position of an F0 peak relative to the sentence accent syllable is
gradually shifted, does not change continuously, but is subject to abrupt changes
at certain points. Hence he attributes the location of F0 peaks a phonological
signicance, namely a categorical distinction. As one possible interpretation,
he associates `early' peaks with established facts, `medial peaks' with new in-
formation and `late' peaks receive a connotation of surprise (see Figure 2.15).
Kohler employed re-synthesized stimuli of one-accent utterances where an F0
peak was subsequently shifted from the left to the right and conducted two kinds
of experiments:
1. The subjects are exposed to the stimuli in ordered sequence and must
decide when they perceive a sudden change in the melody of the sentence.
! More than 60 % perceived a clear change at one point in the sequence
2. The subjects are exposed to individual utterances and must relate these to
one of a set of given contexts.
! Clear dierence between early and medial peak utterances, less marked
between medial and late peaks.
The acoustic dierence between early and medial peaks is that in the former
the F0 contour falls all across the syllable nucleus, whereas in the latter, part
of the rise occurs within the syllable nucleus. It is, however, generally disputed
if speakers systematically modify the location of F0 peaks to convey a certain
meaning [Mob93, p.49].
2.4.9 Conclusions
Hawkins [Haw95] summarizes the systematic problems resulting from the multi-
dimensionality 8 of the features in
uencing the (single-dimensional) F0 contour:
8
linguistic, phonetic, physiological, pragmatic etc.
26 Chapter 2. Intonation - An Overview
Fo [Hz]
240
180
120
60
Er kann ihn damit UM’GEHEN.
-0.5 0.0 0.5 1.0
Time [s]
Figure 2.16: Speech waveform (top) and F0 contour (center) of the utterance \Er kann
ihn damit umgehen."|\Doing so he can cheat him.". An example for the co-occurrence
of linguistic information in the F0 contour. The portion of the F0 contour on the innite
verb `um gehen' - inter alia - carries the following information: 1) lexical (contrast
against ` umgehen'), 2) segmentation (marking nality), 3) sentence accent (prominence
by distinct F0 movement), 4) sentence mode (falling contour for statements).
1. There do not exist denitions for discrete units of intonation
2. Acoustic correlates of linguistic units are typically complex and in
uence
longer portions of the F0 contour
3. They generally contribute to more than one linguistic unit
4. They are highly variable (addition by the author)
This will be discussed on the example of the utterance \Er kann ihn damit
umgehen."| \Doing so, he can cheat him."(Figure 2.16). The F0 movement on
`um gehen' -inter alia- has the following functions:
1. Marking of the word accent (distinction against ` umgehen')
2. Delimitation (marking nality)
3. Sentence accent (center of information !focus)
4. Marking of the sentence mode (falling contour!statement)
If one follows Kohler's view, it can be stated that the F0 peak is a medial
one (falling/rising F0 contour on the accent syllable) and possibly signals new
information.
During the last three decades, several descriptions (`models') of German in-
tonation were developed to dene the relationship between the linguistic units
underlying an utterance and the F0 contour. The approaches dier considerably
depending on their theoretical background (linguistics, experimental phonetics)
and their application (linguistic research, speech technology, language educa-
tion). Most of the studies only cover parts of functions of intonation discussed
so far.
2.5. Models of Intonation 27
Table 2.2: Models of German intonation.
Abbreviations
IR: Intonation research TI: Teaching intonation
SS: Speech synthesis PL: Prosodic labeling
PE: Perception experiments
author(year) F0 description based on applications
Isacenko(1964) tone switches pitch perception PE
Bierwisch(1966) tone levels syntact.surface structure SS ?
Pheby(1981) tone syllables, -groups syntact.surface structure TI
Stock(1982) tone switches, intonemes syntact.surface structure TI
Bannert(1983) minima, oset phonological description SS?
Fery, Uhmann(1988) H, L, *, % auto-segmental theory PL, IR
Altmann(1989) on-,osets, mins, max sentence model IR
Adriaens(1991) copy-contours pitch perception PE
Kohler(1977, 1991) peaks, valleys experimental, linguistic IR, SS
Mobius(1994) Fujisaki-model tting F0 contours SS
Table 2.3: Models of intonation developed for languages other than German.
author(year) F0 description based on applications
Pierrehumbert(1983) H, L, *, % auto-segmental theory PL, IR
T'Hardt(1984) copy-contours pitch perception PE
d`Allessandro(1995) copy-contours pitch perception PE
Fujisaki(1984) Fujisaki-model tting F0 contours SS, PE
Hirst [Hir92] summarizes the requirements an appropriate model of intonation
which should be met:
1. simplicity and ease of calculation
2. closeness of approximation to observed F0 data
3. ease of estimation of parameters from F0 data
4. compatibility with data on speech production (electromyography)
and concludes: \Ultimately, the main criterion for choosing a particular
model must be its compatibility with an overall linguistic description."
Some of the most important models of German intonation will be discussed
in the following section.
Chapter 3 is dedicated to the Fujisaki-model which was adopted for this work.
2.5 Models of Intonation
Figure 2.17 illustrates the role of intonation models as a link between linguistic
structures and their acoustic manifestation, the F0 contour. In principle, two
general methods can be distinguished: One type of model (left to right in the
gure) deduces a phonological description from a linguistic structure (typically,
the syntactic surface structure) specifying accent levels and phrasal boundaries,
28 Chapter 2. Intonation - An Overview
Generative Approach
Linguistic Phonological Abstracted
information representation F0 contour F0 contour
F0
S sequence of tone switches
#<A><B><C><D>#
VP H* L* HL* L%
NP tone sequence
NP (Accents, phrase
Art N V Prp N boundaries)
linear stylization time
Analytical Approach
Figure 2.17: The role of intonation models as links between linguistic structures and
their acoustic manifestations in the F0 contours of utterances. The generative approach
aims at producing F0 contours from higher level information, whereas the analytical
approach infers higher level information from the observed F0 contour by means of
some kind of abstraction.
transforming these into an abstract description of the F0 contour 9 which is
then by application of phonetic rules converted into the actual F0 contour. This
type of model generates F0 contours from higher-level linguistic information and
hence the method can be called \a generative approach". The opposite approach
(right to left) aims at abstracting from the the observable F0 contour by means of
approximation techniques, yielding phonologically relevant basic elements. These
are then used to infer linguistic units and structures. Since this approach is based
on the analysis of the F0 contour observed, either mathematically, graphically or
auditorily, it will be called an \analytical approach".
Most studies of German intonation, combine generative and analytical ele-
ments. All approaches, though with varying degrees, refer to the analysis of
observed F0 data extracted from speech data generally elicited with regard to
some limited linguistic or phonetic problem. Elicitation tasks were, for instance,
focal placement or the realization of various sentence modes as in Altmann et
al. [Alt84].
Not all of the resulting models, however, cover the complete sequence of in-
termediate steps between linguistic structure and the F0 contour, and some of
the models describe the connection between the two only either in left-to-right
(starting from the linguistic description) or right-to-left manner (starting from
the F0 contour).
Table 2.2 gives an overview of models of intonation developed for German with
a brief classication of their properties, theoretical background and applications.
German. Some of these models originate from approaches for other languages,
listed in Table 2.3. In the following sections the various approaches will be
described in further detail with special emphasis on the works by Isacenko and
9
In Figure 2.17, the three varieties of abstract representations of the F0 contour `sequence of
tone switches', `tone sequence' and `linear stylization' are referred to only as examples. There
exist, of course, other (possibly better) representations.
2.5. Models of Intonation 29
Stock, which inspired the present study.
When critically discussing earlier models, the fact remains that the object
they describe is the same: Features of German intonation. Hence, while pointing
out formal dierences between the models, an attempt will be made to nd the
essentials (in terms of `knowledge of German intonation') they have in common.
2.5.1 Isacenko & Schadlich, 1964
\No theory (of intonation) can be based on events which are never repeated."
The early work by Isacenko was one of the rst consistent attempts to sep-
arate the syntactic functions of intonation from attitudinal or emotional ones
(see, for instance [IS64, p.31]) which up to then had often been confused. It
is based on perception experiments using synthesized stimuli with extremely
simplied F0 contours. These were designed to verify the hypothesis that the
syntactic functions of German intonation can be modeled using tone switches
between two constant F0 values connected to accented (`ictic') syllables and
`pitch-interrupters' at syntactic boundaries.
The stimuli were created by monotonizing natural utterances at two constant
frequencies and splicing the corresponding tapes at the locations of the tone
switches:
178.6 Hz Vorbereitungen sind ge alles ist be
150 Hz die troen reit
Four types of tone switches are distinguished: 1) A fall before a low accent
syllable 2) a fall after a high accent syllable 3) a rise before a high accent
syllable and 4) a rise after a low accent syllable.
Figure 2.18 show examples where syntactic functions are modelled by using
tone switches. The experiments showed a high consistency in the perceptual
judgment in a large number of subjects (N = 50, ratings 64-82 %)
2.5.2 Bierwisch, 1966
Bierwisch's description of German intonation [Bie66] stands in the tradition of
generative grammar theory. Starting from the syntactic tree structure of a sen-
tence (Figure 2.19 A) ), transformation rules are applied which yield accent levels
and `intonation units' 10 (intonational phrases). Initially, the levels of boundaries
to the left and right of a constituent in a sentence are determined by the depth
of the node in the tree structure the constituent is connected to. Intonation units
are formed by successive, rule-guided deletion of these boundaries. This cyclical
rule which may eventually remove all boundaries in an utterance is paramet-
erized with a factor p which is an abstract unit re
ecting external factors like
speech rate and speaking style.
10
German: \Intonationseinheiten".
30 Chapter 2. Intonation - An Overview
A) Sentence mode:
die Kinder vertrau en den Eltern (question) \The children trust the parents."
die Kinder ver trauen den Eltern (unnished)
B) Focus:
die Kinder glauben dem Lehrer (broad) \The children believe the teacher."
die Kinder glauben dem Leh rer (narrow)
C) Disambiguation:
ich wei, da der Mann im Auto schlaft \I know that the man in the car is asleep."
ich wei, da der Mann im Auto schlaft\ \I know that the man is sleeping in the car."
D) Phrasing:
Peter arbeitet // ver dient aber wenig
\Peter works, but earns little."
die Zeit schriften // bringen Ar tikel und A nnoncen
\The journals carry articles and advertisements."
Figure 2.18: Examples for syntactic functions modelled by tone switches from [IS64].
Underlining denotes low level of F0, overlining high level of F0. `//' indicates pitch
interrupters at phrase boundaries.
Taking into account, accent levels, phrase boundaries and `syntactic intona-
tion markers' (SIM) 11 , re-writing rules produce a phonetic transcription of the
F0 contour which is described by the relative tone levels of the syllables in the
utterance and the direction of the pitch movements at accented syllables (Fig-
ure 2.19 B) ).
The system of intonational rules developed by Bierwisch shows a large amount
of experience and observatory skill, and - at least for the examples he presents -
yields plausible results. Bierwisch, however, does not give the phonetic evidence
for their relevance and leaves the question unanswered, which acoustical correl-
ates in the F0 contour correspond to the boundaries of the intonation units he
proposes.
Figure 2.20 shows the intonation model developed by Bierwisch.
11
SIMs include, for instance, markers for question-nal rise (Q) and statement-nal fall to
a low sentence accent syllable (D) which are derived from the deep structure of a sentence.
Bierwisch, however, gives only tentative explanations as to the taxonomy of the SIMs.
2.5. Models of Intonation 31
S
A) NP VP
Art Adj N V PN V
Das ganze Unternehmen ist nutzlos gewesen
2 3 1 3 accent levels
B) 0# Das ganze 2# Unternehmen 1# ist nutzlos gewesen D 0# phrasal boundaries
rise rise fall pitch movements
1 3 3 1 1 3 3 3 1 1 1 1 1 tone levels
Figure 2.19: Example of a phonetic description B) generated from a syntactic tree
structure A). Tone levels produced for the sentence \Das ganze Unternehmen ist nutzlos
gewesen."|\The whole venture has been futile.". `#' indicates the boundaries of `in-
tonation units'. After Bierwisch (1966).
2.5.3 Pheby, 1981
Pheby's description of German intonation [Phe81] was incorporated into a re-
cent comprehensive work on German grammar by Flamig [Flam91]. It follows
Kohler's separation of utterances into tone groups (corresponding to Kohler's
`intonational units') tacts, syllables and phonemes. In contrast to Kohler, every
tone group is characterized by one of three distinctive `tone patterns' on its most
prominent syllable, the `tone syllable':
1) falling 2) rising 3) sustained. Each of these three patterns has a strong and
weak variety. According to Pheby, in the so-called \unmarked case", every tone
group correspond to one syntactic clause. In the \marked case", a tone group
consists of more than one syntactic clauses.
Tone groups may be separated by `linguistic pauses'. In the following example
tone syllables are printed in bold face, and boundaries of tone groups denoted
by `//':
// Wir wollen mit dem Zug fahren,//der weniger voll ist. //
\We want to take the train, it is less crowded."
Pheby calls the correspondence between tone groups and syntactic clauses `con-
gruence'. If it does not apply (the marked case) a modication in the meaning
of the utterance occurs:
// Wir wollen mit dem Zug fahren, der weniger voll ist //
\We want to take the less crowded train."
Pheby postulates a relationship of `gravity' between the constituents of an utter-
ance which determines the location of the sentence accent. He points out that
32 Chapter 2. Intonation - An Overview
Deep Structure
(T)
Hierarchy of Constituents in
SIM
Speech Surface Structure
Rate
(A)
(K)
Accent
P (I)
(P) (A) Accent Rules
( I ) Intonation Rules
(I) (K) Convention for Phrasal Boundaries
Phrasing
(P) Phrasing Rules
(I) (T) Syntactical Transformation Rules
Intonation
Fundamental Intensity
Pauses
Frequency Duration etc.
Figure 2.20: Top-down model as proposed by Bierwisch (1966). The deep structure
of a message is transformed into the surface structure of an utterance and `syntactic
intonation markers' (SIM). By application of accent and intonation rules these are
converted into the observable acoustic correlates of an utterance (F0 , intensity, duration
and pausing).
in the unmarked case the sentence accent is generally placed on the constituent
word that is closest to the tail of the utterance. Furthermore, the gravity of
component parts of a sentence is not only determined by their position, but also
by their syntactic properties. Both, gravity and position in the utterance often
coincide (A < B denoting `A weaker than B'):
subject - indirect object - direct object (order)
subject < indirect object < direct object (relationship of gravity)
If the stronger component is missing or replaced by a pronoun the sentence accent
shifts:
\Das Kind hat einem Mann eine Zeitung gegeben."
subject indir. object dir. object
\The child has given a newspaper to a man."
\Das Kind hat sie einem Mann gegeben."
subject (dir. object) indir. object
\The child has given it to a man."
Selting [Sel93] criticizes that Pheby's concept of tone groups as being too
grammar-oriented and with little that is applicable to empirical analysis. The
2.5. Models of Intonation 33
`linguistic pauses' he proposes as prosodic cues to the boundaries of intonational
units are rarely found in natural speech data.
2.5.4 Stock & Zacharias, 1982
The tutorial on German sentence intonation by Stock and Zacharias [SZ82] was
originally designed as a textbook for teaching intonation to foreigners. It is based
on empirical studies by Stock [Sto80] and further develops the concept of tone
switches introduced by Isacenko. Stock and Zacharias formulate phonologically
distinctive elements of intonation which are called `intonemes', in the tradition of
the great soviet intonation researcher Artemov [Art65]. intonemes are charac-
terized by the occurrence of a tone switch at an accented syllable. Depending on
their communicative function, Stock distinguishes between the following kinds of
intonemes:
1. Information intoneme I#: intoneme with falling tone switch (utterance-
nal), signals the completeness of an utterance. Speaker's main intention:
Conveying a message
2. Contact intoneme I": intoneme with rising intonation (utterance -
nal), marks questions where this can not be concluded from the sentence
structure. Speaker's main intention: Establishing contact
3. Non-terminal intoneme N: intoneme with rising tone switch to a me-
dium level which is sustained after the accent syllable. It is usually found
with non-nal accents and signals the incompleteness of an utterance. Its
use largely depends on speech rate and emotional condition.
Here are some examples:
(I ) ‘‘Die Delegation fuhr nach Bu chenwald.’’ ‘‘The delegation went to Buchenwald.’’
(I ) ‘‘Wollt Ihr Pil ze suchen ?’’ ‘‘Do you want to look for mushrooms ?’’
(N) ‘‘Sie gehen am Wald entlang, weil sie Pilze suchen wollen.’’ ‘‘They walk along the forrest,
because they want to look for mushrooms.’’
For better explanation, Stock distinguishes between various varieties of the
I- and C-intonemes depending on the number of pre- and post-accent syllables.
These will not be discussed in detail here.
Stock's work was extremely informative for the author because of its concise
formulations of accentuation rules (word accent, phrase and sentence accents12)
and rules for the prosodic segmentation of sentences into `accent groups'. Since
those rules which describe sentence intonation are being referred to later on in
the study, they are documented here.
12 The latter are called `Kernakzent' and `Hauptkernakzent'|`core accent' and `main core
accent' by Stock.
34 Chapter 2. Intonation - An Overview
Rule A1: Unaccentable Words Articles, prepositions, conjunctions, auxili-
ary and modifying verbs, interrogative and relative adverbs, relative, per-
sonal, interrogative and re
exive pronouns are generally unaccentable.
Rule A2: Core Accent in Attributive Groups (AG) Nouns, adjectives or
pronouns which are modied by an attribute, form `attributive groups'.
The communicatively most important accent (the core accent) is placed on
the last accentable item in the AG. (Examples: \ein gefeierter Kunstler"|
\a celebrated artist", \ein Korb Kirschen"|\a basket of cherries", \ein
Meister seines Faches"|\a master in his eld", \im Monat Juli"|\in
the month of July").
Rule A3: Core Accent in Verb-Complement Groups (VCG) `Verb-
Complement Groups' consist of a verb which is modied by other constitu-
ents (typically objects or adverbial qualications), the complements. The
core accent is generally placed on the last accentable item of the last or only
complement in the VCG (Examples: \das Papier auf einen Tisch legen"|
\to put the paper on a table", \wir laufen schnell zum Bahnhof"|\we run
fast to the station"). Exceptions: VCGs whose complement is determined
(identiable, for instance, by the use of the direct article), VCGs consist-
ing of an adverbial qualication followed by a past participle (Examples:
\wir werden das genannte Buch referieren"| \we will give a report on
the book mentioned", \er ist morgens gekommen"|\He has arrived in the
morning").
Rule A4: Main Core Accent in Complex Utterances In utterances con-
taining several AGs or VCGs the main core accent (= the sentence accent)
is placed on the last item bearing a core accent according to rules A 2
and A 3 (Example, core accent syllables in bold type, main core accent
underlined, VCGs delimited by f g and AGs by [ ]: \[Der Vorsitzende des
Ministerrats] fuberreichte [den berufenen Professoren] [die Urkunden mit
ihrer Ernennung]g"|\The chairman of the Council of Ministers presented
the certicates of appointment to the appointed professors." ).
Rule G1: Formation of Accent Groups Accent groups are the smallest mean-
ingful prosodical units of an utterance. They are characterized by an ac-
cented item, which unaccented items are prosodically linked to. They are
not interrupted by major syntactic boundaries. The linkage occurs either
to a following accented item (proclysis) or to a preceding one (enclysis)
(Examples, accent groups delimited by [ ] : \[Wir untersuchten es][mit
der Lupe]."|\[We examined it][with the magnifying glass.]", \[Er half
ihm][beim Schreiben]."|\[He helped him] [to write].").
Rule G2: Merging of Accent Groups With increasing speaking rate, accent
groups tend to merge. This occurs especially where a part-of-speech is
subordinate to another. The boundary between theme and rheme in an
utterance is generally preserved.
2.5. Models of Intonation 35
Rule AK: Context-Dependent Shifting of the Core Accent Depending on
the context of an utterance, the core accent may shift to any component
part of the sentence. This occurs, for instance, in cases where a contrast
is expressed or special emphasis is laid.
2.5.5 Altmann, Batliner and Oppenrieder, 1989
The ultimate objective of Altmann and his co-workers 1989[ABO89] was the
denition of prototypes of German intonation. They addressed this problem
by experiments examining the realization of sentence mode and focus in natural
speech data. Two kinds of experiments were conducted:
1. Phonetic analysis of a great number of context-elicited natural utterances
(F0, duration, intensity)
2. Perception experiments with selected stimuli from 1): a) accent test (mark-
ing of prominent syllables), b) classication test (assignment of functional
type, like statement, question etc.) and c) test of naturalness (checking for
acceptability of a stimulus within a given context.)
The speech material consisted of segmentally identical sentences, which can
only be distinguished by intonation; corresponding to contrastive `minimal pairs'
dened by Altmann's `sentence model' [Alt84](see section 2.4.6). For the lack of
an appropriate model of German intonation [ABO89, p.6,7], the experimenters
conned themselves to the extraction of on-and oset values and maxima and
minima of F0 in the phrases. In addition they estimated the duration and in-
tensity of syntactic phrases. The authors dismissed the thought of applying the
tone sequence approach by Pierrehumbert [Pie80] to the analysis of their data,
because of its lack of objectivity.
By averaging over the extracted features the authors constructed diagrams
for the prototypal patterns they proposed. Figure 2.21 shows some examples of
these diagrams. The underlying utterances were produced from sentences like
\Sie lat die Nina das Leinen weben."| \She's having Nina weave the linnen."
with the possible focal conditions:
1. narrow focus on the object `Leinen' (\FOKUS 2"in the gure)
2. narrow focus on the innite verb `weben' (\FOKUS 3 in the gure)
3. broad focus on object plus innite verb `Leinen weben'
4. narrow focus on object `Leinen' and narrow focus on the innite verb
`weben'
The diagrams in Figure 2.21 show the conditions 1) narrow focus on `Leinen',
declarations (top left), 2) narrow focus on `Leinen', questions (top right), 3) nar-
row focus on `weben', declarations (bottom left) and 4) narrow focus on `weben',
questions 13.
13 A sample context for case 1) : In einem Textilbetrieb; eine Mutter erkundigt sich bei
einer Angestellten nach den handwerklichen Fortschritten ihrer Tochter. Mutter: \Was lat
36 Chapter 2. Intonation - An Overview
Figure 2.21: Diagrams describing prototypal F0 patterns produced by Altmann et
al.(1989). The y-axis gives F0 in semi-tones relative to the base frequency of a speaker
(the lowest frequency produced in the data). The black squares indicate averaged
minimum and maximum values of F0 for the second and third phrase of a three-phrase
utterance. Duration is given in centiseconds. The average duration of the second and
third phrases is indicated by the horizontal lines limited by white dots. The diagram
display the following conditions: top left: declaration, focus on phrase 2, top right:
question, focus on phrase 2, bottom left: declaration, focus on phrase 3, bottom right:
question, focus on phrase 3.
The black squares denote minimum and maximum values of F0 in the words
`Leinen' and `weben'. `n' denotes the number of speakers over which the F0 values
are averaged. The values are normalized to a semi-tone scale relative to the
speaker-individual minimum F0 . The horizontal lines delimited by dots denote
the average durations of the items `Leinen' and `weben'.
A summary of the main results of this study:
1. There exist main and secondary patterns for the same syntactic structure
(`Kerntypen' and `Randtypen').
2. Sentences with dierent contexts (narrow vs. broad focus) are not neces-
sarily contrasted by intonation.
3. Questions are marked by high oset values of F0 , declarations by low oset
values.
4. The perception of focal prominence may be in
uenced by two interacting
psychological factors: Situational expectations (guided by the particular
die Meisterin meine Nina denn gerade weben ?" Die Angestellte: \Sie lat die Nina das
Leinen weben."| In a factory for textile goods. A mother asks an employee about the training
progress of her daughter. Mother: \What's the forewoman having Nina weave right now ?"
The employee: \She's having Nina weave the linnen."
2.5. Models of Intonation 37
context of an utterance) and habitual expectations (`the most important
thing comes last') 14
It remains unclear, whether the prototypal patterns are any more than an im-
pressionistic description of intonation. Quantifying extreme values of the F0 con-
tour, does not re
ect, for instance, the changes in the contour and its relationship
with the segments of an utterance and therefore cannot be used for synthesiz-
ing F0 . The author believes, however, that Altmann's empirical approach based
on intonational contrasts is quite attractive as it concentrates on cases where
the intonational `load' is maximal; hence its adoption for the rst production
experiment of this work (Chapter 5).
2.5.6 Fery, Uhmann based on Pierrehumbert 1980
υ utterance
π π intonation phrase
τ τ τ τ tact
ω ω ω ω ω ω ω words
σ σ σ σ σ∗ σ σ σ σ∗ σ σ σ Σ∗ syllable
H* T* H% H* T* T% tone tier
Die Fjorde in Norwegen sind unbeschreiblich schön. phoneme tier
Figure 2.22: Hierarchical levels of tone sequence model for German as proposed by
Fery (1988). The utterance \Die Fjorde in Norwegen sind unbeschreiblich schon."|
\The ords in Norway are indescribably beautiful.", is devided into intonation phrases
which consist of two tacts each. The rst tact contains all syllables before the
nuclear accent of the intonation phrase and the second one the remaining ones. In the
tone tier accented syllables and syllables at prosodic boundaries are assigned high (H)
or low (L) tones producing a somewhat impressionistic description of the F0 contour.
Pierrehumbert's approach [Pie80] describes F0 contours as a sequence of high
(H) and low (L) tones. These are associated with accented syllables and pros-
odic boundaries. The actual F0 range between `high' and `low' tones is only
locally dened and subject to rule-determined reductions, called `downstep' or
`catathesis'. Following autosegmental phonology, which Pierrehumbert applies to
intonation, an utterance is built of a number of parallel hierarchical layers, `tiers',
which consist of sequences of phonological segments. The `tone tier' corresponds
to the phonological elements which make up the F0 contour.
14
Batliner calls the former factor `Einstellungseekt' and the latter `Deklinationseekt'.
38 Chapter 2. Intonation - An Overview
Figure 2.23: Example of an utterance labelled according to the tone sequence ap-
proach for German by Fery (1988). The gure shows the F0 contour of the utterance
\Die Fjorde in Norwegen sind unbeschreiblich schon."| \The ords in Norway are
indescribably beautiful." At the bottom accent levels are indicated by the number of `x'
symbols printed over the text of the sentence. `k' symbols indicate the boundaries of
intonation phrases, `j' the boundaries of tacts.
Figure 2.22 shows a prosodic tree structure constructed for an example sen-
tence in Fery 1988 [Fer88] who applied the tone sequence approach to German.
Figure 2.23 shows the respective F0 contour and tones labelled by Fery. The
largest intonational unit of German proposed by Fery is the `intonation phrase',
which as the author explains, corresponds to Pheby's `tone group'. It is char-
acterized by its most prominent syllable, the nuclear syllable. By default, the
intonation phrase is made of two tacts, the rst comprising all syllables before
the nuclear syllable and the second the remaining. Fery does not specify the
criteria for determining the boundaries of intonation phrases, she only assumes
that equally prominent accents in an utterance may constitute separate intonation
phrases. The examples of two-phrase utterances in her data (as in Figure 2.23)
are split at the boundary between subject phrase and predicate phrase. The tonal
Table 2.4: Tonal elements of tone sequence approach as proposed by Uh-
mann 1988.
Function types Phonolog.features Phonetic Realization
accent tone H* + T falling F0 contour throughout accented syllable,
aecting following unaccented syllables
T* + H rising F0 contour throughout accented syllable,
Phonological aecting following unaccented syllables
correlate of H* F0 peak in the nucleus of accented syllable,
focal features no in
uence on following syllables
T* F0 valley in the nucleus of accented syllable,
no in
uence on following syllables
boundary tone T% speaker-dependent F0 minimum,
Phonological (baseline) at phrase boundaries
correlate of H% F0 values above T% and speaker-dependent
phrasing rules `medium' level
2.5. Models of Intonation 39
inventory proposed for German is specied in Table 2.4 after Uhmann [Uhm88].
`*' denotes a tone linked to a prominent accent syllable, `%' denotes a tone linked
to a syllable left of a prosodic boundary. Compared with English [BP86, p.256],
the pitch accent shapes H + L* and L + H* are missing.
The selection of tonal patterns for the description of a particular F0 contour
seems to be rather based on intuitive choice than on objective criteria. In the
utterance \Die Fjorde in Norwegen sind unbeschreiblich schon."| \The ords
in Norway are indescribably beautiful.", (Figure 2.23), Fery assigns the accent
syllable `Nor-' the tone L*, although it is followed by a rise of the F0 contour
and therefore, according to Table 2.4, should rather be assigned L* + H.
At least for English the attempt has been made to generate well-formed
F0 contours from tone sequences 15.
2.5.7 Adriaens, 1991, IPO
Figure 2.24: Example of a stylized F0 contour from 't Hart (1984).
Adriaens' work on German intonation [Adr91] (cited after Mobius 1993 [Mob93,
pp.49-51]) is based on the studies by Cohen and 't Hart [tH84] for Dutch and
English at IPO Eindhoven. The basic principle of the approach is the approxim-
ation of observed F0 contours by piece-wise linear stylization, producing a `copy
contour' (see Figure 2.24 for an example). This is justied by the observation
that not every minor change in the F0 contour is perceptually relevant. The
resulting simplied contour is said to be `perceptually equivalent' to the original
one, although the notion of `equivalence' is not made clear. `t Hart reports that
the stylized contour is not always perceptually identical to the original contour,
but `fully acceptable' as representative of `normal intonation' [tH84]. Since the
15 Although this results in a poor degree of naturalness, N. Higuchi, ATR, (personal
communication).
40 Chapter 2. Intonation - An Overview
IPO model does not consider the linguistic units underlying the utterance, the
question arises, how the term `fully acceptable' should be interpreted, since an
utterance and hence the corresponding F0 contour can only be judged within a
communicative context.
As a result of a heuristical investigation, Adriaens postulates a set of 12 stand-
ardized F0 movements (seven rising and ve falling) which dier as to the F0 in-
terval spanned, their position in the accented syllable and their duration. Seven
of these basic patterns are connected to accent syllables and have prominence-
lending functions, while the remaining ve are only perceptually relevant. The
patterns are superimposed over four dierent standardized declination lines. Ad-
riaens does not investigate into the functional aspects of intonation.
It is obvious that the copy contours for a particular utterance largely depend
on the phonetic judgment of the experimenter and hence lack objectivity.
2.5.8 d'Allesassandro & Mertens 1995
D'Allessandro and Mertens [Md95] developed a stylization approach for F0 con-
tours which takes into account the perceptual impression caused by a partic-
ular contour (`pitch'). Based on results from glissandro research (the percep-
tion of pure tones of changing frequency) they state that a movement in the
F0 contour is only perceptible if it exceeds a certain threshold, the `glissando
threshold'(henceforth G). Also a change in the slope of the F0 movement is per-
ceptible only when it exceeds the `dierential glissando threshold'(henceforth
DG). F0 movements below these thresholds are perceived as static tones, which
corresponds to a short-term integration of F0 . An utterance is segmented into
syllable nuclei, and using an integration algorithm applying G and DG (see 2.25)
a linear stylization is produced for every syllable. The stylized contour is used to
re-synthesize the utterance. The synthesized speech is then compared with the
original. D`Allessandro and Mertens found thresholds of G = 0.32/T 2 and DG
= 20 for continuous speech. They did not investigate into the linguistic content
of the utterances used.
2.5.9 Bannert 1983
Bannert's description of German intonation [Ban83] is based on the analysis of
F0 contours of context-elicited read speech. The corpus contained utterances
of statements, yes-/no-questions and echo-questions produced by three speakers
(see Figure 2.26 for an example, the regions of the nuclear vowels are marked
by hatched rectangles and the onsets of vowels in accented syllables by vertical
lines). The sentences were constructed by successively expanding the phrase
\Die Manner"|\the men" by additional words: \Die Manner in der Menge"|
\the men in the crowd", \Der Muller will die langeren Manner in der Menge
nennen."|\The miller wants to name the taller men in the crowd." etc. 16 .
16
It is unclear, if the resulting all-voiced `nonsense' sentences are useful for eliciting natural
intonation, since the exclusive choice of the initial sounds [l] and [m] in the content words make
them a kind of stave-rhyme.
2.5. Models of Intonation 41
speech
Voiced/
Unvoiced Syllabic/
Voiced parts
Segmentation
PDA
Short-term
integration
Stylization
glissando contour
diff.glissando segmentation
(thresholds)
pitch targets
stylized contour
Re-synthesis
synthesized speech
Figure 2.25: Block diagram of the pitch contour stylization algorithm [Md95]. PDA
is the pitch determination algorithm. V/UV is the voicing decision.
Input to the model is a linguistically dened structure (a sentence) with binary
prosodical features:
1. STRESS (syllable stress ! lengthening of syllable)
2. LENGTH (of vowels)
3. ACCENT (rising F0 movement on nuclear vowel of accented syllable)
4. TERMINAL (terminal: statement, non-terminal: question)
5. PHRASING (realization of phrase boundaries by means of F0 move-
ments, not specied)
6. CONTRAST (contrast expands F0 range)
42 Chapter 2. Intonation - An Overview
Figure 2.26: Examples of F0 patterns of statement (top), echo-question (center) and
yes-/no-question (bottom) from Bannert (1983): \Der lullende Muller von Lingen will
die langeren Manner immer Lummel nennen."|\The lulling miller from Lingen always
wants to call the taller men louts." The vertical lines indicate the onset of the nuclear
vowel in accented syllables.
7. EMPHASIS (emphasis expands F0 range)
The phonological component of the model uses these features to generate
a) `basic temporal patterns' (the speech segments) and b) `basic tonal patterns'
(qualitative F0 contours). These are described by F0 targets (high (H) and low
(L) tones) which are further modied by the binary feature WIDE (wide for
contrastive or emphatical conditions) and aligned with the segmental string. Non-
nal accent syllables are assigned L-tones, as well as last syllables in statements,
whereas nal accent syllables in statements and last syllables in questions are
assigned H-tones.
The intonation algorithm of the model converts the sequence of H and
L-tones into the desired F0 contour. Bannert observed that if the F0 range is
expanded, this aects only the maxima in the contour while the local minima
(the L-tones, temporally aligned with the consonant-vowel boundary in accented
syllables, from where the F0 movements starts) remain relatively unaected. He
therefore uses these local minima to dene the global declination line (German
`Tallinie' `valley line') for his model. The characteristic F0 movements on ac-
cented syllables are superimposed on the valley line. The intonation algorithm
connects the F0 targets determined by straight lines or cosine interpolation.
A third component of the model, the modication component, a kind of
post-processor for the F0 contour generated by the intonation algorithm (taking
into account microprosodic in
uences and the like), is not elaborated by the
author. Bannert concludes that his model requires perceptual evaluation.
2.5. Models of Intonation 43
2.5.10 Kohler 1977, 1991
Earlier descriptions of German intonation by Kohler [Koh77] stand in the tra-
dition of the English musical school represented by Halliday [Hal67]. These are
inspired by musical descriptions of intonation which perceive F0 contours as
sequences of distinctive `tones'. Kohler subdivides utterances into `intonation
units', `tacts', syllables and phonemes. A tact starts with an accented syllable,
which is followed by unaccented syllables. An `intonation unit' is a sequence
of tacts that can be devided into an (optional) `pre-nucleus' and the `nucleus'.
The beginning of the nucleus is marked by the most prominent syllable in the
utterance. For German, Kohler distinguishes between the following tones for the
nucleus:
1. Tone 1: fall to a low level
2. Tone 2: rise to a high level
3. Tone 3: rise to a medium level
4. Tone 4: sustained medium level
5. Tone 5: fall followed by a rise to a medium level
6. Tone 6: rise followed by a fall to a low level
Mobius [Mob93, p.43] calls into question whether this inventory of tones permits
a complete description of German intonation patterns and if the tones represent
intonational contrasts, a question left unanswered by Kohler.
Kohler 1991 [Koh91] introduces the Kiel intonation model (KIM), which gen-
erates an F0 contour from a linguistic description of a sentence. Following
Kohler's earlier work, the F0 contour corresponds to a sequence of peaks and
valleys which are aligned with the segmental string.
Like Bannert, Kohler denes a number of distinctive prosodic features which
are either binary or graded and assigned to segmental units (especially nuclear
vowels in a syllable, denoted <VOK>) or non-segmental units (phrasal bound-
aries, for instance).
Kohler distinguishes between two domains of German prosody: Stress and
intonation. Any lexical item features a syllable bearing the lexical stress, which
potentially becomes the location of sentence stress in an utterance. Sentence
stress is acoustically marked by duration < DSTRESS> and F0 movement
< FSTRESS> assigned to <VOK>. The prominence of an item in an utter-
ance is denoted by a digit between 0 and 9, 0 standing for unstressed words,
1 for secondary stress, 2 - 9 for primary stress with varying degrees of emphasis.
Further markers denote de-accentuation < DEACC> (corresponds to `second-
ary stress'), emphasis< EMPH> (`+' corresponds to prominence levels 3-9)
and stress level < STRLEV> which maps primary stresses 3 to 9 to stress
levels 1 to 7. It remains unclear, why this kind of redundancy was introduced
and why there exist exactly 10 dierent levels of prominence.
44 Chapter 2. Intonation - An Overview
The stress features assigned to a particular utterance are determined using
re-writing rules (`symbolic feature rules') which take into account the syntactic,
semantic and pragmatic contents of the sentence.
PRAGMATICS SEMANTICS
LEXICON
symbolic representation of
segments, stress, boundaries
SYNTAX
with pragmatic and semantic
markers
SURFACE STRUCTURES
symbolic segment chains
boundary, stress, intonation
markers
INTONATION MODEL
(a) symbolic feature rules
(b) parametric rules
Figure 2.27: Environment of the Kiel intonation model [Koh91].
In the `intonation domain' of the model, vowels with `primary' and `secondary'
sentence stresses are assigned basic F0 patterns which are either `valley' or `peak'-
shaped (< VALLEY>). These contours are reminiscent of the tones discussed
above and selected according to the features `terminality' < TERMIN>, ques-
tion/declaration < QUEST>, < EARLY> and < LATE> which describe
the location of F0 peaks (early, medial and late) (in section 2.4.8 the phonological
meaning which Kohler assigns the location of F0 peaks was discussed).
Using the output of the intonation part of the model, parametric rules are
applied to calculate F0 targets along the segmental string which also take into
account the microprosodic in
uence of the speech sounds. Finally, cosine inter-
polation is applied to connect the F0 targets and produce a smoothed F0 contour.
Figure 2.27 shows the environment of the Kiel intonation model [Koh91, p.340].
2.5.11 Discussion and Conclusions
Returning to Figure 2.17 it can be stated that none of the approaches discussed
provides the means to both derive F0 contours from linguistic structures and
2.5. Models of Intonation 45
Table 2.5: Steps between linguistic information and F0 contours covered by
models of German intonation.
LI: Linguistic information PR: Phonological representation
AF: Abstracted F0 contour F0: F0 contour
A: Analytical G: Generative
author approach LI PR AF F0
Isacenko G
=)
=)
Bierwisch G
=)
=)
Pheby G
=)
=)
Stock G
=)
=)
Bannert G
=)
=)
Fery, Uhmann A/G
=)
=)
()
Altmann A
=)
=)
(=
Adriaens A
(=
KIM G
=)
=)
=)
Mobius A/G
=)
()
Table 2.6: Steps between linguistic information and F0 contours covered by non-
German models of intonation.
author approach LI PR AF F0
Pierrehumbert A/G
=)
=)
()
't Hardt A
(=
d'Allessandro A
(=
Fujisaki A/G
()
()
()
infer linguistic structures from F0 contours. Tables 2.5 and 2.6 give a classic-
ation of the models and the range of relationsships they describe. Most of the
approaches have mainly generative character. For the sake of comparison, the
original Fujisaki-model and its application to German by Mobius (discussed in
section 3.4) are included in the table.
Among all intonation models discussed so far, undoubtedly, KIM is the most
comprehensive descriptive system of German intonation. As Kohler explains,
KIM is developed and further rened by a method of `interactive investigation'
using a TTS-system implementing the KIM rule system. This approach starts o
with a denition of linguistic units and structures and explores their manifesta-
tion in the F0 contour by evaluating speech produced by rule-guided synthesis.
Because of its generative character, however, this method is less applicable to
the analysis of natural F0 data. It is useful though to perceptually validate
hypotheses for new rules. The same is true for Bannert's model.
Pierrehumbert's approach holds the advantage of great simplication of the
F0 contour, but at the risk of little objectivity. The intuitive `ad hoc' choice
of tonal patterns does not solve the question as to the quantity of intonational
events.
Altmann, Batliner and Oppenrieder produced quantitative descriptions of pro-
totypal intonation patterns of German, although they reduced the F0 contour to
so few characteristic values that the contour can hardly be regenerated.
The greatest benet of Adriaens' model is the data reduction gained by a styl-
ization method which preserves the general shape of the F0 contour. There are,
46 Chapter 2. Intonation - An Overview
however, no criteria, by which a degree of stylization that preserves all import-
ant acoustic cues for linguistic information can be chosen. Mere `acceptability'
can hardly be a reliable measure. D'Allessandro and Mertens results show that
movements in the F0 contour have to exceed certain thresholds in order to be
audible. Is is, however, unclear if the perception of accents in an utterance can be
directly compared with the perception of glissandi. The prominence of a certain
syllable and hence the perceivability of an F0 movement is - inter alia - in
u-
enced by the duration and intensity of that particular syllable. Furthermore the
perceptual impression caused by an utterance is certainly in
uenced by other
factors, such as the expectation of the listener. Hence the linguistic content of
an utterance must be taken into account in the analysis. Even if the stylization
produced by the algorithm should be equivalent to perceived pitch, it is just
another data reduction approach which does not yield prototypal patterns for
generating F0 contours.
The intriguing simplicity and elegance of Isacenko's approach can only be
adequately judged if one listens to the stimuli he used in his experiments. Despite
their technically poor quality they convinced the author of the present study of
further examining the relevance of tone switches within a more sophisticated
framework of German intonation. The importance of Stock's work is connected
with his elaboration and functional specication of the intonational units dened
by tone switches.
What do the approaches have in common ?
1. They stress the importance of the course of the F0 contour at accented
syllables.
2. They emphasize the importance of the last accent in the utterance (the
sentence accent) for signalling focal condition, terminality and sentence
mode.
3. They observe de-accentuation after the sentence accent
However, none of the models
1. solves the problem of how intonational domains in an utterance (`prosodic
phrases') can be delimited (except for syntactic criteria) in the F0 contour.
2. incorporates elements modelling the production process of F0 .
3. comes in the form of a mathematical formulation.