0% found this document useful (0 votes)
92 views10 pages

Neurocomputing: Mario Malcangi, David Frontini

1) The document discusses using an artificial neural network for language-independent text-to-speech conversion to be used in deeply embedded systems with hands-free interfaces. 2) An artificial neural network was trained using a custom text-to-phone transcription engine to learn the mapping between text and phonemes. Initial experimental results showed the ANN's ability to generalize and map unseen words as well as reduce contradictions in pronunciations. 3) The document argues that using neural networks for text-to-speech conversion has advantages over rule-based approaches in being able to generalize and adapt, which is important for applications requiring minimal resources and versatility across languages.

Uploaded by

Rihawf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views10 pages

Neurocomputing: Mario Malcangi, David Frontini

1) The document discusses using an artificial neural network for language-independent text-to-speech conversion to be used in deeply embedded systems with hands-free interfaces. 2) An artificial neural network was trained using a custom text-to-phone transcription engine to learn the mapping between text and phonemes. Initial experimental results showed the ANN's ability to generalize and map unseen words as well as reduce contradictions in pronunciations. 3) The document argues that using neural networks for text-to-speech conversion has advantages over rule-based approaches in being able to generalize and adapt, which is important for applications requiring minimal resources and versatility across languages.

Uploaded by

Rihawf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ARTICLE IN PRESS

Neurocomputing 73 (2009) 87–96

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Language-independent, neural network-based, text-to-phones conversion


Mario Malcangi a,b,c,, David Frontini a,b,c
a
Universita degli Studi di Milano, DICo—Dipartimento di Informatica e Comunicazione, Via Comelico 39, 20135 Milano, Italy
b
Laboratorio DSP&RTS (Digital Signal Processing & Real-Time Systems), Italy
c
Laboratorio LIM (Laboratorio di Informatica Musicale), Italy

a r t i c l e in fo abstract

Available online 3 August 2009 A speech synthesizer based on an artificial neural network (ANN) is being developed for application to
Keywords: deeply embedded systems for language-independent speech commands on hands-free interfaces. A feed-
Text-to-speech forward, backpropagation, artificial neural network has been trained for this purpose using a custom-
Rule-based ANN training developed, regular expression-based, text-to-phone transcription engine to generate training patterns.
Window alignment Initial experimental results show the expected properties of language independence and in-system
Speech synthesis learning capability of this approach. The ANN demonstrates the capacity to generalize and map the
words missing at training time, as well as to reduce contradictions related to different pronunciations for
the same word.
& 2009 Elsevier B.V. All rights reserved.

1. Introduction subject to debate. Using artificial neural networks for phoneme-


level, text-to-speech conversion has several advantages over
High-quality text-to-phones conversion and speech synthesis hard-computing. Soft-computing’s capacity to generalize makes it
will be a basic requirement in developing the next generation of possible to map words missing from the database, as well as to
computer-based portable applications. Solutions that use inexpen- reduce contradictions related to different pronunciations for the
sive processors and require minimal memory will be greatly same word. Artificial neural networks have been shown to optimally
preferred over those of today, based on high-performance processors solve a large class of applied pattern-matching problems. But very
and large memory consumption. Language-independent speech little research has been done to match the requirements of pattern
synthesis will also be much in demand among applications based generation in machine-to-human interaction.
on embedded systems, because it will lead to code-independent On the other hand, pronunciation for specific languages has
implementation. been studied extensively. As a result, we know a great deal about
Current implementations of text-to-speech (TTS) use a large correspondences between written and uttered representations of
dictionary as a lookup table to determine the most appropriate a given text.
pronunciation of a word to be uttered. This approach has led to Even a phonetically spelled language like Italian, however, has
very high quality in speech production but places great demand numerous irregularities in pronunciation [5]. Though native
on system resources (memory, computing time, etc.) and has speakers may not be immediately aware of the problem, some
limited versatility for different applications. semantically distinct words (e.g. PERDONO,  ‘‘they lose’’, vs.
Text-to-phones conversion is also important in other applications.  
PERDONO, ‘‘forgiveness’’ or PESCA, ‘‘peach’’, vs. PESCA,  ‘‘fishing’’)
In speech recognition, it is a key solution to make these applications are actually spelled so as to be indistinguishable.
speaker independent and in-system updatable. Using text-to-phone English is far more difficult to represent with a limited number of
conversion, a missing word can be updated by spelling rather than rules [25]. For example, the ‘‘I’’ in nearly all one-syllable words ending
uttering. Moreover, phone-to-text conversion is required in speech in ‘‘-IVE’’, such as ‘‘FIVE’’ and ‘‘LIVE’’ (adj.), is pronounced /aI/ but there
recognition systems, above all those based on phoneme recognition are exceptions, such as ‘‘GIVE’’ and ‘‘LIVE’’ (verb), in which it is
(e.g. phonetic-based, speech-to-text recognition systems). pronounced /I/. Most polysyllabic words ending in ‘‘-IVE’’ (like ‘‘OLIVE’’
The applicability of soft-computing (artificial neural networks or ‘‘VEGETATIVE’’) will have the /I/ sound but, again, there are many
and fuzzy logic) to implementing text-to-phones conversion is exceptions (like ‘‘ALIVE’’ or ‘‘ENDIVE’’). These examples show that ‘‘the
exception is the rule!’’ [12].
For many years, speech-production mechanisms have been
 Corresponding author at: Universita degli Studi di Milano, DICo—Dipartimento
investigated with the aim of generating natural utterances with an
di Informatica e Comunicazione, Via Comelico 39, 20135 Milano, Italy.
artificial speech synthesizer. Intelligibility in speech synthesis was
Tel.: +39 02 50314003; fax: +39 02 50316373.
E-mail addresses: malcangi@dico.unimi.it (M. Malcangi), dfrontini@acm.org rapidly achieved, but naturalness proved much more challenging.
(D. Frontini). The reason for the difficulty lies mainly in our ears’ sensitivity to

0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2008.08.023
ARTICLE IN PRESS
88 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96

articulatory information embedded in human uttered speech. Such was able to produce appropriate pronunciation even when a
information is difficult to model mathematically. specific rule had not been furnished during its training stage.
Language modeling and rule-based representation of speech- Since that demonstration, many new ANN approaches to speech
production mechanisms attained successful results in the synthesis have been proposed [9,16,22,23].
seventies, when industrial synthesizers were introduced to the The main advantage of applying an ANN to speech synthesis is
mass market. However, few implementations were carried out its ability to learn to speak just as a human does. This means that
because electronic technology was not adequate to run language- it can learn to speak any language, just as a human does. The ANN
and rule-based speech-production models. engine is the same for any language it is trained in, so there is no
In the days when desktop computing had meager computing coding dependency because the ANN trained for a specific
power and memory storage, many research and development language is only data dependent.
efforts sought optimal solutions to speech-synthesis problems, The traditional approach to TTS synthesis requires a large
such as text-to-phoneme conversion based on neural networks amount of data to represent all the knowledge about how a word
and phoneme-based speech-synthesizer circuits. When digital has to be converted into a correct speech-synthesizer control
signal-processor (DSP) chips were introduced in 1980, hardware stream. The ability of an ANN to generalize reduces the amount of
implementation of speech synthesis was neglected and firmware- such data and also performs a smoothing action on the output,
based speech synthesizing became a more interesting topic for resulting in more natural speech synthesis [2,4].
researchers and developers. Because memory was a scarce In the following sections, this paper presents the system
resource, great effort was devoted to compressing speech data. framework, ANN architecture, training strategy, and performance
Over the last decade, the widespread availability of memory in evaluation. A brief concluding section summarizes the work and
desktop and laptop computers has driven speech-synthesis future plans.
research to aim for high-quality speech production, with
engineers and system designers paying little attention to system 2. System framework
optimization. Naturalness has been attained by using redundancy
as the primary solution to the many hurdles in producing quality The main idea in developing our system framework is to set up
speech, thanks to the assumption that vast resources (memory, an almost fully automatic process to generate a ready-to-run,
computing power, etc.) are available at low cost. ANN-based speech synthesizer trained for a specific language,
The next wave of computing and communication technology starting from ASCII text.
exhibits a clear trend toward embedded computing solutions based Each language is represented by a set of rules that encode all
on high-density integration technologies such as system-on-chip the information needed to correctly pronounce each word in that
(SoC), system-on-package (SoP), etc. These emerging system language. This set of rules is used to generate—from any text—the
technologies are very powerful, but not redundant in terms of training patterns to set up the ANN-based speech synthesizer.
memory and computing power. Speech synthesis needs a systemic A special-purpose, development environment (Fig. 1) was
approach to achieve new results befitting this new scenario. designed for this purpose. It consists of four functional blocks,
Today, embedded computing (cellular telephones, PDAs, MP3 each executing a whole task that helps develop, train, evaluate,
players, etc.), is more readily available than desktop computing. and implement a complete text-to-speech application.
Speech-synthesis applications mostly target embedded and The system framework consists of four main functional modules:
deeply embedded systems, especially to implement handsfree
interfaces for applications that run on such systems.  regular-expression-based, text-to-phones translator;
Speech synthesis becomes a very complex task if the main goal  training-set builder;
is to implement it on a deeply embedded system with real-time,  ANN engine; and
unlimited-vocabulary, speaker-independent and language-inde-  speech synthesizer.
pendent specifications. Current high-quality, speech-synthesis
solutions, due to their reliance on substantial processing
resources, cannot be scaled down to satisfy the emerging
demands of embedded systems in every application field where
human-to-machine interaction is required.
The artificial neural-network (ANN) approach [3,8] to speech
synthesis [11,20] can optimally solve several implementation and
application problems, primarily because it is closer to the process
to be emulated, i.e. the human ability to communicate by means
of voice and language. Moreover, because the linguistic approach
to natural speech synthesis has proven effective, but the task of
generating rules is time-consuming and tedious, an ANN may be
the most suitable solution.
Sejnowski and Rosenberg [24] were the first researchers to
demonstrate that an ANN could be successfully applied to the
speech-synthesis challenge. They demonstrated that a three-layer,
backpropagation network (BPN) can be successfully trained to
convert text into phonemic parameters to drive an articulatory
speech synthesizer.
The most interesting result was achieved by observing the
nature of the speech produced at incremental stages during the
training phase. The utterance evolves, as in children during their
learning stages. The ANN proved capable of learning about
pronunciation rules, as well as about other embedded informa-
tion, such as articulatory effects and inflection. Moreover, the ANN Fig. 1. System framework.
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 89

2.1. Regular expression-based translator

The regular-expression-based, text-to-phones translator auto-


matically generates the data needed to build the ANN training
patterns. It is a specially developed engine that processes ASCII text
and converts it into the corresponding phonetic transcription. To
execute this task, it refers to a language-specific rule set, which
embeds the phonetic, articulatory, and prosodic information
needed to correctly and naturally utter each word in the input text.

2.2. Training-set builder

Input text and its phonetic transcription are the data input for
the training-set builder. This automatically processes the phonetic
transcription of the text used to train the ANN, generating the
appropriate training patterns to tailor the ANN for this specific
application.
One critical task executed by this ANN’s training-set builder is
to align text with its phonetic transcription. Manual alignment is
superfluous thanks to rule-based alignment of text with phonetic Fig. 2. JOONE.

transcription.

2.3. ANN engine

The ANN engine is the core processor of a neural-network-


based speech synthesizer. It is based on JOONE (Java Object-
Oriented Neural-Network Engine) [10]. This engine (Fig. 2) is
trained with the patterns generated from the training-set builder.
Once training is complete, the ANN engine is ready to process a
text string and generate a pattern to run the speech synthesizer.
This framework allows the ANN to be trained automatically by its
user (application engineer or end user) and applied to a specific task.
The only data needed to build the ANN that drives the speech
synthesizer for a specific language is an extensive text and the
phonetic rule set for that language. Both these data sets are currently
available for any language. Therefore, such a neural-network-based
speech synthesizer is effectively language independent.
This architecture allows the ANN-based speech synthesizer to
be updated to a different language simply by changing inter-
connection weightings, while leaving its programming code as is. Fig. 3. Formant speech synthesizer for generating unlimited utterances.

This enables run-time, in-system updating of the speech-response


application to the country where it is running.
Inflection (pitch) is also controllable, statically or dynamically.
This provides variable speech nature (male, female, child, etc.), as
2.4. Speech synthesizer well as variable utterance for the same sex (bass, baritone, etc.).
To achieve accurate pronunciation, the speech synthesizer
A synthesizer model based on the human vocal tract was requires accurately selected phonemes, the rate at which they are
chosen, so that unlimited utterances can be generated to attain uttered, the level of pitch at which they are emitted, and the
the speech-production goal for any language. intensity at which they are produced. Fine quality is then achieved
The speech synthesizer is a phoneme sequencer that allows by dynamically controlling articulation, inflection, and duration.
software to control the rate, pitch, amplitude, duration, movement These speech attributes are fed to the ANN during the training
rate, and articulation rate of spoken phonemes. Speech is phase, so it will encode all this information in symbolic form at
synthesized by combining phones in appropriate sequence. the output level when it transcribes a word. A front end is then
The vocal tract is modeled by a set of cascaded, programmable implemented to encode the phonemic transcription into a control
filter sections. Pitched and unpitched utterances are synthesized sequence for the speech synthesizer (Fig. 3).
using various excitation sources as input for the vocal-tract model
with appropriately programmed filter sections. Virtually any kind 3. Regular-expression-based, unrestricted, text-to-phones
of vocal sound can be generated, each with controllable translator
amplitude.
To gain naturalness in speech production, the synthesizer also Much of the computation in text-to-speech synthesis consists
enables dynamic utterance control. When a phoneme is synthe- of a series of processing passes applied to the text to be uttered
sized, its stationary state is reached through a linear transition, so [1,18,21]. Each pass serves to transform the input text string into
that effective coarticulation can be executed between the previous an output control parameter string to be applied to a speech
synthesized phoneme and the current one. Phonetic articulation- synthesizer to generate the appropriate speech unit, intonation,
rate adjustment is a control attribute of the speech synthesizer. duration, and dynamics.
ARTICLE IN PRESS
90 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96

The text is first preprocessed to convert numbers, sequences, be converted:


abbreviations, and special ASCII symbols ($, %, #, &, etc.) into
size ¼ size(r)
the corresponding expanded text (e.g. $10-ten dollars). Punctuation
for index ¼ 0 to size do
and word boundaries are processed by a set of rules that encode the
c’get(r, index)
corresponding prosody, assigning duration and pitch (F0) to the
Rules’GetRules(c)
subsequent phones. Words are converted into phone streams by a
for R in Rules do
language-specific set of rules. Each rule encodes the pronunciation
pre-Context’createPreContext(r,c)
pattern for a word or a piece of a word (morpheme).
post-Context’createPostContext(r,c)
In keeping with our goal of avoiding code-dependent im-
if R is valid c|r then
plementation of the text-conversion process, our efforts have
result’result+R.phoneme
focused on defining a fully phonetic and linguistic data-driven
cover’R.cover
(rules) text processor. The format of the rules is as follows:
index’index+cover
CðAÞD ¼ B ð1Þ break
end if
which reads ‘‘A is transformed into B, if the text to which it end for
belongs matches A in the sequence CAD’’, where C is a pre-context end for
string and D is a post-context string. B is a phonetic string
All the characters in the input string are first categorized and
(symbols or codes).
marked according to the class elements (expressions (2)). The
Because many rules share elements of pre-context and/or post-
algorithm is then applied. Each input character is then compared
context, classes of elements were defined to make rule coding
sequentially to the set of rules, to find the one that matches.
more compact and to reduce the number of rules. For example, for
The rules are grouped in separate tables, each internally
Italian and English, the following classes of elements, with the
ordered so that for any input character a successful match always
corresponding regular expressions, were defined:
exists; e.g. the English rules for final D are
ð!Þjð4 Þjð$Þ
!RUGGðEDÞ! ¼ =1I=1d=
ð#Þjð½AEIOUYþÞ
!COðEDÞ! ¼ =2E=2d=
ð:Þjð½4 AEIOUY Þ
# : ðIEDÞ! ¼ =1i=1d= ðe:g: studied; muddiedÞ
ðþÞjð½EIYÞ
: ðIEDÞ! ¼ =1a=1I=1d= ðe:g: lied; triedÞ
ð$Þjð½4 AEIOUYÞ
ðTEDÞ! ¼ =1t=1I=1d= ðe:g: hated; lastedÞ
ð:Þjð½BDGJMNRVWZÞ
ðDEDÞ! ¼ =1d=1I=1d= ðe:g: aided; addedÞ
ð4 Þjð½NRÞ ð2Þ
#ðEDÞ! ¼ =1d=ðe:g: played; hoedÞ:
Using these classes, a general rule (for English) such as the ðEDÞ! ¼ =1d=ðe:g: rubbed; hugged; buzzedÞ
following can be coded (using X-SAMPA to encode phones): ðEDÞ! ¼ =1t=ðe:g: stopped; worked; bussedÞ
!ðBIÞ# ¼ =b=a=I= ð3Þ ðDÞ! ¼ =2d=ðe:g: redÞ ð5Þ

The rule is thus valid for several text strings, such as BIOy, where digits ‘‘1’’ and ‘‘2’’ preceding the phone symbol indicate its
BIONy, BIOGy, BIANNUAL, etc. degree of intensity.
Rules are grouped in sublists containing all the rules with the The union of all the left contexts of this group of rules is the
same initial character in the context to be matched. Each sublist is universal set. The last rule always matches even if none of the
internally ordered by specificity of the rules, with the most preceding rules match.
specific first, the most general last, and the last rule formulated as
context-independent, so that it always matches:

4. ANN architecture
!ðBÞ! ¼ =b=i : =
... The ANN has a three-layer, feed-forward, backpropagation
architecture (FFBP-ANN), similar to that used in NETtalk [24]. Its
!ðBIÞ# ¼ =b=a=I= inputs are fully connected to all the nodes in the hidden layer, and
... the hidden layer is fully connected to the output nodes (Fig. 4).
Text is fed to units in the input layer. Phonetic transcription
ðBÞ ¼ =b= ð4Þ
comes out of the nodes at the output layer. All inputs and outputs
have a linear activation function that controls the connection. A
The regular-expression-based, text-to-phones translator is a non-linear activation function (sigmoid) connects hidden-layer
general purpose, text-processor engine applicable to any written nodes to output-layer nodes as follows:
text in any language. It depends only on the rule set and on the
classes defined for that language. The initial character of the 1
substring to be processed is located on the appropriate sublist, si ¼
1Pþ eIi
ð6Þ
from top to bottom, until the appropriate rule matches. The rule is Ii ¼ wij sj
then applied. Finally, the related phone sequence is appended to j

the transcription result sequence.


The algorithm is as follows: if q is the word to be translated, where si is the output of the i-th unit, Ei is the total input, and wij is
R the current rule, and cover is the size of the string to the weight from the j-th to i-th unit.
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 91

Fig. 5. Input-layer data encoding.

Fig. 4. Architecture of the FFBP-ANN.

Input to the FFBP-ANN consists of nine consecutive characters


of text to be phonetized. The output encodes the phone that
corresponds to the character in the center position of the nine-
character input window.
Each of the nine input nodes consists of 36 binary elements,
one for each symbol in the character set (the complete alphabetic
set A–Z plus a few control characters, such as space, period,
question mark, etc.). There are thus 324 (36  9) total input
elements for the FFBP-ANN, enough to fully encode a complete
pattern, nine characters wide (Fig. 5).
The output consists of 394 nodes, one for each phone in the
phone set to be used. Current output encodes the phone that
corresponds to the middle character in the input-layer string. The
pre-context and post-context of the current input character help
determine output.
To determine the best window size (Fig. 6), we look to the set Fig. 6. Sliding window.
of pronunciation rules. If W represents the longest possible
character-string length of either pre-context or post-context, then
window size (number of characters) must be chosen as follows: To set a fixed output-layer size, a binary-coded solution can be
implemented. This solution may be advantageous for hardware
window_size ¼ 2W þ 1 ð7Þ implementation of the ANN, because it can adapt a large range of
phones using a fixed number of nodes at output layer. A 9-bit
Choosing window size based on the longest possible character- encoded output layer can hold as many as 512 different phones,
string length of either pre-context or post-context ensures the enough to drive any kind of speech-synthesis model. In any case, if
window is symmetrical and its size (number of characters) is the ANN implementation is based on programmable hardware
always odd. technologies such as field-programmable gate arrays (FPGA) or
Studies on mutual information provided by neighboring programmable system-on-chip (PSoC), output-layer size need not
characters in determining the correct pronunciation of a word be fixed.
showed a nearly symmetrical bell-shaped function. The informa- A test was performed to compare the learning speed of an ANN
tion gain is maximum at the position of the letter to be with binary-encoded output to that of an ANN with binary-decoded
pronounced. It then decreases symmetrically for left- and right- (one-in-position) output. This test showed that the one-in-position
character context [17]. ANN trains more quickly and learns better than the binary-encoded
The size of input and output layers depends on the number of one, as shown in Fig. 7. The ANN with binary-decoded output also
different characters that may occur in the text to be uttered and shows a slight increase in performance compared to the binary-
number of phones that the ANN needs to represent. If the ANN is encoded ANN trained with the same training pattern. Because
to be trained for a specific language, the size of the input layer can learning speed and better performance are primary targets in our
be fixed as the character set of that language. research, the one-in-position solution was chosen.
The size of the output layer cannot be set, because a given
language exhibits many variations (phones) of basic speech 5. Training strategy
sounds (phonemes). For this reason, the output layer is resized
at training time to fit the synthesis model applied (phonemic, FFBP-ANN training strategy is a key issue in our framework.
diphonic, etc.). One primary goal is to fully automate the training procedure so
ARTICLE IN PRESS
92 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96

where a is the adaptation step, g is the gradient of the error


function E, and b is the momentum term. Successfully applying
this algorithm required that training data (words and their
pronunciations) be collected and that input- and output-data be
converted into suitable binary form.
Fig. 8 describes the sequence of processes used to generate
the encoded training pattern, starting from the text containing
the test-set pattern word. The phonetic-rule-based transcription
engine processes the text so that its utterance is fully
described both statically (phonetics) and dynamically (prosody).
The text and its corresponding phonetic transcription are
then passed through the sliding window process to align
the text pattern (ANN input) with the phonetic pattern
(ANN output). After successful alignment, the training pattern is
binary-encoded.
Two main problems have to be solved: generating the set of
training patterns and aligning them.
Fig. 7. Learning speed is better for one-in-position coding at output layer level.

5.1. Training patterns

Rosenberg and Sejnowski developed a special-purpose dic-


tionary of English words and their pronunciations. Each word
consists of two strings: the phoneme string (transcribed in
PronLex) and the stress string. Both strings are aligned, character
by character, with the word, as shown for the two words
agglutinate and aberration:

a - X, 0
g - g, o
g - –, 4
l - 1, 4
u - U, 1
t - t, o
i - –, 0
n - N, o
a - e, 2
t - t, o
e - –, o

a - @, 2
b - b, o
e - x, 0
r - r, o
r - –, 4
Fig. 8. Training-set generation process. a - e, 1
t - s, 4
i - –, 0
o - x, o
that the ANN can be trained for any language, regardless of its n - n, o
peculiarities.
The backpropagation algorithm was used as the learning where o is the right syllable boundary, 4 is the left syllable
algorithm to train our FFBP-ANN, minimizing the average squared boundary, 1 is the primary stress, 2 is the secondary stress, and 0
error between the actual output at the i-th neuron of the output is the tertiary stress
layer and the target output value: This training set is very large but not exhaustive. It cannot be
automatically generated for any language because it requires a
X
N
phonetic dictionary for the target language. It also needs
E¼ ðsi  ti Þ2 ð8Þ
i¼1
manual alignment when word length fails to match phone-string
length. In an attempt to achieve fully automated training-pattern
where N is the total number of units in the output layer. generation, the pronunciation-rule set and the text-to-phones algo-
To avoid oscillations during the training phase, to amplify the rithm are used to generate the training pattern starting from the word
learning rate, and to allow escaping from local minima, a list alone.
momentum-term, modified learning algorithm was applied: The pronunciation-rule set embeds both alignment information
and stress information in each rule, so alignment can be carried out
wkþ1 ¼ wk  agk þ bðwk  wk1 Þ ð9Þ automatically during training-pattern generation. Stress and duration
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 93

information is encoded in the phone symbol as follows: Once alignment is complete, the sliding window algorithm is
applied to the transcribed word, generating the following training
=a=not stressed array (based on a nine-character window):
=0 a=stressed
=0 a : =stressed and long – – – – – – – – G /–/
– – – – – – – G O /–/
This solution allows any prosodic information to be encoded – – – – – – G O G /–/
into the phone name, so the ANN can learn more about different – – – – – G O G N /–/
pronunciations of each phone. – – – – G O G N A /g/
– – – G O G N A – /0 o/
– – G O G N A – – /J:J/
5.2. Aligning patterns – G O G N A – – – /–/
G O G N A – – – – /a/
Pattern alignment [13] is automatically solved by looking up O G N A – – – – – /–/
the pronunciation rules employed to generate the phonic G N A – – – – – – /–/
transcription for each word in the training set. This is defined as N A – – – – – – – /–/
transforming T of the word string (r) into the phonetic string (p) A – – – – – – – – /–/
by applying the rule R (see Section 3):

p ¼ TðRðrÞÞ 5.3. Training

The alignment problem between the word character string and


Training is accomplished by running the backpropagation
the corresponding phone character string is implemented during
algorithm. This algorithm is difficult to set up for application-
run time when the rule is applied. Implementation is based on the
specific purposes because several parameters must be trimmed
following algorithm:
for optimal learning: the number of hidden units, initial random
weight, learning rate, the coefficient of momentum, and stopping
if sizeof(p) ¼ sizeof(q) point (how many training epochs to run) [19]. Several experi-
no alignment required ments were conducted to find the best combination of such setup
else parameters, training the FFBP-ANN with different parameters and
cover’R.cover then evaluating performance comparatively. We set the training-
if cover 41 function goal to mean square error of 0.004.
alignment required Additional experiments concerned the size of the sliding
endif window. Window size, according to our Italian pronunciation-
endif rule set, proves to be nine characters wide (greater than that
chosen originally by Sejnowski and Rosenberg [24]). Our perfor-
mance evaluation showed that the FFBP-ANN trained with a nine-
If alignment is not required, a one-to-one association between character window works better than the same ANN trained with a
one character of the word r and one phone is established. If seven-character window and only slightly worse than the ANN
alignment is required, a many-to-one association between a trained with an 11-character sliding window. The nine-character
number equal to cover characters of the word r and one phone window size was chosen as the best trade-off between perfor-
followed by cover-1 null phones (coded as -) is established. mance and ANN size.
Applying this method to the Italian word GOGNA, yields the The following setup was then identified:
following alignment:

G ¼ =g= Input layer units: 324


0 Hidden layer units: 80
O ¼ = o=
Output layer units: 394
G ¼ =J : J= Learning epochs: 400
N ¼ =  =’null phone inserted Learning rate: 0.15
Momentum: 0.9
A ¼ =a=
Sliding window size: 9
The following rules apply: Mean square error: 0.004

!ðGÞ ¼ =g= cover ¼ 1


As a training set, 720 Italian words were used. The set consists
... ð10Þ of (nearly) all dictionary words with two shared alphabetical
characters, the words starting with ‘‘C’’ and ‘‘E’’ (Table 1). This
ðOÞ$$# ¼ =0 o= cover ¼ 1 training set is but a small portion of the whole Italian word set yet
... ð11Þ almost completely covers words with these two initial letters.
Though limited in size, this training set covers all the Italian
ðGNÞ#! ¼ =J : J= cover ¼ 2 phonemes and a good percentage of the phones (170 of 394) our
speech synthesizer can synthesize. The letters ‘‘C’’ and ‘‘E’’ are
... ð12Þ
representative of the most difficult pronunciation rules for the
ANN to learn (studies on human language learning show the
ðAÞ! ¼ =a= cover ¼ 1 difficulty of learning to correctly read these two letters in words
... ð13Þ where they occur). For example, the soft ‘‘C,’’ /tS/, takes longer to
ARTICLE IN PRESS
94 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96

Table 1 The same word (text) can be sent as input to both transcription
Training set consisting of 720 Italian words starting with the two alphabetical processes (soft- and hard-computing) and compared at the
character ‘‘C’’ and ‘‘E.’’
phonetic level. Utterance-level comparison is also available
Word Transcription because the speech synthesizer is integrated into the TXT2SP
development environment. This environment also provides the
CE /_tS/0 E_/ functions needed to set up and train the ANN, as well as to
y y compare ANN performance with rule-based implementation,
CELLA /_tS/0 E/l:l/a_/
y y
using a text file as test input.
CETO /_tS/0 E:/t/o_/
y y
CELIBATO /_tS/0 e:/l/i/b/a/t/o_/ 6.1. Test planning
y y
CETRIOLO /_tS/e/t/4/i/0 o:/l/o_/
The main goal of our research is to explore the ability of an FFBP-
ANN to produce text-to-speech at an acceptable level of intellig-
ibility and to match our main embedded-system requirements.
Two main tests were planned. The first was designed to
evaluate whether the FFBP-ANN could correctly transcribe all
the words used at training time. The second was designed to
evaluate whether the FFBP-ANN could correctly transcribe words
not included in the training set. To run first test, the 720 word
set used at training time was fed to the ANN. For the second, the
ANN was fed a set of exception words not included in the training
set.

6.2. Test results

The hard-computing implementation correctly transcribed all


the words (0% error rate) because all the test words are fully
encoded in the pronunciation rule set.
The FFBP-ANN correctly transcribed nearly all the words in
the pronunciation rule set (2% error rate). Transcribing errors did
Fig. 9. Learning curve after 400 epochs of 720 patterns. not compromise utterance intelligibility because they concerned
only stress position. The FFBP-ANN performed well with test
words not in the training set and not encoded in the pronuncia-
tion-rule set. The hard-computing implementation fails system-
learn than the hard ‘‘C,’’ /k/. Such a training set is very useful to atically in correctly transcribing the words (100% error rate)
test the network’s ability to absorb exceptions and to generalize because none of the test words are encoded in the pronunciation-
them. rule set.
Each input word is assigned a set of phone codes, each of An example of this behavior is when a proparoxytone word to
which includes all the parameters that make the word sound be uttered is tested. For example, the proparoxytone word
natural when played through the synthesizer. The network learns
the relationship between input words and output parameters CEMBALO ¼ =tS=0 e=m : =b=a=l=o=
through iteration. After 50 passes (epochs) through the training
set, the phonetized text can be understood. After 100 passes, the is not in the pronunciation rule set, so the hard-computing
error rate will be o2%. After 400 training passes, the error rate is implementation incorrectly transcribes it as follows (default):
o1% (Fig. 9). This represents an ANN in a stable state that can be
used to synthesize good-quality speech. CEMBALO ¼ =tS=e=m=b=0 a : =l=o=
To test the ANN thus set up, an Italian text was conveyed into
the system and the corresponding utterance was produced by However, the soft-computing implementation runs almost cor-
driving the speech synthesizer. rectly:

CEMBALO ¼ =tS=0 e=m=b=a=l=o=


6. Performance evaluation
It fails only with regard to the duration of the /m/ phone.
To evaluate the trained FFBP-ANN’s performance, a special- Moreover, it identifies the correct stress position. This demon-
purpose environment that implements the soft-computing and strates the FFBP-ANN’s very high generalization ability despite the
hard-computing text-to-speech models (TXT2SP) was developed. limited number of exception words furnished in the 720 word-
Both implementations share the pronunciation rule set, allowing based training set.
significant comparison. Table 2 shows some proparoxytone words used to run the
Hard-computing (algorithmic) implementation of text-to- second test and transcription results.
speech uses a set of coded phonetic rules to execute the text-to- The FFBP-ANN incorrectly transcribes long words especially.
phones transcription. Soft-computing (ANN) implementation uses This was due to scarce sampling of long exception words in the
the trained knowledge obtained during learning, from the same training set. By increasing the number of exception words at
set of coded phonetic rules used to run the hard-computing training time and resizing the input window to 11 characters, a
implementation. more robust generalization ability could be embedded in the ANN.
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 95

Table 2 Future efforts will strive to solve the problem of automatically


Proparoxytone words not included in the training set but used to test the FFBP- generating the pronunciation rule set for a given language by
ANN’s ability to correctly transcribe exception words.
starting out merely from text and its utterance by automatic
Proparoxytone word Exact transcription FFBP-ANN transcription phones classification [7]. This will enable in-system language
configuration of the ANN-based, text-to-speech converter.
CECERO /tS/0 e:/tS/e/4/o/ Correct A further aim will be to develop an ANN-based speech
CEDERE /tS/0 E:/d/e/4/e/ Correct synthesizer [6,14,15] that can be embedded in an SoC (System-
CEFALO /tS/0 E:/f/a/l/o/ Correct
CELERE /tS/0 E:/l/e/r/e/ Correct
on-Chip) solution. Implementing a fully ANN-based voice produc-
CELTICO /tS/0 E/l:/t/i/k/o/ Correct tion holds great promise because ANN architecture consists of
CENERE /tS/0 e:/n/e/4/e/ Correct highly repetitive structural patterns. This repetitiveness in ANN
CENTESIMA /tS/e/n/t/’E:/s/i/m/a/ Wrong architecture allows it to be implemented efficiently on field-
CERULO /tS/0 E:/r/u/l/o/ Correct
programmable gate arrays (FPGAs). The high gate density of the
CESTOLA /tS/0 e//s:/t/o/l/a/ Correct
CETERA /tS/0 E:/t/e/r/a/ Correct latest generation of FPGAs enables very large ANNs to be
CETNICO /tS/0 e/t:/n/i/k/o/ Correct implemented on a single chip.
Moreover, new molectronic (molecular electronic) devices that
are in the offing will boost current silicon density and processing
speed between 100 and 1000 times. This means that, in the near
6.3. Analysis of results future, it will be common to implement embedded ANNs.
Therefore, a fully ANN-based implementation of a language-
Test results confirm our hypothesis: the FFBP-ANN can learn independent speech synthesizer with unlimited vocabulary will
pronunciation rules from a limited set of words and produce represent the optimal solution to be integrated into deeply
speech from text on an unlimited vocabulary. On the contrary, the embedded applications.
hard-computing implementation of text-to-speech needs an
exhaustive pronunciation-rule set and a phonetic transcription
of all exception words. Acknowledgment
These results also show that the FFBP-ANN implementation of
text-to-speech can match the primary embedded-system applica- We owe special thanks to Philip Grew for his linguistic
tion requirement: the limited use of system memory. consultancy and some key suggestions.

7. Conclusions References

An FFBP-ANN-based engine for text-to-phones transcription is [1] P.C. Bagshaw, Phonemic transcription by analogy in text-to-speech synthesis:
being developed to act as the back end for phonetic synthesizers. novel word pronunciation and lexicon compression, in: Computer Speech and
The network can be trained on any language using the same set of Language, vol. 12 (2), Elsevier, Amsterdam, 1998, pp. 119–142.
[2] G. Bakiri, T.G. Dietterich, Achieving high-accuracy text-to-speech with
phonemes, if an appropriate set of rules is available. The trained machine learning, in: R.I. Damper (Ed.), Data Mining Techniques in Speech
FFBP-ANN’s ability to generalize shows that this approach to text- Synthesis, Chapman & Hall, New York, NY, 2002.
to-speech synthesis becomes advantageous compared with [3] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press,
Oxford, 1995.
current implementations when embedded systems are its target. [4] J.A. Bullinaria, Representation, earning, generalization and damage in neural
Its main advantage is due to the FFBP-ANN’s ability to correctly network models of reading aloud, Technical Report, Edinburgh University, 1994.
transcribe nearly any word in the trained language, including [5] L. Canepari, Il MaPI—Manuale di pronuncia Italiana, Zanichelli, 2004.
[6] G.C. Cawley, M.D. Edgington, Generalization in neural speech synthesis, in:
exception words. This ability leads to limited memory require- Proceedings of the Institute of Acoustics Autumn Conference, Windemere, UK,
ments when setting up an unlimited vocabulary text-to-speech 1998.
system. On the other hand, rule-based, text-to-speech synthesis, [7] P. Cosi, P. Frasconi, M. Gori, L. Lastrucci, G. Soda, Competitive radial basis
functions training for phone classification, Neurocomputing 34 (2000)
to correctly cover any exception in specific vocabulary, needs large
117–129.
amounts of memory, not available in embedded systems. [8] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall,
Another important advantage of the FFBP-ANN is that its Englewood Cliffs, NJ, 1999.
implementation is language-independent. If its size parameters [9] F. Hendessi, A. Ghayoori, T.A. Gulliver, A speech synthesizer for persian text
using a neural network with a smooth ergodic HMM, ACM Transactions on
(number of nodes at input and output layers) are maximized to fit Asian Language Information Processing 4 (1) (2005).
the language to be supported, then a single ANN engine can [10] JOONE, Java Object Oriented Neural Engine, /http://www.joone.orgS.
switch from one language to another by merely exchanging data [11] O. Karaali, G. Corrigan, N. Massey, C. Miller, O. Schnurr, A. Macie, A high
quality text-to-speech system composed of multiple neural networks, in: IEEE
(specific language trained ANN node weights). Rule-based text-to- International Conference on Acoustics, Speech and Signal Processing, Seattle,
speech synthesis, to switch from one language to another, needs WA, 1998.
updated data (rule sets), as well as program code. This is a [12] P. Ladefoged, Vovels and Consonants: An Introduction to the Sounds of
Languages, Blackwell Publisher, Malden, MA, 2004.
disadvantage for applications based on embedded systems, [13] C.X. Ling, H. Wang, Alignment algorithms for learning to read aloud, in:
because programming code is read-only system information. Proceedings of IJCAI, 1997.
The peculiarity of strict code dependency in an FFBP-ANN [14] M. Malcangi, NeuroFuzzy approach to the development of a text-to-speech
(TTS) synthesizer for deeply embedded applications, in: Proceedings of the
trained for text-to-speech transcription can also be advantageous 14th Turkish Symposium on Artificial Intelligence and Neural Networks,
for software-embedded applications, such as network-distributed Cesme, Turkey, 2005.
services, where each client needs of a specific language for voice [15] M. Malcangi, Combining a fuzzy logic engine and a neural network to develop
an embedded audio synthesizer, in: T. Simos, G. Psihoyios (Eds.), Lecture
interaction. Applications of this kind prove more difficult if
Series on Computer and Computational Sciences, vol. 8, Brill Publishing,
program code needs to be updated, as required by rule-based Leiden, The Netherlands, 2007, pp. 159–162.
text-to-speech synthesis. Finally, the FFBP-ANN’s in-system [16] T. Kristensen, Two neural network paradigms of phoneme transcription—a
learning ability can be part of the embedded system’s native comparison, in: Proceedings of the 2004 IEEE International Joint Conference
on Neural Networks, 2004.
hardware capacity, so that speech-synthesis functionality can be [17] J.M. Lucassen, R.L. Mercer, An information theoretic approach to the
trained at the user level. automatic determination of phonemic baseform, in: Proceedings of the IEEE
ARTICLE IN PRESS
96 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96

International Conference on Acoustics, Speech and Signal Processing, San David Frontini received his B.Sc. in information
Diego, CA, 1984, pp. 42.5.1–42.5.4. science from the Universita degli Studi of Milan. From
[18] Y. March, R.I. Damper, A multi-strategy approach to improving pronunciation 1999 to 2000 he worked as an IT consultant. From 2000
by analogy, Computational Linguistics 6 (2) (2000) 195–219. to 2003 he was an EMEA Professional Service member
[19] M. Moreira, E. Fiesler, Neural networks with adaptive learning rate and in Open Text corporation. From 2003 to 2006 he was a
momentum terms, IDIAP Technical Report, 1995. senior IT specialist at IBM working the telecommuni-
[20] D.P. Morgan, C.L. Scofield, Neural Networks and Speech Processing, Kluwer cation business. From 2006 to 2007 he was an IT
Academic Publishers, London, 1991. architect for SIA Group. Since 2007 he has been an IT
[21] D. O’Shaughnessy, Speech Communication—Human and Machine, Addison- architect in ASTIR. His research interests include
Wesley, Reading, MA, 1987. intelligent information systems, such as neural net-
[22] M. Rahim, C. Goodyear, Articulatory synthesis with the aid of a neural net, in: works and fuzzy logic.
Proceedings of ICASSP, Glasgow, Scotland, 1989, pp. 227–230.
[23] M.S. Scordilis, J.N. Gowdy, Neural network based generation of fundamental
frequency contours, in: Proceedings of ICASSP, Glasgow, Scotland, 1989, pp.
219–222.
[24] T.J. Sejnowski, C.R. Rosenberg, Parallel networks that learn to pronounce
english text, in: Complex Systems 1, Complex Systems Publication, Cham-
paign, IL, 1987, pp. 145–168.
[25] A. van den Bosh, A. Content, W. Daelemans, Measuring the complexity of
writing systems, Journal of Quantitative Linguistics 1 (3) (1994) 178–188.

Mario Malcangi graduated in computer engineering


from the Politecnico di Milano in 1981. His research is
in the areas of speech processing and digital audio
processing. He teaches digital signal processing and
digital audio processing at the Universita degli Studi di
Milano. He has published several papers on topics in
digital audio and speech processing. His current
research efforts focus primarily on applying soft-
computing methodologies (neural networks and fuzzy
logic) to speech synthesis, speech recognition, and
speaker identification, where deeply embedded sys-
tems are the platform that supports the application
processing.

You might also like