BEAT: The Behavior Expression Animation Toolkit: LEAVE BLANK THE LAST 3.81 CM (1.5") of The Left Column On The First Page
BEAT: The Behavior Expression Animation Toolkit: LEAVE BLANK THE LAST 3.81 CM (1.5") of The Left Column On The First Page
ABSTRACT coat on”). A point of this sort, however, never occurs in life (try
it yourself and you will see that only if “you” is being contrasted
The Behavior Expression Animation Toolkit (BEAT) allows with somebody else might a pointing gesture occur) and, what is
animators to input typed text that they wish to be spoken by an much worse, makes an animated speaking character seem stilted,
animated human figure, and to obtain as output appropriate and as if speaking a language not her own. In fact, for this reason,
synchronized nonverbal behaviors and synthesized speech in a many animators rely on video footage of actors reciting the text,
form that can be sent to a number of different animation systems. for reference or rotoscoping, or more recently, rely on motion
The nonverbal behaviors are assigned on the basis of actual captured data to drive speaking characters. These are expensive
linguistic and contextual analysis of the typed text, relying on methods that may involve a whole crew of people in addition to
rules derived from extensive research into human conversational the expert animator. This may be worth doing for characters that
behavior. The toolkit is extensible, so that new rules can be play a central role on the screen, but is not as justified for a
quickly added. It is designed to plug into larger systems that may crowd of extras.
also assign personality profiles, motion characteristics, scene
In some cases, we may not even have the opportunity to handcraft
constraints, or the animation styles of particular animators.
or capture the animation. Embodied conversational agents as
Keywords interfaces to web content, animated non-player characters in
Animation Systems, Facial Animation, Speech Synthesis interactive role playing games, and animated avatars in online
chat environments all demand some kind of procedural
1. INTRODUCTION animation. Although we may have access to a database of all the
The association between speech and other communicative phrases a character can utter, we do not necessarily know in what
behaviors causes particular challenges to procedural character context the words may end up being said and may therefore not
animation techniques. Increasing numbers of procedural be able to link the speech to appropriate context sensitive
animation systems are capable of generating extremely realistic nonverbal behaviors beforehand.
movement, hand gestures, and facial expressions in silent BEAT allows one to animate a human-like body using just text
characters. However, when voice is called for, the issues of as input. It uses linguistic and contextual information contained
synchronization and appropriateness render disfluent otherwise in the text to control the movements of the hands, arms and face,
more than adequate techniques. And yet there are many cases and the intonation of the voice. The mapping from text to facial,
where we may want to animate a speaking character. Cartoon intonational and body gestures is contained in a set of rules
political rallies or cocktail party scenes, for example, demand a derived from the state of the art in nonverbal conversational
crowd of speaking and gesturing virtual actors. While behavior research. Importantly, the system is extremely
spontaneous gesturing and facial movement occurs naturally and permeable, allowing animators to insert rules of their own
effortlessly in our daily conversational activity, when forced to concerning personality, movement characteristics, and other
think about such associations between nonverbal behaviors and features that are realized in the final animation. Thus, in the
words in explicit terms a trained eye is called for. For example, same way as Text-to-Speech (TTS) systems realize written text
untrained animators, and autonomous animated interfaces, often in spoken language, BEAT realizes written text in embodied
generate a pointing gesture towards the listener when a speaking expressive behaviors. And, in the same way as TTS systems are
character says “you”. (“If you want to come with me, get your permeable to trained users, allowing them to tweak intonation,
LEAVE BLANK THE LAST 3.81 cm (1.5”) pause-length and other speech parameters, BEAT is permeable to
OF THE LEFT COLUMN ON THE FIRST PAGE animators, allowing them to write particular gestures, define new
FOR THE COPYRIGHT NOTICE behaviors and tweak the features of movement.
The next section gives some background to the motivation for
BEAT. Section 3 describes related work. Section 4 walks the
reader through the implemented system, including explaining the
methodology of text annotation, selection of nonverbal behaviors,
and synchronization. An extended example is covered in Section
page 2: Proceedings of SIGGRAPH '01
5. Section 6 presents our conclusions and describes possible actual videos of human faces, in accordance with recorded audio
directions for future work. [7]. [27] go further in the direction of communicative action and
generate not just visemes, but also syntactic and semantic facial
2. CONVERSATIONAL BEHAVIOR movements. And the gains are considerable, as “talking heads”
To communicate with one another, we use words, of course, but with high-quality lip-synching significantly improve the
we also rely on intonation (the melody of language), hand comprehensibility of synthesized speech [22], and the willingness
gestures (beats, iconics, pointing gestures [23]), facial displays of humans to interact with synthesized speech [25], as well as
(lip shapes, eyebrow raises), eye gaze, head movements and body decrease the need for animators to spend time on these time-
posture. The form of each of these modalities – a rising tone vs. consuming and thankless tasks.
a falling tone, pointing towards oneself vs. pointing towards the
Animators also spend an enormous amount of effort on the
other – is essential to the meaning. But the co-occurrence of
thankless task of synchronizing body movements to speech,
behaviors is equally important. There is a tight synchrony among
either by intuition, or by using rotoscoping or motion capture.
the different communicative modalities in humans. Speakers
And yet, we still have seen no attempts to automatically specify
accentuate only the important words by speaking more forcefully,
“gestemes” on the basis of text or to automatically synchronize
gesture along with the word that a gesture illustrates, and turn
(“body-synch”) those body and face behaviors to synthesized or
their eyes towards the listener when coming to the end of a
recorded speech. The task is a natural next step, after the
thought. Meanwhile listeners nod within a few hundred
significant existent work that renders communication-like human
milliseconds of when the speaker’s gaze shifts. This synchrony is
motion realistic in the absence of speech, or along with text
essential to the meaning of conversation. Speakers will go to
balloons. Researchers have concentrated both on low-level
great lengths to maintain it (stutterers will repeat a gesture over
features of movement, and aspects of humans such as
and over again, until they manage to utter the accompanying
intentionality, emotion, and personality. [5] devised a method of
speech correctly) and listeners take synchrony into account in
interpolating and modifying existing motions to display different
what they understand. (Readers can contrast “this is a stellar
expressions. [14] have concentrated on providing a tool for
siggraph submission” [big head nod along with “stellar”] with
controlling the expressive shape and effort characteristics of
“this is a . . . stellar siggraph submission” [big head nod during
gestures. Taking existing gestures as input, their system can
the silence]). When synchrony among different communicative
change the nature of how a gesture is perceived. [1] have
modalities is destroyed, as in low bandwidth videoconferencing,
concentrated on realistic emotional expression of the body. [4]
satisfaction and trust in the outcome of a conversation is
and [3] have developed behavioral animation systems to generate
diminished. When synchrony among different communicative
animations of multiple creatures with varying personalities
modalities is maintained, as when one manages to nod at all the
and/or intentionality. [8] constructed a system that portrays the
right places during the Macedonian policeman’s directions,
gestural interaction between two agents as they pass and greet
despite understanding not a word, conversation comes across as
one another, and in which behavioral parameters were set by
successful.
personality attribute “sliders.” [29] concentrated on the
Although all of these communicative behaviors work together to challenge of representing the personality of a synthetic human in
convey meaning, the communicative intention and the timing of how it interacted with real humans, and the specification of
all of them are based on the most essential communicative coordinated body actions using layers of motions defined relative
activity, which is speech. The same behaviors, in fact, have to a set of periodic signals.
quite different meanings, depending on whether they occur along
There have also been a smaller number of attempts to synthesize
with spoken language or not, and similar meanings are expressed
human behaviors specifically in the context of communicative
quite differently when language is or is not a part of the mix.
acts. [20] implemented a graphical chat environment that
Indeed, researchers found that when people tried to tell a story
automatically generates still poses in comic book format on the
without words, their gestures demonstrated entirely different
basis of typed text. This very successful system relies on
shape and meaning characteristics – in essence, they began to
conventions often used in chat room conversations (chat
resemble American Sign Language – as compared to when the
acronyms, emoticons) rather than relying on the linguistic and
gestures accompanied speech [23].
contextual features of the text itself. And the output of the
Skilled animators have always had an intuitive grasp of the form system depends on our understanding of comic book conventions
of the different communicative behaviors, and the synchrony – as the authors themselves say “characters pointing and waving,
among them. Even animators, however, often turn to rotoscoping which occur relatively infrequently in real life, come off well in
or motion capture in cases where the intimate portrayal of comics.”
communication is of the essence.
Synthesis of animated communicative behavior has started from
3. RELATED WORK an underlying computation-heavy “intention to communicate”
Until the mid-1980s or so, animators had to manually enter the [10], a set of natural language instructions [2], or a state machine
phonetic script that would result in lip-synching of a facial model specifying whether or not the avatar or human participant was
to speech (c.f. [26]). Today we take for granted the ability of a speaking, and the direction of the human participant’s gaze [15].
system to automatically extract (more or less beautiful) However, starting from an intention to communicate is too
“visemes” from typed text, in order to synchronize lip shapes to computation-heavy, and requires the presence of a linguist on
synthesized or recorded speech [33]. We are even able to staff. Natural language instructions guide the synthetic human’s
animate a synthetic face using voice input [6] or to re-animate actions, but not its speech. And, while the state of speech is
page 3: Proceedings of SIGGRAPH '01
essential, the content of speech must also be addressed in the kinds of places where emphasis should be placed. Currently, the
assignment of nonverbal behaviors. knowledge base is stored in two XML files, one describing
objects and other describing actions. These knowledge bases are
In the current paper, we describe a toolkit that automatically
seeded with descriptions of generic objects and actions but can
suggests appropriate gestures, communicative facial expressions,
easily be extended for particular domains to increase the efficacy
pauses, and intonational contours for an input text, and also
of nonverbal behavior assignment.
provides the synchronization information required to animate the
behaviors in conjunction with a character's speech. This layer of The object knowledge base contains definitions of classes and
analysis is designed to bridge the gap between systems that instances of objects. Figure 3 shows two example entries. The
specify more natural or more expressive movement contours first defines a new object class CHARACTER as a type of person
(such as [14], or [28] and systems that suggest personality or (vs. object or place) with two features: TYPE, describing whether
emotional realms of expression (such as [3] or [29]). the professional is REAL or VIRTUAL; and ROLE, describing the
actual profession. Each feature value is also described as being
4. SYSTEM "normal" or "unusual" (e.g., a virtual person would be considered
The BEAT system is built to be modular and user extensible, and unusual), which is important since people tend to generate iconic
to operate in real-time. To this end, it is written in Java, is based gestures for the unusual aspects of objects they describe [34].
on an input-to-output pipeline approach with support for user Each feature value can also provide a gesture specification which
defined filters and knowledge bases, and uses an XML tagging describes the type of hand gesture that should be used to depict it
scheme. Processing is decomposed into modules which operate (as described below). The second knowledge base entry defines
as XML transducers; each taking tagged text as input and an object instance and provides values for each feature defined
producing tagged text as output. XML provides a natural way to for the class.
represent information which spans intervals of text, and its use The action knowledge base contains associations between
facilitates modularity and extensibility. Each module operates by domain actions and hand gestures which can depict them. An
reading in XML-tagged text (initially representing the text of the
character's script only), converting it into a parse tree, UTTERANCE
manipulating the tree, then re-serializing the tree into XML
before passing it to the next module. The various knowledge It is some kind of a virtual actor.
bases used in the system are also encoded in XML so that they a. Input to Language Tagging Module
can be easily extended for new applications. UTTERANCE
CLAUSE
An overview of the system is shown in Figure 1. There are three
main processing modules: Language Tagging module, Behavior THEME RHEME
Generation module and Behavior scheduling module. The OBJECT ACTION OBJECT OBJECT=PUNK1
stages of XML translation produced by each of these modules are
shown in Figure 2. The Behavior Generation module is further NEW NEW NEW
divided into a Suggestion module and a Selection module as our it is some kind of a virtual actor
approach to the generation process is to first suggest all plausible b. Output from Tagging Module / Input to Generation
Module
behaviors and then use user modifiable filters to trim them down UTTERANCE
to a set appropriate for a particular character. In Figure 1, user SPEECH PAUSE
definable data structures are indicated with dotted line boxes. GAZE TOWARDS
We will now discuss each of these components in turn.
TONE=L - L%
Discourse Model Knowledge Base
GESTURE BEAT GESTURE ICONIC
Word Timing
GAZE AWAY EYEBROWS EYEBROWS
<ACTION NAME="MOVE" GESTURE="R hand=5, moves clause and the latter is the part that contributes some new
from CC towards L …"> information to the discussion [16]. For example in the mini-
dialogue "who is he?" "he is a student", the "he is" part of the
which simply associates a particular gesture specification with
second clause is that clause's theme and "student" is the rheme.
the verb to move.
Identifying the rheme is especially important in the current
As mentioned above, the system comes loaded with a generic context since gestural activity is usually found within the rheme
knowledge base, containing information about some objects and of an utterance [9]. The language module uses the location of
actions, and some common gestures. Gestures are specified verb phrases within a clause and information about which words
using a compositional notation in which hand shapes and arm have been seen before in previous clauses to assign information
trajectories for each arm are specified independently. This makes structure, following the heuristics described in [18].
the addition of new gestures easier, since existing trajectories or
The next to smallest unit is the word phrase, which in the current
hand shapes can be re-used.
implementation either describes an ACTION or an OBJECT.
4.2 Language Tagging These two correspond to the grammatical verb phrase and noun
The language module of the Toolbox is responsible for phrase, respectively. Actions and objects are linked to entries in
annotating input text with the linguistic and contextual the knowledge base whenever possible, as follows. For actions,
information that allows successful nonverbal behavior the language module uses the verb head of the corresponding
assignment and scheduling. The toolkit was constructed so that verb phrase as the key to look up an action description in the
animators need not concern themselves with linguistic analysis. action database. If an exact match for that verb is not found, it is
However, in what follows we briefly describe the few essential sent to an embedded word ontology module (using WordNet
fundamental units of analysis used in the system. The language [24]), which creates a set of hypernyms and those are again used
module automatically recognizes and tags each of these units in to find matching descriptions in the knowledge base. A
the text typed by the user. It should be noted that much of what hypernym of a word is a related, but a more generic -- or broader
is described in this section is similar to or, in some places -- term. In the case of verbs, one can say that a certain verb is a
identical, to the kind of tagging that allows TTS systems to specific way of accomplishing the hypernym of that verb. For
produce appropriate intonational contours and phrasing along example “walking” is a way of “moving”, so the latter is a
with typed text [17]. Additional annotations are used here, hypernym of the former. Expanding the search for an action in
however, to allow not just intonation but also facial display and the action database using hypernyms makes it possible to find
hand gestures to be generated. And, these annotations will allow and use any descriptions that may be available for a super-class
not just generation, but also synchronization and scheduling of of that action. The database therefore doesn’t have to describe
multiple nonverbal communicative behaviors with speech. all possible actions, but can focus on high-level action categories.
When an action description match is found, a description
The largest unit is the UTTERANCE, which is operationalized as identifier is added to the ACTION tag.
an entire paragraph of input. The utterance is broken up into
CLAUSEs, each of which is held to represent a proposition. To For objects, the module uses the noun head as well as any
accompanying adjectives to find a unique instance of that object
in the object database. If it finds a matching instance, it adds the
<CLASS NAME="CHARACTER" ISA="PERSON">
unique identifier of that instance to the OBJECT tag.
<FEATURE NAME="TYPE"> The smallest units that the language module handles are the
<VALUEDESC NAME="REAL" ISNORMAL="TRUE"> words themselves. The tagger uses the EngLite parser from
<VALUEDESC NAME="VIRTUAL" ISNORMAL="FALSE" Conexor (www.conexor.fi to supply word categories and lemmas
GESTURE="gesture specification goes here"> for each word. It also keeps track of all previously mentioned
</FEATURE> words and marks each incoming noun, verb, adverb or adjective
<FEATURE NAME="ROLE"> as NEW if it has not been seen before. This “word newness”
<VALUEDESC NAME="ACTOR" ISNORMAL=”TRUE"> helps to determine which words should be emphasized by the
<VALUEDESC NAME="ANIMATOR" ISNORMAL="TRUE"> addition of intonation, eyebrow motion or hand gesture [18].
Words can also stand in contrast to other words (for example “I
<INSTANCE NAME="PUNK1">
went to buy red apples but all they had were green ones”), a
<VALUE FEATURE="ROLE" VALUE="ACTOR"> property often marked with hand gesture and intonation and
<VALUE FEATURE="TYPE" VALUE="VIRTUAL"> therefore important to label. The language module currently
labels contrasting adjectives by using WordNet to supply
</CLASS> information about which words might be synonyms and which
might be antonyms to one another [18]. Each word in a contrast
Figure 3. Example Object Knowledge Base pair is tagged with the CONTRAST tag.
detect clause boundaries the tagging module looks for In sum, the language tags that are currently implemented are:
punctuation and the placement of verb phrases.
• Clause
Clauses are further divided into two smaller units of information
structure, a THEME and a RHEME. The former represents the • Theme and rheme
part of the clause that creates a coherent link with a preceding • Word newness
page 5: Proceedings of SIGGRAPH '01
The second intonation strategy suggests H* accents for all execution (see Figure 7). The second approach is to assume the
CONTRAST objects identified by the Tagger, following [30]. availability of real-time events from a TTS engine--generated
The final intonation strategy simply suggests TTS pauses at while the TTS is actually producing audio--and compile a set of
CLAUSE boundaries. event-triggered rules to govern the generation of the nonverbal
behavior. The first approach must be used for recorded-audio-
4.4 Behavior Selection based animation or TTS engines such as Festival [32], while the
The Behavior Selection module analyzes the tree that now
second must be used with TTS engines such as Microsoft's
contains many, potentially incompatible, gesture suggestions, and
Whistler [19]. We have used both approaches in our systems, and
reduces these suggestions down to the set that will actually be
the current toolkit is capable of producing both kinds of
used in the animation. The selection process utilizes an
animation schedules, but we will focus our discussion here on
extensible set of filters which are applied to the tree in turn, each
absolute-time-based scheduling with a TTS engine such as
of which can delete behavior suggestions which do not meet its
Festival.
criteria. In general, filters can reflect the personalities, affective
state and energy level of characters by regulating how much The first step in time-based scheduling is to extract only the text
nonverbal behavior they exhibit. Currently, two filter strategies and intonation commands from the XML tree, translate these into
are implemented: conflict resolution and priority threshold. a format for the TTS engine, and issue a request for word and
phoneme timings. In our implementation, the TTS runs as a
4.4.1 Conflict Resolution Filter separate process. Thus part of the scheduling can continue while
The conflict resolution filter detects all nonverbal behavior these timings are being computed.
suggestion conflicts (those which physically cannot co-occur) and
The next step in the scheduling process is to extract all of the
resolves the conflicts by deleting the suggestions with lower
(non-intonation) nonverbal behavior suggestions from the tree,
priorities. Conflicts are detected by determining, for each
translate them into an intermediate form of animation command,
animation degree-of-freedom, the suggestions which co-occur and
and order them by word index into a linear animation proto-
require that degree-of-freedom, even if specified at different
schedule.
levels of the XML tree. For each pair of such conflicting
suggestions (in decreasing order of priority) the one with lower Once the word and phoneme timings become available, the
priority is deleted unless the two can be co-articulated (e.g., a proto-schedule can be instantiated by mapping the word indices
beat gesture on top of an iconic gesture). into execution times (relative to the start of the schedule). The
schedule can then also be augmented with facial animation
4.4.2 Priority Threshold Filter commands to lip-sync the phonemes returned from the TTS
The priority threshold filter simply removes all behavior engine. Figure 8. shows a fragment of an animation schedule at
suggestions whose priority falls below a user-specified threshold. this stage of compilation.
<VISEME time=0.0 spec="A">
4.5 Behavior Scheduling and Animation <GAZE word=1 time=0.0 spec=AWAY_FROM_HEARER>
The last module in the XML pipeline converts its input tree into <VISEME time=0.24 spec="E">
a set of instructions which can be executed by an animation <VISEME time=0.314 spec="A">
<VISEME time=0.364 spec="TH">
system, or edited by an animator prior to rendering. In general, <VISEME time=0.453 spec="E">
there are two ways to achieve synchronization between a <GAZE word=3 time=0.517 spec=TOWARDS_HEARER>
character animation subsystem and a subsystem for producing the <R_GESTURE_START word=3 time=0.517 spec=BEAT>
character's speech (either through a TTS engine or from recorded <EYEBROWS_START word=3 time=0.517>
audio samples). The first is to obtain estimates of word and Figure 8. Example Abstract Animation Schedule
phoneme timings and construct an animation schedule prior to Fragment
5. EXAMPLE ANIMATION
To demonstrate how the system works, in this section we walk
The final stage of scheduling involves compiling the abstract through a couple of example utterances. The full animated
animation schedule into a set of legal commands for whichever example can be found on the accompanying video tape.
animation subsystem is being used. This final compilation step
As a first example, we trace what happens when BEAT receives
has also been modularized in the toolkit. In addition to simply
as input the two subsequent sentences "It is some kind of a
translating commands it must concern itself with issues such as
virtual actor" and "You just have to type in some text, and the
enabling, initializing and disabling different animation
actor is able to talk and move by itself". Lets look at each
subsystem features, gesture approach, duration and relax times
sentence in turn.
(the abstract schedule specifies only the peak time at start of
phrase and the end of phrase relax time), and any time offsets The language tagging module processes the input first, and
between the speech production and animation subsystems. generates an XML tree, tagged with relevant language
information as described in section 4.1. The output of the
Our current compilation target is a humanoid animation system
language tagger is shown in Figure 2b. Of particular interest in
we have developed called Pantomime [13]. Pantomime animates
Sentence 1 is the classification of “a virtual actor” as an object
one or more VRML-defined characters (adhering to the H-ANIM
and the ability of the system to give it the unique identifier
standard [31]) using a variety of motor skill modules, and
PUNK1. This is because when looking for the object in the
resolves any remaining conflicts in character degrees-of-freedom.
knowledge base, it found under a user-defined type
Pantomime can receive an animation schedule for the character,
CHARACTER, an instance of an ACTOR that in fact is of the
with the schedules specifying motor skills to be executed at
virtual type, this was the only instance matching on this attribute,
specific times relative to the start of the schedule. Hand and arm
so the instance name PUNK1 was copied into the value of ID in
commands are treated specially, however, in that complete
the object tag.
motions for each hand and arm are computed prior to the start of
the schedule. As a result, motions through all specified keyframe When the behavior generator receives the XML tree from the
positions can be spline-smoothed for more natural looking language tagger, it applies generator rules to annotate the tree
behavior. Overlayed onto all commanded motion is a tailorable with appropriate behaviors as described in section 4.3. Beats are
amount of Perlin noise on each character joint [28], and idle suggested for the object “some kind of” and the object “a virtual
motor skills (such as eye blinking) to provide a more life-like actor” (previously identified as PUNK1) because these objects
character. Pantomime renders the final set of character joint are inside a rheme and contain new words. Eyebrow raising is
angles using OpenInventor. also suggested for these same objects and intonational accents
are suggested for all the new lexical items (words) contained in
4.6 EXTENSIBILITY those two objects (i.e. “kind”, “virtual” and “actor”). Eye gaze
As described in the introduction, BEAT has been designed to fit behavior and intonational boundary tones are suggested based on
into a number of existent animation systems, or to exist as a layer the division into theme and rheme. Of particular interest is the
between lower-level expressive features of motion and higher- suggestion for an iconic gesture to accompany PUNK1. This
level specification of personality or emotion. It has also been suggestion was generated because, upon examining the database
designed to be extensible in several significant ways. First, new entry for PUNK1, the generator found that one of its attributes,
entries can easily be made in the knowledge base to add new namely the type, did not hold a value within a typical range.
hand gestures to correspond to domain object features and That is, the value ‘virtual’ was not considered a typical actor
actions. Second, the range of nonverbal behaviors, and the type. The form suggested for the gesture is retrieved from the
strategies for generating them, can easily be modified by defining database entry for the value virtual; in this way the gesture
new behavior suggestion generators. Behavior suggestion filters highlights the surprising feature of the object.
can also be tailored to the behavior of a particular character in a When the behavior selection module receives the suggestions
particular situation, or to a particular animator’s style. Animation from the generator module, it notices that both a beat and an
module compilers can be swapped in for different target iconic gesture were suggested for PUNK1. Using the rule of
animation subsystems. Finally, entire modules can be easily re- gesture class priority (beats being the lowest class in the gesture
implemented (for example, as new techniques for text analysis family), the module filters out the beat and leaves in the iconic.
become available) simply by adhering to the XML interfaces. No further conflicts are noticed and no further filters have been
One additional kind of flexibility to the system derives from the included in this example. The resulting tree is shown in Figure
ability to override the output from any of the modules simply by 2c.
including appropriate tags in the original text input. For
Lastly the behavior scheduling module compiles the XML tree,
example, an animator could force a character to raise its
including all suggestions not filtered out, into an action plan
eyebrows on a particular word simply by including the relevant
ready for execution by an animation engine as described in
EYEBROWS tag wrapped around the word in question, and this
section 4.4. The final schedule (without viseme codes) is shown
tag will be passed through the Tagger, Generation and Selection
in Figure 2d.
modules and compiled into the appropriate animation commands
by the Scheduler. The second sentence is processed in much the same way. Part of
the output of the behavior generator is shown in Figure 9. Two
particular situations that arise with this sentence are of note.
The first is that the action, “to type in”, is identified by the
page 8: Proceedings of SIGGRAPH '01
language module because an action description for typing is As an example of a different kind of a nonverbal behavior
found in the action database. Therefore the gesture suggestion assignment, let’s look at how the system processes the sentence
module can suggest the use of an iconic gesture description, “Are you a good witch or a bad witch?”. The output of the
because the action occurs within a rheme. See Figure 10. for a behavior generation module is shown in Figure 11. As well as
snapshot of the generated “typing” gesture. The second one is suggesting the typical behaviors seen in the previous examples,
that although PUNK1 (“the actor”) was identified again, no here the language tagger has identified two contrasting adjectives
gesture was suggested for this object at this time because it is in the same clause, “good” and” bad.” They have been assigned
located inside a theme as opposed to a rheme part of the clause. to the same contrast group. When the gesture suggestion module
receives the tagged text, generation rules suggest a contrast
UTTERANCE gesture on the “a good witch” object and on the “a bad witch”
object. Furthermore, the shape suggested for these contrast
SPEECH PAUSE SPEECH PAUSE
gestures is a right hand pose for the first object and a left hand
GAZE AWAY GAZE TOWARDS GAZE AWAY pose for the second object since there are exactly two members of
TONE=L-H% TONE=L-L% TONE=L-H% this contrast group. When filtering, the gesture selection module
GESTURE ICONIC GESTURE BEAT notices that the contrasting gestures were scheduled to peak at
EYEBROWS EYEBROWS exactly the same moment as a couple of hand beats. The beats
ACCT=H* ACCT=H* are filtered out using the gesture class priority rule, deciding that
You just have to type in some text and the actor … contrasting gestures are more important than beats. See Figure
12. for a snapshot of the contrast gesture.
Figure 9. Part of the output XML tree for first example
adequate for many purposes. Certainly, this kind of automated Computer Graphics and Applications, vol. 18 (5), pp.
specification improves over the hand-animated associations 32-40, 1998.
between language and nonverbal behavior used in many current [6] Brand, M., Voice Puppetry. Proc. SIGGRAPH '99, pp.
web-based agents, or other autonomous systems. It also provides 21-28, Los Angeles CA, 1999.
a first pass at the desired behaviors in those cases where manual
improvement can follow up. The system is meant to suggest a [7] Bregler, C., Covell, M., and Slaney, M., Video
baseline that without any tweaking will at least appear plausible, Rewrite: driving visual speech with audio. Proc.
but it invites the input of an animator at any stage to affect the SIGGRAPH '97, pp. 353-360, Los Angeles, CA, 1997.
final output. [8] Calvert, T., Composition of realistic animation
Future work includes more extensive automatic linguistic tagging sequences for multiple human figures, in Making Them
and additional inferencing, relying further on WordNet or even Move: Mechanics, Control, and Animation of
on a database of common sense knowledge, such as Cyc [21]. In Articulated Figures, N. Badler, B. Barsky, and D.
addition further work is needed on the notion of the gesture Zeltzer, Eds. San Mateo, CA: Morgan-Kaufmann, pp.
ontology, including some basic spatial configuration gesture 35-50, 1991.
elements. As it stands, hand gestures cannot be assembled out of [9] Cassell, J., Nudge, Nudge, Wink, Wink: Elements of
smaller gestural parts, nor can they be shortened. When gesture Face-to-Face Conversation for Embodied
descriptions are read from the knowledge base, they are currently Conversational Agents, in Embodied Conversational
placed in the animation schedule unchanged. The Behavior Agents, J. Cassell, J. Sullivan, S. Prevost, and E.
Scheduler makes sure the stroke of the gesture aligns with the Churchill, Eds. Cambridge: MIT Press, pp. 1-27, 2000.
correct word, but does not attempt to stretch out the rest of the
gesture, for instance to span a whole phrase that needs to be [10] Cassell, J., Pelachaud, C., Badler, N., Steedman, M.,
illustrated. Similarly, it does not attempt to slow down or pause Achorn, B., Becket, T., Douville, B., Prevost, S., and
speech to accommodate a complex gesture, a phenomenon Stone, M., Animated Conversation: Rule-Based
observed in people. Finally, additional nonverbal behaviors Generation of Facial Expression, Gesture and Spoken
should be added: wrinkles of the forehead, smiles, ear wiggling. Intonation for Multiple Conversational Agents. Proc.
The system will also benefit from a visual interface that displays Siggraph '94, pp. 413-420, Orlando, 1994.
a manipulatable timeline where either the scheduled events [11] Cassell, J. and Prevost, S., Distribution of Semantic
themselves can be moved around or the rules behind them Features Across Speech and Gesture by Humans and
modified. Computers. Proc. Workshop on the Integration of
In the meantime, we hope to have demonstrated that the Gesture in Language and Speech, pp. 253-270,
animator's toolbox can be enhanced by the knowledge about Newark, DE, 1996.
gesture and other nonverbal behaviors, turntaking, and linguistic
[12] Cassell, J., Torres, O., and Prevost, S., Turn Taking vs.
structure that are incorporated and (literally) embodied in the
Discourse Structure: How Best to Model Multimodal
Behavior Expression Animation Toolkit.
Conversation, in Machine Conversations, Y. Wilks,
7. REFERENCES Ed. The Hague: Kluwer, pp. 143-154, 1999.
[13] Chang, J., Action Scheduling in Humanoid
[1] Amaya, K., Bruderlin, A., and Calvert, T., Emotion Conversational Agents, M.S. Thesis in Electrical
from motion. Proc. Graphics Interface'96, pp. 222-229, Engineering and Computer Science. Cambridge, MA:
, 1996. MIT, 1998.
[2] Badler, N., Bindiganavale, R., Allbeck, J., Schuler, W., [14] Chi, D., Costa, M., Zhao, L., and Badler, N., The
Zhao, L., and Palmer., M., Parameterized Action EMOTE model for effort and shape. Proc. SIGGRAPH
Representation for Virtual Human Agents., in '00, pp. 173-182, New Orleans LA, 2000.
Embodied Conversational Agents, J. Cassell, J. [15] Colburn, A., Cohen, M. F., and Drucker, S., The Role
Sullivan, S. Prevost, and E. Churchill, Eds. Cambridge, of Eye Gaze in Avatar Mediated Conversational
MA: MIT Press, 2000, pp. 256-284. Interfaces, MSR-TR-2000-81. Microsoft Research,
[3] Becheiraz, P. and Thalmann, D., A Behavioral 2000.
Animation System for Autonomous Actors personified [16] Halliday, M. A. K., Explorations in the Functions of
by Emotions, Proc. of the1st Workshop on Embodied Language. London: Edward Arnold, 1973.
Conversational Characters, 57-65, 1998.
[17] Hirschberg, J., Accent and Discourse Context:
[4] Blumberg, B. and Galyean, T. A., Multi-Level Assigning Pitch Accent in Synthetic Speech. Proc.
Direction of Autonomous Creatures for Real-Time AAAI 90, pp. 952-957, 1990.
Virtual Environments. Proc. SIGGRAPH '95, pp. 47-
[18] Hiyakumoto, L., Prevost, S., and Cassell, J., Semantic
54, Los Angeles, CA, 1995.
and Discourse Information for Text-to-Speech
[5] Bodenheimer, B., Rose, C., and Cohen, M., Verbs and Intonation. Proc. ACL Workshop on Concept-to-Speech
Adverbs: Multidimensional Motion Interpolation, IEEE Generation, Madrid, 1997.
page 10: Proceedings of SIGGRAPH '01
[19] Huang, X., Acero, A., Adcock, J., Hon, H.-W., [28] Perlin, K., Noise, Hypertexture, Antialiasing and
Goldsmith, J., Liu, J., and Plumpe, M., Whistler: A Gesture, in Texturing and Modeling, A Procedural
Trainable Text-to-Speech System. Proc. 4th Int'l. Conf. Approach, D. Ebert, Ed. Cambridge, MA: AP
on Spoken Language Processing (ICSLP '96), pp. Professional, 1994.
2387-2390, Piscataway, NJ, 1996. [29] Perlin, K. and Goldberg, A., Improv: A System for
[20] Kurlander, D., Skelly, T., and Salesin, D., Comic Chat, Scripting Interactive Actors in Virtual Worlds,
Proc. of SIGGRAPH '96, pp. 225-236, 1996. Proceedings of SIGGRAPH '96, pp. 205-216, 1996.
[21] Lenat, D. B. and Guha, R. V., Building Large [30] Prevost, S. and Steedman, M., Specifying intonation
Knowledge-Based Systems: Representation and from context for speech synthesis, Speech
Inference in the Cyc Project. Reading, MA: Addison Communication, vol. 15, pp. 139-153, 1994.
Wesley, 1990. [31] Roehl, B., Specification for a Standard Humanoid,
[22] Massaro, D. W., Perceiving Talking Faces: From Version 1.1, H. A. W. Group, Ed.
Speech Perception to a Behavioral Principle. http://ece.uwaterloo.ca/~h-anim/spec1.1/, 1999.
Cambridge, MA: MIT Press, 1987. [32] Taylor, P., Black, A., and Caley, R., The architecture of
[23] McNeill, D., Hand and Mind: What Gestures Reveal the Festival Speech Synthesis System. Proc. 3rd ESCA
about Thought. Chicago, IL/London, UK: The Workshop on Speech Synthesis, pp. 147-151, Jenolan
University of Chicago Press, 1992. Caves, Australia, 1998.
[24] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., [33] Waters, K. and Levergood, T., An Automatic Lip-
and Miller, K., Introduction to Wordnet: An On-line Synchronization Algorithm for Synthetic Faces. Proc.
Lexical Database, 1993. of the 2nd ACM international conference on
[25] Nagao, K. and Takeuchi, A., Speech Dialogue with Multimedia, pp. 149-156, San Francisco CA, 1994.
Facial Displays: Multimodal Human-Computer [34] Yan, H., Paired Speech and Gesture Generation in
Conversation. Proc. ACL-94, pp. 102-109., , 1994. Embodied Conversational Agents, M.S. thesis in the
[26] Pearce, A., Wyvill, B., Wyvill, G., and Hill, D., Speech Media Lab. Cambridge, MA: MIT, 2000.
and expression: a computer solution to face animation.
Proc. Graphics Interface, pp. 136-140, 1986.
[27] Pelachaud, C., Badler, N., and Steedman, M.,
Generating Facial Expressions for Speech, Cognitive
Science, 20(1), pp. 1–46, 1994.