CHAPTER            7
Assessing Speaking
From a pragmatic view of language performance, listening and speaking are
almost always closely interrelated. While it is possible to isolate some listening
performance types (see Chapter 6), it is very difficult to isolate oral production
tasks that do not directly involve the interaction of aural comprehension. Only
in limited contexts o.< speaking (monologues, speeches, or telling a story and
reading aloud) can we assess oral language without the aural participation of
an interlocutor.
While speaking is a productive skill that can be directly and empirically
observed, those observations are invariably colored by the accuracy and
effective¬ness of a test-taker's listening skill, which necessarily compromises
the reliability and validity of an oral production test. How do you know for
certain that a speaking score is exclusively a measure of oral production
without the potentially frequent clarifications of an interlocutor? This
interaction of speaking and listening chal-lenges the designer of an oral
production test to tease apart, as much as possible, die factors accounted for
by aural intake.
Another challenge is the design of dictation techniques. Because most
speaking is the product of creative construction of linguistic strings, die
speaker makes choices of lexicon, structure, and discourse. If your goal is to
have test-takers demonstrate certain spoken grammatical categories, for
example, die stimulus you design must elicit those grammatical categories in
ways that prohibit the test-taker from avoiding or paraphrasing and thereby
dodging production of the target form.
As tasks become more and more open ended, the freedom of choice given tc:
test-takers creates a challenge in scoring procedures, hi receptive
performance, die elicitation stimulus can be structured to anticipate
predetermined responses and only those responses. In productive
performance, die oral or written stimulus must be specific enough to elicit
output within an expected range of performance such that scoring or rating
procedures apply appropriately. For example, in a picture-series task, the
objective of which is to elicit a story in a sequence of events, test-takers could
opt for a variety of plausible ways to tell the story, all of which might be
equally accurate. How can such disparate responses be evaluated? One
solution is to assign not one but several scores for each response, each score
representing one of severai traits (pronunciation, fluency, vocabulary use.
grammar, comprehensibility, etc.).
All of these issues will be addressed in this chapter as we review types of
spoken language and micro- and macroskills of speaking, then outline
numerous tasks for assessing speaking.
BASIC TYPES OF SPEAKING
In Chapter 6, we cited four categories of listening performance assessment
tasks. A similar taxonomy emerges for oral production.
1. Imitative. At one end of a continuum of types of speaking performance is
the ability to simply parrot back (imitate) a word or phrase or possibly a
sentence. While this is a purely phonetic level of oral production, a number of
prosodic, lex¬ical, and grammatical properties of language may be included in
the criterion per¬formance. We are interested only In what is traditionally
labeled "pronunciation"; no inferences are made about the test-taker's ability
to understand or convey meaning or to participate in an interactive
conversation. The only role of listening here is in the short-term storage of a
prompt, just long enough to allow the speaker to retain the short stretch of
language that must be imitated.
2. Intensive. A second type of speaking frequently employed in assessment
contexts is the production of short stretches of oral language designed to
demon¬strate competence in a narrow band of grammatical, phrasal, lexical,
or phonolog¬ical relationships (such as prosodic elements—intonation, stress,
rhythm, juncture). The speaker must be aware of semantic properties in order
to be able to respond, but interaction with an interlocutor or test administrator
is minimal at best. Examples of intensive assessment tasks include directed
response tasks, reading aloud, sentence and dialogue completion; limited
picture-cued tasks including simple sequences; and translation up to die
simple sentence level.
3- Responsive. Responsive assessment tasks include interaction and test com-
prehension but at the somewhat limited level of very short conversations,
standard greetings and small talk, simple requests and comments, and the
like. The stimulus is almost always a spoken prompt (in order to preserve
authenticity), with perliaps only one or two follow-up questions or retorts:
            A. Mary:     Excuse me, do you have the
               time?
               Doug:     Yeah. Nine-fifteen.
            B. T:     What is the most urgent environmental problem
               today?
               S:     I would say massive deforestation.
             C.Jeff    : Hey, Stef, how's it going?
               Steff   : Not bad, and yourself?
               Jeff    : I'm good.Steff: Cool. Okay,
                       gotta go.
4.     Interactive. The difference between responsive and interactive speaking
is in die length and complexity of the interaction, which sometimes includes
multiple exchanges and/or multiple participants. Interaction can take the two
forms of transactional language, which has the purpose of exchanging specific
information or interpersonal exchanges, which have the purpose of
maintaining social rela¬tionships. (In the three dialogues cited above, A and B
were transactional, and C was interpersonal.) In interpersonal exchanges, oral
production can become pragmati¬cally complex with the need to speak in a
casual register and use colloquial lan¬guage. ellipsis, slang, humor, and other
sociolinguistic conventions.
5.     Extensive (monologue). Extensive oral production tasks include
speeches, oral presentations, and story-telling, during which the opportunity
for oral interaction from listeners is either highly limited (perhaps to nonverba.
responses) or ruled out altogether. Language style is frequently more
deliberative (planning is involved) and formal for extensive tasks, but we
cannot rule out cer¬tain informal monologues such as casually delivered
speech (for example, nv vacation in the mountains, a recipe for outstanding
pasta primavera, recounting tbe plot of a novel or movie).
MICRO- AND MACROSKILLS OF SPEAKING
In Chapter 6, a list of listening micro- and macroskills enumerated the various
com¬ponents of listening that make up criteria for assessment. A similar list of
speakir skills can be drawn up for the same purpose: to serve as a taxonomy
of skills fro;: which you will select one or several that will become the
objective(s) of an assess ment task. The microskills refer to producing the
smaller chunks of language suci as phonemes, morphemes, words,
collocations, and phrasal units. The macrosk£J-imply the speaker's focus on
the larger elements: fluency, discourse, function, style cohesion, nonverbal
communication, and strategic options. The micro- an.?! macroskills total
rouglily 16 different objectives to assess in speaking.
Micro- and macroskills of oral production
Microskills
1. Produce differences among English phonemes and allophonic variants.
2. Produce chunks of language of different lengths.
3. Produce English stress patterns, words in stressed and unstressed
   positions, rhythmic structure, and intonation contours.
4. Produce reduced forms of words and phrases.
5. Use an adequate number of lexical units (words) to accomplish pragmatic
   purposes.
6. Produce fluent speech at different rates of delivery.
7. Monitor one's own oral production and use various strategic devices—
   pauses, fillers, self-corrections, backtracking—to enhance the clarity of the
   message.
8. Use grammatical word classes (nouns, verbs, etc.), systems (e.g., tense.
   agreement, pluralization), word order, patterns, rules, and elliptical forms.
9. Produce speech in natural constituents: in appropriate phrases, pause
   groups, breath groups, and sentence constituents.
10. Express a particular meaning in different grammatical forms.
11. Use cohesive devices in spoken discourse.
Macroskills
12. Appropriately accomplish communicative functions according to situations,
    participants, and goals.
13. Use appropriate styles, registers, implicature, redundancies, pragmatic
    conventions, conversation rules, floor-keeping and -yielding, interrupting,
    and other sociolinguistic features in face-to-face conversations.
14. Convey links and connections between events and communicate such
    relations as focal and peripheral ideas, events and feelings, new
    information and given information, generalization and exemplification.
15. Convey facial features, kinesics, body language, and other nonverbal cues
    along with verbal language.
16. Develop and use a battery of speaking strategies, such as emphasizing
    key words, rephrasing, providing a context for interpreting the meaning of
    words, appealing for help, and accurately assessing how well your
    interlocutor is understanding you.
As you consider designing tasks for assessing spoken language, these skills
can act as a checklist of objectives. While the macroskills have the appearance
of being more complex uian the microskills. both contain ingredients of
difficulty, depending on the stage and context of the test-taker.
 There is such an array of oral production tasks that   a complete treatment is
almost impossible within the confines of one chapter    in t±iis book. Below is a
con¬sideration of the most common techniques with       brief allusions to related
tasks. As already noted in the introduction to diis     chapter, consider three
important issues as you set out to design tasks:
1. No speaking task is capable of isolating the single skill of oral production.
  Concurrent involvement of the additional performance of aural
  comprehension, and possibly reading, is usually necessary.
2. Eliciting the specific criterion you have designated for a task can be tricky
  because beyond the word level, spoken language offers a number of
  productive options to test-takers. Make sure your elicitation prompt
  achieves its aims as closely as possible.
3. Because of the above two characteristics of oral production assessment, it
  is important to carefully specif)- scoring procedures for a response so that
  ultimately you achieve as high a reliability index as possible.
DESIGNING ASSESSMENT TASKS: IMITATIVE SPEAKING
You may be surprised to see the inclusion of simple phonological imitation in a
con-sideration of assessment of oral production. After all, endless repeating of
words. phrases, and sentences was the province of the long-since-discarded
Audiolingual Method, and in an era of communicative language teaching,
many believe that non-meaningful imitation of sounds is fruidess. Such
opinions have faded in recent years as we discovered that an overemphasis on
fluency can sometimes lead to the decline of accuracy in speech. And so we
have been paying more attention to pro¬nunciation, especially
suprasegmentals, in an attempt to help learners be more comprehensible.
An occasional phonologically focused repetition task is warranted as long a>
repetition tasks are not allowed to occupy a dominant role in an overall oral
pro¬duction assessment, and as long as you artfully avoid a negative
washback effect Such tasks range from word level to sentence level, usually
with each item focusing on a specific phonological criterion. In a simple
repetition task, test-takers repeat the stimulus, whether it is a pair of words, a
sentence, or perhaps a question (to test for intonation production).
    Word repetition task
       Test-takers hear:     Repeat after me:
                             beat [pause] bit [pause]
                              bat [pause] vat [pause]              etc.
                              I bought a boat yesterday.
                              The glow of the candle is growing.   etc.
                              When did they go on vacation?
                              Do you like coffee?                  etc.
        Test-takers repeal the stimulus.
A variation on such a task prompts test-takers with a brief written stimuli-1
which they are to read aloud, (hi the section below on intensive speaking,
som; tasks are described in which test-takers read aloud longer texts.) Scoring
specifica¬tions must be clear in order to avoid reliability breakdowns. A
common form scoring simply indicates a two- or three-point system for each
response.
Scoring scale for repetition tasks
2     acceptable pronunciation
1     comprehensible, partially correct pronunciation
0     silence, seriously incorrect pronunciation
 The longer the stretch of language, the more possibility for error and therefore
the more difficult it becomes to assign a point system to die text. In such a
case, it may be imperative to score only the criterion of die task. For example,
in the sen¬tence "When did they go on vacation?" since the criterion is falling
intonation for zfl?-questionst points should be awarded regardless of any
mispronunciation.
PHONEPASS* TEST
An example of a popular test that uses imitative (as well as intensive)
production tasks is PhonePass. a widely used, commercially available speaking
test in many coun¬tries. Among a number of speaking tasks on die test,
repetition of sentences (of 8 to 12 words) occupies a prominent role. It is
remarkable that research on die PhonePass test has supported the construct
validity of its repetition tasks not just for a test-taker's phonological ability but
also for discourse and overall oral production -ability (Townshend et al., 1998;
Bernstein et al.. 2000; Cascallar & Bernstein, 2000).
The PhonePass test elicits computer-assisted oral production over a
telephone. Test-takers read aloud, repeat sentences, say words, and answer
questions. Widi a downloadable test sheet as a reference, test-takers are
directed to telephone a des¬ignated number and listen for directions. The test
has five sections.
PhonePass® test specifications
Part A:
Test-takers read aloud selected sentences from among those printed on the
test shpet.
Fxamples:
1.    Traffic is a huge problem in Southern California.
2.    The endless city has no coherent mass transit system.
3.    Sharing rides was going to be the solution to rush-hour traffic.
4.    Most people still want to drive their own cars, though.
Part B:
Test-takers repeat sentences dictated over the phone. Examples: "Leave town
on the next train."
PartC:
Test-takers answer questions with a single word or a short phrase of two or
three words. Example: "Would you get water from a bottle cr a newspaper?"
Part D:
Test-takers hear three word groups in random order and must link them in a
correctly ordered sentence. Example: was reading/my mother/a magazine.
Part E:
Test-takers have 30 seconds to talk about their opinion about some topic that
is dictated over the phone. Topics center on family, preferences, and choices.
 Scores for the PhonePass test are calculated by a computerized scoring
template _-reported back to the test-taker within minutes. Six scores are
given: an overall sc -between 20 and 80 and live subscores on the same scale
diat rate pronunciat: : reading fluency, repeat accuracy, repeat fluency, and
listening vocabulary.
 The tasks on Parts A and B of die PhonePass test do not extend beyond the Itr
-of oral reading and imitation. Parts C and D represent intensive speaking (see
vz next section in this chapter). Section E is used only for experimental data-
gadier^i and does not figure into the scoring. The scoring procedure has been
vali against human scoring with extraordinarily high reliabilities and
correlation tics (.94 overall). Furdier, diis ten-minute test correlates widi die
elaborate I Proficiency Interview (OP1, described later in this chapter) at .75,
indicating a • high degree of correspondence between the machine-scored
PhonePass anc! human-scored OPI (Bernstein et al., 2000).
The PhonePass findings could signal an increase in die future use of repeti and
read-aloud procedures for the assessment of oral production. Because a
taker's output is completely controlled, scoring using speech-recognition te
nology becomes achievable and practical. As researchers uncover the constru
underlying both repetition/read-aloud tasks and oral production in all its compl
ties, we will have access to more comprehensive explanations of why such si
—. tasks appear to be reliable and valid indicators of very complex oral
production p ficieucy. Here are some details on the
PhonePass test.
Producer                   :Ordinate Corporation, Menlo Park, CA
Objective                 :To test oral production skills of non-native English
                    speakers
Primary market            :Worldwide, primarily In workplace se'.tings where
                    employees require a comprehensible command of spoken
                    English;secondarily in academic settings for placement and
                    evaluation of students
Type                               :Computer-assisted telephone operated, with
                    a test sheet
Response modes :Oral, mostly repetition tasks
Specifications             :(see above)
Time allocations    :Ten minutes
Internet access            :www.ordinate.com
DESIGNING ASSESSMENT TASKS: INTENSIVE SPEAKING
At the intensive level, test-takers are prompted to produce short stretches of
dis¬course (no more than a sentence) through which they demonstrate
linguistic ability at a specified level of language. Many tasks are''cued" tasks in
diat they lead the test-taker into a narrow band of possibilities.
Parts C and D of die PhonePass test fulfill the criteria of intensive tasks as tiiey
elicit certain expected forms of language. Antonyms like high and low, happy
and sad are prompted so that die automated scoring mechanism anticipates
only one word. The either/or task of Part D fulfills the same criterion. Intensive
tasks may also be described as limited response tasks (Madsen, 1983), or
mechanical tasks .(Underbill, 1987). or what classroom pedagogy would label
as controlled responses.
Directed Response Tasks
In diis type of task, the test administrator elicits a particular grammatical form
or a transformation of a sentence. Such tasks are clearly meclianical and not
commu¬nicative, but they do require minimal processing of meaning in order
to produce the correct grammatical output.
Directed response
Test-takers hear:
Tell me he went home.
Tell me that you like rock music.
Tell me that you aren"t interested in tennis.
Tell him to come to my office at noon.
Remind him what time it Is.
Read-Aloud Tasks
Intensive reading-aloud tasks include reading beyond the sentence level up to
a paragraph or two.This technique is easily administered by selecting a
passage that incorporates test specs and by recording the test-taker's output;
the scoring is rela¬tively easy because all of the test-taker's oral production is
controlled. Because of the results of research on the PhonePass test, reading
aloud may actually be a surprte ingly strong indicator of overall oral
production ability.
For many decades, foreign language programs have used reading passages to
ana¬lyze oral production. Prator's (1972) Manual of American English
Pronunciation included a "diagnostic passage" of about 150 words that
students could read aloud into a tape recorder. Teachers listening to die
recording would then rate students on a number of phonological factors
(vowels, diphthongs, consonants, consonant clus¬ters, stress, and intonation)
by completing a two-page diagnostic checklist on whic: all errors or
questionable items were noted. These checklists ostensibly offered dire tion to
the teacher for emphases in the course to come.
An earlier form of the Test of Spoken English (ISB* see below) incorporate. one
read-aloud passage of about 120 to 130 words widi a rating scale for
pronunciation and fluency. The following passage is typical:
Read-aloud stimulus, paragraph length
Despite the decrease in size—and, some would say, quality—of our cultural
world, there still remain strong differences between the usual British and
American writing styles. The question is, how do you get your message
across? English prose conveys its most novel ideas as if they were timeless
truths, while American writing exaggerates; if you believe half of what is said,
that's enough. The former uses understatement; the latter, overstatement.
There are also disadvantages to each characteristic approach. Readers who
are used to being screamed at may not listen when someone chooses to
whisper politely. At the same time, the individual who is used to a quiet
manner may reject a series of loud imperatives.
The scoring scale for this passage provided a four-point scale for pronunciation
for fluency, as shown in the box below.
Test of Spoken English scoring scale (1987, p. 10)
Pronunciation:
Points:
0.0-0.4       Frequent phonemic errors and foreign stress and intonation
patterns that cause the speaker to be unintelligible.
0.5-1.4       Frequent phonemic errors and foreign stress and intonation
patterns that cause the speaker to be occasionally unintelligible.
1.5-2.4       Some consistent phonemic errors and foreign stress and
intonation patterns, but the speaker is intelligible.
2.5-3.0       Occasional non-native pronunciation errors, but the speaker is
always intelligible.
Points:
0.0-0.4       Speech is so halting and fragmentary or has such a non-native
flow that intelligibility is virtually impossible.
0.5-1.4       Numerous non-native pauses and/or a non-native flow that
interferes with intelligibility.
1.5-2.4       Some non-native pauses but with a more nearly native flow so
that the pauses do not interfere with intelligibility.
2.5-3.0       Speech is smooth and effortless, closely approximating that of a
native speaker.
Such a rating list does not indicate how lo gauge intelligibility, which is
mentioned in both lists. Such slippery terms remind us that oral production
scoring, even with the controls that reading aloud offers, is still an inexact
science.
Underbill (1987. pp. 77-78) suggested some variations on the task of simply
reading a short passage:
    • reading a scripted dialogue, with someone else reading the other part
    • reading sentences containing minimal pairs, for example:
    • Try not to heat/hil the pan too much. The doctor gave me a bill/pill.
    • reading information from a table or chart
If reading aloud shows certain practical advantages (predictable output,
practi¬cality, reliability in scoring), there are several drawbacks to using this
technique for assessing oral production. Reading aloud is somewhat
inauthentic in that we seldom read anything aloud to someone else in the real
world, with the exception of a parent reading to a child, occasionally sharing a
written story widi someone, or giving a scripted oral presentation. Also,
reading aloud calls on certain specialized oral abilities that may not indicate
one's pragmatic ability to communicate orally in face-to-face contexts. You
should therefore employ this technique with some cau¬tion, and certainly
supplement it as an assessment task with other, more commu¬nicative
procedures.
Sentence/Dialogue Completion Tasks and Oral Questionnaires
Another technique for targeting intensive aspects of language requires test-
takers to read dialogue in which one speakers lines have been omitted.Test-
takers are first given time to read through the dialogue to get its gist and to
think about appropriate lines to fill in. Then as the tape, teacher, or test
administrator produces one part orallv, the test-taker responds. Here's an
example.
Dialogue completion task
Test-takers read (an In a department stor
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Salesperson: Customer:
Test-takers respond        d then hear):
a-
May 1 help you?
      Okay, what size do you wear?
      Hmmm. How about this green sweater here?
      Oh. Well, if you don't like green, what color would you like?
      How about this one?
      Great!
      It's on sale today for $39.95.
      Sure, we take Visa, MasterCard, and American Express.
       with appropriate lines.
An advantage of this technique lies in its moderate control of the output of thr
test-taker. While individual variations in responses are accepted, the
technique tar into a learner's ability to discern expectancies in a conversation
and to produc: sociolinguistically correct language. One disadvantage of this
technique is n> reliance on literacy and an ability to transfer easily from
written to spoken Engli>: Another disadvantage is the contrived, inauthentic
nature of this task: Couldn't If same criterion performance be elicited in a live
interview in which an imprompr. role-play technique is used?
Perhaps more useful is a whole host of shorter dialogues of two or thrK lines,
each of which aims to elicit a specified target. In the following example
somewhat unrelated items attempt to elicit the past tense, future tense, yes/t.
question formation, and asking for die time. Again, test-takers see the
stimulus | written form.
CIIAPTCR 7 Assessing Speaking         151 Directed response tasks
Test-takers see:
Interviewer: Test-taker;
Interviewer: Test-taker:
Test-taker: Interviewer:
Test-taker:
Interviewer;
Test-takers respond        What did you do last weekend? '
      What will you do after you graduate from this program?
      I was in Japan for two weeks,
?
      It's ten-thirty.
with appropriate lines.
One coulcl contend that performance on these items is responsive, radier than
Intensive. True, die discourse involves responses, but tiiere is a degree of
control here tiiat predisposes the test-taker to respond with certain expected
forms. Such arguments underscore the fine lines of distinction between and
among die selected five categories.
It could also be argued that such techniques arc nothing more than a written
form of questions that might odierwise (and more appropriately) be part of a
stan¬dard oral interview. True, but the advantage that the written form offers
is to pro¬vide a little more time for the test-taker to anticipate an answer, and
it begins to remove die potential ambiguity created by aural
misunderstanding. It helps to unlock the almost ubiquitous link between
listening and speaking performance.
Underhill (1987) describes yet another technique that is useful for controlling
die test-taker's output: form-filling, or what I might rename "oral
questionnaire." Here die test-taker sees a questionnaire that asks for certain
categories of information (personal data, academic information, job
experience, etc.) and supplies the information orally.
Picture-Cued Tasks
One of the more popular ways to elicit oral language performance at bodi
intensive and extensive levels is a picture-cued stimulus tiiat requires a
description from die test-taker. Pictures may be very simple, designed lo elicit
a word or a plirase; somewhat more elaborate and "busy"; or composed of a
series tiiat tells a story or incident. Here is an example of a picture-cued
elicitation of the production of a simple minimal pair.
Picture-cued elicitation of minimal pairs
Test-takers see:
Test-takers hear: [test administrator points to each picture in succession]
What's this?
Grammatical categories may be cued by pictures. In the following sequence-
comparatives are elicited:
Picture-cued elicitation of comparatives (Brown & Sahm, 1994, p. 135)
Test-takers hear: Use a comparative form to compare these objects.
Test-takers see:
153 The future tense is elicited with the following picture:
Picture-cued elicitation of future tense (Brown & Sahni, 1994, p. 145)
Notice that a little sense of humor is injected here: the family, bundled up in
their winter coats, is looking forward to leaving the wintry scene behuid them!
A touch of authenticity is added in that almost everyone can identify with
looking forward to a vacation on a tropical Island.
Assessment of oral production may be stimulated through a more elaborate
pic¬ture such as the one on the next page, a party scene.
154     a IAPTCR 7 Assessing Speaking
Picture-cued elicitation of nouns, negative responses, numbers, and location
(Brown & Sahni, 1994, p. 116)
Test-takers see:
Test-takers hear:
       1.     [point   to   the table] What's this?
       2.     [point   to   the end table] What's this?
       3.     [point   to   several chairs] What are these?
       4.     [point   to   the clock] What's that?
       5.     [point   to   both lamps] What are those?
       6.     [point   to   the table] Is this a chair?
       7.     [point   to   the lamps] Are these clocks?
       8.     [point   to   the woman standing up] Is she sitting?
       9.     [point   to   the whole picture] How many chairs are there?
       10.    [point   to   the whole picture] How many women are there?
       11.    [point   to   the TV] Where is the TV?
       12.    [point   to   the chair beside the lamp] Where is this chair?
          13.   [point to one person] Describe this person.
In the first five questions, test-takers are asked to orally identity selected
vocabulary items. In questions 6-13, assessment of the oral production of
negatives, numbers, prepositions, and descriptions of people is elicited.
Moving into more open-ended performance, the following picture asks test-
takers not only to identify certain specific information but also to elaborate
with their own opinion, to accomplish a persuasive function, and to describe
preferences in paintings.
Picturc-cucd elicitation of responses and description (Brown & Sahni. 1994, p.
162)
Maps are another visual stimulus that can be used to assess the language
forms needed to give directions and specify locations. In the following
example, the test-taker must provide directions to different locations.
Map-cued elicitation of giving directions (Brown & Sahni, 1994, p. 169)
Test-takers see:
5th St.
Library
N
1
RIVER
4th St.
3rd St.
      Bank
      Bookstore
0>
CO
co
fa
c£
Macy's
Test-takers hear:
You are at First and Jefferson Streets [point to the spot]. People ask you for
directions to get to five different places. Listen to their questions, then give
directions.
1.    Please give me directions to the bank.
2.    Please give me directions to Macy's Department Store.
3.    How do I get to the post office?
4.    Can you tell me where the bookstore is?
5.    Please tell me how to get to the library.
Scoring responses on picture-cued intensive speaking tasks varies, depending
on the expected performance criteria. The tasks above that asked just for one-
word or simple-sentence responses can be evaluated simply as "correct" or
"incorrect." The three-point rubric (2, 1. and 0) suggested earlier may apply as
well, with these modi¬fications:
Scoring scale for intensive tasks
2      comprehensible; acceptable target form
I      comprehensible; partially correct target form
C      silence, or seriously incorrect target form
Opinions about paintings, persuasive monologue, and directions on a map
create a more complicated problem for scoring. More demand is placed on the
test administrator to make calculated judgments, in which case a modified
form of a scale such as the one suggested for evaluating interviews (below)
could be used:
•      grammar
•      vocabulary
•      comprehension
•      fluency
■ pronunciation
•      task (accomplishing the objective of the elicited task)
Each category may be scored separately, with an additional composite score
that attempts to synthesize overall performance. To attend to so many factors,
you will probably need to have an audiotaped recording for multiple listening.
One moderately successful picture-cued technique involves a pairing of two
test-takers. They are supplied with a set of four identical sets of numbered
pic¬tures, each minimally distinct from the others by one or two factors. One
test-taker is directed by a cue card to describe one of the four pictures in as
few words as possible.The second test-taker must then identify the picture. On
the next page is an example of four pictures.
Test-takers see:
Test-taker 1 describes (for example) picture C; test-laker 2 points to the
correct picture.
The task here is simple and straightforward and clearly In the intensive ca:.
gory as the rest-raker must simply produce the relevant linguistic markers. Yet
it still die task of the test administrator to determine a correctly produced
respo-and a correctly understood response since sources of incorrectness may
not easily pinpointed. If the pictorial stimuli are more complex than the above
itc.-greater burdens are placed on both speaker and listener, with
consequently gre.-difficulty in identifying which committed the error.
159
Translation (of Limited Stretches of Discourse)
Translation is a part of our tradition in language teaching that we tend to
discount or disdain, if only because our current pedagogical stance plays down
its impor¬tance. Translation methods of teaching are certainly passe in an era
of direct approaches to creating communicative classrooms. But we should
remember that in countries where English is not the native or prevailing
language, translation is a meaningful communicative device in contexts where
the English user is called on to be an interpreter. Also, translation is a well-
proven communication strategy for learners of a second language.
Under certain constraints, dien. it is not far-fetched to suggest translation as a
device to check oral production. Instead of offering pictures or written stimuli,
the test-taker is given a native language word, phrase, or sentence and is
asked to trans¬late it. Conditions may vary from expecting an instant
translation of an orally elicited linguistic target to allowing more thinking time
before producing a translation of somewhat longer texts, which may optionally
be offered to the test-taker in written form. (Translation of extensive texts is
discussed at the end of this chapter.) As an assessment procedure, the
advantages of translation lie in its control of the output of the test-taker, which
of course means that scoring is more easily specified.
DESIGNING ASSESSMENT TASKS: RESPONSIVE SPEAKING
Assessment of responsive tasks involves brief interactions with an interlocutor,
dif¬fering from intensive tasks in the increased creativity given to the test-
taker and from interactive tasks by the somewhat limited length of utterances.
Question and Answer
Question-and-answer tasks can consist of one or two questions from an
interviewer, or they can make up a portion of a whole battery of questions and
prompts in an oral interview. They can vary from simple questions like "What
is this called in English?" to complex questions like "What arc the steps
governments should take, if any, to stem the rate of deforestation in tropical
countries?" The first question is intensive in its purpose; it is a display question
intended to elicit a predetermined correct response. We have already looked
at some of these types of questions in the previous section. Questions at the
responsive level tend to be genuine referential questions in which the test-
taker is given more opportunity to produce meaningful language in response.
In designing such questions for test-takers, it's important to make sure diat
you know why you are asking the question. Are you simply trying to elicit
strings of lan¬guage output to gain a general sense of the test-taker's
discourse competence? Are you combining discourse and grammatical
competence in the same question? Is each question just one in a whole set of
related questions? Responsive questions may take the following forms:
Questions eliciting open-ended responses
Test-takers hear:
1.      What do you think about the weather today?
2.      What do you like about the English language?
3.      Why did you choose your academic major?
4.      What kind of strategies have you used to help you learn English?
5.      a. Have you ever been to the United States before?
b.      What other countries have you visited?
c.      Why did you go there? What did you like best about it?
d.      If you could go back, what would you like to do or see?
e.      What country would you like to visit next, and why?
Test-takers respond with a few sentences at most.
Notice that question *5 has five situationally linked questions that may vary
slig depending on the test-taker's response to a previous question.
Oral interaction with a test administrator often involves die latter formine the
questions. The flip side of this usual concept of question-and-answer tas to
elicit questions from the test-taker. To assess the test-taker's ability to pr
questions, prompts such as Uiis can he used:
Elicitation of questions from the test-taker
Test-takers hear:
•       Do you have any questions for me?
•       Ask me about my family or job or interests.
•       If you could interview the president or prime minister of your country,
what woulc you ask that person?
Test-takers respond with questions.
A potentially tricky form of oral production assessment involves more daf test-
taker with an interviewer, which is discussed later in this chapter. With r^
dents in an interview context, both test-takers can ask questions of each ot&q
Giving Instructions and Directions
We are all called on in our daily routines to read instructions on how to operate
an appliance how to put a bookshelf together, or how to create a delicious
clam chowder. Somewhat less frequent is the mandate to provide such
instructions orally, but this speech act is still relatively common. Using such a
stimulus in an assessment context provides an opportunity for the test-taker
to engage in a rela¬tively extended stretch of discourse, to be very clear and
specific, and to use appro¬priate discourse markers and connectors. The
technique is simple: die administrator poses the problem, and the test-taker
responds. Scoring is based primarily on com-prehensibility and secondarily on
other specified grammatical or discourse cate-gories. Here are some
possibilities.
Eliciting instructions or directions
Test-takers hear:
•       Describe how to make a typical dish from your country.
•       What's a good recipe for making?
•       How do you access email on a PC computer?
•       How would I make a typical costume for a         celebration  in     your
country?
•       How do you program telephone numbers into a cell (mobile) phone?
•       How do I get from to       in your city?
Test-takers respond with appropriate instructions/directions.
Some pointers for creating such tasks: The test administrator needs to guard
against test-takers knowing and preparing for such items in advance lest they
simply parrot back a memorized set of sentences. An impromptu delivery of
instructions is warranted here, or at most a minute or so of preparation time.
Also, die choice of topics needs to be familiar enough so diat you are testing
not general knowledge but linguistic competence; dierefore. topics beyond the
content schemata of die test-taker are inadvisable. Finally, the task should
require the test-taker to produce at least five or six sentences (of connected
discourse) to adequately fulfill the objective.
This task can be designed to be more complex, thus placing it in the category
of extensive speaking. If your objective is to keep die response short and
simple, then make sure your directive does not take the test-taker down a
path of com¬plexity that he or she is not ready to face.
Paraphrasing
Anodier type of assessment task that can be categorized as responsive asks
the test-taker to read or hear a limited number of sentences (perhaps two to
five) and pro¬duce a paraphrase of the sentence. For example:
Paraphrasing a story
Test-takers hear: Paraphrase the following little story in your own words.
My weekend in the mountains was fabulous. The first day we backpacked into
the mountains and climbed about 2.000 feet. The hike was strenuous but
exhilarating. By sunset we found these beautiful alpine lakes and made camp
there. The sunset was amazingly beautiful. The next two days we just kicked
back and did little day hikes, some rock climbing, bird watching, swimming,
and fishing. The hike out on the next day was really easy—all downhill—and
the scenery was incredible.
Test-takers respond with two or three sentences.
A more authentic context for paraphrase is aurally receiving and orally relav. a
message. In the example below, the test-taker must relay information from a
M phone call to an office colleague named Jeff.
Paraphrasing a phone message
Test-takers hear:
Please tell Jeff that I'm tied up in traffic so I'm going to be about a half hour
late lor the nine o'clock meeting. And ask him to bring up our question about
the employee benefits plan. If he wants to check in with me on my cell phone,
have him call 415-338-3095. Thanks.
Test-takers respond with two or three sentences.
The advantages of such tasks are that they elicit short stretches of ou and
perhaps tap into test-takers' ability to practice the conversational art 0i
ciseness by reducing the output/input ratio. Yet you have to question the rion
being assessed. Is it a listening task more than production? Does short-term
memory rather than linguistic ability? And how does the re determine scoring
of responses? If you use short paraphrasing tasks as an ment procedure, it's
important to pinpoint the objective of the task clearly i case, the integration of
listening and speaking is probably more at stake simple oral production alone.
TEST OF SPOKEN ENGLISH (TSE" )
Somewhere straddling responsive, interactive, and extensive speaking ta5«_
another popular commercial oral production assessment, the Test of Spoken
(TSE). The TSE is a 20-minute audiotaped test of oral language ability win
academic or professional environment. TSE scores are used by many North
American institutions of higher education to select international teaching
assistants. The scores are also used for selecdng and certifying health
professionals such as physicians, nurses, pharmacists, physical therapists, and
veterinarians.
The tasks on the TSE are designed to elicit oral production in various discourse
categories rather than in selected phonological, grammatical, or lexical
targets. The following content specifications for the TSE represent die
discourse and pragmatic contexts assessed in each administration:
1.    Describe something physical.
2.    Narrate from presented material.
3.    Summarize information of the speaker's own choice.
4.    Give directions based on visual materials.
5.    Give instructions.
6.    Give an opinion.
7.    Support an opinion.
8.    Compare/contrast.
9.    Hypothesize.
10.   Function uinteractively."
11.   Define.
Using these specifications, Lazaraion and Wagner (1996) examined 15
different spe¬cific tasks in collecting background data from native and non-
native speakers of English.
1.     giving a personal description
2.     describing a daily routine
3-     suggesting a gift and supporting one's choice
4.     recommending a place to visit and supporting one's choice
5.     giving directions
6.     describing a favorite movie and supporting one's choice
7.     telling a story from pictures
8.     hypodiesizing about future action
9.     hypodiesizing about a preventative action
10.    making a telephone call to the dry cleaner
11.    describing an important news event
12.    giving an opinion about animals in the zoo
13.    defining a technical term
14.    describing information in a graph and speculating about its implications
15-    giving details about a trip schedule
From their findings, the researchers were able to report on the validity of the
tasks, especially the match between the intended task functions and the
actual output of both native and non-native speakers.
Following is a set of sample items as they appear in the T5E Manual, which IT
downloadable from the TOEFL* website (see reference on page 167).
Tesf of Spoken English sample items
Part A.
Test-takers see: A map of a town
Test-takers hear: Imagine that we are colleagues. The map below is of a
neighboring town tha>. you have suggested I visit. YOJ will have 30 seconds to
study the map. Then I'll ask you some questions about it.
   1. Choose one place on the map that you think I should visit and give me
      some reasons why you recommend this place. (30 seconds)
   2. I'd like to see a movie. Please give me directions from the bus station to
      the movie theater. (3D seconds)
   3. One of your favorite movies is playing at the theater. Please tell me about
      the movie and why you like ft (60 seconds)
Part B.
Test-takers see:
A series of six pictures depcts a sequence of events. In this series, painters
have just painted a park bench. Their WET PAINT sign blows away. A man
approaches the bench, sits on it, and starts reading a newspaper. He quickly
discovers his suit has just gotten wet paint on it and then rushes to the dry
cleaner.
Test-takers hear:
Now please look at the six pictures below. I'd like you to tell me the story that
the pictures show, starting with picture number 1 and going through picture
number 6. Please take 1 minute to look at the pictures and think about the
story. Do not begin the story until I tell you to do so.
   4. Tell me the story that the picture show. (60 seconds)
   5. What could the painters have done to prevent this? (30 seconds)
   6. Imagine that this happens to you. After you have taken the suit to the dry
      cleaners, you find out that you need to wear the suit the next morning.
      The dry cleaning service usually takes two days. Call the dry cleaner and
      try to persuade them lo have ihe suit ready later loday. (45 seconds)
   7. The man in the pictures is reading a newspaper. Both newspapers and
      television news prog-ams can be good sources ot information about
      current events. What do you think are the advantages and disadvantages
      of each of these sources? (60 seconds)
Part C.
Test-takers bear:
Now I'd like to hear your ideas about a variety of topics. Be sure to say as
much as you can in responding to each question. After I ask each question,
you may take a few seconds to prepare your answer, and then begin speaking
when you're ready.
  8. Many people enjoy visiting zoos and seeing the animals. Other people
     believe that animals should not be laken from their natural surroundings
     and put into zoos. I'd like to know what you think about this issue. (60
     seconds)
  9. I'm not familiar with your field of study. Select a term used frequently in
     your field and define it for me. (60 seconds)
Part D.
Test-takers see:
A graph showing an increase in world population over a half-century of time.
Test-takers hear:
  10.       This graph presents the actual and projected percentage of the
     world population living in cities from 1950 to 2010. Describe to me the
     information given in the graph. (60 seconds)
  11.       Now discuss what this information might mean for the future. (45
     seconds)
PartE.
Test-takers see:
A printed itinerary for a one-day bus tour of Washington, D.C., on which four
relatively simple pieces of information (date, departure time, etc.) have been
crossed out by hand and new handwritten information added.
Test-takers hear:
   12. Now please look at the information below about a trip to Washington,
D.C., that has been organized for the members of the Forest City Historical
Society. Imagine that you are the president of this organization. At the last
meeting, you gave out a schedule for the trip, but there have been some
changes. You must remind the members about the details of the trip and tell
them about ihe changes indicated on the schedule. In your presentation, do
not just read Ihe information printed, but present it as if you were talking to a
group of people. You will have one minute to plan your presentation and will
be told when to begin speaking. (90 seconds) TSE test-takers are given a
holistic score ranging from 20 to 60, as described in the TSE Manual (see Table
7.1).
Table 7.1 Test of Spoken English scoring guide (19951 TSE Rating Scale
60          Communication almost always effective: task performed very
competently; speech almost never marked by non-native characteristics
Functions performed clearly and effectively
Appropriate response to audience/situation
Coherent, with effective use of cohesive devices
Almost always accurate pronunciation, grammar, fluency, and vocabulary
50         Communication generally effective: task performed competently,
successful use of compensatory strategies; speech sometimes marked by non-
native characteristics
Functions generally performed clearly and effectively
Generally appropriate response to audience/situation
Coherent, with somp effective use of cohesive devices
Generally accurate pronunciation, grammar, fluency, and vocabulary
40          Communication somewhat effective: task performed somewhat
competently, some successful use of compensatory strategies; speech
regularly marked by non-native characteristics
Functions performed somewhat clearly and effectively
Somewhat appropriate response to audience/situation
Somewhat coherent, with some use of cohesive devices
Somewhat accurate pronunciation, grammar, fluency, and vocabulary
30     Communication generally not effective: task generally performed poorly,
ineffective use of compensatory strategies; speech very frequently marked by
non-native characteristics
Functions generally performed unclearly and ineffectively
Generally inappropriate response to audience/situation
Generally incoherent, with little use of cohesive devices
Generally inaccurate pronunciation, grammar, fluency, and vocabulary
20      No effective communication: no evidence of ability to perform task, no
effective use of compensatory strategies; speech almost always marked by
non-native characteristics
No evidence that functions were performed
Incoherent, with no use of cohesive devices
No evidence of ability to respond appropriately to audience/situation
Almos: always inaccurate pronunciation, grammar, fluency, and vocabulary
Holistic scoring taxonomies such as these imply a number of abilities that con-
prise "effective" communication and "competent" performance of the task.
The on* inal version of the TSE (1987) specified three contributing factors to a
final score "' "overall comprehensibility": pronunciation, grammar, and fluency.
The curre~ scoring scale of 20 to 60 listed above incorporates task
performance, functic: appropriateness, and coherence as well as the form-
focused factors. From reportr scores, institutions are left to determine their
own threshold levels of acceptabilrr but because scoring is holistic, they will
not receive an analytic score of how cad factor breaks down (see Douglas &
Smith, 1997. for further information). Classrcx r
teachers who propose to model oral production assessments after the tasks on
the TSE must, in order to provide some washback effect, be more explicit in
analyzing the various components of test-takers' output. Such scoring rubrics
are presented in the next section.
Following is a summary of information on the TSE:
Test of Spoken English (TSE*)
Producer:       Educational Testing Service, Princeton, N)
Objective: To test oral production skills of non-native English speakers
Primary market:         Primarily used for screening international teaching
assistants in
        universities in the United States; a growing secondary market is
        certifying health professionals in the United States
Type: Audiotaped with written, graphic, and spoken stimuli
Response modes: Oral tasks; connected discourse
Specifications:         (see sample items above)
Time allocation:        20 minutes
Internet access:        http://www.loefl.org/tse/tseindx.hlml
DESIGNING ASSESSMENT TASKS: INTERACTIVE SPEAKING
The final two categories of oral production assessment (interactive and
extensive speaking) include tasks that involve relatively long stretches of
interactive discourse (interviews, role plays, discussions, games) and tasks of
equally long duration but that involve less interaction (speeches, telling longer
stories, and extended explana¬tions and translations). The obvious difference
between the two sets of tasks is the degree of interaction with an interlocutor.
Also, interactive tasks are what some would describe as interpersonal, while
the final category includes more transac¬tional speech events.
Interview
When "oral production assessment" is mentioned, the first tiling that comes to
mind is an oral interview: a test administrator and a test-taker sit down in a
direct face-to-face exchange and proceed tlirough a protocol of questions and
directives. The interview, which may be tape-recorded for re-listening, is then
scored on one or more parameters such as accuracy in pronunciation and/or
grammar,          vocabulary      usage,       fluency,     sociolinguistic/pragmatic
appropriateness, task accomplishment, and even comprehension.
Interviews can vary in length from perhaps five to forty-five minutes,
depending on dieir purpose and context. Placement interviews, designed to
get a quick spoken sample from a student in order to verify placement into a
course, may need only five minutes if the interviewer is trained to evaluate the
output accurateh Longer comprehensive interviews such as die OPI (see the
next section) arc designed to cover predetermined oral production contexts
and may require the better part of an hour.
Every effective interview contains a number of mandatory stages. Two
decade-ago, Michael Canale (1984) proposed a framework for oral proficiency
testing th. lias withstood the test of time, lie suggested that test-takers will
perform at thc-best if they are led through four stages:
  1. Warm-up. In a minute or so of preliminary small talk, the interviewe directs
     mutual introductions, helps the test-taker become comfortable with thr
     situation, apprises the test-taker of the format, and allays anxieties. No
     scoring this phase takes place.
  2. Level check. Through a series of preplanned questions, the interviewc
     stimulates the test-taker to respond using expected or predicted forms and
     fur.. tions If. for example, from previous test information, grades, or other
     data, th: test-taker has been judged to be a "Level 2" (see below) speaker,
     the interviewer I prompts will attempt to confirm this assumption. The
     responses may take \tr simple or very complex form, depending on the
    entry level of the learoc Questions are usually designed to elicit
    grammatical categories (such as pa tense or subject-verb agreement),
    discourse structure (a sequence of event' vocabulary            usage, and/or
    sociolinguistic       factors      (politeness       conventior-formal/informal
    language).This stage could also give the interviewer a picture the test-
    raker's extroversion, readiness to speak, and confidence, all of which m-be
    of significant consequence in the interview's results. Linguistic target
    crite-. are scored in this phase. If this stage is lengthy, a tape-recording of
    the inten is important.
 3- Probe. Probe questions and prompts challenge test-takers to go to L
    heights of their ability, to extend beyond die limits of the interviewer's
    expectar through increasingly difficult questions. Probe questions may be
    complex in the framing and/or complex in their cognitive and linguistic
    demand. Through pr items, the interviewer discovers the ceiling or
    limitation of the test-taker's pr I ciency.This need not be a separate stage
    entirely, but might be a set of quest? ^ that are interspersed into the
    previous stage. At the lower levels of proficie: -probe items may simply
    demand a higher range of vocabulary or grammar from test-taker than
    predicted. At the higher levels, probe items will typically ask test-taker to
    give an opinion or a value judgment, to discuss his or her field ot
    -cialization, to recount a nairrative, or to respond to questions that are
    wordc. complex form. Responses to probe questions may be scored, or
    they ma* ignored if the test-taker displays an inability to handle such
    complexity.
 4. Wind-down. This final phase of the Interview is simply a short period o;'
    during which the interviewer encourages the test-taker to relax with some
    questions, sets the test-taker's mind at ease, and provides information
    about v. and where to obtain the results of the interview. This pail is not
    scored.
The suggested set of content specifications for an oral interview (below) may
serve as sample questions that can be adapted to individual situations.
Oral interview content specifications
Warm-up:
1.      Small talk
Level check:
The test-taker. . .
2.      answers w/)-questions.
3.      produces a narrative without interruptions.
4.      reads a passage aloud.
5.      tells how to make something or do something.
6.      engages in a brief, controlled, guided role play.
Probe:
The test-taker. . .
7.      -esponds to interviewer's questions about something the test-taker
doesn't <now and is planning to include in an article or paper.
8.      talks about his or her own field of study or profession.
9.      engages in a longer, more open-ended role play (for example, simulates
a difficult or embarrassing circumstance) with the interviewer.
10.     gives an impromptu presentation on some aspect of test-taker's field.
Wind-down:
11.     Feelings about the interview, information on results, further questions
Here are some possible questions, probes, and comments that tit those
specifi¬cations.
Sample questions for the four stages of an oral interview
Warm-up:
How are you?
What's your name?
What country are you from? What (city, town]?
Let me tell you about this interview.
Level check:
How long have you been in this [country, cityl? Tell me about your family.
What is your [academic major, professional interest, job]!1
How long have you been working at your [degree, job]?
Describe your home Icily, town] to me.
How do you like your home Icily, town]?
What are your hobbies or interests? (What do you do in your spare time?)
Why do you like your |hobby, interest]?
Have you traveled to another country beside this one and your home
country? Tell me about that country.
Compare your home [city, town] to another [city, *own|. What is your favorite
food?
Tell me how to [make, do] something you know well. What will you be doing
ten years from now? I'd like you to ask me some questions.
Tell me about an exciting or interesting experience you've had.
Read the following paragraph, please, [test-taker reads aloudj
Pretend that you are        and I am a . [guided role play
follows!
3.     Probe:
What are your goals for learning English in this program?
Describe your [academic field, job| to me. What do you like and dislike about
it? What is youi opinion of |a recent headline news event|? Describe someone
you greatly respect, and tell me why you respect that person. If you could redo
your education all over again, what would you do differently? How do eating
habits and customs reflect the culture of the people of a country? If you were
[president, prime ministerl of your country, what would you like to change
about your country? What career advice would you give to your younger
friends? Imagine you are writing an article on a topic you don't know very
much about. Ask me some questions about, that topic. You are in a shop that
sells expensive glassware. Accidentally you knock over an expensive vase,
and it breaks. What will you say to the store
owner? [Interviewer role-plays the store owner!
4.     Wind-down:
Did you feel okay about this interview?
What are your plans for [the weekend, the rest of today, the futurel?
You'll get your results from this interview Itomorrow, next week].
Do you have any questions you want to ask me?
It was interesting to talk with you. Best wishes.
The success of an oral interview will depend on
•       clearly specifying administrative procedures of the assessment
(practicality),
•       focusing the questions and probes on die purpose of the assessment
(validity),
•       appropriately eliciting an optimal amount and quality of oral production
from the test-taker (biased for best performance), and
•       creating a consistent, workable scoring system (reliability).
This last issue is die thorniest. In oral production tasks that are open-ended
and that involve a significant level of interaction, the interviewer is forced to
make judg¬ments that are susceptible to some unreliability. Through
experience, training, and careful attention to the linguistic criteria being
assessed, the ability to make such judgments accurately will be acquired. In
Table 7.2, a set of descriptions is given for scoring open-ended oral interviews.
These descriptions come from an earlier ver¬sion of the Oral Proficiency
Interview and are useful for classroom purposes.
The test administrator's challenge is to assign a score, ranging from 1 to 5, for
each of the six categories indicated above. It may look easy to do, but in
reality die lines of distinction between levels is quite difficult to pinpoint. Some
training or at least a good deal of interviewing experience is required to make
accurate assess¬ments of oral production in the six categories. Usually the six
scores are then amal¬gamated into one holistic score, a process that might
not be relegated to a simple mathematical average if you wish to put more
weight on some categories than you do on others.
This five-point scale, once known as'FSI levels" (because they were first
advo¬cated by the Foreign Service Institute in Washington. D.C.), is still in
popular use among U.S. government foreign sen-ice staff for designating
proficiency in a foreign language. To complicate the scoring somewhat, die
five-point holistic scoring cate¬gories have historically been subdivided into
"pluses" and "minuses" as indicated in Table 7.3. To this day, even though the
official nomenclature has now changed (see OPI description below), in-group
conversations refer to colleagues and co-workers by their FSI level: "Oh, Bob,
yeah, he's a good 3+ in Turkish—he can easily handle that assignment."
A variation on the usual one-on-one format with one interviewer and one test-
taker is to place two test-takers at a time with the interviewer. An advantage
of a two-on-one interview is the practicality of scheduling twice as many
candidates in die same time frame, but more significant is die opportunity for
student-student interaction. By deftly posing questions, problems, and role
plays, the interviewer can maximize the output of the test-takers while
lessening die need for his or her own output. A further benefit is the probable
increase in authenticity when two test-takers can actually converse with each
other. Disadvantages are equalizing the output between the two test-takers,
discerning the interaction effect of unequal comprehension and production
abilities, and scoring two people simultaneously.
Table 7.2. Oral proficiency scoring categories (Brown, 2001, pp. 406-407)
Grammar
Vocabulary
Comprehension
Errors in grammar are frequent, but speaker can be understood by a native
speaker used to dealing with foreigners attempting to speak his language.
Speaking vocabulary inadequate to express anything but the most elementarv
needs.
Within the scope of his very limited language experience, can understand
simple questions and statements if delivered with slowed speed-repetition, or
paraphrase.
Can usually handle elementary constructions quite accurately but does not
have thorough or confident control of the grammat.
Has speaking vocabulary sufficient to express himself simply with some
circumlocutions.
Can get the gist of most conversations of non-technic-subjects (i.e., topics that
req, no specialized knowledge).
Ill   Control of grammar is good. Able to speak the language with sufficient
structural accuracy to participate effectively in most formal and informal
conversations on practical, social, and professional topics.
Able to speak the language with sufficient vocabulary to participate effectively
in most formal and informal conversations on practical, social, and
professional topics. Vocabulary is broad enough that he rarely has to grope for
a word.
Comprehension is quite complete at a normal rate speech.
IV   Able to use the language accurately on all levels normally pertinent to
professional needs. Errors in grammar are quite rare,
Can understand and participate in any conversation within the range of his
experience with a high degree of precision of vocabulary.
Can understand any conversation within the rs -of his experience.
Equivalent to that of an educated native speaker.
Speech on all levels is fully accepted by educated native speakers in all its
features including breadth of vocabulary and idioms, colloquialisms, and
pertinent cultural references.
Equivalent to that of an educated native speaker
Fluency
Pronunciation
Task
(No specific fluency description. Refer to other four language areas lor implied
level of fluency.)
Errors in pronunciation are frequent but can be understood by a native
speaker used to dealing with foreigners attempting to speak his language.
Can ask and answer questions on topics very familiar to him. Able to satisfy
routine travel needs and minimum courtesy requirements. (Should be able to
order a simple meal, ask for shelter or lodging, ask and give simple directions,
make purchases, and tell time.)
Can handle with confidence but not with facility most social situations,
including introductions and casual conversations about current events, as well
as work, family, and autobiographical information.
Accent is intelligible though often quite faulty.
Able to satisfy routine social demands and work requirements; needs help in
handling any complication or difficulties.
■:■•
Can discuss particular interests of competence with reasonable ease. Rarely
has to grope for words.
Errors never interfere with understanding and rarely disturb the native
speaker. Accent may be obviously foreign.
Can participate effectively in most formal and informal conversations on
practical, social, and professional topics.
-i-;-
Able to use the language fluently on all levels normally pertinent to
professional needs. Can participate in any conversation within (he range of
this experience with a high degree of fluency.
Errors in pronunciation are quile rare
Would rarely be taken for a native speaker but can respond appropriately even
in unfamiliar situations. Can handle informal interpreting from and into
language.
Has complete fluency in the ■anguage such thai his speech is ■ully accepted
by educated native speakers.
Equivalent to and fully accepted by educated native speakers.
Speaking proficiency equivalent Ic that of an educated native speaker.
174     CHAPTER 7 Assessing Speaking
Table 7.3. Subcategories of oral proficiency scores Level     Description
0      Unable to function In the spoken language
0+     Able to satisfy immediate needs using rehearsed utterances
1      Able to satisfy minimum courtesy requirements and maintain very
simple face-to-face
conversations on familiar topics
1      +      Can initiate and maintain predictable iace-to-face conversations
and satisfy limited
social demands
2      Able to satisfy routine social demands and limited work requirements
2      +      Able to satisfy most work requirements with language usage that
is often, but not
always, acceptable and effective
3      Able to speak the language with sufficient structural accuracy and
vocabulary to par¬
ticipate effectively in most formal and informal conversations on practical,
social, and
professional topics
3+     Often able to use the language to satisfy professional needs in a wide
range of sophisti-
cated and demanding tasks
4      Able to use the language fluently and accurately on all levels normally
pertinent to
professional needs
4+     Speaking proficiency is regularly superior in all respects, usually
equivalent to that of a
well-educated, highly articulate native speaker
5      Speaking proficiency is functionally equivalent to that of a highly
articulate, well-
educated native speaker and reflects the cultural standards of the Country
where the
language is spoken
Role Play
Role playing is a popular pedagogical activity in communicative language-
teachir-classes. Within constraints set forth by the guidelines, it frees students
to be sort:: what creative in their linguistic output. In some versions, role play
allows soirr rehearsal time so that students can map out what they are going
to say. And it hi? the effect of lowering anxieties as students can, even for a
few moments, take on m persona of someone other than themselves.
As an assessment device, role play opens some windows of opportunity for
tc^ takers to use discourse that might otherwise be difficult to elicit, With
prom: such as "Pretend that you're a tourist asking me for directions" or
"You're buyiru necklace from me in a flea market, and you want to get a lower
price," certain M sonal. strategic, and linguistic factors come into the
foreground of the test-takr-oral abilities. While role play can be controlled or
"guided" by the interviewer, technique takes test-takers beyond simple
intensive and responsive levels to a ! of creativity and complexity that
approaches real-world pragmatics. Scoring sents the usual issues in any task
that elicits somewhat unpredictable respo from test-takers. The test
administrator must determine the assessment objective-the role play, then
devise a scoring technique thai appropriately pinpoints th objectives,
Discussions and Conversations
As formal assessment devices, discussions and conversations with and among
stu dents are difficult to specif) and even more difficult to score. But as
informal tech niques to assess learners, they offer a level of authenticity and
spontaneity that othei assessment techniques may not provide. Discussions
may be especially appropriate tasks through which to elicit and observe such
abilities as
•       topic nomination, maintenance, and termination;
•       attention getting, interrupting, floor holding, control;
•       clarifying, questioning, paraphrasing;
•       comprehension signals (nodding,"uh-huh,""hmm," etc.);
•       negotiating meaning;
•       intonation patterns for pragmatic effect;
•       kinesics, eye contact, proxemics, body language; and
•       politeness, formality, and other sociolinguistic factors.
Assessing the performance of participants through scores or checklists (in
which appropriate or inappropriate manifestations of any category are noted'
should be carefully designed to suit the objectives of the observed discussion.
Ol course, discussion is an integrative task, and so it is also advisable to give
some cog nizance to comprehension performance in evaluating learners.
Games
Among informal assessment devices are a variety of games that directly
involve Ian guage production. Consider the following types:
Assessment games
1. "Tinkertoy" game: ATinkertoy (or Lego block) structure is built behind a
    screen. One or two learners are allowed to view the structure. In
    successive stages of construction, the learners tell "runners" (who can't
    observe the structure) how to re-create the structure. The runners then tell
    "builders" behind another screen how to build the structure. The builders
    may question or confirm as they proceed, but only through the two
    degrees of separation Object: re-create the structure as accurately as
    possible.
2. Crossword puzzles are created in which the names of all members of a
    class are clued by obscure information about them. Each class member
    must ask questions of others to determine who matches the clues in the
    puzzle.
3. Information gap grids are created such that class members must conduct
    mini-interviews of other classmates to fill in boxes, e.g., "born in July,"
    "plays the violin," "has a two-year-old child," etc.
4. City maps are distributed to class members. Predetermined map directions
    are given to one student who. with a city map in front of him or her,
    describes the route to a partner, who must then trace the route and get to
    the correct final destination.
Clearly, such tasks have wandered away from the traditional notion of an oral
pro¬duction test and may even be well beyond assessments, but if you
     remember the discussion of these terms in Chapter 1 of this book, you can put
     the tasks into per¬spective. As assessments, the key is to specify a set of
     criteria and a reasonably prac¬tical and reliable scoring method. The benefit
     of such an informal assessment fflo) not be as much in a sunimative
     evaluation as in its formative nature, with washhaci for the students.
     ORAL PROFICIENCY INTERVIEW (OPI)
     The best-known oral interview format is one that has gone through a consider
     able metamorphosis over the last half-century, the Oral Proficiency Interview
     (OPD- Originally known as the Foreign Service Institute (FSI) test, the OPI is tin
     result of a historical progression of revisions under the auspices of several age
     cies, including the Educational Testing Service and the American Council
     Teaching Foreign Languages (ACTFL). The latter, a professional society
     research on foreign language instruction and assessment, has now become
     principal body for promoting the use of the OPI, The OPI is widely used aero
     dozens of languages around the world. Only certified examiners are authorize
     to administer the OPI; certification workshops are available, at costs of arov
     $700 for ACTFL members, through ACTFL at selected sites and conferer
     throughout the year.
     Specifications for the OPI approximate those delineated above under
     discussion of oral interviews in general. In a series of structured tasks, the OF
     carefully designed to elicit pronunciation, fluency and integrative ability, 5
     olinguistic and cultural knowledge, grammar, and vocabulary. Performance
     judged by the examiner to be at one of ten possible levels on the ACTFL-dc
     nated proficiency guidelines for speaking: Superior; Advanced—high. mid.
     Intermediate—high, mid, low; Novice—high, mid, low. A summary of those le is
     provided in Table 7.4.
     The ACTFL Proficiency Guidelines may appear to be just another form c: "FSI
     levels" described earlier. Holistic evaluation is still implied, and in this
     Table 7.4.         Summary     highlights:    ACTFL    proficiency     guidelines—
     speaking
Superior                 Advanced                 Intermediate             Novice
Superior-level           Advanced-level           Intermediate-level       Novice-level
speakers are             speakers are             speakers are             speakers are
characterized by         characterized by (he     characterized by the     characterized by
the ability to           ability to               ability to               the ability to:
   •       participat       •       participate       •      participate      •       respond
           e fully                  actively in              in simple,               to simple
           and                      conversatio              direct                   questions
           effect                   ns in most               conversatio              on the
           vely in                  informal                 ns on                    most
           conversa                 and some                 generally                common
           tions in                 formal                   predictable              features
           formal                   settings on              topics                   of daily
           and                      topics of                related to               life
           informal                 personal                 daily            •       convey
           settings                 and public               activities               minimal
           on topics                interest                 and                      meaning
           related to       •       narrate and              personal                 to
    practical        describe in        environmen         interlocut
    needs            major time         t                  ors
    and              frames with    •   create with        experienc
    areas of         good               the                ed in
    professio        control of         language           dealing
    nal              aspect             and                with
    and/or       •   deal               communica          foreigner
    scholarly        effectively        te personal        s by
    interests        with               meaning to         using
•   provide a        unanticipai        sympatheti         isolated
    structure        ed                 c                  words,
    d                complicatio        interlocutor       lists of
    argumen          ns through         s by               words,
    t to             a variety of       combining          memoriz
    explain          communica          language           ed
    and              tive devices       elements in        phrases,
    defend       •   sustain            discrete           and some
    opinions         communica          sentences          personali
    and              tion by            and strings        zed
    develop          using, with        of                 recombin
    effective        suitable           sentences          ations of
    hypothes         accuracy       •   obtain and         words
    es within        and                give               and
    extended         confidence,        information        phrases
    discourse        connected          by asking      •   satisfy a
•   discuss          discourse          and                very
    topics           of                 answering          limited
    concretel        paragraph          questions          number
    y and            length and     •   sustain and        of
    abstractl        substance          bring to a         immediat
    y            •   satisfy the        close a            e needs
•   deal with        demands of         number of
    a                work and/or        basic,
    linguistic       school             uncomplica
    ally             situations         ted
    unfamilia                           communica
    r                                   tive
    situation                           exchanges,
•   maintain                            often in a
    a high                              reactive
    degree of                           mode
    linguistic                      •   satisfy
    accuracy                            simple
•   satisfy                             personal
    the                                 needs and
    linguistic                          social
    demands                             demands to
     of                                            survive in
     srofessio                                     the target
     nal                                           language
     and/or                                        culture
     scholarly
     life
four levels are described. On closer scrutiny, however, they offer a markedly
dif¬ferent set of descriptors. First, they are more reflective of a unitary
definition of ability, as discussed earlier in this book (page 71). Instead of
focusing on separate abilities in grammar, vocabulary, comprehension,
fluency,and pronunciation, they focus more strongly on the overall task and on
the discourse ability needed to accomplish the goals of die tasks. Second, for
classroom assessment purposes, the six FSI categories more appropriately
describe the components of oral ability than do the ACTFL holistic scores, and
therefore offer better washback potential.Third the ACTFL requirement for
specialized training renders the OPl less useful for class¬room adaptation.
Which form of evaluation is best is an issue that is still hotl> debated (Reed &
Cohen, 2001).
It was noted above that for official purposes, the OPl relies on an administrt
tive network that mandates certified examiners, who pay a significant fee to
achieve examiner status. This systemic control of the OPl adds test reliability
to the proct dure and assures test-takers that examiners are specialists who
have gone through . rigorous training course. All these safeguards discourage
the appearance of "outlaw" examiners who might render unreliable scores.
On the other hand, die whole idea of an oral interview under the control of ar
interviewer has come under harsh criticism from a number of language-testing
spc cialists. Valdman (1988, p. 125) summed up die complaint
From a Vygotskyan perspective, the OPl forces test-takers into a closed
system where, because the interviewer is endowed with full social control,
they are unable to negotiate a social world. For example, they cannot
nominate topics for discussion, Uiey cannot switch formality levels, they
cannot display a full range of stylistic maneuver. The total control the OPl
interviewers possess is reflected by the parlance of the test methodology.... In
short, the OPl can only inform us of how learners can deal with an artificial
social imposition rather uian enabling u.-to predict how they would be likely to
manage authentic linguistic interactions with target-language native speakers.
Bachman (1988, p. 149) also pointed out that the validity of the OPl simply be
demonstrated "because it confounds abilities with elicitation procedures -J.
design, and it provides only a single rating, which has no basis in either the* r
research."
Meanwhile, a great deal of experimentation continues to be conducted design
better oral proficiency testing methods (Bailey. 1998; Young & He. 1** With
ongoing critical attention to issues of language assessment in the yer-come,
we may be able to solve some of the thorny problems of how best m oral
production in authentic contexts and to create valid and reliable sc methods.
Here is a summary of the ACTFL OPL
American Council of Teaching           Foreign Languages (ACTFL) Oral
Proficiency Interview (OPI)
Producer!    American Council on Teaching Foreign Languages. Yonkers. NY
Objective: To lest oral production skills of speakers in 37 different foreign
       languages
Primary market:     Certification of speakers lor government personnel and
       employees in ihe workplace; evaluation of students in language
       programs
Type: Oral interview—telephoned or in person
Response modes: Oral production in a variety of genres and tasks
Specifications:     Personalized questions geared to the test-taker's interests
and
       experiences; a variety of communication tasks designed to
       gauge the test-taker's upper limits; role play
Time allocation:    30-40 minutes
Internet access:    http://www.actfl.org/
DESIGNING ASSESSMENTS: EXTENSIVE SPEAKING
Extensive speaking tasks involve complex, relatively lengthy stretches of
discourse. They are frequently variations on monologues, usually with minimal
verbal inter¬action.
Oral Presentations
In the academic and professional arenas, it would not be uncommon to be
called on to present a report, a paper, a marketing plan, a sales idea, a design
of a new product, or a method. A summary of oral assessment techniques
would therefore be incom¬plete widiout some consideration of extensive
speaking tasks. Once again the rules for effective assessment must be
invoked: (a) specif)' the criterion, (b) set appro¬priate tasks,(c) elicit optimal
output, and (d) establish practical, reliable scoring pro cedures. And once
again scoring is the key assessment challenge.
For oral presentations, a checklist or grid is a common means of scoring or
evalu¬ation. Holistic scores are tempting to use for tiieir apparent practicality,
but tiiey may obscure die variability of performance across several
subcategories, especially die two major components of content and delivery.
Following Ls an example of a checklist for a prepared oral presentation at file
intermediate or advanced level of English.
Oral presentation checklist
Evaluation of oral presentation
Assign a number ro each box according to your assessment of the various
aspects of the speaker's presentation.
3     Excellent
2     Good
1     Fair
0     Poor
Content:
□     The purpose or objective of the presentation was accomplished.
□     The introduction was lively and gol my attention.
□     The main idea or point was clearly stated toward the beginning.
□     The supporting points were
•     clearly expressed
•     supported well by facts, argument
□     The conclusion restated the main idea or purpose.
Delivery:
□     The speaker used gesiures and body language well.
P The speaker maintained eye contact with the audience. Q The speaker's
language was natural and fluent.
□        The speaker's volume of speech was appropriate.
□        The speaker's rate of speech was appropriate.
□        The speaker's pronunciation was clear and comprehensible.
□        The speaker's grammar was correct and didn't prevent understanding.
O The speaker used visual aids, handouts, etc., effectively.
Q The speaker showed enthusiasm and interest.
L] \ If appropriate] The speaker responded to audience questions well.
Such a checklist is reasonably practical. Its reliability can vary if clear
standards f | scoring are not maintained. Its authenticity can be supported in
that all of the it on the list contribute to an effective presentation. The
washback effect of such checklist will be enhanced by written comments from
the teacher, a confer with the teacher, peer evaluations using the same form,
and self-assessment.
Picture-Cued Story-Telling
One of the most common techniques for eliciting oral production is through vi
pictures, photographs, diagrams, and charts. We have already looked at this
elicit device for intensive tasks, but at diis level we consider a picture or a
series of pic as a stimulus for a longer story or description. Consider the
following set of picture*
Picture-cued story-telling task (Brown, 1999, p. 29)
It's always tempting to throw any picture sequence at test-takers and have
them talk for a minute or so ahout them. But as is true of every assessment of
speaking ability, the objective of eliciting narrative discourse needs to be
clear. In the above example (with a little humor added!), are you testing for
oral vocabulary (girl, alarm, coffee, telephone, wet, cat. etc.). for time relatives
(before, after, when), for sentence connectors (then, and then, so), for past
tense of irregular verbs (woke, drank, rang), and/or for fluency in general? If
you are eliciting specific grammatical or discourse features, you might add to
the directions something like "Tell the story that these pictures describe. Use
the past tense of verbs." Your criteria for scoring need to be clear about what
it is you are hoping to assess. Refer back to some of the guidelines suggested
under the section on oral interviews, above, or to the OPI for some general
suggestions on scoring such a narrative.
Retelling a Story, News Event
In this type of task, test-takers hear or read a story or news event that they
are asked to retell. This differs from the paraphrasing task discussed above
(pages 161-162) in that it is a longer stretch of discourse and a different
genre, The objectives in assigning such a task vary from listening
comprehension of the original to produc¬tion of a number of oral discourse
features (communicating sequences and rela¬tionships of events, stress and
emphasis patterns, "expression" in the case of a dramatic story), fluency, and
interaction with the hearer. Scoring should of course meet the intended
criteria.
Translation (of Extended Prose)
Translation of words, phrases, or short sentences was mentioned under the
cat¬egory of intensive speaking. Here, longer texts are presented for the test-
taker to read in the native language and then translate into English.Those
texts could come in many forms: dialogue, directions for assembly of a
product, a synopsb of a story or play or movie, directions on how to find
something on a map, anc other genres. The advantage of translation is in the
control of the content vocabulary, and, to some extent, the grammatical and
discourse features. The disadvantage is that translation of longer texts is a
highly specialized skill for which some individuals obtain post-baccalaureate
degrees! To judge a nonspe cialist's oral language ability on such a skill may
be completely invalid, espc cially if the test-taker has not engaged in
translation at this level. Criteria ft -scoring should therefore take into account
not only the purpose in stimulating a translation but the possibility of errors
that are unrelated to oral productior ability.
One consequence of our being articulate mammals ss an extraordinarily c<
-plex system of vocal communication that has evolved over the millennia
human existence. This chapter has offered a relatively sweeping overview
some of the ways we have learned to assess our wonderful ability to prod*.
-sounds, words, and sentences, and to string them together to make
meaningr-texts. This chapter's limited number of assessment techniques may
encour~r your imagination to explore a potentially limitless number of
possibilities : assessing oral production.
EXERCISES
(Note: (I) Individual work; (G) Group or pair work; (C) Whole-class discussion.)
1. (G) In the introduction to the chapter, the unique challenges of testing
   speaking were described (interaction effect, elicitation techniques, and
   scoring). In pairs, offer practical examples of one of the challenges, as
   assigned to your pair. Explain your examples to the class.
2. (C) Review the five basic types of speaking that were oudined ai the
   beginning of the chapter. Offer examples of each and pay special attention
   to distinguishing between imitative and intensive, and between responsive
   and interactive,
3. (G) Look at the list of micro- and macroskills of speaking on pages 142-143.
   In pairs, each assigned to a different skill (or two), brainstorm some tasks
   that assess those skills. Present your findings to the rest of the class.
4. (C) In Chapter 6, eight characteristics of listening (page 122) that make
   listening "difficult" were listed. What makes speaking difficult? Devise a
   similar list that could form a set of specifications to pay special attention to
   in assessing speaking.
5- (G) Divide the five basic types of speaking among groups or pairs, one type
   for each. Look at the sample assessment techniques provided and evaluate
   them according to the five principles (practicality, reliability, validity
   [espe¬cially face and content], authenticity, and washback). Present your
   critique to the rest of die class.
6. (G) In die same groups as in question *5 above, with die same type of
   speaking, design some other item types, different from the one(s) provided
   here, that assess the same type of speaking performance.
7. (I) Visit die website listed for die PhonePass test. If you can afford it and you
   are a non-native speaker of English, take the test. Report back to die class
   on how valid, reliable, and authentic you felt the test was.
8. (G) Several scoring scales are offered in this chapter, ranging from simple
   (2-1-0) score categories to the more elaborate rubric used for the OPI. In
   groups, each assigned to a scoring scale, evaluate the strengths and
   weak¬nesses of each. Pay special attention to intra-rater and inter-rater
   reliability.
9. (C) If possible, role-play a formal oral interview in your class, widi one
   student (widi beginning to intermediate proficiency in a language) acting as
   die test-taker and anodier (with advanced proficiency) as the test
  administrator. Use die sample questions provided on pages 169-170 as a
  guide.This role play will require some preparation-The rest of the class will
  then evaluate die effectiveness of the oral interview. Finally, the test-taker
  and administrator can offer their perspectives on die experience.
FOR YOUR FURTHER READING
Underhill, Nic. (1987)- Testing spoken language: A handbook of oral testing
     tech¬niques. Cambridge: Cambridge University Press.
This practical manual on assessing spoken language is still a widely used
     collection of techniques despite the fact that it was published in 1987.
     The chapters are organized into types of tests, elicitation techniques,
     scoring sys¬tems. and a discussion of several types of validity along with
     reliability.
Brown, J.D. (1998). New ways of classroom assessment. Alexandria, VA:
     Teachers of English to Speakers of Other Languages.
One of the many volumes in TESOL's "New Ways" series, this one presents a
     collection of assessment techniques across a wide range of skill areas.
     The two sections on assessing oral skills offer 17 different techniques.
Celce-Murcia, Marianne, Brinton, Donna, and Goodwin, Janet. (1996). Teaching
     pro¬nunciation: A reference for teachers of English to speakers of other
     language: Cambridge: Cambridge University Press.
This broadly based pedagogical reference book on teaching pronunciation also
     offers numerous examples and commentaries on assessment of
     pro¬nunciation (which of course goes hand in hand with teaching). Most
     of the references to assessment deal with informal assessment, but the
     book also addresses formal assessment.