0% found this document useful (0 votes)
193 views509 pages

1979LTAS

Uploaded by

Trâm Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views509 pages

1979LTAS

Uploaded by

Trâm Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 509

LANGUAGE

TESIS
AT SCHOOL

John W. Oller, Jr.


Oller, John W., Jr., (1979). Language Tests at School: a
Pragmatic Approach. London: Longman.
(Japanese translation appeared in 1994 published in Tokyo:
Longman and Shubun.)

Language Tests
at School
A Pragmatic Approach

JOHN W. OLLER, Jr.


U niversit y of New Mex ico
Albuque rque

LONG MAN
LONGMAN GROUP LIMITED Oller, John W
London Language tests at schooL - (Applied
linguistics and language study.)
Associated companies, branches and 1. Language and languages - Ability
representath'es throughout the ~'>-'orld testing
L Title II. Series
© Longman Group Ltd. 1979 407'.6 P53 78-41005

All rights reserved. No part of this ISBN 0-582-55365-2


publication may be reproduced, stored in ISBN 0-582-55294-X Pbk
a retrieval system, or transmitted in any
form or by any means, electronic,
mechanical, photocopying, recording, or
otherwise, without the prior pennission of
the Copyright owner.

First published 1979

Printed in Great Britain by


Butler & Tanner Ltd, Frome and London.

ACKNOWLEDGEMENTS
We are grateful to the following for permission to reproduce copyright material:
Longman Group Ltd., for an extract from Bridge Series: Oliver T ,vis! edited by Latif
Doss; the author, John Pickford for his review of 'The Scandaroon' by Henry
Williamson from Bookcase Broadcast on the BBC World Service January 6th 1973,
read by John Pickford; Science Research Associates Inc., for extracts from 'A Pig Can
Jig' by Donald Rasmussen and Lynn Goldberg in The SRA Reading Program - Level A
Basic Reading Series © 1964, 1970 Donald E. Rasmussen and Lenina Goldberg.
Reprinted by permission of the publisher Science Research Associates Inc. Board of
Education of the City of New York, from 'New York City Language Assessment
Battery'; reproduced from the Bilingual Syntax Measure by permission. Copyright ©
1975 by Harcourt Brace Jovanovich, Inc. All rights reserved; Center for Bilingual
Education, Northwest Regional Educational Laboratory from 'Oral Language Tests
for Bilingual Students'; McGraw-Hill Book Company, from Testing English as a
Second Language' by Harris; McGraw-Hill Book Company from 'Language Testing'
by Lado; Language Learning (North University Building) from 'Problems in Foreign
Language Testing'; Newbury House Publishers from 'Oral Interview' by Ilyin; 'James
Language Dominance Test', copyright 1974 by Peter James, published by Teaching
Resources Corporation, Boston, Massachusetts, U.S.A.; 'Black American Cultural
Attitude Scak', copyright 1973 by Perry Alan Zirkel, Ph.D, published by Teaching
Resources Corporation, Boston, Massachusetts, U.S.A.
Contents

Chapter 1
Introduction page
A. What is a language test? I
B. What is language testing research
about? 3
C. Organization of this book 6
Key Points II
Discussion Questions 12
Suggested Readings 13

PART ONE
THEORY AND RESEARCH BASES FOR
PRAGMATIC LANGUAGE TESTING

Chapter 2
Language Skill as a Pragmatic
Expectancy Grammar 16
A. What is pragmatics about? 16
B. The factive aspect oflanguage use 19
C. The emotive aspect 26
D. Language learning as grammar
construction and modification 28
E. Tests that invoke the learner's
grammar 32
Key Points 33
Discussion Questions 35
Suggested Readings 3S
Chapter 3
Discrete Point, Integrative, or
Pragmatic Tests 36
A. Discrete point versus integrative
testing 36
v
VI CONTENTS

B. A definition of pragmatic tests 38


C. Dictation and doze procedure as
examples of pragmatic tests 39
D. Other examples of pragmatic tests 44
E. Research on the validity of
pragmatic tests 50
1. The meaning of correlation 52
2. Correlations between different
language tests 57
3. Error analysis as an
independent source of validity
data 64
Key Points 70
Discussion Questions 71
Suggested Readings 73
Chapter 4
Multilingual Assessment 74
A. Need 74
B. Multilingualism versus
multidialectalism 77
c. Factive and emotive aspects of
multilingualism 80
D. On test biases 84
E. Translating tests or items 88
F. Dominance and proficiency 93
G. Tentative suggestions 98
Key Points 100
Discussion Questions 102
Suggested Readings 104
Chapter 5
Measuring Attitudes and Motivations 105
A. The need for validating affective
measures 105
B. Hypothesized relationships
between affective variables and the
use and learning of language 112
C. Direct and indirect measures of
affect 121
D. Observed relationships to
achievement and remaining puzzles 138
CONTENTS Vll

Key Points 143


Discussion Questions 145
Suggested Readings 147

PART TWO
THEORIES AND METHODS
OF DISCRETE POINT TESTING
Chapter 6
Syntactic Linguistics as a Source for
Discrete Point Methods 150
A. From theory to practice,
exclusively? 150
B. Meaning-less structural analysis 152
C. Pattern drills without meaning 157
D. From discrete point teaching to
discrete point testing 165
E. Contrastive linguistics 169
F. Discrete elements of discrete
aspects of discrete components of
discrete skills - a problem of
numbers 172
Key Points 177
Discussion Questions 178
Suggested Readings 180
Chapter 7
Statistical Traps 181
A. Sampling theory and test
construction 181
B. Two common misinterpretations
of correlations 187
C. Statistical procedures as the final
criterion for item selection 196
D. Referencing tests against non-
native performance 199
Key Points 204
Discussion Questions 206
Suggested Readings 208
Chapter 8
Discrete Point Tests 209
A. What they attempt to do 209
viii CONTENTS

B. Theoretical problems in isolating


pieces of a system 212
C. Examples of discrete point items 217
D. A proposed reconciliation with
pragmatic testing theory 227
Key Points 229
Discussion Questions 230
Suggested Readings 230
Chapter 9
Multiple Choice Tests 231
A. Is there any other way to ask a
question? 231
B. Discrete point and integrative
multiple choice tests 233
C. About writing items 237
D. Item analysis and its
interpretation 245
E. Minimal recommended steps for
mUltiple choice test preparation 255
F. On the instructional value of
multiple choice tests 256
Key Points 257
Discussion Questions 258
Suggested Readings 259

PART THREE
PRACTICAL RECOMMENDATIONS
FOR LANGUAGE TESTING

Chapter 10
Dictation and Closely Related
Auditory Tasks 262
A. Which dictation and other
auditory tasks are pragmatic? 263
B. What makes dictation work? 265
C. How can dictation be done? 267
Key Points 298
Discussion Questions 300
Suggested Readings 302
CONTENTS IX

Chapler 1/
Tests of P roductive Oral
Communication 303
A. Prerequisites for pragmatic
speaking tests 304
B. The Bilingual Syntax Measure 308
C. The IIyin Oral Interview and the
Upshur Oral Communication Test 314
D . The Foreign Senice Instilute Oral
Interview 320
E. Other pragmatic speaking tasks 326
Key Points 335
D iscussion Questions 336
Suggested Readings 338
Chapcer 12
Varieties of Cloze Procedure 340
A. What is the cloze procedure ? 341
B. Cloze tests as pragmatic tasks 345
C. A pplications of cloze procedure 348
D . How to make and use cloze tests 363
Key Points 375
Discussion Questions 377
Suggested Readings 379
Chapler 13
Essays and Related Writing Tasks 38 1
A. Why essays? 38 1
B. Examples of pragmatic writing
tasks 383
C. Scoring for conformity to correct
prose 385
D. Rating content and organization 392
E. Interpreting protocols 394
Key points 398
Discussion Questions 399
Suggested Readings 400
Chapter 14
Inventing New Tests in Relation to a
Coherent Curriculum 401
A. Why language skills in a school
curricul urn? 401
x CONTENTS

B. The ultimate problem of test


validity 403
C. A model : the Mount Gravatt
reading program 408
D. Guidelines and checks for new
testing procedures 415
Key Points 41 8
Discussion Questions 419
Suggested Readings 421
Appendix
The Factorial Structure of Language
Proficiency : Divisible or Not? 423
A. Three empirically testable
alternatives 424
B. The empirical method 426
C. Data from second language
studies 428
D . The Carbondale Project, 1976-7 431
E. Data fro m first language studies 451
F. Directions for further empirical
research 456
References 459
Index 479

LIST OF FIGURES
page
Figure 1 A cartoon drawing illustrati ng the style of
lhe Bilingual Syntax M easwe 48
Figure 2 A hypothetical view of the amount of
variance in learning to be accounted for by
emot ive versus [active SOrts o f information 83
Figure 3 A dominance scale in relation to proficiency
scates 99
Figur e 4 Example of a Likert-type attitude scale
intended for children 137
F igure 5 Componential breakdown of language
proficiency proposed by Harris (1969, p. I I) 173
Figure 6 Componential analysis of language skills as
a fra mework for test construction from
Cooper (1968, 1972, p. 337) 173
Figure 7 'Language assessment domains' as defined
by Silverman el at (1976, p. 2 1) 174
Figure 8 Schematic representation of constructs
posited by a componential analysis of
CONTENTS Xl

language skills based on discrete point test


theory, frem Oller (1976c, p. 150) 175
Figure 9 The ship/sheep contrast by Lado (1961,
p. 57) and Harris (1969, p. 34) 22 1
Figure 10 The warching/washing contrast, Lado ( 1961,
p.57) 222
Figure 11 The pin/pen/pan co ntrast, Lade ( (96J, p. 58) 222
Figure 12 The ship/jeep/sheep contrast, Lade (1961,
p.58) 222
Figure J3 'Who is watching the dishes?' (Lad e, 1961,
p.59) 222
Figure 14 Pictures from the James Language
Dominance Test 309
Figure 15 Pictures frem the New York City
Language Assessment Battery, Listening and
Speaking Subtest 309
Figur e 16 Pictures 5, 6 and 7 from the Bilingual Syntax
Measure 311
Figure 17 Sample pictures for tbe Orientation section
of the Jlyin Ora/Interview (1976) 3 15
Figure 18 Items from the Upshur Oral Communicafion
Test 318
Figur e 19 Some examples of visual closure - seeing the
overall pattern or Gestalt 342

LIST OF TABLES
page
Table 1 lntercorrelations of the Part Scores on the
Test of English as a Foreign Language
Averaged over Forms Admin istered through
April, 1967 190
Table 2 Intercorrelations of the Part Scores on the
Test of English as a Foreign Language
Averaged over Administrations from
October, 1966 through June, 1971 191
Table 3 Response Frequ ency Distribution Example
One 253
Table 4 Response Frequency Distribution Exa mple
T wo 254
Table 5 Intercorrelations between Two Dictations,
Spelling Scores on the Same Dictations, and
Four Other Part s of the UCLA English asa
Second Language Placement Examination
Form 2D Administered to 145 Foreign
Students in the Spring or 197 1 28 1

TABLES IN THE APPENDIX


pag e
Table 1 Principal Components Analysis over
T wenty-two Scor~ on Language Processing
Tasks Requiring Listening, Speaking,
Xll CONTENTS

Readin g. and Writing as well as Specific


Grammatical Decisions 43 5
Table 2 Varimax Rotated Solution for Twenty-two
Language Scores 437
Table 3 Principal Components Analysis over Sixteen
Listening Scores 439
Table 4 Varimax Rotated Solution for Sixteen
Listening Scores 440
Table 5 Principa l Components Analysis over
T wenty-seven Speaki ng Scores 44 3
Table 6 Va rimax Rotated Solution over T wenty-
seven Speaking Scores 444
Table 7 Principal Components Solution over Twenty
Reading Scores 446
Table 8 Varimax Rotated Solution over Twenty
Reading Scores 447
Table 9 Principal Components Analysis over
Eighteen Writing Scores 449
Table 10 Varimax Rotated Solution over Eighteen
Writing Scores 450
Tobie 'I Principal Components Analysis over
TwenlY~ lhree Grammar Scores 452
Table 12 Varimax Rotated Soluti on over Twenty-
three Grammar Scores 453
Acknowledgements

George Miller once said that it is an 'ill-kept secret' that many


people other than the immediate author are involved in the writing of
a book. He said that the one he was prefacing was no exception, and
neither is tbis one. I want to thank all of those many people who have
contributed to the inspiration and energy required to compile, edit,
write, and fe-write many times the material contained in this book. I
cannot begin to mention all of the colleagues, students, and friends
who shared with me their time, talent, and patience along with all of
the other little joys and minor agonies that go into the writing of a
book. Neither can I mention the teachers of the extended classroom
whose ideas have influenced the ones I have tried to express here. For
all of them this is probably a good thing, because it is not likely that
any of them would want to own some of the distillations of their ideas
which find expression here.
Allowing the same discretion to my closer mentors and col-
laborators, I do want to mention some ofthem. My debt to my father
and to his Spanish program published by Encyclopaedia Britannica
Educational Corporation, will be obvious to anyone who has used
and understood the pragmatic principles so well exemplified there. I
also want to thank the staff at Educational Testing Service, and the
other members of the Committee of Examiners for the Test afEnglish
as a Foreign Language who were kind enough both to tolerate my
vigorous criticisms of that test and to help fill in the many lacunae in
my still limited understanding of the business of tests and
measurement. I am especially indebted to the incisive thinking and
challenging communications with John A. Upshur of the University
of Michigan who chaired that committee for three of the four years
that I served on it. Alan Hudson of the University of New Mexico and
Robert Gardner of the University ofvVestern Ontario stimulated my
interest in much of the material on attitudes and sociolinguistics
which has found its way into this book. Similarly, doctoral research
Xlll
XIV ACK:"IIOWLEDGEMEN TS

by Douglas Stevenson, Annabelle Scoon, Rose Ann Wallace,


Frances Hinofotis, and Herta Teitelbaum has had a significant
impact on recommendations contai ned here. Work in Alaska wi th
Eskimo children by Virginia Streiff, in the African primary schools by
John Hofman, in Papua New Guinea by Jonath on Anderson , in
Australia by Norman Hart and Richard Walker, and in Canada and
Europe by John McLeod has fund ame ntally influenced what is said
concerning the testing of children.
In addition to the acknowledgements due to people for contribut-
ing to the thinking embodied here, I also feet a debt of gratitude
towards those colleagues who indirectly contri buted to the
development of this book by making it possible to devote
concentrated periods of time thinking and studying on the topics
discussed here. In the spring of 1974, Russell Campbell, now
Chairman of the TESL group at UCLA, contributed to the initial
work on this text by inviting me to give a series of lectures on
pragmatics and language testing at the American University, English
Language Institute in Cairo, Egypt. Then, in the fall semester of 1975,
a six week visit to the University of New Mexico by Richard Walker,
Deputy Director of Mount Gravatt College of Adva nced Education
in Brisbane, A llstralia, resulted in a stimulating exchange with several
centers of activity in the world where child language development is
being seriously studied in rela tio n to reading curricula. The
possibility of developing tests to assess the suitability of reading
materials to a given gro up o f children and the discllssion of the
relation oflanguage to curriculum in general is very much a product
of the dialogue that ensued.
More recently, the work on this book has been pushed on to
completion thanks to a grant from the Center for English as a Second
Language a nd the Department of Linguistics at Southern Ulinois
University in Carbondale. Without that financial support and the
encouragement of Patricia Carrell , Charles Pari ~ h , Richard D aesch,
Kyle Perkins, and others among the fac ulty and students there, it is
doubtfu l that the work could have been completed.
Finally, I want to thank the students in the testing courses at
So uthern Illinois University and at Conco rdia University in
Mont real who read all or parts of the manuscript during va rious
stages of development. Their comments, criticisms, and encourage-
ment have sped the completion of the work and have improved the
product immensely.
Preface

It is, in retrospect, with remarkable speed that the main principles


and assumptions have become accepted of what can be called the
teaching and learning oflanguage as communication. Acceptance, of
course, does not imply practical implementation, but distinctions
between language as form and language as function, the meaning
potential of language as discourse and the role of the learner as a
negotiator of interpretations, the match to be made between the
integrated skills of communicative actuality and the procedures of
the classroom, among many others, have all been widely announced,
although not yet adequately described. Nonetheless, we are certain
enough ofthe plausibility of this orientation to teaching and learning
to' suggest types of exercise and pedagogic procedures which are
attuned to these principles and assumptions. Courses are being
designed, and textbooks written, with a communicative goal, even if,
as experiments, they are necessarily partial in the selection of
principles they choose to emphasise.
Two matters are, however, conspicuously lacking. They are
connected, in that the second is an ingredient of the first. Discussion
of a communicative approach has been very largely concentrated
on syllabus specifications and, to a lesser extent, on the design of
exercise types, rather than on any coherent and consistent view of
what we can call the communicative curriculum. Rather than
examine the necessary interdependence of the traditional cur-
riculum components: purposes, methods and evaluations, from the
standpoint of a communicative view of language and language
learning, we have been happy to look at the components singly, and
at some much more than others. Indeed, and this is the second matter,
evaluation has hardly been looked at at all, either in terms of
assessment of the communicative abilities of the learner or the
efficacy of the programme he is following. There are many current
examples, involving materials and methods aimed at developing
communicative interaction among learners, which are preceded,
xv
XVI PREFACE

interwoven with or followed by evaluation instruments totally at


odds with the view of language taken by the materials and the
practice with which they are connected. Partly because we ha ve not
taken the curricular view. and partly because we ha ve directed o ur
innovations towards animation rather than evaluation, teachlng and
testing are out of join t. Teachers are unwilling to adopt novel
materials because they can see that they no longer emphasise
exclusively the formal items of language structure which make up the
' psycbometric-structu ralist' (in Spol sky 's phrase) tests another
generation of applied linguists have urged them to use. Evaluators of
programmes expect satisfaction in terms of this testing paradigm
even at points within the programme when quite other aspects of
communicative ability are being encouraged. Clearly, this is an
undesirable and unprod uctive state of affa irs.
It is to these twin matters of communication and curriculum that
John Oller's major contribution to the Applied Linguistics and
Language Study Series is addressed. He poses two questions: how can
language testing relale to a pragmatic vielV of language as com·
munication and how can language testing relate to educational
measurement in general ?
Chapter I takes up both issues; in a new dep arture for this Series
John Oller shows how language testing has general relevance for all
educati onalists, not just those concerned with language. Indeed, he
hints here at a topic he takes up later, namel y the linguistic basis of
tests of intelligence, achievement and aptitude. Language testing, as a
branch of applied linguistics, has cross-curricul ar relevance for the
learner at school. The major emphasis, however, remains the
connec ti on to be made between eva luation, variable learner
characteristics , and a psycho-socia-linguistic perspective on 'doing'
language-based tasks.
This latter perspective is the substance of the four Chapters in Part
One of the book. Beginning from a definition of communicative
proficiency in terms of 'accuracy' in a learner's 'expectancy grammar'
(by which Oller refers to the learner's predictive competence in
formal, functional and strategic terms) he proceeds to characterise
communication as a function al, context-bound and culturally·
specific use of language involving an integrated view of receptive and
productive skills. It is against such a yardstick that he is able, both in
Chapters 3 and 4 of Part One, and throughout Part Two, to offer a
close, detailed and well· fo unded critica l assessment of the theories
and methods of discrete point testing. Such an approach to testi ng,
PREFACE XVll

Oller concludes, is a natural corollary of a view of language as form


and usage, rather than of process and use. If the view of language
changes to one concerned with the communicative properties of
language in use, then our ways of evaluating learners' competences to
communicate must also change.
In following Spolsky's shift towards the 'integrative-
sociolinguistic' view of language testing, however, John Oller does
not avoid the frequently-raised objection that although such tests
gain in apparent validity, they do so at a cost of reliability in scoring
and handling. The immensely valuable practical recommendations
for pragmatically-orientated language tests in Part Three of the
book constantly return to this objection, and show that it can be
effecth'ely countered. What is more, and this is a strong practical
theme throughout the argument ofthe book, it is necessary to invoke
a third criterion in language testing, that of classroom utility. Much
discrete point testing, he argues, is not only premissed on an untenable
view of language for the teacher of communication, but in requiring
time-consuming and often arcane pre-testing, statistical evaluation
and rewriting techniques, poses quite impractical burdens on the
classroom teacher. What is needed are effective testing procedures,
linked to the needs of particular instructional programmes, reflecting
a communicative view of language learning and teaching, but which
are within the design and administrative powers of the teacher.
Pragmatic tests must be reliable and valid: they need also to be
practicable and to be accessible without presupposing technical
expertise. If, as the examples in Part Three of the book show, they can
be made to relate and be relevant to other subjects in the curriculum
than merely language alone, then the evaluation of pragmatic and
communicative competence has indeed cross-curricular significance.
Finally, a word on the book's organisation; although it is lengthy,
the book has clear-cut divisions: the Introduction in Chapter 1
provides an overview: Part One defines the requirements on
pragmatic testing; Part Two defines and critically assesses current
and overwhelmingly popular discrete point tests, and the concluding
Part Three exemplifies and justifies, in practical and technical terms,
the shape of alternative pragmatic tests. Each Chapter is completed
by a list of Key Points and Discussion Questions, and Suggested
Readings, thus providing the valuable and necessary working
apparatus to sustain the extended and well-illustrated argument.
Christopher N Candlin, General Editor
Lancaster, July 1978.
Author's Preface

A simple way to find out something about how well a person knows a
language (or more than one language) is to ask him. Another is to hire
a professional psychometrist to construct a more sophisticated test.
Neither of these alternatives, however, is apt to satisfy the needs of
the language teacher in the classroom nor any other educator,
whether in a multilingual context or not. The first method is too
subject to error, and the second is too complicated and expensive.
Somewhere between the extreme simplicity of just asking and the
development of standardized tests there ought to be reasonable
procedures that the classroom teacher could use confidently. This
book suggests that many such usable, practical classroom testing
procedures exist and it attempts to provide language teachers and
educators in bilingual programs or other multilingual contexts access
to those procedures.
There are several textbooks about language testing already on the
market. All of them are intended primadly for teachers of foreign
languages or English as a second language, and yet they are generally
based on techniques of testing that were not developed for classroom
purposes but for institutional standardized testing. The pioneering
volume by Robert Lado, Language Testing (1961), the excellent book
by Rebecca Valette, Modern Language Testing: A Handbook (1967),
the equally useful book by David Harris, Testing English as a Second
Language (1969), Foreign Language Testing: Theory and Practice
(1972) by John Clark, Testing and Experimental Methods (1977) by
J. P. B. Allen and Alan Davies, and Valette's 1977 revision of Modern
Language Testing all rely heavily (though not exclusively) on
techniques and methods of constructing multiple choice tests
developed to serve the needs of mass production.
Further, the books and manuals oriented toward multilingual
education such as Oral Language Tests for Bilingual Students (1976)
by R. Silverman, et ai, are typically aimed at standardized published
XVlll
AUTHOR'S PREFACE XI X

tests. It wo uld seem tha, all of the previously published books


attempt to address classroom needs for assessing proficiency in one or
more languages by extending to the classroom the techniques of
stand ardized testing. The typical test format disc ussed is generally the
multiple-choice discrete-item type. However, such techniques are
difficult and often impracticable for classroom use. While Valette,
Harris, and Clark briefly discuss some of the so-called ' integrati ve'
tests like dictation (especially Valelte), composition, and oral
interview, for the most part they concentrate on the complex tasks of
writing, pre-testing, statistically evaluating, and re ~wr iting discrete
point multiple-choice items.
The emphasis is reversed in this book. We concentrate here on
pragmatic testing procedures which generally do not require pre-
testing, statistical evaluation, or re-w riting before they can be applied
in the classroom or some other educational context. Such tests can be
shown to be as appropriate to monolingual contexts as they are to
multilingual and multicultural educational senings. I
Most teachers \\'hether in a foreign language class room or in a
multilingual school do not have the ti me nor the technical
background necessary for m ul tiple choice test develop ment, much
less for the wo rk that goes into the standardization of such tests.
Therefore, this book foc usses on how to make. give. and evaluate valid
and reliable language tests of a pragmatic sort. Theoretical and
empirical reasons are given, however, to establish the practical
foundation and to show why teachers and educato rs can confidentl y
lise the recommended testing procedures without a great deal of
prerequisite technical training. Although such training is desirable in
its own right, and is essential to the researcher in psychometry,
psych olinguistics, sociolinguistics, or education per se, th is book is
meant as a handbook for those ma ny teachers a nd educators who do
not have the time to master full y (even if that were possible) the highl y
technical fields of statistics, resea rch design, and applied linguistics.
The book is addressed to educators at the consumer end of
educational research. It tries to provide practical information
without presupposing technical expertise. Practical examples of
testing proced ures are given whereve r they are approp riate.

I Since it is believed that a muh ilingu<l l context is normally also multicultural, and since
it is also the case that la nguage and cult ure a re mutuall y dependent aDd insepa rable,
the term 'multilingual' is often used as aD a bbreviation fo r the longer te rm
'nmltilingua! -multicultural' in spile of the fact that t he latter term is often preferred by
many a uthors these days.
xx AUTHOR'S PREFACE

The main criterion of success for the book is whether or not it is


useful to educators. If it also serves some psychometrists, linguists
and researchers, so much the better. It is hoped that it will fill an
imponant gap in the resources available to language teachers and
educators in multilingual and in monolingual contexts. Thoughtful
suggestions and criticism are invited and will be seriously weighed in
the preparation of-future editions or revisions.

John Oller
Albuquerque, New Mexico, 1979

......_.•..._ - _ . , . .... ...._._ - - --


1

Introduction

A. What is a language test?


B. What is language testing research
about?
C. Organization of this book

This introduction discusses language testing in relation to education


in generaL It is demonstrated that many tests which are not
traditionally thought of as language tests may actually be tests of
language more than of anything else. Further. it is claimed that this is
true both for students who are learning the language of the school as a
second language, and for students who are native speakers of the
language used at schooL The correctness of this view is not a matter
that can be decided by preferences. It is an empirical issue, and
substantiating evidence is presented throughout this book. The main
point of this chapter is to survey the overall topography ofthe crucial
issues and to consider some of the angles from which the salient
points of interest can be seen in bold relief. The overall organization
of the rest oflhe book is also presented here.

A. What is a language test?


When the term 'language test' is mentioned, most people probably
have visions of students in a foreign language classroom poring over
a written examinatjon. This interpretation of the term is likely
because most educated persons and most educators have had such an
experience at one time or another. Though only a few may have really
learned a second language (and practically none of them in a
classroom context), many have at least made some attempt. For
them, a language test is a device that tries to assess how much has
2 LANGUAGE TESTS AT SCHOOL

been Jearned in a foreign la nguage co urse, o r so me part of a cou rse.


But written examinations in foreign la ngua ge classrooms are only
one of the m any form s that language tests take in the schools. For any
student lvhose no live language or language variety is nor used in the
schools, many te.ws nOltraditionally thought alas language [eSiS may be
primarily tests of lon!:uage ability. F or learners who speak a minority
langua ge wheth er it is Spa nish, Italian, Germ a n, Navajo, Diguetlo.
Black English, or whatever language, coping with any test in the
school context may be la rgel y a matter of coping with a language test.
There are, moreover, language aspects lO tests in general even/or the
student who is a native speaker of the language of the test.
In one wa y o r another, practically every kind of significant testing of
human beings depends on a surreplitious test of ability to use a
particula r language. Cons ider th e fact that the psychological
construct of ' imelligence' o r IQ, at least insofa r as it can be measured ,
may be no mo re tha n language proficiency. In any case, substantial
research (see O ller and Perkins, 1978) indicates t hat language a bility
proba bl y acco unts fo r the lion's share of va riability in 10 tests.
Il remains to be proved that there is so me unique and mea ningful
va riability th at can be ass ociated with other aspects of intelligence
once the language fac tor has been removed. And, con versely, it is
appa rentl y the case that th e bulk o f va riabilit y in language
pro ficiency test sco res is indistinguisha ble from the va riability
produced by IQ tests. In C hapter 3 we will return to consider what
meani ngful factors language skill migh t co nsist of (also see the
Appendix on this questio n). It has not yet been shown concl usively
that there are unique portions of variability in language test scores
that are not attributable to a general intelligence factor.
Psycholinguisti c and sociolinguistic regearch rely on langua ge test
results in obvious ways. As U pshur (1 976) has pointed o ut, so does
resea rch o n the nature o f psychologicall y rea l grammars. Whether
the criterion measure is a reacti on time 10 a heard stimulus, the
acc uracy of attempts to produce, interpret, o r recall a certain type of
verbal material, or the amo unt o[tirne required to do so, o r a markoD
a scale indicating agreement or disagreement with a given statement,
language proficiency (even if it is not the object of interest p er se) is
pro-ba-b:-y-r;..~~jvi· factor in the design, and ma y often be the major
facto r.
The renowned Russia n psyc ho logist, A. R . Luria (1959) has
argued that even molor tasks as simple as push-ing or not p ushing a
button in response to a red or green light may be rath er directly
INTRODU CTION 3
rel ated to langua ge skill in very youn g children. On a much broader
scale. achievement testing may be much mo re a problem of langua ge
testing than is commonly th o ught.
F o r all of these reasons the problems of langua ge testing are a large
subset of the problems of educational measurement .in general. The
meth ods and findings of language testing research a re of crucial
impo rtance t o research concerning psychologicall y real grammatical
systems, and to all o ther areas that must of necessity make
assum pti o ns abo ut the nature of such systems. Hence, all areas of
educational measurement are either directl y o r ,indirectly implicated .
Nelther intelligence meas urement, achievement testing, aptitude
assessment, nor person ali ty gauging, attitude meas urement, or just
plain class room evaluatio n can be done witho ut language testing. It
th erefore seems reaso nable to suggest that educatio nal testing in
general ca n be done better if it takes the findin gs of language testing
resea rch into acco unt.

B. What is language testing research about?


In general, the subject ma tter oflanguage testin g research is the use
and lea rning oflanguage. Within educational contexts, the domain of
fo reign language teachin g is a special case of interest. Multilingual
deh ve ry of curricul a is another very impo rtant case of interest.
However, the pheno mena of interest to resea rch in language testin g
are yet mo re pervasive.
Pertinent questions of the broader domain include : (1) How can
levels of language proficiency, stages ofJanguage acquisition (first o r
second), degrees of bilingua lism, or language com petence be defined ?
(2) H ow are earlier sta ges of la nguage learning d ifferent from later
stages, and ho w can kn ow n or hypothesized differences be
demonstrated by testin g procedures? (3) How can the effects of
instructi o nal programs o r techniques (o r enviro nmental chan ges in
general) be demonstra ted empirically ? (4) H ow are levels of language
pro ficiency and the conco mita nt social interactions that they allow or
den y related to the acquisition of knowledge jn an educational
setting? This is not to say that th ese questions have been or even will
be answe red by language testing research, but tha t they a re indicative
of some of the kinds of issues tha t such research is in a unique positio n
to grapple with .
Three angles of approach ca n be discerned in the literature on
language testing resea rch. F irst, language tests may be examined as
4 LANGUAGE TESTS AT SCHOOL

tests per se. Second, it is possible to investigate learner characteristics


using language tests as elicitation procedures. Third, specific
hypotheses about psycholinguistic and sociolinguistic factors in the
performance of language based tasks may be investigated using
language tests as research tools.
It is important to note that regarding a subject matter from one
angle rather than another does not change the nature of the subject
matter, and neither does it ensure that what can be seen from one
angle will be interpretable fully without recourse to other available
vantage points. The fact is that in Janguage testing research, it is never
actually possible to decide to investigate test characteristics, or
learner traits, or psycholinguistic and sociolinguistic constraints on
test materials without making important assumptions about all three,
regardless which happens to be in focus at the moment.
In this book, we will be concerned with the findings of research
from all three angles of approach. When the focus is on the tests
themselves, questions of validity, reliability, practicality, and
instructional value will be considered. The validity of a test is related
to how well the test does what it is supposed to do, namely, to inform
us about the examinee's progress toward some goal in a curriculum or
course of study, or to differentiate levels of abiEty among various
examinees on some task. Validity questions are about what a test
actually measures in relation to what it is supposed to measure.
The reliability of a test is a matter of how consistently it produces
similar results on different occasions under similar circumstances.
Questions of reliability have to do with how consistently a test does
what it is supposed to do, and thus cannot be strictly separated from
validity questions. Moreover, a test cannot be any more valid than it
is reliable.
A test's practicality must be determined in relation to the cost in
terms of materials, time, and effort that 1t requires. This must include
the preparation, administration, scoring, and interpretation of the
test.
Finally, the instructional value of a test pertains to how easily it can
be fitted into an educational program, whether the latter involves
teaching a foreign langu~ge, teaching language arts to native
speakers, or verbally imparting subject matter in a monolingual or
multilingual school setting.
When the focus of language testing research is on learner
characteristics, the tests themselves may be viewed as ehcitation
procedures for data to be subsequently analyzed. In this case, scores
INTROD UCTION 5
on a test may be treated as summary statistics indicating various
posi tions on a developmental scale, or individual performances may
be ana lyzed in a more detailed way in an attempt to diagnose specific
aspects of learner development. The results of the latter sort of
analysis, orten referred to as 'error analysis' (Richards, 1970a, 1971 )
may subsequently enter into the process of prescribing therapeutic
interven tion - possibly a classroom procedure.
When the focus of language testing research is the verbal material
in the test itself, questions usuall y relate to the psychologically real
grammatical constraints on particular phonological (or graphologi-
cal) sequences, syllable structures, vocabulary items, phrases, clauses,
and higher level patterns of discourse. Sociological constraints may
also be investigated with respect to the negotia bility of those elements
and sequences of them in interactional exchanges between human
beings or groups of them .
For any o rthe stated purposes or research, and of course there are
others which are not mentio ned, the tests may be relati vely formal
devices or informal elicitation procedures. They may require the
production or comprehension of verbal sequences, or both. The
language material may be heard, spoken, read, written (or possibl y
merely thought), or some combination of these. The task may require
recognition onl y, or imitatio n, or dela yed reca ll , memorization,
meaningful conversational response, learning and lo ng term storage,
or some combination of these ..
Ultimately, any attempt to apply the results of language tesling
research must consider the total spectrum of tests qua tests, learner
characteristics, and the psychological and sociological constraints on
test materials. Inferences concerning psychoJogjca lly real grammars
cannot be meaningful apart from the results of language tests viewed
from all three angles of research outlined above. Whether or not a
particular language test is valid (or better, the degree to which it is
valid o r no t valid), whether o r not an achievement test or aptitude
test, or personality inventory, or IQ test, or whatever other sort of test
one chooses to consider is a language test, is dependent on what
language competence reall y is and what sorts of verbal sequences
present a challenge to that competence. This is essentially the
question that Spolsky (1968) raised in the paper entitled: 'What does
it mean to know a language? Or, How do you gel someone to perform
his competence?'
6 LANGUAGE TESTS AT SCHOOL

C. Organization of this book

A good way, perhaps the only acceptable way to develop a test of a


given ability is to start with a clear definition of the capacity in
question. Chapter 2 begins Part One.' on Theory and Research Bases
for Pragmatic Language Testing by proposing a definition for
language proficiency. It introduces the notion of an expectancy
grammar as a way of characterizing the psychologically real system
that governs the use of a language in an individual who knows that
language. Although it is acknowledged that the details of such a
system are just beginning to be understood, certain pervasive
characteristics of expectancy systems can be helpful in explaining
why certain kinds of language tests apparently work as well as they
do, and how to devise other effective testing procedures that take
account of those salient characteristics of functional language
proficiency.
In Chapter 3, it is hypothesized that a valid language test must
press the learner's internalized expectancy system into action and
must further challenge its limits of efficient functioning in order to
discriminate among degrees of efficiency. Although it is suggested
that a statistical average of native performance on a language test is
usually a reasonable upper bound on attainable proficiency, it is
almost always possible and is sometimes essential to discriminate
degrees of proficiency among native speakers, e.g. at various stages of
child language learning, or among children or adults learning to read,
or among language learners at any stage engaged in normal
inferential discourse processing tasks. Criterion referenced testing,
where passing the test or some portion of it means being able to
perform the task or tasks at some desired level of adequacy (which
may be native-like performance in some cases) is also discussed.
Pragmatic language tests are defined and exemplified as tasks that
require the meaningful processing of sequences of elements in the
target language (or tested language) under normal time constraints.
It is claimed that time is always involved in the processing of verbal
sequences.
Chapter 4 extends the discussion to questions often raised in
reference to bilingual-bicultural programs and other multilingual
contexts in general. The implications of the now famous Lau versus
Nichols case are discussed. Special attention is given to the role of
socia-cultural attitudes in language acquisition and in education in
general. It is hypothesized that other things being equal, attitudes
INTRODUCTION 7
expressed and perceived in the schools probably account for more
variance in rate and amount of learning than do educational
methodologies related to the transmission of the traditionally
conceived curriculum. Some of the special problems that arise in
multilingual contexts are considered, such as, cultural bias in tests,
difficulties in translating test items, and methods of assessing
language dominance.
Chapter 5 concludes Part One with a discussion of the measure-
ment of attitudes and motivations. It discusses in some detail
questions related to the hypothesized relationship between attitudes
and language learning (first or second), and considers such variables
as the context in \vhich the language learning takes place, and the
types of measurement techniques that have been used in previous
research. Several hypotheses are offered concerning the relationship
between attitudes, motivations, and achievement in education.
Certain puzzling facts about apparent interacting influences in
multilingual contexts are noted.
Part Two, Theories and lvlethods of Discrete Point Testing, takes up
some of the more traditional and probably less theoretically sound
ways of approaching language testing. Chapter 6 discusses some of
the difficulties associated with testing procedures that grew out of
contrastive linguistics, syntax based structure drills, and certain
assumptions about language structure and the learning oflanguage
from early versions of transformational linguistics.
Pitfalls in relying too heavily on statistics for guiding test
development are discussed in Chapter 7. It is shown that different
theoretical assumptions may result in contradictory interpretations
of the same statistics. Thus it is argued that such statistical techniques
as are normally applied in test development, though helpful if used
with care, should not be the chief criterion for deciding test format.
An understanding of language use and language learning must take
priority in guiding format decisions.
Chapter 8 shows how discrete point language tests may produce
distorted estimates of language proficiency. In fact, it is claimed that
some discrete point tests are probably most appropriate as measures
of the sorts of artificial grammars that learners are sometimes
encouraged to internalize on the basis of artificial contexts of certain
syntax dominated classroom methods. In order to measure
communicative effectiveness for real-life settlngs in and out of the
classroom, it is reasoned that the language tests used in the classroom
(or in any educational context) must reflect certain crucial properties
8 LANGUAGE TESTS AT SCHOOL

of the normal use oflanguage in ways that some discrete point tests
apparently cannot. Examples of discrete point items which attempt to
examine the pieces of language structure apart from some of their
systematic interrelationships are considered. The chapter concl udes
by proposing that the diagnostic aims of discrete point tests can in
fact be achieved by so-called integrative or pragmatic tests. Hence, a
reconciliation between the apparently irreconcilable theoretical
positions is possible.
I n conclusion to Part Two on discrete point testing, Chapter 9
provides a natural bridge to Part Three, Practical Recommendations
Jor Language Testing, by discussing multiple choice testing
procedures which may be ofthe discrete point type, or the integrative
type, or anywhere on the continuum in between the two extremes.
However, regardless of the theoretical bent orthe test writer, mUltiple
choice tests require considerable technical skill and a good deal of
energy to prepare. They are in some respects less practical than some
of the pragmatic procedures recommended in Part Three precisely
because of the technical skills and the effort necessary to their
preparation. Multiple choice tests need to be critiqued by some native
speaker, other than the test writer. This is necessary to avoid the
pitfalls of ambiguities and subtle differences of interpretation that
may not be obvious to the test writer. The items need to be pretested,
preferably on some group other than the population which will
ultimately be tested with the finished product (this is often not
feasible in classroom situaiions). Then, the items need t~ be
statistically analyzed so that non-functional or weak items can be
revised before they are used and interpreted in ways that affect
learners. In some cases, recycling through the whole procedure is
necessary even though all the steps of test development may have
been quite carefully executed. Because of these complexities and costs
of test development, multiple choice tests are not always suitable for
meeting the needs of classroom testing - or for broader institutional
purposes III some cases.
The reader who is interested mainly in classroom applications of
pragmatic testing procedures may want to begin reading at Chapter
lOin Part Three. However, unless the material in the early chapters
(especially 2 through 9) is fairly well grasped, the basis for many ofthe
recommendations in Part Three will probably not be appreciated
fully. For instance, many educators seem to have acquired the
impression that certain pragmatic language tests, such as those based
on the cloze procedure for example (see Chapter 11), are 'quick and
INTRODUCTION 9
dirty' methods of acquiring information about language proficiency.
This idea, however, is apparently based only on intuitions and is
disconfirmed by the research discussed in Part One. Pragmatic tests
are typically better on the whole than any other procedures that have
been carefully studied. Whereas the prevailing techniques of
language testing that educators are apt to be most familiar with are
based on the discrete point theories, these methods are rejected in
Part Two. Hence, were the reader to skip over to Part Three
immediately, he might be left in a quandary as to why the pragmatic
testing techniques discussed there are recommended instead of the
more familiar discrete point (and typically multiple choice) tests,
Although pragmatic testing procedures are in some cases
deceptively simple to apply, they probably provide more accurate
information concerning language proficiency (and even specific
achievement objectives) than the more familiar tests produced on the
basis of discrete point theory. Moreover, not only are pragmatic tests
apparently more valid, but they are more practicable, It simply takes
less premeditation, and less time and effort to prepare and use
pragmatic tests. This is not to say that great care and attention is not
necessary to the use of pragmatic testing procedures, quite the
contrary. It is rather to say that hour for hour and dollar for dollar
the return on work and money expended in pragmatic testing will
probably offer higher dividends to the learner, the educator, and the
taxpayer. Clearly, much more research is needed on both pragmatic
and discrete point testing procedures, and many suggestions for
possible studies are offered throughout the text.
Chapter 10 discusses some of the practical classroom applications
of the procedure of dictating material in the target language.
Variations of the technique which are also discussed include the
procedure of 'elicited imitation' employed with monolingual,
bilingual, and bidialectal children.
In Chapter 11 attention is focussed on a variety of procedures
requiring the use of productive oral language skills. Among the ones
discussed are reading aloud, story retelling, dialogue dramatization,
conversational interview techniques, and specifically the Foreign
Service Institute Oral Interview, the Jlyin Oral Interview, the Upshur
Oral Communication Test, and the Bilingual Syntax Measure are also
discussed.
Increasingly widely used cloze procedure and variations of it are
considered in Chapter 12. Basically the procedure consists of deleting
words from prose (or auditorily presented material) and asking the
\0 LANGUAGE TESTS AT SCHOOL

examinee to try to replace the missing words. Because of the


simplicity of application and the demonstrated validity or the
technique, it has become quite popular in recent years. However, it is
probably not any more applicable to classroom purposes than some
of the procedures discussed in other chapters. The cloze procedure is
sometimes construed to be a measure of reading ability, though it
ma y be just as much a measure of writing, listening and speakin g
ability (see Chapter 3, and the Appendi x).
Chapter 13 100ks a t writing tasks per se. It gives considerable space
to ways of approaching the difficulties of scoring relatively free
essays. Alternative testing methods considered include various
methods of increasing the constraints on the range of material that
the examinee may produce in response to the test procedure. Controls
range from suggesting topics for an essay to asking examinees to
rephrase heard o r read material after a time lapse. Obviously, many
other control tecbniques are posslble. and varia tlons on the doze
procedure can be used to construct many o f them.
Chapter 14 considers ways in which testing procedures can be
related to curricula. In particular it asks, 'How can effective tes ting
procedures be invented or adapted to the needs of an instructional
program?' Some general guidelines are tentatively suggested for both
developing or adapting testing procedures and for studying their
effectiveness in relati on to particular educational objectives. To
il1ustrale one of the ways that c urriculum (learning, teaching, and
testing) can be related to a comprehensive sort of validation research,
the Mount Gravatt reading research project is discussed. This project
provides a rich source of data concerning preschool children, and
children in the early grades. By carefully studying the kinds of
language games that children at vari ous age levels can play and win
(Upshur, 1973), that is, the kinds of things that they can explore
verbally and with success, Norman Hart, Richard WaJker, and their
colleagues have provided a model for relating theories of language
learning via careful research to the practical educational ta sk of
teaching reading. There are, of course, spin off benefits to all o ther
areas of the curriculum because of the fundamental part played by
language use in every area of the educational process. It is strongly
urged that language testing procedures, especially for assessing the
language skills of children, be carefull y examined in the light of such
research.
Throughout the text, wherever techn ica l projects are referred to,
details of a technical so rt are either omitted or are explained in non-
INTRODUCTION 11
technical language. More complete research reports are often
referred to in the text (also see the Appendix) and should be consulted
by anyone interested in applying the recommendations contained
here to the testing of specific research hypotheses. However, for
classroom purposes (where at least some of the technicalities of
research design are luxuries) the suggestions offered here are intended
to be sufficient. Additional readings, usually of a non-technical sort,
are suggested at the end of each chapter. A complete list of technical
reports and other works referred to in the text is included at the end of
the book in a separate Bibliography. The latter includes all of the
Suggested Readings at the end of each chapter, plus many items not
given in the Suggested Readings lists. An Appendix reporting on
recent empirical study of many of the pressing questions raised in the
body of the text is included at the end. The fundamental question
addressed there is whether or not language proficiency can be parsed
up into components, skills, and the like. Further, it is asked whether
language proficiency is distinct from IQ, achievement, and other
educational constructs. The Appendix is not included in the body of
the text precisely because it is somewhat technical.
It is to be expected that a book dealing with a subject matter that is
changing as rapidly as the field of language testing research should
soon be outdated. However, it seems that most current research is
pointing toward the refinement of existing pragmatic testing
procedures and the discovery of new ones and new applications. It
seems unlikely that there will be a substantial return to the strong
versions of discrete point theories and methods of the 1960s and early
1970s. In any event the emphasis here is on pragmatic language
testing because it is believed that such procedures offer a richer yield
of information.

KEY POINTS
1. Any test that challenges the language ability of an examinee can, at least
in part, be construed as a language test. This is especially true for
examinees who do not know or normally use the language variety of the
test, but is true in a broader sense for native speakers of the language of
the test.
2. It is not known to what extent language ability may be co-extensive with
IQ, but there is evidence that the relationship is a very strong one. Hence,
IQ tests (and many other varieties of tests as well) may be tests of
language ability more than they are tests of anything else.
3. Language testing is crucial to the investigation of psychologically real
grammars, to research in all aspects of distinctively human symbolic
12 LANGUAGE TESTS AT SCHOOL

behavior, and to educational measurement in general.


4. Language testing research may focus on tests, learners, or constraints on
verbal materials.
5. Among the questions of interest to language testing research are: (a) how
to operationally define levels of language proficiency, stages of language
learning, degrees of bilingualism, or linguistic competence; (b) how to
differentiate stages of learning: (c) how to measure possible effects of
instruction (or other environmental factors) on language learning; (d)
how language ability is related to the acquisition of knowledge in an
educational setting.
6. When the research is focussed on tests, validity, reliability, practicality,
and instructional value are among the factors of interest.
7. When the focus is on learners and their developing language systems,
tests may be viewed as elicitation procedures. Data elicited may then be
analyzed with a view toward providing detailed descriptions of learner
systems, and/or diagnosis of teaching procedures (or other therapeutic
interventions) to facilitate learning.
8. When research is directed toward the verbal material in a given test or
testing procedure, the effects of psychological or sociological constraints
built into the verbal sequences themselves (or constraints which are at
least implicit to language users) are at issue.
9. From all of the foregoing it follows that the results oflanguage tests, and
the findings of language testing research are highly relevant to
psychological, sociological, and educational measurement in general.

DISCUSSION QUESTIONS
1. What tests are used in your school that require the comprehension or
production of complex sequences of material in a language? Reading
achievement tests? Aptitude tests? Personality inventories? Verbal IQ
tests? Others? What evidence exists to show that the tests are really
measures of different things?
2. Are tests in your school sometimes used for examinees whose native
language (or language variety) is not the language (or language variety)
used in the test? How do you think such tests ought to be interpreted?
3. Is it possible that a non-verbal test of IQ could have a language factor
unintentionally built into it? How are the instructions given? What
strategies do you think children or examinees must follow in order to do
the items on the test? Are any of those strategies related to their ability to
code information verbally? To give subvocal commands?
./4. In what ways do teachers normally do language testing (unintentionally)
in their routine activities? Consider the kinds of instructions children or
adults must execute in the classroom.
5. Take any standardized test used in any school setting. Analyze it for its
level of verbal complexity. What instructions does it use? Are they more
or less complex than the tasks which they define or explain? For what age
level of children or for what proficiency level of second language learners
would such instructions be appropriate? Is guessing necessary in order to
understand the instructions of the test?
INTRODUCTION 13
6. Consider any educational resea rch project that you know of or have
access to. Wh at sorts of measurements did th e resea rch use ? Was there a
testing tech niqu e? An observational or rating procedure? A way of
reco rding beha vio rs? Did language figure in the measurements taken ?
7. Why do you think language might or might not be related to capacity to
perform moto r tasks, part icularly in young children? Consider finding
your way around town, or around a new buildin g, or around the house.
Do you ever use verbal cues to guide your own stops , starts, turns?
Subvocal ones? How about in a strange place orwhen you are very tired?
Do you ever ask yo urself things like, NolV what did I come in here/of?
8. Can yo u conceive of any way to operario oalize notions like language
competence, degree of bilingualism, stages of lea rning, effectiveness of
language teaching, rate ofleaming, level of proficie ncy with out language
tests?
"9. If you were to rank the criteria of validity , reliability, practicality and
instructional vall~e in their order o f importance, what order woul d you
put them in ? Consider the fact that validity wi thout practica lit y is
certain ly possible. The same is true for validity with ou t instructional
value. How abou t instructional value withou t va lidity?
lIIO. Do you consid er the concept of intelligence or IQ t o be a useful
theoretical construct? Do you believe th at resea rchers and theorists
know what they mean by the term apart fro m some test score ? How
about grammatical knowledge? Is it the same son of construct ?
·. . .·1J. Ca n you think of any way(~) that time is normally invol ved in a task like
readi ng a n ovel when no one is holding a stop-watch?
_0

SUGGESTED READINGS
1. Geo rge A. Miller, ' The Psycholinguisls,' EncoUllier 23, 1964, 29-37.
Reprinted in Charles E. Osgood and Thomas A. Sebeok (eds.)
Psycholinguistics: A SUl'l'ey of Theory alld Research Problems.
Bloomington, lndiana: Indiana University , 1965.
2. John W. Oller, Jr. and Kyle Perkins, Language in Education : Testing the
Tests. R owley. Massachusetts: Newbury House, 1978.
3. Bernard Spolsky, 'Introducti on : Linguists and Language Testers' in
A dvances in Language Testing: Series 2. Approaches LO Language
Testing. Arlington, Virginia: Center for Applied Linguistics, 1978, v-x.
PART ONE
Theory and Research Bases
for Pragmatic Language Testing
2

Language Skill
as a Pragmatic Expectancy
Grammar
A. What is pragmatics about')
B. The factive aspect of language use
C. The emotive aspect
D . Language learning as grammar
construct.ion and modification
E. Tests that invoke tbe learner's
grammar

Understanding what is to be tested is prerequisite to good testing of


any sort. In this chapter, the object of interest is language as it is used
for communicative purposes - for getting and giving information
abou t (or for bringing about changes in) facts or states of affairs, and
for expressing attitudes toward those facts or states of affairs. The
notion of expectancy is introduced as a key to understanding the
nature of psychologically real processes that underlie language use. It
is suggested that expectancy generating systems are constructed and
modified in the course ofJanguage acquisition. Languag. proficiency
is thus characterized as consisting of such a n expectancy genera ling
system. Therefore, it is claimed that for a proposed measure to
qualify as a la nguage test, it must invo ke the expectancy system or
grammar of th e examinee.

A. What is pragmatics about?


The newscaster in Albuquerque wbo smiled cheerfull y while speaking
of traffic fatalities, floods, and o ther calamities was expressing an
entirely differen t attitude toward the facts he was referring to than
was probably held by the friend s and relatives of the victim s, not to
16
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 17
mention the more compassionate strangers who were undoubtedly
watching him on the television. There might not have been any
disagreement about what the facts were, but the potent contrast in the
attitudes of the newscaster and others probably accounts for the
brevity of the man's career as the station's anchorman.
Thus, two aspects of language use need to be distinguished.
Language is usually used to convey information about people, things,
events, ideas, states of affairs, and attitudes toward all of the
foregoing. It is possible for two or more people to agree entirely about
the facts referred to or the assumptions implied by a certain statement
but to disagree markedly in their attitudes toward those facts. The
newscaster and his viewers probably disagreed very little or not at all
concerning the facts he was speaking of. It was his manner of
speaking (including his choice of words) and the attitude conveyed by
it that probably shortened his career.
Linguistic analysis has traditionally been concerned mainly with
what might be called theJactive (or cognitive) aspect oflanguage use.
The physical stuff of language which codes factive information
usually consists of sequences of distinctive sounds which combine to
form syllables, which form words, which get hooked together in
highly constrained ways to form phrases, which make up clauses,
which also combine in highly restricted ways to yield the incredible
diversity of human discourse. By contrast, the physical stuff of
language which codes emotive (or affective, attitudinal) information
usually consists of facial expression, tone of voice, and gesture.
Psychologists and sociologists have often been interested more in the
emotive aspect of language than in the cognitive complexities of the
factive aspect. Certainly cognitive psychology and linguistics, along
with philosophy and logic, on the other hand, have concentrated on
the latter.
Although the two aspects are intricately interrelated, it is often
useful and sometimes essential to distinguish them. Consider, for
instance, the statement that Some of Richard's lies have been
discovered. This remark could be taken to mean that there is a certain
person named Richard (whom we may infer to be a male human),
who is guilty oflying on more than one or two occasions, and some of
whose lies have been found out. In addition, the remark implies that
there are other lies told by Richard which may be uncovered later.
Such a statement relates in systematic ways to a speaker's asserted
beliefs concerning certain states of affairs. Of course, the speaker may
be lying, or sincere but mistaken, or sincere and correct, and these are
18 LANGUAGE TESTS AT SCHOOL

only some of the many possibilities. Tn any case, however, as persons


who know English, we understand the remark about Ricbard partly
by inferring the sorts of facts it would take to make such a statement
true. Such inferences are not perfectly understood, but there is no
doubt that language users make them. A speaker (or writer) must
make them in order to know what his listener (or reader) will
probably understand, and the listener (or reader) must make them in
order to know what a speaker (or writer) means.
In addition to the factive information coded in the words and
phrases of the statement, a person who utters that statement may
convey attitudes toward the asserted or implied states of affairs, and
may further code information concerning the way the speaker thinks
the listener should feel about those states of affairs. For instance, the
speaker may appear to hate Richard, and to despise his lies (both
those that have already been discovered and the others not yet found
out), or he may appear detached and impersonal. In speaking, such
emotive effects are achieved largely by facial expression, tone of
voice, and gesture, but they may also be achieved in writing by
describing the manner in which a statement is made or by skillful
choice of words. The latter, of course, is effective either in spoken or
written discourse as a device for coding emotive information. Notice
the change in the emotive aspect if the word lies is replaced by half-
truths: Some of Richard's half-truths have been discovered. The
disapproval is weakened still further if we say: Some of Richard's
mistakes have been discovered, and further still if we change mistakes
to errors o/judgement.
In the normal use of language it is possible to distinguish two major
kinds of context. First, there is the physical stuff oflanguage which is
organized into a more or less linear arrangement of verbal elements
skillfully and intricately interrelated with a sequence of rather
precisely timed changes in tone of voice, facial expression, body
posture, and so Oil. To call attention to the fact that in human beings
even the latter so-called 'paralinguistic' devices of communication are
an integral part of language use, we may refer to the verbal and
gestural aspects of language in use as constituting linguistic context.
With reference to speech it is possible to decompose linguistic context
into verbal and gestural contexts. With reference to writing, the terms
linguistic context and verbal context may be used interchangeably.
A second major type of context has to do with the world, outside of
language, as it is perceived by language users in relation to themselves
and valued other persons or groups. We will use the term
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 19
extralinguistic context to refer to states of affa irs constituted by
things, events, people, ideas, relationships, feelings, perceptions,
memories, and so forth. It may be useful to distinguish objective
aspects of extralinguistic context from subjeclive aspects. On the one
hand , th ere is the world of existing things , events, persons, and so
forth, and on the other, there is the world of self-concept, other-
concept, interpersonal relationships, group relationships, and so on.
In a sense, the two worlds are part of a single totality for any
individual, but they are not necessarily so closely related. Otherwise,
there would be no need for such terms as schizophrenia, o r paranoia.
Neither linguistic nor ex tralinguistic contexts are simple in
themselves, but what complicates matters still furt her and makes
meaningful communication possible is that there are systematic
correspondences between linguistic contexts and extralin guistic ones.
That is, sequences of linguistic elements in normal uses o f language
are not haphazard in their relation to people, things, events, ideas,
relationships, attitudes, etc. , but are systematically related to states of
a/fairs outside oflanguage. Thus we may say that linguistic contexts
are pragmatically mapped onto extralinguistic contexts, and vice
versa .
We can now o/fer a definition of pragmatics. Briefly, it addresses
th e question: how do utterances relate to human experience outside
of language? It is concerned wit b the relationships between linguistic
contexts and extralinguistic contexts. It embraces the traditional
subject matter of psycholinguistics and also that of sociolinguistics.
Pragmatics is about how people communicate information about
facts and feelings to other people, or how they merely express
themselves and their feelings through the use of language for no
particular audience, except possibly an omniscient God. It is about
how meaning is both coded and in a sense in vented in th e normal
intercourse of words and ex perience (to borrow a metaphor from
Dewey, 1929).

R The factive aspect of language use


Language, when it is used to convey information about facts , is
always an abbreviation for a richer conceptuali zation. We know
mo re about objects, events, people, relationships, and states of affairs
than we a re ever full y able to express in words. Consider the difficulty
of sayi ng all you know abo ut the familiar face of a friend. The fact is
that you r best e/fort would probably fail to convey enough
20 LANGUAGE TESTS AT SCHOOL

information to enable someone else to single out your friend in a large


crowd. This simply illustrates the fact that you know more than you
are able to say.
Here is another example. Not long ago, a person whom I know
very well, was involved in an accident. He was riding a ten-speed
bicycle around a corner when he was hit head-on by a pick-up. The
driver of the truck cut the corner at about thirty miles an hour leaving
no room for the cyclist to pass. The collision was inevitable. Blood
gushed from a three inch gash in the top of his head and a blunt
handlebar was rammed nearly to the bone in his left thigh. From this
description you have some vivid impressions about the events
referred to. However, no one needs to point out the fact that you do
not know as much about the events referred to as the person \vho
experienced them. Some of what you do know is a result of the
linguistic context of this paragraph, and some of what you know is
the result of inferences that you have correctly made concerning what
it probably feels like to have a blunt object rammed into your thigh,
or to have a three inch gash in your head, but no matter how much
you are told or are able to infer it will undoubtedly fall short of the
information that is available to the person who experienced the
events in his own flesh. OUf words are successful in conveying only
part ofthe information that we possess.
Whenever we say anything at all we leave a great deal more unsaid.
We depend largely for the effect of our communications not only on
what we say but also on the cr,eative ability of our listeners to fill in
what we have left unsaid. The fact is that a normal listener supplies a
great deal of information by creative inference and in a very
important sense is always anticipating what the speaker will say next.
Similarly, the speaker is always anticipating what the listener will
infer and is correcting his output on the basis of feedback received
from the listener. Of course, some language users are more skillful in
such things than others.
We are practically always a jump or two ahead of the person that
we are listening to, and sometimes we even outrun our own tongues
when we are speaking. It is not unusual in a speech error for a speaker
to say a word several syllables ahead of what he intended to say, nor is
it uncommon for a listener to take a wrong turn in his thinking and
fail to understand correctly, simply because he was expecting
something else to be said.
It has been shown repeatedly that tampering with the speaker's
own feedback of what he is saying has striking debilitating effects
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 21
(Chase, Sutton, and First, 1959). The typical experiment illustrating
this involves delayed auditory feedback or sidetone. The speaker's
voice is recorded on a tape and played back a fraction of a second
later into a set of headphones which the speaker is wearing. The result
is that the speaker hears not what he is saying, but what he has just
said a fraction of a second earlier. He invariably stutters and distorts
syllables almost beyond recognition. The problem is that he is trying
to compensate for what he hears himself saying in relation to what he
expects to hear. After some practice, it is possible for the speaker to
ignore the delayed auditory feedback and to speak normally by
attending instead to the so-called kinesthetic feedback of the
movements of the vocal apparatus and presumably the bone
conducted vibrations of the voice.
The pervasive importance of expectations in the processing of all
sorts of information is well illustrated in the following remark by the
world renowned neurophysiologist, Karl Lashley:
The organization of language seems to me to be
characteristic of almost all other cerebral activity. There is a
series of hierarchies of organization: the order of vocal
movements in pronouncing the word, the order of words in the
sentence, the order of sentences in the paragraph, the rational
order of paragraphs in a discourse. Not only speech, but all
skilled acts seem to involve the same problems of serial
ordering, even down to the temporal coordinations of muscular
contractions in such a movement as reaching and grasping
(1951, p. 187).
A major aspect of language use that a good theory must explain is
that there is, in Lashley's words, 'a series of hierarchies of
organization.' That is, there are units that combine with each other to
form higher level units. For instance, the letters in a written word
combine to form the written word itself. The word is not a letter and
the letter is not a word, but the one unit acts as a building block for
the other, something like the way atoms combine to form molecules.
Of course, atoms consist of their own more elementary building
blocks and molecules combine in complex ways to beco1ll€ the
building blocks of a great diversity of higher order substances.
Words make phrases and the phrases carry new and different
meanings which are not part of the separate words of which they are
made. For instance, consider the meanings of the words head, red,
the, and beautiful. Now consider thejr meanings again in the phrase
the beautiful redhead as in the sentence, She's the beautiful redhead I've
been telling you about. At each higher level in the hierarchy, as John
22 LANGUAGE TESTS AT SCHOOL

Dewey (1929) put it, new meanings are bred from the copulating
forms. This in a nutshell is the basis of the marvelous complexity and
novelty oflanguage as an instrument for codjng information and for
conceptualizing.
Noam Chomsky, eminent professor of linguistics at the
Massachusetts Institute of Technology, is mainly responsible for the
emphasis in modern linguistics on the characteristic novelty of
sentences. He has argued convincingly (cf. especially Chomsky, 1972)
that novelty is the rule rather than the exception in the everyday use
of language. If a sentence is more than part of a ritual verbal pattern
(such as, 'Hello. How are you 7'), it is probably a novel concoction of
the speaker which probably has never been heard or said by him
before. As Geo rge Miller (1964) ha s pointed out, a conservative
estimate of the number of possible twenty-word sentences in Engl ish
is on the order of the number of seconds in o ne hundred million
centuries. Although sentences may share certain structura l features,
any particular one that happens to be uttered is probably invented,
new on the spot. The probability that it has been heard before and
memorized is very slight.
The novelty of language, however, is a kind of freedo m within
limits. When the limits on the creativity allowed by language are
violated, many versions of nonsense result . They may range from
unpronounceable sequences like gbntmbwk (unpronounceable in
English at least) to pronounceable nonsense such as flOX ems gtelj
onmo kebs (from Osgood, 1955). They may be syntac tically
acceptable but semantically strange concoctions like the much
overused example of Jabberwocky or Chomsky's (now trite)
illustrative sentence, Colorless green ideas sleep furiously. J
A less well known passage of nonsense was inveI1ted by Samuel
Foote, one of the best known humorists of the nineteenth century, in
order to prove a point about the organization of memory. Foote had
been attending a series of lectures by Charles Macklin on oratory. On
one particular evening, Mr Macklin boasted that he had mastered the
principles of memorization so thoroughly that he could repeat any
paragraph by rote after having read it only once. At tbe end of the
lecture, the unpredictable Foote handed Mr Macklin a piece of
paper on which he had composed a brief paragraph during the

I At one Summer Linguistics Institu te, someone had bum per Slickers pri nted up with

C homsky's famous sentence. One orlhem round its way into the hands or my brother,
D. K. Oller, and eventually onto my bumper to the consternation of many Los Angeles
motorists.
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 23
lecture. He asked Mr Macklin to kindly read it aloud once to the
audience and then to repeat it from memory. So Mr Macklin read:
So she went into the garden to cut a cabbage leaf to make an
apple pie: and at the same time a great she-bear coming up the
street pops its head in the shop. 'What 1 No soap I' SO he died,
and she very imprudently married the barber: and there were
present the Picninnies, the Joblillies, and the Garcelies, and the
Great Panjandrum himself, witb the little round button at the
top, and they all fell to playing the game of eatcb-as-cateh-ean,
till the gunpowder ran out the heels of their boots (Samuel
Foote, ca. 1854: see Cooke, 1902, p. 22lt)2
The incident probably improved Mr Macklin's modesty and it
surely instructed him on the importance of the reconstructive aspects
of verbal memory. We don't just happen to remember things in all
their detail: rather, we remember a kind of skeleton, or possibly a
whole hierarchy of skeletons, to which we attach the flesh of detail by
a creative and reconstructive process. That process, like all verbal and
cognitive activities, is governed largely by what we have learned to
expect. The fact is that she-bears rarely pop their heads into barber
shops, nor do people cut cabbage leaves to make apple pies. For
such reasons, Foote's prose is difficult to remember. It is because the
contexts, outside of language, which are mapped by his choice of
words are odd contexts in themselves. Otherwise, the word sequences
are grammatical enough.
Perhaps the importance of our normal expectancies concerning
words and what they mean is best illustrated by nonsense which
violates those expectancies. The sequence gbntmbwk forces on our
attention. things that we know only subconsciously about our
language - for example, the fact that g cannot immediately precede h
at the beginning of a word, and that syllables in English must have a
vowel sound in them somewhere (unless shhhh! is a syllable). These
are facts we know because we have acquired an expectancy grammar
for English.
Our internalized grammar tells us that glerf is a possible word in
English. It is pronounceable and is parallel to words that exist in the

2 r am indebted to my father, John Oller, Sr., for this illustration. He used it often in his
talks on language teaching to sho\v the importance of meaningful sequence to recall
and learning. He often attributed the prose to Mark Twain, and it is possible that
Twain used this same piece of wit to debunk a supposed memory expert in a circus
contest as my father often claimed. I have not been able to document the story about
Twain, though it seems characteristic of him. 'No doubt he knew of Foote and may
have consciously imitated him
24 LANGUAGE TESTS AT SCHOOL

language such as glide, Selj, and slurp: still, glerf is not an English
word. Our grammatical expectancies are not completely violated by
Lewis Carroll's phrase, thefrumious bandersnatch, but we recognize
this as a novel creation. Our internalized grammar causes us to
suppose that frumious must be an adjective that modifies the noun
bandersnatch, We may even imagine a kind of beast that, in the
context, might be referred to as a frumious handersnatch. Our
inferential construction mayor may not resemble anything that
Carroll had in mind, if in fact he had any 'thing' in mind at alL The
inference here is similar to supposing that Foote was referring to
Macklin himself when he chose the phrase the Great Panjandrum
himself: H·ith the little round button at the top. In either case, similar
grammatical expectancies are employed.
But it may be objected that what we are referring to here as
grammatical involves more than what is traditionally subsumed
under the heading grammar. However, we are not concerned here
with grammar in the traditional sense as being something entirely
abstract and unrelated to persons who know languages. Rather, we
are concerned with the psychological realities oflinguistic knowledge
as it is internalized in whatever ways by real human beings. By this
definition of grammar, the language user's knowledge of how to map
utterances pragmatically onto contexts outside oflanguage and vice
versa (that is, how to map contexts onto utterances) must be
incorporated into the grammatical system, To illustrate, the word
horse is effective in communicative exchanges if it is related to the
right sort of animaL Pointing to a giraffe and calling it a horse is not
an error in syntax, nor even an error in semantics (the speaker and
listener may both know the intended meaning), It is the pragmatic
mapping of a particular exemplar of the category GIRAFFE (as an
object or real world thing, not as a word) that is incorrect. In an
important sense, such an error is a grammatical error.
The term expectancy grammar calls attention to the peculiarly
sequential organization of language in actual use. Natural language is
perhaps the best known example of the complex organization of
elements into sequences and classes, and sequences of classes which
are composed of other sequences of classes and so forth. The term
pragmatic expectancy grammar further calls attention to the fact that
the sequences of classes of elements, and hierarchies of them which
constitute a language are available to the language user in real life
situations because they are somehow indexed with reference to their
appropriateness to extralinguistic contexts.
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 25
In the no rmal use ofl anguage, no matter what level oflanguage or
mode of processing we think of, it is always possible to pred ict
partially what will come next in any given sequence of elements. The
elements may be sounds, syllables, word s, phrases, sentences,
paragraphs, or larger units of discourse. The mode of processing may
be listening, speaking, reading, wrlting, or thi nking, or some
combinat ion ofthese. In the meaningful use oflanguage, some sort of
pragmatic expectancy grammar must function in all cases.
A wide variety of research has shown that the more grammatically
predictable a sequence of linguistic elements is, the more readily it can
be processed. For instance, a sequence of nonsensical syllables as in
the example from Osgood, Nox ems gler/onm o k ebs, is more difficult
than the same sequence with a more obvious structure imposed on it ,
as in The 1l0X ems glerfed th e Ollmo kebs. But the latter is still more
difficult to process than, The had boys chased (he prell)' girls. It is easy
to see that the gradation from nonsense to completely acceptable
sequences of meaningful prose can vary by much fi ner degrees, but
these examples serve to illustrate that as sequences of linguistic
elements become increasingly more predictable in terms of
gra mmatical organization, they become easier to handle.
Not onl y are less constrained sequences more difficult than more
constrained ones, but this generalizatio n ho lds true regardless of
which oftbe four traditionally recognized skills we a re speaking of. It
is also true for learning. In fact, there is considerable evidence to
suggest that as organizationa l constraints on lingu istic sequences are
increased, ease of processing (whether perceiving, producing,
learning, recalling, etc.) increases at an accelerating rate, almost
exponentially. It is as though our learned expectations enable us to lie
in wait for elements in a highly constrained linguistic context and
make much shorter work of them than would be possible iflhey took
us by surprise.
A s we ha ve been arguing throughout this chapter, the constraints
on what may follow in a given sequence of linguistic elements go far
beyond the traditionall y recognized grammatica l ones, and they
operate in every aspect of o ur cognition. In his treatise on thinking,
John Dewey (1910) argued that the 'central factor in thinking' is an
element of expectancy. He gives an example of a man strolling along
on a wa rm da y. Suddenl y, the man notices that it has become cool.
It occurs to him that it is probably going to rain ; looking up, he sees a
dark cloud between himself and the sun . He then quickens his steps
(p. 61). Dewey goes on to define thinking as ' that operation in which
26 LANG UAGE TESTS AT SCHOOL

present/aC IS suggest other fads (or truths) in such a way as to induce


belief in the latler Up OIl the ground or the warrant orthe form er' (p. Sf).

C. The emotive aspect


To this point , we have been concerned primarily wit h t he ractive
aspect o f language and cognitio n . H owever, much of what has been
said applies as well to the emotive aspect of language use. Nonet he-
less there are contrasts in the coding of the two types of information.
While factive info rma tio n is coded primarily in distinctly verbal
sequences, emoti ve info rmation is coded primarily in gestures, tone
of voice, faci al exp ression, and th e li ke. Whereas verbal sequences
co nsist of a fi nite set of distinctive sou nds (or features of so unds),
sylla bles, word s, idi oms, and collocatio ns, and generall y of d iscrete
and co unlabl e sequences of elements, the emotive coding dev ices a re
typically non-discrete and are more o r less continuo usly va riable.
F or example, a strongl y articul ated p in the word pat ha rdly
changes th e mea ning of the word, nor does it serve much better to
distinguish a p at on the head from a cat in the garage. Sho uti ng the
wo rd garage d oes not imply a larger garage, nor would whispering it
necessarily change the meaning of the word in terms of its faClive
value. Either yo u have a ga rage to talk about o r you d on't, and t here
isn' t much po int in d istinguishing cases 111 between the two extremes.
With emoti ve info rm ati o n things are d iffere nt. A wildl y brand ished
fis t is a stronger statement th an a mere clenched fist. A lo ud sho ut
means a stronger deg ree of emotion than a softer tone. In t he kind s of
devices typically llsed to code emoti on in formation, variability in the
strength of the symbol is analogically related to similar variabili ty in
the attitude that is symboli zed.
In both speaking and writing, choice of word s also fi gures largely in
t he coding o f a ttitud inal informa tio n. Consider the differences in the
attitudes elicited by the following sentences: (I ) Some people say it is
better to explain our pOint of view as well as give the f1eKS: (2) Some
people say it is better to include some propaganda as }vell as give the
n e ~vs . Clearly, the presuppositions a nd implications of the two
sentences are so mewhat different, but they could conceivably be used
in reference to exactly the same immedia te states of affa irs or
ext rahnguisti c co ntex ts. Of the people polled in a cert ain stud y,
42.8 % agreed wi th t he fi rst. while onl y 24.7% agreed with the seco nd
(Co pi, 1972, p. 70). T he 18. 1 %di fference is a pparentl y attribut able to
the difference between explaining and pro pagandizing.
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 27
Although the facts referred to by such terms as exaggerating and
lying may be the same facts in certain practical cases, the attitudes
expressed to\vard those facts by selecting one or the other term are
quite different. Indeed, the accompanying facial expression and tone
of voice may convey attitudinal information so forcefully as to even
contradict the [active claims of a statement. For instance, the teacher
who says to a young child in an irritated tone of voice to 'Never mind
about the spilled glue! It won't be any trouble to clean it up I' conveys
one message factively and a very different one emotively. In this case,
as in most such cases, the tone of voice is apt to speak louder than the
words. Somehow we are more willing to believe a person's manner of
speaking than whatever his words purport to say.
It is as if emotively coded messages were higher ranking and
therefore more authoritative messages. As Watzlawick, Beavin, and
Jackson (1967) point out in their book on the Pragmatics of Human
Communication, it is part of the function of emotive messages to
provide instructions concerning the interpretation of factively coded
information. Whereas the latter can usually be translated into
propositional forms such as 'This is what I believe is true', or 'This is
what I believe is not true', the former can usually be translated into
propositional forms about interpersonal relationships, or about how
certain factive statements are to be read. For instance, a facetious
remark may be said in such a manner that the speaker implies 'Take
this remark as ajoke' or 'I don't really believe this, and you shouldn't
either.' At the same time, people are normally coding information
emotively about the way they see each other as persons. Such
messages can usually be translated into such propositional forms as
'This is the way I see you' or 'This is the way I see myself' or 'This is
the way I see you seeing me' and so on.
Although attitudes toward the self, toward others, and toward the
things that the self and others say may be more difficult to pin down
than are tangible states of affairs, they are nonetheless real. In fact,
Watzlawick et aI, contend that emotive messages concerning such
abstract aspects of interpersonal realities are probably much more
important to the success of communicative exchanges than the
factively coded messages themselves. If the self in relationship to
others is satisfactorily defined, and if the significant others in
interactional relationships confirm one's definition of self and others,
communication concerning factive information can take place.
Otherwise, relationship struggles ensue. Marital strife over whether
or not one party loves the other, children's disputes about who said
28 LANGUAGE TESTS AT SCHOOL

what and whether or not he or she meant it, labor and management
disagreements about fair wages, and the arms race between the major
world powers, are all examples of breakdowns in factive com-
munication once relationship struggles begin.
What is very interesting for a theory of pragmatic expectancy
grammar is that in normal communication, ways of expressing
attitudes are nearly perfectly coordinated with ways of expressing
factive information. As a person speaks, boundaries between
linguistic segmcnts are nearly perfectly synchronized with changes in
bodily postures, gestures, tone of voice, and the like. Research by
Condon and Ogston (1971) has shown that the coordination of
gestures and verbal output is so finely grained that even the smallest
movements of the hands and fingers are nearly perfectly coincident
with boundaries in linguistic segments clear down to the level of the
phoneme. Moreover, through sound recordings and high resolution
motion photography they have been able to demonstrate that when
the body movements and facial gestures of a speaker and hearer 'are
segmented and displayed consecutively, the speakers and hearer look
like puppets moved by the same set of strings' (p. 158).
The demonstrated coordination of mechanisms that usually code
factive information and devices that usually code emotive infor-
mation shows that the anticipatory planning of the speaker and the
expectations of the listener must be in close harmony in normal
communication. Moreover, from the fact that they are so synchro-
nized we may infer something of the complexity of the advance
planning and hypothesizing that normal internalized grammatical
systems must enable language users to accomplish. Static grammati-
cal devices \vhich do not incorporate an element of real time would
seem hard put to explain some of the empirical facts which demand
explanation. Some sort of expectancy grammar, or a system
incorporating temporal constraints on linguistic contexts seems to be
required.

D. Language learning as grammar construction and modification


In a sense language is something that we learn, and in another it is a
medium through which learning occurs. Colin Cherry (1957) has said
that we never feel we have fully grasped an idea until we have 'jumped
on it with both verbal feet.' This seems to imply that language is not
just a means of expressing ideas that we already have, but rather that
it is a means of discovering ideas that we have not yet fully discovered.
I.ANGUAGE SKI LL AS A PI{AGMATIC EXPECTANCY GR AM MAR 29
John Dewey argued thatlanguage was not just a means of 'expressing
antecedent thought', rather that it was a basis for the very act of
creative thinking itself. He wryly ob served that the things that a
perso n says often surprise himself more than anyone else. Alice in
Through the Looking Glass seems to have the same thought instilled in
her own creative imagination through the geni us of Lewis Carroll.
She asks, 'How can I know what I am going to say until I have already
said it?'
Because of the nature of human limitations and beca use of the
complexities of o ur uni verse of experience, in order for o ur minds to
cope with the vastness of the diversity, it categorizes and systematizes
elements into hierarchies and sequences of them. Not only is the
universe of experience more complex than we can perceive it to be at a
given mo ment of time, but the depths of o ur memo ries have
regi stered unto ld millions of details abo ut previous experience that
are beyond t he grasp of our present consciousness.
Our immediate awareness can be thought of as an interface
between external reality and the mind. It is like a corridor of activity
where incoming elemem s of experience are processed and where the
highly complex activities of thinking and language communication
are effected. T he whole of our cognitive experience may be compared
to a more or less constant stream of complex and interrelated objects
passing back and fo rth through this center of activity.
Because of the connections a nd interrelatio nships between
incoming elements, and since they tend to cluster together in
predictable ways, we learn to expect certain kinds of things to follow
from certain o th ers. When you turn the corner on the street where
you live you expect to see certain familiar buildings, yards, trees, and
possibly your neighbor's do g with teeth bared . Wh en someone speaks
to yo u and you turn in his d irection, yo u ex pect to see him by
looking in the direction of the sou nd you ha ve just hea rd. These sorts
of expectations, whether they are learned or innate, are so
commonplace t hat they seem trivial. They are not, however. fmagine
the shock of having to face a world in which such expectations
sto pped being correct. Think what it wo uld be like to walk into your
li ving room and find yourself in a strange place. Imagine walking
towa rd someone and getting farthe r from him with every step. The
violations of our commonest expectations are horror-movie material
that make earthquakes and hurricanes seem like Disneyland.
Man's great adva ntage over other o rganisms which are also
prisoners of t ime a nd space, is his a bility t o lea rn and use la nguage to
30 LANGUAGE TESTS AT SCHOOL

systematize and organize experience more effectively. Through the


use of language we may broaden or narrow the focus of our attention
much the way we adjust the focus of our vision" We may think in
terms of this sentence, or today, or this school year, or our lifetime, or
known history, and so on. Regardless of how broad or narrow our
perspective, there is a sequence of elements attended to by our
consciousness within that perspective. The sequence itself may
consist of relatively simple elements, or sets of interrelated and highly
structured elements, but there must be a sequence because the totality
of even a relatively simple aspect of our universe is too complex to be
taken in at one gulp" We must deal with certain things ahead of
others. In a sense, we must take in elements single file at a given rate,
so that within the span of immediate consciousness, the number of
elements being processed does not exceed certain limits.
In a characteristic masterpiece publication, George "Miller (1956)
presented a considerable amount of evidence from a wide variety of
sources suggesting that the number of separate things that our
consciousness can handle at anyone time is somewhere in the
neighborhood of seven, plus or minus one or two. He also pointed out
that human beings overcome this limitation in part by what he calls
·chunking'. By treating sequences or clusters of elements as unitary
chunks (or members of paradigms or classes) the mind constructs a
richer cognitive system. In other words, by setting up useful
categories of sequences, and categories of sequenc:es of categories,
our capacit:y to have correct expectations is enhanced -- that is, we are
enabled to have correct expectations about more objects, or more
complex sorts of objects (in the most abstract sense of 'object')
without any greater cost to the cognitive system.
All of this is merely a way of talking about learning. As sequences
of elements at one level are organized into classes at a higher order of
abstraction, the organism can be said to be constructing an
appropriate expectancy grammar, or learning. A universal con-
sequence of the construction and modification of an appropriate
expectancy grammar is that the processing of sequences of elelTients
that conform to the constraints of the grammar is thus enhanced.
Moreover, it may be hypothesized that highly organized sequences of
elements that are presented in contexts where the basis for the
organization can be discovered will be more conducive to the
construction of a"n appropriate expectancy grammar than the
presentation of similar sequences without appropriate sorts of
context.
LA NG UAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 31

We are dra wn to the generalization th at there is an extremel y


im portant parallel between the normal use of language and the
lea rning of a language. The learner is never quite in the position of
having no expectations to beg in with. Even the newborn infant
apparently has certain innate expectancies, e,g" that sucking its
mother's breast will produ ce a desired effect, In fact, experiments by
Bower (1 97 1, 1974) seem to point to the conclusion that an infant is
bo rn with certain ex pectati o ns o f a much mo re specific sort - fo r
example, the expectation that a seen object should have some tangible
solidity to it He proved that in fa nts at surprisi ngly early ages were
astonished when they passed their hands through the space occupied
by what appeared to be a tangibl e object. However, his experiments
show th at infants appa rently have to learn to expect entities (such as
mothe r) to appear in onl y one pl ace at one time, They also seem to
have to learn that a percept of a moving object is ca nsed by the same
object as the percept of that sa me moving object when it comes to
rest.
The pro blem, it would seem, fro m an educa tional po int of view is
how to take advantage of the expectancies th at a learner has already
acquired in trying to teac h new material. The question IS, what does
the learner already know, and how can that kn owledge be optimally
utilized in the presentation of new mate ria l ? It has been de-
mo nstrated many limes over th at learning o f verbal material is
enhanced if the meaningfulness of the material is maximized fro m rhe
learne r's po int of view. An unpronounceable sequence o f letters like
gbntmblVk is more difficul t to learn and to recall than say, nox ems
glerj; in spite of the fact that the latter is a longe r sequence ofletters,
The latter is easier beca use it conforms to some of the expectations
that Engli sh speakers have acquired concern ing phonological and
graphological elements. A phrase like colorless green ideas confo rms
less well to our acquired ex pectancies than beautiful/all colors. Given
appro priate contexts for the latter and the lack of them for the most
part for the former, the latter should also be easier to learn to use
appropriately than the fo rmer, A nonsensical passage like the one
Mr Foote invented to stump Mr Macklin would be more difficult to
learn than normal prose. The reason is simple enough. Learners
know more about normal prose before the learning task begins.
La nguage progra ms tha t employ full y contextualized and
max imall y meaningful language necessaril y o ptimize the learner's
ability to use previo usly acquired expectancies to help discover the
pragmat ic mappings of utterances in th e new language onto
32 LAN GVAGE TESTS AT SCHOOL

extralinguistic contexts . Hence, they would seem to be superio r to


programs that expect learners to acquire the ability to use a language
on the basis of disconnected lists of sentences in the form of pattern
drills, many of which are not onl y unrelated to meaningful
extralinguistic contexts, but which are intrinsically umelatable,
If one carefull y examines language teaching methods and language
learning settings which seem to be conducive to success in acquiring
facility in the language, they aU seem to have certain things in
common. Wh ether a learner succeeds in acquiring a first language
because he was born in the culture where that language wa s used, or
was transported there and forced to learn it as a second language:
whether a learner acquires a second language by hiring a tut or and
speaking the lang uage incessantly, or by marrying a tutor, o r by
merely maintaining a prolonged rela tionship with someone who
speaks the la nguage: whether the learner acquires the language
through the command approach used successfull y by J, J. Asher
( 1969, 1974), o r the old silent method (Gattegno, 1963), or through ; .
set of film s of communicative exchanges (Oller, 1963- 65), or by
joining in a bilingual education experiment (Lambert and Tucker,
1972), certain sorts of data and motivations to attend to them are
always present. The learner must be exposed to Unguistic co ntexts in
their peculiar prag mat.ic relationships to ex tralinguistic contexts, and
the learner must be mo ti vated to communicate with people in the
target lan guage by discovering ttJ-ose pragmaLic relationships .
Although we have said little a bout education in a broader sense,
everything said to this point has a broader application. In effect, the
hypothesis concerning pragmatic expec tancy grammar as a bas is for
explaining success and failure in language learning and ,language
teaching can be extended to all other a reas of the school curric ulum in
which language plays a large part. We will return to this issue in
C hapter 14 where we discuss reading curricula and other language
based parts o f curricula in general. In particular we will examine
research into the developing lan guage skill s of children in Brisbane,
Australia (Hart, Walker, and Gray, 1977).

E. Tests that invoke the learner's grammar


When viewed fr om the vantage point assumed in th is chapter,
language testing is primarily a task of assessing the efficiency of the
pragmatic expectancy grammar the learner is in the process o f
co nstructing. In o rd er for a lan guage test to achieve validit y in te rms
of the the oretical construct of a pragma tic expectancy gramma r, it
LANGUAGE SKILL AS A PRAGMATIC EXPECTANCY GRAMMAR 33
will have to invoke and challenge the efficiency of the learner's
developing grammar. We ca n be more explicit. Two closely
interrelated criteria of construct validity may be imposed on language
tests : first, they must cause the lea rner to process (either produce or
comprehend, or possibly to comprehend, store, and recall, or some
other combinatio n) temporal sequences of elements in the language
that conform to normal contextual constraints (linguistic and
extralinguistic): second, they must require the learner to understand
the pragmatic interrelationship of linguistic co ntexts and extra-
lingUistic contexts,
The two validity requirements just stated are like two sides of the
same coin. The first emphasizes tbe sequential constraints specified by
the grammar, and the second emphasizes the function of the grammar
in relating sequences of elements in the language to states of affairs
outside of language, In subsequent chapters we will often refer to
these validity requirements a s the pragmatic naturalness criteria. We
will explore ways of accomplishing such assessment in Chapter 3, and
in greater detail in Part Three which includes Chapters 10 through 14,
Techniques thal fail to meet the naturalness criteria are discussed in
Part Two - especiall y in Chapter 8, Multiple choice testing
procedures are discussed in Chapter 9.
KEY POINTS
1. To understand the problem of constructing valid language tests, it is
essential to understand the nature of the skill to be tested.
2. Two aspects of language in use need to be distinguished: the/active (or
cognitive) aspect of language use has to do with the coding of
information about states of affairs by using words, phrases, cla uses, and
discourse; the emotive (or affective) aspect oflanguage use has to do with
the coding of information about attitudes and interpersonal re-
lationships by usin g facial expression, gestu re, tone of voice, and choice
of wo rds. These two aspects of language use are intricately interre1a ted.
3. Two major kinds of context are distinguished: linguistic context co nsists
of verbal and gestural aspects; and extralinguistic conteXT similarly
consists of objective and subjective aspects.
4. The systematic correspondences between lin gui stic and extralinguistic
contexts are referred to as pragmatic mappings.
5. Pragmalics asks how ut terances (and of co urse o ther forms of language
in use) are related to human ex perience.
6. In relation to the factive aspect of coding information about states of
affai rs outside of language, it is asserted that language is always an
a bbreviation for a much more complete and detailed sort of knowledge.
7. An important aspect of th e coding of information in lan guage is the
anticipamry planning of the speaker and (he advance hypothesizi ng of
the listener concerning what is likely to be said next.
34 LANGUAGE TESTS AT SCHOOL

8. A pragmatic expectancy grammar is defined as a psychologically real


system that sequentially orders linguistic elements in time and in relation
to extralinguistic contexts in meaningful ways.
9. As linguistic sequences become more highly constrained by grammatical
organization of the sorts illustrated, they become easier to process.
10. Whereas coding devices for factive information are typically digital
(either on or off, present or absent), coding devices for emotive
information are usually analogical (continuously variable). A tone of
voice which indicates excitement may vary with the degree of excitement,
but a digital device for, say, referring to a pair of glasses cannot be
whispered to indicate very thin corrective lenses and shouted to indicate
thick ones. The word eyeRlasses does not have such a continuous
variability of meaning, but a wild-eyed shout probably does mean a
greater degree of intensity than a slightly raised voice.
11. Where there is a conflict between emotively coded information and
factive level information, the former usually overrides the latter.
12. When relationship struggles begin, factive level communication usually
ends. Examples are the wage-price spiral and the arms race.
13. The coding of factive and emotive information are very precisely
synchronized, and the gestural movements of speaker and listener in a
typical communicative exchange are also timed in surprisingly accurate
cadence.
14. Some sort of grammatical system incorporating the element ofreal time
and capable of substantial anticipatory-expectancy activity seems
required to explain well known facts of normal language use.
15. Language is both an object and a tool of learning. Cherry suggests that
we not only express ideas in words, but that we in fact discover them by
putting them into words.
16. Language learning is construed as a process of constructing an
appropriate expectancy generating system. Learning is enhancing one's
capacity to have correct expectations about the nature of experience.
17. It is hypothesized that language teaching programs (and by implication
educational programs in general) will be more effective if they optimize
the learner's opportunities to take advantage of previously acquired
expectancies in acquiring new knowledge.
18. It is further hypothesized that the data necessary to language acquisition
are what are referred to in this book as pragmatic mappings - i.e., the
systematic correspondences between linguistic and extralinguistic
contexts. In addition to opportunity, the only other apparent necessity is
sufficient motivation to operate on the requisite data in appropriate
\\lays.
19. Valid language tests are defined as those tests which meet the pragmatic
naturalness criteria _. namely, those which invoke and challenge the
efficiency of the learner's expectancy grammar, first by causing the learner
to process temporal sequences in the language that conform to normal
contextual constraints, and second by requiring the learner to
understand the systematic correspondences of linguistic contexts and
extralinguistic contexts.
LANGUAGE SK1LL AS A P RAG MAH C EXPECTANCY GRAMMAR 35
DISCUSSION QUESTIONS
) , Why is it so important to understand the nature of the skill you are trying
to test ? Can you think of exa mples of tests that have been used fo r
educatio nal or o ther decisions but which were not related to a careful
consideraLion of the sk ill or knowledge they purported to assess? Study
closely a test that is used in yo ur school or tha t you have taken at some
time in the co urse of your educational experience. How can you tell if the
test is a measure of what it p urports to measure? Does the la bel on the
test really tell you what it measures?
2. Look for examples in your own experience illustrating the importanceof
grammatically based expectancies. Riddles, puns, jokes, and parlo r
games are good so urces. Speech errors are eq uall y good illustrations.
Consider the exa mple of th e little girl who was as ked by an ad ult where
she got her ice cream. She replied, ' All over me,' as she looked sheepishly
at the vanill a and chocolate stains all over her dress . How did her
expectations differ from th ose o f the adult who as ked the question ?
3. Keep track o flistening or read ing errors where you took a wrong tum in
your thinking and had to do some retreading fa rther down the line.
Discuss the source of such wrong turns.
4, Considerthe sentences: (a) The boy was bucked ofl'by the pony, and (b)
The boy wa s bucked otfby the ba rn (example from Woods, 1970). Why
does the second sentent.'e requ.ire a mental d o ubl e~ ta ke ? Note similar
exa mples in yoW' reading for the next few d ays. Write down exa mples
and be prepared to discuss them with yo ur class ,

SUGGESTED READINGS
1, Geo rge A. Miller, 'The Magical Number Seven Plus or Minus Two:
Some Limits on Our Ca paCity for Processing In formatio n;
Psychological Re l'iew 63, 1956, 81- 97.
2, Donald A, No rman, 'In Retrospect,' l.\.1em ory and Attention, New York:
Wiley, 1969, p p. 177-181.
3, Part VI of Focus on the Learner. Rowley, Mass,: Newbury House, 1973,
pp. 265-300.
4, Bernard Spolsky, 'What Does It Mean to Kn ow a Language or How Do
You Get Someo ne to Perfo rm His Competence?' In J. W . OHer, Jr. and
J, C. Richard s (eds, ) Focus on the Learner. Rowley, Mass.: Newbury
House, 1973, 164-76.
3

Discrete Point,
Integrative,
or Pragmatic Tests
A. Discrete point versus integrative
testing
B. A definition of pragmatic tests
C. Dictation and cloze procedure as
exa mples of pragmatic tests
D. Other examples of pragmatic tests
E. Research on the validity of
pragmatic tests
I. The meaning o f correlatio n
2. Correlations between different
language tests
3. Erro r analysis as an
independent source of validi ty
da ta

N ot a ll that glitters is gold, and not eve rything that goes by the na me
is twenty-four karat. Neither are all tests which are called language
tests necessarily worth y of the name, and some a re better than o thers.
This chapter deals with three classes of tests that are called mea sures
of language - but it will be argued that they a re not equal in
effectiveness. It is claimed that only tests which meet the pragmatic
naturalne ss criteria defined in Chapter 2 are language tes ts in the most
fundamental sense of whallanguage is and how it functions.

A. Discrete point versus integrative testing


Tn rece nt yea rs, a body of literature o n language testing has devel oped
which distingui shes two major categories of tests. John Ca rroll (196 1,
see the Suggested Readings at the end of this chapter) was the person
36
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 37
credited with first proposing the distinction between discrete pOint
and integrative languagc tests. Although the types a re not always
different for practical purposes, the theoretical bases of the two
approaches contrast markedly and the predictions concerning the
effects and relative validity of different testing procedures also differ
in fund amental wa ys depending on which of the two approaches one
selects. The contrast between these two philoso phies, of course, is not
limited to language testing per se, but can be seen throughout the
whole spectrum of ed ucational endeavor.
Traditio na ll y, a discrete poiJ7l test is one that attempts to focus
attention on one point of grammar at a time. Each test item is aimed
at one and only one element of a particula r component of a grammar
(or perhaps we should say hypothesized gramma r), such as
phonology, syntax, or vocabula ry. Moreover, a discrete point test
purports to assess onl y one skill a t a time (e .g., listening, or speaking,
or reading, or writing) and only one aspect of a skill (e.g., productive
versus receptive or oral versus visual ). Within each skiB, aspect, and
component, discrete items supposedly focus on precisely one a nd
only one phoneme, morpheme, lexical item, grammatical rule, or
whatever the appropriate element may be. (See Lado, 1961, in
Suggested Readings a t the end of this chapter.) For instance, a
phonological discrete item might require a n examinee to distinguish
between minimal pairs, e.g., pill versus p eel, auditorily presented. An
example of a morphological item might be one which requires the
selection of a n appropriate suffix such as -!less or -ity to form a noun
fro m an adjecti ve like secure, or sure. An example of a syntactic item
might be a fill -in-the-blank type where the examinee must supply the
suffix -s as in He Ivalk .- to town each morning nmv that he lives in the
city .l
The concept of a n integrative test was born in contrast with the
definition of a discrete point test. If discrete items take language skill
apart, integrative tests put it back together. Whereas discrete items
attempt to test knowledge oflanguage one bit at a time, integrati ve
tests attempt to assess a learner's capacity to use many bits all at the
same time, and possibly while exercising several presumed com-
ponents of a grammatical system, and perhaps more than one of the
traditionall y recognized skills o r aspects of skills.
However, to base a definition of integrati ve language testing on

I Other discrete item examples a re offered in Chapter 8 where we retllrn to the topic of

dIscrete point tests and exa mIne them in greater deta il.
38 LANGUAGE TESTS AT SCHOOL

what wo uld appear to be its logical anti thesis and in fac t its
competing predecessor is to assume a fairly limiting point of view. It
is possible to look to other sources for a theoretical basis and
rationale for so-called integrative tests.

B. A definition of pragmatic tests


The term pragmatic lest has sometimes been used interchangea bly
wi th the term inlegratil1e lest in order to call attention to the
possibility of relating integrative language testing procedures to a
theory of pragm atics. or pragmatic expectancy grammar. Whereas
integrative testing has been somewh at loosely defined in terms of
\vhat discrete point testing is not, it is possible to be somewhat more
precise in saying what a pragmatic test is: it is any procedure or task
that causes the learner to process sequences of elements in a language
that confonn to the normal contextual constraints of th at language,
and which requires the learner to relate sequences of linguistic
elements via pragmatic mappings to extralinguistic context.
Integrative tests are often pragmatic in this sense. and pragmatic
tests are always integrative. There is no ordinary discourse situation
in which a learner might be asked to listen to and distinguish between
isolated minimal pairs of phonological contrasts. There is no normal
language use context in which one's attention wo uld be focussed on
the syntactic rules involved in placing appropria te suffixes on verb
stems or in moving the agent of an active declarative sentence from
the front of the sentence to the end in order to form a passive (e.g.,
The dog bit John in the active form becoming John was bitten by the
dog in the passive). Thus) discrete pointlests cannot be pragmatic, and
conversely, pragmatic tests cannot be discrete point tests . Therefore,
pragmatic tests must be integrative.
But integrative language tasks can be conceived which do not meet
one or both of the naturalness criteria which we have imposed in our
definition of pragmatic tests. Ifa test merely requires an examinee to
use more than one of the four traditi onally recognized skill s andior
one or more of the traditionally rec ognized components of grammar,
it must be considered integrative. But to qualify as a pragmatic test.
more is required .
In order for a test user to say something meaningful (valid) about
the efficiency of a lea rner' s developing grammatical system, lhe
pragmatic na turalness criteria require that the test invoke and
challenge that developing grammatical system. Thi s requires
DISCRETE POI~T , IN TEGRATIVE OR PRAGMATIC TESTS 39
processing sequences of elements in the target language (even if it is
the learner's first and onl y language) subject to temporal contextual
constraints. In additi on, the ta sks must be such that for examinees to
do them , linguistic sequences must be related to extralinguistic
contexts in meaningful ways.
Examples of tasks that do not qualify as pragmatic tests include all
discrete point tests, the ra te recital of sequences of material without
attention to meaning: the manipulation of sequences of verbal
elements, possibl y in complex ways, but in ways that do not require
awareness of meaning. In brief, if the task does not require attention
to meaning in temporally cent rained sequences of linguistic elements,
it cannot be construed as a pragmatic lan guage test. Moreover\ the
constraints must be of the type that are fo und in normal uses of the
language, not merely in some classroom setting that may ha ve been
contrived according to some theory of how languages should be
taught. Ultimately, the question or whether or not a task is pragmatic
is an empirical one. It cannot be decided by theory based preferences,
or opini on polls.

c. Dictation and doze procedure as examples of pragmatic tests


T he traditional dictation, rooted in the distant past of language
teaching, is an interesting examp le of a pragmatic language testing
procedu re. If the sequences of words or phrases to be dictated are
selected from no rmal prose, o r dialogue, or so me other natura) form
of discourse (or perhaps if the seq uences are carefully co ntrived to
mirror normal discourse, as in well-written fiction) and if the m aterial
is presented orally in sequences that are long enough to challenge the
short term memory or the learners, a simple traditional dictatio n
meets the naturalness requirements for pragmatic language lests.
First, such a task requires the processing of temporally constrained
sequences ormateri al in the language and second, the task of dividing
up the stream of speech and writing down what is beard requires
understanding the meaning of the material - i.e. , rela ting the
linguistic context (which in a sense is given) to the extralinguistic
context (which must be inferred).
Although an inspection of the results of dictation tests with
appropriate statistical procedures (as we will see below) shows the
technique to be very reliable and highly valid , it has not always been
looked o n with favor by the experts. For example, Robert Lado
(1961) said:
40 LANGUAGE TESTS AT SCHOOL

Dictation . .. on critical inspect ion . .. appears to measure very


little of language. Since the word order is given ... it does not
test word order. Since the words are given . .. it does not test
voca bulary. It ha rdly tests the aural perception of the
examiner's pro nunciation because the wo rds can in man y cases
be identified by context. ... The student is less likely to hear the
sounds incorrectly in the slow reading of the words which is
necessary for dictatio n (p. 34) .
Other authors have te nded to foll ow Lado's lead :
As a testing device . .. dictation must be regarded as generally
uneconomical and im precise (Harris, 1969, p. 5). Some teachers
argue that dictation is a test of auditory comprehension, but
surely this is a very indirect and inadequate test of such an
important skill (Anderson, 1953, p. 43). Dictation is primarily a
test of spelling (Somaratne, 1957, p. 48).
More recently. J. B. Heaton (1975), though he cites some of the up-to-
date research on dictation in his bibliography, devo tes less than two
pages to dictation as a testing procedure and concludes that
dictatio n . . . as a testing device measures too many different
language features to be effective in providing a means o f
assessing anyone skill (p. 186).
Davies (1977) offers much the same criticism of dictation. He suggest s
that it is too imprecise in diagnostic information, and further that it is
apt to have an unfortunate 'wash back' effect (namely, in taking on
'the a ura oflanguage goals'). Therefore, he argues
it may be desirable to abandon such well- worn and suspect
techniques for less familiar and less coherent ones (p. 66).
In the rest ofthe book edited by Allen and Davies (1977) there is only
one olher mention of dictation. Ingram (1977) in the same volume
pegs dictation as a rather weak sort of spelling test (see p. 20).
If we we re to rely on an opini on poll, the weight of the evidence
would seem to be against dictation as a useful Janguage testing
procedure. However, the validity of a testing procedure is hardJy the
sort of question that can be answered by taking a vote.
Is it reall y necessa ry to read the material very slowly as is implied
by Lado's remarks? The answer is no. It is possible to read slowly, but
it is not necessary to do so . In fact, unless the material is presented in
sequences long enough to challenge the learner's sho rt term memory,
and quickly enough to simulate the no rmal temporal nature of speech
sequences, then perhaps dictatio n would become a test o f spelling as
Somar(ltne and Ingram suggest. However, it is not even necessary to
DISCRETE P01KT, INTEGRATIVE OR PRAGMATIC TESTS 41
count spelling as a criterion for co rrectness. Somaratne' s remark
seems to imply th at one must, but research shows t hat one shouldn't.
We wi ll retu rn to this question in particular, namely, the scoring of
dictation, and other practical questions in C hapter 10.
The view that a language learner can take dictatio n (which is
presented in reasonably long bursts, say, five or more words between
pauses, and where each burst is given at a conversational ra te)
without doing some very active and creative processing is credible
onl y from the vantage point of the naive examiner wha thinks that the
learner auto matically knows \vhat the examiner knows about the
material being dictated. As the famous Swiss linguist painted aut
three quarters of a century ago,
. .. the main characteristi c of the sound chain is that it is linear.
Considered by itself it is onl y a line, a continuous rib bon along
whi ch t he ear perceives no self~suffi cient and clear~cut division
... (quoted from lectures compiled by de Saussure's students,
Bally, Sechehaye, and Riedlinger, 1959, pp. 103- 104).
T o prove that the words of a dictation are not nece::;sarily 'given'
from the lea rner' s point of view, one only needs to try to write
dictation in an un knawn langua ge. The reader may try this test: have
a speaker of Yoruba, Thai, M andarin, Serbian or so me other
language which you do not kn ow say a few short sentences with
pauses between them long enough for you ta write th em d awn or
attem pt ta repeat them. T ry something simple like, Say Man what's
happening, or Hmv's life been treating you lately, at a co nversational
rate. If the p roof is not co nvincing, consider the kind s o f erro rs that
non-native speakers of English make in taking dictation.
In a research report circulated in 1973, Johansson gave examples of
voca bulary errors : eliqllants, elephants, and elekvants far the word
eloquence. It is possible that the first rendition is a spelling error, but
that possibili ty d oes not exist for the other renditions. At the phrase
level, consider of appearance for of the period, person in facts for
pertinent fa cts, less than j ustice, lasting justice, last in justice for just
inj ustice.Or when a foreign student at UCLA writes, lO find parlicle
man living better and mean help man and boy tellable damage instead of
to fin d practical means of f eeding people better and means of helping
th em avoid the terrible damage of windsrorms, does it make sense to
say that th e words and th eir order were 'given' ?
Tho ugh much resea rch remains to be done to understa nd better
wha t learners are d oing when they take dictation, it is clear from the
above examples tha t whatever mental processes they are performing
42 LANGUAGE TESTS AT SCHOOL

must be ac.tive and creati ve. There is much evidence to suggest that
there are fundamental parallels between tasks like taking di ctation
and using language in a wide variety of other ways. Among closely
related testing procedures are sentence repetition tasks (or 'elicited
imitation') which have been used in the testing of child ren for
proficiency in one or mo re languages or language varieties. We return
to this topic in detail in Chapter 10.
All of the research seems to indicate that in order for examinees to
take dictation, o r to repeat utterances that challenge their short term
memory, it is necessary not only to make th e appropriate
discriminations .in dividing up the continuum of speech, but also to
understand the meaning of what is said .
Another example of a pragmatic language testing procedure is the
cloze technique . The best known va riety of this technique is the sort
of test that is constructed by deleting every fifth , sixth, or seventh
wo rd from a passage of prose. Typicall y each deleted wo rd is replaced
by a blank of standard length, and the task set the examinee is to fill in
the blanks by restoring the missing words. Other varieties of the
procedure involve deleting specific vocabulary items, parts of speech,
affixes, or particular types of grammatical markers.
The word cloze was invented by Wilso n Taylor (1953) to call
attention to the fact that when ?n examinee fil1 s in the gaps in a
passage of prose, he is doing something similar to what Gestalt
psychologists call 'closure', a process related to the perception of
incomplete geometric figures, for example. Taylor considered words
deleted from prose to present a special kind of closure problem . From
what is known of the grammatical knowl edge the examinee brings to
bear in solving such a closure problem, we can appreciate the fact that
the problem 15 a very special sort of closure.
Like dictat io n, doze tests meet both of the naturalness criteria for
pragmatic language tests. In order to give correct responses (whether
the standard of correctness is the exact word tha t originally appeared
at a particular point, or any other word that fu lly fits the context of
the passage), the learner must operate ..___ the basis of, both
immediate and long-range ___ constraints. Whereas some of the
blanks in a cloze test (say of the standard variety deleting every 11th
wo rd) can be filled by attending only to a few words on either side of
the blank, as in the first bl ank in the preceding sentence, ot her blanks
in a typical cloze passage require attentio n to lo nger stretches of
linguistic context. They often require inferences about extralin guistic
context, as in the case of the second blank in the preceding sentence.
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 43
The wo rd on seems to be req uired in the first blank by the words
operate and the basis of, without any additional information.
However, unless long range con straints are taken into account, the
second blank offers many possibilities. If the examinee attended only
to such constraints as are afforded by the words from operate
onward, it could be filled by such words as missile, legal, or leadership.
T he intended word was contextual. Other alternatives which might
have occurred to the reader, and which are in the general semantic
ta rget area might include temporal, verbal, extralingui.'1,tic, grammati-
cal, pragmatic, ling uistic, ps),cholinguislic, sOciolinguistic, psyt:hologi-
cal, semantic, and so on.
In taking a cloze test, the examinee must utilize information that is
inferred abo ut the facts, events, ideas, relationships, states of a ffairs,
social settings and the like that are pragmatically mapped by the
linguistic sequences contained in the passage. Examples of cases
where extralinguistic context and the linguistic context of rhe passage
are interrelated are obvious in so-called delctie- words such as here
and now, then and there, rhis and that, pronouns that refer to persons
or things, tense indicators, aspect markers on verbs, adverbs of time
and place, determiners and demonstratives in generaJ, and a host of
others.
For a simple example, consider the sentence, A horse )vasjtlSllvhell
he was tied to a hitching post, and the same animal was alsofas! when he
lVon a horse-race. If such a sentence were part of a larger context, say
on the difficulties of the English language, and if we deleted the first a,
the blank could scarcely be filled wit h the definite article the because
no horse has been mentioned up to that point. On the other hand, if
we deleted the the before the word s same animal, th e indefinite article
could not be used because of the fact that the borse referred to by the
phrase A horse at the beginning of the sentence is the same horse
referred to by the phrase the same an im al. This is an example of a
pragmatic constrai nt. Consider the oddit y of saying. The horse was
jas t }vhen he was tied to a hitching post, and a same animal was also fast
when he won a horse-race.
Even tbough the pragmatic mapping constraints involved in
normal discourse are only partially understood by the theoreticians,
and though they cannot be precisely characterized in terms of
grammatical systems (at least not yet), the fact that they exist is well-
known, and the fact tha t they can be tested by such pragmatic
procedures and the doze tech nique has been demonstrated (see
Chapter 12).
44 LANGUAGE TESTS AT SCHOOL

All sorts o f deletio ns of so-called content words (e.g. , no uns,


adjectives, verbs, <:Ind adverbs), and especially gramma tical con-
nectors such as subordinating conjunctions, negatives, and a great
many others ca rry with them constraints that may range backward or
forward across several sentences or more. Such linguistic elements
may entail restrictio ns that influence items that are widely sepa rated
in the passage. This places a strain o n short term memo ry which
presses the learner's pragmatic expectancy grammar into opertltion.
The accuracy with which the learner is able to suppl y co rrect
responses can t herefore be taken as an index of the efficiency of the
learner's developing grammatical system. Ways of constru cting,
administering, scoring, and interpreting cJoze tests and a va riety of
related procedu res for acquiring such indices are discussed in
Chapter 12.

D. Other examples of pragmatic tests


Pragmatic testing procedures are potentially innumera ble. The
techniques discussed so far, dictation, cl oze) and variations of them,
by no means ex haust the possibilities. Pro bably they d o not even
begin to indicate th e range of reaso nable possibilities to be explo red.
There is al ways a da nger that minor empirical ad va nces in
educatio nal resea rch in panicular, may lead to excessive dependence
on procedures th at a re associated with the progress. H owever, in
spite of th e fact th at some of the pragmatic procedures thus far
investigated d o appear to work substa ntially better than their discrete
point predecesso rs, there is little d o ubt th at pragmatic tests can also
be refined and expanded. It is import ant th at the procedures which
now exist and whic h have been studied sho uld not hmit o ur vision
cODe-erning o ther po ssibilities. Ra ther, they s hould serve as
guideposts fo r subsequem refinement and development o f still more
effective and more informative testin g procedures.
Therefore, t he point of this section (and in a broader sense, this
entire book) is not to provide a co mprehensive list of possible
pragmatic tes ting procedures, but rather to illustrate so me o f the
possible types of procedures tha t meet the naturalness criteria
concerning the temporal constraints o n language in use, and the
pragmatic mapping o f linguistic contexts o nt o extralinguistic o nes.
Below, in section E of this chapter, we will discuss eVldence
concerning t he va lidity of pragmatic tests. (Also, see th e Ap pendi x. )
I)ISCRETE POIN T, INTEG KATI VE O R PRAG MA TIC TESTS 45
Combined cloze and diclarion. The examinee reads materi al from
which certain portions have been deleted and simultaneously (or
subsequen tly) hears the same mate ri al without deletions either live or
on tape. T he examinee's task is Lo fill in the missing porti ons the same
as in the usual cl oze procedure, but he has the added support of the
a uditory signal to help him fill in the missi ng portions, Many
va ria tions on this procedure are possible. Single wo rds, or even parts
of words, o r sequences of \vo rd s, or even whol e sentences or longer
segme nts may be deleted . T he less m(.l teriaJ one dele[cs, presumably,
the more the task resem bles the stnnda rd do ze procedure, and the
more one deletes, the more the task looks like a standa rd dictation.

Oral c/oze procedure. Instead of presenting a d oze passage in a


written fo rmat, it is possible to use a carefully prepared ta pe
recordin g oftlle mate ria l with numbers read in for the blanks, or with
pauses where bla nks occur, Or, it is possible merely to read the
material up to the blank, gi ve the examinee the opportunity to guess
the missing wo rd, rec ord the response, and at that point either tell the
examinee the ri ght answer (i.e., the missing word), or sim ply go on
without any feed back as to the correctness of the exa minee's
response. Another procedure is to a rran ge the deletions so that they
al ways come al the e·nd of a cla use or sentence. A ny of these o ral d oze
techniques ha ve the advantage of being usable with non-literate
populations,

Dictation with il1leljering noise. Severa l va rieties of this procedure


ha ve been used, and fo r a wide ra nge of purposes , The best known
examples are the versions of lhe Spolsky-Gradman noise tests used
with non-native speakers of English. The proced ure simply invol ves
superimposing white noise (a wide spectru m of rand om noise
sou nding roughl y li ke rad io static or a shhhhshing sound a t a
constant level) onto taped ve rbal materi al. If th e linguistic context
under the noise is full y meaningful a nd subject to the normal
extralingui stic constraints, this procedure qualifies as a pra gma tic
testing technique. Va riations include noise thro ughout the material
versus noise over certain porti ons only. It is argued, in any event, that
the no ise constitutes a situation somewhat pa rallel to man y of the
everyday contexts where lang uage is u sed in less than ideal acoustic
conditions, e.g .. trying to ha ve a conversation in someone's
livingro om when the television and air conditioner a re producing a
high level of competing noise, or tryi ng to talk to or hear someone else
46 LANGUAGE TESTS AT SCHOOL

in the crowded lobby of a hotel, or trying to hear a message over a


public add ress system in a busy air tennina l, etc.

Paraphrase recognit ion. In one version, exa minees a re asked to


read a se nte nce and then to select from fo u r or five alternati ves the
best paraphrase for the given sentence. T he task may be made
some\\·ha t more difficult by h aving examinees read a pa ragraph or
longer passage and th en select fr om several alternatives the one which
best represents the central meaning or idea o f the gi ven passage. T his
task is somew hat similar to tellin g what a co nversation was about, or
what the main idea s o f a speech were, and the like. Typically, such
tests are interpreted as being tests of reading comprehension.
H owever, th ey are pragmatic language- tests inasmuch as they meet
th e naturalness criteri a related to meanin g and tempo ral constraints.
A paraphrase rec ognition task may be either in a written format or
a n ora! format or some combinatio n of the m. An example of an o ral
fo rmat comes from the TeSL oj English as a Foreign Lanxuage
produced by Educa tional Tcstin g Service, Princeto n, New Jersey .
Examinees hear a sentence like, John dropped [he lefter in the mailbox.
T hen they must ch oose bctween (a) Johl1 sellt the letter; (b) John
opened the letter ; (c) John losl the leiter; (d) John destroyed the leller.2
Of course, considerabl y more complicated items are possible. The
discrete point theorist might object that since the first stimulus is
presented auditorily a nd since th e choices are then presented in a
written fo rmat. it becomes problematic to say what the tes t is a test of
- whether listenin g comprehensio n, or readi ng comp rehension ) or
both. This is an issue that we will return to in Chapters 8 and 9, and
which will be addressed briefly in the section on the validity of
pragmatic tests below. Also , see the Appendi x.

Question ans~j.'ering. In one secti on of the TOEFL , examinees a re


required to select the best answe r fro m a set o f written alternati ves to
an audit o ril y presented questi o n (either o n record or tape) . F or
instance, t he examinee might hear, When did Tom come here? In th e
test booklet he read s, (a ) By rax i ; (b) Yes. he did; (c) To study his/Ory;
and (d ) Last night. He must ma rk on his answer sheet the letter
co rrespo ndin g to the best a nswer to [he given questio n.

l This example and subsequent ones fro m the TOE FL are based on mimeographed
ha nd -outs prepa red by t he staff at Educa tional Testi ng Service 10 describe the new
fo rmat of the TOEFL in relation lothe fo rmal u~ed fro m 1961-·1975.
DISCRETE POrNT , INTEG RATIVE OR PRAGMATIC TESTS 47

A sligh tl y different question answe ring task appears in a different


section of the te st. The examinee hears a dialog ue such as:
MAN'S VorCE: Hello Mary. This is Mr Smit h at the office. Is
Bill feeling any better today'!
W OMAN'S VorCE : Oh, yes, Mr Smith. He's feeling much better
now. But tbe doctor says he'll bave to stay in
bed until Mo nda y.
THIRD VOICE: Where is Bill now'
Possible answers from whic h the exa minee must c hoose includ e: (a) At
the office; (b) 011 his way to work; (c) Home in hed; and (d) Away on
v(lcation.
Perhaps the preceding example, and other multiple choice
examples may seem somewhat contr ived. For this an d oth er reasons
to be disc ussed in Chapte r 9, good items of the preceding ty pe are
quite difficult to prepare. Other formats wh ich allo w the examinee to
supply ans\vers to questi o ns conce rning less contrived contex ts may
be more suitable for classroom applications. For in stance, sections of
a tel evision or rad io broadcast in tbe target lan guage ma y be taped.
Questions formed in relation to those passages could be used as pa rt
of an interview· technique aimed at testing oral skill s.
A co lo rful , in terestin g, and potentiall y pragmatic testing techniqu e
is the Bilingual Syntax M easure (Bur t, Dulay_ and Hernand ez, 1975).
It is based on questions concerning colo rfu l cartoon style pictures
like the one shown in Figu re 1, on page 48.
The test is intended fo r c hildre n bet ween th e ages of four and
nine, from kind ergarten through second grade. Although the authors
of the test ha ve devised a scoring procedure that is essentially aimed
at assessing cont rol of less th an twe nt y so-called func tors (mor-
phological and syntactic markers like th e plural endings on nouns, or
tense markers o n verbs) , the proced ure itself is hig hly pragmatic.
Firsl. questio ns are asked in relation to specific ex tralin guistic
context s in ways tbat require the processing of seq uenccs of elements
in English, or Spanish, or possibly some other language. Second ,
those meaningful sequences of hng uistic ele ments in the form or
questi ons must be rela ted to th e give n extralingui stic contexts in
meaningful wa.ys.
Fo r in stance. in relation to a picture suc h as the o ne shown in
Figure I, the chil d might be asked so mething like, Hol\' come he's so
skinny '? The qu estioner in dicates the guy pushing the wheelbarrow.
The situatio n is natural enough and seems likely to moti vate a child to
48 LANGUA GE TESTS AT SCHOOL

Fi gure 1. A cartoon drawi ng illustrating the style of the Bilingual Syntax


}\1easllre.

want to respond. We ret urn to the Bilingual Syntax Measure and a


number of rel ated procedures in Chapter 11.

Oral in/en/fe w. In additio n to asking specific questions a bo ut


pictured or rea l situations, oral tests may take a variety of other
forms. In effect, every oppo rtunity a lea rner is given to talk in an
educatio nal settin g can be considered a kind of o ral language test.
The sco re on such a test may be only the subjecti ve impressjon th at it
makes on the teacher (or another evaluato r), or it ma y be based on
some more detailed pla n o f countin g erro rs. Surprisingly perhaps, (he
so-called objective procedures are not necessarily more reliable. In
fact) they m ay be less reliable in some cases. Certain aspects of
language performances ma y si mpl y lend th emselves mo re to
subj ective judgement than th ey do to qu antification by formula. For
instance, Richards (l 970b) has shown th at naive native speakers a rc
fairl y re\iablejud ges of wo rd frequencies. Also, it has been known for
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 49
a long time that subjective ranki ngs of passages of prose are
sometimes more reliable than ra n kings (ror relati ve difficulty) based
on readability formula s (Kl are, 1974).
An institutional technique th at has been fairly well standardized by
the Foreign Service Institute uses a trainin g proced ure for judges who
are taught to conduc t interviews and to judge performance o n the
basis of carefull y thought-out rating scales. Thi s procedure is
discussed a long with thc IIyi" Oral Imerview (Ilyin, 1972) and
Upshur's Oral Communication Test (no date), in Chapter .1 1.

Composition or essay writing. Most free writing tasks necessarily


qualify as pragmatic tests . Because it is freq uently difficu lt to judge
examjnees relat ive to o ne another when th ey ma y have attempted to
say entirely different sorts of th ings, and beca use it is also difficult to
say what constitutes an error in wIl ting, various modified writing
tasks have been used. For exam ple, there is the so-call ed dehydrated
sentence, or dehydrated essay. The examinee is given a telegraphic
message and is asked to expand it. An insta nce of the dehydrated
sentence is child/ride/hicycle/off embankment/last mOlUh. A dehy-
drated narrative might continue, was taken to hospital/lingered near
dealltifamily reunited/back to school/two weeks in hospUal.
Writing ta sks may range fr ~ m the extreme case of allowing
examinees to select their own topic and to develop it, to maximally
controlled tasks like fi lling in blan ks in a pre-selected (or even
contri ved) passage prepared by the teacher or examiner. The blanks
might req uire open-ended responses on the o rder of whole
paragraph s, o r sentences, or phra ses, or words . In tbe last case, we
have arri ved hack at a rat her obvious form o f doze procedure.
Another version of a fa jrly contro lled writi ng task in volves either
listening to or reading a passage and then tryi ng to reproduce it from
recalL If the original material is auditori ly presented, the task
becomes a special variety of dictation. This procedure and a va riety of
others a re discussed in greater detail in C hapte r 13.

Narration. One of the techniqu es sometimes used successfully to


eli cit relati vely spontaneous speech samples is to ask subjects to talk
abo ut a frightening experience o r an accide nt where they were a) m o st
'shaded out ofthe picture' (Paul Anisman, personal communication).
With very young children, story re-telling, wh ich is a special version
of narration, has been used. It is imp ortant th at such tasks seem
natural to the c-hild , however, in order to get a reali stic attempt from
50 LANGUAGE TESTS AT SCHOOL

the examinee. For instance, it is important that the person to whom


the child is expected to re-tell the story is not the same person who has
just told the story in the first place (he obviously knows it). It should
rather be someone who has not (as far as the child is aware) heard the
story before - or at least not the child's version.
Translation. Although translation, likc other pragmatic pro-
cedures, has not been favored by the testing experts in recent years, it
still remains in at least some of its varieties as a viable pragmatic
procedure. It deserves more research. It would appear from the study
by Swain, Dumas, and Naiman (1974) that ifit is used in ways that
approximate its normal application in real life contexts, it can
provide valuable information about language proficiency. If the
sequences of verbal material are long enough to challenge the short-
term memory of the examinees, it would appear that the technique is
a special kind of pragmatic paraphrase task.

E. Research on the validity of pragmatic tests


We have defined language use and language learning in relation to the
theoretical construct of a pragmatic expectancy grammar. Language
use is viewed as a process of interacting plans and hypotheses
concerning the pragmatic mapping of linguistic contexts onto
extralinguistic ones. Language learning is viewed as a process of
developing such an expectancy system. Further, it is claimed that a
language test must invoke and challenge the expectancy system of the
learner in order to assess its efficiency. In all of this discussion, we are
concerned with what may be called the construct validity of pragmatic
language tests. If they were to stand completely alone, such
considerations would fall far short of satisfactorily demonstrating the
validity of pragmatic language tests. Empirical tests must be applied
to the tests themselves to determine whether or not they are good tests
according to some purpose or range of purposes (see Oller and
Perkins, 1978).
In addition to construct validity which is related to the question of
whether the test meets certain theoretical requirements, there is the
matter of so-called content validity and of concurrent validity. Content
~Jalidity is related to the question of whether the test requires the
examinee to perform tasks that are really the same as or
fundamentally similar to the sorts of tasks one normally performs 111
exhibiting the skill or ability that the test purports to measure. For
instance, we might ask of a test that purports to measure listening
DISCRETE POIl'\ T, INTEGRA TI VE OR PRAGMATIC TESTS 51

comprehension for adult foreign students in Ame rican uni versities:


does the test require the lea rn er to do the so rt of thing that it
supposed ly measures his abili ty to do? Or, for a test that purports to
measure the degree o f dominance of biling ual c hildren in classroom
contexts that req uire li stening and speaking, we might ask : does the
tes t require th e childre n to say and do thin gs tha t are similar in some
fundamental way to what they a re normally req uired to do in the
classroom ? These are questions about content validity .
With respect to concurrenl validity, the questi o n of interest is to
what exten t do tes ts that purpo rt to measure the same skill (s), or
component( s) of a skill (or skills) correlate statistically with each
other? Below, we will digress briefly to consider the meaning of
sta ti stical correlation. An example of a question concerning
concurrent validity would be: do severa l tests lh at purpo rt to
measure the sa me thing actuall y correlate mo re highly with each
other than with a set of tests t hat purpo rt to measure something
different? For instance, do language tests correlate more hi ghly with
each other than with tests th at are labeled IQ tests? And vice versa,
Do tests whi ch are la beled tests o f listenin g compre hensio n co rrela te
better with each other than t hey do wi th tests that purport to
measure reading comprehensjon? A nd vice versa.
A special set of questions abo ut concurrent validity relate to the
matter of test reliability. In th e general sense, concurrent validity is
about whether o r no t tests tha t purport to d o the same thing act uall y
do accomplish the sa me thing (o r belter, t he degree to which they
accom plish the same thing) . Reliability of tests can be taken as a
special case of conc urrent validity, If all of th e items on a test labeled
as a test of writing ability are supposed to measure writin g ab ility,
then t here should be a high deg ree o f consistency of perfo rm ance o n
the vari o us items o n that test. T here ma y be d iffe rences of difficulty
level, but presumably the type of skill to be assessed shou ld be the
same, This is like saying the re should be a high degrce of concurrent
va li dity am ong items (or tests) that purport to measure the same
thing. In order fo r a test to have a hi gh degree of validity of any so rt, it
can be shown t hat it must first have a high deg ree o f reliabili ty,
In addition to these empirical (a nd statisticall y determined)
req uirements, a good test must also be practical and, for ed uca tiona l
purposes, we might want to add that it should also have instructional
va/uC'. By being practical we mean that it should be usable within th e
li mi ts of time a nd budget ava ilable. It sho uld have a high degree o f
cost effectiveness.
52 LANGUAGE TESTS AT SCHOOL

By having instructional value we mean that it ought to be possible


to use the test to enhance the de livery of in struction in student
population s. This ma y be accomplished in a foreign language
classroom by diagnosing student progress (and teacher effectiveness)
in more specific ways. In some cases the test itself becomes a teachin g
procedure in the mOst obvious sense. In multilingual contexts better
knowledge of student ab ilities to process informati on coded verbally
in one or more languages can help motivate curricular decisions.
Indeed) in monolingual contexts curricular decisio ns need to be
related as much as is possible to the communicati o n skills of stud ents
(see Chapter 14).
It has been faceti ously observed that what we a re concerned with
when \ve add the requirements of practicality and instructional value
is something we might call true va lidi ty, or valid validity. With so
many kinds of validity being discussed in the literature today. it does
not seem entirely in appropriate to ask somewhat idealisticall y (and
sad to say, not superfluously) for a valid variety of validity that
teachers and educato rs may at least aim for. T oward this end ~¥e
mi ght exa mine the results of theo retical investigati o ns of COns truct
validit y, practical ana lyses orlhe cO/lIent of tests, and careful stud y of
the intercorrelations among a wide variety of testing procedures to
address questions of concurr(!nt va lidity. 3
We will return to the matter of validity of pragmatic tests and their
patterns of interrela ti o nship as determined by co ncurrent va lidity
studies after a brief di gression to consider the meanin g of correlation
in the statistical sense of the te rm. The reader v.rho ha s some
background in statistics or in the mathematics underlying statistical
correlation may want to skip over the next eleven paragraphs and go
directl y to the disc ussio n of results of statistical correlations between
va rious tests that ha ve been devised to assess language skills.

1. The meaning of correlation. The purpose here is not to teach the


reader to apply correlation necessarily, but to help th e non-

) AnOlher va riety of vaJidity sometimes refer red to in the literature is face I'llfidilY.
Harris (1969) defines il as 'simply· the way the lest looks - to the exam inees, lest
ad ministrators, educators, and the like' (p. 21) . Sinee these kinds of opinions are often
based on mere experiences with things that ha ve been called tests of such and such a
skill in the past, Harris notes that they are not a ver y important part of determining the
validity of tests. Such opinio ns are ultimately im portant only to the extent that they
affect performance on the test. Where judgements of face va lidity can be shown to be
ill -informed , they :s hould not serve as a basis fo r the evaluatio n of testing proced ures al
all .
DI SCRETE POI NT > I K TEGR ATIVE OR PRAGMATIC TESTS 53
statistically trained reader to un derstand the meaning of correlation
enough to appreciate some of the interesting findings of recent
resea rch on the reliability and validity of various langua ge testing
techniques. T here are man y excellent texts tha Ldeal with correla tion
more thoroughl y and with its a pplication to research designs. The
inte rested reader may want to consult one of the many available
references. 4 No attempt is made here to achieve any sort of
mathematical rigor - and perh aps it is worth noting that most
practical applications of statistical procedures do Dot conform to all
of the niceties necessary for ma t hematjcal precision attainable in
theory (see Nu nnally. 1967, pp. 7- 10, for a discussion of this point).
Few researchers, however, would therefore deny the usefulness of the
applications.
Here we are concerned with simple correlation. also kn own as
Pea rson product-moment correlation. To understand the mea ning
of this statistic, it is first necessa ry to understa nd the simpler statistics
of the arithmetic mean, the variance, and the standard deviation on
which it is based. The arithmetic mean of a set of scores is computed
by addi ng up all of the scores in the set of in teres! and di viding by the
number of scores in the set. This procedure provides a measure of
cent ral tendency of the scores. It is like an answe r to the questio n, jf
we were to take all the amounts of \vhatever the test measures and
distribute an equal amount to each examinee, how much \vo uld each
one get with none left over? Whereas the mea n is an index of where
the true algebraic center of the scores is, the va ria nce is an index of
how much sco res te nd to diffe r from that central point.
Since the true degree of variability of possible scores on a test tends
to be somewh<l t larger than the variability of sco res made by a given
group of examinees: the computation of test va r ia nce must correct for
this bias. Wi thout going into an y detail, it has been .proved
ma thematicall y tha t the best estimate of true test variance can be
made as follows: first, subtract the mean score from each of the scores
on the test and record each of the resulting devia tions from the mean
(the deviations will be positive quantities fo r scores larger than the
mea D, and negative quanli ties for scores less tha n the mean); second,
square each of the deviations (i.e., multiply each de viation by itself)
and record the result each time; third, add up all of the squares (note
that all orthe quantities must be either zero or a positive value since
4 An exceUent text written plincipally fo r educa tors is Merle Ta te, SUllistics ill
Educa/ioll a1ld P.~ychology.· A (i"irSI Course. New York : Macmill an. 1965. especiall y
Chapter V1L Or, see J',;unnal1y (1967), or Kerlinger a nd Pedhalur (1973).
54 LANGUAGE TESTS AT SCHOOL

the square of a negative value is always a positive value); fourth,


di vide the sum of sq uares by the number of scores minus one (the
subtraction of one at this point is the correction of estimate bias noted
at the beginning of this paragraph). The result is the best estimate of
the true variance in the population sampled.
The standard dev iation of the same sel of scores is simply the
square Too t of the va(lance (i.e. , the positive number which times
itself equal s the va ria nce). Hence, the standard deviation and the
variance are interconvertible values (tbe o ne can be easil y deri ved
from the other) . Eac h of them provides an index oftheoverall tendency
of the scores to vary from the mathematicall y defined central quantity
(the mean). Conceptually, computing the standard deviation is
something like answering the question: if we added to the mean and
subtracted from th e mea n amounts of whatever the test meas ures, how
much would we have to add and subtract o n the average to o btain the
o riginal set of sco res? Il can be shown mat hematicall y th at for normal
distributions of scores, the mean and the standard deviatio n tell
everything th ere is to know about the distribution of scores. The
mean defines the central point about which the scores tend to cluster
and their tendency to vary from that central point is th e standard
deviation.
Wecan now sa y what Pearson product-moment correlat io n means.
Simply stated. it is an index of the tendency for the scores of a group
of examinees on o ne test to covary (that is, to differ from their
respective mean in similar di rection and magni tude) with the sco res
of the same group of examinees on another test. If, for example, the
examinees wh o tend to make high scores on a certain c1 0ze test also
tend to make high sco res on a reading co mprehension test, and if
those who tend to make low scores on the readi ng test also tend to
make low sco res on t he cloze, the two tests a re positively correlated.
The squa re of the correlation between any two tests is a n index of
the variance overlap between them. Perfect correlation will result if
the ~cores of examinees on two tests differ exactly in proportion to
each other from their respective mcan s.
One of the conceptually simplest ways to compute th e product-
moment correlation between two sets of test sco res is as follow s: ,first,
co mpute the standard deviation for each test; second, fo r each
examinee, compute the deviatio n from the mean on the first test and
the deviati on fro m the mean o n the second test ; third, multiply the
deviation fro m the mea n o n test one times the deviation fro m the
mean on test l\\1O for each examinee (whether th e value of the
DISCReTE POINT, [NTEG RATIVE OR PRAG~lAT1 C TFSTS 55

deviatio n is positive o r negative is importa nt in this case because it is


possible to get negative values on this operation): fourth , add up the
product s of deviations from step tbree (note that the resulting
quantity is conceptua lly similar to the sum or squares of deviations in
tbe computation of the variance of a single set o f scores); finally,
divide the quantity from step four by the standard deviation of test
one times the standard deviation of test two times one less than the
number of examinees. The resulting value is the correlation between
the two tests.
Correlations ma y be positive or negative. We have already
considered an example of positive correlation. An instance of
negative correlation would res ult jf we cou nted co rrect responses on
say, a doze test, and errors, say, on a dictation. Thus, a hi gh sco re o n
the cJoze test wou ld (if th e tests were correlated positively, as in the
previ ous example) co rrespond to a low score on the other. High
scorers on the doze test would typically be low scorers on the
dict atio n (that is, th ey would make fe wer errors), and low scorers on
the cloze would be high scorers on the dictat ion (t hat is, they would
make many errors). Howeve r, if the score on the doze test were
converted to an error count also, the co rrel ation would become
positi ve instead of negative. Therefo re, in empirical testin g research,
it is most often the magnitude of correlation between tw o tests that is
of interest rather tban the d irection of the relationship. However, the
value of the co rrelati on (plus or minus) becomes interesting whenever
it is surprising. W e will consider several slich cases in Chapter 5 when
we discuss empirical research with attitudes and moti vations.
What ab out th e magnitude of correlations? When should a
correlation be considered high or low? Answers to such questions can
be given only in rela lion to certain purposes, and then only in general
and so mewhat imprecise term s. In th e first place, the size of
co rrelations cannOl be linea rly interpreted. A corre1atio n o f .90 is not
three times larger than a correlation of .30 - rather it is nine times
larger. It is necessary to square the co rrel at ion in each ca se in order to
make a more meaningful compa riso n. Since .90 squa red is .81 and .30
squared is .09, and since .81 is nine times larger th an .09, a co rrelation
of .90 is actuall y nine times larger than a correlation of .30.
Computationally (or perhaps we should say mathematically), a
correlation is like a standard deviation, while the square of the
correlation (or the coefficient of determil101ion as it is called) is o n the
same order as the variance. Indeed, the square of the correlation of
two tests is a n index o f the a moun t of variance overlap between the
56 LANGUAGE TESTS AT SCHOOL

two tests - or put differently, it is an index of the amount of variance


that they ha ve in common. (Fo r mo re th o ro ugh discussion, see Tate,
1965 , especially Chapter VI!.)
With respect to reli ability studies, correlati o ns above .95 between,
say, two alternate forms of the same test are considered qui te
adequate. Statisticall y, slIch a correlation means that the test forms
overlap in va riance at about th e .90 level. That is, ninety percent of
the total variance in bo th tests is present in either one by itself. One
co uld feel quite confident that th e tests wo uld tend to pro duce ve ry
similar results if administered to the same population of subjects.
What can be known from the one is almost identical to what can be
known fr om the other, with a small margin aferror.
On the o ther hand, a reliability index of .60 fo r alternate forms of
the same test would no t be considered adequate for most purposes.
T he two tests in this latter instance are scarcely interchangeable. It
would hardl y bejustifia ble to say that they a rc very reli a ble measures
o r whate ver th ey are aimed at assessing. (However. one cannot say
th at they a re necessaril y measurin g different things on the basis of
such a correlation. See Chapter 7 on statisti cal traps.)
In general, whether the questi on concerns reliability o r valid ity,
low correlati ons are less informati ve th an hi gh co rrelations. A n
observed low correl ati on betwee n two tests that are expected to
correlate highly is something like the failure of a prospector in search
of gold. It may be that there is no go ld or it may bc that th e prospector
simpl y hasn ' ltumed the right sto nes o r panned the ri ght spots in the
stream. A low correlati o n ma y result from the fact th at o ne of the
tests is too easy or too hard for the population tested. It may mean
th a t one o rthe tests is unreliable. O r that both of them are unreliabl e.
Or a low correhHion m ay result fro m (he raet that o ne or bo th lests do
not measure what they are supposed to measu re (i.e. , are not valid L
o r merely that one of them (or both) has (o r have) 11 low degree of
va lidity.
A very high correla tion is less difficult to interpret. It is more like a
gold strike. The richer the strike, th at is, th e higher the correlati on,
th e more easily it can be interp reted. A correlation of .85 or .90
between two tests that are superfici ally ve ry different would seem to
be evidence that they are tappin g the same underl ying skill or abilit y.
In any event, it means at face value that th e two tests share .72 or .8 1
of the total variance in both tests. That is, between 72 and 81 percent
of what can be known [ro m th e o ne can be kn ow n equally well from
the other.
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 57
A further point regarding the interpretation of reliability estimates
should be made. Insofar as a reliability estimate is accurate, its square
may be interpreted as the amount of non-random variance in the test
in question. It follows that t he validity of a test can never exceed its
reliability, and further tbat validity indices can equal reliability
indices only in very special circumstances - namely, when all the
reliable (non-random) variance in one test is also enerated by the
<2!..E. e s a return to this very important fact about correlations
as reliability indices and correlations as validity indices below. In the
meantime, we should keep in mind that a correlation between two
tests should normally be read as a ,eliability index if the two tests are
considered to be different form s oftbe same test or testing procedure.
However, if the two tests are considered to be differelll tests or testing
procedures, the correlation between them should normally be read as
a validIty mdex.

2. Correlations between differelll language tests. One of the first


studies that showed surprisingl y high correlations between sub-
stantiall y different language tests was done by Rebecca Valette (1964)
in connection with the teaching of French as a foreign language at the
college level. She used a dictation as part of a final examination for a
course in French. The rest of the test included: (1) a listening
comprehension task in a mUltiple choice format that contained items
requiring (a) identification ofa phrase heard on tape, (b) completion
of sentences heard on tape, and (c) answering of questio nsconcerning
paragraphs heard on tape ; (2) a written sentence completion task of
the fill -in-the-blank va riety; and (3) a sentence writing task where
students were asked to answer questions in the affirmative or negative
o r to follow instructions entailed in an imperative sentence like, Tell
John to come here, where a correct written response might be, John ,
come here.
For two groups of subjects, all first semester French students, one
of which had practiced taking dictation and the other of which had
not, the correlations between dictation scores and the other test
scores combined were .78 and .89, respectively . Valette considered
these correlations to be notably high and concluded that the 'dictee'
was measuring the same basic overall skill s as the longer and more
difficult to prepare French examination.
Valette concluded that the difference in the two correlations could
be explained as a result of a practice effect that reduced the validity of
dictation as a test for students who had practiced taking dictation.
58 LANGUAGE TESTS AT SCHOOL

However, the two groups also had different teac hers which suggests
another possible explana tion for the differences. Moreover, Kirn
(1972) in a stud y of dictation as a testing technique at UCLA found
that extensive practice in taking dictation in English did not result in
substantially higher scores. Another possible explanation for the
differences in correlations between Valette's two groups might be tha t
dictation is a useful teaching procedure in which case the difference
might be evidence for realleaming.
Nevertheless, one of the results of Valette's study has been
replicated on numerous occasions with other tests and wit h entirely
different populations of subjects - namely, that dictation does
correlate at surprisingly high levels with a vast array of other
language tests. For instance, in a stud y at UCLA a dictation task
included as part of the UCLA English as a Second Language
Placement Examination Form I correlated better with every other
part of that test than any other two pa rts correlated with each other
(Oller, 1970, Oller and Streiff, 1975). This wo uld seem to suggest that
dictation was accounting fo r more of the total variance in the test
than any other single part of that test. The correlation between
dictation and the total score onall other test parts not including the
dictation (Voca bulary, Gra mma r, Composition, and Phonology -
for description see OUer and Streiff, 1975, pp. 73- 5) was .85. Thus the
dictation was accounting for no less than 72 %of the variance in the
entire test.
In a later study, using a different form orthe UCLA placement test
(Form 2C), dictation correlated as weU with a cloze test as either of
them did wit h any of the other subtests on the ESLPE 2C (Oller and
Conrad, 1971). This was somewhat surprising in view of the striking
differences in format of tbe two tests. The dictation is heard and
written, while the cloze test is read with blanks to be filled in. The one
test utili zes an auditory mode primarily whereas the other uses
mainly a visual mode.
Why would they not correlate better with more simil ar tasks than
with each other ? For in stance, why would the cloze test not correlate
better with a reading comprehension task or a vocabulary task (both
were included among the subtests on the ESLPE 2C)? The
correl at ion between cloze and dictation was .82 while the correlations
between cloze and reading, and cloze and vocab ulary were .80 and
.59, respectively. This surprisin g result confirmed a similar finding of
Darnell (1968) who found that a cloze task (scored by a somewhat
complicated procedure to be discussed in Chapter 12) correlated
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 59
better with the Listening Comprehension subsco re on the TOEFL
than it did with any other part of that examination (which also
includes a subtest aimed at reading comprehension and one aimed at
voca bulary knowledge). The correlation between the cloze test and
the Listening Comprehension subtest was. 73 and the correlation
with the total sco re on all subtests of the TOEFL combines was .82.
In another study of the UCLA ESLPE 2A Revised, correlations of
.74, .84, and .85 were observed between dictations and cloze tests
(Oller, 1972). Also, three different cloze tests used with diffe rent
populations of subjects (above 130 in number in each case) correlated
above .70 in six cases with grammar tasks and paraphrase recognition
tasks. The cloze test was scored by the contextually-appropriate
(,acceptable word ') method ; see C hapter 12.
While Valette was looking at the performance of students in a
fo rmal language classroom context where the language being studied
was not spoken in the surrounding community, the studies at UCLA
and the one by Darnell (at the University of Colorado) examined
populations of students in the United States who were in social
contexts where English was in fact the language of the surrounding
community. Yet the results were similar in spite of the contrasts in
tested populations and regardless of the contrasts in the tests used
(the various versions of UCLA ESLPE, the TOEFL, and the foreign
language French exams).
Similar results are available, however, from still more di verse
settings. 10hansson (1972) reported on the use of a combined cloze
and dictation procedure which produced essentially the same results
as the initial studies with dictation at UCLA. He found that his
combined c10ze and dictation procedure correlated better with scores
on several language tests than any of the other tests correlated with
each other. It is noteworthy that Johansso n's subjects were Swedish
college students who had lea rned English as a fo reign language. The
correlation between his eloze-dictation and a traditional test of
listening comprehension was .83.
In yet another context, Stubbs and Tucke r (1974) found that a
c10ze test was generally tbe best predictor of various sections o n the
American University at Beirut English Entran ce Examination. Their
subject population included mostly native speakers of Arabic
learning English as a foreign or second language. The cloze test
appeared to be superior to the more traditional parts of the EEE in
spite of greater ease of preparat io n oftbe cloze test. In particular, the
cloze blanks seemed to discriminate better between high sco rers and
60 LANGUAGE TESTS ATSCH(x)L

low scorers than did the traditional discrete point types of items (see
Chapter 9 on item analysis and especially item discrimination).
A study by Pike (1973) with such diverse techniques as oral
interview (FSI type), essay ratings, c10ze scores, the subscores on the
TOEFL and a variety of other tasks yielded notably strong
correlations between ta sks that could be construed as pragmatic tests.
He tested native speakers of Japanese in Japan , and Spanish speakers
in Chile and Peru. There were some interesting surprises in the simple
correlations which he observed. For instance, the essay scores
correlated better with the subtest labeled Listening Comprehension
for all three populations tested than with any of the other tests, and
the cloze scores (by Darnell 's scoring method) correlated about as
highly with interview ratings as did any other pairs of subtests in the
data.
The puzzle remains. Why should tests that look so different in
terms of what they require people to do correlate so highly ? Or more
mysterious still, why should tests that purport to measure the sa me
skill or skills fail to correlate as highly with each other as they
correlate with other tests that purport to measure very different
skills? A number of explanations can be offered, and the data are by
no means all in at this point. It would appear that the position once
favored by discrete point theorists has been excluded by experimental
studies - that position was that different forms of discrete poinllests
aimed at assessing the same skill , or aspect of a skill , or compo nent of
a skill , ought to correlate better with each other than with, say,
integrative (especiall y, pragmatic) tests of a substantially different
sort. This position now appears to have been incorrect.
There is considerable evidence to show that in a wide range of
studies with a substantial variety of tests and a diverse selection of
subject populatio ns, discrete point tests do not correlate as well with
each other as they do wi th integrative tests. Moreover, integrati ve
tests of very different types (e.g., cl oze versus dictation) correlate even
more highly with each other than they do with language tests which
discrete point theory wo uld identify as being more simila r. The
correlations between di verse pragmatic tests, in other words,
generally exceed the correlations observed between quite similar
discrete point tests. This wo uld seem to be a strong disproof of the
ea rl y claims of discrele point theories of testing, and one will search in
vain for an explanation in the stro ng versions of discrete point
approaches (see especia lly Chapter 8, a nd in fact all of Part Two).
Having discarded the strong version of what might be termed the
DISCRETE POINT, INTEGRATlVEOR PRAGMATlCTESTS 61
discrete point hypothesis - namely, that tests aimed at similar
elements, components, aspects of skills, or skills ought to correlate
more highly than tests that are apparently requiring a greater diversity
of performances - we must look elsewhere for an explanation of the
pervasive results of correlation studies. Two explanations have been
offered. One is based on the pragmatic theory advocated in Chapter 2
of this book, and the other is a modified version of the discrete point
argument (Upshur, 1976, discusses this view though it is doubtful
whether or not he thinks it is correct). From an experimental point of
view, it is obviously preferable to avoid advocacy and let the available
data or obtainable data speak for themselves (Platt, 1964).
One hypot hesis is that pragmatic language tests must correlate
highly if they are valid language tests. Therefore, the results of
correlation studies can be easily understood or at least straightfor-
wardly interpreted as evidence of the fundamental validity of the
variety of language tests that have been shown to correlate at such
remarkably high levels. The reaso n that a dictation and a cloze test
(which are apparently such different tasks) intercorrelate so strongly
is that both are effective devices for assessing the efficiency of the
learner's developing grammatical system, or language ability, or
pragmatic expectancy grammar, or cognitive network of the
language or whatever one chooses to call it. There is substantial
empirical evidence to suggesllhat there may be a single unilary factor
thaI accounts for practically all of the variance in language lests
(Oller, I 976a). Perhaps that factor can be equated with the learner's
developing grammatical system.
One rather simple but convincing source of data on this question is
the fact that the validity estimates on pragmatic tests of different sorts
(i.e. the correlations between different ones) are nearly equal to the
reliability estimates for the same tests. From this it follows that the
tests must be measuring the same thing to a substantial extent.
Indeed , if the validity estimates were consistently equal to, or nearly
equal to the reliability eslimates we would be forced to conclude that
the tests are essentially measures of the same factor. This is an
empirical question, however, and another plausible alternative
remains to be ruled out.
Upshur (1976) suggests that perhaps the grammalical system of the
learner will account for a large and substantiai portion of the variance
in a wide variety of language tests. This central portion of va riance
might explain the correlations mentioned above, but there could still
be meaningful portions of variance left which would be attributable
62 LANGUAGE TESTS AT SCHOOL

to components of grammar or aspects of language skill, or the


traditional skills themselves.
Lofgren (1972) concluded that 'there appear to be four main
factors which are significant for language proficiency. These ha ve
been named knowledge of words and structures, intelligence,
pronunciation, and fluency' (p. II). He used a factor analytic
approach (a sophisticated variation on correlation techniques) to test
'Lado's idea ... that language' can be broken down into 'smaller
components in order to find common elements' (p. 8). In particular,
Lofgren wanted to test the view that language skills could be
differentiated into listening, speaking, reading, writin g, and possibly
tran slating factors. His evidence would seem to support either the
unitary pragmatic factor hypothesis, or the central grammat ical
factor with meaningful peripheral components as suggested by
Upshur. His data seem to exclude the possibility that meaningful
variances will emerge which are unique to the traditionally
recognized skill s. Clearly, more research is needed in relation to this
important topic (see the Appendix).
Closely related to the questions about the composition of language
skill (and these questions ha ve only recently been posed with
reference to nati ve speakers of any given language) , are questions
about the important relation of language skill(s) to IQ and other
psychological constructs. If pragmatic tests are actua ll y more valid
tests than other widely used measures of language skills, perhaps
these new measurement techniques can be used to determine the
relationship between ability to perform meaningful language
proficiency tasks and ability to answer questions on so-called IQ
tests, and educational tests in general. Preliminary results reported in
Oller and Perkins (1978) seem to support a single factor solution
where language proficiency accounts for nearly all of the reliable
variance in IQ and achievement tests.
Most of the studies oflanguage test validit y have dealt with second
or foreign language learners who are either adults o r post-
adolescents. Extensions to native speakers and to children who are
either nati ve or non-native speakers of the language tested are more
recent. Many careful empirical investigations are now under way or
have fairly recently been reported with yo unger subjects. In a
pioneering doctoral study at the University of New Mexico, Craker
(1971) used an oral form of the cloze proced ure to assess language
skills of children at the first grade level from four ethnic backgrounds.
She reported significant discrimination between the four groups
DISCRETE POINT, INTEGRATIVE OR PRAGMATIC TESTS 63
suggesting that the proced ure may indeed be sensitive to va riatio ns in
le vels of proficiency for children who are either preliterate or are just
beginning to learn to read.
Although data are lack ing on many of the possible pragmatic
testing procedures that might be applied with children, the cloze
procedure bas recent ly been used with literate children in th e
elementa ry grades in contexts ranging fro m th e Alaskan bush
count ry to the African primary school. Streiff (1977) investigated the
effects of the availabilit y of reading resources on reading proficiency
among Eskimo children from the third to sixth grade using doze tests
as the cri terion measure for reading proficiency . Hofman (1974) used
cloze tests as measures of reading proficiency in Uganda schools from
grades 2 through 9. Data were collected on children in 14 sc hools (12
African and 2 Eu ropean). Si nce the tests were in a second language
for many o fLhe African children, and in [he nati ve language for many
o f the European children, some int eresting comparisons are possible.
Concerning test reliabilities and internal consistencies of the various
cloze tests used, Hofman repo rts somewhat la wer values for the 2nd
graders~ but even including them, the average reliability estimate for
all nine test batteries is .9 1 - and none is below .85. These data were
based on a mean sample size 0[264 subjects. The sm allest number for
any test battery was 232.
In what may be the most interesting stud y of language proficiency
in yo un g children to date, Swa in, Lapkin, and Barik (1976) have
reported data closely paralleling results obtained wit h adult second
language learners. In their research, 4th grade bilinguals (or
English speaking children who arc becoming bilingual in French)
were tested. Tests of proficiency in French were used and correlated
with a cloze test in French scored by the exact and acceptable word
methods (see Chapter l2 fo r elaboration on scoring methods).
Proficiency tests fo r English ability were a lso correl a ted with a cloze
test in English (al so scored by both methods). In every case, the
correlations between doze scores and the other measures of
proficiency used (in both lang uages) were hi gher th an the correlations
between any oftbe other pairs ofpraficiency tests. This result would
seem to support the conclu sion that th e doze tests were simply
acco unting for more ofthe ava il able meaningful variance in both the
nati ve language (English) and in the second language (French).
The autho rs co ndude. "[his stud y has indicated that the doze tests
can be used effectively with young children . . . the cloze technique has
been shown to be a valid and reliab le mean s of measuring second
64 LANGLiAGE TESTS AT SCHOOL

language proficiency. These characteristics along with the fact that it


is easy to constrnct, administer, and score combine to make the doze
technique an efficient tool to use in summative evaluations. In
addition, a detailed analysis of the types of errors made on a cloze test
shows promise as an excellent source of information in formative
evaluations' (p. 40). Stillmore recently, Lapkin and Swain. (1977) cite
further confirming evidence.
Regardless of the eventual answers yet to be obtained concerning
some of the subtler questions still remaining about the developing
field of pragmatic testing. it would seem that the substantial validity
of such testing techniques is supported from a wide range of research
angles. If language tests were like windows through which language
proficiency might be viewed, and if language proficiency were
thought of as a courtyard that could be seen from a number of
different windows, it would seem that a clearer view of the courtyard
is possible through some windows than through others. Pragmatic
tests seem to be among the clearest windows available for
determining what is in the courtyard. Moreover, very different
pragmatic tests seem to be producing a very high degree of agreement
about what is there - about \vhat language proficiency really is.

3. Error analysis as an independent source of validity data. If


language tests are viewed as elicitation procedures for acquiring data
on examinees' performance in a certain language, one way of
assessing the validity of the procedures themselves is to see what
kinds of data they are capable of eliciting. Typically, the analysis of
learner performance in this way has the objective of making 'an
exhaustive description of an individual's grammar' (Swain, Dumas,
and Naiman, 1974, p. 68), or at least this would seem to be the ideal.
Usually such studies are referred to as error analyse,,)'. Richards
(1970a, 1971) popularized the term error analysis in two widely read
and often anthologized papers. Another term that has been used is
illlerlanguage analysis following Selinker (I972). The fundamental
objective of all such studies \vould appear to be to characterize in a
useful \vay the developing language ability (whether in a first or other
language) of a learner or group oflearners.
It is, however, possible to use the analysis of learner performances
on language tests as a basis for determining what it is that a certain
test or type of test measures. \Ve have seen above some rather strong
evidence to suggest that at least certain types of so-called pragmatic
language tests are apparently tapping the same underlying language
DISCRETE POINT, INTEGRATIVE OR PRAGMATI C TESTS 65
abilit y. It is possible to raise the questi on whether e rror analyses
based on pragmatic tas ks used as elicitation procedures will support
the same conclusion - \v he ther pragmatic te sts of many different
kind s do indeed seem to be tests of the same basic ability.
For instance, if a learner, or irIearners in genera l, tend to make the
same sorts of errors in spontaneous speech that they make in a more
co ntro ll ed interview situatio n, o ur co nfidence in the interview
technique would be strengtheoed . Or, suppose learners could he
shown to make the same sorts of errors in a translatio n task, say from
the nati ve language to the target language, that they make in
spontaneous speech in the target language. Since it takes a long time
to collect sufficient samples of spontaneous speech da ta, and since it is
relatively easier to set up a translati on task, such a findin g would be
ve ry useful and wo uld make testing and data collection for
ime clanguage analysis much easie r. Or, co nsider tbe possibility that a
second lang uage lea rner might tend to make simi lar sorts o f e rrors in
writing a dictation of heard material and in speaking. Such a fi nding
could make the testing of speaking ability much more convenient.
The fact is that few carefully controlled studies that might shed
light on these important possibilities have been done. One notable
exception is a study of elicited imitation and [ranSlalion by Swain,
Dumas, and N aiman ( 197 4) . Elicited imitation is a kind of oral
dictatio n procedure. T hat is, the materialLO he imitated is presented
to the examinee audit o ril y and the examinee's task is to repeat it.
F raser, Bellugi, and Brown (1963) argued against thi s procedure as a
test of language competence m uch the way Lado ( 1961) and others
argued against dictation. They complained that the technique could
only assess a relativel y superficial aspect of 'perceptual-motor skills'.
Other researchers (Menyuk, 1969, Ervin-Tripp, 1970, Slobin and
Welsh, 1973, and Na talicio and Williams, 197 1) disagreed. They
argued that if the material to be repeated pushed tbe limits of the
sho rt-term memory of the exa minees, it w as in fact a valid test ofboth
comprehension and production skills. With different aims in mind ,
Labov and Cohen (1 967) a nd Baratz (1 969) had used elicited
imitati on in studies of no n-majority varieties o f English. M ore
recentl y, the techniqu e has been extended to te st proficiency in two
majo r varieties of English by Politzer, Hoover, and Brown (1974).
However, with man y of the arg uments unresolved co ncernin g [he
actual demands of the task, Swain, Dumas, and Naiman (1 974) set
o ut to study the validity of elicited imit<:ttion in rela tion LO elicited
translation (that is, both tasks we re under the contro l of an examiner
66 LANGUAGE TESTS AT SCHOOL

or experimenter: neither was spontaneous). They reasoned that if the


sentences children were to repeat exceeded immediate memory span,
then elicited imitation ought to be a test both of comprehension and
production skills. Translation on the other hand could be done in two
directions, either from the native language of the children (in this case
English) to the target language (French), or from the target language
to the native language. The first sort of translation, they reasoned,
could be taken as a test of productive ability in the target language,
whereas the second could be taken as a measure of comprehension
ability in the target language (presumably, if the child could
understand something in French he would have no difficulty in
expressing it in his native language, English).
In order to rule out the possibility that examinees might be using
different strategies for merely repeating a sentence in French as
opposed to translating it into English, Swain, Dumas, and Naiman
devised a careful comparison of the two procedures. One group of
children was told before each sentence whether they were to repeat it
or to translate it into English. If subjects used different strategies for
the two tasks, this procedure would allow them to plan their strategy
before hearing the sentence. A second group was given each sentence
first and told afterward whether they were to translate it or repeat it.
Since the children in this group did not know what they were to do
with the sentence beforehand, it would be impossible [or them
consistently to use different strategies for the imitation task and the
translation task. Differences in error types or in the relative difficulty
of different syntactic structures might have implied different
strategies of processing, but there were no differences. The imitation
task was somewhat more difficult, but the types of errors and the rank
ordering of syntactic structures were similar for both tasks and for
both groups. (There were no significant differences at all between the
two groups.)
There were some striking similarities in performance on the two
rather different pragmatic tasks. 'For example, 75 ~/~ of the children
\vho imitated the past tense by "a + the third person of the present
tense of the main verb" made the same error in the production task .
. .. Also, 69 /~ of the subjects who inverted pronoun objects in
imitation made similar inversions in production .... In sum, these
findings lead us to reject the view held by Fraser, Bellugi, and Brown
(1963) and others that imitation is only a perceptual-motor skill'
(Swain, Dumas, and Naiman, 1974, p. 72). Moreover, in a study by
Naiman (1974), children were observed to make many of the same
DISCRETE POI NT, I:-.lTEGRATIVEOR PRAGMA TIC TESTS 67
erro rs in spontaneous speech as they made in elicited translation from
t he na tive language to the ta rget language.
In yet another study, Dumas and Swain (1973) 'demonstrated th at
when yo ung second language learners similar to the ones in Naiman's
study (Naiman, 1974) were given English translations of their own
French spontaneous production s and asked to t ranslate these
utterances into French, 75 % o f their translations matched their
o riginal sp0nlaneous prod uctio ns' (Swain, Dumas, and N aiman,
1974, p. 73).
Although much more st ud y needs to be done, and with a greater
variety of subject populations and techniques of testing, it seems
reasonable to say that there is already substanti al evidence from
studies of learner outputs that quite different pragmatic tasks may be
tapping the same basic underl ying competence. There would seem to
be two explanations fo r differences in performa nce o n tasks that
require sa ying mea nin gful things in the ta rget la nguage versus merely
indicating comprehension by, say, translating the meaning of what
someone else has just said in the target language into one's own native
language. Typically, it is observed that tasks of the latter sort are
easier tha n the former. Learners can often understand things that
they cannot say; they can often repeat things tha t tbey could not have
p Ultogether withoutthe suppo rt ofa model ; th ey may be a ble to read
a sentence which is wriuen down that they could not have understood
at all if it were spo ken ; they ma y be a ble to comprehend written
ma te rial that they obviously co uld nol ha ve written. There seems to
be a hiera rchy of difficulties associated with different tasks. The two
expla na tions that have been put forth parallel th e two competing
explanations [or the overlap in variance on pragmatic language tests.
Discrete point testers have long insisted on the separation of tests
of th e traditionally reco gnized skills. The extreme versio n of this
argument is to propose that lea rners possess different grammars for
different skills, aspects of skills, components of aspects of skills, a nd
so o n right down to the individual phonemes, morphemes, etc. The
disproof of this view seems to have already been provided now many
times over. Such a theoretical argument cannot embrace the data
from correlation studies or the data from error analyses. However, it
seems tha t a weaker version cannot yet be ruled out.
It is possible that there is a basic gramma tical system underlying
all uses o f la nguage. but that there remain certain compo nents which
are not part of the central COle tbat account fo r what are frequentl y
referred to as differences in productive and receptive repertoires (e.g.
68 LANGUAGE TESTS AT SCH()(}L

phonology for speaking versus for listenin g), or diffe rences in


productive and receptive abilities (e.g. speaking and writing versus
listening and reading), or differences in oral and visual skills (e.g.
speaking and listening versus reading and writing), or components
that are associated with special abilities such as the capacity to do
simultaneous translation) or to imitate a wide variety of accents, and
so on, and on. This so rt of reaso nin g wo uld harmonize wit h the
hypothesis discussed by Upshur (1976) concerning unique varia nces
associated with tests of particular skills o r components of grammar.
Another plausible alternative exists, however, and wa s hinted at in
the article by Swain, Dumas, and Naiman (1974). Consider the rather
simple possibility that the differences in difficulties associated w'ith
d ifferent langua ge tasks may be due to differences in the load they
impose on the brain. It is possible th at the grammatical system (call it
an expectancy gra mmar, or call it something else) functi ons with
different levels of efficiency in different language tasks - not beca use it
is a different grammar - but because of differences in the load it must
bear (or help consciousness to bear) in relatio n to different tasks. N o
one would propose that because a man can carry a onc hundred
pound weight up a hill faster than he can carry a one hundred and
fifty pound weig ht that he therefore must ha ve different pairs of legs
for carrying d iffere nt amounts of weight. It would n't even seem
reasonable to suggest that he has mo re weigh t-moving ability when
carrying onc hun dred po unds rather than when ca rrying o ne hundred
and fifty pound s. Would it make sense to suggest that there is an
additional component of skill that makes the one hund red pound
weight easier to carry?
The caSe of, say, speaking versus listening skill is obvi ously much
more complex than the analogy, which is intentionall y a reduction to
absurdity. But the argument can appl y. In speaking, the narrow
corrido r of activity known as attention o r conscio usness must
integrate tb e motor coordinatio n of signals to the articulators teiling
them what moves to make and in what order, when to turn t he voice
on and off, when to pus h air and when not to; syllables must be timed,
monitoring to make certain the right ones get articulated in th e right
sequence; facial expressions, ton es, gestures, etc. must be sync hro-
nized with the stream of speech ; and all of the foregoi ng must be
coordinated with cenain ill-defined intentio ns to communicate (or
with pragmatic map pings of utterances onto ex tralinguistic contexts,
if you like). In listenin g, somewhat less is required. While the speaker
must both plan and monitor his articulatory output to make sure it
D1SC KETE POINT, INTEGRATIVe OR PRAGM ATIC TESTS 69
catches up with what he intended to say in form and meaning all the
while continuing to plan and actively construct further form s and
meanings, the listener needs onl y to monitor his infe rences
concerning the ~ pea ker' s intended meanings) a nd to help him do this,
the listener has the already constructed senso ry signals (that he can
hear and see) wbich the speaker is outputting. A similar explanation
ca n be offered for the fact that reading is somewhat less tax ing than
writing, and that reading is somewhat easier th an listening, and so
forth for each possible comrast. Swain, Duma s, and Naiman (1 974)
anticipate this sort of explanalion when they talk about ' the influence
of memory ca pacity on some of the specific aspects of processing
involved in tasks of imitation and translation' (p. 75).
Crucial experiments to force the choice between the two com peting
ex planations for hierarchies of difficulties in different language tasks
remain to be done. Perhaps a double-edged a pproach from both
correlation studies and error analysis will disprove one or the other of
the two competing alternatives, or possibly other alternatives remain
to be put for wa rd . In the meanti me, it woul d seem that the evid ence
fro m error analysis suppo rts the val idit y of pragmatic tasks.
Cert ainly, it would appear tha t studies of errors on different
elicitation proced ures are ca pa ble of putting both of the interesting
alternatives in the position of being vulnerable to test. Thi s is all that
science requires (Pl att, 1964).
Teachers and educa tors, of course, require more. We cannot wait
fo r a ll of the data to come in, or for the crucia l experiment s to be
devised and executed. Decisions mu st be made a nd they will be made
eitb er wisely or unwisely, for better or for worse. Students in
clClssrooms cann ot be left to sit there without a curriculum, and the
decisions concerning the curriculum must be made with or without
the aid of valid language tests. The best a vailable option seems to be
to go ahead with the sorts of pragmatic la nguage tests that have
proved to yield high concurrent validity statistics and to provide a
rich supply of in fo rmation concerning learner grammatical systems.
In the nex t chapter we will consider aspects oflanguage a ssessment in
multilingual contexts. In Pa n T wo, we will discuss reasons for
rejecting certain ve rsions of discrete point tests in fa vor of pragmatic
testing, and in Part Three, specific pragmatic testing procedu res are
discu ssed in greater detail. The relevance of the procedures
recommended there to educa tio nal testing in general is discussed at
numerous points th roughout Pa rt Three.
70 LANGUAGE TESTS AT SCHOOL

KEY PO JN TS

1. Discrete point tests are aimed at specific elements of phonology, syntax,


or vocabul ary within a presumed aspect (productive or receptive, oral or
visua l) of onc of the trad itionally recognized lan gua ge skill s (listening,
speaking, rea ding, or writing).
2. The strong version of th e discrete point a pproach a rgues that different
test items a re needed to assess di fferent elements of kn owledge within
each component of grammar, a nd different subtests arc needed for each
different co mponent of each di fferent aspect of each different skill.
Theoretically, many different tests are req uired.
3. lntegrative lests are defin ed as a ntithetical to disc rete point tests.
In tegrati ve tests lump man y element s and possibly several components,
aspects and skills together and test them all at the same time.
4. While it can be a rgued that di screte point tests and integrati ve tests a re
me rely two ex. tremes on a continuu m, pra gma tic tests constLtute <:I
special class of integrutive tests. It is p ossible to conceive of a discrete
point test as being more or less integrati ve, and an integrative test as
bei ng more o r less d iscrete, but pragm atic tests a re mo re precisely
defined.
5. Pragmatic language tests must meet two naturalness criteri a: first, they
m ust require the learne r to utilize no rmal contextual constraint s on
sequences in the la nguage ; a nd, second, they must requ ire compre~
hension (and possibly producti on also) of meaningful sequences of
elements in the language in relation to extralinguisti c contexts.
6. D iscrete po int tests can not be pragmatic tests.
7. Th e question whethe r o r not a task is pragmat ic is a n cmpi ri<:<:I1 one. It
can be deci ded by logic (that is by definiti on) and by experiment, but n ot
by opinion poll s.
8. Dictati on a nd doze p rocedure l::lrC exa mples of pragmatic tests. First,
they meet the requirement s of t he definiti on, a nd second, tht!Y functi on in
experimental a pplicatio ns in the p redicted ways.
9. Other examp les include combin a tions of doze and d ictation, oral c1 0ze
tasks, dicta tio n wit h interferi ng nojse, pa ra phrase recogniti on, question-
a nswering, o ra l interview, essay writing, na rratio n, a nd t ra nsla ti on.
10. A test is val id to the extent th at it meaSures what it is supposed to
measure. Construct validity has to do with theoretical justifica ti on of a
testing procedu re ; comenl validity bas to do with the faith fulness wi th
which a test reflects th e normal uses oflan guage to which it is related as a
measure of a n examin ee's skill ; concurrelll validity has to do with the
strength of co rrelations between tests th a t purport to measure the same
thing.
11. Correlati on is a statistica l index of thc tendency of scores on two tests to
vary propo rti onately from their respective means. It is an index of the
square root of variance ove rlap, o r varia nce common to two tests.
12. The squa re of thc sim ple correlat ion between two tests is an unhiased
estimate of t heir variance ovcrlap . The tec hnical term for th \!: square of
the co rrelatio n is the coe.fficien t afdeterm ination. Correlation s cannot be
co mpared linea rl y, but their squ a res can be.
DISCRETE POINT. l>JTEGRA TlVE OR PRAGMA T IC TESTS 71
13. A high correlation is more informative and easier to interpret than a low
one. Wh ile a high correla tion does mean that some sort of strong
relati onship exists, a low correlation does not unambiguou sly mean tha t
a strong relationship does not exist bet\veen the tested variabl es. There
are many ma rc explanations for low correlations than fo r hig h ones.
14. Genera ll y. pragmatic tests of appa rently very different types correl ate at
higher levels with each olber [ha n they do with other lests. However, they
also seem to correl a te better with the more trad itional di sc rete point tests
than the latter do with each other. Hence, pragmatic tests seem to be
generatin g more mea ningful vari ance than di sc rete item tests.
15. Two possibilit ies seem to exist: there may be a large faclOr of
gramm.Hica l kn owledge of some so rt in every la nguage test, with certa in
residu al va riance !; a ttributable to specific co mponents, aspects, skill s,
and the like; or, language skill may be a relatively unitary factor and
there may be no uniqu e meanin gful variances that can be attributed t o
specific components, etc.
16. The rela ti on of lan guage proficiency to in telligence (or mo re specifically
to score!; on so-called lQ tests) remains to be studied more carefully.
Scores on achievement tests and educational measures of all sorts sh ould
also be examined cri ticall y V,iith ca reful experimental procedures.
17. Error ana lysis, or in te rlanguage a nalysis, can pro vide additio na l validit y
data on language tests. If la nguage tests a re viewed as elicitati on
procedures, and if errors are ana lyzed carefully, it is possible to ma ke
very specific observations ab out whether certain tests measure different
things o r the same thing .
18. A vailable data suggest that very d ifferent pragmatic tasks, suc h as
sponta neous speech, or elicited imitation, or elicited tra nslation tend to
produce the same kind s of learner errors.
19. Differences in difficulty acro ss ta sks may b e ex pl ained by co nsidelin g th e
rela tive load o n mental mechani sms.
20. Teachers and educa to rs can' t wa it for all of the resea rch da la to come in.
At prese nt, pragma tic testing seem s to provide the most promise as a
reliabl e, valid, and usable approach to th e measurem ent of langua ge
a bilit y.

DISCUSS[ON QUESnONS
1. What sort of testing procedure is morc common in your :;chool or in
ed ucat io nal syste ms with which you a re familiar ; discrete po int testing,
o r integ rat ive test ing? Are p ragmatic tests used in classrooms tha( yo u
know of? Have yO Ll used dictation or d oze procedure in teaching a
forei gn language '? Some other applicati on ? For inst ance, asse ~s ing
reading compreh ension, or th e suitability of material s aimed at a certain
grade level?
" 2. D iscuss ways tha t yo u evalu a te Ihe language proficiency of childre n in
your cl ass room, or students at your uni versity, or perhaps how you
might estimate your own Janguage proficiency in a second Janguage you
have studied.
3. What are some o r the dra wbacks or ad vantages 10 a p honol ogical
72 LANGUAGE TESTS AT SCHOOL

discrim ination task where the examinee hears a sentence like, H e thinks
he will sail his hoa! at the lake, and must decide whether he heard ~e/l or
sail. Try writing se veral items or this type and di sc uss the factors that
ent er into determining [he correct answer. \Vha t ex pectancy biasos may
ari se ?
4. In what way is understanding the sen tence, Shove ofj; SmiIh, I'm tired of
talking [0 you, dependent on kn owing the meaning of olf? Could you test
such knowled ge ...vith an appropri ate discrete item ?
5. Try doing a doze test and a dictation. Reflect on the strategies you usc in
perfor ming one or the other. Or give one to a class and ask them to tell
you whatlhey <Jttended (0 and what they were d oing mentall y wltile lhe.y
fi ll ed in the blanks or wrote down what they had heard.
6. Propose other form s of tasks that you think would qualify as pragmatic
tasks. Co nsid er whether they meet the two naturalness criteria. If so try
th em out and see how they work.
7. Can you think of possi ble objections to a dictation with noise? What are
some arguments for and again st such a procedure?
18. Given a low correl alion between two language tests, say, .30, what are
some of the possible conclusion s? Suppose the correlation is .90. Whm
c·ould you concl ude for the latter'!
9. Keep a diary on li stening errors and speech errors that you or people
around you mak e. Do the processes appear to be distinct ? Interrelated?
10. Discuss Alice's query about not knowing what she wo uld say next till she
had alread y sa id it. How does this fit the strategies you fo llow when you
speak? In what cases might the statement apply to your own speech ?
II. Can a high correlation between two tests be taken as an indication of test
validity? When is it merely an indication of te st reliability? What is the
difference? Can a lest be valid without being relia ble ? How abo ut the
reverse?
... 12. Try co llecting samples of lea rner o utputs in a va riety o f ways. Com-
pare error types. Do dissimilar pragmatic tests elicit similar or dissimilar
erro rs?
l 13. Consider the pros and cons of the long despised technique of translation
as a teaching and as a testing device. Are the p otential uses similar'? Con
you define clearly abuses to be avoided ?
14. What sorts of lests do you think wi ll yield superior di agnostic
in fo rmation to help you to know what to do to help a learner by teac hing
strategies? Consider what you d o now with the test data that are
available. Are there reading tests? Vocabulary tests? Grammar tests ?
What do you do differently because of the information that you get from
the tests you use?
Iv 15. Comp ute the means, variances, and st andard deviations for the
following sets of scores :
Test A Test B
George 1 5
Sarah 2 10
Ma ry 3 15
(Note that these data are highly artificial. They are contrived purely to
DISCRETE POI XT, INTEGRATI VE OR rRAGMATIC TESTS 73
illustrate the meaning of correlation while keeping the computations
extremely simple a nd manageable.)
"16. What is the correlation between tests A and B?
17. Wha t would your interpretation of the co rrelation be if Tests A and B
were alternate forms of a piacemenl lest ? What if they were respecti vely a
reading comprehension test and an ora] int erview?
"18. Repeat questio ns 15, 16, and 17 with the following data:
Test C Test D
George 5 7
Sarah 10 6
Mar), 15 5
(Bea r in mind that correlations would almo st never be done on such
small numbers of subjects. Answers: correla ti on between A a nd B is
+ 1.00, between C and D it is - 1.00.)

SUGG ESTED READINGS


I. Anne Anasta si, Psychological Testing. New York: Macmillan, revised
edition, 1976. See especially Cha pters 5 and 6 on test validity.
2. Joh n B. Carroll , 'Fundamental Considerations in Testin g for English
Language Proficiency of Foreign Students,' Tes1ing, Washingto n, D.C. :
Center for Applied Linguistics. 196 1, 31-40. Rep rinred in H. B. Allen
a nd R. N. Campbell (eds.) Teaching English (IS a Second Language: A
Book ofReadillgs. New York: McGraw H ill, 1972,313-320.
3. L J. Cronhach, Essentials of Psychological Testing. New York : Harper
and Row, 1970. See especially the discussion of different types of validity
in Chapter 5.
4. Robert Lado, Language Testing. New York: McGraw Hill, 1961. See
especially his discussion of discrete poinL test rationale pp. 25-29 and
39- 203.
5. J ohn W. Oller, Jr., 'Language Testing,' in Ronald Wardha ugh and
H. Douglas Brov.lll (ed s.) Survey of Applied Linguistics. Ann Arbor,
Michigan: University of Michi gan, 1976,275- 300.
6. Robert L. Thorndike, "Reliabilit y; Proceedings of the 1963 invita1ional
Conferem.:e on Testing Problems . Princeton, N.J .: Educat ional Testing
Service, 1964. Reprinted in G lenn H. Bracht, Kenneth D. H opk.ins, and
Julian C. Stanley (eds.) Perspectives in Educational and Psychological
.Measurement. Englewood Clift's, N.J. : Prentice-H all, 1972, 66- 73.
4

Multilingual Assessment

A. Need
B. Multilingualism versus
multidialectalism
C. Factive and emotive aspects of
multilingualism
D. On test biases
E. Translating tests or items
F. Dominance and proficiency
G. Tentative suggestions

Multilingualism is a pervasive modern reality. Ever since that cursed


Tower was erected, the peoples of the world have had this problem.
Tn the United States alone, there are millions of people in every major
urban center whose home and neighborhood language is not one of
the majority varieties of English. Spanish, Italian, German, Chinese
and a host of other 'foreign' languages have actually become
American languages. Furthermore, Navajo, Eskimo, Zuni, Apache,
and many other native languages of this continent can hardly be
called 'foreign' languages. The implications for education are
manifold. How shall we deliver curricula to children whose language
is not English? How shall we determine what their language skills
are?

A. Need
Zirkel (1976) concludes an article entitled 'The why's and ways of
testing bilinguality before teaching bilingually,' with the following
paragraph:
74
MULTILINGUAL ASSESSMENT 75
The movement toward an effective and efficient means of
testing bilinguality before teaching bilingually is in progress. In
its wake is the hope that in the near fu ture 'equality of
ed ucational opportu nity' will become more meaningful for
lin guistically different pupils in our nation 's elementary schools
(p. 328).
Earli er he observes, h owever~ that 'a substantial number of bilingual
programs do not take systematic steps to determine the language
dominance ortheir pupils (p. 324).
Since the 1974 Supreme Court ruling in the case of Lau versus
Nichols, the interest in multilingual testing in the schools of the
United States has taken a sudden upswing. The now famous court
case involved a contest between a Chinese family in San Francisco
and the San Francisco school system. The following quotation from
the Co urt's Syllabus explai ns the nat ure of the case:
The failure of the San Francisco school system to provide
English language instruction to approximately 1,800 students
of Chinese ancestry who do n ot speak English denies them a
meaningful opportunity to participate in th e public educational
program and thus violates §601 of the Civi l Rights Act of 1964,
which bans discrimination based 'on the ground of race, color,
or national origin,' (Lau vs. Nichols, 1974, No. 72- 6520).
In page 2 of an opinion by Mr. Justice Stewart concurred in by The
Chief .lustice and Mr. Justice Blackmun, it is suggested that 'no
specific remedy is urged upon us. Teaching English to the students of
Chinese ancestry w ho do not speak the language is one choice. Giving
instruction to this group in Chinese is another' (1974, No. 72--6520).
Further, the Court argued:
Basic English skills are at the very core of what these public
schools teach . Imposition of a requirement that, before a child
can effectively participate in the educational prog ram, he must
already have acquired those basic sk ills is to make a mockery of
public education. W e know that those who do not understand
English are certain to find their cl assroom ex periences wholl y
incomprehensible and in no way meaningful (1974, No.
72- 6520, p. 3).
As a result of the interpretation rendered by the Court, the U.S.
Office of Civil Rights convened a Task Force which recommended
certain so-ca lled 'Lau Remedies'. Among other things, the main
document put together by the Task Force requires language
assessment procedures to determine certain facts abolllianguage use
and it requires the rating of bi lingual proficiency on a rough five point
76 LANGU AGE TESTS AT SCHOOL

scale ( I. monolingual in a language other th an English; 2. more


proficient in another language than in English ; 3. balanced bilingual
in English and an oth er language; 4. mo re profici ent in English th an
in an other language ; 5. monolingual in English).
Multilingual testin g seems to have come to sta y for a while in U.S.
schools, but as Zirke l and others have noted, it has come very late. It
is late in the sense o f antiquated and inhumane educa tional programs
that placed children of language backgrounds other than English in
classes fo r the ' mentall y retarded ' (Diana versus California State
Educatio n Department , 1970, No. C-7037), and it is late in terms of
bilingual education p rograms th at were started in the 19605 and even
in th e early 1970s o n a hope and a promise but without adequ ate
assessment of pupil needs and capa bilities (cf. Jo hn and H o rner,
197 1, a nd Shore, 1974, cited by Zirk el, 1976). In fac t, as recentl y as
1976, Zirkel observes, 'a fatal fla w in many bilingual programs lies in
the lingui stic identifica tion of pupil s at the critical point of the
planning and placement process' (p. 324).
Moreo ver, as Teitelbaum (1976), Zirkel (1976), and others have
often noted, typical method s of assessment such as s urname sur veys
(to identify Spanish speakers, fo r instance) or merel y as king a bo ut
lan guage preferences (e.g. , teacher or student questi onnaires) are
largely inadequate. The one wh o is most likely to be victimized by
such inadequate melhods is the child in the sch ool. One second grade r
indicated that he spo ke ' English- only' o n a rating sheet, but when
'casuall y asked la ter whelhe .. his pa rents spo ke Spanish, the child
respo nded without hesitation: SI, ellos hablan espailo l - pero yo, no.'
Perh aps someone had convinced hi m that he wa s not supposed to
speak Spanish?
Surely, for the sake o f th e child, it is necessa ry to obtain reliable
and va lid info rmati on about what language(s) he speaks and
understands (and how well) before decision s are reached abou t
curriculum delivery and the language policy of t he classroom. 'But,'
some sincere administrator may say , 'we simply ca n't afford to do
systematic testing o n a wide scale. We d on 't have the people, the
budget, or the lime: The answer to such an o bjectio n must be
indignant if it is equall y sincere and genuine. Can th e schools affo rd
to d o this year what they did last yea r ? Can they afford to continue to
deliver instruction in a language that a substantial number of the
children cannot unde rstand? Can they afford to do other wide scale
stan dardized testing programs whose res ults may be less valid ?
1t may be true that the educato rs canno t wail until all the research
MULTILINGUAL ASSESSMENT 77
res ults are in) but It is equally tr ue that we cannot afford to play
political ga mes of holding out for more budget to make changes in
ways the present budge t is being spent, especially when those changes
were needed years ago. This year' s budget and next year's (if there is a
next year) will be spent. People in the sch ools will be busy at many
tasks, and a ll oftbe available time will get used up. D oing all of it the
way it was done last year is proof only of the disappointi ng fact that a
system that purports to teach doesn't necessarily learn . Indeed, it is
merely another comment on the eq ually discouraging fact that many
students in the schools (a nd uni versities no doubt) must learn in spite
of the s:ystem which becomes an adversary instead of a servant to the
needs of learners.
The problems are not unique to the United States. They are world -
wide problems. H ofman (1974) in reference to the scbools of
Rhodesia says. ' It is important to get some idea, one that should have
been reached ten years ago, of the state of English in the primary
school' (p . 10). In the case of Rhodesia, and the argument can easil y
be extended to man y of the wo rld 's nations, H ofman questions the
blind and uninformed language policy imposed on teachers and
children in the schools. In the case of Rhodesia, at lea st until the time
his report was written, English \vas the required school la nguage frOI11
1st grade onward. Such policies have recently been c hallenged in
many parts of the world (not just in the case of a su per-imposed
English) and reports of serious studies examining important variables
are beginning to appear (see for instance, Bezanson and Hawkes,
1976, and St reiff, 1977). This is not to say that there ma y not be much
to be gained by tho roug h knowl edge of one or the languages of the
wo rld currently enjoying much power and prestige (such as English is
at the present momem ), but there are man y questions concernin g the
price that must be paid for suc h knowl edge. Such questions can
scarcely be posed without serious multilingual testing on a much
wider scale than has been common up till now.

B. Multilingualism versus multidialectalism


The problems of language testing in bilingual or multilingual
contex ts seem to be compounded to a new order of magnitude each
time a new language is added to the system. In fac t, it wou ld seem that
the problems are more than just doubled by the presence of more than
one language because there must be social and psychological
interactions between different langua ge communities produc.ing
78 LANGUAGE TESTS AT SCHOOL

complexities not present in monolingual communities. However,


there are some parallels between so-called monoling ual communities
and multilingual societies. The for mer display a rich diversity of
langua ge varieties in much the way that the lalLe r exhibit a variety of
languages. To the ex tent that differe nces in language va rieties parallel
differences in languages, there may be less contrast between the two
sorts of settings than is commonly thought. In both cases th ere is the
need to assess performa nce in relat io n to a plurality of group no rms.
In both cases there is the difficulty of determinin g what group no rms
are appropriate at different times and places within a given social
ord er.
It has so metimes been argued that children in the schools should be
compared only aga inst themselves and never against group norms,
but this argument implicitly d eni es the nature of nonnal human
communication. Evaluatin g the language abilil Y o f an indi vidual by
comparing him only against himself is a little like clapping wi th one
hand . Something is inissi ng. It onl y makes sense to sa y that a person
kn ows a language in relation to the way that other persons who also
kn ow tha t language perfonn when they use it . Becoming a speaker of
a particul ar language is a distincti vely socializing process. It is a
process of identifying with and to so me d egree functioning as a
member of a social gro up.
In multilingual societies, where many mutually unintelligible
languClges are c-o lrunon fare in the market places and urban cenlers,
the need for language proficiency testing as a basis for info rmi ng
educatio nal policy is perhaps more obvious than in so-called
m onolingual societies. However, the case of mo nolingual societies,
which a re typically multidialectal, is deceptive. Although differe nt
varieties o f a language may be mutually in te lligible in many
situatio ns, in others t hey are no t. At least since 1969, it has been
kn own that school child ren who speak different varieties of English
perfo rm about equ ally badly in tests that requi re the repetition of
sentences in the other group's variety (Baratz, 1969). Unfortunately
for children who speak a non-maj ority va riety of English, all o f the
other testing in the sc hools is do ne in the majority va riety (so metimes
referred to as 'stand a rd English'). An important question currently
bein g researched is the extent to which educatio nal tests in general
may co ntain a built-in language variety bias and related to it js the
m ore general question concerning how much of the variance in
educalional tests in general ca n be explained by va riance in language
pro ficiency tests (see Slump, 1978 Gunna rsson, 1978 and
\fUL TIUNGUAL ASSESS\fENT 79

Pharis and Perkins, in press ; also see the Appendix).


The parallel between multilingualism and multidialectalism is still
more fundamental. In fact, th ere is a seriolls question of principle
concerning whet her it is possible to distinguish languages and
dialects. P art of the trouble lies in the fact that for any given language
(however we define it), there is no sense at all in trying to distinguish it
from its dialects or varieties. The language is its varieties. The only
sense in which a particular variety of a langua ge may be elevated to a
status above other varieties is in the manner of Orwell's pigs - by
being a little moreequal or in this case, a little more English or French
or Chinese or Navajo or Spanish or whatever. One of the imp ortant
rungs on the status ladder for a language variety (and for a language
in tbe general sense) is whether or not it is written and whether or not
it can lay claim to a long lite rary tradition. Other factors are who
happens to be holding the rei ns of political power (obviously t he
language variety they speak is in a privileged position), and who has
the money and the goods that others must buy with their money. The
status of a particular variety of English is subject to many oftbe same
influences that the status of English (in the broader sense) is
co ntro lled by.
The question, where does language X (or va riety X) leave off and
language Y (or va riety Y) begin is a little like the question, where does
the river stop and the lake begin. The precision of the answer, or lack
of it, is not so much a product of cla rity or unclarit y of thought as it is
a product of the nature of the objects spoken o f. New terms will no t
make the boundaries between languages, or dialects, or between
languages and language varieties any clearer. Indeed, it can be argued
that the distincti on between languages as disparate as Mandarin and
English (or Navajo and Spanish) is merel y a malter of degree. For
languages tha t are more closely related , such as German and
Swedish, or Portuguese and Spanish, or Navajo and Apache, it is
fairly obviou s that their differences are a matter of degree. However,
in relation to abstract grammatical systems that may be shared by
all human beings as part of their genetic programming, it ma y be the
case that all languages share muc h ofthe same universal grammatica l
system (Chomsky, 1965, J9 72).
Typically, varieties of a language that are spoken by minorities are
termed 'dialects' in what sometimes becomes an unintentional (or
possibly intentional) pejorati ve sense, For example, Ferguson and
Gumperz (1971) suggestthat a dialect is a 'potential language' (p. 34).
This remark represents the tendency to institutio nalize what may be
80 LANGUAGE TESTS AT SCHOOL

appropriately termed the 'more equal syndrome'. No one would ever


suggest that the language that a middle class white speaks is a
potential language .- of course not, it's a real language. But the
language spoken by inner city blacks - that's another matter. A
similar tendency is apparent in the remark by The ChiefJustice in the
Lau case where he refers to the population of Chinese speaking
children in San Francisco (the 1,800 who were not being taught
English) as 'linguistically deprived' children: (Lau versus Nichols,
1974, No. 72-6520, p. 3). Such remarks may reflect a modicum of
truth, but deep within they seem to arise from ethno-centric
prejudices that define the way I do it (or the way we do it) as
intrinsically better than the \vay anyone else does it. It is not difficult
to extend such intimations to 'deficit theories' of social difference like
those advocated by Bernstein (1960), Bereiter and Engleman (1967),
Herrnstein (1971), and others.
Among the important questions that remain unanswered and that
are crucial to the differentiation of multilingual and monolingual
societies are the follo\ving: to what extent does normal educational
testing contain a language variety bias? And further, to what extent is
that bias lesser or greater than the language bias in educational
testing for children who come from a non-English speaking
background? Are the two kinds of bias rcally different in type or
merely in degree?

C. Factiye and emotive aspects of multilingualism


Difficulties in communication between social groups of different
language backgrounds (dialects or language varieties included) are
apt to arise in two ways: first, there may be a failure to communicate
on the factive level; or second, there may be a failure to communicate
on the emotive level as well as the [active level. If a child comes to
a school from a cultural and linguistic background that is substantially
different from the background of the majority of teachers and
students in the school, he brings to the communication contexts of the
school many sorts of expectations that will be inappropriate to many
aspects of the exchanges that he might be expected to initiate or
participate in. Similarly, the representatives of the majority culture
and possibly other minority backgrounds will bring other sets of
expectations to the communicative exchanges that must take place.
In such linguistically plural contexts, the likelihood of misin-
terpretations and breakdo\vns in communication is increased. On the
MLLTIUN(iUAL ASSESSMENT 81
factive level, lhe aClual forms of the language(s) may present so me
diiliculties. The surface form s of messages may sometimes be
uninterpretable, or they may sometimes be: misinterpreted. Such
problems may make it diffi c ul t for tbe child , teacher, and others in the
system to communicate the facti vc-Ievel information that is usually
the focus of classroom activities - namely, transmitting the subject
matter content of the curriculum. Therefore, such factive level
communication problems ma y account for some portion of the
va riance in th e school performance of c hildren from ethnically and
cullurally different backgrounds, i.e., their gene rally lower scores on
educational tests. As Baratz (1969) has shown, howev er, it is
important to keep in mind the fact that if the tables were turn ed, if the
majority were suddenly the minority, their scores on educational tests
might be expected to plummet to the same levels as are typica l of
min o rities in today's U.S. schools. Nevertheless, there is another
important clu ster of factors that probably affect variance in learning
far more dra stical ly than problems offactive level co mmuni cation.
There is considerable evidence to suggest that the more elusive
e moti ve or attitudinal level of communica tio n may be a more
important var iable than the surface fonn of messages concerning
subject matter. This emotive aspect of communication in the schools
directly relates to the se!jCconcept that a given c hild is developing, and
also it relates to group loyalties and ethnic identity. Though such
factors are diffic ult to meas ure (as we sha ll see in Chapter 5), it seems
rea so nable to hypothesize that they may account for morc of the
variance in learning in the schools that can be accounted for by the
selection of a particular teaching me thod ology for instilling certain
subj ect matter (factive level communication).
A s the resea rch in the Canadian experiments has shown, if the
socio-c ultural (emo tive level) factors are not in a turmoil and if the
chi ld is receiving adequate encouragement and support at home, etc.,
the child can app a rently learn a whole new wa y o f coding information
factively (a new linguistic system) rather incidentally and can acquire
the subject maller a nd skills taught in the sc hools without great
d ifficulty (Lambert and Tuc ker, 1972, Tuc ker and Lambert, 1973,
Swain, 1976a, 1976b).
However, the vcry different experience of children in schools, say,
in the Southwestern United States where mailY ethnic minorities do
not experience such success requires a different interpretation.
Perhaps the emolive messages that the child is bombarded with in the
Southwest help explain the failure of the sc hools. Pertinent que stions
82 LANGUAGE TESTS AT SCHOOL

are: ho\" d oes the child see his culture po rtrayed in the curriculum ?
H ow does t he child see himself in relation to the other children who
may be defined as successful by the system' How does the child 's
home experience match up with the experience in the sch ool '!
It is hypothesized that variance in rate oflearniug is probably more
sensitive to the emotive level messages co mmunicated in facial
ex pressions, tones o f voice, deferenti al treatment o r some c hild ren in a
classroo m, bia sed representation o f expe riences in the school
curriculum, and so on, than to d ifferences in factive level methods o f
presenting subject matter. Thls may be more true for minority
children than it is rorchildren who are part or the majori ty. A similar
view has been suggested by Labov (1972) and by Goodman and Buck
(1973),
The hypo thesis and its consequences can be visualized as shown in
Figure 2 where the a rea enclosed by th e la rger circle represents the
to Lal amount of variance in learning to be accounted for (obvio usly
the Figure is a metaphor, not an explanatio n or model). The area
enclosed by the smaller concentric circle represents th e hypothetical
amo unt of variance that might be explained by emotive message
fact ors, A mong these are messages that the child perceives
concerning his own worth, the va lue of his people and cultu re) the
viabil it y of hi s language as a mea ns of communication, and the
validity of his li fe experience. The area in the o uter ring represents the
hypothetical portion of va riance in learning that ma y be accounted
for by appeal to factive level aspects of commun ication in the schools,
such as methods of teaching, su bject matte r taught, language of
presentation of [he material, 1Q, initial ach ievement levels, etc.
Of all the ways struggles for ethnic identity manifest th emselves,
and of all the messages t hat can be communicated between different
social groups in the ir mutual stru ggles to identify and define
themselves, William James <as cited by Watzla wick, el ai, 1967, p, 86)
suggested th at t he most cruel possible message o ne human being can
commun ica te to another (or one group to another) is simply to
pretend that the other individual (or group) does not exist Examples
are too comm on for comfort in the history of ed ucation. Consider the
statement that Columbus discovered America ill /492 (Banks, 1972),
Then ask, who doesn't eoum' (Clue: Who was already here before
Columbus ever got the wind in bi s sails ?) James said, 'No mo re-
fiendish punishment could be devised, '" than that one should be
turned loose in a society and remain absolutely unn oticed by all (he
members thereof ' <as cited by Watzlawick, et ai, 1967, p, 86),
MULTILINGUAL ASSESSMENT 83

VARIANCE
which can be
attri buted to
METHODS

Representation
of the worth of
the child's own the child's own
people experience

Portrayal of the Portrayal of the


viability of the value of the
child's own child's own
culture

Figure 2. A hypothetical vic\\! of the amount of


variance in learning to be accounted for by emotive
versus factive sorts of information (methods of
conveying subject maiter arc represented as
explaining variance in the outer ring. while the bulk
is explained by emotive factors).

The interpretation of low scores on tests, therefore, needs to take


account of possible emotive conflicts. While a high score on a
language test or any other educational test probably can be
confidently interpreted as indicating a high degree of skill in
communica6ng factive information as well as a good deal of
harmony between the child and the school situation on the emotive
level, a low score cannot be interpreted so easily. In this respect, low
scores on tests arc somewhat like low correlations between tests (see
the discussion in Chapter 3, section E, and again in Chapter 7, section
B): they leave a greater number of options open. A low score may
occur because the test was not reliable or valid, or because it was not
suited to the child in difficulty level, or because it created emotive
reactions that interfered \\'ith the cognitive task, or possibly because
84 LANGUAGE TESTS AT SCHOOL

the child is really weak in the skill tested. The last interpretation,
however, should be used with caution and only after the other
reasonable alternatives have been ruled out by careful study. It is
important to keep in mind the fact that an emotive-level conflict is
more likely to call for changes in the educational system and the way
that it affects children than for changes in the children.
In some cases it may be that simply providing the child with ample
opportunity to learn the language or language variety of the
educational system is the best solution: in others, it may be necessary
to offer instruction in the child's native language, or in the majority
language and the minority language: and no doubt other untried
possibilities exist. If the emotive factors are in harmony between the
school and the child's experience, there is some reason to believe that
mere exposure to the unfamiliar language may generate the desired
progress (Swain, 1976a, 1976b).
In the Canadian experiments, English speaking children who are
taught in French for the first several years of their school experience,
learn the subject matter about as well as monolingual French
speaking children. and they also incidentally acquire French. The
term 'submersion' has recently been offered by Swain (l976b) to
characterize the experience of minority children in the Southwestern
United States who do not fare so well. The children arc probably not
all that different but the social contexts of the two situations are
replete with blatant contrasts (Fishman, 1976).

D. On test biases
A great deal has been written recently concerning cultural bias in tests
(Briere, 1972, Condon, 1973). No doubt much of what is being said is
true. However, some well-meaning groups have gone so far as to
suggest a 'moratorium on all testing of minority children.' Their
argument goes something like this. Suppose a child has learned a
language that is very different from the language used by the
representatives of the majority language and culture in the schools.
When the child goes to school, he is systematically discriminated
against (whether intentionally or not is irrelevant to the argument).
All of the achievement tests, all of the classes, all of the informal
teacher and peer evaluations that influence the degree of success or
failure of the child is in a language (or language variety) that he has
not yet learned. The entire situation is culturally biased against the
child. He is regularly treated in a prejudicial way by the school system
MULTIL1NGUALASSESSMENT 85
as a whole. So, some urge that we shoul d ai m [Q gel the cultural bias
out of the school situation as much as possible and es pecia Uy out of
the tests. A smaller group urges that all testing should be sto pped
indefini tely pending in vestigation of other ed ucatio na l alterna tives.
The arguments supp orting such proposals are persuasive, but the
suggested solutions do not solve the problems they so grap hically
point o ut . Consider the possibility of ge ttin g th e cultural bias o ut of
language proficiency tests. Granted that la nguage tes ts th ough they
may va ry in th e punge ncy of tbeir cultural fl avor all have cultural bias
built int o them. They ha ve cultural bias because they present
sequences of linguistic elements of a certa in language (or la nguage
variety) in specific contex ts. In. fact, it is the purpose of such tests to
disc rimina te between various levels of ski ll, often, to discriminate
between native and non-native pe rform ance. A language test is
therefore intentio nally biased aga inst those wh o do not speak the
l£:Inguage or who do so at different levels of skill .
Hence, getting the bias out of language tests, if pushed to the
logical limits, is to get language tests to stop functi oning as language
tests. O n the surface, preventing th e cultura l bias and the
discrimination between gro-ups th at such tests provide might seem
like a good idea, but in th e long run it will create mo re problems than
it can solve. Fa r o ne, it will do ha rm to the children in the schools who
most need help in coping with the majority language system by
pretending that crucial communication pro blems do not exist. At the
same time it wou ld also preclude tbe majo rity culture representatives
in schools from being ex posed to the challenging opportunities of
trying to cope in settings that use a minority language system. If this
latter event is to occur, it will be necessa ry to eValuate developin g
learner proficiencies (of th e majo rit y types) in terms orth e no rms tb at
exist for the min ority language(s) and eulture(s).
The more extreme al te rnati ve of halting testing in the sc hools is no
real solution eith er. What is needed is more testing that is based on
carefull y constructed tests and wi th particular questions in mind
followed by deli berate and careful anal ysis. Part of the difficulty is the
lack of adequate data - not an overabundance of it. F o r instance,
until recently (Oller and Perkins, 1978) there was no dat a on the
relative importa nce of langua ge va riety bias, or just plain language
bias in educationa l testin g in generaL T here was a lways plenty of
evidence that such a factor must be important to a vast ar ray of
ed uc£:Itional tests, but how importa nt? Opinions to the effect that it is
no t very importHm . or th at it is of great impo rtance merely accent the
86 LANGUAGE TESTS AT SCHOOL

need for empirical research on the question. It is not a question that


can be decided by VOle - not even at the lime ho nored 'grass roots
level ' - but that is whe re the studies need to be done.
There is another imponam way that some of the facts concernin g
test biases and their probable effects on the school performance of
certain gro ups of children may have been over-zea lously interpreted.
Th ese latter interpretations rel ate to an extension of a version of the
stron g contrastive ana lysis hypothesis fa miliar to applied lingui sts.
The reasoning is no t unappealing. Since children who do poo rl y on
tests in school , and on tasks such as learning to read, write, and do
arith met ic, are typi cally (or at !cast often) children who do not use tbe
maj ority variety of English at home, their failure may be attributed to
differences in the language (or variet y of English) that they spea k at
home and the language that is used in the scbools. Good man (1965)
offered such an explana tion for the lower perfo rmance of in ner cit y
bl ack children on reading tests. Reed (1973) seemed to advocate the
same view. They suggested that structural contrast s in sequences o r
linguistic elements common to th e speech of such children accounted
for their lower reading scores. Simila r a rguments have been popular
for yea rs as an explanation of the 'd ifficulties' of teaching or lea rning
a language other than the native language of the learner group. There
is much controverting evidence, however, for either application ort he
contrastive analysis hypothesis (also see Chapter 6 below) .
For instance, contrary to the predicti on that would follow fro m th e
contrasti ve anal ysis hypothesis, in two recent studies, black children
und erstood majority English about as well as white children, but the
white children had grea ter difficult y with mi nority black English
(N orton and H odgso n, 1973, and Stevens, Ruder, and Tew, 19 73 ).
While Bara tz (1969) showed that white children tend to tran sform
sentences presemed in black English into their white English
co unterparts, and similarly, black children rend er sentences in white
English into their own language varie ty, it would appear from her
rem arks that at least the black children had little difficulty
understanding white English . This evidence does not conclusively
elim inate the position o nce ad vocated by Goodman a nd Reed , but it
does suggest the possibility oflooking elsewhe re for an explanation of
reading problems and other difficulties of minority children in the
schools. For example, is it not possible that sociocultural fact ors th at
are of an essentially no n-lin guistic so rt might pl ay an equal if not
greater part i.n ex plaining school perfonnance ? One lnight ask
whether black children in the communities where differences have
MULTILINGUAL ASSESSM ENT 87
been o bserved are subject to the hnds of reinforcement and
punishment contlngencies that are present in th e experience of
compa ra ble groups of children in t he maj ority culture. D o they see
their pa re nts rea ding books at ho me? Are th ey encouraged to read by
parents and older siblings? These are tent ative attempts at phrasing
the ri ght questi on s, but tbey hint at certain lines of research .
A s to the contrasti ve explana tio n of the diffic ulties o f la nguage
learners in acquiring a new linguistic system, a question should
suffice. Why sho uld Ca nadian French be so much easier for some
middle cJass children in M ontreal to acquire, than Engli sh is fo r many
minority children in the Southwest ? The a nswer p ro bably d oes not lie
in th e contrasts between the language systems. Indeed, as t he data
c ontinues to acc ummulate, it would appear that man y of the c hildren
in bilingual programs in the U nited States (pe rhaps most of the
children in most o rthe programs) a re dominant in English when they
come to school. The contrasti ve explanatio n is clearl y inapplicable to
tho se cases. Fo r a review of the literature on second language studies
and th e contras tive an alysis approaches, see Oll er (1979). For a
sys~em a tic study based o n a Spa nish-En glish bilingual program in
Albuque rque, New M exico, see T eitelba um (1976).
Ifwe reject the contrastive expianation: ·what th en? Again it seems
we are led to emo ti ve aspects of the school situati o n in relati on to th e
child's experience outside of the sch ool. If the child's cultural
background is neglected by the curriculum, if his people are no t
represented or are portrayed in an unf"vorable li ght or are just
simpl y misrepresented (e.g., the Plains Indian pictured in a ca noe in a
widely used elementary text, Joe Sando, personal communica tion) , if
his language is treated as unsuitable fo r edu ca tional pursuits
(possibl y referred to as the ' ho me la nguage' but not the 'school
language'), probably just about a ny teaching method will run into
major difficulties.
It is in the area of cuhura l valu es and ways o f expressing emoti ve
level informati on in general (e.g., e thnic identity, a pprova l,
disapp ro val , etc.) where social groups ma y contrast more markedly
and in ways tbat are a pt to c reate significant barriers to COITl-
munica tion between gro ups. The barrie rs are not so much in th e
struct ur al systems of th e languages (nor ye t in the education al tests)
as [hey are in the belie f systems and ways or living o f different
cultures. Such differences may create diffic ulties for the acceptance of
culturally distinct groups and the people who represent th em. The
failure of the min o rit y c hild in school (o r the failure o f the sc hool) is
88 LA~G U AGE TESTS AT SCHOOL

more likely to be caused by " conflict between cultures and the


personalities they sustain rather than a lack o r cognitive skill s or
abilities (see Bl oom, 1976).
In any event, none of the fact s about tes t bias lessens the need for
sound language proficiency testing. Those facts merely accent the
demands on educa tors and others who are attempting to devise tests
and interpret test results. And, alas, as Upshur (I 969b) no ted test is
still a four letter wo rd .

E. Translating tests or items


Although it is possible to translate tests with little apparent loss of
information, and without drasticall y altering the task set the
examinees under some conditions, the translation o f items for
standardized multiple choice tests is beset by fundamental problems
of principle. First we will consider th e problems of tran slating tests
item by item from one multiple cho ice format into another, and then
we will return to consider the more general problem of translating
pragmatic ta sks fro m one langua ge to another. It may seem
surprising at the outset to note that the former translation procedure
is probably not feas ible while the latter can be accomplished witho ut
great difficult y.
A doctoral dissertati on completed io 1974 at the Uni versit y of Ne w
Mexico in vesti gated the fea sibilit y of t[(lnslatin g the Boehm Test of
Basic Concepts fro m English into N avaj o (Scoon, 1974). The test
attempts to measure the ability of school children in the ea rly grades
to handle such notion s as sequence in time (before versus after) and
location in space (beside, in front of, behind, under, and so o n) . It was
reasoned by the o ri ginal test writer that children need to be a ble to
understand such concepts in order to foll ow everyday class room
instructions, and to carry out simple educational ta sks. Scoon hoped
to be able to get data from the translated test which would help to
define instructio nal strategies to ai d the Navajo child in the
acquisition and use of such concepts.
Even though skilled bilinguals in English and Navajo helped with
the translation tas k, and thou gh all owances were made for
unsuccessful translation s, dialect va riati o ns, and the like, the
tendency fo r the tra nslaled items to produce results similar to the
o riginal items was surprisingly weak. Lt is questionable whether the
two tests can be said to be similar in what they require o f e xaminees.
Some of the items that were among the easiest ones on the English test
MULTILING UA L ASSESSMENT 89
turned out to be very difficu lt on the Navajo versio n, and vice ve rsa.
The researcher began the project hoping to be able to dia gnose
learning difficulties of N avajo children in their own language. The
stud y evolved to an investigation of th e feasibility of translating a
standa rdized test in a Illultiple choice fo rmat froIll English into
Navajo. SeDan conduded that translat ing standardized tes ts is
probably not a feasible approach to the dia gnosis of educational
aptitudes and skills.
All ofthis would lead us to wonder ab o ut the wisdom of translating
standardized tests o f 'imelli ge nce' o r achievement. Nevertheless: such
translation s exist. There are seve raJ reaso ns why translating a test,
item by item, is apt to produce a very different test th an the one the
tran slators began with.
Translation of factive-Ievel info rmation is of course possible.
However, much more is required. Tran slation of a multiple choice
test item requires not onl y the maintenance of the facti ve informa tion
in the stem (o r lead-in part) of the item, but the maintenance of it in
ro ughly the same relatio n to the paradigm o f iin guisLic and
extralinguistic contexts that it calls to mind. M o reover, the
relat ionships between the distractors and the correct answer must
remain approximately the same in terms of the linguisti c and
extralingui stic contexts tha t they call to mind and in terms of the
relati o nships (similarities and differences) between all of those
contexts . While it may sometimes be diffk ult to maintain the fac ti ve
content of one linguistic form when translati ng it into another
language, this may be possible. However, to maintain the paradigm
of interrelati onships between linguistic and extralinguistic contex ts in
a set of distractors is probably no t just difficult - it may well be
impossible.
Translating delicately composed test item s (on some o f the
delicacies, see Chapter 9) is so mething like trying to translate ajo ke, a
pun, a riddle, or a poem. As 'Robert Frost once remarked, when a
poem is translated, the poetry is ofte n lost' (K olers, 1968, p. 4). With
test items (a lesser art forIll) it is the meanin g a nd the relatio nship
between alternative choices which is apt to be lost. A tran slation of a
joke o r poem o ften has 10 undergo such c hanges that if it were IiteraBy
translated back into the source language it wo uld not be recogniz-
able. With test items it is the meaning of the items in terms of their
effects on examinees that is apt to be changed, possibl y beyond
recognition.
A very commo n statement about a very ord ina ry fact in English
90 LANGUAGE TESTS AT SCHOOL

ma y be an extremel y uncommo n statemen t a bout a ve ry ext raM


ordinary fact in Na vajo. A way o f spea kin g in English may be
incomprehensible in N avaj o ; for instance, th e fact that you cut down
a t ree before yo u cut it up which is very different from cUfling up in a
class or cutting do wn yo ur teacher. Conve rsely, a commonplace
sa ying in Navajo ma y be enigma ti c if literally translated into English.
Successful translati on of items requires maintaining roughly th e
same style level, the sa me frequency of usage of voca bulary and
idiom, compa rable phrasing and reference complexities, and the
same relationships among alternative choices. In some cases this
si mply cannot be don e. Just as a p un cann ot be directly translated
from one la nguage into another precisely beca use of the peculiarities
of the partic ular expecta ncy gramma r that ma kes the pun p ossible, a
test item canno t always be transla ted so as to achieve equ al effect in
th e target la nguage. This is due quite simply to the fact th at the real
gra mmars o f natural languages a re devices t hat relate to paradigms
of extralinguistic conlex ts in necessa rily unique wa ys.
The bare tip of the iceberg can be illustra ted by d ata from wo rd
association experiments conducted by Paul K olers an d reported in
Scientific A m erican in M a rch 1968. He was not conccrn ed with the
word associa tions su ggested by test items, but his data illustrate the
na ture of th e problem we are con sidering. T he meth od consists of
presenting a word to a subject such as mesa (o r table) and asking hi m
to say wha tever othe r word comes to mind, such as sit/a (or chair).
Kolers was inte rested in determining whe the r pairs of associated
word s were similar in both o r the la nguages of a bilingual subject. In
fact, he found that they were the sa me in onl y about one-fifth of t he
cases. Actuall y he co mplicated t he task by asking the subject to
respond in the same la nguage as th e stimulus word on two tests (o ne
in each of th e subj ect's two langu ages), and in the opposite language
in two other tests (e.g., once in English to Spanish stimulus words,
a nd once in Spani sh to English stimulus words) . The first two tests
ca n be refe rred to as intralingual a nd the seco nd pair as i11lerlingual.
The cha rt given below illustrates typical resp onses in Spanish a nd
Enghsh under all four co nditions. It shows tha t while th e response in
English was a pt to be girl to th e stimulus boy, in Spanis h the wo rd
muchacJw genera ted the response hombre. As is shown in th e cha rt,
t he interlingu al associa tion s tend ed to be th e sa me in about one out of
five cases.
In view of such facts, it is a ppa renl that it would be ve ry difficult
indeed to o btain simila r associa tio ns between sets o f alle rnati ve
MULTILI NG U AL ASSESSMENT 91
TNTRALINGUAL INTERLINGUAL

- ENG LlSH- EN G USH ! ,- ;NG LI~~ ~~NISHI


table dish table 5i11a
r boy girl I boy nina
I

kin g queen I king feina

--I L ~PA:::~ :~::IS~~I


house window

I·--SP-:"NIS-~~NIS~ _

' mesa sill a mesa c hai r


I I
1
muchacho h~mbre , muchae-h o tro users
I
L_____.c_~~~other
rey feln a
I_. __ __
fe y queen
_ ~ _madre___ _ 1 .~
TYPICAL RESPONSES in a wo rd ~ a ssociati o n test were gi ven by a subject
whose native la nguage was Spanish. He was as ked to respond in Spani!'h to
Spanish stimulus word s, in English to the same words in English and in each
language to stimulus word s in the other.

choices on multiple choice items. Seoon's results showed that (he


attempt at translatin g t he Boehm test into N avajo did n ot produce a
compara ble test. This, of course, does n ot prove th a t a compara ble
test could not be devised, but it does suggest strongly that oth er
methods fo r test development sho uld be employed. Fo r instance, it
would be p ossible to devise a co ncept test in Navajo by writing items
in Navaj o ri ght from the start in stead of tryi ng to transla te items from
an existing E nglish test.
Beca use of th e diversity of grammatic-al systems that different
languages employ, it would be pure luck if a transla ted test item
should produce highly similar effects on a populatio n of speakers
who ha ve internalized a very different gramm a tical system for
relating la nguage sequences to extralinguistic contexts. We sho uld
predict in advance th a t a large number of such transla tion attempts
would p ro duce markedly different effects in the target la nguage.
'Stranslation the refo re neve r feasible? Quite the contra ry.
Altbough it is difficult to translate puns, j o kes, o r isolated test items in
a multiple choice format , it is no t terrihly difficult to translate a n ovel,
or even a relativel y sh ort p orti on o f prose o r discou rse . Despite t be
idiosyncrasies of language systems with respect to the subtle and
delicate inte rrelatio nships of th eir elemen ts that ma ke poetry and
lDultiple c hoice test items possible, there a re certain robust feaLUres
that all la nguages seem to sha re. All o f t hem ha ve ways of cod ing
factive (o r cognitive ) informa ti on and all o f them have ways o f
92 LANGUAGE TESTS AT SCHOOL

ex pressing emotive (or affective) messages, and in doing these things >
all languages a re highl y orga ni zed. They are both redundant and
creative (Spolsky, 1968). Accord ing to recent resea rch by .l o hn
McLeod (1 975), m any languages are probably abou t equall y
redundant.
T hese latte r facts abo ut simil ariti es o f linguistic systems suggest the
possibility of appl ying roughl y comparable pragmatic testing
procedures across languages wi th equivalent effects. For instan ce~
Ihere is empirical evidence in five languages that translati o ns and the
o riginal texts from which they were translated can be converted into
d oze tests of roughl y com parable difficulty by followin g the usual
procedures (McLeod, 1975 , on procedures see also Chapter 12 of th is
book). The languages tha t McLeod investiga ted incl uded Czech,
Po lish, French, German, and E nglish. H owever, using different
method s, Oller, Bowe n, Dien, and Mason ( 1972) showed that it is
probably possible to create cloze tests of roughl y comparab le
difficulty (a ssuming th e educatio nal status, age , and socioeconomic
factors are co ntrolled) even w he n the la nguages are as different as
English, Thai, and Viet namese.
In a diffe rent application of cloze procedure, Klare, Sinaiko, and
Stolurow ( 1972) recommend 'back translati o n' by a translator who
has no t seen lhe original passage. Mc Leod (1975) used this p roced ure
to check the fait hfulness of th e translations used in his study and
referred to it as 'blind hack translation'. Wh ereas in the item-by-item
tran slati o n that is necessary for multiple choice tests of the typical
sta ndardized type the re will be a systematic o ne·for·one coc-
respondence between the original test items and the tra nslation
version, th is is not possi ble in the constructio n of d oze tests ovcr
tran slatio n of equi va lent pa ssages. In fa ct it wo uld vio late the normal
use of the procedure to try to obta in equiva lences between individual
blanks in the passages.
Other pragmatic testing procedures besides the doze techniqu e
co uld perhaps also be translated be tween two o r mo re lang uages in
order to obta in rou ghl y equi va lent tests in different languages.
Multiple choice tests that qu alify as pragma tic tasks should he
translatable in this way (for examples, see Cha pter 9). Passages in
English and Fante we re equated by a tran sla tion proced ure to tes t
rea ding comprehensio n in a sta ndard mu ltiple choice format by
Bezanson a nd Hawkes (1976).
Kolers (1968) used a brief paragraph carefull y translated in to
French as a basis for in vesti ga tin g the nat ure of bilingualism. H e also
MULTILINGUA L ASSESSMENT 93
con structed two test versions in Whlch either English or French
phrases were interpolated into the text io the opposite language. The
task requ ired of subjects was to read each passage aloud. Kolers was
interested in whether passages requiring switching between English
and French would require more time than the monolingual passages.
He determined that eacb switch o n the average required an extra third
of a second. The task, howe ver, a nd the procedure for selting up the
task could be adapted to the interests of bilingual or multilingual
testin g in other settings.
Ultimately, the argu ments briefl y presented here concerning the
feasi bility of setting up equivalent tests by tra nslation between two or
more languages are related to the controversy about discrete point
and integra tive or pragmatic tests. Generall y speaking it should be
d ifficult to translate discrete point items in a multiple choice (or other)
format while maintai ning equivalence, th ough it is pro bably quite
feasible to translate pragmatic tests and thereby to obtain equivalent
tests in more than one language. For any given discrete item,
translation into another language will produce (in principle a nd of
necessity) a substantially different item. For any given pragmatic
testing procedure on the other hand , translation into another
language (if il is done carefull y) can be ex pected to produce a
substanllall y similar test. In lhis respect, discrete item s are delicate
whereas pragmatic procedures are robust. Of course, the possibility
of translat in g tests of ei ther type merits a grea t deal more empirica l
sludy.

F. Dominance and proficiency


W e nOled above that the 'Lau Remedies' require da ta concerni ng
language use and do minance. The questions of impo rtance would
seem to be, what language does the child prefer to use (or reel most
comfortable using) with 'A,hom and in what contexts of ex perience?
And , in which of lWO o r more languages is the child most competent?
The most common way of getting information concerning language
use 1S by ei ther interview1ng the children and eliciting information
fro m them, o r by address ing a questionnaire to the teacher(s) or some
other person who has Ihe oppo rtunity to observe the child. Spolsky,
Murphy, H olm, and Ferrel (1972) offe r a questionnaire in Spanish
and one in Navajo (either of which can also be used in E nglish) to
'classify' students ro ughl y according to the same basic categories that
are recom mended in the ' Lau Remedies' (see Spolsky, ef ai, p. 79).
94 LANGUAGE TESTS AT SCHOOL

Since the 'Remedies' came morc than two years later than the
Spolsky, et al article, it may be safe to assume that the scale
recommended in the 'Remedies' derives from that source. The
teacher questionnaire involves a similar five point scale (see Spolsky,
etal,p.81).
Two crucial questions arise. Can children's responses to questions
concerning their language-use patterns be relied upon for the
important educational decisions that must be made? And, second,
can teachersjudge the language ability of children in their classrooms
(bear in mind that in many cases the teachers are not bilingual
themselves; in fact, Spolsky, 1974, estimates that only about 5 % of
the teachers on the Navajo reservation and in BIA schools speak
Navajo)? Related to these crucial questions is the empirical problem
of devising careful testing procedures to assess the validity of self-
reported data (by the child), and teacher-reported judgements.
Spolsky, et at made several assumptions concerning the interview
technique which they used for assessing dominance in Spanish and
English;
1. ... bilingual dominance varies from domain to domain.
Subscores were therefore given for the domains of home, neigh-
borhood. and school.
2. A child's report of his own language use is likely to be quite
accurate.
3. Vocabulary fluency (word-naming) is a good measure of
knowledge of a language and it is a good method of comparing
knowledge of two languages.
4. The natural bias of the schools in Albuqucrque as a testing
situation favors the use of English; this needs to be counteracted by
using a Spanish speaking interviewer (p. 78).
If such assumptions can be made safely, it ought to be possible to
make similar assumptions in other contexts and with little
modification extend the 'three functional tests of oral proficiency'
recommended by Spolsky, et al. Yet they themselves say, somewhat
ambiguously:
Each of the tests described was developed for a specific purpose
and it would be unwise to use it more widely without careful
consideration, but the general principles involved may prove
useful to others who need tests that can servc similar purposes
(p. 77).
One is inclined to ask what kind of consideration'? Obviously a local
MULTILINGUAL ASSESSMENT 95
meeting of the School Board o r some o ther orga nization will no t
suffice to justify the assumptions listed above, or to guarantee the
success of the testing procedures, with or w ithout adapt ations to fit a
particul a r local situatio n (S polsky, 1974). What is required first is
some careful logical in vestigation of possible outcomes from th e
procedures recommended by Spolsky, e l ai, and other procedures
which can be devised for the purpose of c rossvalidation. Second,
empirical stud y is requi red as illustrated, for example, in the
Appendix bel ow.
Zirkel (1974) points o ut that it is no t en o ugh merely to place a child
on a do minance scale. Simple logic will explai n why. [t is possi ble for
two children to be balanced bilinguals in term s of suc h a scale but 10
differ radicall y in terms o f their developmental levels. An ex treme
case would be children at di ffercnt ages. A mo re pertinent case would
be tw o children of th e same age and grade level who are balanced
bilinguals (thus both scoring at the mid po int of the dominance scale,
see p. 76 above), but wh o are radically different in language skill in
both languages. One child might be performing at an advanced level
in bo th langua ges while the other child is perform ing at a much lower
level in both languages. Measuring for d ominan ce-onl y wou ld not
reveal such a di tTerence.
No ex perimenta tion is required to show the inadequacy o f any
procedure that merely as sesses do minance - even if it does the job
accurately, and it is dou btful whether some of the procedures being
recommended can do even the job of domimillce assessment
accura tely. Besides, there are imponant considerations in addit ion 10
mere language dominance whic h can ente r the picture only when
valid proficien cy data are available for both la ngua ges (or each of the
several languages in a mu ltilingual settin g). Moreover, with care to
insure test equ ivalence acr oss the lang uages assessed, dom inance
scores can be de ri ved directl y fro m proficiency dat a - th e reverse is
not necessarily possible .
Hence , the question concerning how to acqui re reliable infor-
mation concerning lan guage proficiency in mul til in gual co ntexts,
includ ing the impo rtant matter of determining language dominance,
is essentially th e same question \ve have been dealing with th roughout
this book. In order to determi ne language dominance accurately, it is
necessary to impose the additi o na l requirement o f equating tests
across languages. Prelimi nary res ults of McLeod (1975), Klare,
Sinaik o, and Stolurow (1972), Oller, Bowen, Dien, and Mason
(1972). and Beza uson a nd Haw kes (1 976) suggest that careful
96 LANGU AGE TESTS AT SCHOOL

translation may offer a solution to the equivalence problem, and no


doubt there are other approaches that will prove equally effective.
There are pitfall s to be avoided, however. There is no doubt that it
is possible 10 devise tests that do not accomplish what they were
designed to accomplish - that are not valid. Assumptions of validity
are justifiable only to the extent that assum ptions of lack of validity
ha ve been disposed of by careful research. On that theme let us
reconsider the four assumptions quoted earlier in this section. What
is the evidence lhal bilingual domin ance 'varies from domain to
domain'?
In 1969, Cooper reported that a word-naming task (the sa me sort
of task used in the Spanish-English test ofSpolsky, e/ ai, 1972) which
varied by domains such as 'home' versus 'school' or 'kitchen' versus
'classroom' produced different scores depending on the domain
referred to in a particular portion of the test. The task set the
examinee wa s to name all the things hecould think of in the ·kitchen',
for example. Exam inees completed the task for each domain (five in
all in Cooper's study) in both languages without appropriate
counterbalancing to avoid an order effect. Since there were significant
contrasts between relative abilities of subjects to do the task in
Spanish and English across domains, it was concluded th at their
degree of dominance varied from one domain to another. This is a
fairly broad leap of inference, however.
Consider tbe following question: does the fact that I can name
more objects in Spanish that I see in my office than objects that I can
see under the hood of my car mean that I am relatively more
proficient in Spanish when sitting in my office than when looking
under the hood of my car'! What Cooper's results seem to show (and
Teitelbaum, 1976, found similar results with a similar task) is that the
number of things a person can name in reference to one physical
setting may be smaller or greater than the number that the same
person can name in reference to another ph ysical setting. This is not
evidence of a very direct sort about possible changes in lan guage
dominance when sitting in your living room, or when sitting in a
classroom. Not even the contrast in 'word-naming' across languages
is necessarily an indication of any difference whatsoever in language
dominance in a broader sense. Suppose a person learned the names of
chess pieces in a language other than his native language, and suppose
further that he does not know the names of the pieces in his native
language. Would this make him dominant in the foreign language
when playing chess"
MUL TlLlNGUAL ASSESSMENT 97
A more importan t question is not whether there are contrasts
across domains, but whether the ·word-naming' task is a valid
indication of language proficiency. Insufficient data are available. At
face value such a task appears to have little relation to the sorts of
things that people normally do with language, children especially.
Such a task does not qualify as a pragmatic testing procedure because
it does not require time-constrained sequential processing, and it is
doubtful whether it requires mapping ofullerances onto extralinguis-
tic contexts in the normal ways that children might perform such
mappings - naming objects is relati vely simpler than even the speech
of median-ranged three-and-a-halfyear old children (Hart, 1974, and
Hart, Walker, and Gray, 1977).
Teitelbaum (1976) correlated scores on wo rd-naming tasks (in
Spanish) wit h leacher-ralings ; and self-ralings differentiated by four
domains (,kitchen, yard, block, school'). For a group of kindergarten
through 4th grade children in a bilingual program in Albuquerque
(nearly 100 in all), the correlations ranged from .15 to .45.
Correlations by domai n with scores on an interview task, ho wever,
ranged fr om .69 to .79. These figures hardly justify the differentiation
of language dominance by domain. The near equivalence of the
correlations across domains with a single interview task seems to
show that the domain differentiation is pointless. Cohen (1973) has
adapted the word-naming task slightly to convert it into a story-
telling procedure by domains. His scorin g is based on the number of
different words used in each story-telling domain. Perhaps other
scoring techniques should also be considered.
The second assumption quoted above was that a child' s own report
of his language use is apt to be 'quite accurate'. This may be more true
for some children than for others. For the children in Teitelbaum's
study neither the teacher's ratings nor the children's own ratings were
very accurate. In no case did they account for more than 20 ~~ of the
variance in more objective measures oflanguage proficiency.
What about the child Zirkel (1976) referred to ? What if some
children ar.systema tically indoctrinated concerning what language
they are supposed to use at school and at home as some advocates of
the 'home language/school language' dichotomy advocate? Some
research with bilingual children seems to suggest tbat at an early age
they may be able to discriminate appropriately between occasions
when one language is called for and occasions when the other
lan guage is required (e.g., answering 1n French when spoken to in
French, but in English when spoken to in English, Kinzel, 1964),
98 LAN GL"AGE TESTS AT SCHOOL

without being able to discuss this ability at a more abstract level (e.g.,
reponing when you are supposed to speak French rather than
English). Teitelbaum's data reveal little correlati on between
questions about language use and sc.ores on more objec.tive language
proficiency measures. Is it possible that a bilingual child is smart
enough to be sensitive to what he th inks the interviewer expects him
to s::;.y? U pshur (197 1a) observes, ' it isn't f" if to ask a man to cut his
own throat, and even if we should ask, it isn' t reasonable to expect
Iiim to do it. We don't ask a man to rate his proficiency when an
honest answer might resu lt in his failu re to achieve some desired goal'
(p. 58). Is it fair to expect a child to respond independently of what he
may think the interviewer \vants to hea r?
We have dealt earlier with the third assumptio n qu oted above, so
we come to the fourt h. Suppose that we assume the interviewer
should be a speaker ofthe minority language (ra ther than English) in
order to counteract an English bias in the schools. There a re several
possibilities. Such a provision ma y have no effect, the desired effect (if
indeed it is desired as it may distort the picture of the child's true
capabilities along tbe lines of the preced ing paragraph), or an effect
that is opposite to the desired one. The onl y way to determine which
result is the actua l one is to devise some empirical measure of the
relati ve magnit ude of a possible interviewer effect.

G. Tentative suggestions
What methods then can be recom mended fo r multilingual testing'?
There are many methods that can be expected to work well and
deserve to be tried - among them are suitable adaptations o f the
methods discussed in this book in Par t Three. Some of the ones that
have been used with encouraging results include oral interview
procedures of a wide ra nge of types (but designed to elicit speech from
the child and data on comprehension, not necessaril y the child's o\vn
estimate of how well he speaks a certain language) - see Chapter 11.
Elicited imitation (a kind of oral dictation procedure) has been widely
used - see Chapter 10. Versions of the cloze procedure (particul arly
ones that ma y be ad ministered orally) are promising and have been
used with good resu lts - see Chapter 12. Variations on composition
tasks and story telling or retelling have also been used - see Chapter
13. No doubt many other procedures can be devised - Chapter 14
offers some suggestions and guidelines.
In brief, what seems to be required is a class of testing procedures
MULTILINGUAL ASSESSMENT 99
providing a basis for equivalent tests in different languages that will
yield proficiency data in both languages and tbat will simultaneously
pro vide dom inance sco res of an accurate and sensiti ve sort. Figure 3
offers a rough concepLUalization o f the kind s of equivalent measures
needed . If Scale A in the dia gram represents a range of possible scores
on a teS! in la nguage A, and if Scale B represents a range of possible
scores on an equivalent test in language B, the relative difference in
scores on A and B can provide the basis for placement on th e
dominance scale C (modelled after the ' Lau Remedies' or the
Spolsky, et ai, 1972, scales).
~ 100%
r
I - - - -- -- -- --- - -- -- -- ----1
Scale A

0% 100%
I I
Sca le B

A Ab AB Ba B
I I I I I
Sca le C
Figure 3. A domina nce scale 10 relation to
proficiency scales. (Sca les A and B represent
equivalent proficiency tests in languages A and B,
\",.hil e scale C represents a domina.nce scale, as
requ ired by the Lau Remedies. It is claimed that the
meaning o f C can onl y be adequately defined in
relation to scores on A and B. )

It would be desirable to calibrate bot h ofthe proficiency scales with


reference to comparable groups of monolingual spea kers of each
language involved (Cowan a nd Zarmed, 1976, fo llowed such a
procedure) in order to be able to interpret scores in relation to clear
c rite ria of performance. The do minance sca le can be ca librated by
defining distances on th a t scale in term s o f units o f difference in
proficiency on Scales A and B.
This can be done as follows: fi rst, subtract each subject's score o n
A from the score on B. (If the tests are cali brated properl y, it is not
likely that anyone will get a perfect score on either tesl tho ugh there
may be some zeroes.) T hen rank order the results. They should range
from a series of positive values to a series of negative values. If the
group tested consist::; onl y o f c hildren who are do minant in o ne
language , the re will be onl y positive or onl y negative va lues, but not
100 LANGUAGE TESTS AT SCHOOL

both . The ends or the rank will define the ends orthe dominance scale
(with reference to the population tested) - that is the A and B points
on Scale C in Figure 3. The center po int, AB on Scale C, is simpl y the
zero position in the rank. That is the point at which a subject's scores
in both languages are equal. The points between the ends and the
center, namely, Ab and Ba, can be defined by findin g the mid point in
the rank between th at end (A or B) and the center (AS).
The degree of accuracy with which a particul a r subject can be
classed as A~m o nolin g ual in A, Ab~domin ant in A, AB~ equall y
bilingual in A and S, Sa ~ domin ant in B, B ~ mo nolingual in B, can
be judged quite acc urately in terms of the standard error of
differences on Scales A and B. The standard erro r of the differences
can be computed by finding the variance of the differences (A minus
B, fo r each SUbject) a nd then di viding it by the square root of the
number of subj ects tested. If th e distribution of differences is
appr ox imately normal, chances are better than 99 in 100 th at a
subject's true degree ofbilinguality will rail within th e range or plus or
minus three standard errors ab ove o r below his act ual attained score
on Scale C. If measuring off ± 3 standard erro rs fro m a subject's
attained score still leaves him cl ose to. say, Ab on th e Scale, we can be
confident in classifying him as 'd o minant in A ' .
Thus, if the average standard error of differences in scores on tests
Aand B is large, the acc uracy of Sca le C will be less th an if the average
standard error is small . A general guideline might be to require at
least six standard errors bel ween each of the fi ve points on the
d ominance scale. It remains to be seen, however, ",,;hat degree of
accuracy will be possible. For suggestions on equating scales across
languages, see Chapter 10, pp. 289- 295. '

KEY POI NTS


1. There is a serious need for multil ing ual testing in t he schools not just of
th e United States, but in many nations.
2. In the Lau versus !'lichols case in 1974, the Supreme Co urt ruled tha t the
Sa n F ranci sco Schools were viola ting a section o f th e Civil Right s Code
which ' bans discrimination based on the ground of race, color. or
natio nal origin' ( 1974, N o. 72-6520). It was ruled t ha t the schools should
either provid e in str uct ion in the na ti ve language of the 1,800 Chinese
spea king children in question, or provide special instruction in the
Engli sh language .
3. Even at the present, in academi c year 1978-9, ma ny bilingual prog rams

I Also see discussion q uest ion number 10 at theend orChapter 10. Muc h basic rese,lrch
is needed o n these issue!>.
MULTHJNGUAL ASSr-.:s5MEN T )0I

and many schools which have children of multilingual background s are


not doi ng adequate language assessment.
4. There are importam parallels bet ween multilingual and multidialectal
societies. In both cases there is a need for language assessm ent
procedures referenced against group norms (a plurali ty or them).
5. In a strong logical sense, a language is its vari eties or dialects, and the
d ialects or varieties are languages. A particular va riety may be elevated
to a higher status by virtue of the 'more equal sy ndrome', but this does
not necessitate that other va riet ies must therefo re be regarded as less
than languages --mere 'potential languages'.
6. Prejudices ca n be institutionali zed in theories of 'defici ts' or 'de-
privations'. The pertinent questi on is, from whose poi nt of viev,,.? The
institutio nalization of such theor ies into· discrimin atory educational
practices may well create real deficits.
7. It is hypothesized that, at leasLfo r the min ority child, and perhaps for the
majority child as well , variance in learning in the schools may be much
more a funct ion o f the em otive aspects of interactions within and outside
of schools than it is a funclion of methods of teaching and presentation
of subject matter per se.
8. Wben emotive level struggles ari se, factive le vel co mmu nication usuall y
stops altogether.
9. Ignoring the existen ce of a child or social group is a cruel punishment.
Who discovered America ?
10. Getting the cultural bias out of language tests would mean making them
into something besides language tests. However, adapting them to
particular cultural needs is another matter.
11 . Contrast ive analysis based ex planati ons for the generally low'cr scores of
minority child ren o n educalional tests run into major empirical
difficulties. Other factors appear to'be much more important than the
surfa ce form s of different languages or language variet ies.
12. Translating discrete point test items is rough ly compa rable to translat ing
jokes, or puns, or poems. It is difficult if not impossible.
13. Translat ing pragmatic tests o r testing procedures on the other hand is
more like translating prose o r translating a no vel. It can be done, not
easily perhaps, but at least it is fe asible.
14. ' Blind back tran slation' is one procedure for checking the accuracy of
translation aLCempts.
15. Mea sures of multilingual proficiencies require valid proficiency tests.
The vaJidit y of proposed procedures is an empirical question.
Assumptions must be tested, or they remai n a threat to eve ry educational
decisio n based on {hem.
16. Mea surlng dominance is not enough. To interprellhe meaning of a score
on a dominance scale, it is useful to know the proficiency scores which it
deri ves from.
17. It is suggested that a dominance sca le of the sort recommended in the Lau
Rem edies can be calibrated in terms of th e average stan dard error of the
differences in test sco res againsl wh ich [he scale is referenced.
l~ . Similarly, it is recommended that sco res on the separate proficiency tests
be referenced against (and calibrated in terms of) the scores of
102 LANGUAGE TESTS AT SCHOOL

mo nolingua l children who speak the la nguage o f tbe lest. ( It is rea lized
tha t this may not be poss ible to attain in reference to some very small
pop ulations where the minority language is on the wane.)

DISCUSSION QU ESTIONS
1. H o w are curric ular decision s regardin g the delive ry of instructional
ma terials made in your school(s) ? Sc hools t ha t you know of?
2. If you work in or know of a bil ingual program, what steps are taken in
lh at program to assess language proficiency ? How are the scores
in terpreted in relation to the curriculum ? If you knew that a substantia l
number o f the chil dren in yo ur school were app roxima lely equaU y
proficien t in two langu ages, what curric ular decisions would y ou
recommend ? What else would yo u need to know in order to recomm end
policie s?
..13. If you were asked to ran k prio ri ties for budgetary e xpend itures, where
wo uld langu age testing come on the list for tests and measurements ? Or,
wo uld there be any such list?
4. W ha t is wrong with 'surname surveys' as a basis for determin ing
la nguage domin ance? What can be sa id a bo ut a child whose name is
Ortiz ? Smith ? Reitzel ? What abo ut aski ng the child concerning his
la ng uage preferences? Wh at are some o f the fac tors th at mig ht influ ence
the child 's response? W hy no t j ust ha ve the teachers in t he sc hools judge
th e proficienc), of the children '!
5. W hat price would yo u say would be a fair one for being able to
communica te in oneo f the world's power languages (especiall y English) ?
Co nside r tbe case fo r the child in the Africa n primary schoo ls as
suggested by Hofman. Is it worth th e cos t ?
!,..16. Try to conceive ofa la nguage te st t hat need not ma ke reference to group
no rm s. H ow would suc h a test rela te to ed ucationa l POliL-Y?
7. Co nside r do in g a stud y of possible la ngua ge variety bias in the tests used
in yo ur sch ool. Or perha ps a language bi as study, or combinati on of the
two would be more appropriate. What kind s of sco res would be available
fo r the st udy? IQ ? Apt itude ? Achievement ? O assroorn observa tions?
W hat sorts o f language proficiency measures might you use ? (Anyone
serio usly consi derin g suc h a study is urged to consult Part Three, and to
ask the advice of someone who knows so methin g a bout research de sign
befo re actuall y undertak ing the project. It co uld be done, ho wever, by
any teacher or admini strator ca pable of persuading others th at it is
wo rth doing.) Are language variety biases in educat ional testing different
in type?
8. Begi n to construct a list of the sociocu ltur al factors that might be
pa rtially accounta ble for the widely di scussed vi ew that ha s been put
fort h by Jensen and Herrnstein to the elreet th at certa in races are
superior in intelligence. What is intelligence? How is it measured wit h
refe rence to yo ur school or school popul a tions t hat you know of?
9. Spend a few days as an observer of children in a classro om setting. N ote
ways in which the teacher a nd sign ifica nt others in the school
com munica te emotive in forma tion to the chi ld ren . Look especiall y for
MUL TlLINGUAL ASSESSMENT 103
co ntra sts in what is said and wh at is mean t from the child's perspective.
Consider the school cu rricul um with a view to its represent ati on of
di ffe rent cultures and langua ge varieties. Observe the behaviour of the
children . What kinds of messages do they pick up and pass on? What
kinds of beliefs and attitudes are they being encouraged to accept or
reject ?
· 10. D iscu ss the differences between ' submersion' and ' immersion' as
educational approaches to the problem of teaching a child a new
language and cult ural system. (See Ba rik and Swain, 1975. in Suggested
Readi ngs at the end of this chapter.)
11 . Consid er an example or several of them of children wh o are especially
low ac hievers in your '\Choo l o r in a school that yo u know o f. Loo k for
sociocultural facto rs that may be related to the low achievement. Then,
co nsider some of the high achiever s in the same context. Consider the
proba bl e effects of sociocultural contex ts. Do you see a pattern
emerging ?
12. T o wh at extent it' 'cultural bias' in tests identical to 'experiential bias'-
i.e., sim ply not having been exp osed to some p ossible sel of experiences?
Can yo u fi nd genuine cultural factors th at are d istinct from experiential
biases? Examine a widely used sta ndardized test. If possible disc uss it
wilh someone whosecul lural experience is very diffe rent fro m you r own.
Are there items in it that arc cl early biased ? Culturall y,! O r merely
experient iany?
13. Discuss facto rs influencing children who learn to read before they come
to school. Consider those factors in relation to what you know of
children wh o fai l to learn to read after they come to school. Where does
la nguage development fit into th e picture? Books in the home? Models
who are readers tbat the child might em ul ate'! Do child ren of widely
different dialect origins in the United States (or elsewhere) learn to read
much th e same material ? Consider G reat Bri tain , Ireland , Aust ralia,
Canada, and ot her nations.
14. Consider the fa ct that l est is a four letter word (Upshur, 1969b). Why?
How ha ve tests been misused or simply used to make them seem so
omino us, pon emo us, even wicked '! Reco nsider the definiti o n o f
\anguage tests offered in Chapter I. Can you th in k of any innocuo us and
benign procedures that quali fy as tests? How coul d such procedures be
used to reduce the threa t of tests"
15. Try translating a few discrete items on several types o f tests. Have
someone else wh o has not seen the ori ginal items do a 'blind back
tran slat io n'. Check the comparabi lity of the results. Try the same wi th a
pas sage of prose. Can corrections be mad e to clear up the d ifficulties that
arise? Are there substanti al contra sts bet ween the two ta sk s '?
16. T ry a language interview proced ure that asks child ren how wel l they
understan d one or mo re languages and when they use it. A procedure
such as the one suggested fo r Spanish-English bilinguals by Spolsky, et
at (1972, see Suggested Readings for Chapter 4) should suffice. Then test
the children on a battery o f other measu res -.. teacher eva luations of the
type suggested by Spolsky, et al (1 972) for N avajo-English bili nguals
might also be used . lntercorrela te the sco res and determine empi ricall y
104 LANGUAGE TESTS AT SCHOOL

their degree of varia nce overlap. To what extent can the various
procedures be said to yield the same inform ation?
17. What kind s of tests can you conceive of to assess the popula r opinion
that langua ge domin ance va Tie!'. fr o m d oma in to d o main '! Be careful to
defi ne "domain' in a sociohngui sticall y mea ningful a nd pcagmalicaU y
useful way.
18. Devise proficiency tests in two or more languages and attempt to
calibrate them in the recommended ways - both agai nst the scores of
mo nolingu al reference groups (if t hese are accessible) a nd in terms of t he
average standard error of the differences on the two tests . Relate them to
a five point do minance scale such as the one shown in F igure 3 above.
Co rrelate scores on the proficiency tests (hat yo u have devi sed with other
standardi zed tests used in the school from which your sample population
was drawn. To what extent are vari ances on other tests accounted for by
varia nces in language proficiencies (especially in English, assuming that
it is the language of pract ically all standard ized test ing in the schools)'!

SUGGESTED READI NGS


1. H. C. Barik and Merrill Swain, 'Three-year E valua tion of a Large Sca le
Early Grad e French Immersion Program : the Ott awa study,' Language
Learning 25 , 1975,1- 30.
2. Andrew Co hen, 'The Sociolinguistic Assessment of Speaking Skills in a
Bilingual Ed ucatio n Program,' in L. Palmer and B. Spolsky (eds.) Papers
on Language Testing. W ashington, D.C. : TESOL, 1975, 173-.. 186.
3. Paul A. Kolers, 'Bilingualism and Information Processing; Scientific
American 2 18, 1968,78-84.
4. 'OCR Sets Guidelines for Fulfilling Lau Decision,' Th e Linguistic
Reporter 18, 1975, 1, 5-7. (Give::; addresses of Lau Centers and quotes the
text of the ' Lan Remedies' recommended by the Task Force appointed
by the Office of Civil Righ ts.)
5. Patricia J. Nakano, 'Educational Implica tions of the Lau v. }\iichols
Decision,' in M. Burt, H . Dulay, and M . Finocchinaro (eds.) Viewpoints
011 English as a Secund Language. New York: Regents, 1977,2 19-34.

6. Bernard Spolsky, 'Speech Com munities and Schools,' TESOL Quarterly


8, 1974. 17- 26.
7. Bernard Spolsky, Penn y Murphy, Wayne H olm, and AUen Ferrel,
'Three Functional Tests or Oral Proficiency,' TESOL Quart erly 6, 1972,
221 - 235. (Also tn Palmer and Spolsky, 75- 90, see reference 2 above.
Page references in this text are to the latter source.)
8. Perry A. Zirkel, ' A Method fo r Determining and Depicting Language
Dominance,' TESOL Quanerly 8, 1974,7- 16.
9. Pe rry A. Zirkel, 'The Why' s a nd Ways of Testing Bilinguality Before
Teaching Bilingua ll y,' The Elemenlary S chool Journal, March 1976,
323 330.
5

Measuring Attitudes
and Motivations

A. The need fo r validating affecti ve


measures
B. H ypothesized relationsbips between
affective va riabl es and the use and
lea rning of language
C. Direct and indirect measures of
affect
D . Observed relatio nships to
achievement and remai ning puzzles

A great deal of research has been done on the topic of measuring the
affecti ve side of human experience. Personalit y, attitudes, emOlio ns,
feelings, and motivati ons. however, are subj ective things and even
our 0 "-11 ex perience tells us that they are as changeable as the wind.
The question here is whether or no t they can be measured . Further,
what is the relationship between existing measures aimed at affecti ve
variables and measures aimed at language skill or other educational
constructs?

A. The need for validating aU"ective measures


No one seems to doub t that attitudinal fa ctors are related to human
perfo rm ances. t In the precedin g cha pter we consid ered the
hypothesis th at emotive or affective factors pl aya greater part in
determinin g success or fa.ilure in sc bo ols than do facti ve or cogniti ve
factors (particul arly teaching methodol ogies). The pro blem is ho w to
determine what the affective facto rs might be. It is widely believed

1 Throu ghout lhis chapter and elsewhere in the book , we often use the term "attitudes'

as a cove r te rm fo r all so rts o f affective va ri ~b lc:; . W hile t here are ma ny theories that
distingui sh between many sorts of attitudina l, molivationa l and personali ty va ria bles,
all of them are in the same boat when it comes to validati on.

105
106 LANGUAGE TESTS AT SCHOOL

that a child's self-concept (confidence or lack of it, will ingness to take


social risks, and all around sense of well-being) must contribute to
virtuall y every sort of school performance ~ or perfo rmance o utside
of the school fo r that matter. Similarly, it is believed that the child's
view of others (w hether of the child's own eth nic and social
background, or of a different cultural background, whether peers or
no n-peers) will influence virtuaUy every aspect of his interpersonal
interactions in positive and negative wa ys that contribute to success
or failure (o r perha ps just to happiness in life, which though a vaguer
concept, may be a better one).
l! is not difficult to believe that attitude variables are important to a
wide range of cognitive phenomena - perhaps the wh ole range - but it
is difficult to say just exactly what attitude variables are. Therefore, it
is difficult to prove by the usual empirical method s of science that
attitudes actually have the importance usuall y attribu ted to them. In
his widely read and often cited book, Beyond Freedom and Dignity,
B. F. Skinner (1971) offers the undisguised thesis tha t such concepts
as 'freedom' and 'dignit y' not to mention 'anxiet y', 'ego', and the
kind s af terms popular in the atti tud e literatu re are merely loose an d
misleading ways of speaking about the kinds of events that control
beha vior. He ad vocates sharpening up o ur thinkin g and our ways of
controlling behavior in order to save the worl d - 'not . .. to destroy
[the] enviro nment or esca pe from it ... [but] to redesign it' (p. 39). To
Skinner, anitudes are dispensable intervening varia bles between
behavior and the consequences of behavior. They can thus be done
away w ith . 1f he were quile correct, it o ught to be possible 1O observe
behaviors and their consequences astu tely enough to ex plain all there
is to explai n ab out human beings - however, there is a problem for
Skinner'S approach . Eve n simpler systems th an human beings are not
fully describable in that way -e .g., try observing the input and output
of a simple desk calculator and see if it is possible to detennine how it
works inside then recall that human beings are much more complex
than desk calculators.
Onl y very radical a nd ve ry narrow (a nd therefore largely
uninteresting and very weak theories) are abl e to dispense completely
wit h attitudes, feelings, personalities, and other diffic ult-lo-measure
internal states and motives of human beings. It seems necessary,
therefore, to take atti tudes into account. The problem begins,
however, as soon as we try to be very ex plicit abo ut what an attitude
is. Sha w and Wrigbt (1967) say that 'an attitude is a hypothetical, or
latent, variable' (p. 15). That is to say, a n attitude is not the sort of
MEASURING ATTITUDES AND MaTI VA TIONS 107
variable that can be observed directly. If someo ne reports that he is
angry. we must eithe r take his word for it, o r test his statement on the
basis of what we see him doing to determine whether or not he is
really angry. In attitud e research, the chain of inference is often much
lo nger than just the inference from a report to an attitude or from a
beha vioral pattern to an attitude. The quality or quantity of the
attitude can only be inferred from some other variab le that can be
measured.
F or instance, a respondent is frequently asked to evaluate a
statement about some situatio n (or possibly a mere propositio n
about a ve ry general slate of affairs) . Sometimes he is asked to say
how he would act or feel in a given described situation. In so-called
'projecti ve' techniques, it is further necessary for some judge or
evaluator to rate the response of the subject for the degree to which it
displays some attitude. In some of these techniques there are so many
steps of infere nce where error might arise that it is amazing that such
techniques ever produce usable dala, but apparently they sometimes
do. Occasio nall y, we may be wrong in judging whethe r so me one is
happy or sad, angry or glad, anxious or calm, but often we are correct
in our judgements, and a trained observer may (not necessarily wilf)
become very skilled in making such judgements.
Sha w and Wright (1967) suggest that 'attitude measurement
consists of the assessment of an individual's responses to a se t of
situations. The set of situations is usually a set of statements (items)
abo ut the attitude object, to which the indi vid ual responds with a set
of specific categories, e.g., agree and disagree . ... The ... number
derived from his scores represents his position on the latent attitude
variable' (p. 15). Or, at least, th at is what the researcher hopes and
o ften it is what the researche r me rely asserts. For example, Gardner
a nd Lambert (1972) assert th at the degree of a perso n's agreement
with the statement that 'Nowadays more and more people are prying
in io matters that shoul d remain personal and private' is a reflection of
the ir degree of 'anti-democra tic ideology' (p. 150). The statement
appears in a scale supposed by its au th ors (Adorno, Frenkel-
Brunswick, Levinson, and Sanford, 1950) to measure 'prejudice and
authorita rianism'. The scale was used by Gardner an d Lambert in the
1960s in the states of Maine, Lo uisiana, aDd Connect icut. Who can
deny that the statement was substantiall y true and was becoming
more true as the Nix on regime grew and flouri shed?
The trouble with most attitude measures is that they have never
been subjected to the kind of critical scrutiny that sho uld be applied
108 LANGUAGE TESTS AT SCHOOL

to any test that is used to make judgements (or to refute judgements)


about human beings. In spite of the vast amounts o f research
completed in the la st three or four decades on the topic of attitudes,
personality, and measures of related variables, precious little has
been learned that is not subject to severe logical and empirical doubts.
Shaw and Wright (1967) who report hundreds of attitude measures
along with reliability statistics a nd validity results, wbere available,
lament their 'impression that much effort has been wasted ... .' and
that 'nowhere is thi s more evident than in relation to the instruments
used in the measurement of attitudes' (p. ix.).
They are not alone in their disparaging assessment of attitude
measures. In his preface to a tome of over a thou sa nd pages on
Personality : Tests and RevieJrs, Oscar K . Buros (1970) says:
Paradox icall Yl the area of testing which has outstripped all
others in the quantity of research over the past thirty years is
also the area in which our testing procedures ha ve the generally
least accepted validity ... ' in my own case, the preparation of
thi s vol ume has ca used me to become increasingl y discouraged
at the snail's pace at which we are adva ncing compared to the
tremendous progress bein g made in the areas of medicine,
science, and technology. As a profession, we are prolific
researchers; but so mehow or o ther there is very little agreement
aboUl wbat is the resulting verifiable knowledge (p. ixx.).
In an article on a different topic and addressed La an entirely
different audience, Jo hn R. Platt offered some comments that may
help to explain the relative lack of progress in social psychology and
in certain aspects of the measurement of sociolingui stic va riables . He
argued that a thirty year failu re to agree is proof of a thirty year
failure to do the kind of research that produces the astounding and
remarkable advances of fields like 'molecula r biology and high-
energy physics' (1964, p. 350). The fact is tbat the social sciences
generally are among the 'areas of science that are sick by comparison
because they have forgotten the necessity for alternative hypotheses
and di sproof' (p. 350). He was speaking of his own field , chemistry,
when he coined tile terms <The Frozen Method, The Eternal
Surveyor, The Never Finished, The Great Man with a Single
Hypothesis, The Little Club of Dependents, The Vendetta, The All
Encompassing Theory which Can Never Be Falsified ' (p. 350), but do
lhese terms not ha ve a certain ring of familiarity with reference to the
social sciences? What is the solution'! He proposes a return to the old-
fas hioned method of inducti ve inference - with a couple of
em bellishments.
MEASLltING ATTITUDES AND MOTIVATIONS 109
It is necessary to form multiple working hypotheses instead of
merely popul arizing the ones that we happen to fav or, and instead of
trying merely to suppo rt or worse yet to prove (which strictl y
speaking is a logical impossibility fo r interesting empirical)
hypotheses, we should be busy eliminating the plausible alternatives -
alternati ves which are rarely addressed in the social sciences. As
Francis Bacon stressed so man y years ago, and Platt fe-emphasizes,
science advances only by disproofs.
Is it the failure to disprove that explains the lack of agreement in
the social sciences? Is there a test that purports to be a measure of a
certain construct ? What else might it be a measure of? What other
alternati ves are there that need to be ruled out ? Such questions have
usually been neglected . H as a researcher found a significa nt
difference between two groups of subjects? What plausible
explanations for the difference have not been ruled out ? In addition
to formin g multiple working hypotheses (which will help to keep our
thinking impartial as well as clear), it is necessary always to recycle
the familiar steps of the Baconian method : (I) form clear hypotheses ;
(2) design crucial experiments to eliminate as many as possible ; (3)
carry out the ex periments ; and (4) 'refine tbe possibilities that remain'
(Platt, 1964, p. 347) and do il all over again, and again, and again,
al ways eliminating some of the plausible altern alives and refining the
remaining ones. By such a method, the researcher enhances the
chances o f formulating a more powerful ex planatory theo ry on each
cycle fro m data-t o-theory to da ta-to-theo ry with maximum
efficiency. PIaU asks if there is any virtue in plodding almost aiml essly
through thirty years of work thal might be accompl ished in two or
three months with a little forethought and planning.
In the sequel to the first volume on personality tests, Buros offers
the foll owing stronger statement in 1974 (Pel'sonalilY TeS1S and
R eviews II):
It is my considered belief that most standardi zed tests are
poorly constructed , of questionable or un known validity,
pretentious in their claims, and likely to be misused more often
than not (p. xvii).
In another compendium, one that reviews some 3,000 so urces of
p sychological tests, E. Lowell Kelly writes in the Foreword (see
Chun, Cobb, and French, 1975):
At first blush, the development of such a large number of
assessment devices in the past 50 yea rs would appear to reAect
remarkable progress in the development of the social sciences.
110 LANGUAGE TESTS AT SCHOOL

Unfortun ately, this conclusion is not justified , . .. nearly three


out of four of the instruments are used but once and often only
by the developer of the instrument. W orse still, .. . the more
popular instruments tend to be selected more frequ ently not
because they are better measuring instruments) but primarily
because of their convenience (p. v).

It is one thing to say that a particular attitude questionnaire, or


rating procedu re of whate ver sort, measures a specific attitude o r
personality characteristic (e.g., autho ritarianism, prejudice, anxiety,
ego strength /weakness, or the like) but it is an entirel y different
matter to prove that this is so . Indeed, it is often difficult to conceive of
an y test whatsoever that would constitute an adequate criterion for
'degree o f autho ritarianism' o r 'degree of empathy' ) etc. Often the
o nl y validity info rmat ion offered by the designer or user of a
particular procedure for assessing some hypothetical (at beSI latent)
attitudina l variable is the label that he associates with the procedu re.
Kelly (1 975) classes this kind of validity with severa l other types as
what he calls ' pseudo-va lidity'. He refers to this kind of validity as
<nominal validity' - it consists in 'the assumption that the in strument
measures what its name implies' . It is related closel y to 'validity by
fiat' - <assertion by the author (no matter how di s tin g ui s hed~ ) that the
instrument measures variable X, Y, or Z' (p . vii, from the Fore\vord
to Chun, el ai, 1975) . One has visions of a king laying the sword blade
on either shoulder of the knight-lo-be a nd decl aring to all ears, '1 du b
thee Knight.' Hence the test author gi ves 'double' (or 'dubious', if you
prefe-r) validit y to his testing instrument _. first by authorship and then
by name.
Is the re no way out o f this difficul ty? Is it not possible to require
mo re o f attit ude meas ures than the pseudo -va lidit ies which
ch aracterize so many of them ? In a classic paper on 'construct
validit y' Cronbach and Meehl (1 955) addressed tbis import ant
question. If we treat attitudes as theoretical constructs and measures
of attitudes (or measures that purport to be measures of attitudes) as
tests, then the tests are subject to the same so rts o f empirical and
th eoretical justification that are applied to an y construct in scientific
study. C ronbach and Meehl say, 'a construct is some postulated
a ttribute of people assumed to bc refl ected in perfo rmance' (p. 283).
They co ntinue, 'persons wh o possess this attribute will, in situation
X, act in manner Y (with a stated proba bility)' (p . 284).
Amon g the techniques for assessing the validity of a test of a
postulated construct are the following: select a group whose beha vior
MEASURING ATTlTUDES AND MOTl VATlONS 111
is well known (or can be determined) and test the hypothesis that the
behavior is related to an attitude (e.g., the classic example used by
Thurstone and others was church-goers versus non-church-goe rs).
Or, if the attitude or belief of a particular group is kn own, some
behavioral criterio n c<:t n be predicted and th e hypothesized outcome
can be tested by the usu al method .
These techniques are rather rough and read y, and can be imp roved
upon generally (in fa ct they must be imp roved upon if attitude
measures are to be mo re finely calib rated ) by devising ah ernative
measures of th e attitude or construct and assessin g the degree (and
pat tern) of variance overlap o n the vario us measures by correlation
and related procedures (e.g., factor analysis and multiple regression
techniques). For instance, as Cronbach and Meehl (1955) obse rve, ' if
two tests are presumed to measure the same construct. a correlation
between them is predicted ... If the obtained correla tion departs from
the expectation, however, there is no way to know whether the fault
lies in test A, test S, or the formulation of the construct. A matrix of
intercorrelations often points out profitable ways of dividing th e
construct into m ore meaningful parts, factor analysis being a useful
computa tional method in such studies' (p. 287).
Following this laller procedure, scales whic h are supposed to
a ssess the same construct can be correlated to determine whet her in
fact they share some variance. The extent of the correlation, or or
their tendency to correlate with a ma thematically defined fac tor (or to
'load' on such a factor) , may be taken as a kind of index of the va lidi ty
of the measure. Actuall y. the lattcr techniques are related to the
general internal consistency of measures that purport to measure the
same thing. T o prove a high degree of internal consistency between a
variety of measures of, say, 'prejudice' is not Lo prove that the
measures are indeed assessing degree of prejUdice, but if anyo ne of
them on independent grounds can be shown to be a measure of
prejudice th en co nfid e nce in a ll of the meaSLIres is thereby
strengthened. (For several applications of rudi menta ry factor
analysis, see the Appendix.)
Ultimately, there may be no behavioral criterion which can be
proposed as a basis for validat ing a given attitude measure. At best
there may be only a range of criteria, which ta ke n as a wh ole, cause us
to ha ve confidence in the measurement technique (fo r most presentl y
used measures, however, there is no reason at all to have confidence in
them). As Cronbach and Meehl (1 955) put it, 'per so nalit y lcsts, and
some tests of ability, are interp reted in term s of attributes for which
11 2 LANGUAGE TlliTS AT SCHOOL

there is no criterion' (p. 299). [n such cases, it is principally throu gh


statistical and empirical methods of assessing the internal c.onsisteocy
of such devices that their construct validity is to be judged. They give
the exampl e of the measu rement of temperature. One criterion we
might impose on any measuring device is tbat it would show higher
temperature for water when it is boiling th a n when it is frozen solid,
or when it feels hot to the touch rather than cold to th e touch - but the
criterion can be proved to be more crude than the degrees of
temperature a variety of thermo meters are ca pable of displaying.
Mo reover, the subjccti ve judgement of tempera ture is mo re subj ect to
error than are measurements derived directly from the measu ring
techniques.
The trouble with attitudes is that they are so out of reach, and at
the same time they are apparently subjcct to a kind of flu idity that
allows them to change (or perhaps be created on the spot) in resp onse
to different social situations. Typically, it is the effects of attitudes
that we are interested in rather than the attitudes per se, or
alternatively, it is th e social situatio ns that give rise to both the
attitudes and their effects which are the objects of in teres!. Recen tly,
there has been a surge of interest in the to pic of how attitudes affect
language use and language behavio r. We turn now to tha t topic and
return below to the kinds of measures of att itudes that have been
widely used in the testing of certain research hypotheses related to it.

B. Hypothesized relationships between affective variables and the use


and learning oflanguage
At fi rst loo k, the relationship of atti tudes to the use and learning of
langua ge may appear to be a small part of th e problem of attitudes in
general - what aboUt attitude toward self, significant others,
situations, groups, etc.? But a littl e reflection will show that a very
wide range of variables are compassed by the to pic of this subsecti on.
Moreover. they are va riables th at ha ve received muc h special
attention in recent years. In additio n to concern for atti tudes and the
wa y they infl uence a child's learning and use of his own nati ve
language (or possibly his native languages in the event that he grows
up with more than one at his command), there has been much
concern for the possible effects that attitudes ma y have on the
learning of a second or third language and the way such attitudes
affect the choice of a particul ar linguistl c code (lang uage or style) in a
pa rticular context. Some of the concern is generated by actual
MEASURING A TTlTUDES AND MOTlV AnONS 113
rese arc h results - for instance, results showing a significa nt
relationship between certain attitude mea sures or motivation indices
and attainment in a second language. Proba bly most of i!, bowever, is
generated by certain appealing arguments that often accompany
meager or even contradictory research findings - or no findings at aJl.
A widel y cited and popular position is th at of Guiora and
his collaborators (see Guiora, Paluszny, Beit-Hallahmi, Catfo rd,
Cooley, and Dull, 1975). The argument is an elabora te Olle and has
man y interesting ramifications and implications, but it can be
capsulized by a few selected qu ota tions from th e cited article. Crucial
to the arg ument are the notions of empathy (being able to put yourself
in someone else's shoes - apropos to Shakespeare's claim th at 'a
friend is anotber se/f'), and /angl/age ego (that aspect of self awareness
related to the fact that T sou nd a certain way when T speak an d that
this is part of 'my' identity):
We hypothesized that this ability to shed our native pro-
n unciation habits and temporaril y adopt a different pro-
nun ciation is closely related to empat hic capacity (p. 49).
One wonders whether having an empathic spirit is a necessa ry
criterion for acquiring a native-like pronunciatio n in an other
language? A sufficient one? But the hypoth esis has a certain appeal as
we read marc.
With pronunciation viewed as the core of language ego, and as
the most critical contribu tion of langu age ego to self-
rep rese ntation , we see that the ea rl y flexibility of ego
boundaries is reflected in the ease of assimilating nati ve-like
pronunciation by you ng children; the later reduced flexibili ty is
reflected in the reduction of this ability in adu lts (p. 46).
Apart from some superfluity and circul arit y (or perhaps because of it)
so far so good. They continue,
At this point we can link empathy and pronunciation of a
second language. As conceived here, both require a temporary
relaxation of ego boundaries and Oms a temporary modifi-
cation of self-representation. Although psychology tradi-
tionally regards language performance as a cognitive-
intellectual ski ll , we are concerned here with that particular
aspect of language behavior that is most criticall y tied to self-
representation (p. 46).
But the most persuasive part of the argument comes two pages later :
superimposed upo n the speech so unds of the words one
chooses to utter are sounds which give the listener information
114 LANGUAGE T ESTS AT SCHOOL

ab out the speaker's identity . The listener can decide whether


one is sincere o r insincere. Ridicule the way I sound, my d_lalect,
or my attempts at pronouncing French and you will have
ridiculed me.
(O ne might , however, be regarded as oversensitive if one spoke very
bao French, mightn' t one?)
Ask me to change the way 1 sound and yo u ask me to change
myself. T o speak a second language authe ntically is to take o n a
new identity. As with empathy, 11 is to step into a new and
perhaps unfamiliar pair of shoes (p. 48).
What about the empirical studies designed to test the relationship
between degree of empathy and acquisition of unfamil iar phonologi-
cal systems ? What does the evidence show? The empirical data is
reviewed by Schuman n (1975), who, though he is clearly sympathetic
with the general thesis, fi nds the empirical evidences for it excessively
weak. The first problem is in the method for measuring 'empathy' by
the Micro Momemary Expressio n test - a procedure that consists
of watching silent psychiatric interviews and pushing a button
every time a change is noticed in the expression on the face of the
person being interviewed by the psychiatrist (presumably th e film
does not focus on the psychiatrist, but Guiora. et ai, are not clear on
this in their description of the so-called measure of empathy).
Reasonable questions might include, is thi s purported measure of
empath y correlated with other purported measures of the same
construc t? Is the technique reliable'? Are the results similar o n
different occasions under simi lar circumstances? Can train ed
'empathi zers' do th e task better than untrained persons? Is there any
meaningfu l discrimination o n th e MME between persons who are
judged on o ther grounds to be hi ghl y empat het ic and perso ns who are
judged to be less so ? Apparently, the test, as a measure of empathy
must appeal to 'nominal validity'.
A second problem wit h the empirical research on the Guio ra , el aI,
hypothesis is the measure of pro nunciation accuracy, the Standtud
Thai Procedure :
The STP consists of a master tape recording of 34 test jtems
(ranging in lengt h from I to 4 syllables) sepa rated by a 4 second
pause. The voicer is a female Thai native speaker .. .. Total
41
testing time is minutes.... (p. 49).
The scoring procedure is currently under re vision. The basic
evaluatio n metho d involves rating to ne, vowel and consunant
qu ality for selected phonetic units on a scale of I (poor). 2 (fair),
MEASt:RING ATTITUDES AND MOTIVATIONS 11 5
or 3 (native-like). Data tapes are rated independently by three
nati ve Thai speakers, trained in pronunciation evaluation. A
distinct advantage of the STP is that it can be used with nai ve
subjects. It bypasses the necessity of first teaching subjects a
second language (p. 50).

A distinct advantage to test people who have not learned the


language '! Presumably the test ought to discriminate between more
and less nati ve-like prono un cers of a language tbat the subjects have
already lea rned . Does it? No data are offered . Obviously, the STP
would not be a very direct test of, say, the ability to pronounce
Spanish utterances with a native-like accent - or would it ? No data
arc given. Does the test discriminate between persons who arc judged
to speak 'seven languages in Russian' (said wi th a thick Russian
accent) and persons who are judged by respective native speaker
groups to speak two o r more languages so as to pass themselves for a
native speaker o f each of the seve ral languages" Reliability and
validity studies oC the test afC conspicuously absent.
The third problem with the Guio ra, el ai, hypothesis is that the
empirical results that have been reported are either only weakly
int erpretable as supporting their hypo the sis, or they are not
in terpretable at all. (Indeed, it makes little sense to try to interp ret the
meaning of a correlation between two measures about which it
cannot be said with an y confidence what they are measures of.) Their
prediction is that empathetic persons will do better on the STP than
le ss empathetic persons.
Two empirical studies aUempted to establish the connection
bet ween empathy and atta ined skill in pronunciation - both are
discussed more full y by Schumann (1975) than by Guiora, el ai,
(1975). The first study with Japanese learners failed to show any
relationship between attained pronunciation skill and the scores on
the· MME (t he STP was not used). The seco nd study, a more extensive
one, found significant correlations between rated pronunciation in
Spanish (for 109 subjects a t the Defe nse La ngua ge Institute in a three
month intensive langua ge course), Russian (for 20 I subjects at D U),
Japanese (1 3 subjects a t DLl), Thai (40 subjects), and Mandarin (38).
There was a diftlculty however. The correlations were po sitive for
the fi rst three languages and negati ve for the last two. This wou ld
seem to indicate that if the MME measures empathy, for some groups
il is positively associated wi{h the acquisition of pronunciation skills
in another language, and for other gro ups, it is negatively associated
with the same process. What can be concluded from such d ata? Very
116 LANGUAGE TESTS AT SCHOOL

little of any substance - except that the validity of the meas ures, and
the hypothesis is in serious doub t.
In spite of all of the foregoing weaknesses in the empirical methods
of Guiora, el ai, their reasoning seems to have merit independentl y.
Perhaps this is because the arguments they propose for relating
language use a nd learning to affective variables find some kind of
historical appeal inasmuch as they may relate to the fcelings that
people experi ence and report about the process oflearnin g a language
and the sense of personal identity that goes with the process.
Schumann (1 975) relates the argument to a va riety of other
viewpoints on how affective varia bles playa part in the acquisition of
a second language. Earlier, Brown (1 973) was apparently sym-
pathetic to the position advocated by G uiora el al. He suggested that,
'the very definition of communication implies a process o f revealing
one's self to a nother' (pp. 233- 4), hence the 'ego' is very much
involved in any process of language usc a nd language learning by a ny
definition.
Brown suggests serious inve stigation of such ego related factors as
the role of imitative behaviors (related to model figures, admired
persons), self- knowledge, self-esteem, and self-confidence. He reports
that an unpublished study in 1973 (Lederer, see Brown's references p.
244) 'revealed that the self-concept of Detroit high school students
was an overwhelmin g indicator of success in a foreign language' (p.
234) . He also recommends investigation of variables such as
empat hy, introversion/ extroversion, and aggression. (See Suggested
Readin gs at the end of this chapteL)
A nother very substantial tradition of research revolves aro und the
work of Lambert and G ardner and their co-workers. Th at resea rc h
paradigm hinges on the distinctio n between two majo r motives for
learning the language of an other group, integrative versus in ~
strumental, and several satellite concepts. It is generally argued that if
a learner want s to be like valued me mbe rs of the target langua ge
comm unity, he is m ore apt to be a successful learner of that target
language. A learner who can be shown to have such an orientation
can be assumed to have an integrative motive, whereas a learner who
wants to acquire the language for utilitari an reasons is said to have an
instrumental motive. In the early researc h on the topjc one get s the
impression tbat integrativel y orie nted lea rners were expected ( 0
outperfor m instrumentally motivated learners (other things being
equal), but in later studies th e sharpness of th is distinct jon fad ed
somewhat.
MEASURING ATTITUDES AND MOTIVA nONS 117
For instance, in their first study Gardner and Lambert (1 959)
concluded that 'the integratively oriented students are generally more
successful in acquiring French than those who are instrumentally
oriented' (p, 196), Later, Anisfeld and Lambert (1961) weakened
their o riginal hypothesis substantially : 'our results may suggest that
the reasons for studying a language, whether instrumental or
integrative, are not important in themselves' (p . 224) whereas what is
important is ' attitudes toward 'he language community' (p. 225).
However, in yet another study, Lambert, Gardner, Barik, and
Tunstall (1 962) weaken the position still furt her :

Our previous studies indicated that favorable attitudes and


o r1entations towa rd the other language contributed to a strong
motivation to learn the other group's language and COll-
sequently correlated with achievement measures. In the present
case. even the instrumentally oriented advanced students have
strong motivat ions to learn the language, thereby reducing the
rel ation of Orientation Index [which gives a 1 for an
instrumental motive and a 2 for an integrative moti ve l scores to
achie vement (p. 24 1).

In a more recent stud y by Lukmani (1972) . an instrum ental motive


seemed to be more strongly related to attainment in English as a
second language in India than an integrative motive. Hence, all three
possibilities concerning the original hypothesis have been obtained:
in some cases, the integrative motive has been stronger (as was
originally predicted , and as Gardner and Lambert, 1972, p. 16 still
maintained), in others there has been no advantage of one over tbe
other, and in still others, the instrumental motive appears to be the
stronger.
What is to become of such a position? Should it not be rejected or
substantially modified ') It would seem th at the empirical data o nly
serve to confuse the issue - at least the data do no t clearly suppo rt the
'working hypothesis' and the hypothesis keeps on working without
ever drawing unemployment benefit (see Johnson and Krug, in press) .
The same sort of difficulties arise with respect to (he satellite
concepts of anomie (the sense of being cut loose from one's roots that
may arise in the-course of becoming a member of a different language
and culture group), and ethnocentrism (like egocentrism, except that
one's group rather than one's self is the center of attention). Early in
Lambert's writing, the hypothesis that anomie was correlated with
the learning of a second language - that is, the more anomie one felt
about one's ow n cult ure and language, the more rapidly one might
li S LANGUAGE TESTS AT SCHOOL

want to acquire another (see Lambert, 1955, and Gardner and


Lambert, 1959). A similar claim was offered for ethnocentrism
although it was expected to be negatively correl ated with attainment
in th e target language. Neither of these hypo theses (if th ey can be
termed hypotheses) has proved to be an y more susceptible to
empirical test than the one about types of motivation. The argument s
in each case have a lot of appeal , but the hypotheses themselves do
not seem to be borne out by the data - the results are either unclear o r
contradictory . We will return to consider some of the speeific
measuring instruments used by Gardner and Lambert in section C.
Yet another research tradition in attitude measurement is that of
Joshua Fishman and his co-workers. Perhaps his most important
argument is that there must be a tho ro ughgoing in vestigation or the
factors that determine who spea ks what language (or variety) to
who m and under what conditions.
In a recent article, Cooper and Fishman (I975), twelve iss ues of
interest to researchers in the area of 'language attitude' are di scussed.
The authors defi ne language attitudes either narrowly as rel ated to
how people think people ought to talk (following Ferguson, 1972) o r
broadly as ' those a ttitudes which influence lan guage beha vio r and
behavior toward language' (pp . 188- 9). The first definition they
consider to be too narrow and the second 'too broa d' (p. IS9) .
Among the twelve questions are: (1) D oes altitude have a
charac teristic struc lure? .. . ; (2) To what extent can language
att itudes becompo nentialized ? . .. ; (3) D o people have characte ri stic
beliefs abo ut their language? (c.g., that it is well suited for logic, or
poetry, or some special use) ; (4) Are Janguage attit udes really
attitudes toward peo ple who speak the language in question, or
toward the language itself'! Or some o f each ? (5) What is the elfeet of
context o n expressed stereotypes? Is one language considered
inappropriate in some social settings but appropriate in others? (6)
Where do language attitudes come from'! Stereotypes that are
irrational? Actual experiences? (7) What effects do they have '!
('Typically, only modest relatio nships are fo und between attitude
measures and the oven behavio rs whic h such scores are designed to
predi ct' p. 191): (8) Is one language apt to be mo re effective than
an other for persuading bilinguals under different circumstances? (9)
'What relationships exist among attitude sco res obtained from
different classes of attitudinal meas urements (physiological or
psycho logicaJ reaction, situational behavio r, verbal report) ?' (p.
191) ; (10) How about breaking down the measures into ones that
MEASURING ATTITU DES AND MOTIVATIONS 11 9
assess reactions 'toward an object, toward a situation, and toward th e
object in that situation' (p. 191 ) ? What wou ld the component
structure be like then? (11) 'Do indirect measures of attitude (i.e.,
measures whose purpose is no t appa rent to th e respo ndent) ha ve
higher validities than direct measures'? ... (p. 192); (12) ' H ow
reliable are indirect measures ... ?' (p. 192).
It would seem that attitude studies generally share certain strengths
and weaknesses. Among th e strengths is the intuitive appeal of th e
argument that peo ple's feelings ab out themselves (including the way
they talk) and abo ut others (and the way they talk) ought to be related
to their use and lea rning o f any language(s). Among the weaknesses is
the general lack of empirical vulnerability of most of t he th eo reti cal
claims that are made. They stand or fall on the merits of their intuitive
appeal and quite independently of any ex perimental d ata that may be
accumu1ated. They are subject to wide dilfe-rences of imerpretati on,
and there is a general (nearly complete) lack of evidence on tbe
valid.i ty of purported measures of a ttitudinal co nstructs. What data
are available are often contradicto ry or uninterpretable - yet the
favored hypo thesis still survives. In brief, most auitude studies are not
empirical studies at all. They are mere attempts to support favored
'working' hypotheses - the hypotheses will go right on working even
if they turn o ut to predict the wrong (or worse yet, all possible)
experimental outcomes.
Part of the p roblem with measures of attitudes is that they require
subjects to be honest in the 'throat-cutting' way - to give information
about themsel ves which is potentiall y damaging (consider Upshur's
biting criticism of such a proced ure, cited above on p. 98).
Furthermo re, they require subjects to answer questio ns or react to
statements that sometimes pl ace them in th e double-bind of being
damned any way they turn - th e condemnatio n which intelligent
people can easily antici pate may be light o r heavy, but why should
su bjects be expected to give meaningful (in deed , non -patho logical)
responses to such items? We consider some o f the measurement
techniques that are of the dOUble-bind t ype in the next section. As
Watzlawick, Beavin, and Jackson (1967) reaso n, such double-bind
situations are apt to generate pathological responses (,crazy' and
irrati onal behaviors) in people who by all possible non-clinical
standards are 'normal '.
And there is a fu rth er problem which even if tbe otbers could be
resolved remains to be dealt with - it promises to be kn o ttier than all
o f the rest. Whose val ue system shall be selected as the proper
120 LANGUAGE TESTS AT SCHOOL

criterion for th e valuation of scales or the int erpretati on of


respo nses? By whose stand ards will the questions be interpreted ?
This is the essential validity problem of altitude inventories,
personality measures, and affecti ve valuations of all sorts. It knocks
at the very doors of the unsolved riddles of human existence. Is there
meaning in life " By whose vision shall it be declared " Is there an
absolute truth" Is there a God ? WillI be called to give an accounting
for what 1 do" Is life on this side of the grave all there is? What shall r
do wit h this man Jesus? What shall I do with the mo ral precept s of my
own c ulture? Y ours? So meone else's? Are all solutions to the riddles
of equiva lent value? Is none of any value ? Who shall we have as our
arbiter? Skinner ? Hitler? Ki ssinger ?
Shaw and Wright (1967) suggest that ' the onl y inferential step' in
the usual techniques fo r the measurement of attitudes ' is the
assumptio n that the evaluatio ns of th e persons in volved in scale
construction correspond to those of the individuals whose attitudes
are being measured' (p. 13). One does no t have to be a logician to
know that the inferential step Shaw and Wright are describing is
clearl y not the onl y one in volved in the 'measurement of attitudes'. In
fact , if that were the only step involved it wo uld be entirely pointless
to wa ste time trying to devise measurement instruments- just ask the
test constructors to di vulge their own att itudes at th e o utset. Why
bother with the step of inference ?
There a re other steps that invol ve inferential leaps of substantia l
magn itude, but the one they describe as the 'only' one is no doubt the
crucial one. There is a pretense here that the value judgements
concerning what is a prejudice or what is not a prejudice, or what is
anxiet y or what is not anxiet y, what is aggressiveness or acquiescence t
what is strength and what is weakness, etc. ad h?/inilum, can be
acceptably and impartially determined by some group consensus.
What group will the almighty academic communi ty pick? Or will the
choice of a value system for affecti ve measurement in th e schools be
made by political leaders? By Ma rxists" Christians? Jews? Blacks?
Whites) Chicanos ? N avajos? Theists? Atheists? U pper class!
Middle class? Humanists ? Bigots? Exislentialists? Anthropologists ?
Sexists? Racists ? Militarists ? Intellectual s?
T he trouble is the same one that Plato di scu,"ed in his Republic - it
is not a question of how to inlerpret a single statement on a
questionnaire (th o ugh this is a questio n of importance for each such
statement) , it is a question of how lo decide cases of disagreement.
Voting is one proposal, but if the minority (or a great plu rali ty of
MEASU RI NG ATTITUDES AND MOTI VATIONS 121
mi no rities, d own to the level of individuals) get voted d own, shall
thei r va lues be repressed in the institutionaliza tio n of attitude
measures?

C. Direct and indirect measures of affect


So many different kinds of techniques have been developed for the
purpose of trying to get peo ple to reveal their beliefs and feelings that
it would be impossible to be sure that all of the types bad been covered
in a single review. Therefore, the intent of this sectio n is to discuss the
most widely used types of attitude scales and other measurement
techniques and to try to draw some conclusions concerning their
empirical validities - particul arly the measures th at have been used in
conjuncti on with langua ge pro ficiency and the so rts of hypotheses
considered in section B above.
Traditio nall y, a distinctio n is made between 'direct' and 'lndirect'
attitude measures. Actuall y we ha ve already seen that there is no
direct way of measuring att itudes, nor can there ever be. T his is not so
much a problem with measuring techniques per se as it is a problem of
the nature of attitudes themsel ves. There can be no direct measure of
a construct that is evidenced only indirectly, and subjectively even
then.
As qu alities of huma n ex perience, emOlions, attitud es, and values
are no to ri o usly ambiguo us in their very expressio n. Joy or sadness
may be evident by tears. A betrayal or an unfeigned love may be
demonstrated with a kiss. D isgust or rejoicing may be revealed by
laughter. Approval or disapproval by a smile. Physiological measures
migh t o lTe r a way out, but they would have to be checked against
subjective judgements. An increase in heart rate might be caused by
fear, anger, surprise, etc. But even if a particul a r co nstellatio n of
gla ndular secretions, palmar sweating, gal vanic skin response, and
other ph ysiological resp o nses were thou gbt to indicate some
emotional state, presum ably the test would have to be validated
against subjective judgement by asking th e patient displaying the
pertinent constellation offaetars, 'Do you feel angry now?' As D. M.
M acKay (1951) noted in his insightful article 'Mind like Behavior in
Artefacts,' we could kn ow all about the inner worki ngs of a neon sign
witho ut knowing th e mea nin g o f the words th at it displays. The
problem of attitudes is like that ; it is distinctl y a matter of
interpretati on.
What measures of atti tudi nal variables have been used o r
122 LANGUAGE TESTS AT SCHOOL

recommended in studies ofJanguage use a nd language lea rning ? We


have mentioned a fe w above ; \-\le will look more clo sel y at a number of
t hem below. They include the notoriously unreliable 'projective'
techniques such as the Thematic Apperception Test (see Ana stasi,
1976, p p . 565-7, and 579- 85) which is a R orschach-like test
consisting of ' 9 deliberately vague pictures concerning which
subjects are asked to tell stories. They also use the rankin g of
sta tement s concerning how well they describe the respondent. ,M ore
often attitude scal es are of the more traditional agree-disagree type
o ri ginated by Likert (1932). Recently, a var iant on the Likert scale
tech nique has been develo ped by Zirkel and Jackso n (1974) for use
with children from four different ethnic backgrounds. It employs line
drawings of faces supposedly ranging from very happy to very sad.
First, we will co nsider three o rthe attitude scales of t he Likert-type
llsed by Gard ner and Lambert (1972) - the F Scale, the E Scale, and
the Anomi e Scale. The instructi ons used for each scale were similar:

The rollowing Slalemenls are opinions whic h have often been


expressed by students about yo ur age. They cover a wide range
of topics and it has been fo und that many people agree with
eac h statement and many disagree. There are no right or wro ng
answers. You are asked to mauk each statement in the left-hand
margin according to your agreement o r disagreement as
follow s :

+ 1: slight support, - I: slight opposition,


ag reement disagree ment
+ 2: moderate suppo rt, - 2: moderate oppositio n,
agreement disagreement
+3: strong support, -3: strong opposition,
agreement disagreement

MODIFIED F SCALE

I. Obedien ce and respect for aut hority are the mo st


important virtues children should learn.
2. What yout h needs most is strict di scipline, rugged
determination, and the will to work an d fight for family
and country.
3. Nowadays when so many differenl kind s of people move
around and mi x togethe r so much. a perso n has to protect
himself especially ca refull y against catchin g an infection o r
di sease from them.
4. What this country needs most, more than laws and po litical
programs, is a few courageous, tireless, devoted leaders in
who m the people ca n put thei r faith .
MEASURING ATTITLiDES AND MOTIVATIONS 123
5. No weakness or difficu lty can hold us b ack if we have
enough will power.
6. Human nature being w ha t it is, the re will alway s be wa r and
co nflict.
7. A person who has b ad manners, habits, an d breeding can
hard ly expect to get along with decent peopl e.
8. People can be di vided into two d istinct classes: the wea k
and tbe strong.
9. There is hardly anything lower th a n a person who does not
reel a great love, gratitude, and respect for his parents.
10. Tbe true American way o f life is disappearing so fa st that
fo rce may be necessary to preserve it.
II. Nowadays more and more peo ple are pr ying into mallers
that should remain personal and private.
12. If people would talk less and work more, everybody would
be belter off.
13. Most people don't rea li ze ho w much o ur li ves afc
con trolled by plots hatched in secrct places.

According to Gard ner and La mbert (1972, p. 150), all of the


foregoing statements 'reflect anlidemocratic ideology', In fact . the
original authors of the scale developed items from research on ' the
potentiall y fascistic individual ' (Adorno, et ai, 1950, p. I) which
'bega n with anti-Semitism in th e rocus of attention' (p. 2). Sources fo r
the items were subject protocols from 'factua l sho rt essay questions
pertaining to such topics as rel igion, war, ideal society, and so forth ;
early resul ts fro m projective qu estions: fina lly, and by far the most
.i mportant, material fro m the interviews and the ll1emalic
Apperception Tests' (p. 225). The thirteen scales givcn above were
selected from Forms 45 and 40 of the Adorno, et 01 (1950) F Scale
co nsisting of some 46 items accord ing to Ga rdner and Lam bert
( 1972). Act uall y, item 10 given above was fro m F o rm 60 (an earlier
versio n orthe F Scale).
There are t\VO major sources of validity on the Fascism Scale (tha t
is, the F Scale) th at are ea sily accessible . First, tbere are the
inte rco rrelati ons between the ea rl y versio ns o f the F Sca le with
meas ures that were supposed (by the original authors o rthe F Scale)
to be simil ar in content, and second, there are the same data in the
several correlation tables offered by Gardn er and Lambert which can
be examined. According to Ad o rno, ef al ( 1950, pp. 222- 24) the
o riginal purpose in. deve loping the F Scale was to try to obtain a less
obvious measure of 'antidemocratic potential' (p. 223, their italics)
tha n was available in the E Scale (or Ethnocentrism Scale) which they
had alread y devel oped .
124 LANGUAGE TESTS AT SCHOOL

Immediately following is the Gardner and Lambert adaptation of


a selected set of the questions on the E Scale which was used in much
of their attitude research related to language use and language
learning. Below, we return to the question of the validity of the F
Scale in relation to the E Scale:
MODIFIED E SCALE
I. The worst danger to real Americanism during tbe last fifty
years has corne·from foreign ideas and agitators.
2. Now that a new world organi zation is set up, America must
be sure that she loses none of her independence and
complete power as a sovereign nation .
3. Certain people who refuse to salute the flag should be forced
to conform to such a patriotic action , or el se be imprisoned.
4. Foreigners are all right in their place, but they carry it too far
when they get too familiar with us.
5. America may not be perfect, but the American way has
brought us about as close as human beings can gel to a
perfect society.
6. It is only natural and right for each person to tbink that his
family is better than a ny other.
7. The best guarantee for Olir national security is for America
to keep the secret of the nuclca r born b.
These items were selected by Gardner and Lambert from 20
original items recommended for the final form of the Adorno, et al E
Scale. The original authors reason , 'the social world as most
ethnocentrists see it is arranged like a series of concentric circles
around a bull's,eye. Each circle represents an ingroup,outgroup
distinction; each line serves as a barrier to exclude all outside groups
from the center, and each group is in turn excluded by a slightly
narrower one. A sample Hmap" illustrating the ever-narro wing
ingroup would be the following: Whites, Americans, nati ve-born
American s, Chri s tians~ Protestants, Californians, my famil y, and
finally - r (p. 148). Thus, the items on the E Scale are expected to
reveal the degree to which the respondent is unable 'to identify with
humanity' (p. 148).
How well does the E Scale, and its more indirect counterpart the F
Scale, accomplish its purpose'? One wa y of testing the adequacy of
both scales is to check their iutercorrelation. This was done by
Adorno, et ai, and they found correlations ranging from a low of .59
to a high of .87 (1950, p. 262). They concluded that ifthe tests were
lengthened, or corrected for the expected error of measurement in
any such test, they should intercorrelate at the .90 level (see their
MEASURING ATTITUDES AND MOT IVATIONS 125
footnote, p. 264). From these dat a the conclusion can be drawn that if
eitber scale is tapping an 'a uthoritaria n' outlook, both must be.
However, the picture changes radicall y when we exa mine the data
from Gardner and Lambert (1972).
In studies with their modified (in fact , shortened) versions of the F
and E Scales, the correlations were . 33 (for 96 English speaking high
school students in Loui siana), .39 (for 145 English speaking hig h
school students in M aine) , .33 (for 142 English speaking high school
students in Connecticut), .33 (for 80 French-American high school
students in Louisiana), and .46 (fo r 98 French-American high school
students in Maine). In none of these studies does the overlap in
variance on the two testRexceed 22 % and the pattern is quite different
fro m the true relationsbip posited by Adorno, el al between F and E
(the variance overlap sbould be nearly perfect).
What is the explanation? One idea that has been offered previously
(Liebert and Spiegler, 1974, and their references) and which seems to
fit tbe data from Gardner and Lambert relates to a subject's tendency
merely to supply what are presumed to be the most socially
acceptable responses. If the subject were able to guess that the
experimenter does in fact consider some responses more appropriate
tha n others, this \vo uld create some pressure on sensitive subjects to
give the sociall y acceptable responses. Such pressure would tend to
result in posi tive correlations across the scales.
Another possibility is tha t subjects merely seek to appear
consistent from one a nswer to the next. The fact that one has agreed
or disagreed with a certa in item on either the E or F Scale may set up a
strong tendency to respond as one has responded on previous items-
a so-called 'response set'. If the response set fac tor were accounti ng
for a large po rtion oft be variance in measures like E and F, then this
wo uld also account fo r the high correlations observed between them.
In either event, shortening the tests as G ardner and Lambert did
would tend to reduce the amount of variance overlap between them
because it would necessarily reduce the tendency of the scales to
establish a response set, or it wo uld reduce the saliency of sociall y
desirable responses. All of this could happen even if neither test had
anything to do with the personality traits they are trying to measure.
Along this line, Crowne and Ma rlowe( 1964) report :
Acquiescence has been established as a majo r resp onse
determinant in the measurement of such persona lity variables
as authorita rianism (Bass, 1955, Chapman and Campbell ,
1957, Jackson a nd Messick. 1958). The basic method has been
126 LANGUAGE TESTS AT SCHOOL

to show, first of all, that a given questionnaire - say the


California F Scale (Adorno, et aI, 1950) - has a large proportion
of items keyed agree (or true or yes). Second, half the items are
reversed, now being scored for disagreement. Correlations are
then computed between the original and the reversed items.
Failure to find high negative correlations is, then, an indication
of the operation of response acquiescence. Jn one study of the F
Scale, in fact, significant positive correlations - strongly
indicative of an acquiescent tendency - were found (Jackson,
Messick, and Solley, 1957), (p. 7).
Actually, the failure to find high negative correlations is not
necessarily indicative only of a response acquiescence tendency; the re
are a number of other possibilities, but all of them are fatal to the
claims of validity for the scale in question.
Another problem with scales like E and F involves the tendency for
respondents to differentiate facti ve and emoti ve aspects of statement s
with which they are asked to agree or disagree. One may agree with
the [active con tent of a statement and disagree with the emotive tone
(both of which in the case of written questionnaires are coded
principally in choice of words). Consider, for instance, the factive
content and emotive tone of the following statemems (the first
version is from the Anomie Scale which is discussed below):
A. The big trouble with our country is that it relies, for the
most part, on the law of the jungle : 'Get him before he gets
you.'
B. The most serious problem of our people is that too few of
them practice the G old en Rule: 'D o unto others as you
would have them do unto you.'
C. The greatest evil of our country is that we exist, for the most
part, by the law of survival: 'Speak softly and carry a big
stick. '
Whereas the factive content of the preceding statements is similar in
a ll cases, and though each might be considered a rough paraphrase of
the others, they differ greatly in emotive tone. Concerning such
differences (which they term 'content' and 'style' respectively),
Crowne and Marlowe (1964) report:
Applyi ng this differentiation to the assessment of personality
characteristics or attitudes, Jackson and Messick (1958)
contended that both stylistic properties of test items and
habitual expressive or response style s of individuals may
outweigh the imp ortance of item con tent. The wayan item is
worded -- its style of expression ~ may lend to increase its
frequency of endorsement (p . 8).
:MEASU RI NG ATTITUDES AND MOTIVATIONS 127
Their observation is merely a special case of the hypothesis which we
discussed in Chapte r 4 (p. 82f) on the relative importance of factive
and emoti ve aspects of communication.
Taking all of the to regoin g into account, the validit y of the E and F
Scales is in grave doubt. The hypo thesis that they are measu res of the
same basic configuration of personality traits (or at least of similar
configurations associated with 'authoritarianism') is not the only
hyp othesis that will explain the available data - nor d oes it seem to be
the most plausible of the a vailable alternatives. Furthermore, if the
va lidity of the E and F Scales is in d o ubt, their pattern of
inte.rrelationship with other variables -such as attained proficiency in
a second language - is essentiall y uninterpreta ble.
A third scale used by Gardner and Lambert is the An omie Scale
adapted partly from Srole (1951 , 1956):

ANOMIE SCALE
I. In th e u.S. tod ay, public officials aren't really very
interested in the problems of the average man.
2. Our country is by far the best coun try in which to li ve. (The
scale is reversed o n this item and o n num ber 8.)
3. The state of the world bei ng what it is, it is very difficul t for
the student to plan for his career.
4. In spite of what some people say , the lot of the average man
is getti ng worse, not better.
5. These da ys a perso n doesn 't reall y know whom he ca n
eQum o n.
6. It is ha rdly fair to bring children into the world with the
way things look for the fut ure.
7. N o matter how hard I try, I seem to get a 'raw deal' in
school.
8. The opport unities o ftered young people in the U nited
States a re far greater than in any o ther co untry.
9. Hav ing lived this lo ng in this cult ure, I'd be happier moving
to some other country now.
10. In this country, it's who m you kn o w, not what you know,
th at makes for success.
11. The big troubl e with our country is that it relies, for the
most pa rt, on thcla w o f the jungle : ' Get him before he gets
you .'
12. Sometimes I can't see much sense in putting so much time
into education and learning.

This test is intended to meas ure 'personal dissatisfactio n or


disco ura gement with one's place in society ' (Gardner a nd Lambert,
1972, p. 2 1).
128 LANGUAGE TESTS AT SCHOOL

Oddly perhaps, Gardner and Lambert (1972, and their other works
reprinted there) have consistently predicted that higher scores on the
preceding scale should correspond to higher performance in learning
a second language - i.e., thal degree o f anomie and attainment of
proficiency in a second language should be positively correlated -
however, they have predicted that the correlations for the E and F
Scales with attainment in a second lang uage should be negative, The
difficulty is that other authors have argued tbat sco res on the Anomie
Scale and the E and F Scales should be positively intercorrelated with
each ot her - that, in fact, "anomia is a factor related to the formation
of negative rejective attitudes toward minority groups' (Srole, 1956,
p.71 2).
Srole dtes a correlation of .43 between a scale designed to measure
prejudice toward mino rities and his' Anomia Scale' (both 'anomie'
and "anomia' are used in designating the scale in the literature) as
evidence tha t 'lostness is one of the basic conditions out of which
some types of politicaJ authoritarianism emerge' (p. 714, footn ote
20). Yet other authors have predicted no relationsh ip at all between
Anomie scores and F scores (Christie and Geis, 1970, p. 360).
Again , we seem to be wreslling with hypotheses that are flavo red
more by the preferences of a particular research technique than they
are by substantial research data. Even 'nominal ~ va lidity cannot be
invoked when the researchers fail to agree on the meaning of the
name associated wilh a particular questionnaire. Gardner and
Lambert (1972) report generall y positive correlations between
Anomie scores and E and F scores. This, rather than contributin g to a
sense of confidence in the Anomie Scale, merely makes it, too, suspect
of a possible response set factor - or a tendency to give sociall y
acceptable responses, or possibly to give consistent responses to
similarl y sca led items (negati vely toned or positively toned). In brief,
there is little or no evidence to show that the scale in fact measures
what it purports to measure.
Christie and Geis (1970) suggest that the F Scale was possibly the
most studied measure of attitudes for the preceding twenty year
period (p. 38). One wonders how such a measure survives in the face
of data which indicate that it ha s no substantial claims to va lidity.
Further, one wonders why, if such a studied test has produced such a
conglomeration of con tradictory findings, anyone should expect to
be able to whip together an attitude measure (with much less study)
that will do any better. The problems are not merely technical ones
associated with test reliability and validity, they a re also moral ones
MEASURING ATTITUDES AND MOTIVATIONS 129
having to do with the uses to which such tests are intended to be put.
The difficulties are considered severe enough for Shaw and Wright
(1967) to put the following statement in a conspicuous location at the
end of the Preface to their book on Scales for the Measurement of
Attitudes:
The attitude scales in this book are recommended for research
purposes and for group testing. We believe that the available
information and supporting research does not warrant the
application of many of these scales as measures of individual
attitude for the purpose of diagnosis or personnel selection or
for any other individual assessment process (p. xi).
In spite of such disclaimers, application of such measurement
techniques to the diagnosis of individual performances - e.g.,
prognosticating the likelihood of individual success or failure in a
course of study with a view to selecting students who are more likely
to succeed in 'an overcrowded program' (Brodkey and Shore, 1976)-
is sometimes suggested:
A problem which has arisen at the University of New Mexico is
one of predicting the language-learning behavior of students in
an overcrowded program which may in the near future become
highly selective. This paper is a progress report on the design of
an instrument to predict good and poor language learning
behavior on the basis of personality. Subjects are students in the
English Tutorial Program, which provides small sized classes
for foreign, Anglo-American, and minority students with poor
college English skills. Students demonstrate a great range of
linguistic styles, including English as a second language,
English as a second dialect. and idiosyncratic problems, but all
can be characterized as lacking skill in the literate English of
college usage - a difficult 'target' language (p. 153).
In brief, Brodkey and Sbore set out to predict teacher ratings of
students (on 15 positively scaled statements with which the teacher
must agree or disagree) on the basis of the student's own preferential
ordering of 40 statements about himself· some of the latter are listed
below. The problem was to predict which students were likely to
succeed. Presumably, students judged likely to succeed would be
given preference at time of admission.
Actually, the student was asked to sort the 40 statements twice-
first, in response to how he would like to be and second, how he was at
the time of performing the task. A third score was derived by
computing the difference between the first two scores. (There is no
way to determine on the basis of information given by Brodkey and
130 LANGUAGE TESTS AT SCHOOL

Shore how the items were scaled - that is, it cannot be determined
whether agreeing with a particular statement contributed positively
or negatively to the student's score.)
THE Q-SORT STATEMENTS
(from Appendix B of Brodkey and Shore, 1976, pp. 161-2)
1. My teacher can probably see that I am an interesting
person from reading my essays.
2. My teachers usually enjoy reading my essays.
3. My essays often make me feel good.
4. My next essay will be written mainly to please myself.
5. My essays often leave me feeling confused about my own
ideas.
6. My writing will always be poor.
7. No matter how hard I try, my grades don't really improve
much.
8. I usually receive fair grades on my assignments.

10. My next essay will be written mainly to please my teacher.


II. I dislike doing the same thing over and over again.

18. I often get my facts confused.


19. When I feel like doing something I go and do it now.

22. I have trouble remembering names and faces.

28. I am more interested in the details of ajob than just getting


it done.
29. J sometimes have trouble communicating with others.
30. I sometimes make decisions too quickly.
31. I like to get one project finished before starting another.
32. I do my best work when I plan ahead and follow the plan.

34. I try to get unpleasant tasks out of the way before I begin
working on more pleasant tasks.

36. I always try to do my best, even if it hurts other people's


feelings.
37. I sometimes hurt other people's feelings without knowing
it.
38. I often let other people's feelings influence my decisions.
39. I am not very good at adapting to changes.
40. I am usually very aware of other people's feelings.
On the basis of what possible theory of personality can the
foregoing statements be associated with a definition of the
successful student? Suppose that some theory is proposed which
offers an unambiguous basis for scaling the items as positive or
MEASURING ATTITUDES AND MOTIVATIONS 131
negati ve. What is the relationship of an item like 37 to such a theory?
On the basis of careful thought one might condude that statement 37
is no t a va lid description of any possible person si nce if such a person
hurt other people's feelings wi thout knowing it, how would he know
iP Tn such a case the score might be either positively or negatively
related to logical reasoning ability - depending on wbether the item is
positi vely or negati vely scaled. Note also the tendency throughout
to place the student in the position of potentia l double-binds.
Consider item 28 about the details of a job ve rsus getting it done.
Agreeing or disagreeing may be true and false at the same time.
Further, consider the fact that if the items related to the teacher's
attitudes are scaled appropriately (that is in accord with the teacher's
attitudes about what a successful learner is), the test may be a
mea sure of the subject's ab ility to perceive the teache r's attitudes -
i.e. , to predict the teacher's evaluation of the subject himself. This
would introduce a high degree of correlation between the personality
measu re (the Q-Sort) and the teacher's judgements (the criterion of
whether or not the Q-Sort is a valid measure of the successful student)
- but the correlation would be an artefact (an artificially contrived
result) of the experimental procedure. Or consider yet another
pOS5ibility. Suppose the teacher' s judgements are act ually related to
how well the student understands English - is it not possible that the
Q-Sort task might in fact disc riminate amo ng more and less proficiem
users of the language ? These possibilities might combine to produce
an apparent correlatio n between the 'perso nality' measure and the
definition o f 'success'.
N o statistical correlation (in the sense of C hapter 3 above) is
reported by Brodkey and Shore (1976). They do, however, report a
table of correspondences between grades assigned in the course of
smd y (whic h themsel ves are related to the subjective evaluatio ns of
teac hers stressing 'rewa rd for effort, ra th er than achievement alo ne',
p. 154). T hen they proceed to a n individual analysis of exceptional
cases: the Q-Sort task isjudged as not being reliable for '5 Orientals, I
reservation Indian, and 3 students previously noted as having serious
emotional problems' (p. 157). The authors suggest, 'a general
conclusion might be that the Q-sort is not reliable for Oriental
students, who may have low scores but high grades, and is slightly less
reliable for wo men than men . . .. [for] students 30 or older, ... Q-sort
scores seemed independent of grades .. .' (p. 157). No explanations
are offered for these apparently deviant ca ses, but the authors
co nclude no netheless that 'the Q-sort is on th e way to providing us
132 LANGUAGE T ESTSAT SCHOOL

with a useful predictive tool fo r screemn g Tutorial Program


applicants' (p . 158).
Tn another st ud y, reported in an earlier issue of the same journal,
Chastain (1975) correlated scores on several personality measures
with grades in French, German, and Spanish for students numbering
80, 72, and 77 respectively. In addition to the personality measures
(w hic h lnclud ed scales purporting to assess anxiety , outgoingness,
and creati vity), predictor va riab les included the verbal and
quantitative sub-scores o n the Scholastic Aptitude Test, high school
rank , and prior language experience. Chastain observes, 'surprising
as it may seem, the direction of correlation was not consistent [for test
anxiety], (p. 160). In one case it was negative (for 15 subjects enrolled
in an audio-lingual French course, - .48), and in two others it was
positive (for the 77 Spanish students, .37 ; and for the 72 German
students, .2 1). Chastain suggests that 'perhaps some concern a bo ut a
test is a plus while too muc h anxiety can produce nega tive results' (p.
160). Is his measure valid ? Chastain's measure of Test Anxiety came
from Sara son (195 8,1961). An example item given by Sarason (1958)
is 'While takin g an important examinati on, I perspire a great deal' (p.
340). In his 196 1 study, Saraso n repo rts correlatio ns with 13
measures of'intellectual ability' and the Test Anxiety scale along with
five other meas ures of personalit y (all of them subsca!es on th e
AUiobiograpilicai Survey). F o r two separate studies with 326 males
and 41 2 fem ales (all freshman or sop ho more students at the
Un iversity of Washington, Seattle), no correlati ons above .30 were
reported. In fact, Test Anxiety produced the stronge st correlations
with high school grade averages (divided into six categories) and with
scores on Cooperative English subtes ts, The highest correlation was
- .30 between Test Anxiety and the ACE Q (1948, presumably a
widely ll sed (est since the author gives o nl y the abbreviation in the
texl of the ar ticle). These are hardl y encouraging validity statistics.
A serious problem is [hat correlations of above .4 between the
various subscores on the Autobiographical Survey may possibly be
explained in terms of response set. There is no reason for concluding
that Test Anxiely (as measured by the scale by the same name) is a
substantial factor in variance obtained in the various 'intellectual'
variables. Since in no case did Chastain' s other personality varia bles
account for as much as 10 % o f the va riance in grades, they are not
discussed here. We o nl y note in passin g that he is probably correct in
saying that 'course grade may not be synonymous with achievement'
(p. 159) - in facl it may be sensitive to affective variables precisely
MEASURING ATTITUDES AND MOTIV AnONS 133
because it may involve some affectively based jud gement (see
especially the basis fo r course grades recommended by Brodkey and
Shore, 1976, p, 154),
We come now to the empathy measure used by Guiora, Paluszny,
Beit-Hallahmi, Catford, Cooley , and Dull (1975) and by Guior. and
others. In studying the article by Haggard and Isaacs (1966), where
the origina l MME (Micro-Momentary E xpression) test had its
beginnings, it is interesting to note that for highly skilled judges the
technique adapted by Guiora, el ai, had average reliabilities of onl y
.50 and .55. The original authors (Haggard a nd Isaacs) suggest that ' it
would be useful to determine the extent to which observers differ in
their ability to perceive accurately rapid changes of facial expressions
and the major correlates of this ability' (p. 164). Apparentl y, Guiora
and associates simply adapted the test to their own purpose with httle
change in lts [arm and without attempting (or at least without
reporting attempts) to determine whether or not it constituted a
measure o f empathy.
From their own description ofthe MME, several problems become
immediately apparent. The subject is inst ructed to push a butto n,
which is attached to a recording device, whenever he sees a change in
facial expression on the part of a perso n depicted on film . The first
obvious tro uble is that there is no apparent way to differentiate
between hits and misses - that is, there is no way to tell for sure
whether th e subject pushed the button when an actual change was
taking place or merely when the subject th ought a change was taking
place. In fact, it is apparently the case that the higher the number or
button presses, the higher the judged empathy of the subject. Isn't it
just as reasonable to assume that an inordinatel y high rate of butt on
presses might correspond to a high rate of fal se alarms? In the da ta
repo rted hy Haggard and Isaacs, even highly skilled judges were no t
able 10 agree in man y cases on when changes were occurring, much
less o n the meaning or the changes (the latter would seem to be the
more impo rtant indicator of empathy). They observe, 'it is mo re
difficult to obtain satisfactory agreement when the tas k is to identify
and desi gnate the impulse or affect which presumably underlies any
particular ex pressi on or expression change' (p. 158). They expected to
be able to tell more about the nature and meaning of changes when
they slowed down the rate of the film. Ho wever, in th a t condition (a
condition also used by G uiora and associates, see p. 51) the reliability
was even lower on the average than it was jn the normal speed
condition (.50 versus .55, respectively).
1 34 LANG UAGE TESTS AT SCHOOL

Since it is ax iomatic (though perhaps not exactl y true for all


em pirical cases) that the validity of a test cannot exceed the squ are of
its reliability, the va lidity estimates for the MME would have to be in
the range of ,25 to .30 - this would be only fo r the case when the test is
a measure ofsomeone's ability to notice changes in facial expressions,
or better, as a meas ure of interjudge agreement on the task ofnolicing
changes in facial expressions. The ex trapolation from such judge·
ments to 'empa th y' as the constrUCt to be measured by th e MME
is a wild leap indeed. No validity estimates a re possible on the basis
of available data for an inferential ju mp of the latter sort.
Another widely used measure of altitudes - one that is somewhat
less direct than questions or statements concerning the attitudes of
the subject toward the object or situation of interest - is the semantic
differential technique which was introduced by Osgood, Suci, and
Tannenbaum ( 1957) for a wider ran ge of purposes. In fact, they were
interested in the measurement of mea nin g in a broader sense. Their
method, however, was adapted to att itud e studies by Lambert and
Gardner, and by Spolsky (1969a). Several follow up studies on the
Spolsky resea rch are discussed in Oller, Baca, and Vigil (1977).
Gardner and Lambe rt {I 972) reported the use of seven point scales
of the following type (subjects were as ked to rat e themselves,
Ame ricans, how the y themselves lVould like to be, French·
Americans, a nd their French teacher) :

SEMANTIC DIH'ERENTIAL SCALES, BIPOLAR VARl ETY


I. Interesting _._. _ :_ :_ :_ :_ Boring
2. Prejudiced _:_: ._. :_ :_ :__ :_ Unprejudiced
3. Brave _ ' __.:_ :_ :_. :_ :_ Cowardly
4. Handsome _ '_ :_ :_ '_ ._ . __ Ugly
5. Colorful _ ,_:_ ,_ ,_ :_ :_ Colorless
6. Friendl y _:_:_ :_ ,_ ,_ :_ Unfriendl y
7. H onest _:_ :_ :_ :__. :_ :_ Disbonest
8. Stupid _ ._._ ._ ._ ._ ._ Smart
9. Kind _._. _ ._ ._ ._ ._ Cruel
10. Pleasant _._. _ ._ ._ .____ _ Unplea sa nt
1 I. Polite _._. _ ._ ._ ._ ._ Impolite
12. Sincere ___ '_. '_ '_ '._._ ,_ Insincere
13. Successful _._. _ ._ ._ ._ ._ Unsuccessful
14. Secure _"_'_ '_ '_ '_ '_ Insecure
15. Dependable _. _ ___ ._ ._ ._ .. _ Undependable
16. Permissive _ ._._,_ ._ ._ ._ Strict
. .
17. Leader _. _ ._ ._ ._ ._ ._ Follower
18. Mature . . . Jmmatu re
- ' - " --- ' - '- '- '-
19. Stable Unstable
MEASURIKG ATTITUDES AND MOTIVATIONS 135
20. Happy Sad
21. Popul ar Unpopular
22. Hardworking Lazy
23. Ambitious Not Ambitious
Semantic differential scales of a unipolar va riety were used by
Gardner and Lambert (1972) and by Spolsky (1969a) and others (see
Oller, Baca, and Vigil, 1977). In form they are very similar to the
bipolar scales except that the points of the scales have to be marked
with some value suc h as 'very characteristic' or 'no t at all
characterist ic' or possibly 'very much like me' or 'not at all like me'.
Seven point and five point scales have been used.
In an evaluation of attitudes toward the use of a particular
language Lambert, Hodgson, Gardner, and Fillenbaum (1960) used
a ' matched guise' technique. Fluent French- English bilinguals
recorded material in both languages. The recordings from several
speakers were then presented in random order and subjects were
asked to rate the speakers. (Subjects were, of course, unaware that
each speaker was heard twice , once in English and once in French,)
SEMANTIC DlFFERE~TlAL SCALES, UNIPOLAR VARIETY
1. Heigh t very little _ :_ :_. :._:_:_ :_ very much

and so on for the attribu tes: good looks, leadership, thoughtfulness,


sense of humor, intelligence, honesty, self-confidence, friendliness,
dependability, generosity, entertainingness, nervousness, kindness,
reliability, am bition, sociability, character, and general likability.
Spolsky (19690) and others have used similar lists of terms
presumably defining personal attributes: helpful, humble, stubborn,
businesslike, shy, nervous, kind, friendly, dependable, and so forth.
T he latter scales in Spolsky's stuclies, and several modeled after his,
were referenced against how subjects saw themselves to be, how they
would like to be, and how they saw speakers of their native language,
and speakers of a language they were in the process of acquiring.
How reliable and valid are the forego ing types of scales? Little
information is available. Spolsky (1969a) reasoned tbat scales such as
the foregoing should provide more reliable data than those which
were based on responses to direct questions concerning a subject's
agreement or di sagreement with a statement rather bald-facedly
presenting a particular attitude bias, or than straightforward
question s about why subjects were studying (he foreign langua ge and
the like, The semantic differential type scales were beJie-ved to be more
136 LANGUAGE TESTS AT SCHOOL

indirect measures of subject attitudes and therefore more valid than


more direct questions about attitudes. The fo rmer , it was reasoned ,
should be less susceptible to distortion by sensitive respondents.
Data concerning the tendency of scales to correlate in meaningful
\vays are abo ut the only evide nce concerning the validity of such
scales. For instance, negatively va lued scales such as 'stubborn'
'nervous' and 'shy' tend to cluster together (by correlation and factor
analysis techniques) indicating at least th at subjects are differentiat-
ing the semantic values of scales in meaningful ways. Similarl y. scales
concernin g highl y valued positive traits such as ' kind' 'friendly'
' dependa ble' and the like also tend to be more highly correlated with
each other than with very dissimilar trait s.
There is also evidence that views of persons of different nati onal,
ethnic, or iinguistic backgrounds differ substantially in ways that a re
characteri stic of kn own attitudes of certain groups. For insta.nce,
Oller, Baca, and Vigil (1977) report data showing that a group of
Mexican American women in a .l o b Corps program in Albuquerque
gene rall y rate Mexica ns substantiall y higher than they rate
Americano ,~' (Anglo-Americans) o n the same traits. It is concei vable
that such scales could be used to judge the strength of' self-concept,
atti tude toward other groups, and similar constructs. However, much
more research is needed befo re such measures are pm forth as
measures of particular constructs. Furthermore, th ey are subject to
all of the usual objections concerning self-reported dat a.
Little research has been done wit h the measurement of attitudes in
c hildren (at least thi s is true in relation to the question s and research
interests discussed in section B above) . Recently. however, Zirkel and
Jackson (1974) have offered scales intended for use with children of
Anglo, Black, Native American, and Chicano beritages. These scales
are ofthe Likert-type (agree versus disagree on a five point scale with
one position for 'd on't know ') . The innova tion in their technique
involve s th e use of line drawings of happy versus sad faces as shown
in Figure 4. Strick land (1970) may have been the first to use such a
method with child ren. It is apparently a device for obtaining scaled
responses to attitude objects (such as, food , dress, games, well kn own
personalities who are models of a partic ular cultural group , and
symbols believed important in the definition of a culture). The sca les
are used for non-readers and preliterate children. The Cu/wra/
Attitude Scales exist in four forms (one for each of the above
designated et hnic groups).
The Technical Report indicates lest- retest reliabilities ra ngin g fro m
MEASURING ATTITUDES AND MOTIVATIONS 137

Figure 4. Example of a Likert-type attitude scale intended for children.


(From Zirkel (1973), Black American Cultural A ttllude Scale. The scale has
five points with a possible 'don't know' answer as represented at the extreme
left - scales are referenced against objects presumed important to defining
cultural attitudes and of a sort that can be pictured easily.)

.52 to .61, and validity coefficients ranging from .15 to .46. These are
not impressive if one considers that only about 4 % to 25 % of the
variance in the scales is apt to be related tothe attitudes they purport to
measure.
No figures on reliability or validity are given in the Test A1anual.
The authors caution, 'the use of the Cultural Attitude Scales to
diagnose the acculturation of individual children in the classroom is
at this time very precarious' (p. 27). It seems to be implied that
'acculturation' is a widely accepted goal of educational programs,
and this is questionable. Further, it seems to be suggested that the
scales might someday be used to determine the level of acculturation
of individual children - this implication seems unwarranted. There is
not any reason to expect reliable measures of such matters ever to be
forthcoming. Nonetheless, the caution is commendable.
The original studies with the Social Distance Scale (Bogardus,
1925, 1933), from which the Cultural Attitude Scales very indirectly
derive, suggest that stereotypes of outgroups are among the most
stable attitudes and that the original scale was sufficiently reliable and
valid to use with some confidence (see Shaw and Wright, 1967). With
increasing awareness of crosscultural sensitivities, it may be that
measures of social distance would have to be recalibrated for today's
societal norms (if indeed such norms exist and can be defined) but the
original studies and several follow ups have indicated reliabihties in
the range of .90 and 'satisfactory' validity according to Newcomb
(1950, as cited by Shaw and Wright, 1967, p. 408). The original scales
required the respondent to indicate whether he would marry into,
have as close friends, as fellow workers, as speaking acquaintances, as
138 LANG UAGE TESTS AT SCHOOL

visito rs Lo his co untry, or would deba r from visiting his country


members of a specific minority or designated outgroup . The
Bogardus definition of 'social distance' , however, is considerably
narrower than that proposed more recently by Schumann (1976). The
latter is, at this point, a theoretical construct in the making and is
therefore not reviewed here.

D. Observed relationships to achievement and remaining puzzles


The question addressed in this section is, how are affective variables
related to educational attainment in general, and in particul ar to the
acquisition of a second language? Put differently, what is the nature
and the strength of o bserved relationships? Gardner (1975) and
Gardner, Smythe, Clement, and G li ksman (1976) have argued th at
attitudes are somewhat indirectly related to attainment of pro ficiency
in a second la nguage. Attitudes, they reason, are merely one of the
t ypes of facto rs that give rise to motiva tions which are merely o ne of
the types of factors which eventuall y result in attainment of
proficiency in a second language. By thi s line of reasoning, attitudes
are considered to be causally related to ach ievement ofproficiellcy in
a second language even though the rel atio nship is no t ap t to be a
ve ry strong one.
In a review o f so me 33 surveys of '~ i x difte rent grad e levels (grades 7
to 12) from seven different regions across Canada' (p. 32) in volving
no less t han about 2,000 subj ects, t he highest avera ge co rrelation
between no less t han 12 different attit ude scales with two measures of
French achievement in no case exceeded .29 (Gardner, 1975). Thus,
the largest amount of variance in l angu~ge proficiency scores that
was predicted on the a verage by the attit ude measures was never
greater tha n %. This result is not inco nsistent with th e claim that
81
the relationship between attitude measures and attainment in a
second language is quite indirect .
H owever, such a result also leaves o pen a number of alterna tive
explanations. It is possible that the weakness of t he observed
relationships is due to the unreliability or lack of validity of the
attitude measures. If this explanation were correct, there might be a
much stronger relationship between attitudes and attained pro-
ficiency th an ha s or ever wou ld become apparent using th ose atti mde
measures. Another possibility is that the measures of language
proficiency used are themselves low in reliability or valid ity. Yet
another possibility is that attitudes do not cause attainment of
MEASURING ATTITUDES AND MOTIVATIONS 139
proficiency but rather are caused by the degree of proficiency attained
- though weakly perhaps. And, many other possibilities exist.
Backman (1976) in a refreshingly different approach to the
assessment of attitudes offers what she calls the 'chicken or egg'
puzzle. Do attitudes in fact cause behaviors in some way, or are
attitudes rather the result of behaviors ? Savignon (1972) showed that
in the foreign language classroom positive attitudes may well be a
result of success rather than a cause.
It seems quite plausible that success in learning a second language
might itself give rise to positive feelings toward the learning situation
and everything (or everyone) associated with it. Similarly, failure
might engender less positive feelings. Yet another plausible
alternative is that attitudes and behaviors may be complexly
interrelated such that both of them influence each other. Bloom
(1976) prefers this latter alternative. Another possibility is that
attitudes are associated with the planning of actions and the
perception of events in some way that influences the availability of
sensory data and thus the options that are perceivable or conceivable
to the doer or learner.
The research of Manis and Dawes (1961) showing that cloze scores
were higher for subjects who agreed with the content of a passage
than for those who disagreed would seem to lend credence to this last
suggestion. Manis and Dawes concluded that it wasn't just that
subjects didn't want to give right answers to items on passages with
which they disagreed, but that they were in fact less able to give right
answers. Such an interpretation would also fit the data from a wide
variety of studies revealing expectancy biases of many sorts.
However, Doerr (in press) raises some experimental questions about
the Manis and Dawes design.
Regardless of the solution to the (possibly unsolvable) puzzle
about what causes what, it is still possible to investigate the strength
of the relationship of attitudes as expressed in response to
questionnaires and scores on proficiency measures of second
language ability for learners in different contexts. It has been
observed that the relationship is apparently stronger tn contexts
where learners can avail themselves of opportunities to talk with
representatives of the target language group(s), than it is in contexts
where the learners do not have an abundance of such opportunities.
For instance, a group of Chinese adults in the Southwestern United
States (mostly graduate students on temporary visas) performed
somewhat better on a doze test in English if they rated Americans as
140 LANGUAGE TESTS AT SCHOOL

high on a factor defined by such positive traits as helpfulness,


sincerity, kindness, reasonableness, and friendliness (Oller, Hudson)
and Liu, 1977). Similarly, a group of Mexican-American women
enrolled in a Job Corps program in Albuquerque, New Mexico
scored somewhat higher on a cloze test ifthey rated themselves higher
on a factor defined by traits such as calmness, conservativeness,
religiosity, shyness, humility, and sincerity (Oller, Baca, and Vigil,
1977). The respective correlations between tlie proficiency measures
and the attitude factors were .52 and .49.
In the cases of two populations of Japanese subjects learning
English as a foreign language in Japan, weak or insignificant
relationships between similar attitude measures and similar pro-
ficiency measures were observed (Asakawa and Oller, 1978, and
Chihara and Oller, 1978). In the first mentioned studies, where
learners were in a societal context rich in occasions where English
might be used, attitudinal variables seemed somewhat more closely
related to attained proficiency than in the latter studies, where
learners were presumably exposed to fewer opportunities to
communicate in English with representatives of the target language
cullure(s).
These observed contrasts between second language contexts (such
as the ones the Chinese and Mexican-American subjects were
exposed to) and foreign language contexts (such as the ones the
Japanese subjects were exposed to) are by no means empirically
secure. They seem to support the hunch that attitudes may have a
greater importance to learning in some contexts than in others, and
the direction of the contrasts is consistently in favor of the second
language contexts. However, the pattern of sociocultural variables in
the situations referred to is sufficiently diverse to give rise to doubts
about their comparability. Further, the learners in the second
language contexts generally achieved higher scores in English.
Probably the stickiest and most persistent difficulty in obtaining
reliable data on attitudes is the necessity to rely on self-reports, or
worse yet, someone else's evaluative and second-hand judgement.
The problem is not just ODe of honesty. There is a serious question
whether it is reasonable to expect someone to give an unbiased report
of how they feel or think or behave when they are smart enough to
know that such information may in some way be used against them,
but this is not the only problem. There is the question of how reliable
and valid are a person's judgements even when there is no aversive
stimulus or threat to his security. How well do people know how
MEASURING ATTITUDES AND MOTIVATIONS 141
they would behave in such and such a hypot hetical situation ? Or,
how does someone know how they feel about a statement that may
not have any relevance to their present experience? Are average
scores on such tasks truly indicative of group tendencies in terms of
attitudes and their supposed correlative behaviors, or are they merely
indicative of group tendencies in responding to what may be
relatively meaningless tasks ?
The foregoing questions may be unanswerable for precisely the
same reasons that they are interesting questions. However, there are
other questions that can be posed concerning subjective self-ratings
that are more tractable. For instance, how reliable are the self-ra6ngs
of subjects of their own language skills in a second language, say?
Can subjects reliably judge their own ability in reading, writing, or
speaking and listening tasks? Frequently in studies of altitude, the
measures of attitude are correlated onl y with subject's own reports of
how well they speak a certain language in a certain context with no
objective measures of language skill whatsoever. Or, alternatively,
subjects are asked to indicate when they speak language X and to
whom and in what contexts. How reliable are suchjudgements?
In the cases where they can be compared to actual tests oflanguage
proficiency, the results are not too encouraging. In a study with four
different proficiency measures (grammar, voca bulary, listening
comprehension, a nd cloze) and four self-rating Scales (listening,
speaking, reading, and writing), in no case did a correlation between a
self-rating scale and a proficiency test reach .60 (there were 123
subjects) - this indicates less than 36 ~~ overlap in variance on any of
the self-ratings with any of the proficiency tests (Chihara and Oller,
1978). In another study (Oller, Baca, Vigil, 1977) correlations
between a single self-rating scale and a cloze test in English scored by
exact and acceptable meth ods (see Chapter 12) were .33 and .37
respectively (subjects numbered 60).
Techniques which require o thers to make judgements about the
attitudes of a person or group seem to be even less reliable than self-
ratings. For instance, Jensen (1965) says of the widest used
'projective' technique for making judgements about personality
variables, 'put frankly, the consensus of qualified judgement is that
the Rorschach is a very poor test and has no practical worth for any
of the purposes for which it is recommended by its devo tees' (p. 293).
Careful research has shown that the Rorschach (administered to
abo ut a million persons a year in the 1960s in the U.S. alone,
according to Jensen, at a total cost of 'approximately 25 million
142 LANGUAGE TESTS AT SCHOOL

dollars', p. 292) and other projective techniques like it (such as the


Thematic Apperception Test, mentioned at the outset of this chapter)
generate about as much variability across trained judges as they do
across subjects. In other words, when trained psychologists or
psychiatrists use projective interview techniques such as the TAT or
Rorschach to make judgements about the personality of patients or
clients, the judges differ in their judge ments about as much as the
pa tients differ in their judged traits. On the basis of such tests it would
be impossible to tell the difference (even if there was one) between the
level of, say anxiety, in Mr Jones and Mr Smith. In the different
ratings of Mr Jones, he would appear a bout as different from himself
as he would from Mr Smith. 2
In conclusion, the measurement of attitudes does not seem to be a
promising field - though it offers many challenges. Chastain urges in
the conclusion to his 1975 article that 'each teacher sho uld do what he
or she can to encourage the timid , supp ort the anxiou s, and loose the
creative' (p. 160). One might add that the teacher will probabl y be far
more capable of determining who is an xious, timid , creative, a nd we
may add empathetic, aggressive, outgoin g, introverted , eager,
enthusiastic. shy, stubborn, in ventive, egocentric, fascistic, ethnocen-
tric, kind , tender, lo ving, and so on, without the help of the existin g
measures that purport to discriminate between such types of
personalities. Teachers will be better off relying on their own
in tuitions based on a compassionate and kind-hearted interest in
their students.

l In spite of the now well-known weaknesses of the Rorschach (Jensen, 1965) and the
T AT (Anastasi, 1976), En in ( 1955) llsed the T AT to dra w conclusions about contrasts
between bilin gual subject's perfonnances on the lest in each of tbeir two languages.
T he variables on which she claimed they differed were such things as 'physical
aggressi on', 'escaping blame' , 'withdrawal', and 'assertions of independence' (p. 391).
She says, 'it wasconc\urled tha t there a re systemat ic differences in thecomen! of speech
of bilingua ls according to the language being spoken, and lhat the differences <l fe
probably related to differences in soci al roles and standards of cond uct associated wit h
the respecti ve language communities' (p. 391 ). In view of recent studies on [he validity
of the TAT (for instance, see Ihe remarks by An astasi, 1976, pp. 565--587), it is doubtful
t hat Ervin 's results could be replicat ed. Perhaps someone shoul d attempt a simil ar
st udy to see if the same pattern of results will emerge. However, as long as t he validity
of the TAT a nd other projecti ve techniques like it is in seri ous doubt, the results
obtained are necessarily insecure. Moreover, it is not only the validity of such
techniques as measures of something in particular that is in question - but their
rel iability as measures of anything at all is in question.
MEASU RING ArnTUDES AND MOTIVATIONS 143

KEY POI NTS


1. It is widely believed that attitudes are factors invol ved in the causation of
behavior and that they are therefore important to success or failure in
sc hools.
2. Attitudes toward self and others are probably among the mo st
important.
3. Attitudes cannot be directly observed, but must be inferred from
behavio r.
4. Usuall y, attitude tests (or attempts [ 0 mea sure attitudes) consist o f
asking a su bjec t to say how he feels about o r wo uld respond to some
hypothetical situation (or possibly merely a statement that is believed to
characterize the attitude in question in some way).
5. Although attitude and perso nality research (accordi ng to Buras, 1970)
received more attention from 1940-1970 th an any other area of te sting,
attitude and personality mea sures are generally the least valid sort of
tests.
6. O ne of the essential ingredients o f successful research th at seems to have
been lacking in many of the altitude studies is t he readiness to cntcrlain
mU ltiple hypo theses instead o f a particula r favored viewpoi nt.
7. Appea l to the la bel o n a parl ic ular 'a tt itude measure' is not satisfactory
evidence of validity.
8. There are many ways of assessing the validity of a proposed measure ofa
particular hypothetica l construct such as an attitude o r motivational
orientatio n. Amon g them ,Ire the development of multiple methods of
<:l ssessi ng the same construct and checking the pattern of correlation
between them: c hecking the a ttitudes of groups known to beha ve
di fferentl y toward th e aui tude o bject (e .g., the institutiona lized churc h,
o r possibly the sc-hools. o r charitable o rganizat io ns): a nd repeated
combination s of the foregoing.
9. For some attitudes or perso nality traits there may be no adequate
behavioral criterion. For instance, if a person says he is angry, what
behavioral criterion will unfailingly prove that he is not in fact scared
instead of angry?
10. Affective variables that relate to the ways people interact through
lan guage co mpass a very wide range or conceivable affecti ve variables.
II . ft has been widcly cJaimed that a ffecti ve variables play a n impo rtant part
in the lea rning of second la ng uages. The research evidence, however, is
o ft en contradictory.
12. Guiora, et aJ tried to predict the ease with wh ich a new system of
pronunciation would be acquired on the ba sis ora purported measure of
empathy. Their argument s, however, probably have more intuitive
appeal than the research justifies.
13. I n two cases, groups sco rin g lo wer on Gui ora's test of em pathy did better
in acquiring a new system of pronunciation than did groups scoring
hi g her o n the empath y test. This direc Ll y contrad ic ts Gu io ra 's
hypo thesis.
14. Brown (1973) a nd others have repo ned that self-conce pt m a y be an
import ant factor in predicting success in lea rning a foreign or second
language.
144 LANGUAGE TESTS AT SCHOOL

15. Gardner and Lambert have argued th at integrativel y motivated learners


should perform better in learning a second language than instrumentally
moti vated learners. The argument has much appeal , but the data confirm
every possible o utcome - sometimes integrah vely moti vated learners do
excel, so metimes they do no t, a nd sometimes they lag be hind
instrumentally motivated learners. Could the measures of motivation be
invalid? There are other possibilities.
16. The fact that attitude scales require subjects to be honest about
potentially damaging information makes those scales suspect of a built·
in di stortion factor (even if the subjects try to be honest).
17. A serious moral problem arises in the valuation of the sca les. Who is to
judge whar constitutes a prejudiCial view? An et hnocentric view ? A bias'!
Voting will not necessa rily help. Th ere is still the problem of determ ining
who will be included in the vote, or who is a member of which et hnic,
racial , national, reli gious, or non-religious group. II is known that the
mea ning of re spo nses to a particular scale is essent ia ll y uninterprel a ble
unJess the value of the scale can be determined in ad vance.
18. Attitude measures a re all necessarily indirect meas ures (if indeed they
can be construed as measures at all) . They may, howevcr, be
straightforward questions such as 'Are yo u prejudiced?' or they may be
cleverly designed sca les that conceal their true purpose ~ for example,
the F>or Fascism Scale by Adorno, ef {II.
19. A possihle interpreta tion of the pa ttern of cor relations fo r items on the F
Scale a nd the E Scale (Adorn o, et 01 and Gard ner and Lambert ) is
re sponse set. Since all of the statements are keyed in the same direction , a
tendency to respond consistently though independently of item content
would produce so me correlation among the items and overlap between
the two scales. However, the correlations might have nothing to do wit h
fascism or e thnocentrism which is what the two scales purport to
measure.
20. It has been suggested that the emotive lOne of sta lements included in
attitude questionn aires may be as important to the responses of subjects
as is the factive content of those same statement s.
21. As long as the va lid it y of purported at titude mea sures is in question, their
pattern of interrelatio nship with any other c riteria (say. pro ficie ncy
a tt ained in a second language) remains essent iall y uninterpret<Jble.
22. Co ncerning the sense of lostness presumably measured by the Anomi e
Scale, all possibl e predictions have been made in relation to attitudes
toward minorities and outgroups. The scale is moderately correlated
with the E and F Scales in the Gardner and La mbe rt data which may
merely indica te th at the Ano mie Scale 100 is subject to a response set
fac to r.
23. Experts rarely recommend the usc of personality invenrories as a basis
for decisions that will affect individuals.
24. A study of anxiety by Ch astain (1975) revealed conflicting re sults: for
onc group higher a nxiety was corre lated with better instead of worse
grades.
25. Patterns o f co rrelation bet ween scales o f the seman tic d iffe rent ia l type as
applied to attitude measurement indicate some possible validity.
MEASURING ATTITUDES AND MOTIVATIONS 145
Clusters of variables that are distilled by factor analytic techniques show·
semantically similar scales to be more closely related to each other than
to semantically dissimilar scales.
26. Attitude scales for children of different ethnic backgrounds have recently
been developed by Zirkel and Jackson (1974). No validity or reliability
information is given in the Test Manual. Researchers and others are
cautioned to use the scales only for group assessment - not for decisions
affecting individual children.
27. There may be no way to determine the extent to which attitudes cause
behavior or are caused by it, or both. The nature of the relationship may
differ according to context.
28. There is some evidence that attitudes may be more closely associated
with second language attainments in contexts that are richer in
opportunities for communication in the target language than in contexts
that afford fewer opportunities for such interaction.
29. In view of all of the research, teachers are probably better off relying on
their own compassionate judgements than on even the most highly
researched attitude measures.

DISCUSSION QUESTIONS
1. Reflect back over your educational experience. What factors would you
identify as being most important in your own successes and failures in
school settings? Consider the teachers you have studied under. How
many ofthem really influenced you for better or for worse'? What specific
events can you point to that were particularly important to the
inspirations, discouragements, and day-to-day experiences that have
characterized your own education. In short, how important have
attitudes been in your educational experience and what were the causes
of those attitudes in your judgement?
2. Read a chapter or two from B. F. Skinner's Verbal Behavior (1957) or
Beyond Freedom and Dignity (1971) or better yet, read all of his writings
on one or the other topics. Discuss his argument about the dispensability
of intervening variables such as ideas, meaning, attitudes, feelings, and
the like. How does his argument apply to what is said in Chapter 5 ofthis
book?
3. What evidences would you accept as indicating a belligerent attitude? A
cheerful outgoing personality? Can such evidences be translated into
more objective or operational testing procedures?
~'4. Discuss John Platt's claims about the need for disproof in science. Can
you make a list of potentially falsifiable hypotheses concerning the
nature of the relationship between attitudes and learning? What would
you take as evidence that a particular view had indeed been disproved?
Why do you suppose that disproofs are so often disregarded in the
literature on attitudes'? Can you ofTer an explanation for this? Are
intuitions concerning the nature and effects of attitudes apt to be less
reliable than tests which purport to measure attitudes?
5. Pick a test that you know of that is used in your school and look it up in
the compendia by Buros (see the bibliography at the end of this book).
146 LANGUAGE TESTS AT SCHOOL

What do the reviewers say concerning the test'? Is it retia ble? Does it have
a substantial degree of validity,? How is it used in your school? How does
the use of the test affect decisions concerning children in the school?
6. Check the school files to see what sorts of data are recorded there on the
results of personality inventories (The Rorschach? The Thematic
Apperception Test ?). Do a correlation study to determine what is the
degree of relationship between scores on available personality scales and
other educational measures.
7. Discuss the questions posed by Cooper and Fishman (p. 118f, this
chapter). What would you say they reveal about the state of knowledge
concerning the nature and effects of language attitudes and their
measurement?
8. Consider the moral problem associated with the valuation of attitude
scales: Brodkey and Shore (1976) say that 'A student seems to exhibit an
enjoyment of writing for its own sake, enjoyment of solitary work, a
rejection of Puritanical constraints, a good deal of self-confidence, and
sensitivity in personal relationships' (p. 158). How would you value the
statements on the Q-sort p. 130 above with respect to each of the
attitudinal or personality constructs offered by Brodkey and Shore?
What would you recommend the English Tutorial Program do with
respect to subjects who perform 'poorly' by someone's definition on the
Q-sort? The Native American? The subjects over 30? Orientals?
9. Discuss with a group the valuation of the scale on the statement: 'Human
nature being what it is, there will always be war and conflict.' Do you
believe it is moral and just to say that a person who agrees strongly with
this statement is to that extent fascistic or authoritarian or prejudiced?
(See the F Scale, p. l22f.)
10. Consider the meaning of disagreement with the statement: 'In this
country, it's whom you know, not what you know, that makes for
success.' What are some of the bases of disagreement? Suppose subjects
think success is not possible? Will they be apt to agree or disagree with
the statement? Is their agreement or disagreement characteristic of
Anomie in your view? Suppose a person feels that other factors besides
knowing the right people are more important to success. What response
would you expect him to give to the above statement on the Anomie
scale? Suppose a person felt that success was inevitable for certain types
of people (such a person might be regarded as a fatalist or an unrealistic
optimist). What response would you predict? What would it mean?
11. Pick any statement from the list given from the Q-sort on p. 130.
Consider whether in your view it is indicative of a person who is likely to
be a good student or not a good student. Better yet, take all of the
statements given and rank them from most characteristic of good
students to least characteristic. Ask a group of teachers to do the same.
Compare the rank orderings for degree of agreement.
12. Repeat the procedure of question 11 with the extremes of the scale
defined in terms of Puritanism (or, say, positive self concept) versus nOll-
Puritanism (or, say, negative self concept) and again compare the rank
orden; achieved.
13. Consider the response you might give to a statement like: 'While taking
MEASURING AITITliDE'S AN D MOTI VATIONS 147
an im po rtant exa mination, ] perspire a great deal. ' What factors might
inft uence your degree of ag reement or disagreement independently of the
degree of anxiety you mayo r may not feci when ta king an important
examination? How about roo m temperature? Body chemistry (some
people normally perspire a great deal)? Your feelings about such a
subj ect ? Your feelings about someone who is bra2'.en enough to ask such
a sociall y obtuse questi on ? The humor of m opping your brow as yo u
ma rk the spot labeled "never' on the questionnaire?
14. Wbat factors do you feel enter joto the defin ition of empathy ? Which of
tbose are potentially measurabl e by the MM E ? (See the discussion in the
text on pp. 13J-4.)
15. How well do you th ink semantic differential scales, or Likert-type
(agree- disagree) scales in general can be und erstood by subjects who
might be tested with such techniques? Consider the happy-sad faces and
the neutral response of'l don't know' indicated in Figure 4. Will children
wh o are non~literat e (pre-readers) be able to perfo rm the ta ~k
meaningfull y in yo ur judgement ? Try it out on some relativel y non·
threatening subject such as recess (piay·[ime, or whatever it is called at
yo ur school) compared agai nst some attitude object th at you know the
children are general1y Jess keen abo ut (e.g .. a rithmetic ? read ing ? silting
q uietly?) . The questi on is ca n the children do the task in a way that
refl ects their known preferences (if indeed yo u d o know their
preferences) ?
16. How reliable and consistent are self-reports about skills, feelin gs,
att itudes, and the like in general? Consider yourself or persons whom
you kn ow well , e.g., a spouse. si bl ing, room- mate, o r close friend. Do yo u
usually agree wi th a perso n's own assessment of how he feels about such
and such ? Are there points of d isagreeme nt '? Do you ever feel you have
been right in assessing lhe feelings or others, who claimed to feel
differently than you believed they were actu all y feeling? Has someone
else ever enlightened yo u o n how you were ' rea lly' feeling? Was he
correct? Ever? Do you ever say everything is fin e when in fact it is lousy ?
\Vhat kinds of social fact ors might influence such statements'! Is it a
matter of honesty or kindn ess or both or neither in your view?
17. Suppose you ha ve the oppo rtunity to influence a school board or oth er
decision·making bod y concernin g the use or non-use of personality tests
in schools. WhaL kinds of decisions would you recommend ?

SUGGESTED READINGS
1. Anne Anastasi, 'Personali ty Tests,' Part 5 of the boo k by the same
auth or entitled Psychological Testing (4th ed). New York: Macmillan,
1976, 493-621. (Much of the information contain ed in this very
th orough book is accessible to the person not train ed in statistics and
resea rch design. Some of it is technical, however.)
2. H . DougJas Brown, ' A1f~cti ve Variables in Second Language
Acquisition: Language Learning 23. 1973, 23 1-44. (A thoughtful
discussio n of affecti ve va ri ables that need to be mo re thoroughl y
st udied .)
148 LANGUAGE TESTS AT SCHOOL

3. Robert L Cooper and Joshua A. Fishman, 'Some Issues in the Theory


and Measurement of Language Attitud es,' in L. Palmer and B. Spolsky
(cds. ) Papers on Language Testing: J96 7- J974. Washington, D.C.:
Teachers of English to Speakers of Other Languages, 187- 98.
4. John Lett, 'Assessing Attitudinal Outcomes,' in June K. Phillips (ed.)
The Language Connec tion: From (he Classroom (0 the World: ACTFL
Foreign Language Education Series. New York: National Textboo k, in
press.
S. Pau l Pimsle ur, 'Student Faclors in Foreign Language Learning : A
Review of the Li te rature,' Modern Language Journal 46, 1962, 160-9.
6. Sandra J. Savigno n, Communicatille Competence. Montrea l : Marcel
Didier, 1972.
7. John Schumann, 'Affective Factors and the Problem of Age in Second
Language Acquisition,' Language Leal"lJing 25, 1975, 209- 35. (Foll ows
up on Brow n's di scussion and reviews much of the recent second
language literature on the topic of affective varia bles - especially the
work o f Gu io ra, et at )
PART TWO
Theories and Methods
of Discrete Point Testing
6

Syntactic Linguistics
as a Source for
Discrete Point Methods
A. From theory to practice,
exclusively?
B. Meaning-less structural analysis
C. Pattern drills without meaning
D. From discrete point teaching to
discrete point testing
E. Contrastive linguistics
F. Discrete elements of discrete
aspects of discrete components of
discrete skills - a problem of
numbers

This chapter explores the effects of the linguistic theory that


contended language was primarily syntax-based - that meaning
could be dispensed with. That theoretical view led to methods of
language teaching and testing that broke language down into ever so
many little pieces. The pieces and their patterns were supposed to be
taught in language classrooms and tested in the discrete items of
discrete sections oflanguage tests. The question is whether language
can be treated in this way without destroying its essence. Humpty
Dumpty illustrated that some things, once they are broken apart, are
exceedingly difficult to put together again. Here we examine the
theoretical basis of taking language apart to teach and test it piece by
piece. In Chapter 8, we will return to the question of just how feasible
this procedure is.

A. From theory to practice, exclusively?


Prevailing theories about the nature of language influence theories
about language learning which in their turn influence ways of
teaching and testing language. As Upshur observed (I969a) the
ISO
DISCRETE POINT METHODS IN SYNTACTIC U N GUJSTI CS 15J
direction of the influence is usuall y from linguistic theory to learning
theory to teaching methods and eventually to testing. As a result of
the direction of influence, there have been important time lags -
changes in theory a t one end take a long time to be realized in c hanges
at the other end. Moreo ver, just as changes in blueprints are easier
than changes in buildings, changes in theories have been made often
without any appreciable change in tests.
Although the chain of influence is sometimes a long and indirect
o ne. with many intervening variables. it is possible to see the
unmistakab le marks of certain techniques of linguistic analysis not
only on the pattern drill techniques of teaching that derive from those
methods of analysis, but also on a wide range of discrete point testing
techniq ues.
T he unidirectional influence from theory to practice is not healthy.
As John D ewey put it many yea rs ago:
That individ uals in every branch of human endeavor be
experimentalists engaged in testing the findings of theorists is
the sole guarantee for the sanity of the theorist (191 6, p. 442).

Language theorists are not immune to the bite of the forego ing
maxim. In fact, because it is so easy to speculate about the nature of
language, and because it has been such an immensely popular
pastime with philosophers, psychologists, logicians, lin guists,
ed ucators, and others, theories of language - perhaps more than
other theories - need to be constantly challenged a nd put to the test in
every conceivab le laboratory.
Surely the findings of class room teachers (especially language
teachers or teachers who have learned the importance oflanguage to
all aspects of an ed ucarional c urriculum) are as important to the
theories oflanguage as the theories themselves a re to what ha ppens in
the classroom. Unfortunately, the direction of influence has been
much too one-sided. Too often the teacher is merely handed untried
and untested materials that some theory says ought to work - too
often the materials don't work a t all and teachers are left to invent
their own curricula while at the same time they are expected to
perform the absorbing task of delivering it. It's something li ke trying
to write the script, perform it, direct the production, and operate the
theater all at the >arne time. The incredible thing is that some teachers
manage surprisingly well .
If the direction o f influence between theory and practice were
mu tual, the interaction would be fatal to many of the existing
152 LANGUAGE TESTS AT SCHOOL

theories. This would wo und the pride of many a theorist, but it wou ld
generally be a healthy and happy state of affairs. As we ha ve seen,
empirical advances are made by di sproofs (Platt, 1964, citing Baco n).
They are not made by supporting a favored position - perhaps by
refining a favored position some progress is made, but what progress
is there in empirical research that merely sup ports a favored vie\v
while pretending that there are no plausible competing a lternatives?
The latter is no t empirical research at all . It is a form of idol worship
where the th eory is enshrined and th e pretence of resea rch is merely a
form - a ritual. Platt argues that a theory which cannot be 'mortally
endangered' is not ahve. We may add lhat empirical research that
does not mort all y enda nger the hypotheses (o r theories) to which it is
add ressed is not empirical research at all.
H ow have linguistic theories influenced theories about language
learning a od subseq uentl y (o r simultaneously io some cases)
methods of language teaching and testing? Put differentl y, what are
some o f the salient characteristics of theoretical views that have
influenced practices in language teaching and testing? How have
discrete point testing methods, in particular, found justification in
language teaching methods and linguistic theories? What are the
most important differences between pragma tic testing methods and
discrete point testing methods?
The crux of the jssue has to do with meaning. People use language
to info rm others, to get info rmatio n from o thers, to express their
feelings and emotions, to anal yze and characterize their own
tho ughts in words, to explain, cajole, reply, explore, incite, disturb,
encourage, plan, describe, promise, pl ay, and much much more. The
crucial question, therefore, for any theory that claims to be a theory
of natural language (and as we have argued in Part One, for any test
that purports to assess a person's abil ity to use language) is how it
addresses this characteristic feature of language - the fact . that
language is used in meaningful ways. Put somewhat differently,
language is used in ways that put utterances in pragmatic
correspondences with extra-linguistic contexts. Learning a language
involves discovering how to create utterances in accord with such
pragmatic correspondences.

B. Meaning-less structural analysis


We will briefly consider how the structural ling uistics of the
Bloomfieldian era dealt with the question of meaning and then we will
DISCRETE POINT METHODS IN SYNTACTIC LING UISTICS 153
consider how language teaching and eventually language testing
methodologies were subsequently affected. Bloomfield (1933, p. 139)
defined the meaning of a linguistic for m as the situation in which the
speaker utters it, and the response it calls fort h in the hearer. The
behaviori stic motivation for such strict attention to observables will
be obvious to anyone ramiliar with the basic tenets of behaviorism
(see Skinner, 1953, and 1957). There are two major problems,
however, with such a definition of meaning. For one it ignores the
inferential processes that are always involved in the association of
meanings with linguistic utterances, and for another it fa ils to take
account of the importance of situations and contexts that are part of
the history of experience that influence the inferential connection of
utterances to meanings.
The greatest difficulties of the B100mfieldian structuralism,
however, a rise not directly from the definition of meaning that
Bloomfield proposed, but from the fact th at he proposed to disregard
meaning in his linguistic theory altogether. He argued that 'in order
to give a scientifically accurate definition of meaning we should ha ve
to have a scientifically accurate knowledge of everything in the
speaker's world' (p . 139). Therefore, he contended that meaning
should not constitute any part of a scientific linguistic analysis. The
implication of his definition was that since the situations which
prompt speech are so numerous, the number of meanings of the
linguistic units which occur in them must consequent.ly be so large as
to render their descript ion infeasible. Hence, Bloomfieldian
linguistics tried to set up inventories of phonemes, morphemes, and
certain syntactic patterns without reference to the ways in which
those units were used in normal communication.
What Bloomfield a ppeared to overlook was the fact tbat the
communicative use of language is systematic. If it were not people
could not communicate as they do. While it may be impossible to
describe each of an infinitude of situations, just as it is impossible to
count up to even the lowest order of infinities, it is not impossible in
principle to characterize a generative system that will succeed where
simple enumeration fails. The problem of language description (or
better the characterization of language) is not a problem of merely
enumerating the elements of a particular analytical paradigm (e.g.,
the phonemes or distinctive sounds of a given language). The
problem of characterizing language is precisely the one that
Bloomfield ruled out of bound s - namely, how people learn and use
language meaningfully .
J 54 LANGUAGE TESTS AT SCHOOL

What effect would such thinking have on language teaching and


eventually on language testing " A natural prediction wo uld be that it
ought to lead to a devastating disregard for meaning in tbe pragmatic
sense of the word. Indeed it did. But, before we examine critically
so me of the debilitating effect s on language teaching it is necessary to
recognize that Bloomfield's deliberate excision of meaning from
linguistic analyses was not a short-lived nor narrowl y parochial
suggestion - it was widely accepted and persisted well into the 1970s
as a definitive characteristic of American linguistj cs. Though
Bloomfield's limiting assumption was certainly not accepted by all
American linguists and was severely crititized or ignored in certain
European traditions of considerable significance,l his particular
va riety of structural linguistics was the one th at unfortunately was to
pervade the theories and methods of language teaching in the United
St ates for the next forty or so years (with few exceptions, in fact, until
the present).
The commitment to a meaning-less linguistic a na lysis was
strengthened by Zellig Harri, (1951) whose own thinking was
apparently very influential in certain simila r assumptions of Noam
Chom,ky , a student of Harris. Harris believed that it would be
possible to do linguistic analyses on the basis of purely fo rmal criteria
having to do with nothing except the observable relationships
between linguistic elements and other lingui stic elements. He said:
The whole schedule of procedures .. . which is designed to begin
with the raw data and end with a statement o f grammatical
structure, is essentially a twice made application of two maj or
steps : the setting up of elements and the statement of the
distribution of th ese elements relative to each other The
elements are determined relatively to each other) and on the
basis of the distributional relations among them (Harris, 1951,
p.61).
There is a problem, however, with Harri s's method . How will the
first element be identified ? It is no t possible to identify an unidentified
element o n the basis of its yet to be discovered relationships to other
yet to be discovered elements. Neither is it conceivable as Harris
recommend s (1951, p. 7) that 'this operatio n can be ca rried out ...
o nl y if it is carried out for all elements simultaneously'. If it c-annot

I For instance, Edward Sapir (1921) was one of the Americans who was not willin g to
accept Bl oomfie ld's limiting assumption about meaning. The Prague Sch ool of
Ii n guishc~ in Czechoslova kia was a nOlable European st ronghold which was lit tle
ena mored with Bloomfield 's fo nnalism (sec Vachek , 1966 on the Prague group).
DJSCRETE POINT METHODS IN SYNTACTIC LlNGl)ISTI CS 155
work fo r one unidentified element, how can it possibly work for all of
them ? The fact is that it canno r work at all. Nor has anyone ever
successJidly applied the methods Harris recommended. It is not a mere
procedural difficulty that Harris's proposals run aground on, it is the
procedure itself that creates the difficulty. It is intrin sically
unworkable and viciously circul ar (Oller, Sales, and H arrington,
1969).
Fur ther, how can unidentified elements be defined in terms of
themselves or in terms of their relatio nships to other similarly
undefined elements? We will see below that Harris's recom-
mend ations fo r a procedure to be used in the discovery of the
grammatical elements of a language, however indirectly and through
whatever chain of inferential steps, has been nearly perfectl y
translated into procedures for teaching languages in classroom
situations - procedures that work about as well as Harris's method s
worked in linguistics.
Unfortunately, Bloo mfield's limiting assumptio n about meaning
did not end its influence in the writings of Zellig Harris, but persisted
light o n tluo ugh the Cho mskyan revolution and into the earl y 19705.
In fact, it persists even today in teaching methods and standardized
instruments for assessin g language skills of a wide variety of sorts as
we will see below.
Cho msky (1 957) found what he believed were compelling reasons
for treating grammar as an entity apart from meaning. He said:
I think that we a re forced to conclude that grammar )s
autonomo us and independent of meaning (p. 17).
and aga in at the conclusion to his book Syntactic St ructures:
G ramma r is bes t formulated as a self-contained stud y
independent of semantics (p. 106).
He was interested in
attempting to describe the structure oflanguage with no explicit
reference to the way in which this instrument is put to use (p .
103).
Although Chomsky stated his hope that the syntactic theory he
was elaborating might eventuall y have 'significant interconnections
with a parallel semantic tbeory' (p. 103), his earl y theorizing made no
provision for the fac t that words and sentences are used fo r
meaningful purposes - indeed that fact was considered, o nly to be
summarily disregarded. Furthermore, he later contended that the
156 LANGUAGE TESTS AT SCHOOL

communicative function of language was subsidiary and derivative -


that language as a syntactically governed system had its real essence
in some kind of'innertotality' (1964, p. 58) - that native speakers of a
language are capable of producing 'new sentences that are
immediately understood by other speakers alt hough they bear no
physical resembl ance to sentences which are "familiar'" (C ho msky,
1966, p, 4).
The hoped for 'semantic theory' which C ho msky alluded to in
several places seemed to ha ve emerged in 1963 when Katz and Fodor
published ' The Structure of a Semantic Theory'. However, they too
contended that a speaker was capable of producing and und erstand-
ing indefinitely many sentences that were \i'holly novel to him' (t heir
italics, p. 481). This idea, inspired by Chomsky's thin king, is an
exaggeration of the creativity of language - or an understatement
depending on how the coin is turned.
If everylh_ing about a pa rticular uttera nce is completel y new, it is
not an utterance in any natural language, for one of the most
characteristic facts about utterances in natural Janguages is that they
conform to certain systematic principles. By this interpretation, Katz
and Fodor have overstated the case for creativity. On th e olher hand,
for everything about a particular utterance to be completely novel,
that utterance wou ld conform to none of the pragmaticc-onstraints o r
lower order pbo no logical rules, syntactic patterns, semantic va lues
and the like. By this rendering, Katz and Fodor have underestimated
the ability of the language user to be creative within the limits set by
his language.
The fact is th at precisely because utterances are u sed in
communicative contexts in particular correspondences to those
contexts, practically everything about them is familiar - their
newness consists in the fact that they constitute new combinati ons of
known lexical elements and known sequences of grammatical
catego ries. It is in this sense that Katz a nd Fodor's remark can be
read as an understatement. The meaningful use of utterances in
discourse is always Dew and is constantly a source of information and
meaning that would otherwise remain undisclosed.
The continuation of Bloomfield's limiting assumption a bout
meaning was onl y made quite clear in a foot note to An Integrated
Theory of Lillguisli<' Descriptions by Kat z and Postal (1 964). In spite
of the fact that they claimed to be integratin g Chomsky's syntactic
th eo ry with a semantic one, they mentioned in a footnote that ' we
exclude aspects of sentence use and comprehension th at are not
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS 157

explicable through the postulation of a generative mechanism as the


reconstruction of the speaker's ability to produce and understand
sentences. In other words, we exclude conceptual features such as the
physical and sociological setting of utterances, attitudes, and beliefs
of the speaker and hearer, perceptual and memory limitations, noise
level of the settings, etc,' (p, 4), It would seem that in fact they
excluded just about everything of interest to an adequate theory of
language use and learning, and to methods of language teaching and
testing.
It is interesting to note that by 1965, Chomsky had waivered from
his originally strong stand on the separation of grammar and
meaning to the position that 'the syntactic and semantic structure of
natural languages evidently offers many mysteries, both offact and of
principle, and any attempt to delimit these domains must certainly be
quite tentative' (p, 163). In 1972, he weakened his position still further
(or made it stronger from a pragmatic perspective) by saying, 'it is not
clear at all that it is possible to distinguish sharply between the
contribution of grammar to the determination of meaning, and the
contribution of so-called "pragmatic considerations", question of
fact and belief and context of utterance' (p. Ill).
From the argument that grammar and meaning were clearly
autonomous and independent, Chomsky had come a long way
indeed. He did not correct the earlier errors concerning Bloomfield's
assumption about meaning, but at least he came to the position of
questioning the correctness of such an assumption. It remains to be
seen how long it will take for his relatively recently acquired
skepticism concerning some of his own widely accepted views to filter
back to the methods of teaching and testing for which his earlier views
served as after-the-fact supports ifnot indeed foundational pillars. It
may not have been Chomsky's desire to have his theoretical thinking
applied as it has been (see his remarks at the Northeast Conference,
1966, reprinted in 1973), but can anyone deny that it has been applied
in such ways? Moreover, though some of his arguments may have
been very badly misunderstood by some applied linguists, his
argument about the autonomy of grammar is simple enough not to be
misunderstood by anyone.

C. Pattern drills without meaning


What effects, then, have the linguistic theories briefly discussed above
had on methods oflanguage teaching and subsequently on methods
158 LANGUAGE TESTS AT SCHOOL

of language testing? The effects are direct, obvious, and unmistak-


able. From meaning-less linguistic analysis comes meaning-less
pattern drill to instill the structural patterns or the distributional
'meanings' of linguistic forms as they are strung together in
utterances. In the Preface (by Charles C. Fries) to the first edition of
English Sentence Pattern:·; (see Lado and Fries, 1957), we read:
The 'grammar' lessons here set forth ... consist basically of
exercises to develop habits
What kinds of habits" Exactly the sort of habits that Harris believed
were the essence of language structure and that his 'distributional
analysis' aimed to characterize. The only substantial difference was
that in the famed Michigan approach to teaching English as a second
or foreign language, it was the learner in the classroom who was to
apply the distributional discovery procedure (that is the procedure
for putting the elements of language in proper perspective in relation
to each other). Fries continues:
The habits to be learned consist of patterns or molds in which
the 'words' must be grasped. 'Grammar' [rom the point ofvie\v
of these materials is the particular system of devices which a
language uses to signal one of its various layers of meaning -
structural meaning ( ... ). "Knowing' this grammar for practical
use means being able to produce and to respond to these signals
of structural meaning. To develop such habits efficiently
demands practice and more practice, especially oral practice.
These lessons provide the exercises for a sound sequence of such
practice to cover a basic minimum of production patterns in
English (p. v).
In his Foreword to the later edition, Lado suggests,
The lessons are most effective when used simultaneously with
English Pattern Practices, which provides additional drill for
the patterns introduced here (Lado and Fries, 1957, p. iii).
In his introduction to the latter mentioned volume, Lado continues,
concerning the Pattern Practice Materials, that
they represent a new theory oflanguage learning, the idea that
to learn a new-' language one must establish orally the patterns of
the language as subconscious habits. These oral practices are
directed specifically to that end. (His emphasis, Lado and Fries,
1958, p. xv)

... in these lessons, the student is lead to practise a pattern,


changing some element of that pattern each time, so that
normally he never repeats the same sentence twice.
DISCRETE POINT METHODS IN SYNTACTIC LINGUIS TICS 159
Furthermore, his attention is drawn to the changes, which are
stimulated by pictures, o ral substitutions, etc" and thus the
pattern itself, the sigll~jica n{ jramnvork of the sentence, rather
than the particul ar sentence, is driven intensively into his habit
reflexes.
1t would be false to assume that Pattern Practice, because it
aims at habit formation, is unworthy of the educated mind,
which, it might be argued, seeks to control language through
conscious understandjng. There is DO disagreement on tbe value
of having the human mind understand in order to be at its
learning best. But nothing could be more enslaving and
therefore less worthy of the human mind th an to have it chained
to the mechanics of the patterns of the language ratherthan free
to dwell on the me!'sage conveyed through language, It is
precisely because of this view that we discover the highest
purpose of pmterll pra ctice : to reduce to habit )vhat rightfully
belongs 10 habil in the new language, so that the mind and the
personality may be freed to dwell in their pro per rea'-m, that is,
on the meaning of the communication rather than the
mechanics of the grammar (pp. xv-xvi).
Andjust how do these pattern practices work? An example or two
will display the principle adequately. For instance, here is one from
Lado and F ries (1957):
Exercise lc.l. (To produce affirma tive short answers.) Answer
the questions with YES, HE IS; YES, SHE IS ; YES, IT IS; ... For
example:
Is Jo hn busy? YES, HE IS,
Is the secretary busy ? YES, SHE IS.
Js the telephone busy? YES, IT [So
Am I right') YES , YOU ARE.
Are you and John busy? ' YES, W E ARE,
Are the students homesick? YES, THEY ARE .
Are you busy? YES, I AM.
(Conti nu e:)
I. Is John busy? 8. Is Ma ry tired ?
2. Is the secretary busy " 9. Is she hungry?
3. Is the telephone busy? 10. Are you tired ?
4. Are you and John busy') II. Is t he teacher right?
S. Are the students 12. Are the students busy')
homesick? 13. Is the answer correct?
6. Are you busy' 14. Am I right?
7. Is the alphabet 15. Is Mr. Brown a doctor?
important ?
Suppose thal the well -meaning student wa nt s to discover the
meaning of the sentences that are presented as Lado and Fries
suggcst. How could it be done ? How, for instance, will it be possible
160 LANGUAGE TESTS AT SCHOOL

for the learner to discover the differences between a phone being busy
a nd a person being busy? Or between being a doctor and being
correct? Or between being right and being tired? Or between the
appropriateness of asking a questio n like, 'Are you hungry'" on
certain occasions, but not 'Are you the secretary busy '!'
While considering these questions, consider a series of pattern
drills selected more or less at random from a 1975 text entitled From
Substitution 10 Substan ce. The authors purport to take the learner
from 'manipulation to communication' (Paulston and Bruder, 1975,
p. 5). This particular series of drills (supposedly progressing from
more manipulative and mechanical types or drills to more meaningful
and communicative types) is designed to teach adverbs that involve
the manner in which something is done as specified by with plus a
noun phrase:
Model: C [Cue] He opened the door with a key.
can/church key
R [Response] He ope ned the can with a church key.
T [Teacher says] S[Student respon ds]
bottl e/opener He opened the bottle with an opener.
box/his teeth He opened the box with his teeth.
letter/knife He opened the letter with a knife.

M,[Mechanical type of practice]


Teaching Point: COnlraSI WITH + N/BY + N
Model: c [Cue] He used a plane to go there.
R [Response] He went there by plane.
C He used his teeth to open it.
R He opened it with his teeth.
T [Teacher] He used a telegram to answer it.
s [Student] He answered it by telegram.
T He used a key to unlock it.
S He unlocked it with a key.
T He used a phone to contact her.
S He contacted her by phone.
T He used a smile to caJrn them .
S He calmed them with a smile.
T He used a radio to talk to them.
S He talked to them by radio.

M, [Meaningful drill according to the aut hors]


Teaching Point: Use a/HOW and Maim er Adverbials
Model : c [Cue open a bottle
R } [one
possible
response] H ow do you open a bottle?
DISCRETE POINT METHODS IN SYNTACTIC LINGUISTICS 161

R2 [another] (With an opener.)


c [Cue] finance a car
R[ How do you finance a car?
Rz (With a loan from the bank.)
(By getting a loan.)
T [Teacher]
light a fire
sharpen a pencil
make a sandwich
answer a question

c [Communicative drill according to the authors]


Teaching Point: Communicative Use
Model:c [Cue] How do you usually send letters to
your country?
R (By airmail.) (By surface mail.)
C How does your friend listen to your
problems?
R (Patiently.) (With a smile.)
T [Teacher] How do you pay your bills here?
How do you find your apartment
here?
How will you go on your next
vacation?
How can I find a good restaurant?
In the immediately foregoing pattern drills the point of the exercise
is specified in each case. The pattern that is being drilled is the only
motivation for the collection of sentences that appears in a particular
drill. That is why at the outset of this section we used the term
'syntactically-based pattern drills'. The drills that are selected are not
exceptional in any case. They are in fact characteristic of the major
texts in English as a second language and practically all of the texts
produced in recent years for the teaching of foreign languages (in
ESL/EFL see National Council of Teachers of English, 1973, Bird
and Woolf, 1968, Nadler, Marelli, and Haynes, 1971, Rutherford.
1968. Wright and McGillivray, 1971, Rand. 1969a. 1969b, and many
others). They all present learners with lists of sentences that are
similar in structure (though not always identical, as we will see
shortly) but which are markedly different in meaning.
How will the learner be able to discover the differences in meaning
between such similar forms? Of course, we must assume that the
learner is not already a native speaker - otherwise there would be no
point in studying ESL/EFL or a foreign language. The native speaker
knows the fact that saying, 'He used a plane to go there,' is a little less
162 LANGUAGE TESTS AT SCHOOL

natural than say ing , 'He went there in an airplane,' or 'He flew,' but
how will the non-n ati ve speaker discover such things on t he basis of
the information that can be made available in the drill') Perhaps the
authors of the drill are expecting the teacher to act out each sentence
in some way, or creatively to invent a meaningful context for each
sentence as it comes up. Anyone who has tried it knows that the hour
gets by before you ca n makeit through just a few sentences. Inve nting
contexts fo r sente nces like, 'Is the alphabet important ?' and 41s the
telephone busy '!' is like trying to wri te a slo ry where the characters,
the plot, and all oftbe backdrops cbange from one second to the next.
It is not just difficult to conceive of a context in which studen ts can be
homesick, alphabets important, the telephone, the secretary, and
John busy, Mary tired, me right, Brown a doctor and so on, but
before you get to page two the task is impossible.
If it is difficult for a native speaker to perfo rm the task of in venting
contex ts for such bi zarre collections or utte rances, wh y sho uld
anyone expect a non-native speaker who doesn' t know the language
to be able to do it ? The simple truth is that they cannot do it. It is no
more possible to learn a language by such a method than it is to
analyze a language by the form of distributional analysis proposed by
Ha.rris. It is necessary to get some data in the form of p ragmatic
mappings of utterances onto meaningful contexts - failing that it is
not possible either to analyze a language adequately or to learn one at
all.
Worse yet, tb e lypical (not the exceptional, but the ordinary
everyday garden variety) pattern drill is bristling with fal se leads
about similarities t hat are only superficial and will lead almost
immediately to unacceptable forms - the learner of course wo n't
kn ow that they are unacceptable because he is not a nati ve speaker of
the language and has little or no c hance of ever discovering where he
went wrong. The bewildered learner will ha ve no way of kn owi ng that
for a person to be busy is not like a telephone being busy. What
in formation is there to prevent the learner from drawing the
reasonable conclusion that if telephon es ca n be busy in the way that
people can , that televisions, vacuum cleaners, tel egraphs, and
typewriters can be busy in the same sense? What will keep the learner
fro m ha ving d ifficulty distinguishin g t he meanings o f alp habet,
telephone, secretary, and doctor if he doesn' t alread y kn ow the
meanings of those word s '! If the learne r doesn't already know th at a
doctor, a secretary, and a teacher are phrases tha t refer to people with
different occupati onal statuses, how wi ll the drill help him to discover
DI SCRETE POINT METHODS IN SYNTACTIC LINGUISTICS J63
this informatio n? What, from the learner's point of view is different
abo ut homesick and a doctor? What will prevent the learner from
saying Are the students doctor and Is the alphabet busy?
Mea ning wo uld prevent sucb absurdities, but the nature o f the
pattern drill encourages them. It is an invitation to confusion.
Without meani ng to help kee p similar form s apart, the result is
simple. They cannot be kept apart - they become mixed together
jndiscriminatel y. This is not because learners are lacking in
intelligence, rathe r ;t is because they are in fact quite intelligent and
they use their intelligence to classify similar things together a nd to
keep different things apart. But how can they keep different things
apa rt when it is only the superficial similarities that have been
constantly called to their atten tion in a pattern drill ?
The drills proposed by Paulston and Brude r a re more remarkable
than the ones by Lado and Fries, because the Paulston-Bruder drills
are supposed to become progressively more meaningful- but they do
not. They merely become less obviously structured. The responses
and the stimuli to elicit them don' t become any more meaningful. The
respo nses merely become less predictable as o ne progresses through
the series of drill s concerning each separate point o f grammatica l
structure.
Who but a native speaker of English will know that opening a door
wilh a key is not very much like open ing a can with a church k ey '! The
t wo sentences are alike in fact primarily jn term s of the way they
sound . If the nOD-native knew the meanings before doing the drill
there wo uld be no need for doing the drill - but if he does need the
drill it will do him absolutely no good and probably some harm.
What will keep him fro m saying, He opened the can with a hand? Or,
He opened the bOllle with a church key? Or, H e opened the car with a
door ' Or, He opened thefaucet with a wrench' lfhecan say, He opened
th e letter }fith a letter opener, why not, He opened the box with a box
opener? Or, He opened the plane with a plane opener? If the learner is
encouraged to say, He used a k ey to unlock it, why not, H e used a letter
to write her, o r He used a call to phone her '?
In the drill a bove labelled M \, which the authors describe as a
mechanical drill, the objec-t is to contrast phrases like with a knife and
by telegram. Read over the drill a nd then try to say what will prevent
th e learner from coming up with forms like, He went there with a
plan e. He contacted her wit.h a phone, He unlocked it by key, He calmed
them by sm ile, He used radio 10 talk to them , He used phone 10 con tact
her, He con/ocled her with telegram , etc.
164 LANGUAGE TESTS AT SCHOOL

In the drill labelled ]\1" which is supposed to be somewhat more


meaningful , additional traps are neatly laid for the non-native
speaker who is helplessly at the mercy of the patterns laid out in the
drill. When asked how to light a fire, sharpen a pencil, make a
sandwich, or answer a question, he has still fresh in his memory the
keyed answers to the question about how to open a bottle or finance a
ca r. He has hea rd that you can fina nce a car by getting a loan or that
you can open a bo ttle with an opener. What is to prevent the
unsuspecting and naive learner from saying that you can open a
botde by getting an opener - structurall y the answer is fla wless and in
line with what he has just been taught. The answer is even creative.
But pragmatically it is not quite ri ght. Because of a quirk of the
language that native speakers have learned, getting an opener does
not imply using it, though with an opener in response to the question
Ho w do you open a boule ? does impl y the required use of the opener.
What will keep the learner from crea ti vely inventing forms like, by
a march in response to the question about starring a fire ? Or by a
pencil sharpener in answer to how you sharpen a pencil '? Would it not
be perfectly reasonable if when asked How do you ans'wer a question?
the learner replied with an ansl-ver? or by answering?
The so-called 'Communicative U se' drill offers eve n more
interesting traps. How do you send yo ur letters? By an airplane of
course. Or sometimes I send them wi th a ship _When I'm in a hur ry
th o ugh I always send them with a plane. How does your friend listen
to your problems'? By patience mostl y, but sometimes he does so by
smiling. Bills'! Well , I almost always bill them by mail or with a car -
sometimes in an airmail. My apartment ? Easy. I found it by a friend.
We went there with a car. My next vacation? With an airplane. M y
girl friend is wanting to go too, but she goes by getting a loan with a
bank. A good restaurant ? You can go with a ta xi.
Is there any need to say more ? Is there an English teacher alive
anywhere who canno t write reams on the topic? What then, Oh
Watchman of the pattern drill? The pattern drill without meaning, my
Son; is as a do or opening into darkness and leading nowhe re but to
confusion.
If the preceding examples were exceptional, there might be reason
to hope that pa ttern drills of the so rt illustrated above might be
transformed into more meaningful exercises. There is no reason for
such a hope, however. Pattern drills which are unrelated and
intrinsically unrelatable to meaningful ex tralinguistic contexts are
confusing precisely because they are well written - that is, in the sense
DISCRETE PQINT METHODS IN SYNTACTIC LINGUISTICS 165
that they confonn to the principles of the meaning-less tbeory of
linguistic analysis on which they were based. They are unworkable as
teaching methods for the same reason that the analytical principles
on which they are based are unworkable as techniques of linguistic
anal ysis. The analytical principles that disregard meaning are not just
difficult to apply, but the y are fundamentally inapplicable to the
objects o rinleres! - namel y, natural languages.

D. From discrete point leaching (meaning-less pattern drills) to


discrete point testing
One might have expected that the hyperbole of meaning-less
language was rully expressed in the typical pattern drills that
characteri zed the language teachin g of the I 950s and to a lesser extent
is still characteristic of most published materials today. However, a
further step toward complete meaninglessness was possible and was
advoca ted by two leading auth o rities ofthe 1960s. Brooks (1964) and
Mono n (1960. 1966) urged th at the minds of th e learners who were
manipulating the pattern drills should be kept free and unen-
cumbered by the meanings of the forms they were practicing. Even
Lado and Fries (1957, 19 58) at least argued th at the main purpose of
pattern drills was not only to instill 'habits' but was to ena ble learners
to say meaningful things in the la nguage. But, Brook s an d Morton
developed the argument that skill in the purely manipulative use of
the language, as taught in pattern drills, would have to be fully
mastered before proceeding to the capacity to use the language for
communicative purposes. The analogy offered was the practicing of
scales and arpeggios by a novice pianist, before the novice could hope
to j oin in a concerto or to use the newly acquired habits expressively_
Cla rk (1972) apparentl y accepted this two stage model in relation
to the acquisition of listenin g comprehension in a fo reign language.
Furthermore, he extended the model as a justificatio n for discrete
poin t and integrative tests :
Second-level ability cannot be effectively acquired unless first-
level perception of grammatical cues a nd other formal
interrelationships amon g spoken utterances ha s become so
th oro ughly learned and so automatic that the student is able to
turn most of his listening altention to 'those elements which
seem to him to contain the gist ortbe message' (Ri vers, 1967, p.
193, as quoted by Clark, 1972, p. 43) .
Cla rk continues:
166 LANGUAGE TESTS AT SCHOOL

Testing info rmation ofa highl y diagnostic type would be useful


durin g the 'first stage' of in struction, in which so und
discriminations, basic patterns of spoken grammar, items of
function al voca bulary, and so forth were being formally ta ught
and practised . ... As the instructional emphasis changes Crom
formal work in discrete aspects to more extensive and less
controlled listeni ng practice, the utility (and also the possibility)
of diagnostic testing is reduced in favor of eva luative
procedures which test primarily the students' comprehension of
the 'general mes.age' rather than the apprehension of certain
specific sounds or sound patterns (p. 43).
How successful has this two stage dichotomy proved to be in
language teaching ? Stockwell and Bowen hinted at the core of the
difficulty in their introduction to Rutherford (1968):

The most difficult transitjon in lea rning a language is going


from mechanical skill in reproducing patterns acquired by
repetition to the constructi on of novel but appropri ate
sentences in natural social contexts. Language teachers . .. not
infrequently .. . fumble and despair, when conlronted with the
challenge of leading student s comfortably over this hurdle (p.
vii).
What if the hurdle were an unnecessa ry one? What ifit were a mere
a rtefact of the attempt to separate the learning of the gra mmatical
patte rns of the language from the communicati ve use of the
language ? If we asked how often children a re exposed to meaningless
non-contextualized language of the sort that second language
learners are so frequently expected to master in forei gn language
classrooms, the answer would be, never. Are pattern drills, therefore,
necessary to language learning? The answer must be that they are
not. Further, pattern drills of the non-contextualized and non-
contextualiza ble va riety are probably about as confusing as they are
informative.
If as we have already seen above, pattern drills are associated with
the 'first stage' of a two stage process o f teaching a foreign language
and if the so-called 'diagnostic tests' (or discrete point tests) a re also
associated with that first stage, it only remains to show th e connection
between the pattern drills and the discrete point items themselves.
Once this is accomplished, we will have illustrated each link in the
chain from certain linguistic theories to discrete point methods of
la nguage testing.
Perhaps the area of linguistic anal ysis which developed the most
rapidly was the level of phonemics. Accordingly, a whole tradition of
DISCRETE POINT METHODS IN SYNTACTI C LI NGUISTICS 167
pattern drills was created. It was oriented toward the teaching of
'pronunciation ', especially the minimal phonemic contrasts of
various target languages. For instance, Lado and Fries (1954)
suggested:
A very simple dr ill for practicing the recognition of
distinctive differences can be made by arranging minimal pairs
of words on the blackboard in columns thus:
(The words they used were offered in phonetic script but are
presented here in their normal English spell ings.)
man men
lass less
lad led
pan pen
bat bet
sat set
The authors continue:
The teacher pronounces pairs of words in order Lo make the
student aware of the contrast. When the teacher is certain that
the students are beginning to hear these distinctions he can then
have them actively participate in the exercise (p. iv).
In a footnote the reader is reminded:
Care must be taken to pronounce such contrasts with the same
intonation on both words so that the sale difference between the
words will be the sound unde r study (op cit).
It is but a short step to test items addressed to minimal
phonological contrasts. Lado and Fries point out in fact that a
possible test item is a picture of a woman watching a baby versus a
woma n washing a baby. In sllch a case, the examinee might hear the
statement, The woman is washing the baby, and point to or otherwise
indicate tbe picture to which the utterance is appropria te.
Harris (1969) observes that the minimal pair type of exercise, of the
sort illustrated above, 'is, in reality, a two-choice "objective test", and
most sound discrimination tests are simply variations and expansions
of this common classroom technique' (pp. 32- 3). Other variations
for which Harris offers examples include heard pairs of words where
the learner (or in this case, the examinee) must indicate whether the
two words are the same o r different; or a hea rd triplet where the
learner must indicate which of three words (e .g. ,jump, chump, jwnp)
wa s different from the other two; or a heard sente nce in which either
168 LANGUAGE T ESTS AT SC HOOL

member of a minimal pair might occur (e.g., I t was a large ship,


versus, It ~'Vas a large sheep) where the examinee must indicate either a
pictu re of a large ship or a large sheep depending o n what was heard.
H arris refers to the last case as an example of testing minimal pairs 'in
context' (p. 33-'\). It is not diffi cult to see, however, that the types of
contexts in which one might expect to find both ships and sheep are
relati vely few in nu mber - certainly a very small minority of possible
contexts in which one might expect to find ships without sheep or
sheep without ships.
Vocab ulary teaching by discrete point methods also leads rather
directly to discrete point vocabulary tests. For instance, Bird and
W oolf (1968) include a substitution drill set in a sentence frame of
That 's a _ _ _ , or This is a _ __ with such items as chair, pencil,
table, book, and door. It is a sho rt step from such a drill to a series of
corresponding test items. For example, Clark (1972) suggests a test
item where a door, chair, table, and bed are pictured. Associated with
each picture is a letter which the student may mark on an answer sheet
for easy sco ring. The learner hears in French , Voici une chaise, and
should correspondin gly mark the letter of the picture of the cha ir on
the answer sheet.
Other item types suggested by Harris (1969) include: a word
follo\ved by several brief definitions from which the examinee must
select the one that corresponds to the meaning of the given word; a
definition fo llowed by several words fro m which the examinee must
select the one closest in meaning to the given definition ; a sentence
frame with an underlined word and several possible synonyms from
wh ich the examinee m ust choose the best alternative; and a sentence
frame with a blank to be filled by o ne of several choices and where all
but one of the choices fa il to agree wit h the meaning requirements of
the sentence frame.
Test items of the discrete poi nt type aimed at assessing particular
grammatical rules have often been derived directly fro m pattern drill
formats. For exam ple, in their text for teaching English as a foreign
language in Mali, Bird and Woolf (1968) recommend typical
transfo rmation d rills from singul ar statements to plural o nes (e.g., Is
this a book? to Are these books? and reverse, see p. 14a) ; from
negative to negative interrogative (e.g., John isn't here, to Isn't John
here? see p. 83) : from interrogati ve to negative interrogative (e.g., Are
we going? to Aren't I-ve going? see p. 83) ; statement to question (e.g.,
H e hears abOUI Takamba , to Wha t does he hear aboUl 'see p. 13 1) ; and
so forth.
DlSCRETE POJNT ~lETHODS IN SYNTACTrC LINGUISTICS 169
There are many other types of possible drills in relation to syntax,
but the fact that drills orthis type can and have been translated more
or less directly into test items is sufficient perhaps to illustrate the
trend. Spolsky, Murphy, Holm, and Ferrel (1972, 1975) give
examples of test items requiring transformations from affirmative
form to negative, or to question form, from present to pa st, or from
present to future as part of a 'functional test of oral proficiency' for
adult learners of English as a second language.
Many other examples could be given illustrating the connection
between discrete point teaching and discrete point testing, but the
foregoing examples should be enough to indicate the relationship,
which is simple and fairly direct. Discrete point testing derives from
the pattern drill methods of discrete point teaching and is therefore
subject to man y of the same di(ficuities.

E. Contrastive linguistics
One of the strongholds of the structural linguistics of the 1950s and
perhaps to a lesser extent the 1960s was contrasti ve analysis. It bas
had less influence o n work in the teaching of English as a second
language in the United States than it has had on the teaching of
English as a foreign language and the teaching of other foreign
lan guages. There is no way to apply contrastive analysis to the
preparation of materials for teaching Engli sh as a second language
when the language backgrounds of the students ran ge from
Mandarin Chinese, to Spanish, to Vietnamese, to Igbo, to German,
etc. It would be impossible for any set of materials to take into
account all of the contrasts between all of the languages that are
represented in many typical college level classes for ESL in the U.S.
Ho weve r, the claims of contrastive analysis are still relati vely strong
in the teaching of foreign languages and in recent years have been
reasserted in relation to the teaching of the majority variety of
English as a second dialect to children who come to school speaking
some other variety.
The basic idea of contrasti ve analysis was stated by Lado (1957). It

the assumption that we can predict and describe the patterns


that will cause difficulty in learning, and those that will not
cause difficulty, by comparing systematicall y the langua ge and
culture to be learned with the native language and culture of the
student. In our view, the preparation of up-to-date pedagogical
170 LANGUAGE TESTS AT SCHOOL

and experimental materials must be based on this kind of


comparison (p. vii).
Similar claims were offered by Politzer and Staubach (1961), Lado
(1964), Strcvens (1965), Rivers (1964, 1968), Barrulia (1969), and
Bung (1973). All of these authors were concerned with the teaching of
foreign languages.
More recently, the claims of contrastive analysis have been
extended to the teaching of reading in the schools. Reed (1973) says,
the morc 'radically divergent' the non-standard dialect (i.e., the
greater the structural contrast and historical autonomy vis~a~vis
standard English), the greater the need for a second language
strategy in teaching Standard English (p. 294).
Farther on she reasons that unless the learner is
enabled to bring to the level of consciousness, i.e., to formalize
his intuitions about his dialect, it is not likely that he will come
to understand and recognize the systematic points of contrast
and interference between his dialect and the Standard English
he must learn to control (p. 294).
Earlier, Goodman (1965) offered a similar argument based on
contrastive analysis. He said,
the more divergence there is between the dialect of the learner
and the dialect of learning, the more difficult will be the task of
learning to read (as cited by Richards, 1972, p. 250). (Goodman
has since changed bis mind; see Goodman and Buck, 1973.)
If these remarks were expected to have the same sort of effects on
language testing that other notions concerning language teaching
have had, we should expect other suggestions to be forthcoming
about related (in fact derived) methods oflanguage testing. Actually,
the extension to language testing was suggested by Lado (1961) in
what was perhaps the first major book on the topic. He reasoned that
languagelests should focus on those points of difference between the
language of the learners and the target language. First, the 'linguistic
problems' were to be determined by a 'contrastive analysis' of the
structures of the native and target languages. Then,
the test . .. will have to choose a few sounds and a fe\\' structures
at random hoping to give a fair indication of the general
achievement of the student (p. 28).
More recently, testing techniques that focus on discrete points of
difference between two languages or two dialects have generally
fallen into disfavor. Exceptions however can be found. One example
DISCRETE POINT METHODS IN SYNT Acnc LINGUISTI CS 171
is the test used by Politzer, Hoover, and Brow n (1974) to assess degree
of control of tw o important di alects of English. Such items of
difference between the maj ority variety of English and the minority
va riety 3 l issue included the marking of possessives (e.g., John's house
versus John house), and the presence or abse nce of the copula in the
surface fo rm (e. g., He go in' to lO Hm versus He is going to 100vn) .
Interestingly, except for the manifest influence of contrastive analysis
and discrete point theory o n the scoring of lhe test used by Po litzer, et
ai, it could be construed as a pragmatic test (i.e., it consisted of a
sequential text where the task set th e learner was to repeat sequences
of material presented at a conversational ra te).
Among the most serious difficulties for tests based on co ntrastive
linguistics is that they should be suited (in theo ry at least) fo r onl y one
language backgro und - namely, the la ngua ge o n which th e
contra stive anal ysis was performed . Upshu r (1962) argues that this
very fact results in a peculiar dilemma for contrasti vel y based tests:
either th e tests will not d ifferentiate ability levels amo ng students with
the same native language background and experience, 'or the
contrasti ve anal ysis hypo th esis is invalid ' (p. 127). Since the purpose
of all tests is to differentiate- success and failure or degrees of o ne o r
the other, any contrastivel y based lest is therefore either not a test or
not contrasti.vely based. A more practical problem fo r cont rastivel y
based tests is th at lea rn ers from different source languages (o r
dialects) would require different tests. If there a re many source
languages contrastively based tests become infeasible.
Further, from the point or view of pragmatic language testing as
discussed in Part One above, the contrastive analysis approach is
irrelevan t to th e determ ina tion of what constitu tes an adequate
language test. It is an empirical questi on as to whether tests can be
devised which are mo re difficult for lea rners from one so urce
language than fo r learners from other source languages (where
proficiency level is a controlled variable). Wilson (in press) d iscusses
this pro blem and presents some evidence suggesting that contrastive
analysis is not helpful in determining which test items will be difficult
fo r learners or a certajn language backgro und .
In view of the fact that contrastive analysis has p roved to be a poor
basis for predicting errors that language learners will make, or for
hierarchicall y ra nking po ints of structure accordin g to their degree of
difficulty, it seems highly unlikely th at it will ever provide a
substantial basis for th e construction of language tests. A t best,
contrastive analysis provides heuristics only fo r certain discrete point
172 LA NGUAGE TESTS AT SCHOOL

test items. Any such items must then be subjected to the sa me so rts o f
validity criteria as are any other test items.

F. Discrete elements of discrete aspects of discrete components of


discrete skills - a problem of numbers
One of the serious difficulties o f a th o ro ughgoing anal ytica l model of
discrete point testing is th at it generates an ino rdinately la rge number
of tests. If as Lado (196 1) claimed, we ' need to test the elements and
the skills sepa ra tely' (p. 28), and if as he further argued we need
separate tests for supposedly separate components of langua ge
ability, and for bO[ h productive and receptive aspects of those
components, we wind up needing a very large number of tests indeed.
It might seem odd to insist on pro nunciation tests for speaking and
separate pro nunciatio n tests for listening. but Lado did a rgue that the
' linguistic pro blems' to be tested 'will differ so mewhat for prod uction
and for recognition ' and t hat therefore 'different lists are necessary to
test the student's pronunciation in speaking and in listening' (p. 45).
Such distinctions, of co urse, are also common to tests in speech
pathology.
Hence, what is required by discrete poi nt testing theory is a set of
items for testing the elements of pho nology, lexicon (or voca bulary),
sy ntax, and possibl y a n additional compo nent o f sema ntics
(depending on the theory o ne selects) times the number of aspects o ne
recognizes (e.g., producti ve versus receptive) times th e num ber of
separate skills one recognizes (e.g., listening, speaking, rea ding,
writing, and possibly others).
In fact, several different m odels have been proposed. Fig ure 5
below illustrates t he componential ana lysis suggested by Harris
(1969) ; Figure 6 shows the analysis snggested by Cooper (1972);
Figure 7 shows a brea kdown offered by Sil ve rman, N oa, Russell, and
Molina ( 1967): and fina ll y, Figure 8 shows a slightl y different model
of discrete point categories proposed by Oller (1976c) in a d iscussion
of possible research recommended to test certain crucial hypotheses
generated by discrete point and pragmatic theories of language
testing.
Ha rris's model wo uld require (in principle) sixteen se pa ra te tests
o r subtests ; Cooper's would req uire lwent y-four separate tes ts o r
subtests ; the model of Sil vennan et al would require sixteen ; a nd the
model o f O ller would require twel ve.
What classroo m teacher has time to develop so many different
-
k
DISCRETE roJKT METHODS JK SYNTACTIe LINGUISTI CS 173
~ - !~~ - ~ ~ ~

La nguage Skills
r----- - - ~ -- - - -
_. _. .~___ _+
Compone nts -----~---

Listen:: ._~peaking __ I~eading l_\>'nting~


Phono!ogy/
I Orthography

3
- -- --- - ---- ·~· ·~--
·
Slruct ure
I
-----+-- - --

'
I voca~~~_ _, _ _ _ _ ,.L. --, ...- -

l
-

I
Rate and general !

fl uency
.,-, . _ - - -- - - -- ~
~ __.-l_~ ___ __ ~ I.
Figure 5. Co mponential breakd own of language profic iency proposed by
Hanis (1969_p. II ).

Know l e d ge

l
Phono logy Sy ntax Seman t ics Tota l
'0

? I
List ening
I
./" , ,
~
Speaki ng
----- I S
k
./" I I I

~ I
! Rea ding I
, s

~
V
W riting

~ .------- /" .-------


--- ..-' ..-' ..-' .------- -----
Figure 6. Componential analysis of la nguage skills as a fra mework for test
constructio n from Cooper (1968. 1972, p. 337).

tests ? T here a re o ther grave difficulties in t he selectio n of items for


each test and in dete rmining how many ilems sho uld a ppear in each
test, but these a re discussed in Chapter 7 secti on A. T he q uestion is,
afe all t he variou s tests or s ubtests necessa ry? We don ' t norm a lly use
174 LANGUAGE TESTS AT SCHOO L

Figure 7. 'Language assessment domains' as defi ned by Silverman et al (1976,


p.2 1).

only our phonology, o r vocabulary, or grammar: why must th ey be


taught and tested separately?
An empirical questio n which must be answered in o rder to justify
such compo nentlalization of lang uage skill is whether lests that
purport to measure the same component of language skill (or the
same a spect, modality, or whatever) are in fact more hi ghly
correl ated with each ot her than wit h tests that purport to measure
different components. Presently ava ilable research results show man y
cases where tests that purport to measure different components or
aspects (etc.) correlate as strongly o r even more strongly than do tests
that purpo rt to measure the same compo nents. These resul ts a re
de vastatin g to th e claims of construct validity put fo rth by advocates
of discrete point testing.
For exam ple, Pike (1973) found that scores on a n essay task
correlated mo re stro ngly wi th the Listenin g Comprehension subscore
o n [he TOEFL than with the R eading Comprehension subseore for
th ree different groups of subjects. This sort of result controverts the
prediction that tasks in the readin g- writing modally ought to be more
highly intercorrelated with each other than with tasks in the listening-
DISCRETE POI KT METHODS IN SYNTACTI C Ll NGU ISTtCS 175

Input O utput
MODE
Sen s ory ~ RECEPTIVE PRODUCTIVE
Motor
MODA LlTY~
- - - --

Li ~te n i n g Spea king

AUD lTO RY;


RTI CU LATORY
I ········1 ,
,
Ph on· Struc- Vocab- IPh on- Slruc- Vocab-
ology tu re ulary ology lure uIary

- ..

Read ing Writing

VISUAL!
MANUAL i
I Graph- Struc- Vocab- Graph· strUC. Vocab·
ology lure ulary ology lure ulary

L I i J
Figure 8. Sc hemat ic representatio n o f construcrs posi ted hy a co mponential
c!nC:l lysis of la ng uage s kill s ba sed o n d iscrete po int lest theo ry, fro m OlJe)"
(1 976c, p. 150).

speaking modality (of course, the TOEFL Listening Comprehension


subtest does require readin g). Perhaps more surprisingly, Darnell
(1968) found that a cloze ta sk was more highl y correlated with the
Li stening Comprehensio n section on the TO EFL tha n with an y of
the other subscores. Oller and Co nrad (I 97 1) got a simila r result with
the UCLA ESL Placeme>1l Ex amination Form 2C. O Ue r and Streiff
(1975) fou nd that a dicta tion ta sk was more stro ngly co rrelated with
each ot her part of the U CLA ESLPE Form J th a n any of the other
part s with each other. This was particularly surprising to discrete
p oint theorizi ng in view of the fact that the dictation was the only
sectio n of t he test that required substantial list eni ng com prehension.
Except for the phonological discrimination task, which required
d istinguishing minimal pairs in sentence sized co ntex ts, no o ther
subsecti on required listening comprehension at all.
In co nclusion, if tasks th a t bear a certain label (e.g., reading
176 LANGUAGE TESTS AT SCHOOL

comprehension) correlate as well wit h tasks that bear different labels


(e.g ., listening comprehension, or vocabulary, or oral interview, etc .)
as they do with each other, what independent justification can be
offered for their distinct labels or for the positing of separate skills,
aspects, and components of language? The only justification that
comes to mind is the questionable theoretical bias of disc rete point
theory.
Such a justification is a va riety, a not too subtle va riety in fac t, o f
validity by fiat, or nominal validity - for instance, the statement that a
'listening comprehension test' is a test of 'listening comprehension'
because that is what it was named by its author(s); or that a 'reading'
test is distinct from a 'grammar' test because they were assigned
different labels by their creators. A 'vocabulary' test is different from
a 'grammar' or 'phonology' test beca use they were dubbed differently
by the theori sts who rationalized tbe distinction in the first place.
There is a better basis for la beling tests. Tests may be referred to,
not in terms of the hypothetical constructs that they are supposed to
measure, but rather in terms of the sorts of operations they actually
require oflearners or examinees. A cloze test requires learn ers to fill in
blanks. A dictat ion requires them to write down phrases, cla uses or
other meaningful segments of disco urse. An imitation task req uires
them to repeat or possibly rephrase material that is heard. A reading
aloud task requires reading aloud. And so forth. A synonym
matching task is the most popular form of task usually called a
'Vocabulary Test'. If tests were consistently labeled acco rding to
what they require learners to do , a great deal of argumentation
concerning relatively meaningless and misleading test labels could be
avoided. Questions about test validity are difficult empirical
questions which are only obscured by the careless assignment of test
labels on the basis of untested theoretical (i.e., theory based) biases.
In tbe fin al a nalysis, the question of whether or not language skill
can be split up in to so many discrete elements, components, and so
forth, is an empirical question. It can not be decided a priori by fiat. It
is true that discrete point tests can be made from bits and pieces of
language, just as pumpkin pie can be made from pumpkins. The
question is whether the bits and pieces can be put together agai n, and
the extent to which they are characteristic of the whole. We explore
this question more thoroughly in C hapter 8 below. Here, the intent
has been merely to show that discrete point theories of testing are
derived from ce rt l:l in metho ds of teac hing which in their turn derive
from certain methods of linguistic analysis.
DISCRETE POIKT METHODS IN SYNTACTI C LI NGUISTICS 177
KEY POINTS

1, Theories of language influ ence theories of language learnin g whieh in


thei r turn influence theo ries o f language teaching which in their turn
influence theories and methods or lan guage testing.
2, A unidirectinal influence from theory to practice (and not from practical
findings to theo ries) is unhealthy.
3. A theory that ca nnot be mortally endangered is not alive.
4, Language is typically used for meaningful purposes - therefore, any
theory of language that hopes to attain a degree of adequacy must
encountenance this: fact.
5, Structural analysis less meaning led to pattern drills without mea ning.
6. Bloomfield's exclusion of meaning from the domain of interest to
linguistics wa s reified in the distributional discovery procedure
recommended by Zellig Harris .
7. The insistence on grammar witho ut meaning was perpetuated by
Chomsky in 1957 when he insisted un a grammatical analysis o f natural
languages with no reference to how those languages were used in normal
communicatio n.
8. Even when semantic notion s were incorporated into the 'integrated
theory' of th e mid sixties, Katz and Fodor, and Katz and Postal insisted
that the speaker's knowledge of how language is used in relation to
extralinguistic conte-xts should rema in outside the pale o f interest.
9. Analysis of language without reference to meaning or context led to
theories of language learning which similarly tried to get lea rners to
internalize grammatical rules with 1i1tle or no chance of ever discovering
the meanings of the utteran ces they were en co uraged to habi tualize
through manipulative pauern drill .
10. Pattern drill s, like (be lingui stic analyses fro m which they derive,
focussed on 'structural meaning' - the superficial meaning that was
ass ociated with the distributional patterns of linguistic elements relative
to each other and largely independent of any pragmatic motivati on for
uttering them.
11, Typical pattern drill s based on the syntactic theories referred to above
a re esse ntiall y noncontext ualizable - that is, there is no possible context
in wh ich all of the di verse things t hat are included in a pauern dri ll could
actually occur. (See the dril1 s in most any ESL/EFL text - refer to the list
of references giveu on p. 161 above.)
12. It is quite impossible, not just difficult, for a non-native speaker to infer
pragmatic contexts for the sentences correctly in a typical syntax based
pattern d rill unless he happens to know already what thedriU purports to
be teaching.
13. Syntactically mOlivated pattern drills are intrinsically structured in ways
that will necessa rily confuse learners concerning similar forms with
different meanings - there is no way for the learner to discover the
pragmatic motivations for the differences in meaning.
14. The hurdle between manipul ative drills and communicative use o f the
utterances in (hem (or [he rules they are supposed to in:aill in the learner)
is an artefact of meaning- less pattern drill s in the first place.
178 LANG UAGE TESTS AT SCHOOL

15. Discrete point teaching, particula rly the syntax based pattern drill
app roach, has bee n more or less directly tran slated into discrete point
testi ng.
16. Co nt rastive linguisti cs contended that the difficul t patterns of a
pa rt icular ta rget la ngua ge could be dete rmined in adva nce by a dili gent
and careful comparison of the mltive langua ge of the learner wit h the
target language.
17. Th e notion of co ntrastive lin gu istics was extend ed to the teac hin g of
read ing and to lang uage tests in the claim t hat the points of difficulty
(predicted by contrastive analysis) should be the main targets for
teac hin g a nd for test items.
J8. A majo r diffi culty for contrastive linguistics is that it has never provided
a very good basis fo r prediction. Tn many cases where the predictions are
clea r they are wrong, and in oth ers, they a re too vague to be of any
empirical value.
19. Another seri ous difficulty is tha t every different nati ve la nguage
bac kground theoretica ll y requires a differenl language lest to assess
knowledge of the same target language (e.g., English). FUlther, there
seems to be no rea son to expect a comparison or English and Spa ni sh to
p rovide a good test of either Engli sh or Spanish ; rat her, what appears to
be required is so mething that ca n be arrived at independently of any
compa ri son of the two la nguages - a Lest of o ne or the othe r langll age.
20. Ma n y problems of test validi ty can be avoided if tests are labeled
according to wha t they require exa minees LO do in stead o f accordin g to
what the test author think s the test measures.
21. Finally, discrete point test theo ries require man y subtests which a re of
q uestionable validity. Wheth er there exists a separable (and separately
testable) compone nt of vocabula ry. a nother of gra mm ar, and sepa rab le
skill s of listen ing, reading, wr iting, a nd so forth must be determined on
a n empirica l basis.

DIS C U S SIO~ QUESTIONS


I. Ta ke a pattern d rill from any text. Ask what t he motivation fo r the drill
was. Consid er the possibili ty (o r impossibili ty) o f contextua1izing the
drill by making obvious to the lea rne r how the sentences o f the d rill
might relate to realistic contexts of com mu nicat ion. Can it be done? If
so, demonstrate how. If not, ex pl ain why not. Ca n contextu alized or
co ntextualizable drill s be written '? If so, how wo ul d the motiva ti on for
suc h drill s diffe r from the mo tiva tio n for nonco ntext ualizabl e d rills?
2. Ex.am ine pragma ticall y moti va ted drill s such as those incl uded in EI
esplJ1101 por elmundo (Oller, 1963- 65). Stud y the way in which each
se ntence in the d rill is relata ble (i n ways that can be and in fa ct are made
obvious to th e lea rner) to pragma tic contexts th at are already establis hed
in the mind of the learner. H ow do such drill s diffe r in focus fr om the
dri lls recommended by other aut hors who take syntax as th e starti ng
po int ra the r tha n meaning (o r p ragmatic ma pping of uttera nce onto
context)?
3. Work thro ugh a la nguage test th at you know to be widely used . Consider
DISCRETE POINT METHODS IN SYNTAcnc LINGU ISnCS 179
which of the subparts of the test rely on discrete point theory for their
j ustification. What sorts of empirical studies could you p ropose to see if
I.he tests in q uestio n rea ll y measure what t hey purpo rt to measure'! What
specific predictions wo uld you make concerning the intercorrelations of
tests with the same label as opposed to tests with d ifferent labels? What if
th e la bels are not more than mere labels'!
4. Consider the sentences of a particula r pattern d rill in a language that you
know well. Ask whether the sentences of the drill would be apt to occur in
real life. For instan ce, as a sta rter, consider the likelih ood of ever having
to distinguish between She's waTching the hahy versus She's washing lhe
baby. Can you conceive o f more likely contrasts? What contex tu<:I l
factors would ca use yo u to prefer She's giving the bahy a hath, or She's
bathing the inf ant, etc.?
5. Following up on question 4, take a particular sentence fro m a pactern
drill an d try to say aU tha t you know about its form and meaning. For
instance, co nsider whaL other form s ic calls to mind and what ot her
meanings it excludes or call s to mind . Compare the associations thus
devel oped fo r one sentence wit h the set of simi lar associatio ns that come
to mind in the next sente nce in the drill, and the next and the next. Now,
analyze some of the fa lse associat ions that the d rill encourages. Try to
predic t so me of the errors tha t the drill will enco urage lea rners to make.
D o an observational study where the pattern drill is used in a classroom
situation and see if the false ass ociation (errors) in fac t arise. Or
alternatively, record the crrors that arc commi tted in response to the
sentences of a particula r drill and see if you ca n explain them after the
fact in terms of the sorts of associations that are enco uraged by the drill .
6. As an exercise in distributional analysi s, have someone give you a simple
Jist of sim il arly struct ured sentences in a language that you do not know.
Try to segment tho se sentences without reference to mean ing. See if you
ca n tell where the word bo undaries a re, a nd see if it is possible to
determine what the relationships between words are- i.e. , try to discover
the st ructural meanings o rth e utterances in the fo reign language without
referring to the way those utterances are pragmatically mapped onto
extraling uistic contexts. T hen, test the success of your l:l.ttempt by asking
an in forman t to give yo u a literal wo rd for word translation (o r p o~s i b l y
a word for morpheme or phrase translation) along with any less litera l
translation into Englis h. See how close your understand in g o[ the unit s
of 1he language was to the actual structu ring that can be determined on
the basis ora more pragma tic analysis.
7. Take any utterance in any meani ngful lin gui stiCcontex t and assign to it
the sort or tree st ructure suggested by a phrase structure grammar (or a
more sophisticated gra mmatica l system jf you like). Now consid er the
q uestion, wha t add itional sorts of knowledge do 1 normally bring to bear
on the interpreta tion o f such sentences lhat is not ca ptured by the
syntactic analysis just completed? Extend the questio n. What techniques
might be used to make a learner aware of the additional sorts of clues and
information that native speakers make use of in coding and decoding
such utterances? The ex tension can be carried one step fur ther. Wha t
techniq ues could be used to test the abili ty ofJ earners to uti lize the kinds
180 LA NGUAGE TESTS AT SCHOOL

of clues and informa tion available to native speakers in the coding and
decoding of such utterances?
8. Is there a need for pattern drills witho ut meaning in language teac hing?
If one chose to di$pense w1th them com pletely, how could pattern drills
be co nstructed so as to max imize awareness o f t he pragmatic
congequences of each formal change in the utterances in the pattern drill '!
Consider the naturalness constrai nt s proposed for pragmatic lang ua ge
tests. Could not similar naturalness constraints be imposed on pattern
drill s? What ki nd s of artificiality might be tol erable in such a system and
what kinds wo uld be intolerable?
9. Take any discrete test item fr om any di screte test. Embed it in a conlex t
(if that is possible, for some items if is very difficu lt to conceive of a
realistic context). Consider th en the degree of varia bility in 'possible
choices [or the isolated discrete item and for th e same item in context.
\Vhich do you think would produce results most commensurate with
genu ine abilit y to communicate in the langu age in questi on ? Wh y?
Alternali vel y, take an item from a doze Le5l amI isolate it from its contex. t
(t hi s is always possible). Ask the same questions.

SU GGESTED R EADrNGS
1. J. Sta nley Ahman, and Ma rvi n D. Glock, M eaJuring and Evaluating
Educational Achiel'ement2nd Ed. Boston : Allyn and Bacon, 1975. See
especially Chapters 2, 3, and 4.
2. John L. D. Clark, Foreign Language TesT ing: Theory and Prllc/ice.
Philadelphia: Cente r for Curriculum Development, 1972,25-113.
3. David P. Harris, Testing English as a Second Language. New York:
McG raw Hill, 1969, 1··11 ,24-54.
4, J. B. Heaton, Writing English Language Tests. London: Longman, 1975,
5. Robert .I. Silverman, Joslyn K . Noa, Randall H . Russell, and John
Molina, Oral Language Tests for Bilingual S tudel"lTs: An E~' a/uation of
Language Dominance ' and Proficiency Instruments. Portland, Oregon:
Center for Bilingual Education (USOE. Department of HEW), 18- 28.
7

Statistical Traps

A. Sampling theory and test


construction
B. Two common misinterpretations of
correlations
C. Statistical procedures as the final
criterion for item selection
D. Referencing tests against non-
native performa nce

There is an old adage that 'Figures don 't lie', to wh ich some sage
added, ' But liars do figure'. While in the right contexts, both of these
statements are true, both oversimplify the problems faced by anyone
who must deal with the sorts of figures known as statistics produced
by the figurers known as statisticians. Although most statistics can
ea sily be misleading, most statisticians probably do not intend to be
misleading even when they produce statistics tha t are likely to be
misinterpreted.
The book by Huff (1954), How to Lie I"ith Statis tics, wo uld
probably be more widely read if it were on the topic, H oII' to Lie
without Stalistics, Statistics per se is a dry and forbidding subject. The
difficulty is probably no t related to deliberate deception but to the
difficulty of the objects of interest - namely, certain classes of
numbers known as statistics. This chapter examines several common
but misleadin g app lications of statistics and statistical concepts. We
are thinking here oflallguage testing, but the problems are in ract very
genera l in educational testing.

A. Sampling theory and test construction


One of the misapplications or statistical noti ons to language testing
relates to the construction of tests. Discrete point lheorists have
suggested that
181
182 LANGUAGE TESTS AT SCHOOL

the various 'parts' ofthe domain of 'language proficiency' must


be defined and represented in appropriate proportions on the
test ...
Item analysis statistics take on a different meaning ... The
concern is ... with how well the items on the test represent the
content domain (Petersen and Cartier, 1975, p. 1080 ...
At first glance, it might appear that, in principle, a test of
general proficiency in a foreign language should be a sample of
the entire language at large. In practice, obviously, this is
neither necessary nor desirable. The average native speaker gets
along quite well knowing only a limited sample of the language
at large, so our course and test really only need to sample that
sample. (Further discussion on lvhat to sample will follow after
touching on the problem of how to sample [po 111 D.
The reader will begin to appreciate the difficulty of determining just
what domains to sample (within the sample of what the native
speaker knows) a couple of paragraphs farther on in the Petersen-
Cartier argument.
Most language tests, including DLI's tests [i.e., the tests of
the Defense Language Institute], tberefore make a kind of
stratified random sample, assuring by plan that some items test
grammatical features, some test phonological features, some
test vocabulary, and so on. Thus, for example, the DLI's
English Comprehension Level tests are constructed according
to a fairly complex sampling matrix which requires that specific
percentages of the total number of 120 items be devoted to
vocabulary, sound discrimination, grammar, idioms, listening
comprehension, reading comprehension, and so on.
The authors continue to point out that there was once an attempt to
determine the feasibility of establishing a universal item-
selection matrix of this sort for all languages, or perhaps for all
languages ofa family, so tbat the problem of making a stratified
sample for test construction purposes could be reduced to a
somewhat standard procedure. However, such a procedure has
not, as yet, been found, and until it is, we must use some method
for establishing a rational sample of a language in our tests (p.
112).
The solution arrived at is to 'consider the present DLI courses as
rational samples of the language and to sample them ... for the item
objectives in our tests of general ability' (p. 113).1

1 The authors indicate in personal communication that they reached this decision only

'with great reluctance'. They did not in their words 'arrive at these conclusions naively
but, in fact, only with some considerable pain'. Therefore, the foregoing and following
remarks should be interpreted accordingly.
STATISTICAL TRAPS 183
Several crucial questions a rise: What would a representative
proportion of the phonology of the language be? How many items
would be requi red to sample the component of phonology if such a
component is in fact considered to be pa rt of the larger domain of
language proficiency? What percentage of the items on a language
test should address the compo nent of vocabulary (in order to be
representative in the way that this component is sampled) ? How
man y items shou ld address th e grammar component? Wh at is a
representative sa mpling of the grammatical structures ofa language?
How many no un phrases should it contain '! Verb phrases '! Left
embedded clauses'? Right branching relati ve clauses?
One is struck by the possibil ity that in spite of what anyone may
say, there may be no answers to these qu estions. The suspicion that
no answers exist is heightened by the admission that no procedure has
yet been found which will provide a ra[ional basis for 'making a
stratified sample'. Indeed, if this is so, on what possible ratio nal basis
co uld specific percentages of a certain total number of items be
detennined?
Clearly, the motivation fo r separate items aimed a t ' phono logical
feat ures, voc(lbulary, gramma r, idioms, listening comprehension,
reading comprehension' and so forth is some sort of discret e point
analysis that is presumed to have validity independent of any
particular language test or portion thereof. However, the sort of
questions that must be answered concernjng the possible existence of
such components of lang uage proficiency ca nnot be answered by
merely invoking the theoretical biases that led to the hypo thesized
components ofJanguage proficiency in the first place. The the oretical
biases themselves must be tested - not merely ass umed.
T he problem for anyone proposing a test construction method tha t
relies on an attempt to produce a representati ve (or even a 'rat ional')
'sa mpling of the la nguage' is not merely a ma tter of how to parse up
the universe of 'l anguage proficiency' into components, aspects,
skill s, and so forth, but once having arrived at some division by
whatever methods, the most difficult pro bl em still remains - how to
recognize a representati ve o r rational sample in a ny pa rti cul ar
po rtion of the defined universe.
It can be show n that any procedure tha t might be proposed to
assess the representativeness of a particul ar sample of speech is
doomed to failure because by any possible methods, all sam ples of
real speech a re either equally representati ve o r equall y unrep-
resentative of the universe of possible utterances. This is a natural
184 LANGUAGE TESTS AT SC HOOL

and inevitable consequence of the fact that speech is intrinsically non-


repetitive. Even an attempt to repe at a communicative exchange by
using the same utterances with the same meanings is bound to fail
because the context in which utterances are created is constant ly
changing. Ut terances cannot be perfectly repeated. Natu ral discourse
of the sort that is characteristic of the normal uses of human
languages are even less repeatable - by the same logic. The universe to
be sampled from is not just very large, it is infinitely large and non-
repetitive.2 To speak of a rational sampling of possible utterances is
like speaking of a rational sampling of Future events or even of
historical happenings. The problem is not just wbere to dip into the
possible sea of experience, but also bow to know when to stop dipping
- i.e., when a rational sample has been achie ved.
To make the problem more meaningful, consider sampling the
possible sentences of English. It is known tbat the number of
sentences must be infinite unless some upper bound can be placed on
the length of an all owable sentence. Suppose we exclude all sentences
greater than twenty words in length. (To include tbem only
strengthens the argument. but since we are using the method of
reduction to absurdity tying our ha nds in this way is a reasonable way
to star!.) Miller (1964) conservati vely estimated that the number of
grammatical twenty-word sentences in English is roughly 10 2 0. This
figure derives directly from the fact that if we interrupt someone who
is speaking, on the average there are a bout 10 words that can be used
to form an appropriate grammatical continuatio n . The number 1020
exceeds the number of seconds in 100,000,000 centuries.
Petersen and Cartier (1975) suggested that one way around the
sampling problem was to 'sample the sample' of language th at
bappened to appear in the particular cou rse of stud y at tbe Defense
Language Institute. Rather than trying to determine what would be a
representative sample of the course, suppose we just took th e whole
course of an estim ated '30 hours a week for 47 weeks' (p. 112).
Suppose further th at we make the contrary to fact assumption that in
the course students a re exposed to a minimum of o ne twenty-word
sentence each second. If we th en considered the entire course of stud y
Z This is not the :mme as saying thaI every sentence is ' wholly novel' or 'tota lly

unfamiliar' - a point of view that was argued againsl in Chapter 6 above. To say that
normal communicative events cannot be repeated is like sayi ng that you cannot relive
yesterday or even the preceding moment of time. Thi s does not mean that there is
nothing familiar about 10 d <lY or that there will be nothing famili a r about tomorrow. It
is like say ing, h.owever. that tomorrow will not be identical to toda y, nor is today quite
like yesterday. etc.
STATISTICAL TRAPS 185

as a kind of lan guage test, it wou ld o nly be possibl e for suc h a test to
cover about five milli<.m sentences - less th an .000000000001 percent
(one trillionth of one percent) of the possible twenty-word sentences
in English. In what realistic sense could this 'sample' be considered
represe ntative of the wh ole? We are confro nted here not just with a
difficulty of how to select a 'sample'. because it doesn't make any
difference how we select the sa mple. It can never be argued that any
possible sampl e of sentences is representative of the whole.
If we ta ke a larger un it of discourse as our basic wo rking unit, the
diffic ult y o f obtaining a representative sample becomes far grea ter
since the num ber of possible discourses is many orders of magnitude
larger than the number of possible twent y~ word sentences. We arc
thu s forced to th e conclusion that the discrete point sampling no tion
is about as applicable to the problem of constructing a good language
test as the meth od of listing sentences is to the characterizatio n orthe
grammar o f a particular natural language. No li st will ever cover
enough of the language to be interesting as a theory of grammar. It is
fundamentally inadequate in principle - not just incomplete. Such a
list could never be completed, a nd a test modeled after the same
fashion could never be long enough even if we extended it for the
duration of a person's li fe expectancy. [n t his vein , conside r the fa ct
that the number of twenty-word sentences by it self is abo ut a
thousand times larger than the estimated age of the ea rth (Miller,
1964).
The escape from the problem of sampling theory is suggested in the
statement by Petersen and Cartier (1975) that ' the average nati ve
speaker gets along quite well kn owing only a limited sampl e of the
languageatlarge, so our course and test really only need to samplethat
sample' (p. 111 ). Actually. the nati ve speaker knows far more than he
is ever observed to perform with his language. The conversations that
a person has wh h others are but a small fraction of conversations th at
one could ha ve if one c hose to do so. Similarly, the utterances that
one com prehends are but an infinitesimal fraction of the utterances
that one could understand if tbey were ever presented. Indeed, the
nati ve speaker knows far more o f his language than he will eve r be
observed to use - even if he talks a lot and if we observe him all the
time. Th is wa s the original motivation for the di sti nction between
competence and pcrfonnance in tbe Cho m:-;kyan linguistics o f (he
1950s and 1960s.
The solution to the difficulty is not 10 shift our attention from man y
utterances (say , all of the o nes uttered o r understood by a given nati ve
186 LANGUAGE TESTS AT SC HOOL

speaker) to fewer uttera nces (say, all ofthase utterances presented to a


language learner in a pa rticular course of study at the DLl), but to re-
move o ur attention to an entirely different sort of o bject - namely, the
grammatica l system (call it a cognitive netwo rk , generative grammar,
expectancy grammar, o r interlallguage system) th at the learner is in
th e process of internalizing and which when it is mature (i.e., like that
of th e native speaker) will generate not only the fe w meaningful
utterances that happen to occur in a particular segment of time, but
the many th at the nati ve spea ker can say and understa nd .
Instead of trying La construct a language lest that will <repre-
sentatively' or 'rationally sampl e" the universe of 'language', we
should simply construct a test th at req uires the language learne r to do
what native speakers do with disco urse (perha ps any discourse will
do). Then th e interpretation of the test is rel a ted not to the par ticul ar
disco urse that \ve happened to select, nor even to lhe uni verse of
possible discourses in the sense of sampling theory. But thus it is
rela ted to the efficiency of the lea rner's internalized grammatical
system in processing discourse. The validity of the test is related to
how well it enables us to predict the learner's perfo nnance_in o lher
discourse processing lasks.
We can differentiate between segments of discourse that are ea sy to
process and segments th at are mo re difficult to process. Further, we
can differentiate between segments of di scourse that would be
appropriate to the subj ect matter of ma thema tics, as opposed to say ,
geogra phy, or ga rdening as o pposed to a rchitecture. The selectio n of
segments of discourse appropriate to a particular learner or group of
learners wo uld depend largely on the kinds of things that woul d be
expected later on of the same learner or group. Perhaps in this sense it
woul d be possible to 'sample' types of d iscourse - but this is a very
different so rt of ,;a mpling than 'sampli ng' the phono logical contrasts
o f a language, say, in appropriate pro portion to the vocabulary
items, grammatical rules, and the like. A language user can select
portions of college level tests for cIoze tests, but this is not the sort of
'sampling' that we have been arguing against. T o ma ke the selectio n
analogous to the son of sa mpling we have been arguing against is in
principle impossible no l just in defensible; one would have to search
for prose with a specific proportion of certain phonological contrasts,
vocabulary items. 'grammatical rules and the li ke in relation to a
co urse syllab us for teaching the language of th e test.
In sho rt, sampling theory is either inapplicab le o r not needed.
Where it might seem to apply, e.g. , in th e case of pho nological
STATISTICAL TR APS 187
contrasts, or vocabulary, it is not at all clear how to justify the
weighting of subtests in relati on to each other. Where sampling
theory is clearl y inapplicable, e,g" at the sentence or discourse levels,
it is also obviously not needed, No elaborate sampling technique is
needed to determine whether a learner ca n read college tex ts at a
defined level of comprehension - nor is sampling theory necessary to
the definition of the degree of comprehension that a given learner
may ex hibit.

B. Two common misinterpretations of correlations


Correlations between language tests ha ve sometimes been misin-
terpreted in two ways: first, low correlations have sometimes been
taken to mean that two tests wit h different labels a re in fact measuring
diJTerent skills, o r aspects o r components of skills; and, second, high
correlations between dissimila r language processing tasks have
sometimes been interpreted as indicating mere reliability or even a
lack of validit y, Depending upon the start ing assumptions of a
particular theoretical viewpo int, the same statistics may yield
different interpretations - 0 1" may seem to support different, even
mutually contrad ictory conclusions. When such contradictions arise,
either one or both of the contradictory viewpoints must be wrong or
poorly articulated,
For instance, it is not reaso nable to imerpret a low correlation
between tw o tests as an indicatio n that the two tests a re measuring
differe nt skills and also that neither of them is reliable (o r that one of
them is not reliable). Since reliab ility is a prerequisite to validity, a
given statistic cannot be taken as an indication ofl ow reli ab ility and
high va lidity (yet this is sometimes suggested in the literature as we
wi ll see below), Similarly, a hi gh correlatio n between two dissimilar
language tests ca nnot be dismissed as a case of high reliability but low
validity all in the same breath . This latter argument, however, is more
complex, so we wi ll take the simpler case first.
Suppose that the correlatl on between a particular test that bears
the label 'Gramma r' (or ' English Structure') and another language
test that bears the label ' Vocabulary' (or ' Lexical Knowledge ') is
observed to be comparatively low, say, AO, Can it be concluded from
such a low correlation that the two tests are therefore measuring
different skills? Can we say o n the basis of the low correlati on
between them that the so-called "G rammar' test is a test of a
'grammar' component of language proficiency while the so-called
188 LANGUAGE TESTS AT SCHOOL

'Vocabulary' test is a test of a 'lexical' component ? The answer to


both question s is a simple, no.
The observed low correlation co uld result ir both tests were in fa ct
measures of the same basic factor but were both relatively unreliable
measures of that fact o r. It could also result if one of th e tests we re
unreliable ; o r if onc o f them were poorly calibrated with respect to
th e tested subjects (i.e. , too easy or too difficult); or if one of the tests
we re in fact a poor meas ure of what it purported to measure even
though it might be reliable ; and so forth. In any case, a low
correlation be tween two tests (even if it is expected on the basis o f
some theoretical reasoning) is relatively uninformative. It certainly
cannot be taken as an indication of the validity of the correlated tests.
Consider using recitation of poetry as a measure of empathy and
quality of artistic taste as a measure o f independence - would a low
correlati on between the two measures be a basis for claiming that one
or both must be valid ? Would a low correlation justify the assertion
that the two tests are measures of different factors ? The point is that
low correlations would be expected if neithe r test were a measure of
anything at all.
A low correlation between a 'Grammar' test and a 'Vocabulary'
test might well be the product of p oor tests rather than an
independence of hypo thesized components of proficiency . In fact,
many failu res to achieve very high correlat ions would not prove by
any means that very high correlations do not in fact ex ist bet ween
something we might ca ll ' knowledge of voca bulary' and something
else rhat we might call ' knowled ge of gram mar' . Indeed the two kinds
of knowledge might be one kind in reality and no number of low
correlati ons between language le sts labeled 'Grammar' te sts on the
one hand and ' Vocabulary' tests on the other would suffice to exclude
such a possibility - and so on for all possible dichotomies. In fact the
observation of low correlations between lan guage tests where high
correlations might be expected are a little like fishin g trips which
produce small catc hes where large catches were expected. Small
catches do not prove that big catche~ do not ex ist. The larger catch es
may simply be beyond the depths of previously used nets.
T o carry the example a little farth e r, consider the following rema rk
by Bolinger (1975) in his introductory text on linguistics : T here is a
vast amount of grammatical detail still to be dug out of the lexicon -
so much th at by the time we are through th ere may be little point in
talking abou t grammar and lexicon as if they were two different
things' (p. 299). Take a relati vely common word in English like
STATISTICA L TRAPS 189
con tinue. 1n a test such as th e TO EFL, this word could appea r as a n
item in the Vocabula ry subsectio n. It might appear in a sentence
frame something like, The p eople want to 'continue' . The examinees
might be required to identify a synonymous expresslon such as
k eep on fro m a field of distractors. On the other hand, the word
continue might just as easil y appear in the English Structure secti on
as part of a verb ph rase, or on some o ther part of the test in a
different grammatical fo rm - e.g. , continuQtion, continual, co ntinuity.
discontinuity, continuous, or the like. If the word appea red in the
E nglish Structu re section it might be part of an item such as the
following:
SPEAKER A: 'But do you think they' ll go on building"'
SPEAKERR: 'Yes, I do because the contractor has to meet his
dead line. T think '
(A) lhe people continue to will
(B) will they want to continue
(C) to contin ue th e people will
(D ) they will wa nt to continue
In an important sense, knowin g a word is kn owing how to use it in
a mea ningful context, a context tha t is subject to the- normal syntactic
(and other) constraints of a particular langua ge. Does it ma ke sense
then to insist on testin g word-kno wledge independent of the
constraints that govern the relationships between words in discourse?
Is it possible? Even ifit does turn out to be possible, the proof that it
has been accomplished will have to come from sources of evidence
other tha n mere low correla tions between tests labeled 'Grammar'
tests and tests called 'Vocabulary' tests. In particular, it will ha ve to
be sho wn t hat there exists some uniqu e and meaningful variance
associated with tests of the one [ype that 1s not also associated with
tests of the other type - this has not yet been done. Indeed, many
attempts to fin d such unique varia nces have fai led (see the A ppendix
and references cited there).
In spite of the foregoing considerati o ns, some researchers have
contended that relatively low correlations between diffe rent language
tests have more substantial mea ning. For instance. the College
En tra nce Examination Board and Educational Testin g Service
recommended in their 1969 Manu al jar Studies in Support of Score
InterpretQ(ion (fo r the TOEFL) that it may be desirab le to 'stud y the
intercorrelations am ong the parts [of the TO£FL] to determine the
extent to which they a re in fac t measuri ng different abilities for the
group tested' (p. 6) .
190 LANGCAGE TESTS AT SCHOOL

The hint that low correlations might be taken as evidence that


different subtests are in fact measuring different components of
language proficiency or different skills is confirmed in two other
reports. For instance, on the basis of the data in Table I, the aulhor~
of the TOEFL InterpreTive Information (Revised (968) conclude: 'it
appears that Listening Comprehensio n is measuring some aspecl of
English proficiency different from that measured by the other four
parts, since the correlations of Listening Comprehension with each of
the others are the four lowest in the table' (p. 14).
TABLE I
Intercorrelation of the Part Scores on the Test of
English as a Foreign Language. Averaged over
Fonns Administered through April 1967. (From
.h1anualfor TOEFL Score Recipients. Copyright ©
1973 by Educatiomli Testing Service. All right s
reserved. Reprinted by permission .)

Subscores (I) (2) (3) (4) (5)


--- - - --- - -._ - _. _ ---.- - --- -- -
(I) .62 .53 .63 .55
Listening Co mprehensio n
(2) .62 .73 .66 .79
Eng lish Structure
(3) .53 .73 .70 .77
Vocabulary
(4) .63 .66 .70 .72
Reading Comprehension
(5) .55 .79 .77 .72
Writin g Abili ty
- - - _.,. - ------ - - --

Later, in an update of the same interpretive publication, Manual


}in· TOEFL Score Recipients (1973) on the basis of a simila rtable (see
Table 2 below), it is suggested that 'the correlations between
Listening Comprehension and the other pans of the test are the
lowest. This is probably because skill in listening comprehension may
be quite independent of skjlls in readi ng and writing ; a lso it is not
possible to standardize th e administration of th e Listening
Comprehension section to the same degree as the other parts of the
test' (p. 15). Here the authors ofler what amount to mutually
exclu sive explanations. Both cannot be correct. If the low
correlations between Listening Comprehension and the other
subtests are the product of unreliability in the techniques for giving
STA TlSTICAL TRAPS 191

the Listening test (i.e., poor sound reproduction in some testing


centers, or merely inconsistent procedures) then it is not reasonable
to use the same low correlations as evidence th at the Listening test is
validly measuring something that the other sub tests are not
measuring. What might that someth.ing be" Clearly. if il is lack of
TABLE 2
Intercorrelations of the Part Scores on the Test of
English as a .Foreign Language. Averaged over
Administrations from OctOber 1966 through June
197 1. (From Manllalfor TOEFL Score Recipients.
Copyright © 197 3 by Educational Testing Service.
All rights reserved. Reprinted by permission.)

Subscores ( I) (2) (3) (4) (5)


- .. - - . - ._-_ ._-- - . - ---- - - . _ - --- _ ._. _- -
(I) .64 .56 65 .60
Listeni ng Comprehens io n
(2) .64 .72 .67 .7 8
English Structure
(3) .56 .72 .69 .74
Vocabulary
(4) .65 .67 .69 .72
Read ing Comprehension
(5) .60 .78 .74 .72
Writing Abilit y

consistency in the procedure that produces the low correlations it is


not Ustening ability as distinct from reading, or vocabulary
knowledge, or grammar knowledge, etc. On the other hand if the lo w
correlations a re produced by real differences in the underlying skills
presumed to be lested, the administrative procedures for the
Lislening test must have substantial reliability. II jusl can 't go bOlh
way s.
In the 1973 manual, Ihe a Ulhors conlinue the argument that the
tests are in fact measuring different skill s by noting that 'none of the
correlations ... [in our Table 2] are as high as the reliabilities of the
part scores' from whic h the y condude that 'each part is contributing
something unique [ 0 Ihe lotal score' (p. 15). The queslion Ihat is still
unanswered is what thaI 'something unique' is, and whether in the
case of each subtest it is in any way related to the label on that subtest.
Is the Reading Comprehension sublest more ofa measure offeading
ability than it is of wriling abilil y or gra mmatical knowledge or
192 LANGUAGE TESTS AT SCHOOL

vocabulary knowledge or mere test-taking ability o r a general


proficiency factor or intelligence? The fact that the reliability
coefficients are higher than the correlations between different pa rt
scores is no proof th at the tests are measurin g different kind s of
knowledge. In fact, they may be measuring the same kinds of
knowledge and their low intercorrelations may indicate merely tha t
they are not doing it as well as they could . In any event, it is axiomatic
that validity cannot exceed reliability - indeed the general rule of
thumb is that validity coefficients are not expected to exceed the
square of the reliabilities of the intercorrela ted tests (Tate, 1965). If a
certain test has some error variance in it and a certain other test also
has some error variance in it, the error is apt to be compounded in
their intercorrelation. Therefore, the co rrelation between two tests
can ha rdly be expected to exceed their separate reliabilities. It can
equal them only in the very special case that the test s are measuring
exactly the same thing.
From all of the foregoing, it is possible to see that low (or at least
relatively low) co rrelatio ns between different language te sts can be
interpreted as indications of low reliability or validity but hardl y as
proof that the tests are measurin g difl"erent things.
If one makes the mistake of interpreting low correlations as
evidence that the tests in question are measuring different things, how
will one interpret higher correlatio ns when they are in fact observed
between equally diverse tests? For instance, if the somewhat lower
correlations between TOEFL Listening Comp rehension subtest and
the Reading Comprehension subtest (as compared against the
intercorrelations between the Rea ding Comprehension subtest and
the other subtest,) represented in Tables I and 2 above, are taken as
evidence that the Listening test measures some skill that the Reading
test does not measure, and vice ver sa, then ho w can we explain the
fact that the Listening Comprehension sub test correlates more
strongly with a cloze test (usuall y rega rded as a reading compre-
hension measure) than the latter does with the Reading
Comprehension subtest (see D arnell, 1968)?
Once high correlations between apparentl y diverse tests are
discovered, the previ ous interpretations of low correlations as
indicato rs of a lack of relationship between wh atever skills the tests
are presumed to measure are about as convincing as (he arguments of
the unsuccessful fisherman who said there were no fish to be cau ght.
The fact is that the fish are in hand. Surprisingly high correlations
have been observed between a wide variety of testing techniques with a
STATISTICAL TRA PS 193
wide range of lesled populations. The techniques range from a wlwle
fa mily ofprocedures under the gen eral rubric oleloze lesling. dieta lion,
elic ited imitation , essay writing, and oral interview. Populations have
ranged from child ren and ad ult second language learners, to children
and adults tested in their native language (see the Appendix).
What then can be made of such high correlations? Two
interpretations have been offered. One of them argues that the stron g
correlations previously observed between cloze and dictation, for
instan ce, are merely indications of the reliability of both procedures
and proof in fact that they are both measurin g basically the same
thing - furt her, that one of them is therefore not needed since they
both give the same information. A second interpretation is that the
high correlations between diverse tests must be taken as evidence nOl
only of reliability but also of substantial test validity. [n the first case,
it is argued that part scores on a language proficiency test should
produce low lntercorrelatlons in order to attain va lidit y, and in the
second that just the reverse is true. It would seem that both positions
cannot be correct.
It is easy to see that the expectation of low correlations between
tests that purpo rt to measure different skill s, components, or aspects
of language proficiency in accord with discrete point test philosophy
n ~es sil ates some method of explaining away high correlations when
they occur. The so lutio n of treating the correlations merely as
indications oftest reliability, as Rand (1972) has done, runs into very
serious logical trouble. Why shoUld we expect a dictation which
requ ires auditory processing o f sequences of spoken material to
measure the same thing as a doze test which requires the learner to
read prose and replace missing words? To say that these tests are tests
of the same thing, o r to interpret a high correlatio n between them as
an indication of reliability (alone, and not something more) is to saw
off th e limb on which the whole of discrete point theory is perched. If
doze procedure is not differem from dictation. then what is the
difference between speaking and listening skills' What basis could be
offered for distinguishing a read ing test from a grammar test? Are
such tasks more dissimilar than cloze and dictation? If we were to
fo llow this line of reasoning just a sho rt step further, we wo uld be
fo rced to conclude that low correlations between la.nguage tests of
any sort are lndicators of low reliabilities per force. This is a
conclusion that no discrete point theorist, however, eQuid entertain
as it obliterates all of the distinctions that are crucial to discrete point
testing.
194 LANGUAGE TESTS AT SCHOOL

What has been proposed by discrete point theorists, however, is


that tests should be constructed so as to minimize, deliberatel y, the
correlations between parts. If di screte point theorizing has substance
to it, such a recommendation is not entirely with out support.
However, if the competing vie wpoint of pragmatic or integrative test
philosophy turns out to be more correct, test constructors sho uld
interpret low correlations a s probable indicators aflow validities a nd
should seek to construct language tests that maximize the
interco rrelation of part scores.
This does not mean tha t what is req uired is a test tha t coosists of
only Doe sort of task (e.g., reading without speaking, listening, or
writing). On the contrary, unless one is interested only in ability to
read with comprehension or something of the so rt , to learn how we ll
an indi vidual understands, speaks. reads, and writes a language, it
may well be necessary (or at least highly desirable) to require him to
do all four sorts of performances. The question here is not which or
how many tasks should be included on a comprehensive language
test, but wha t sort of interrelationship between performances on
language tasks in general should be expected.
Ifan individual happens to be much better at listening tasks than at
speaking tasks, or at reading and writing ta sks than at speaking and
listening tC:l sks, we would be much more apt to discover this fact with
valid lan guage tests than with non-valid ones. However, the case of a
particular individual , who may show marked di fferences in ability to
perform different language tasks, is not an argument against the
possibility of a very high correlation between those same tasks for an
entire population o f subjects, or for subjects in general.
What if we go down the other road <, What if we assume that part
scores 0 0 language tests that intercorrelate highl y are therefore
red undant and that one of the hi ghl y correlated test scores should be
eliminated fro m the test ? The next step would be to look for some
new subtest (or 10 construct one or more tha n one) which would
assess some differen t com po ne nt of language skill no t included in the
redundant tests. In addition to sawing the li mb out from under the
discrete point test philosophy, we would be making a fundamental
error in the definition of reliability versus validity. Furthermore, we
would be faced with the difficult (a nd probably intriosical1 y
insol uble) problem of trying to decide how much weight to assign to
which subskill, component, or aspect, etc. We have discussed this
latter difficulty in section A above.
The matter of confusing reliability and validity is the second point
STA TlSTI CAL TR APS 195
to be attended to in this section. Among the methods for assessing test
reliabilit y cue: the teSl·retest me-thod; the technique of correlating
one half of a test with the other half of the same test (e .g., correlating
the average score on even numbered items with the average on odd
numbered items for each presumed homogeneous portion of a test);
and the a lternate forms method (e .g., correlating different forms of
the same test). In all of these cases, and ill fac t in all other measures of
reliability (Kuder-Richardson formula s and other internal con-
sistency measures induded) , reliability by definition has to do wit h
te sts or portions of test s that are in some sense the same or
fundamentally similar.
To int erp ret high correlations between substan tially different tests,
or tests that require the performance of substantiall y different tasks,
as mere indicators of reliability is to redefine reliability in an
unrecogni za ble way. If one accepts such a definition, then how will
one ever distinguish between measures of reliability and measures of
validity? The distinction , whjch is a necessary o ne, evaporates.
[n the case of language tests that require the performance of
substantially different tasks, to interpret a high correlation between
them as an jndication o f reliability alone is to treat the tasks as the
same when they are no t, and to ign ore the possibility that even more
dive rse pragmatic language tasks may be equally closely related. In
the case of language tests high correlations are probably the result of
an underlying proficiency factor that relates to a psychologically real
grammatical system. If such a factor exists, the ultimate test of
validity of any la ngua ge test is whether or no t it taps that factor, and
how well it does so. T he higher the correlations obtained between
di verse tasks, the stronger the confidence that they are in fact tapping
such a factor. The reaso ning may seem cjrcular, but the circularity is
only apparent. There are independent reasons for postulating the
underlying grammatical system (or expectancy grammar) and there
are still other bases for determining what a particular language test
measu res (e.g. , error analy sjs). The crucial empirical test for the
existence of a psychologically real grammar is in fact performance o n
language tests (or call them tasks) of different sorts. Similarly, the
chief criterion of validity [or any proposed language test is how well it
assesses fhe efficiency of the learner's internalized grammatical
system (or in the terms of Part One of thi s book, the learner's
expectancy system).
On the basis of present research (see Oller and Perkins, 1978) it
seems likely that Chomsky (1972) was correct in arguing that
196 LANGUAGE TESTS AT SCHOOL

langua ge abilities are central to human intelligence. Further, as is


discussed in greater detail in th e Appendi x, it is apparently the case
that language ability, school ac hievement, and [Q all co nstitute a
relati vely unitary factor. However, even if this laUer conclusion were
not sustained by further research, the discrete point interpretations or
correlations as discussed above will still have to be radically revised.
The problems there are logical and ana lytical whereas the unitary
fact o r hypoth esis is an empirical issue that requires experimental
study .

C. Statistical procedures as the final criterion for item selection


Perhaps because of its distincti ve jargon , perhaps because of its
bristling mathematical fo rmul as, o r perhaps out of blind respect for
things that one does not understand fully , statistical procedures (like
budgets and curricula as we no ted in Chapter 4) are sometimes
elevated from the status of slaves to educatio nal purposes to the
status of masters which define the purposes instead of serving them.
In Chapter 9 below we return to the topi c of item analysis in greater
detail. Here it is necessa ry to define briefly the two item statistics on
which the fate of most test item s is usu ally decided - i.e., whether to
include a particular item, exclude j t, or possibly re write it and pretest
it again. The first item statistic is the simple percentage o f studeOlS
answering correctly (it em fa cility) or the percentage answering
inco rrectly (i/eln difficulty ). For the sake of parsimony we will spea k
o nl y of item fac ility (IF) _. in fact, a little renection will show that item
difficulty is merely an other way of expressing th e same numerical
value that is expressed as IF.
The second item statistic that is usuall y used by professional testers
in evaluating the efficiency of an item is item discrimination (0). This
latter statistic has to do with ho w well the item tends to separate lhe
examinees wh o arc mo re proficient at tbe task in question from th ose
examinees wh o are less proficient. There are numerous formulas for
computing IDs, but all o f them are in fact methods of measuring the
degree or correlation between the tendency to get high and low sco res
on th e total te st (or sub test) and the tendency to answer correctly or
incorrectly on a particular test it em. The necessary assumption is that
the whole test (o r subtest) is apt to be a belter measure of wha lever the
item attempts to measure (and th e item can be considered a kind of
miniature te st) than any single item. If a given item is valid (by thi s
criterion) it mllst correlate positively and significantly with the total
ST ATlSTICAL TRAPS 197
test. If it does 50, it is possible to conclude that high scorers on the test
wi ll tend to answer the item in question correctly more freq uently
tha n do lo w scorers. However, if the correlation is nil or worse yet, if
it is negative, the item is in the former case giving no information
abo ut the p roficiency assessed by the test as a whole, and in the latter
is reversing the trends on the total test - i.e., if the correlation between
the item and the test is negative it means that the examinees who get
high sco res on the test tend to miss the item more frequ ently t han
exa minees who get low scores o n the test.
For reasons that are discussed in more detail in Chapter 9, item s
with very low or very high IFs and/o r items with very low or negative
TDs are usually discarded from tests. Very simply, items that are too
easy or to o hard provide little or no information about the range of
proficiencies in a pa rticula r gro up of examinees, a nd items that have
nil o r negati ve IDs either contribute nothing to the total amount of
meaningful var iance in the te,t (i.e., the tend ency of the test to spread
the examinees ove r a scale ranging from less proficient to more
proficient) or they in fact tend to depress the meani ngful variance by
ca ncelling out some o f it (in the case of negative lOs).
Probably any mult iple choice test, or other test that is susceptible
to item anal y,is, can be significantly improved by the application of
the above criteria for eliminating weak items. In fac t, as is argued in
Chapter 9, multiple choice tests which ha ve not been subjected to the
requirements of such item analyses sho uld probably not be used for
the purposes of making educati o nal decisions - unfo rtunatel y, they
are used for such purposes in many educational co ntexts.
The appropriate use of item analysis then is to eliminate (or at least
to flag for re vision) items that are for whatever reason inconsistent
with th e test as a who le, or items t hat are not calibrated appropriately
to t be le vel of proficiency of the population to be tested. But what
abo ut the items that are left unscathed by such analyses? What abo ut
the items th at seem to be appro priate in IF and 10 ? Are they
necessarily, therefo re, valid items? If such statistics can be used as
met hods for ellminating weak items, why not use them as the final
criteria for judging the items which are no t eliminated as va lid - o nce
and for aJl ? There are severa l reaso ns why acceptable item statistics
cannot be used as thefina! hasis /o r judging the validity qfirems to be
in eluded in tests. It is necessary t hat test items conform to minimal
requirements o flF and rD, but even if they do, this is not a sufficient
basis for judging the items {O be 'valid ' in any fund amental sense.
One of the reaso ns th at item statistics ca nn ot be used as final
198 LANG UAGE TESTS AT SCHOOL

criteria for item selection - and perhaps the most fund amental reason
- relates to the assumption on which ID is based . Suppose tha t a
certain Reading Comprehension test (or one that bears the label) is a
fairly poor test of what it purports to measure. It foll ows that even
items that co rrelate perfectly with the total score on sllch a test mllst
also be poor items. Fo r instance, jf the test were reall y a measure of
the learner's abilit y to recall dates or numbers mentioned in the
reading selection and to do simple arithmetic operatio ns to derive
new dates, the items with the highest IDs might in actuality be the
ones with the lowest validities (as tests of reading comprehension).
Another reason th at item statistics cannot be relied on for the fin al
selection of test items is that they may in fact push the test in a
d irection that it should not go. Fo r exampl e, suppose th at o ne want s
to test kn owledge of wo rds needed fo r college-level reading of texts in
mathematics. (We ign ore for the sake of the argumen t at this point
the question of whether a 'vocabulary' test as distinct from other
types of tests is reall y more of a measure o f vocabuhlry knowled ge
tha n of other things. Indeed, for the sake of the argument at this
point, let us assume that a valid test of'vocabulary knowledge' can be
constructed.) By selecting words from mathematics texts, we might
construct a satisfact o ry test. But suppose that for whatever reason
certain wo rd s like ogle. rustle, shimmer, sheen, chiffonier, spouse,
fe ttered, prune, and p es ter creep into the exa mination, and suppose
further lhat they all produce acceptable item statistics. Are they
the refore acceptable items for a test that purports to measure the
vocabulary necessary to the readin g of mathematics texts ?
Or change the example radically, do words like ogle, and chiffonier,
belong in the TOEFL Vocabul ary subtest ? With the ri ght field of
distractors, either of these might produce quite acceptable it em
statistics - indeed th e first member of lhe pair did appear in lhe
TOEFL. Is it a word that forei gn students applying to America n
universities need to kn ow ? As a certain Me Jones pointed out, it did
produce very acceptable item stati stics. Should it th e ref~ re be
incl uded in the test ? If such items are allowed 10 stand, then what is to
prevent a test from gravitating further and further away from
co mmon fo rms and usages to mo re and mo re esote ric terms that
produce acceptable item stati stics?
To return to the example concerning the comprehension of words
in mathematics text s, it is conceivable that a certain population of
very capable students win kno w all the wo rds included in the
vocabulary test. There fo re, the it e ms would turn out to be too easy by
STATISTJCAL TRAPS 199
item slatislics criteria. Does this necessaril y mean that the items are
not sufficiently difficult fo r the examinees'J To claim this is like
arguing that a ten foot ceiling cannot be measured with a ten foot
tape. It is like arguing that a ten foo t tape is the wrong instrument to
use for measuring a ten foot ceiling beca use the ceiling is not high
enough (or alternatively because the tape is too short).
Or, to look at a different case, suppose that all of the subjects in a
certain population perform very badly on o ur vocabulary test. The
item statistics may be unacceptable by tb e usual stan dards. Does it
necessarily follow that the test must be made less difficult ? Not
necessarily, because it is possible that the subjects to be tested do not
know any of the words in the mathematics texts - e.g., they may not
kno w the langua ge of the texts. To claim therefore that the test is no t
valid and /or tbat it is too difficult is like claiming that a tape measure
is not a good measure of length , and/or that it is not short enough,
because it cann ot be used to measure the distance between adjacent
points o n a line,
For all ofthe foregoing reasons, test h em statistics alone can not be
used to select or (in the limiting cases just discussed) to reject items.
The interpretatio n of test item statistics must be subordi nate to
higher co nsiderations. Once a test is available which has certain
independent claims to validity, item anal ysis may be a useful tool for
refining that test and for attaining sligh tl y (perhaps significantl y)
hIgher levels of validity. Such statistics are by no means, however,
final criteria for the selection of items.

D. Referencing tests against non-native performance


The evolution of a test or testing procedure is largely determined by
the assumptions on which that test or procedure is based. This is
particularly tru e of institu tional or sta ndardized tests because oftheif
longer surviva l ex pectancy as compared aga inst classroom tests that
are usually used o nly once and in o nly one fo rm. Until now, discrete
point test phil oso phy has been the principal basis underlying
standardi zed tests of all so rts.
Since discrete point theories of langu age testing were gene rall y
articulated in relation to the performance of non-na tive speakers of a
gi ven target language, most of the language tests based o n sllch
theorizing have been developed in reference to the performance of
non-native speakers of the langu age in qu estion. Generally, this has
been justified on the basis of the assumption that native speakers
200 LANGUAGE TESTS AT SCHOO L

should perfo rm flawle ssly. or nearly so , on language tasks t hat are


no rmall y included in such tests. H owever, native speakers of a
la ngua ge d o vary in their a bility to perfo rm language rela ted tasks. A
child of six years may be just a s much a native speaker of English as
an adult o f age twenty-five, but we d o not expect the child to be abl e
ta do all o f the t hings with English th at we may expect of th e adult -
hence, th ere are differences in skill attri butable to age or matu ration .
Nei ther d o we expect a n illiterate fa rmer who has not had th e
educational opportunities oYan urbanite of comparable abilities to be
able to read at the same grad e level and with equal comprehension-
hence, there are differences due to edu cation and experience.
Furthermore, recent empirical resea rch, especially Stump (1978)
has shown that no nnal native spea kers d o var y greatl y in proficiency
an d that this variance m ay be identical with what has formerl y been
call ed IQ a nd /o r achievement.
Thus, we must conclude th at there is a real cho ice: language tests
ca n either be referenced against the perfo rmance of native spea kers or
they ma y be referenced against the perfo rmance o f no n-nati ve
speakers. Put m ore concretely, t he effec tiveness of a test item (or a
subtest, or an entire batte ry of tests) may be judged in terms of how it
functions with nati ves or no n-natives in pro ducing a ra nge of scores -
or in producing meaningful variance between better performers and
worse perfo rmers.
Jf a non-nati ve reference population is used , test writers will tend to
prepare items th at maximize the variability within that population. If
na ti ve spea kers a re selected as a reference popUla tion, test writers will
tend to arrange items so as to maximi ze the variability within that
po pulatio n. Or more accuratel y, the test writers will attempt to make
th e testes) in either case as sensitive as possible to t he va ri ance in
language proficiency that is actually characte ristic of t he population
aga inst w hic h the test is referenced .
1n general, th e attempt to maximize t he sensitivity of a test to true
va riabihlies in tested po pulations is desira ble. This is what test
v" lidit y is about. Tbe rub comes from tbe fact th at in the case of
language tests, the ability of non-native speakers to answer certain
discrete test items correctl y may be unrelated to the kind s o f ability
th at nati ve speakers d isplay when t hey use language in normal
co ntexts of communication.
There a re a number o f salient differences between the perfo rmance
of native speakers of a given language and the performance of non-
natives who are at vario us stages of develo pment in acquiring the
STATISTICAL TRAPS 201
same language as a second o r foreign language. Amo ng the
differences is the fact that na tive speakers generall y make fewer
errors, less severe errors, and errors which have no relationship to
ano ther language system (i.e., nati ve spea kers d o not have fo reign
accents, nor do they tend to make errors that originate in th e
syntactic, semantic, or pragmatic system of a competing language).
Na tive spea kers are typica ll y able to process material in their native
language that is richer in organiza ti o nal complexities than the typical
non-n ative can handle (other things such as age, educational
ex perience, and the like being equal). Non-n at ives ha ve difficulty in
achieving the same leve l of skill that native speakers typicall y exhibit
jn hand ling jokes, puns, ri ddles, irony, sarcasm, facetious humor,
hyperbole, do uble entendre, subtl e inuendo, and so forth . Highl y
skilled nalive spea kers are less susceptible to fa lse anal ogies (e.g.,
pronounciation fo r pronunciation, ask it to him on amllogy with forms
like l elf il 10 him) and are mo re capable o f making t he appro priate
ana logies afforded by th e richness of their ow n linguistic system (e.g.,
the use of parallel phrasing across sentences, the use of metap hors,
similes, contrasts, compariso ns, and so o n).
Because of the contrasts in na ti ve and non-na tive perfo rmance
which we have noted above, and many others, the effect of
referencing a test against one population rather than the other may be
quite significant. Suppose the decisio n is made LO use non-na ti ves as a
reference population - as, for instance, the TO EFL test writers
decided to do in the early 1960s. What will be the effect on the
eventual form of the test items'! H o w will t hey be apt to differ from
test items that are re ferenced against the performance of native
speakers?
If (he va riance in (he performance of natives is not completely
similar to the variance in the perform ance of non-natives, it follows
that items which work well in rel ation to the variance in one will not
necessa ril y wo rk well in relation to tbe va riance in t he other. In fact,
we sh ould predict th at some o f the items that are easy for nati ve
speakers should be difficult for non- natives and vice ve rsa - some of
tbe items (hat are easy for non-na tives sho uld be more difficult for
native spea kers. This las t prediction seems anomalous. Why should
non-nati ve speakers be able to perform better on any langua ge test
item th an native speakers'? From the point of view ora sound theory
of psyc holinguistics, the fact is th at native speakers shoul d always
outperform non-natives, other thi ngs being equal. However, if a
given test of langu age proficiency is referenced against th e
202 LANGUAGE TESTS AT SCHOOL

performance of non·native speake rs, and if the variance in their


performance is difl'erent from the va riance in the performance of
natives, it foll ows that some of the items in the test will tend to
gravitate toward portions of variance in the reference population that
are not characteristic of normal language lise by nati ve speakers.
Hence. some o f the items on a test referenced against non-na ti ve
performance will be more difficult for natives than for n o n-nati ye s ~
and many of the items on such tests may have little or nothing to do
with actual a bility to communicate in the tested language.
Why is there reason toex pect varia nce in the language skills or non-
native speakers to be somewhat skewed as compared against the
variance in native performance (due to age, education, and the like)?
For one lhin g~ many non-native spea kers - perhaps most non-nati ves
who were the reference populations for tests like the TOEFL , lhe
lvfodern Language AssociaNon Tests, and many other foreign
language tests - are exposed to the target lan guage primarily in
somewhat art ificial classroom contexts. Further, they a rc exposed
principally to materia ls (syntax based pattern drills, fo r instance) th at
are found ed on discrete point theo ries or teaching and analyzing
languages. They are encouraged to form generalizations about the
na ture of the ta rget language that would be very uncha racteristic of
native speaker intuitio ns about how to say and mean thin gs in that
same language. No native speaker, for example, would be apt to
confuse going to a dance in a car with going 10 a dance with a girl, but
no n-nati ves ma y be led into such confu sio ns. Forms like going 10 a
foreign country with an airplane and going to a foreign counrry in an
airplane are often confused due to classroom ex perience - see Chapter
6, section C .
The kinds of false analogies, o r incorrect generahza ti ons that non-
na tives a re lured into by poorly conceived mate rials combined with
good teaching migh t be construed as the basis for what could be
called a kind offreak gra mmar - that is, a grammar th aI is suited on ly
ror the rat her odd contexts of certain leaching materials a nd that is
quite ill-suited for the natural contexts of communication. If a test
then is aimed at the variance in performance that is ge nerated by
more or less effective internalizatio n of such a freak grammar, it
sho uld not be surprising th at some of the items which are sensiti ve lo
the knowledge that such a grammar expre<ses would be impervious to
the knowl edge that a more normal (i.e. , native speaker) grammar
speci fie s. Similarly, tests that are sensitive to the variance in natura l
gra mma rs mi ght well be insensitive to some o r the kinds of discrete
STATISTICAL TRAPS 203
point knowledge characteristically taught in language m aterials.
If the foregoing predicti ons were correct, interesting contrasts in
nati ve and non-native performance on tests such as the TOEFL
should be experimentally demonstrable. In a study by Angoff and
Sharon (1971), a group of nati ve speaking college stud ents at the
University of New Mexico performed less \vell than non-nati ves on
21 % of the items in the W ri ting Ability section of that examination.
The Writing Abilit y subtest oftbe TOEFL consists prima rily of items
aimed at assessing knowledge of discrete aspects of grammar, style,
and usage. The fact that some items are ha rder for nati ves than for
non-natives draws into question the validity o f those items as
mea sures o f knowl edge that native spea kers possess. Apparently
some of the items a re in fact sensitive to things that noo-natives are
taught but that native speakers do not normally learn. If the test Were
normed against the performance o f native speakers in the first place,
this sort of anomaly could not arise. By this logic, native performance
is a more valid criterion against which to judge the effectiveness of test
items than no n-nati ve performance is.
Another sense in which the performance of non-n ati ve speakers
may be skewed (i.e., characteristic o f unusual or freak grammars) is in
the relationship between skills, aspects of skills, and components of
skills. For instance, the fact that scores on a test of listening
comprehension correlate less strongly with written tests of reading
comprehension, grammatical knowledge (of the discrete point sort),
and so-called writing ability (as assessed b y the TOEFL , for
instance), than the laner tests correlate with eac h other (see Tables I
and 2 in section B above) may be a function of experientia l bias rather
than a consequence of the f<:l ctorial structure ofla.nguage profici ency.
M a ny n on-native speakers who are tes ted on the TOEFL , for
example, may not have been exposed to models who speak fluent
American English. Furthermore, it may weB be that experience with
such fluent models is essential to the development of listening skill
band in hand with speaking, reading, and writing abilities. Hence, it is
possible that the true co rrelation between skills, aspects, and
components of skill s is much higher under normal circumsta nces
than has often been assumed. Further, the best approach ifthis is true
would be to make the items and tests maximally sensi tive to the
meaningful va riance present in native speaker performance (e.g., that
sort of variance that is due to normal maturation a nd experience).
In short, referencing tests against the perform ance of non-native
speakers, th ough statistically an impeccable decision, is hardly
204 LANGLAGE TESlS AT SCHOOL

defensible from the vantage point of deeper principles of validity and


practicality. In a fundamental and indisputable sense, native speaker
performance is the criterion against which all language tests must be
validated because it is the only observable criterion in terms of which
language proficiency can be defined. To choose non-native
performance as a criterion whenever native performance can be
obtained is like using an imitation (even if it is a good one) when the
genuine article is ready to hand . The choice of nati ve speaker
performance as the criterion against which to judge the validity of
language profici ency tests, and as a basis for refining and developing
them, guarantees greater facility in the interpretation of test scores,
and morc meaningful test sensitivities (Le., variance).
Another incidental benefit of referencing tests against native
performance is th e exertion of a healthy press ure on materials writers,
teachers, and administrators to teach non-na ti ve spe-akers to d o what
natives do - i.e., to communicate effect ively instead of teaching
them to perform di screte point drills that ha ve little or no relation to
real communication. Of course, there is nothing surprising to the
successful class room teacher in any of these observations. Many
language teachers have been devoting much effort to making all of
their classroom activities as meaningful , natural, and relevant to the
normal communicati ve uses of language as is possible, and th at for
many yea rs.

KEY POI NTS


1. Statistical reasoning is sometimes difficult and can easily be mi sleading.
2. There is no known rational way of deciding what percentage ofitems on
a discrete point test should be devoled to the assessment of a particular
skill, aspect, or component of a skill . In deed, there cann ot be any basis
for component ial a naly sis oflanguage tests into phonology, synta x, and
vocabulary subtests, because in normal uses of langua ge all components
work band in hand and simultaneously.
3. The difficulty o f represemati vel y s<lmpiing the universe of possible
sentences in a language or disco urses in a lan guage is insurmountable.
4. The sentences or discourses in a language which actually occur are but a
small porti on (an infinitesimally small portion) of the ones which could
occur, and they are non-repetitive due to the very nature of human
experience.
5. The discrete point method of sampling the universe of po ~sible pilrase..<;,
or sentences, o r discourses, is abo ut as applicable to the funda mental
problem o f language LesLingas t he met hod ofli stingexa mples of phrases,
sentences, or disco urses is to the basic problem o f characterizing
language proficiency - or psychologically real grammatical systems.
6. The solution to the grammarian 's problem is to focus attenti on at a
STATISTICAL TRAPS 205
deeper level - not on the surface forms of utterances, but on the
unde rlying capacity whic h generates not only a particular utterance, but
uttera nces in general. The soluti on to the tester's probl em is similar -
namely to focu s attention not on tbe sa mpling or phrases, semences, or
discourses per se, but rather on the assessment of the efficiency of the
developing learner ca pacity which generates sequences of linguistic
elements in the target lan guage (i. e. , the efficiency of the lea rner's
psycho logically real gra mmar th al interprets and produces sequences of
elements in the target language in particula r correspondences to
extra linguistic contexts).
7. Low correlations have some6mes bee n inte rpreted incorrectly as
showing that test s with different labels are necessarily measures of wha t
the labels name. There are, howeve r, many O[her sources of low
correlations. In fact, tests that are measu res of exactly the same thing
may co rrelate at low levels if one o r both a re unrel iable, too ha rd or too
easy, or sim ply not valid (e. g., if b oth are measures of nothing).
8. It can not reasonably be argued that a low correlatio n between test s with
differe nt labels is due to a lack of validity for one of the tests and is also
evidence that the tests are measuring differe nt skills.
9. O bserved high correlations be tween dive rse language lests cannOl be
di smissed as mere indications of reli ability - they must indicate that the
proficiency facro r underl ying the d iverse pe rform<i nces is validly tapped
by both tests. Furthermo re, such hi gh correlations are not ambiguo us in
the way that low co rrelations a re.
10. Th e expectat ion of low correlations between tes ts that req uire diverse
language performances (e.g. , listeni ng as opposed to reading) is drawn
from disc rete poi nt theorizi ng (especiall y the componentiali7j ng of
langua ge ski ll s), but is strongly refuted when diverse language tests (e. g.,
cloze and dictation, sentence paraphrasing and essay writing, etc. ) are
observed to correlate at high level s.
II. To assume that high correlations between diverse tests are merely an
indication that the tests are rcliab le, is to trCfit different tests as jf they
were the sa me. If they are not in fact the sa me, and if they are treated as
the same, wha t j ustification remains fo r treating any l W O lests as different
tests (e.g., a phon ology test compa red again st a voca bulary test)? T o
foll ow such reasoning to its logica l co ncl usion is to obl iterate the
possib ility of recognizing different skill s, aspects, or com ponents of
skill s.
12. Stati stical procedures merit the position of slaves to educat ional
purposes much the way hammers and nails merit the posit ion of 10015 in
relation to building shelves. If the tools a re elevated to the status of
procedures for defining the sh ape of th e shelves Of what so rt of books
they can hold, they are being mi sused.
13. Acceptable item statistics do not guaran tee valid test items, neilher do
unacceptable item statistics prove [hat a give n item is not valid.
14. Tests must have independent and higher claims to va lidit y before item
stati stics per se can be meaningfull y in terpreted .
15. Language te-sts may be referenced aga inst the performance o f native o r
non ~ na tive speake rs.
8

Discrete Point Tests

A. What they attempt to do


B. Theoretical problems in isolating
pieces of a system
C. Examples of discrete poi nt items
D. A proposed reconciliation with
pragmatic testing theory

Here several of Ihe goals of discrete point theory a re considered . The


theoretical difficult y of isolating the pieces of a system is considered
along with the dia gnostic aim of specific di screte point test items. It is
concluded that the virtues of specific diagnosis are preserved in
pragmatic tests without the theoretical drawbacks and artificiality of
discrete item tests. After all, tb e elements of la nguage onl y express
their separate identities normally in full-fledged na tural discou rse.

A. What they attempt to do


Disc rete point tests attempt to achieve a number of desirable goals.
Perhaps the forem ost among them is the diagnosis of learner
difficulties and weaknesses. The idea is often put forth that if the
teacher or o ther test interpreter is able [0 learn from the test results
exactly what the learner 's strengths and weaknesses are, he will be
better able to prescribe rem edies for problems and will avoid wasting
time leaching the lea rner what is alread y know'l .
Discrete point tests attempt to assess the learner's capabilities to
handle particular phonological contrasts from the point of view of
perception and production. They attempt to assess the learner's
capabilities to produce and interp ret stress patterns and intonations
on longer segments of speech. Special subtests are aimed al
kn owledge of voca bulary and syntax. Separate tests for speaking,
listening, reading, and writing may be devised . Al ways it is correctly
209
I)ISCRETE POINT TESTS 211
phonolo gical cont rasts : vocabula ry exercises focussing on the
expansion of receptive or productive repertoires (or speakin g or
listening repertoires); syntax drills designed to teach certai n patterns
of structure for speaking, and others designed to teach certain
patterns for listening, and others for reading and yet others fo r
writing; and so on until all components and skills we re ex hausted .
These three goal s, th at is, diagnosing learner strengths and
weaknesses, prescribing curricul a aimed at pa rticular skills, and
develo ping specific teaching strategies to help learners overcome
particular weak nesses, are a mo ng the laudable aims of discrete point
testing. It sho uld be noted, ho wever, that the theoretical basis of
discrete poi nt teaching is no better than the em pirical result s of
discrete point tes(ing. The pres umed compo nents of grammar are no
more real fo r practical purposes th an they can be demonstrated to be
by th e meaningful and systematic results of discrete point tests aimed
at differentia ting those presumed components of grammar. Further,
the ul timate effectiveness of the whole phil osophy of d iscrete point
linguistic anal ys is, leachin g, and testin g (not necessa rily in any
particular order) is to be judged in terms of how well the learners who
are subjected to it are thereby enabled to communicate information
effecti vely in the target language. In brief, the who le of discrete point
methodology stands or falls on th e basis of its practical results. The
question is whether lea rners who a re exposed to such a method (o r
family of meth ods) ac tually acquire the targe t language.
The general im potence of such meth ods can be attested to by
almost a ny studem who has studied a foreign language in a classroom
situatio n. Discrete point methods a re notoriously ineffective. Their
nearl y complete failure is demonstrated by the paucity of flu ent
speakers of any target language who have acquired their fluency
exclusively in a classroom si tuation . Unfo rtunately, since classroom
situations are predominantl y characterized by materials a nd methods
that derive more Of Jess directly from discrete point linguistic
anal ysis, th e verdict seems inescapable: discrete point methods don 't
work.
The next o bvious question is why. How is it th at metho ds which
have so much auth o rity, and just downright rational anal ytic appeal,
fail so widely? Surely it is no t fo r lack of dedicatio n in the profession.
It cann ot be due to a lack of talented teachers and bright st udents, nor
that the methods have not been given a fair try. Then, why?
2 12 LAN G UAGE TESTS AT SCHOOL

B. Theoretical problems in isolating pieces of a system


D iscrete point theo ries are predica ted on the notion that it is possible
to sepa rate analyticall y the bits and pieces of langu age and then to
teach and/or test those elements one at a t.i me without reference to the
contexts of usage from which those elements were excised. 1t is an
undenia ble fact, ho wever, that pho nemes do no t ex ist in isolation. A
ehild has to go to sc hool to learn that he knows how to handle
phonemic contrasts - to learn that his language has phonemic
contrasts. It ma y be true that he unco nscio usly makes use of the
phonemic co ntra st between see and say. fo r in stance, but he mu st go
to school to find out that he has such skill s or that his la ngua ge
req uires them . N ormall y, th e pho nemic cont rasts of a language are
no more consciou sly available to the language user th an harmonic
interva ls are to a mu sic lover, or than the peculiar pro perties of
chemica l el ements are to a gourmet co ok . Just as the relation s
between harmonics a re imp o rta nt to a mu sic lover o nl y .in the context
of a musical piece (a nd probably not at all in any other context), and
just as the pro perties of chemical eleme nts are o f interest to the c oo k
onl y in terms of the gustatory effect s they produce in a ro ast or dish of
stew, phonemic co ntrasts are principall y o f interest to tbe language
user only in term s o f their effects in commun icative exchanges - in
di scourse .
Discrete point analysis necessarily breaks the elements oflan guage
apa rt a nd tries to teach them (or test them) separately witb little or no
attenti o n to the way th ose elemen ts interact in a larger context of
communicatio n. What makes it ineffective as a basis for teaching or
tes ting languages is that crucial pro perties of language are lo st wh en
its elements ~He separated . The fact is tha t in any system where the
parts interact to produce properti es and qualities th at do not exist in
the parts separately, the !Vhale is g reater thall the sum aJ its parts, !ftbe
pa rts canno t just be shuffled together in any old order - if they must
rathe r be put together accordin g to certain orga nizatio nal constraints
- tho se orga niza tion al c o nstra ints them selve s become c rucia l
properties o f the syste m whic h simply ca nn o t be fo und in the parts
separa tely.
A n exa mpl e of a discrete po int approach to the constru ctio n of a
test of 'listening grammar' is offered by Clark (1972):
Basic to th e g ro wth of stud ent facility in li stening co mpre-
hension is the devel opment of the ability to isolate and
appropriately interpret important syntactical and morphologi-
cal aspects o f the spoken utterance such as tense, number,
DISCRETE POINT TESTS 2 13
perso n, subject-o bject di stinctions, declarative and imperative
structures, attributions, a nd so forth . The student 's knowledge
of Jexicon is not at issue here ; and for that matter, a direct way
of testing the a ural identi fica tion of grammatical fu nctio ns
wou ld be to use nonsense words incorporatin g the desired
m orp hological elements or syntactic patterns. Given a sentence
such as 'Le muglet a h e candre par la friblo nlle,' [ro ughly
translated from F rench, The muglet has been candered hy the
frib/un, where mug/et, cander, andJriblun are nonsense words ]
the student might be tested on his ability to determine: I) the
time aspect of the utterance (past time), 2) the 'actor' and 'acted
upon' ('friblonne' a nd 'muglet', respecti vely), a nd the action
invol ved ('candre ') [po 53f].
First, it is assumed tha t listening skill is different from speaking
skill, o r reading skill, o r wri ting skill. Further, that lexical kno wledge
as related to the listening skill is o ne thing while lexical kno wledge as
related to the reading skill is another, and further still that lexical
knowledge is different from syntactic (or morpho logical) knowl edge
as each pertains to listening ski ll (or 'listening grammar' ). On the
basis of such assumptio ns, Clark pro poses a very logical ex tension: in
o rder to se parate t.h e supposedly separate skills for testing it is
necessary to elimina te lex ical kn owledge from consideration by the
use of nonsense words like lnt/gleE, conder, and/rib/lIn, He continues,
If such elements were bei ng tested in the area of reading
com prehe nsi on ~ it would be technicall y feasible to prese nt
printed nonsense sentences of this so rt upon which the student
would o perate. In a listen ing comprehensio n situa tion,
ho wever, the difficulty of retaining in memory the various
strange wo rds involved in the stimul us sentence would pose a
listening comprehension problem independent of the studen t's
abilit y to inte rpret the grammatical cues themselves. Instead of
nonsense wo rds (which wo uld in an y event be avoided by some
teachers on pedago gical grounds), ge nuine foreign-language
vocabularY'Is more suitably employed to convey the grammati-
cal elements inte nded for a ural testing [po 54].
Thus, a logical ex tension of discrete po int testing is offered for
reading comprehension tests, but is considered inapp ropriate for
listening tests. Let us suppose tb at such items as Clark is suggesting
were used in read ing comp rehension tests to separate syntactic
knowledge fro m lexical knowledge. In what ways wo uld they differ
fro m simila r sentences that might occur in normal conversation,
prose, or disco urse ? Compare The muglel has been candered by the
lriblun wit h The money has been squandered by the j;·eeloader. Then
consider the question wbether the relationshi ps that ho ld between the
2 14 LANGUAGE TESTS AT SCHOOL

grammatical subject and its respecti ve predicate in each case is the


same. Add a th ird example. The pony has been telhered by Ihe
barnyard. It is entirely unclear in the nonsense example whether the
relationship between the muglet and the friblun is similar to the
relationship between the money and the freeloader or whether it is
similar to the rela tionship between th e pony and the barnyard . H ow
could such syntactic relationships and the knowledge of them be
tested by such items ? One might insist that the difference between
squanderin g something and tethering something is strictly a matter of
lexical knowled ge, but can one reasonab ly claim that the relationship
between a subject and its predicate is st rictly a lexical relationship?
To do so would be to erase any vestige of the original distinction
between syntax and vocabulary. The fact that the money is in some
sense acted upon by the freeloa de r who does something with it ,
na mely, squanders it, and that the po ny is no t simila rl y acted upon by
the barnyard is all bound up in the synta x and in th e lexical items of
the respecti ve sentences not to mentio n their potential pragmatic
relations to extralinguistic contex ts and their semantic relations to
other similar sentences.
It is not even remotely possible to represent such intrinsically rich
complexities with nonsense items or the so rt Clark is proposing.
Wbat is tbe answer to questions like: Can Fiblulls be candered by
mugLels? Is candering something that can be done lOfribiuns ? Call if be
done co nfuglets ?There a re no a nswers to such questio ns, but there are
clear and obvious answers to questions li ke: Can barnyards be
tethered by ponies ? Is tethering something that can be done to
barnyards? Can it be done to ponies? CanJreeloaders be squandered by
mone}'? Is squandering something fhal can be done to ji'eeloaders? Can
it be done to money ? The fact th at the la tter questions have answers
a nd that the former have none- is proof thal no rmal sentences have
properties that are no t present in the bones of those same sentences
stripped of meanin g. Tn fact, they have syntactic properti es that are
no t present if the lexical items are not th ere to enrich the syntactic
organizati on of the sentences.
The syntax of utterances seems to be just as intricately involved in
the expression of meaning as the le xico n is, and to propose testing
syntactic and lexical knowledge se parately is like proposing to test the
speed of an a ut o mobile with the wheels first and the engine later.
It makes little difle rence to the difficulties that discrete point testing
creates if we change the foca l po int of the argument from the sentence
level to the syllab le level or to the level of full-fledged discourse.
DISCRETE POINT TESTS 215

Syllables have properties in disco urse that they do not h ave in


isolat ion an d sentences have properties in discou rse that they do not
have in isolati on a nd discourse has properti es in rela tion to everyday
experience th at it d oes n ot have when it is isolated fr OID such
ex perience. In fact, discourse ca nn ot really be considered d iscourse at
all ifit is not systematically related to experience in ways that can be
inferred by speakers of t he language. With res pect to syllables,
consider the stress and length ofa given syllable such as /lOd/ as in He
read the entire essay in one silling and in the sentence He read it is wha t
he did wilh iI, (as in response to Whal all earl h did you say he did lVilh
it 7) .
Can a learner be said to know a syllable on the basis of a discrete
test item {hat requires him to distinguish it fro m ot her similar
syllables ? If the learner knew all the syllables of the language in that
sense wo uld this be the same as kn owin g the language?
For the wo rd syllable in the preced ing questio ns, substitute the
words sound, wo rd, syntactic pattern , but one must not su bstitute the
words phrase, senlence, o r conversalion, becC:l use they certain ly canno t
be adequately tested by discrete item tests. In fac t it is extremely
doubtfu l tha t anything much above the level of the distinctive sounds
(or phonemes) of a language can be tested one at a time as discrete
point theorizing requires. F urthermo re, it is enti rely unclear what
should be considered an adeq uate discrete point test of knowledge of
the so unds or sound system of a language. Should it include all
possible pairs of so unds with similar distributi ons? Just such a
pairing would create C:I very loog lest ifit o nly required discriminatio n
decis;ons abo ut whether a heard pair of sounds was the same or
different. Suppose a person could handle all of the items on the test.
In what sense could it be said that he therefore knows the so und
system of the tested language?
The fact is that the so unds of a language a re structured into
sequences that make up syllables which are structured in complex
ways in to words and phrases which are themselves structured into
sentences and paragraphs o r higher level units of discourse, and the
highest level of organization is rather obviously involved in the lowest
level of linguistic unit production and interpretation. The very same
sequence of so unds in one context will be taken for one syllable and in
another context will be taken for another. The very same sound in one
contex t may be interpreted as o ne word and in a nother as a
completely different word (e.g., 'n, as in H e's ·n ape, and in This 'n that
'n the other). A given sequence of words in one context ma y be taken
216 LANGUAGE TESTS AT SCHOOL

to mean exactly the opposite of what they mean in anoth er context,


(e.g., Sure you will, meaning either No , you won't or Yes, you lvill.)
All of the foregoing facts and many others that are not mentioned
here make the problems of the discrete item writer not just difficult
but insurmountable in principle. There is no way that the normal
facts of language can adequately be taught or tested by using test
items or teaching materials that start out by destroying the very
properties o f language tha t most need to be grasped by learners.
How can a person learn to map utterances pragmatica lly on to
ext ralinguistic contexts in a language tha t he does not yet kn ow (that
is, to express and interpret in formation in words about experience) if
he is forced to deal with w ord~ and utterances that are never related to
eX lralin guistic experience in the required ways? The answer is that no
one can learn a language on the basi s of the principles advocated by
discrete point th eorists. This is not because .it is very difficult to learn a
language by experiencing bits and pieces of it in isolatio n from
pragmatic contexts, it is because it is impossible to learn a language
by experiencing bits and pieces of it in that way.
For the same reason , discrete test item s that aim to test the
kno wledge of language independent of the use of that kn owledge ill
normal contexls o f communicatio n must also fail. No one has ever
proposed that instead of running races at t he Olympics contestants
should be subjected to a battery of tests including the analysis of
individual muscle potentials, general quickness, speed of bending the
leg at the knee joint, and the like - rather the speed of runners is tested
by having them run. Wh y shonld the case be so different for language
testing'! Instead of asking, how well can a particular language learner
handle the bits a nd pieces of presumed analytical com ponents of
grammar, why not ask how weJJ the learner can use all of the
components (whatever they arc) in dealing with disco urse?
In addit io n to strong logic, there is much empirical evidence to
show that discrete point methods of teaching fail and that discrete
point methods of testing are inefficient. On the other hand, there are
methods of teaching (and lea rning languages) that work - for
instance, methods of teaching where the pragmatic mapping of
utterances o nto extralinguistic contexts is made obvious to the
learner. Similarl y, methods of testin g that require the examinee to
perform such mapping of sequences of elements in the target
language are quite efficienL
DISCRETE POINT TESTS 217

C. Examples of discrete point items


The purpose of this ~ectio ll is to examine some examples of discrete
point items and to consider the degree to whic h they produce the
kinds of information they are supposed to produce - namely,
diagnostic information concernin g the mastery of specific points of
linguistic structure in a particula r target language (a nd for so me
tes ters, learners from a particular backgro und language) . In spite of
the fact that vast num bers of discrete point test categories are possible
in thcory, they al ways get pared do wn to manageable proportions
eve n by the theorists who advocated the more proliferate test designs
in the first place.
For example, under the general heading of tests of phonol ogy a
goodly number of subbeadings have been proposed including:
subtests of phonemic contrasts, stress and intonation, subclassed still
furt her into subsub tes ts of recognition and production not to
mention th e distinctions between wo rd stress versus sentence stress
and so on. In actualit y, no o ne has ever devised a test that makes use
of all of the possible distinctions, nor is it likely that anyo ne cver will
since the possible distinctions can be mUltiplied ad infinitum by
the same methods that produced the commonly empl oyed distinc-
tions. This last fact, however. has empirical consequences in the
demo nstrable fact th at no two disc rete point testers (unless they have
jmitated each other) are apt to come up wi th tests that represent
precisely the same categories (i.e. , subtests, subsubtests, and the like).
T herefo re, th e items used as examples here ca nno t rep resent all oft be
types of items that have been proposed. They do, ho wever, represent
commonly used types.
First, we will consider tests of ph onological skills, then voca bulary,
then grammar (usually limited to a na rrow definition of syntax -- that
is, having to do with sequential relations between words or phrases,
or clauses).

1. Phonological items. Perhaps the most often recommended and


widest used technique for assessing 'recognition ' or 'auditory
discrimination' is the minimal pair technique o r some variation oril.
Lado (1961), Harris (1969), Clark (1972), Heaton (1975), Allen and
Davies (1 977), and Valette (1977) all recommend some variant of the
technique. For instance, Lado suggests reading pairs of wo rds witb
mi nimal so und contrasts \vhile the students write down 'sa mel or
'different ' (abb reviated to ' S' or 'D') for eaeh numbered pair. T o test
'speakers of Spanish, Portuguese, Japanese, Finnish' and other
21 8 LAN GUAGE TESTS AT SCHOOL

langua ge background s who are learning English as a fo reign or


second la nguage, Lado proposes items like the foll owing:
I. sleep ; slip
2. jist: fist
3. ship ~ sheep
4. heat : heat
5. jeep ; gyp
6. leap : leap
7. rid ; read
8. mill: mill
9. neat ; knit
10. beat ; bit (Lado, 196 1, p. 53) .
Another item type which is quite similar is offered by both Lado
and Harris. The specific examples here a re fr om Lado. The teacher
(o r examiner) reads th e word s (with identical stress and int onation)
and asks the learner (or examlnee) to indic(l te whic h words aTe the
same. If all are the sam e, the examinee is to c heck A , B, and C, on th e
a nswer sheet. If none is the sa me he is to check O.
1. cat ; cat ; cot
2. run; sun; run
3. las t; last ; last
4. beast : best: best
5. pair; fair ; chair (Lado, 1961, p. 74).
N ow, let us consider briefly the question of what diagnostic in-
formatio n such test items provide. Suppose a certain examinee misses
item 2 in the second set of items give n abo ve. What can we deduce
fro m thi s fact ? Is it safe to sa y that he doesn't know Is! o r is it /r/ ? Or
could it be he had a lapse of attention ? Could he have misunderstood
the item instruc tions o r marked the wrong spot on the answe r sheet?
What teaching strategies could be recommended to remedy the
problem '! Wh at docs missing item 2 mean with respect to overall
comprehension ?
Or su ppose the learner misses item 4 in th e second set given above.
Ask th e same questio ns. What about item 5 where three initial
co nsonants are contrasted ? The implicatio n of the theory th at highl y
focussed discrete point items are diagnostic by virtue of th eir being
aimed at specific contrasts is not entirely transparent.
What about the adequacy of coverage of possible contrasts? Since
it can be shown [hat the phonetic form of a particu lar pho neme is
quite different when th e pho neme occ urs initiall y (after a pause or
silence) rather than medially (between other sounds) or finall y (before
a pause o r sil ence), an adequate rec.ognitio n test fo r the so unds of
DISCRETE POINT TESTS 219
English should presumably assess contrasts in all three positions. If
the test were to assess o nl y minimal contrasts, it should presumabl y
test separately each vowel contrast and eac h consonantal c-omrast
(ignoring the chameleonic phonemes such as /r/ and /1/ which have
properties of vowels and consonants simultaneously, not to mention
jwj and !y/ which do no t perfectly fit either category). It would have
to be a ve ry long test indeed. if there were only eleven vowel::; in
English, the matrix of possible contrasts would be eleven times
eleven, or 121, minus eleven (the diagonal pairs oftbe matrix wbich
in volve contrasts between each element and itself, or the null
contrasts), or 110, divided by two (to compensate for the fact that the
top half of the matri x is identical to the bottom half). Hence, the
number of non-redundant pairs of vowels to be contrasted would be
at least 55 . if we add in the number of conso nants th al can occur in
initial, medial, and final position, say, ab out twenty (to be 011 the
conservative side) we must add an other ,190 items times the three
positions, or 1.470, plus 55 equals 1,525 items. Furthermore, this
estimate is still lo w because it does no l accou nt for consonant
clusters, diphthongs, nor for vocalic elements that can occur in initial
or final positions.
Suppose the teacher decides to test only a sampling of the possible
contrasts and develops a 100 item test. How will the data be used ?
Suppose there are twenty students in the class where the test is to be
used. There would be 2,000 separate pieces of data to be used by the
teacher. Suppose that each student missed a slightl y different set of
items o n the test. H ow wo uld thi s diagnostic information be used to
develop different teaching strategies for each separate learner ?
Suppose that the teacher actually had the time and energy to sit down
and go through the tests one at a time looki ng at each sepa rate item
for each separate learner. How would the item score for each separate
learner be translated into an appropriate teaching strategy in each
case'? The problem we come back to is how to interpret a particular
perfonnance on a particular item o n a high ly focussed discrete point
test. It is something like the problem of try ing to determine the
composition of sand in a particul ar sand box in relati on to a certain
beach by comparing the grains of sand in the box with the grains of
sand on the beach - one at a time. Even if one were to set out {O
improve the degree of correspondence how would one go abOllt it,
and what criterion of success could be conceived?
Other types of items proposed to test phonological di scrimination
are minimal sentence pairs suc h ag:
220 LANGUAGE TESTS AT SCHOOL

1. Will he sleep ? Will he slip ?


2. They beat him. They bit him.
3. Let me see the sheep. Let me see the sheep. (Lad 0 , 196 1, p.
53).
Lado suggests that these items are more valid than words in isolation
because they are more difficult : 'The student does not know where the
difference will occur if it does occur' (p. 53) . He argues that such a
focussed sort of test item is to be preferred over more full y
contextua lized discourse samples because the rormer insures that the
student has actually perceived the sound co ntrast rather than merely
guessing {he meaning o r understanding the context instead of
perceiving the 'words containing the difficult sounds' (p. 54). Lado
refers to such guessing factors and context clues as 'non-Ianguage
factors' .
But let's consider the matter a bit fur ther. In what possible context
wo uld comprehension of a sentence like, Will he sleep ? depend on
someone's knowledge of the difference between sleep and slip ? Is the
knowledge associated wit h the meaning of the word sleep and the sort
of states and the extralinguistic situations that the word is likely to be
associated with Jess a matter oflanguage proficiency than knowledge
of the contrast between fiyj and II!,! ]s it possible to conceive of a
context in which the sentence, Will he sleep ? would be likely to be
taken for the sentence, Will he slip ? How often do yo u suppose
slipping and sleeping wo uld be expected to occur in the same
contexts? Ask the same questions fo r each of the other example
sentences.
Further, consider the difficulty of sentences such as the ones used in
2, where the first sentence sets up a frame that will not fit the second . If
the learner assumes that the 'hey in They beat him has the same
referential meaning as the subsequent they in They bit him the verb bit
is unlikely. People may beat things or people or animals, but him
seems likely to refer to a person or an animaL Tak e either option.
Then when you hea r, They bit him close on the heels of They beat him ,
what will you do with it ? Does Ihey refer to the same people and does
him refer to the same person or animal ? If so, how odd. People might
bea t a man or a dog, bUI wo uld they then be likely to bite him ? As a
result of the usual expectancies tha t normal language users will
generate in perceiving meaningful sequences of elements in their
lang uage, the second sentence is mo re difficult with the first as its
antecedent. Hence, the kind of contextualization proposed by Lad o
to increase item validity may well decrease item validity. The function
DISCRETE POINT TESTS 221
of the sort of context that is suggested for discrete point items of
phonological contrasts is to mislead the better language learners into
false expectancies instead of helping them (on the basis of normally
correct expectancies set up by discourse constraints) to make subtle
sound distinctions.
The items pictured in Figures 9-13, represent a different sort of
attempt to contcxtualize discrete point contrasts in a listening mode.
In Figure 9, for instance, both Lado (1961) and Harris (1969) have in
mind testing the contrast between the words ship and sheep. Figure
10, from Lado (1961), is proposed as a basis fortesting the distinction
between H-'atciling and tvashing. Figure 11, also from Lado, is
proposed as a basis for testing the contrasts between pin, pen, and pan
- of course, we should note that the distinction between the first two
(pin and pen) no longer exists in the widest used varieties of American
English. Figure 12 aims to test the initial consonant distinctions
between sheep and jeep and the vowel contrast between sheep and
ship. Figure 13 offers the possibility of testing several contrasts by
asking the examinee to point to the pen, pin, pan, picture, pitcher, the
person who is watching the dishes (according to Lado, 1961, p. 59) and
the person ~vho is lvashing the dishes.

\ B

A B
Figure 9. The ship/sheep contrast, Lado (1961 , p.
57) and Harris (1969, p. 34).
222 LANGU AGE TFSTS AT SCHOOL

A B

Figure 10. The watching/ washing contrast, Lado


(1 961, p. 57).
A u c

I~
Figure 11 . Th e pin/pen!pancontrasl. Lado {196 1, p.
58).
A B c

Figure 12. The sliip/jeepisheep co nLrast, Lado


( 196 1, p. 58).

Figure 13. 'Who is lVatching the dishesT (Lado,


196 1, p. 59).
DISCRETE POINT TESTS 223
Pertinent questions for the interpretation of errors on items of the
type related to the pictures in Figures 9- 11 a re similar to the questions
posed above in relatio n to si milar items without pictures. If it were
difficult to prepare an adequate test to cover the phonemic contrasts
of English without pictures it would surely be more difficult to try to
do it wit h pictures. Presumably the motivation for using pictures is to
increase the meaningfulness of the test items - to contextualize them,
just as in the case of the sentence frames discussed two paragraphs
ea rlier. We saw that the sentence contexts actu ally a re apt to create
false expectancies which wo uld distract from the purpose of the
items. What about the pictures?
I s jt natural to say that the man in Figure 13 is watching the dishes'?
It would seem more likely that he might watch the woman who is
was hing the dishes. Or consider the man watching the window in
Figure 10. Why is he doing that" D oes he ex pect it to escape ? To
leave? To halch ? To move ? To serve as an exit for a criminal who is
abo ut to try to get away from the law? If not for so me such reason, it
would seem more reasonable to say that the man in the one picture is
staring at {/ windOlv and the one in the other picture is H'{lshing a
different windon'. lfthe man were watchin g the same window, hOWl Sit
that he cannot see the man who is washing it? The context, which is
proposed by Lado Lo make the contrasts meaningful, not only fail s to
represent normal uses of la nguage accura tely, but also fail s to help
the lea rner to make the distinctions in questio n. If the learner d oes
not already know the difference between wa tching and washing, and
if he was not confused befo re experiencing the test item he may well
be afterward. If the learner does not alread y know the meaning of the
words ship , sheep, pin, pan, and pen, and if the sound contrasts are
difficult for the learner to perceive, the pictures in conjunction with
mea ningless similar sounding forms merely serve as a shghtly richer
basis for confusion. W hy sho uld th e learner who is a lread y hav ing
difficul ty with the distinction say bet ween pin and pen have a ny less
diffic ul ty after being exposed to Lhe pictures associated with words
which he cannot distingu ish ? If he should become able to perceive the
distinction on the basis of some teaching exercise related to the test
item types, on what possible basis is it reaso na ble to ex pect the learner
to associate correctl y the (previo usly unfa miliar) word sheep, for
instance, with the picture of the sbeep and the wo rd ship with the
picture of the ship?
The very fo rm of the exercise (or test item) has placed the
contrasting words in a context where all of the normal bases for the
224 LANGUAGE TESTS AT SCHOOL

distinctio n in meaning have been deliberatel y obliterated . The leamer


is very much in the positio n of tbe child leaming to spell to whom it is
pointed out tha t the pairs of spellings th eir a nd there, pare and pair,
son and sun represent differenl meanings and not to get confused
about which is which. Such a method of presentation is almost certain
to confuse th e learner co ncerning which meaning goes with which
spelling.

2. Vocabulary items. It is usua lly suggested t hat knowledge of


words should be referenced against the modalit y of processing - that
is, the vocabul ary one can comprehend when reading. Hence, it is
often claimed th at there must be separate vocabulary tests fo r each of
the trad itionally recogni z.ed four skills, at least fo r receptive and pro-
ductive repertoires . Above, especially in Chapter 3, we considered
an alternative explanation for the relati ve availability of words in
listen ing, speaking, re ad ing, and writing. It was in fac t suggested that
it is probably the difficulty of the task and the load it places o n
memory and attention that creates th e apparent differences in
vocabulary across different processing tasks. Ask yo urself the
questio n wheth er you know o r do not kn ow a wo rd you may have
difficulty in thinking of at a particular juncture. Would you know it
better ifit wcre written '! Less well if you heard it spo ken? If yo u could
understand its use by ~o meo ne else how does this relate to your ability
or inabi lity to use the sa me word appropria tely ? It would certainly
appear that there is room for the view that a single lexicon may
account for wo rd knowledge (wha tever it may be) across all four
skills. It may be merel y the accessibility of words that changes with
the nature ofthe processing task rather than the words actuall y in the
lex icon .
In any event, discrete point theory requires tests of vocabulary and
often insists that there must be separate tests for what are presumed
to be difTerent skill areas. A frequently-used type of vocabula ry test is
one aimed specifically at the so-called reading skill. For instance,
Davies (1977) suggests the foll owing vocabu lary item:
Our tom cat has been missing ever : :; ince lhat day I upset his
milk.
A. wild C. name
B. drum D . male

One might want to ask how useful the wo rd tom in the sense given is
fo r the students in the test populati o n. Further, is it no t possible that a
DISCRETE POINT TESTS 225
wild tom ca t became the pet in question and was then frightened off
by the incident? Wh at diagnostic information can one infer (since this
is supposed to be a diagnostic type of item) from the fact that a
particular student mis ses the item selecting, sa y, choice C, name?
Does it mean that he does not know the meaning of tom in the sense
given ? That he doesn't know the meaning of name? That he doesn't
know any of the words used ? That he doesn't understand the
sentence? Or are no t all of these poss ibilities via ble as well as ma ny
other combinations of them ? What is specific about the diagnostic
informatio n supposedly provided by such an item?
Another item suggested by Davies (1977) is of a slightly different
type:

Cherry: Red Fruit Vegetable Blue Cabbage


Sweet Stalk Tree Garden
The ta sk of the examinee is to order the words offered in relation to
their closeness in meaning to the given word cherry. Davies avows, 'it
may be argued that these te,t, are testing intelligence, particularly
example 2 [of the two examples given immediately above] which
demands a very high degree or literacy, so high thatit may be entirely
intelligence that is being tested here' (p. 81). There are several
untested presuppositions in Davjes' remark. One of them is that we
know better what we a re talking about when we speak of ' intelligence'
than when we speak of language skill. (On this topic see the
Appendix, especially part D .) Another is that the words in the
proffered set of terms printed next to the word cherry have some
intrinsic order in relation to cherries.
The difficulty with this item, as with all of the items of its type is that
the relationship between cherrie s and cabba ge, o r gardens, etc. , has a
great deal more to do with w here one find s the cherries at the moment
tha n with something intrinsic to the nature of t he wo rd cherry. At o ne
m oment the fact that a cherry is apt to be fo und on a cherry tree may
be the most important defining property. In a different context the
fact that some of the cherries are red and therefore edible may carry
more pragmatic cash value than the fact that it is a fruit. In yet
another context sweetness may be the property of greatest interest. It
is an open empirical question whether items of the sort in question
can be scored in a sensible way and whether or not they will produce a
high correlation with (ests o f reading ability.
Lado (1961) was among the first language testers to suggest
vocab ul ary items like the first of Davies' example, given above. For
226 LANGUAGE T ESTS AT SCHOOL

instance, Lado suggested items in the following form:


llltegrify
A. intelligence C. intrig ue
B. uprightness D . wea kness
Anothe r alternative suggested by Lado (1961, p. 189) was :
The opposite of strong is
A. short C. wea k
B. poor D. good
Simila r items in fact can be found in books on language testing by
Harris (1969), Clark (1972), Valette (1 967, 1977), Heaton (1975) and
in m any other sources. In fact, th ey date back to the earliest forms o f
so-ca lled 'intelligence ' and also 'reading' tes ts (see Gunn arsso n,
1978, a nd his references).
Two nagging questions continue to plague the user of discrete
point voca bulary tests . The first is whether suc h tests really measure
(reliably and validly) something other than what is measured by tests
that go by different names (e.g. , grammar, or pro nunciation, not to
mention reading comprehension or IQ). The second is whether the
kind of knowledge displayed in such tests could not better be
dem onstrated in tasks that more closely resemble normal uses of
language.

3. Grammar items. Again, there is the problem of deciding what


modality is the appropriate one, or how man y different moda lities
must be used in order to test adequ atel y grammatical kno wledge
(wh atever the latter may be construed to be). Sa mple items foll ow
with sources indicated (all of them were apparently intended fo l' a
written mode of presentation):
I. ' Does John speak French ?'
'1 don't know what. '
A. does
B. speaks
C. he (Lado, 1961 , p. 180) .
u. When _ _--,-_ .___ __ ?
A. plan C. to go
B. do D. you (Harris, 1969, p. 28).
111. I want to _ _ .. ..._ home no w.
A. gone C. go
B. went D. going (D avies, 1977, p. 77).

Similar items can be fo und in Cla rk (1972), Hea ton (I97 5), and
DISCRETE POIN T TESTS 22 7
Valette (1 977). E>sentially they concentrate on the ordering of words
or phrases in a minimal context, or they require selection of the
appropriate co ntinuation at some poin t in the sentence. Usually no
larger context is implied o r otherwise indicated.
In the Appendix we exa mine the correlation between a set o f tests
focus sing on the form ation of appropriate continuations in a given
te xt , another set requiring the ordering of words, phrases, or clauses
in similar texts, and a large baUery of other tests. The results suggest
that there is no reasona ble basis for claiming that the so-called
vocabulary (synonym matching) type of test items are measuring
anything other than what the so-called gramma r items (selecting the
appropriate continuat ion , and ordering elements appropriately) are
measuring. Further, these tests do not seem 10 be doing anything
different from w hat standard dictatio n and doze procedure c an
accomplish. Unless counter evidence can be produced to support the
super-st ructure o f discrete point test theory, it would appea r to be in
grave empirical difficulty.

D, A proposed reconciliation with pragmatic testing theory


From the arguments presented in this chapter and throughout this
entire book - especially all of Pari Two - one might be inclined to
think that the 'elements' of language, whatever they may be, should
never be considered at all. Or at least, one might be encouraged to
read this recommendati o n between th e lines. Ho wever, thi s wo uld be
a mistake. What, after all, does a pragmatic test measure? Does it not
in fact measu re the examinee's ability to make use o f the sounds,
syllables, words, phrases, intonations, cJauses, etc. in the contexts of
normal communication ? Or at least in contexts that faithfully mirror
normal uses oflanguage? If the latter is so, then pragmatic tests are in
fact doing what discrete point teste" wanted done all along. Indeed,
pragmatic tests are the only reasonable approach to testing language
skills i f \Ve want to kn ow how well the examinee can use th e elements
of the language in real-life communicati on con texts.
Wh at pragmatic langua ge tests accomplish is precisely what
discrete po int testers were hoping todo. The adva ntage that pragma tic
tests offer to the classro om teac her and to the educato r in generali s
that they are far easier to prepare than are tests o f the discrete point
type, and they are nearl y certain to pro duce more meaningful and
more readily interpretable results. We will see in Chapter 9 that the
preparation and production of multiple choice tests is no sim ple
228 LANGUAGE TESTS AT SCHOOL

tas k. We have already seen that th e determina ti on of how many items


o f certain t ypes to include in discrete poin t tests po ses intrinsically
insoluble and pointless theoretical and practical mind-bogglers. F or
instance, ho w many vocabulary items should be included? Is tom as
in tom cat wo rth including ? What is the rela ti ve importance of vo wel
co ntrasts as compared against morphologica l contrasts (e.g. , plu ra l,
possessive, tense marking. and the like)? Which grammatical point s
found in linguistic analyses sho uld be ro und in language tests
focussed on 'grammar' ? What relati ve weights should be assigned to
the various catego ries so determined ? How much is enou gh to
re present adequately the importance of determiners ? Subject raising?
Relativizatio n? The list goes on a nd on and is most certainly not even
close to being complete in the best analyses currentl y available.
The great virtue, the important insight o f linguistic analysis, is in
demonstrating that language co nsists of com plicated sequences o f
elements, subsequences of sequences, and so fo rth. Further, linguistic
research has helped uS to see that the elements of language are to a
degree analyzable. Discrete point theory tried to capita lize on this
insight and pushed it to the proverbial wall. It is time now to re-
evaluate th e results o f the a pplication . Recent resea rch with
pragmatic la nguage tests suggests that the essential insights of
discrete point lheo ries can be more adequatel y ex pressed in
pragmatic tests than in overly simplistic discrete point approaches
which obliterate crucial prope rti es of language in the process o f
taking it to pieces. The pieces sho uld be o bserved, studi ed , taught,
and tested (it would seem) in the natural habitat of disco urse rath er
than in isol ated sentences pulled o ut of th e clea r blue sky.
In Part T hree we will consider ways in which the diagn os tic
informatio n so ught b y discrete point theory in isolated items aimed at
pa rticula r rules, word s, sound contrasts and th e like can much mo re
sensibly be found in learner protocols related to the performance of
pragmatic discourse processing tasks where the focus is on
communicating something to somebody rather than merely filling in
so me blan k in some senseless (or ne arl y senseless) di screte item pulled
fro m some strained test writer's brain. The reconciliation of discrete
point the o ry with pragmatic testing is acco mplished quite simply. All
we have to d o is acknowledge the fact that the elemenlS o f language
are no rm all y used in discourse fo r t he purposes of communication -
by the latter term we include all of the abstract, expressive, and poeti c
uses of language as well as the wo nde rful munda ne uses so famili ar to
a ll normal human beings.
DISCRETE POINT TESTS 229
KEY POINTS
1. Discrete point approaches co testing derive from d iscrete point
approaches to teaching. They are mutually supportive.
2. Discrete point tests are supposed to provide diagno stic input to specific
tei:lching remedies for specific weaknesses.
3. Both approaches stand or fa ll togethe r. If discrete point tests cannot be
shown to have substantial validit y, discrete point teaching will be
necessariJy dra wn into question.
4. Simila rly, the validity of d iscrete poim testing and all of its instructional
applic.ltions would be d rawn into question ifit were sho wn that d.iscrete
po int leaching does not work.
5. Discrete point teaching is a notorious failure. There is an almost
complete scarcity of persons who have actua lly learned a forei gn
lan guage on the basis of di sc rete point method s of teachi ng.
6. The premise of discrete point theories, that langua ge can be taken to
pi eces and put back togeth er in the curriculum, is app arently false.
7. Any discourse in any nalurall an guage is mo re than the mere sum of its
analyzable pa rts. C ruci al properties of language are lost when it is
bro ken down into discrete phonemiccom rasts, words, structures and the
like.
8. No nsense, of the sort recommended by some expe rts as a basis for
di screte point test items, does not exhibit many of the pragmatic
properties of normal sensible utterances in discourse contexis.
9. The trouhle is that the lowest level units of discourse are involved in the
pro duction and interpretati on of the highest level units. They cann ot,
therefo re, be separated witho ut obliterating the characteristic re·
lations hips between them.
10. No one ca n learn a la nguage (o r teach one) on the basis of the principles
advocated by discrc te point theo rists.
11. Discrete point tests of p osited components are often separated into the
categories ofphonologicaJ, lexical, and syntactic tests.
12. It can easily he sho\'m that such tcsts, even though they arc advocated as
diagnostic tests, do not provide very specific diagnostic information at
all.
13. Typicall y, discrete items in multiple choice rormat require highl y
artificia l and unnatur<J1 distinctions among lin guistic forms.
14. Further, when an attempt (0 co ntex tualize [he items is made, it usually
fall s flat because the contrast itself is ao unlikely one in normal discourse
(e.g., watching the baby versus washing Ihe baby ).
15. Discrete items offer a rich basis for confusion to any student who may
already be having trouble with whatever distinction is required.
16. PragmatiC tests can be shown to do a better job of what discrete point
testers were interested in accomplishing all along.
17. Pragmatic tests assess the learner's ability to use the <elements' of
language (whatever they ma y be) in the normal COnLext s of hum an
d iscourse.
18. Moreover, pragma(ic tests are superior diagnostic devices.
230 LANGUAGE TESTS AT SCHOOL

DISCUSSION QUESTIONS
1. Obtain a protocol (answer sheet and test booklet) from a discrete point
test (sound contrasts, vocabulary, or structure). Analyze each item,
trying to determine exactly what it is that the student does not know on
each item answered incorrectly.
2. Repeat the procedure suggested in question I, this time with any
protocol from a pragmatic task for the same student. Which procedure
yields more finely grained and more informative data concerning what
the learner does and does not know'? (For recommendations on partic-
ular pragmatic tests, see Part Three.)
3. lnterpret the errors found in questions I and 2 with respect to specific
teaching remedies. Which of the two procedures (or possibly, the several
techniques) yieJds the most obvious or most useful extensions to
therapeutic interventions'? In other words, which test is most easily
interpreted with respect to instructional procedures '?
4. Is there any information yielded by discrete point testing procedures that
is not also available in pragmatic testing procedures? Conversely, is there
anything in the pragmatic procedures that is not available in the discrete
point approaches? Consider the question of sound contrasts, word
usages, structural manipulations, and communicative activities.
S. What is the necessary relationship between being able to make a
particular sound contrast in a discrete item test and being able to make
use of it in communication? How could we determine if a learner were
not making use of a particular sound contrast in conversation? Would
the discrete point item help us to make this determination'? How? What
about \vord usage? Structural manipulation? Rhythm? Intonation'?
16. Take any discrete point test and analyze it for content coverage. How
many of the possible sound contrasts does it test? Words'? Structural
manipulations? Repeat the procedure for a pragmatic task. Which
procedure is more comprehensive? Which is apt to be more repre-
sentative? Reliable? Valid? (See the Appendix on the latter two issues,
also Chapter 3 above.)
7. Analyze the two tests from the point of view of naturalness of what they
require the learner to do with the language. Consider the implications,
presuppositions, entailments, antecedents, and consequences of the
statements or utterances used in the pragmatic context. For instance, ask
what is implied by a certain form used and what it suggests which may
not be stated overtly in the text. Do the same for the discrete point items.
-18. Can the content of a pragmatic test be summarized? Developed?
Expanded? Interpolated? Extrapolated? What about the content of
items in a discrete point test? Which test has the richer forms, meaning
wise? Which forms are more explicit in meaning, more determinate?
Which are more complex?
SUGGESTED READINGS
1. John L. D. Clark, Fureign Language Testing: Theory and Practice.
Philadelphia: Center [or Curriculum Development, 1972.
2. Robert Lado, Language Testing. London: Longman, 1961.
3. Rebecca Valette, lvludern Language Testing. New York: Harcourt, 1977.
9

Multiple Choice Tests

A. Is there any olher way to ask a


question ?
B. Discrete point and integrative
multiple choice tests
C. About writing items
D. Item ana lysis and it s interpretatio n
E. Minimal recommended steps for
multiple choice test preparation
F. On the in structional value of
multiple choice tests

The mai n purpose of this chapler is to clarify the nature of multiple


ch("l ice tests - how they are constructed , the subjective decisions that
go int o their preparation, the minimal num ber of steps necessary
before they can be reasonab ly used in classroom contexts, the
incredible range and variety of tasks that they may em body. and
finalJy, their general impracticality for day to day classroom
application. It wilJ be shown that multiple choice tests can be of the
discrete point or integrative type or anywhere on the continuum in
between the two ex tremes. Some of them may further meet the
nat uraln ess requirements for pragmatic language tests. Th us, this
chapter provides a natural bridge between Part Two (contra discrete
point testing) and Part Three (a n exposition of pragmatic testing
techniques).

A. Is there any other way to ask a question·l


At a testing conference some years ago. it wa s reported that the
fo ll owing exchange took place between two of the panicipants. The
speaker (probably John Upshur) was asked by a wo uld-be discussant
if mUltiple choice tests were really al1 that necessary. To which
231
232 LANGUAGE TESTS AT SCHOOL

Upshur (according to Eugene Briere) quipped, ' Is there an y other way


to ask a questi on ?' End of discussio n. The would-be co ntender
withdrew to the comfort and an o nymity of his form er sitting
position.
When you think about it, conversati ons are laced with decision
points where implicit choices are being co nstantly made. Questions
impl y a range o f alternati ves. Do yo u want to go get something to
eat? Yes or no. How about a hamburge r place, or would you rather
have something a little more elegant ? Which place did you have in
mind? Are you speaking to me (or to so meone else)? Questi o ns just
naturally seem to imply alternatives. Perhaps the alternatives are not
usually so well defined as they are in multiple choice tests, and
perhaps the implicit alternatives are no t usually offered to co nfuse or
trap the person in no rmal communicati on though they are explicitl y
intended for that purpose in multiple cho ice tests, but in bo th cases
there is the fund amental similarity that a lternatives (explicit or
implicit) are o ffered. Pilo t asked Jesus, 'What is truth ?' Perhaps he
meant, 'There is no answer to this questi on, ' but at the same time he
appeared to be interested in the possibility of a different view. Even
abstract rhet orical questions may implicitly afford alternatives.
It would seem that multiple choice tes ts have a certain naturalness,
albeit a strained o ne. They do in fact require people to make decisions
that are at least similar in the ~ense defined above to decisions that
people are often required to make in normal communicatio n. BUl
this, of co urse, 1S no t the main argument in fa vor of their use. Indeed,
the strain that mUltiple choice tests put on the flow of normal
communicative interactions is often used as an argument against
them.
The favor that multiple choice tes ts enjoy among professional
testers is due to their presumed 'o bjecti vity', and concomitant
reliability of sco ring. Further, when large numbers of people are to be
tested in short periods oftime with few proctors and scorers, multiple
choice tests are very economical in ter ms of the effort and expense
they require. The questions of validity posed in relation to language
tests (or other types of tests) in general are still the same questions,
and the validity requirements to be imp osed on such tests should be
no less stringent for multiple choice versio ns than for other test
format s. It is a n empirical question whether in fact mUltiple cholce
tests afford any ad vantage whalsoe ~'er over other types o f tests. It is
not the sort o f question that can be decided by a vote of the American
(or any other) Psychometric Associatio n. 1t can only be decided by
MULTIPLE CHOICE TESTS 233
appropriate research (see the Appendix, also see Oller and Perkins,
1978).
TIie preparation and evaluation of specific multiple choice tests
hinges on two things: the nature of tlie decision required by test items,
and the nature of the alternati ves o ffered to the examinee on each
item. It is a certainty that no multiple choice test can be any better
than the item s that con stitute it, nor can it be a ny more va lid than the
choices it offers examinees at requisite decision point s. From this it
follows that the multiple choice format can onl y be ad vantageous in
term s of scoring and administrati ve convenience if we have a good
multiple cho ice test in the first place.
It will be demonstrated here that the prepara tion of sound mUltiple
c hoice tests is sufficientl y cballenginga nd technically difficult to make
them impracticable for most classroom needs. T his w ill be
acco mplished by showing some of the pitfall s tbat commonly trap
even the experts. The formidable technical problem of item analysis
done by hand will be shown to all but completely eliminate mUltiple
choice formats from consjderatio n, Further) it will be argued that the
multiple choice formal is intrinsic.ally inimical to the interests of
instruction. What multiple choice formats gain in reliability and ease
of administration, in othe r word s, is mo re than used up in detrimental
instructional effects and difficulty of preparation.

B. Discrete point and integratil'e multiple choice tests


In Chapter 8 above , we alread y examined a number of multi ple
choice items of a discrete point lype. There were items aimed at
phonologica l contrasts, vocabulary, and 'grammar' (in the ra ther
narrow sense of surface mo rpho logy and syntax). There are,
however, many item types that can easily be put into a multiple choice
format, or which are usually found in such a format but which are not
discrete po int items. For instance, wbat discrete elements of language
a re tested in a paraphrase recognition task suc h as tlie following?

Match the given sentence with the alternative th at most nearly


says the same thing:
Before the turn of the century, the tallest buildings were
rarely more tha n three storeys above ground (adapted fr om
Heaton, 1975, p. 186).
A. Alter the turn of the century, buildings had more
storeys above ground.
234 LANGUAGE TESTS AT SC ORL

B. Buildings rarel y had as man y as four sto reys above


ground up until th e turn of the century.
C. At about the turn of the century, buildings became
more numerous and considerably taller than ever
before .
D. Buildings used to have mo re storeys above ground
th an they did at about the turn of the century.

It would be hard to say precisel y what poi nt of grammar, vocabul a ry,


etc. is being tested in the item just exemplified . Could a test composed
of items of thi s type be called a test of reading comprehension ? H ow
about paraph rase recognition? Language proficiency in general?
What if it \-vere presented in a spoken forma t?
As we have noted before, the problem of what to say a test is a test
of is principally an issue of test validit y. It is an empirical question.
What we can safely say a n the ba sis of the item formal alo ne is what
the (est requires the lea rner to do - or a t lea st what it appea rs to
require. Perhaps, therefo re, it is besllo call the item type a ' semence
paraphrase recognition' task. Thus, by naming the task rather than
positing some abstract construct we av oid a priori validity
commitments - that is, we suspend judgement on the validity
questions pending empirical investigation. Nevertheless, wha tever we
choose to call the specific item type, it is clearly more at the integra tive
side of the continuum than at the discrcte point end.
There are many o ther lypes of multiple choice items that are
integrative in nature. Consider the probJem of selecti ng a nswers to
questions ba sed on a text. Such questions may focus on some detail of
information given in the text, the general topic of the text, something
implied by the text but not stated, the meaning of a particular wo rd,
phrase, or clause in the text, and so fo rth. For example, read the
foll owing text a nd then select the best a nswe rs to the questions that
follow:
Black Students in Urban Canada is an attempt to provide
information to urban Canadians who engage in educational
transactions with members of this ethnicity. Although the
OIS E conference did not attract ed ucators from either west of
Manitoba or from Newfoundland, it is felt that there is an
adequate minimum of relevance such that concerned urba n
teachers fro m all parts of this nal Lon may unco ver something of
profit (D'Oyley and Silverman, 1976, p. vi).
(l ) T his paragraph is probably
A. an introductio n to a Canadian novel.
B. a recipe for transactional analysis.
MULTIPLE CHOICE TESTS 235
C. a preface to a conference report.
D. an epilog to ethnic studies.
(2) The word ethnicity as used in the paragraph has to do
with
A. sex.
B. skin color.
C. birthplace.
D. all of the above.
(3) The message of the paragraph is addressed to
A. all educators.
B. urban educators.
C. urban Canadians involved in education.
D. members of the ethnic group referred to.
(4) The abbreviation OISE probably refers to the
A. city in question.
B. relevant province or state.
C. journal that was published.
D. sponsoring organization.
(5) It is implied that the ethnic group in question lives in
predominantly
A. rural settings.
B. suburban areas.
C. urban settings.
D. ghetto slums.
(6) Persons attending the meetings referred to were
apparently
A. law enforcement officers.
B. black students.
C. educators.
D. all of the above.

The preceding item type is usually found in what is called a 'reading


comprehension' test. Another wa)! of referring to it is to say that it is a
task that requires reading and answering questions-leaving open the
question of what the test is a test of. It may, for instance, be a fairly
good test of overall language proficiency. Or, it may be about as good
a test of listening comprehension as of reading comprehension. These
possibilitie-s- 'C.'-3nnot be ruled out in advance on the basis of the
superficial appearance of the test. Furthermore, it is certainly
possible to change the nature of the task and make it into a listening
and question answering problem. In fact, the only logical limits on
the types of tests that might be constructed in similar formats are
v·/hatever limitations exist on the creative imaginations of the test
writer. They could be converted, for instance, to an open-ended
236 LANGUAGE TESTS AT SCHOOL

format requiring spoken responses to spoken questions over a heard


text.
Not only is it possible to find many alternate varieties of multiple
choice tests that are clearly integrative in nature, but it is quite
possible to take just about any pragmatic testing technique and
convert it to some kind of multiple choice format more or less
resembling the original pragmatic technique. For example, consider a
doze test over the preceding text - or, say, the continuation of it. We
might delete every fifth word and replace it with a field of alternatives
as follows:

Black Students in Urban (I) __ (A) Europe


(B) America
(C) New Guinea
(0) Canada
is an attempt to (2) _ _ __ _ (A) find
(B) provide
(C) include
(0) take
information to urban Canadians (3) ______ (A) which
(B) while
(C) who
(0) to
engage in educational transactions (4) - - - - - (A) with
(B) on
(C) to
(0) by
members of this ethnicity ..

Bear in mind the fact that either a printed format (see JOllZ, 1974,
Porter, 1976, Hinorotis and Snow, 1977) or a spoken format would be
possible (Scholz, Hendricks, Spurling, Johnson, and Vandenburg, in
press). For instance, in a spoken format the text might be recorded as
'Black students in Urban blank one is an attempt to blank two
information to urban Canadians blank three engage in education
transactions blank four members of this ethnicity .... ' The examinee
might see only the alternatives for filling in the numbered blanks, e.g.,

(1) __ (A) Europe (B) America (C) New Guinea (0) Canada
(2) __ (A) find (B) provide (C) include (0) take

etcetera. To make the task a bit easier in the auditory mode, the
recorded text might be repeated one or more times. For some
MULTIP L~ CHOICE TESTS 237
exp lo ratory work with such a procedure in an auditory mode see the
Appendix, also see Scholz, el at (in press), Fo r other suggestions for
making the task simpler, see Chapters 10, 11 , and 12 on factors that
affect the difficult y of disco urse processing tasks,
Once we ha ve broached the possibility of using discourse as a basis
for constructing mU ltiple choice tasks, many va riations on test item
types can easil y be conceived. For instance, instead of asking the
examinee to select the appropriate continuation at a particular point
in a text, he may be asked to select the best sy no nym or paraphrase
fo r an indicated po rtio n of text from a field of alternatives, Instead of
focussing exclusively on words, it would be possible to use phrases or
clauses or larger uni ts of discourse as the basis for items. Instead of a
synonym matching or paraphrase matching task, the exam inee might
be required to put \-vo rds, phrases, o r clauses in an appropriate order
within a gi ven context of discourse. Result s fro m tasks of all these
types are discussed in greater detail in the Appendix (also see the
references given there), The import ant point is that any such lests are
merely illustrative of a bare smattering of the possible types,
The question concerning which o f the possible procedures are best
is a matter fo r empirical consideration. Present findings seem to
indicate that the most promising multiple choice tasks are th ose that
require the processing of fairly ex tended segments of discourse - say,
150 wo rds of tex t or more, However, a note of caution should be
so unded , The construction of mUltiple choice tests is generall y a
considerabl y more complicated matter than the mere select ion of an
appropriate segment of discourse, Altho ugh pragmatic tasks can
with some premeditation and creativity be converted into a var iety of
multiple choice tests, the latter are scarcely as easy to use as the
original pragmatic ta sks themselves (see Part Three), I n th e next
section we will consider some of the technical problems in writing
items (especially alternatives),

C. About writing items


There are only a handful of principles that need to be grasped in
writing good items, but there are a great m any ways to viol ate any or
all of them. The first problem in writi ng items is to decide wha t sort of
item s to write . The second prob lem is to write the items with suitable
distractors in each set o f alternatives. In both steps there arc ma ny
pitfalls, Professionally prepared tests are usually based on explicit
instructions concerning the format for it ems in each section o f the
238 LANGUAGE TESTS AT SCHOOL

test. Not only will the superficial lay-out of the items be described and
exemplified, but usually the kinds of fair conten t for questions will
also be more or less circumscribed, and the intended test population
will be described so as to infor m item writers concerning the
appropriate range of difficulty of items to be included in each part of
the test. I
U nfortunately, the questions of test validity are genera lly
consigned to the statistician's department. They are rarely raised at
the point of item writing. However, as a rule of thumb, all of the
normal criteria for evaluating the validity of test content should be
applied fro m the earliest stages of test constructio n. The first
princi ple, therefore, would be to ask if the material to be included in
items in the test is somehow rel ated to the skill, construct. or
c urriculum that the test is suppo sed to assess o r measure. Sad to say
ma ny ofthe items actuall y included in locally prepared, teacher-made
or multiple choice standardized tests are not always subjected to this
primary evaluation. If a test fail s thi s fi rst evaluation, no matter how
elega ntly its items arc constructed, it cannot be an y better than any
other ill ~conceived test of whatever it is supposed to measure.
Assuming that the prima ry validity question has been properly
considered, the next problem is to write the best possible items of the
defined type(s). In some cases, it will not be necessary to write items
fro m scratch, but rather to select appropriate materials and me rely
edit the m or record them in some approp riate fashion. Let us assume
that all test items are to be based on samp)es of realistic discourse.
Arguments for thi s choice are given throughout the various chapters
of this book. Other choices could be made - for instance, sentences in
isolation could be used - but this would not change the principles
directly related to the construction of items. It would merely change
their fleshed-out realization in particular instances.
During the writing stage, each item must be evaluated for
appropriateness of content. Does it ask for information that people
would normally be expected to pay attention to in the discourse

I Jo hn Bo rmUlh (1970) has developed an ex.tensive argument fo r deriving multiplc-


cho ice items from curricula via expucil and rigorous linguistic transformati o ns. The
item s in his method ology are directly tied to sentences utt ered or written in the
curriculum. The argument is provocative, However, it presupposes that the surface
forms of the sentences in the curriculum are all that could be tested. Normal disco urse
processing, on the other hand, goes/ar beyond what is stated ovenly in surface fo rms
(see Frederiksen's recent articles and his references). Therefore, So nnuth's inte resting
pro posal will no t be considered fUri ber here. I am i n d~bt ed to Ron Mac kay (of
Concordia U niversity in Mo ntrea l) fo r calli ng Bo rmut h's argumenllO my attention.
" UL TIPLE C HOICE TESTS 239
context in q uestion ? Is the decisio n that is required one that reall y
seems to exercise the skill that the test as a whole is ai med a t
measuring? Is the correct choice really the best choice for someone
who is good at the skill being measured (in this case, a good language
user)? A re the distractors actu all y attractive traps for someone who is
not so good at the skill in question ? Are they well balanced in the
sense of going together as a set? Do they avoid the inclusio n of
blatant (but extra neo us) clues to the correct cho ice? In sum, is the
item a well-conceived basis for a choice between clear alterna tives?
In addition to insisting that the items of interest be ancho red to
realistic discourse contexts, on the basis of research findings
presented elsewhere in this tex t (especially the Appendix), we will
disregard ma ny of th e discrete point arguments of purity in item
types. In other wo rds, we will aban don the notion that vocabula ry
k nowledge must be assessed as if it were independent of grammatical
skill, or tha t reading items sho uld Dot include a writing aspect, etc. All
of the available empirica l research seem s to indicate th at such
distinctions are anal ytical nice ties that have no fundamental fact ual
coun ter parts in the variance actually produced by tests tbat are
constructed on the basis of such distincti o ns. T herefore, the
distinctio ns that we will make are principally in the types of tasks
req uired oft be learner - not in the hypothetical ski lls or constructs to
be tested. For all of these reasons, we should also be clear abo ut the
fact that the construction of a mul tiple choice test is not apt to
produce a test that is more va lid than a test of a similar sort in some
other fo rmat. Tbe point in building a multiple choice test is to attain
greater economy of administration and scoring. It is purely a
qu estion of practicality and has little or nothing to do with reliability
and validity in the broader sense of these terms.
Most any d iscourse contex t can be dealt with in just abo ut an y
processing mode. Fo r instance, consider a breakfast conversation.
Suppose that it in vo lves wha t the va rious members of a famil y are
going to do that day, in additio n to the no rmal 'pass the salt and
pepper' kind of conversati on at breakfast. Nine-year-old Sarah spills
the orange juice while M r Kowolsky is scalding his mo uth on boiling
coffee and remonstrating that M rs Kowalsky can't seem to cook a
thing wit bout getting it too hot to eat. Thirteen-year-old Samuel
wa nts to know if he can have a dolla r (make that five do lla rs) so he
can see th at latest James Bond movie, and his mother insists tb at he
not fo rge t th e piano lesson at fo ur, and to feed the cat ... It is
possible to talk about such a context ; to listen to talk about it; to read
240 LANGUAGE TESTS AT SCHOOL

about it ; to write about it. The same is true fo r almost any context
conceivable where normal people interact through the medium of
language.
I! might be reasonable, of course, to start with a written text.
Stories, narrative s, expository samples of writing, in fact, just about
any text may provide a suitable basis for language text material. It is
possible to talk about a story, listen to a story, answer questions
about a story, read a story, retell a story, write a story, and so forth.
What kinds of limits should be set on the selection of mater.ials!
Obviously, one would not want to select test material that would
distract the test taker from the main job of selecting the appropriate
choices of the mUltiple choice item s presented. Therefore, super-
charged topics about such things as rape, suicide, murder, and
heinous crimes should probably be avoided along with esote ric topics
of limited interest such as highly technical crafts, hobbies, games, and
the like (except, of course, in the very special case where the eso teric
or super--charged topic is somehow central to the instructiona l goals
to be assessed). Materials that state or imply moral, cultural , or racial
judgements likely to offend test takers should also proba bly be
avoided unless there is some specific reason for including them.
Let us suppose that the task decided on is a reading and question
answering type . Further, for whatever reasons, let us suppose that the
following text is selected:

Oliver Twist was born in a workhouse, and for a long time


after his birth there was considerable doubt whether the child
would live. He lay breathless for some time, rather unequally
balanced between this world and the next. After a few struggles,
howeve r, he breathed, sneezed and uttered a loud cry.
The pale face of a young woman lying on the bed was raised
weakly from (he piilow and in a faint vo ice she said, 'Let me see
the child and die.'
'Oh, you must not talk about dying yet,' said the doctor, as he
rose from where he was sittin g near the fire and adva nced
toward s the bed.
'God bless her, no!' added the poor old pauper who was
acting as nurse.
The doctor placed the child in its mother's arms ; she pressed
her cold white lips on its forehead ; passed her hands over her
face ; gazed wildly around, fell back - and died.
'It's a ll over,' said the doctor at last.
'Ah, poor dear, so it is!' said the old nurse.
'She was a good-looking girl , too ,' added the doctor: 'where
did she come from?'
MULTIPLE CHOICE TESTS 241

'She was brought here last night,' replied the old woman. ' She
was found lying in the street. She had walked some distance, for
her shoes were worn to pieces; but nobody knows where she
ca me from , or where she was going, nobody knows. '
'The old story,' said the doctor, shaking his head, as he
leaned over the body, and raised the left hand; 'no wedding-
ring, I see. Ah I Good night l' (Dickens, 1962, p. 1)

In framing questions concerning such a text (or any other) the first
thin g to be considered is w hat the text says. What is it about '! If it is a
story, like this one, who is referred to in it? Wh at are tbe importa nt
events? What is the connection between them? What is the
rel ationship between the people, events, and states of affairs referred
to? In other words, how do the surface forms in the text
pragmatically map onto states of affairs (or fa cts, imaginings, etc.)
which the text is about ? The author of the text bad to consider these
questions (at least implicitly) the same as the reader, or anyone who
would retell the story or discuss it. Linguisticall y speaking, this is the
pro blem of pragmatic mapping.
Thus, a possible place to begin in constructing test items would be
with the topic. What is the text about ? There are many ways of posing
the question clearly , but putting it into a mUltiple choice format is a
bit more complicated than merely asking the question. Here we are
concerned with better and worse ways of forming such multiple
choice questions. How should the question be put, and what
alternatives should be offered as possible answers?
Consider some of the ways tha t the question can be badly put:
(1) The passage is about _ _ __
A. a doctor
B. a nurse
C. an unmarried woman
D. a child
The trouble here is that the passage is in fact about aU of the
foregoing, but is centered on none of them. If any were to be selected
it wo uld probably have to be the child, because we understand from
tbe first paragra ph of the text that Oliver Twist is the child who is
being born. Further, the att ention of every person in the story is
prim!,rly directed to the birth of this child. Even the mother is
concerned merely to look at him before she dies.
(2) A good title for this passage might be _ _ _ __
A. Too young to die.'
B. 'A cross too heavy.'
242 LANGUAGE TESTS AT SCHOOL

C. 'A child is born .'


D. ' God bless her, no I'
Perhaps the author has C in mind, but the basis for that choice is a bit
obscure. After all, it isn't just any child, and neither is it some child of
great enough moment to justify the generic sense of 'a child'.
Now, consider a question that gets to the point:
(3) The central fact talked about in the story is _ __ _ _
A. the birth of Oliver Twist
B. the death of an unwed mother
C. an experience of an old doctor
D. an old and common story
Hence, the best choice must fit Ihe facts well. It is essential that the
correct answer be correct, and further that it be better than the other
alternatives offered.
Another common problem in writing items arises when the writer
selects facts that are in doubt on the basis of the given information
and forces a choice between two or more possible alternatives.
(4) When the author says that 'for a long time a fter his birth there
was considerable doubt whether the child would live' he probably
means that _ __ _ __
A. the child was sickly for months or possibly years
B. Oliver Twist did not immediately start breathing at birth
C. the infant wa s born with a respiratory disease
D. the child lay still without breathing for minutes after birth
The trouble here is lhat the text does not give a sufficient basis for
selecting between the alternatives. While it is possible th at only B is
correct, it is not impossible (on the basis of given information) that
OIle of the other three choices is also correct. Therefore, the choice
that is intended by the author to be the correct one (say, B) is not a
very reasonable alternative. In fact ) none of the alternatives is really a
good choice in view of the indeterminacy of the facts. Hence, the facts
ought to be clear on the basis of the text, or they should not be used as
content/or test items.
Finally, once the factual content of the item is clear and after the
correct alternative has been decided on, there is the matter of
constructing suitable distractors, or incorrect alternatives. The
distractors should not give away the correct choice or call undue
attention to it. They should be similar in form and cOlltent to the
correct choice and they should have a certain attractiveness about
them.
MULTIPLE CHOICE TESTS 243

For instance, consider the following rewrite of the alternatives


offered for (3):
A. the birth of Oliver Twist
B. the death of the young unwed mother of Oliver Twist
C. the experience of the old doctor who delivered Oliver Twist
D. a common story about birth and death among unwed mothers
There are several problems he re. First, Olive r Twist is mentioned in
all but one of the alternatives thus drawing attention to him and
giving a clue as to the correcl choice. Second, the birth of Twist is
mentioned or implied in all four alternatives giving a second
unmistakable clue as to the correct choice. Third, the choices are not
well balanced - they become increasingly specific (pragmatically) in
choices B, and C, and then jump to a vague generality in choice D.
Fourth , the correct choice, A, is the most plausible of the four even if
one has Dot read the text.
There are several common ways item writers often draw attention to
the correct choice in a field of alternatives without, of course,
intending to. For one, as we have already seen, the item writer may be
tempted to include the same information in several forms among the
various alternatives. This highlights that information. Another way
of highlighting is to include the opposite of the correct response. For
instance, as alternatives to the question about a possible title for the
text, consider the following :
A. 'The death of Oliver Twist. '
B. 'The birth of Oliver Twist.'
C. 'The same old story.'
D. 'God bless her, no I '
The inclusion of choice A call s attention to choice B and tends to
eliminate the other alternatives immediately.
The tendency to include the opposite of the correct alternative is
very common, especially when the focu s is on a word or phrase
meaning:
(5) In the opening paragraph, the phrase 'unequally balanced
between this world and the next' refers to the fact that Oliver appears
tobe _ __
A. more alive than dead
B. more dead than alive
C. about to lose his balance
D. in an unpleasant mental state
To the test ~w ise examinee (or any moderately clever person), A and B
are apt to seem more attractive than C or D even if the examinee has
244 LAN GUAGE TESTS AT SCHOOL

not read the original text about T wisl.


Yet another way of cluing the test taker as to the appropriate
choice is to make it the longest and most complexly worded
alternative, or the shortest and most succinct. We saw an example of
the latter above with reference to item (3) where the correct choice
was obviously the shortest and the clearest one of the bunch. Here is
another case. The only difference is that now the longest alterna tive is
the correct choice:
(6) The doctor probably tells the youn g mother not to talk about
dying because _ _ _ __ _
A. he doesn't think she will die
B. she is not at a ll ill
e. he wants to encourage her and hopes that she will not die
D. she is delirious
The tendency is to include more information in the correct alternative
in order to make absolutel y certa in that it is in fact the best c hoice.
Another motive is to make the distracto rs short to save time in
writing the item s,
Another common problem in writin g distractors is to include
alternatives that are ridiculous and often (perhaps only to the test
writer) hilarious. After writing forty o r fifty alternatives there is a
certain tendency for the test writer to become a little giddy. It is
difficult to think o f distractors wi t bout occasionall y coming up with a
real humdinge r. After onc or two, the stage is set for a hila ri ous lest,
but hilarity is not the point of the testin g and it may be delete riou s to
the validity of the test qua test. For instance, consider the following
item where the task is to select the best paraphrase for the given
sentence:
(7) The pale face of a young woman lyi ng on the bed was raised
weakly from the pillow and she spoke in a faint voice.
A. A faintin g face on a pillow rose up from the bed and spoke
softl y to the young woman.
B. The pale face and the woman were lying fainU y on the bed
when she spoke.
e. Weakly from the pillow the pale face rose up and faintly spoke
to the woman.
D. The woma n who was very pale and weak lifted herself from the
pillow and spoke.
Alterna tive B is distracting in more ways than o ne. Choice C
continues the metapho r created, and neither is apt to be a very good
distract or except in a hilarlous diversionary sense. With o ut reading
MULTI PLE CHOICE T ESTS 245
the given sentence or the story, choice D is the only sane alternative.
In sum, th e co mmon foul-ups in multiple choice item writing
include the follo wing:
( I) Selecling ina ppropriale co ntent for Ihe item.
(2) Failure to include the co rrect a nswer in the field of alternatives.
(3) Including two or mo re plausible choices am ong the
alternati "es.
(4) Asking t he lest taker to guess facts t hat are not stated or
implied .
(5) Lea ving unintentional clues about the correcl choice by
making it either the longest or shortes t, or by including its
opposite, or by repeatedly referring to the informatio n in the
correct choice in other choices, or by including ridiculous
alternatives.
(6) Writing distractors that don 't fit together with the correct
choice - i.e. , that are too general or too specific, too abstract
or too co ncrete, too simple or too complex.
These are onl y a few of the m ore co mmon problems. Without doubt
th ere are many other pitfalls to be avoided.

D. Item analysis and its interpretation


Sensible item an alysis involves the careful subj ective interpretation of
some objecti ve facts about the way examinees perform on mUltiple
choice items. Inso far as all tests in volve determinate and quantifiable
cho ices (i.e., co rrect and incorrect respo nses, or subj ecti vely
determined better and '\-vorse res ponses), item analysis at base is a very
general procedu re. However, we will consider it here specifically with
refe rence to multiple choice items and the very co nveniently
quantifiable data that they yield. In particula r, we will discuss the
statistics that usually go by the names of item fa cili ty and item
discrimination. Finally, we will discuss the interpretation o f response
fre quency distributions.
We wlll be co ncerned with the mea nin g of the statistics, the
assumptions o n which they depend in o rder to be useful. a nd their
computation. 1t will be sho wn that item an alysis is generally so
tedio us to perfo rm by hand as to render it largely impracticable for
cl assroom use. N evertheless, it will be argued that item analysis is an
important and necessary step in the preparatio n of good multiple
cho ice tests. Because of this latter fact, it is suggested that every
classroom teacher and educator who uses multiple choice test data
246 LANGUAGE TESTS AT SCHOOL

shoul d know somet hing of item analysis - how it is done, and what it
means.
(i) Itemfacility. One of the basic item statistics is item facility (IF) .
It has to do with how easy (or difficult) an item is from the viewpoint
of the group of students or examinees ta kin g the test of which that
item is a part. The reason for concern with IF is very simple - a test
item that is too easy (say, an item that every student answers
correctly) or a test item that is too difficult (o ne, say, that every
student answers incorrectl y) can tell us nothing abo ut the differences
in ability within the test population. There may be occasions when a
teacher in a classroom situation wants all of the students to answer an
item (or all the items on a test) perfectly. Indeed, such a goal seems
tantamount to the very idea of what teaching is about. Nevertheless,
in school-wide exams, o r in tests that are intended to reveal
differences among the students who are better and worse performers
on whatever is being tested, there is nothing gained by including test
items that every student answers correctly or that every student
answers incorrectly.
The computation of IF is like the computation of a mean score for
a test onl y the test in this case is a single item. Thus, an IF value can be
computed for each item on any test. It is in each case like a miniature
test score. The only difference between an IF value and a part score or
total score on a test is that the IF va lue is based on exactly one item. It
is the mean score of all the examinees tested on that particular it em.
U suall y it is ex.pressed as a percentage o r as a decimal indicating the
number of students who answered the item correctly:
IF = the number of students who answered the item correctly
divided by the total num ber of students
This formula will produce a decimal value fo r IF. To convert it to a
percentage, we multiply the result by 100. Thus, IF is the proportion
of students who answer the item in question correctly.
Some authors use the term 'item difficulty', but this is not what the
proportion of students a nswering correctly reall y expresses. The IF
increases as the item gets easier and decreases as it gets more difficult.
Hence, it really is an index of facility. To convert it to a difficulty
measure we would have to subtract the IF from the maximum
possible score on the item - i.e., 1.00 if we are thinking in terms of
decimal values, and 100 % if we are thinking in terms of percentage
values. The proportio n answering incorrectl y sho uld be referred to as
the item difficulty. We will not use the latter notion, however, because
MULTIPLE CHOICE TESTS 247
it is completely redundant once we have the IF value.
By pure logic (or mathematics, if you prefer), we can see that the IF
of any item has to fall between zero and one or between 0 % and
100 %. It is not possible for more than 100 % of the examinees to
answer an item correc tly, nor for less than 0 % to fail to answer the
item correctly. The worst any group can do on an item is for all of
them to answer it incorrectly (IF = .00 = 0 %). The best they can do
is fo r all ofthem to answer it correctly (IF = 1.00 = 100 %). Thus, IF
necessarily falls somewhere between a and 1. We may say that IF
ranges from 0 to I.
For reasons given above, however, an item that everyone answers
correctly or incorrectly tells us nothing about the variance among
examinees on whatever the test measures. Therefore, items falling
somewhere between about . 15 and .85 are usually preferred. There is
nothing absolute abo ut these values, but professional testers always
set some such limits and throwaway or rewrite items th at are judged
to be too easy or too difficult. The point of the test items is almost
always to yield as much variance among examinees as possible. Items
that are too easy or too difficult yield very little va riance - in fact, the
amount of mea ningful variance must dec rease as the item approaches
an IF of 100 % or 0 %. The most desirable IF values, therefore, are
those falling toward the middle of the range of possible values. IF
va lues falling in the middle of the range guarantee some variance in
scores among the examinees.
H owever. merel y obtaining va riance is not enough. Meaningful
variance is required. That is , the variance must be reliable and it must
be valid. It must fait hfully reflect variability amo ng tested subjects on
the skill or kn owledge that the test purportedly measures. This is
w here another statistic is required for item evaluation.
(ii) l lem discriminorion . The fundament al issue in all testing and
measurement is to discriminate between larger and smaller quan tities
of something, better and worse performances, success and failure,
more or less ofwhatever one wants to test o r meas ure. Even when the
objective is to demonstrate mastery. as in a classroom setting where it
may be expected that everyone will succeed, the test caono t be a
measure of mas tery at all unless it provides at least an opportunity for
failure or for the demonstration of something less than mastery. Orto
take another illu stration, con sider the case of engineers who 'test' the
strength of a railroad trestle by running a loaded freight train over it.
Tbey don't expect the bridge to collapse. No netbeless, the lest
discriminates between the criterion of success (the bridge holding up)
248 LANGUAGE TESTS AT SCHOOL

and failure (the bridge collapsing). Thus, any valid test must
discriminate between degrees of whatever it is supposed to measure.
Even if only two degrees are distinguished - as in the case of mastery
versus something less (Valette, 1977) - discrimination between those
two degrees is still the principal issue.
In school testing where multiple choice tests are employed, it is
necessary to raise the question whether the variance produced by a
test item actually differentiates the better and worse performers, or
the more proficient examinees as against the less proficient ones.
What is required is an index of the validity of each item in relation to
some measure of whatever the item is supposed to be a measure.
Clearly if different test items are of the same type and are supposed to
measure the same thing, they should produce similar variances (see
Chapter 3 above for the definition of variance, and correlation). This
is the same as saying that the items should be correlated. That is, the
people who tend to answer one of the items correctly should also tend
to answer the other correctly and the people who tend to answer the
one item incorrectly should also tend to answer the other incorrectly.
If this were so for all of the items of a given type we would take the
degree of their correlation as an index of their reliability -- or in some
terminologies their internal consistency. But what if the items could be
shown to correlate with some other criterion? What if it could be
shown that a particular item, for instance, or a batch of items were
correlated with some other measure of whatever the items purport to
measure? In the latter case, the correlation would have to be taken as
an index of validity - not mere internal consistency of the items.
What criterion is always available? Suppose we think of a test
aimed at assessing reading comprehension. Let's say the test consists
of 100 items. What criterion could be used as an index of reading
comprehension against which the validity of each individual item
could be assessed? In effect, for each subject who takes the test there
will be a score on the entire test and a score on each item of the test.
Presumably, if the subject does not answer certain items they will be
scored as incorrect. Now, which would be expected to be a better
measure of the subject's true reading comprehension ability, the total
score or a particular item score? Obviously, since the total score is a
composite of 100 items, it should be assumed to be a better (more
reliable and more valid) index of reading comprehension than any
single item on the test. Hence, since the total score is easily obtainable
and always available on any multiple choice test, it is the usual
criterion for assessing individual item reliabilities. Other criteria
MULTIPLE CHOICE TESTS 249
could be used, however. For instance, the items on one test could
easily be assessed against the scores on some different test or tests
purportin g to measu re the same thing. In the latter instance, the other
test o r tests would be used as bases for evalua ting the validit y of the
items o n the first test.
In brief, the question of whether an individual test item
di scri minates between examinees on some dimension of interest is
a matter of both reliability and va lidity. We cannot read an index of
item discrimina tio n as anything more than a n index of reliability,
however, unless the criterion aga inst which th e item is correla ted has
some independent claims to validity. In the latt er case, the index of
item discrimination becomes an index of validity over and above the
mere questi on of reliability.
T he usual criterio n selected for determining item discrimination is
the total test sco re, It is simpl y assumed that the entire test is apt to be
a better measure of whatever the test purpo rts to measure than any
single test item by itself. This assumption is only as good as the
validity of th e total test score. If the test as a whole does not measure
what it purports to measure, then high item discriminatio n values
merely indicate that the test is reliabl y meas uring something - who
kn ows what. If the test on the whole is a valid measure of reading
comprehension on the other hand, the strength of each item
discrimination value may be taken as a measure of the validity of that
item . Or, to pu t the matter mo re precisel y, the degree of validit y of the
test as a who le establishes the limitations o n th e interpretatio n of item
discrimination values. A s far as human beings are co ncerned, a test is
never perfectly valid, only more or Jess valid within limits that can
be determined only by inferential methods.
To return to the example of the 100 item reading comprehension
test, let us consider how a n item disc rim ina lion index could be
computed. T he usual met hod is to select the total test sco re as the
criterion against which indi vidual items o n the test will be assessed.
The problem then is to compute or estimate the strength of the
correlation between each individual item on the test and the test as a
whole. More specifically, we want to know the strength of the
correlation between the scores on each item in relation to the scores
on all the items.
Since 100 correlations would be a bit tedio us to compute, especially
when each one would require the manipulatio n of at least twice as
many scores as there are examinees (thai is, all the tola1 scores plus all
the scores on the item in questio n), a simpler method would be
250 LANGUAGE TESTS AT SCHOOL

desirable if we were to do the job by hand. With the present


availability of computers, no one would be apt to do the procedure by
hand any more, but just for the sake of clarity the Flanagan (1939)
technique of estimating the correlatio n between the scores on each
item and the score on the to tal test will be presented in a step by step
fashion.
Prior to computing anything, the test of course has to be
administered to a group of examinees. In order to do a good job of
estimating the discrimination va lues for each test item the selected
test population (the group tested) should be representative of the
people for wbom the test is eventually intended. Further, it should
involve a large enough number of subjects to ensure a good sampling
ohhe true variability in the populatio n as a whole. It would not make
much sense to go to all the trouble of computing item discrimination
indices on a 100 item test with a sample of subjects ofless than say 25.
Probably a sample of 50 to 100 subjects, however, would provide
meaningful (reliable and valid) data on the basis of which to assess the
validities of individual test items in relation to total test scores.
Once the test is administered and all the da ta are in hand, the first
step is to score the tests a nd place them in order from the highest score
to the lowest. If 100 subjects were tested, we would have 100 test
booklets ranking from the highest score to the lowest. If scores a re
tied, it does not matter what order we place the booklets in for those
particula r cases. However, all of the 98s must rank ahead of all of the
97s and so forth .
The next step (still foll owing Flanagan's method) is to count off
from the top down to the score that falls at the 821 percentile. In the
case of our data sample, this means that we count down to the student
that falls at the 28th position down from the top of the stack of
papers. We then designate that stack of papers that we have just
counted off as the High Scorers. This group will contain approx-
imately 271 % of all the students wh o took the test. In fact it contains
the 271% of the students who obtained the highest scores on the test. '
Then, in similar fas hion we count up from the botto m of the
booklets remaining in the original stack to position number 28 to
obtain the corresponding group that will be designated Low Scorers.
The Low Scorers will contain as near as we can get to exactly 271 %of
the people who achie ved scores ranking at the bottom ohhe stack .
We now have distinguished between the 271 % (rounded off in this
case to 28 %) of the students who got the highest scores and the 271 %
who got the lowest scores on the test. From what we already know of
MULTIPLE CHOICE TESTS 251
correlation, if scores on an individual item are correlated with the
total score it foll ows that for any item, more of the High Scorers
should get it right than of the Low Scorers. That i" the students who
are good readers should tend to get an item right more often than the
students who are not so good at reading. We would be disturbed if we
found an item that good readers (High Scorers) tended to miss more
frequently than weak readers (Low Scorers). Thus, for each item we
count the number of persons in the High Scorers group who answered
it correctly and compare this with the number of persons in the Low
Scorers group who answered it correctly. What we want is an index of
the degree to which each item tends to differentiate High and Low
Scorers the same as the total score does - i.e., an estimate of the
correlation between the item scores and lbe total score.
For each item, the following formula will yield such an index:
ID = the number of High Scorers who answered the item
correctly minus the number of Low Scorers who
answered the item correctly, divided by 27t %ofthe total
number of st udents tested
Flanaga n showed that this method provides an optimum estimate of
the correlation between the item in question and the total test. Tbus,
as in the case of product-moment correlation (see Chapter 3 above),
ID can vary from + I to - \. Further, it can be interpreted as an
estimate ofthe computable correlation between the item and the total
score. Flanagan has demonstrated, in fact, that the method of
comparing the top 27-~ %against the bollom 271 %produces the best
estimate of the correlation that can be obtained by such a method
(betler for example than comparing the top 50 %against the bottom
SO %, or the top third against the bottom third, and so on).
A specific example of some dummy data (i.e., made up data) for
one of the above test items will show better how ID is computed in
actual practice. Suppose we assume that item (3) above, based on the
text about Oliver Twist, is the item of interest. Further, that it is one
of 100 items constituting the reading comprehension test posited
earlier. Suppose we have already administered the test to 100
examinees and we have scored and ranked them .
After determining the High Scorers and the Low Scorers by the
method described above, we must then determine how many in each
group answered the item correctly. We begin by examining the
answers to the item given by students in the High Scorers group. We
look at each test booklet to find out whether the student in question
252 LANGUAGE TESTS AT SCHOOL

got the item right or wrong. Ifhe got it right we add one to the number
of students in the High Scorers group a nswering the item correctl y. If
he got it wrong, we disregard his sco re. Suppose that we find 28 out of
28 students in the High Scorers group answered the item correctly.
Then, we repeat the co unting procedure for the Low Scorers.
Suppose tha t 0 out of 28 students in the Low Scorers grou p answered
the item correctly . The ID for item (3) is equal to 28 minus 0, di vided
by 28, or + I. That is, in this hypothetical case, the item discriminated
perfectly between the better and not-sa-good readers. We would be
inclined to conclude that the item is a good one.
Take another example. Consider the following dummy data on item
(5) above (also about the Oliver Twist text). Suppose that 14 of the
people in the High Scorers group and 14 of the people in the Low
Scorers gro up answered the item correctly (as keyed, th at is,
ass uming the 'correct' answer re ally is correct). The ID would be 14
minus 14, divided by 28, o r 0/28 = O. From this we would conclude
that the item is not producing any mea ningful va riance at all in
relatio n to the performance o n the entire test.
Take one furt her example. Consider item (4) abo ve on the Twist
text. Let us suppose that all of the better readers selected the wrong
a nswer - cboice A. Further, that a ll of the poorer readers selected the
a nswer keyed by the examiner as the correct choice - say , choice D .
We would have an ID equal to 0 minus 28, div ided by 28, or - l.
From this we would be inclined to co nclude that the item is no good.
Indeed, it wo uld be fair to say that the item is producing exactly the
wrong kind of va riance. 1t is tendi ng to place the low scorers on the
item into tbe Higb Scorers group fo r tbe total score, and the hi gh
scorers on the item are act ually ending up in the Low Scorers group
for the overall test.
From all of the foregoing discussion about ID, it should be obvious
that high positive ID values are desirable, whereas low or negative
values are undesirable . Clearly, the items o n a test should be
correlated with the test as a who le. T he stro nger the correlations, the
mo re reliable the test, and to the extent that the test as a whole is valid,
the stronger the correlations of items with total, the more valid the
items must be. Usually, professional testers set a value of .25 or .35 as
a lower limit o n acceptable IDs. If an item falls below the a rbitra ry
cut-off point set, it is either rewritten or culled from the total set of
items on the test.

(iii) Response frequen cy distributions. In addition to finding o ut


,t ULTIPLE CHOICE TESTS 253
how hard or easy an item is, and besides knowing whet her it is
correlated with the co mp osite of item scores in the entire test, the test
author often needs to know how each and a ll of the distracto rs
performed in a given test administra tion. A techn ique for determin-
ing whetb er a certain distracto r is distracting any of the students or
not is simply to go through all of the test booklets (o r answer sheets)
and see how many of the students selected the alternative in qu estion.
A more infonnati ve tech nique, however, is to see what the
distribution of responses was for the Hi gh Sco rers versus the Low
Scorers as well as for the group falling in between, call them the Mid
Scorers. In o rd er to acco mplish this, a response frequency
distribution can be set up as shown in Tab le 3 imm ediately below:

TABLE 3
Respom,e Freq uency Distributio n Example One.

Item (3) A' B c D Omit


- - -- - - - -- --
High Scorers
(t op 27t~~)
28 o o o o

Mid Scorers
(mid 45 %)
15 10 10 9 o
-- _. _ --- - - - - - - - -
Low Sco rers
(l ow 27! %)
o 7 7 7 7
- - - - -- -- - _ . -_ . -- - - --
The table is based on hypothetical data for item (3) based on the
Oliver Twist text above. It shows th at 28 of the H ig h Scorers marked
the correct choice, namely A, and none of them marked B, C, or D,
and none of them failed to mark the item. It shows further th at the
distri bution of scores for the Mid group fa vored the correct choice A,
with B, C, and D func tioning ab o ut equally well as distractors. No
one in the Mid group failed to mark the item. Finally, reading across
the last row o f data in the cha rt, we see th at no one in the Low group
ma rk ed the correct choice A, and equal numbers marked B, C, and D.
Also, 7 people in th e Low group failed to mark the item a t all.
IF and 10 are di rectly computa ble from such a response frequency
distribution. We get IF by adding t he figures in the column headed by
the le tter of the correct alte rnative , in thi s case A. Here, the IF is 28
plus 15 plus 0, o r42, divided hy 100 (the total number of subjectswho
254 LANGUAGE TESTS AT SCHOOL

took the exam) which equals .42. The ID is 28 (the number of persons
answering correctly in the High group) minus 0 (the number
answering correctly in the Low group) which equals 28, divided by
27~% of all the subjects tested, or 28 divided by 28, which equals l.
Thus, the IF is .42 and the ID is l.
We would be inclined to consider this item a good one on the basis
of suc h statistics. Further, we can see that all of the distraclors in the
item were worki ng quite well . For instance, distractors Band C
pulled exactly 17 responses, and D drew 16. Thus, there would appear
to be no dead wood among the dislract ors.
To see better what the response frequ ency distribution can tell us
about specific di stractors, let's consider ano ther hypothetical
example. Consider the data presented on item (4) in Table 4 .

TABLE4
Response Frequency Distributi on ExampLe T wo .

Irem(4) A B C D' Omit

High Scorers
28 0 0 0 0
(top 271 ~/~)

Mid Scorers
15 15 0 14 0
(mid 45 %)

Low Scorers
0 0 0 28 0
(low 27S %)

Reading across row one, we see that a ll of the High group missed the
item by selecting the same wrong choice, namely A. Ifwe look back at
the item we can find a likely explanation for this. The phrase ' for a
long time after his birth' does seem to imply alternative A which
suggests that the child was sickly for 'months or possibly years'.
Therefore , distractor A should probably be rewritten. Similarly,
distractor A drew off at least IS of the Mid group as well. Choice C,
on the other hand, was completely useless . It would probably have
changed no thin g if that choice has not been among the field of
alternati ves. Finally, since only the low scorers answered the item
correctl y it should undoubtedly be com pletely reworked or
discarded.
MULTIPLE CHOICE TESTS 255
E. Minimal recommended steps for multiple choice
test preparation
By now the reader probably does not requ ire much further
convincing that mUltiple choice preparation is no simple matter.
T hus, a ll we will do here is state in summarial fonn the steps
considered necessary to the preparation of a good multiple choice
test.

(1) Obtain a clear notion of what it is that needs to be tested.


(2) Select appropriate item content and de vise an appropria te
item format.
(3) Write the test items.
(4) Get some qua lified person to read the test items for editorial
difficulties of vagueness, ambiguity, and possible lack of
clarity (this step can save much wasted energy on later steps).
(5) Rewrite any weak items or otherwise revise the test format to
achieve ma ximum clarity concernin g what is required of the
examinee.
(6) Pretest the items on some suitable sample of subjects other
than the specifically targeted group.
(7) Run an item analysis over the data fro m th e pretesting.
(8) Discard o r rewrite items that prove to be too easy or too
difficul t, or low in d iscriminatory power. Rewrite or discard
non-functional alternatives based on response frequency
distributions.
(9) Possibly recycle through steps (6) to (8) until a sufficient
Dumber of good items has been attained.
(10) Assess the va lidity of the fi nished product via so me one or
more of the techniques discussed in Chapter 3 above, and
elsewhere in this book.
(II ) Apply th e finished test to the ta rge t po pulation. Treat the da ta
acquired on th is step in the same way as the data acquired on
step (6) by recycling through steps (7) to ( 10) until optimum
levels of reli ability and validity are consistently attained.

In view of the complexity of the tas ks involved in the construction


of mUltiple choice tests, it would seem inadvisable fo r teachers with
normal -instructional loads to be expected to construct a nd use such
tests for normal classroom purposes. Furthermore, it is argued that
such tests have certain instructional drawbacks.
256 LANGUAGE TESTS AT SCHOOL

F. On the instructional value of multiple choice tests

While multiple choice tests have rather obvious advantages in terms


of administrative and scoring convenience, anyone wh o wants to
make such tests part orthe daily instructio nal routine must be willing
to pay a high price in test preparation and possibly genuine
instructi o nal damage. It is the purpose of the multiple c ho ices o ffe red
in a n y field of alternatives to tric k the un war y, illinformed , or less
skillful lea rner. Oddly, no where else in t he curriculum is it common
procedure for educators to recommend deliberate co nfusion of the
learner - why should it be any differe nt when it comes to testing?
It is paradoxical that all of the popular vie ws of how learning can
be maximized secm to go nose to nose wi th bot h fists flying against the
very essence of mUltiple choice test theory. If th e test succeeds in
d isc riminating among the stro nger and wea ker stud ents it does so by
decoying the weaker lea rne rs into misconceptions. half-tru ths, and
l an us· faced tra ps.
Dean H . Obrecht once told a little anecdote that very neatly
illustrates the instructional dilemm a posed by multiple choice test
items. Obrecht \vas teaching acoustic phonetics at the University of
Rochester when a certain student of Germanic extraction poin ted o ut
the illogicalit y of the term 'spectrog ram' as distinct fro m 'spectro-
graph'. The student observed that a spectrogram mighl be like a
telegra m, i.e., somet hing produced by t he co rresponding ·graph, to
wit a telegraph or a spectrograph. On the other hand, th e student
noted. the 'spectrograph ' might be like a photograph for which there
is no corresponding photogram. 'Now which,' asked t he bemused
student, 'is th e mac hine and which is the record that it produces?'
Henceforth, Dr. Obrecht often complained that he could not be su re
whether it was the machine lhat was the spectrograph, o r the piece of
paper.
What then is the proper use of multiple choice testing'? Perha ps it
sho uld be thoroughly re-evaluated as a procedure for educational
applications. Clea rly, it has li mited application in classroo m
testing. The tests are difficult to prepa re. Their analysis is tedious,
technically formidable, and frau ght with pitfalls. Most importantly,
the design of dislractors to trick the learners into confusing dilemmas
is counter prod uctive. It runs contrary to the very idea of education.
Is this necessa ry?
Tn the overall perspective multiple choice tests affo rd two principal
adva nta ges: ease of administration and scoring. They cost a great
MULTIPLE CHOI CE TESTS 257
deal on the other hand in terms of preparation and counter
productive instructional effects. Much researc h is needed to
determine whether the possibly contrary effects on lear ning can be
neutralized or even eliminated if the preparation of items is guided by
certain statable principles - for insta nce, what ifan of the alternati ves
were set up so that only factuaH y incorrect distraclors were used ? It
might be that SOIne types of multiple choice items (perhaps the so rt
used in certain approaches to programmed instruction) are even
instructionall y beneficial. But at this point, the instructional use of
multiple choice format s is not recommended.

KEY POINTS
I. There is a certain strained natural ness abou t mUltiple choice test formats
inasmuch as there d oes not seem to be any other way to ask a question.
2. However, the main argument in fa vor of using multipJ e choice tests is
their supposed 'objectivity' and their ease of administration and scoring.
3. In [aci. multiple cho ice tests ma y not be an y mo re rel iable or valid than
similar te!'. ts in different format s - indeed , in so me cases. it is known that
the open ~ ende d form ats tend to produce a greater amount of reliable and
valid test variance, e.g., ordinary cJoze procedure versus mUltiple ch oice
variants (see Chapter 12, and the Appendix ).
4. Multipl e c hoice test may be di screte point, integrati ve. o r pragmatic ·
there is nothing jntrinsically di scre-te paiD( about a multiple choice
format .
S. Pragmatic tasks, with a little im agination and a lot of work, can be
converted into multipl e choice tesls; however, the validity of the latter
tests must be a s~essed in aU of the usual way s.
6. If one is going to co nstruct a multiple choicc test fo r language
assessment , it is recommended that the test author begin with a discourse
context as the basis for test items.
7. hems must be evaluated for co ntent, clarity, and balance among the
alternatives they offer as choices.
8. Each set of alternatives should be evaluated for cl arity, balance,
extraneous clues, and determinacy of the correct choice.
9. Texts, i.e., any discourse based set of materials, that discuss high ly
technical, esoteric, super-charged, or otherwise distracting content
should probably be avoided in most instances.
10. Among the common pitfalls in item writin g are selecting inappropriate
content : fai lure to include a thoroughly correct alternative; including
more than one plausi ble alternat ive: asking test takers to guess facts not
stated or implied ; leaving unintemional clues as lO the correCl choice
among tbe field of alternatives; making the co rrect t:hoice the longest or
shortest ; induding the opposite of the correct ch oice among the
alternatives ; repeatedly referring to informati on in the correct choice in
other alternatives; and using rid ic ulous or hila rious alternati ves.
11 . hem analysis usually involves the examina tion o f item facilit y indices,
258 LANGUAGE TESTS AT SCHOOL

item discrimination indices, and response freq uency distributions.


12. Item facility is simply the proportion of students answering the item
correctly (according to the way it was keyed by the test item writer).
13. Item discrimination by Flanagan's method is the number of students in
the top 271 %ofthe distribution (based on the total test scores) minus the
students in the bottom 27t~,;; answering the item correctly, all divided by
the number corresponding to 27! ~~ of the distribution.
14. Item discrimination is an estimate of the correlation between scores on a
given item considered as a miniature test, and scores on the entire test. It
can also be construed in a more general sense as the correlation between
the item in question and any criterion measure considered to have
independent claims to validity.
15. Thus, ID is always a measure of reliability and may also be taken as a
measure of validity in the event that the total test score (or other
criterion) has independent claims to validity.
16. Response frequency distributions display alternatives against groups of
respondents (e.g., high, mid, and low). They are helpful in eliminating
non-functional alternatives, or misleading alternatives that are trapping
the better students.
17. Among the minimal steps for preparing multiple choice .tests are the
following: (1) clarify what is to be tested; (2) decide on type of test to be
used; (3) write the items; (4) have another person read and critique the
items for clarity; (5) rewrite \veak items; (6) pretest the items; (7) item
analyze the pretest results; (8) rewrite bad items; (9) recycle steps (6)
through (8) as necessary; (10) assess validity of product: (11) use the test
and continue recycling of steps (7) through (10) until sufficient reliability
and validity is attained.
18. Due to complexity of the preparation of multiple choice tests, and to
their lack of instructional value, they are not recommended for
classroom applications.
19. Oddly, multiple choice tests are the only widely used educational devices
deliberately conceived to c·onfuse learners.

DISCUSSION QUESTIONS
1. Examine a multiple choice te::;t that is widely used. Consider the
alternatives offered to some of its items. In what ways are the distractors
composed so as to be maximally attractive to the unwary test takers?
2. Have you ever used multiple choice tests in your classroom? Or, better
still, have you ever taken a multiple choice test prepared by an amateur?
Have you been subjected to final examinations in a multiple choice
format? What were your reactions? What was the reaction of your
students? Were there any students that did well on the test who did not
know the subject matter supposedly being tested? How can this be
accounted for? Were there any points about which you or your students
::;eemed to be more confused after testing than before?
/3. What uses of multiple choice tests can you conceive of that are not
confusing to learners'? What experiments would be necessary to test the
validity and utility of multiple choice tests of the proposed type in
MULTIPLE CHOICE TESTS 259
comparison with tests in o lher form a ts?
4. In what ways are multipl e choice tests more practical th an pragmatic
tests? In wha t ways is the reverse true ? What a re the key factors to be
co nsidered ?
5. Can you co nceive of ways of short-cutting the reco mmended steps for
mUltiple choice test preparation ? Fo r instance, consider the possibility of
using more items in the pretest setting than a re necessa ry for the final test
format. What are the costs and benefits?
6. Find out how muc h mo ney, bow many man hours, a nd how much
overall expense goes into the prepa ration of a single standardjzed test
th at is wid ely used in the schools where you work , or where your children
a ttend, o r in thearea where you live. How does the expense compare with
the time, money, materials, and equipment a vailable to the average
classroom teacher ?
7. Have you ever done an item analysis of a mUltipl e c hoice lest by ha nd ?
Would you cons ide r doing one once a week ? Once a month ? Twice a
yea r? Wh a t kinds of changes in instructional load, equipment
availability and the like wo uld he necessa ry to ma ke regula r mUltiple
choice testi ng in the classroom a reasonable teacher in itiated a nd teacher
ex ecuted chore? What other alternatives exist ? District wide tests?
Pro ressionall y prepared tests to go with particular curricula? How about
teacher ma de tests of an e ntirel y d dferent sort ?

SUGGESTED READINGS
1. John Bormuth, On the Theory of Achievement Test Items. Chicago:
U niversity of Chi cago Press, 1970.
2. Al a n D avies, 'The Conslr uction of La nguage Tests,' in 1. P . .B. Allen and
Al a n Davies (eds.) Testing and Experimental Alethods. Volume 4 of the
Edinburgh Course in Applied L inguistics. Lond on : Oxford University,
1977, pp. 38- 104.
3. J . B. Heato n, Wri ting Eng/ishLanguage Tests. London: Longman, 1975.
4. Rebecca Valette, A1adem Language Testing. New Yo rk: Ha rcour[, 1977.
PART THREE
Practical Recommendations
for Language Testing
10

Dictation and Closely Related


Auditory Tasks

A. Which dictation and other auditory


tasks are pragmatic?
B. What makes dictation work?
C. How can dictation be done ?
I. Selecting the material and the
procedure
2. Examples or standard dictation,
partial dictation, and elicited
imitation
I. . Standard dictation
a. Sample materials
b. Adm inistration
procedures
c. Scoring procedures
d. Why spelling errors are
not counted
e. Sample scored protocols
11. Partial dicta tion
a. Sample materials
b. Administration
procedures
Ill. Elicited imitation
a. Sample materials
b. Administration
procedures
c. Scoring procedures
3. Interpreting scores and
protocols

In Part Two, we considered some orthe shortcomings of discrete point


approaches to language testing. Now we come to the question that
262
DICTATION AN D CLOSELY RELATED AUDITORY TASKS 263
has no doubt been on the minds of many readers ever since the
beginning of the discussion of pragmatic language testing in Chapter
I of Part One. That question is how to do it reliably and validly.
This chapter focusses attention on a family of auditory testing
procedures. The best researched of them is. a variety of dictation
which we will refer to as standard dictation. This va riety and others
related to it are described in some detail below, but it should be noted
earl y that the few testing techniques which are discussed in this and in
following chapters are scarcely indicative of the range of possible
pragmatic testing procedures. More particularly, the rew auditory
tasks described in this chapter are far from a complete accounting of
possible tests. In fact , no complete listing can ever be obtained
because their number is unbounded in principle. The techniques
discussed are intended merely to introduce some of the possibilities
tbat are known to work well. It is not to set the limits of the range of
pragmatic auditory tasks that are possible.

A. Which dictation and other auditory tasks are pragmatic?


Any proposed testing procedure which is to qualify as a pragmatic
language processing task must meet the two naturalness criteria laid
down in Chapter 3 above : (I) It must require the processing of
temporal sequences of elements in the language constrained by the
normal meaningful relationships of such elements in discourse, and
(2) it must require the performer of the task to relate the sequences of
elements to extralinguistic contex t via what we defined above as
pragmatic mappings. More succinctly, pragmatic ta sks require time
co nstrained processing of the meanings co ded in discourse.
In order for such a pragma tic task to be used as a tesling procedure,
it is necessary that it be quan tifiable by some appropriate and feasible
scoring technique. It is assumed that all pragmatic- ta sks insofar as
they have external manifestations from which processing can be
observed or inferred are more or less quantifiable. The adequacy of
the quantification process is part of lhe assessment of the reliability
and validity of any such procedure used as a testing technique. Of
course, some techniques are more easily converled to scores than are
others, so if we have our choice, we are better off with the ones tbat
are most easily quantifiable.
It is no t ne<:essary, however, that pragmatic tasks single out any
particular skiB s o r hypothesized co mponent of grammar. Neither is it
necessary that the task require the processing of stimuli in only one
264 LANGUAGE TESTS AT SCHOOL

sensory modality or in only one producti ve output mode. Various


skills, mo re than oue input/output channel, and aU of the presumed
components of language proficiency are apt to be involved in any
pragmatic task.
Among the family of dictation procedures that have been used in a
variety of ways as testing techniques are standard dictatio n, partial
dictation, dictation with competing noise, dictation/composition,
a nd elicited imitation . The first of these, a nd proba bly the best
kno wn, requires examinees to write verbal sequences of material as
spo ken by an examiner o r played back from a recording. If the
material is presented at a conversational rate and if it is given in
sequences of sufficient length to ch allenge the examinee's short term
memo ry, we can be assured that a deeper level of pragmatic
processing must take place. If the material, o n the other hand, is
dictated very slowly and with very short sequences of words or o ne
wo rd at a time syUa ble-by-syllable, the task does not meet the
naturalness criteria.
Partial dictation is similar to standard dictation except that the
examinees are given a written version of the text (alo ng with the
spoken version) where the written passage ha s certain portions left
o ut. The exa minees must listen to the spo ken material and fill in the
blanks in the written version. In this case, the task resembles a d oze
test except fo r the fact that all of the text is presented in the oral
version (which may by some methods be presented m ore than once)
and portio ns of it are also written down for the examinees in advance.
Other facto rs being equal, partial dictation is an ea-sier task from the
exa minee's point of view though it takes more effort to prepare from
the vantage point of the examiner. It is eas ier to perform because
mo re senso ry informa tion is given concerning the message - a partial
written version and a complete spo ken version.
Dictation may be made more difficult by reducing the signal-to-
noise rati o. As Miller, Heise, and Lichten (1951 ) had th oroughly
demonstrated many yea rs ago, the speech signal is remarkably
resistant to distortion . N ormally we can understand spoken messages
even under very noisy conditions (as when carrying on a conversatio n
in an automo bile speed ing along a freeway). And, we can understand
speech or writing when the signal itself is very weak (as in listening to
a distant radio station with a lot of competition from static, or when
reading the text of a badly faded duplicated copy). Spolsky, Sigurd,
Sato, Walker, and Arterburn ( 1968) showed that dic tation with
added noise was a feasible language testing procedure. Pragmatic
DICTATION AND C LOSELY RELATED AUDITORY TASKS 265
verSIOns of the technique are possible so lon g as the naturalness
criteria are met.
Another form of dictation is what has sometimes been called the
dicto-comp, or dictmion/composition. Examinees are instructed to
listen to a text one Of more times while it is presented either live or on
tape at a conversational rate. Then they are asked to write from
memory what they have heard. Verbatim scoring procedures are less
applicable to this technique, and it present s many of the same scoring
difficulties that we will encounter in reference to essays in general in
Chapter 13. Scoring is more difficult than with standard dictation,
but detailed procedures ha ve been worked o ut with reference to
specific texts (Frederiksen, 1975a, 1975b, and his references) . Most
applications of this procedure have been in studies of discourse
processing in cognitive psychology, though the procedure (or at least
a variant of it) has been used in Freshma n English testing (Brodkey,
1972).
Elicited imitation is an auditory task that is similar to dictation in
terms of the material presented to the examinee but dissimilar with
respect to the response mode. In this case, the examinee hears the
material, the same as in dictation (and with equal possibilities for
variation), but instead of wTiting down the material the examinee is
asked to repeat it or otherwise recount what was said. Both verbatim
tasks and more loosely constrained retelling tasks have been used
(Slobin and Welsh, 1967, pioneered the fonner technique in studies of
child language which is the more scorable oflhe two approac hes; see
also Miller, 1973, for an excellent list of references on verbatim tasks).
The more loosely constrained retelling tasks run into the formidable
scoring problems of the dictation/composition technique and essay
grading or rating of free speech. A significant difference, however, in
favor of the elicited imitation and dictation techniques is that the
examiner at least knows fairly well what the examinee should be
trying to say or write. ln the essay and free speech cases, the limits of
possible messages are less well defined.

B. What makes dictation work?


As testing procedures~ simple dictation and its derivatives may seem a
little too simple. Some may object that it is possible to write down (or
repeat) sequences of material in a language that one does not even
understand. This is an important objection and it deser ves to be dealt
with early in this discussion. A potential examiner may argue, for
266 LANGUAGE TFSTS AT SCHOOL

example, that some learners can take dictation in a certain language


almost flawlessly, but that they cannot understand that language
under conversational c·onditions. This is an interesting way of stating
the objection because ill this form the objection gra tuitously offers its
own solution : describe the 'conversational conditions' that the
learner finds difficult and translate them into an appropriate dictation
procedure. It follows that such a dictation must by definition
constitute a challenge to the limits of the learner's ability in the
language and it remains only to quantify the technique (i.e., develop a
scoring procedure).
The reason that dictation works well is that the whole family of
auditory tasks that it comprises faithfully reflect crucial aspects of the
very acti vities that one must normally perform in processing
discourse auditorily. In fact, in the case of dictation in particular and
pragmatic testing procedures in general, language tests are much
more direct measures of the competence underlying such perfor-
mances than is the case with almost any other kind of educational
testing. Performing a dictation task is a lot closer to understanding
spoken discourse than answering multiple choice questions o n a so-
called aptitude test is to displaying an aptitUde for learning or for
whatever the aptitude test purports to measure. Consider the case of
displaying memory for certain pairings of English words wit h
Turkish words in a subtest of the Modern Language Aptitude Test.
Such an activity is considerably less similar to learning a foreign
language than dictation is to processing discourse. This same
argument can be extended to a comparison of any pragmatic test with
any other sort of educational measure.
If pragmatic language tests require examinees to do the kinds of
tasks that a pragmatic theory oflanguage use and language learnin g
defines as characteristic of human discourse, then the tests should be
as good as the theory depending on how faithfull y they mirror the
performances that the theory tries to explain. The theory is
vulnerable to other sorts of empirical tests but the information from
studies of language tests qua tests is also olle of the sources of data to
test the theory. The reasoning may seem circular, but it is certainl y
not viciousl y so. The theory itself can be used as an argument for the
validit y of proposed testing procedures that accord well with it
because the theory itself is anchored to empirical reality through
validity checks other than language test data and research . This is the
requirement of the theoretical basis advocated by Cronbach and
Meehl (1955) in their classic article on construct validit y.
DICTATION AND CLOSELY RELATED AUDITORY TASKS 267
A second validity criterion concerns the pattern of inter-
relationship between tests that purport to measure the same
construct. Cronbach and Meehl (1955) argued that if a variety of tests
are believed (according to some theoretical reasoning) to assess the
same construct (in our case the expectancy grammar of the language
user), they should produce a corresponding pattern of iOlercor-
relation. That is, they should intercorrelate strongly, demonstrating a
high degree of variance overlap. This sort of validity evidence for
dictation and closely related procedures was discussed in Chapter 3
above (also see the Appendix) . In fact, the degree of correlatio n
between dictation and other pragmatic testing procedures in many
studies has approached the theoretical ideal given the estimated
degree of reliability of the tests investigated (see especially Oller,
1972, Stubbs and T ucker, 1974, Irvine, Ata i, and Oller, 1974,
Hinofotis, 1976, Oller and Hillofotis, 1976, Bacheller, in press,
LoCoco, in press, and Stump, 1978).
A third validity criterion suggested by Cronbach and Meehl (1955)
pertains to the kinds of e rrors that examinees make in performing the
tas ks circumscribed by a procedure or set of testi ng procedures. Data
fro m studies oflearoer errors (or better, learoer prot ocols including
all of the learner outputs) are supportive of the interpretation that
pragma tic ta sks of all sorts share a considera ble amount of va riability
in terms of what they req uire learners to do. Comparisons of the types
and quantities of errors reveal substantial similarities in performance
across different pragmatic tasks (see the discussion of error anal ysis
in Chapter 3 above).
In a nutshell, dictation and closely related procedures probably
work well precisely because they are members of the class oflanguage
processing tas ks that faithfully rellect what people do when they use
language fo r commun icative purposes in rea l life contexts. That class
of tests is large. but it excludes many co-called 'language' tests tha t do
not meet the pragmatic naturalness criteria.

C. How can dictation be done?


1. SelecUnK the material and [he procedure. The first step is to de-
termine the purpose for testing. What is to be done with the test data
once gathered? Failure to deal adequately wi th this question results
in the collection offar more data than can be analyzed in many cases,
or else it leads to the collection of da ta that cannot be used to add ress
the principal questions or needs. Once it has been determined what
the test is for, the next step is to decide which of the many possible
268 LANGUAGE TESTS AT SCHOOL

procedures to use and how to calibrate the test appropriately to the


purpose for testing and the subject population to be tested. Both of
these decisions should be informed by the educa tor's (or other
person's) knowledge of the subjects to be tested, how the testing is to
be articulated within the total curricular setting, and the practlcalities
of how many people and what amount oftime (and possibly money)
is to be spent in collecting, analyzing and interpreting the test da ta .
In relation to the foregoing decisions, some of the questions that
should probably be considered are: How many times will the testing
be done? If it is part of an instructional program, how can the test
data be used to improve the efficiency of that program? Is th e test
data to be used for placement purposes? As a basis for diagn osing
level s of overall proficiency and prescribing instruction accordingly?
Is it to be used to assess specific instrucrional goals? There are many
other questions, but perhaps these provide some clues as to how (0
attack the decision-making problem. Some examples of specific
testing problems may also help.
Suppose the purpose of testing is to assess the ability of children to
understand the language of the class roo m, and possibly to assess the
degree of bilinguality of children with more than one language. We
might want to kn ow, for instance, whether it makes sense to deli ver
instruction in English to children whose home language is, say,
Spanish. The test should be defined in terms of the kinds of discourse
tasks the children will be expected to perform in classroom contexts.
If an estimate of dominance is desired it will be necessary to devise
similar tasks of roughly equivalent difficulty in both languages and to
obtain a proficiency estimate in each language along the lines
suggested in Chapter 4 above. If the children are preliterate, or if they
are still involved in the difficult process of acquiring literacy, an
elicited imitation task might be selected. Other forms of dictation
would be ruled out. Below, we consider in more detail one of the ways
of doing elicited i11111ation.
Another possible purpose for institution-wide testing is the case of
testing foreign students in a college o r university admissions and '
placement procedure. The principal question might be whether the
incoming students can understand the sorts of speech events they will
be required to deal with in the course of their stud y i.n a certain degree
program. Ifit is determined that college lecture m aterial is the so rt of
discourse that should be used as a basis for test material, then it may
make sense to use one or more texts drawn fro m typic-a t lect ure
material as a basis for testing.
DICTA TION AND CLOSELY RELATED AUDITORY TA SKS 269
In many school contexts, and fo r many purposes, it would make a
lot of sense to try to find out mo re about the ability of learners to
process discourse of varying degrees of difficulty and concerning
different subject matter domains. Dictation a t an appropriate rate, of
the kinds of material that learners are expected to cope with, is a
promising way of investigating how well lea rners can handle a variety
of school- related discourse processing tasks . Many educa tors may be
surprised if not dismayed at the level of comprebension exhibited by
students.
In some cases learners may be able t o handle substantially richer
verbal material than they are being ex posed to. In an important stud y
(discussed in greater detail in Chapter 14) Ha rt, Walker, and Gray
(1977b) ha ve shown that preliterate children a re capable of dealing
with texts of substantially greater pragmatic complexity than the
ones with which they are presented in most read ing programs. Their
research shows that children who are exposed to richer and more
complex materials, that are better articulated in relation to the kinds
of discourse children are able to handle, learn to read and write much
more readily (Hart, 1976). In other cases it may be that the language
of the cu rriculum is too complex and needs to be simplified. If
this were tbe case, however, it would probably be true only in a
limited sense - e.g., perhaps certain concepts should be taught rather
than taken fo r gran ted. In most instances, tho ugh, learners are
probably capable of handling a great deal more information and of
considerably greater complexity than any particular curriculum has
found ways of expressing.
Or, consider the problem of testing fo reign st udents at the college
level. Proba bly the standard dictation procedure is about as
appropriate as any testing technique to determine the relative abilitjes
of foreign students to follow spoken Engli sh in wh atever cont exts are
defined. Scores can be made more meaningful in relation to tbe
general problem of admission by finding out how well the broad
spectrum o f entering students who are nati ve speakers of English
perfo rm on the same task. There is evidence that tbere may be
considerably mo re variabihl y in (he performance of native speakers
on pragmatic processing tasks tha n could ever have been discovered
wi th discrete point test procedures. Whereas it was reasonable to
insist that the average native speaker should be ab le to answer all or
nearly all of th e item s on a discrete point grammar test correctly, or
on a phonological discrimination task, there is considerable room for
va riability in discourse processing skill s that involve the higher
270 LANGUAGE TESTS AT SCHOOL

integration of knowled ge and verbally coded information . Recent


evidence suggesls that native speakers may tend to make the same
kinds of errors in dictation th at no n-natives make (Fishman, in
press), th ough the natives do ma ke fewer errors.
Another possible purpose fo r wanting to test with some va riety of
dictation is to evaluate the acquisition of information of some more
specific sort. For instance, a foreign language teacher, or an English
teacher (o r a math or statistics teacher) may want to know how well
the students in a particular class are able to follow the discussion of a
particular topic or segment of course material. A standard dictation
of materi al (with fairly lo ng bursts of sequen ces at a conversational
rate) or perhaps better, a dictalion/composition may be informative
both to the teacher and to the stud ents. If the sequences of verbal
material are presented at a normal conversational speed and in fairly
long bursts, it will be exceedingly difficult for learners to write down
what they do not understand, or to recall and rephrase material that
they cannot process at a fairly deep level of comprehension.
The research of Frederiksen (1975a, 1975b, and see his references)
and that of Gen tner (1 976, also see his references) suggests strongly
that the process of understanding discourse is considerabl y mo re
constructive than sentence interpret ati on models of speech under-
standing would have led us to believe. Furthermore, learner
performances probably vary much more and due to many more
factors th an the speech procesging m odels of the 1960s would have led
us to believe. Learning in any academic setting may be far more a
matte-r of language comprehension in a fairl y literal sense than
traditional models might have led us to suppose (for some pi oneering
work on this top ic, see Freedle and Carroll, 1972).

(i) Standard dic tation. For the purpose of illustrating the most
widely used di ctation procedure, the following texIs taken from
dicta tion materials used in the testing of foreign students at UCLA in
the spring of 1971 are selected:

(a) SAMPLE MA TERIALS


Will Rogers (Text I )
Will Rogers grew up in the western part of the United States.
He was a real cowboy, riding ho rses aro und his father's
ran ch all day. When he was ve ry yo ung, his parents worried
about him because he was always doing something wrong.
DICTATIOK AND CLOSELY RELATED AUDITORY TASKS 271
No one could control him. His fath er fin ally sent him to a
military school in the South. When the director of the school
noticed Will's cowboy rope lying on top of his suitcase, he
expected trouble with this new boy, and he was right.

A Taste for Music (Text 2)


A taste for music, a taste for anything, is an ability to
consume jt with pleasure. Taste in music is preferential
consumption; a greater liking for certain kinds of it than for
others. A broad taste in music involves the ability to
consume with pleasure many kinds of it. Vast numbers of
persons, many of them highly intelligent, derive no pleasure
at all from organized sound. An even larger number can take
it or leave it. They find it agreeable for the most part,
stimulatin g to the sentiments, and occasionally interesting to
the mind.

It can be said that neither of these texts lSvery characteristic of college


level reading material and they may cven be less represen tative of
college classroom discourse. No netheless, both worked very well in
discriminating English abilities of fo reign students enterin g UCLA,
which was their original purpose as dictation material s, and which is
further testimony to the robustness of dictation as a testing
tecbnique.

(b) ADMINtSTRA nON PROCEDURES


H owever the decision concerning rhe selection of materials is
reached, the next step is to decide how to set up the dictation task.
There are many ways to influence the difficulty level of the ta sk even
after the difficulty level of the materials has been set by the selection
process. FaclOrs influencing the task difficulty are: (I) the conceptual :,
di ffic ulty of the word sequences themsel ves (other factors being held
constant) ; (2) the overall speed of presentatio n ; (3) the length of
sequences of material that are presented between pauses ; (4) the
signal-to-noise ratio; (5) the number of times the text is presented; (6)
the dialect and enunciation of the speaker and the dialect the hearer is
most familiar with; and (7) a miscellany of other factors.
272 LANGUAGE TESTS AT SCHOOL

The determination of difficult y level is scarcely a lgorithmic.


Rigorous criteria do not exist and quite possibly cannot ever be
established. T his is partly due to the tremendous variability of the
experience of populations o f subjects fo r which o ne mi ght like to set
such criteria. However, the fact that rigorous decision-making
criteria do not exist does not prevent testers from reaching practical
decisio ns concerning relevant fact ors. The main reaso n for
mentioning such factors is to provide the reader with some notion
concerning the remarkable spread of difficulty levels that may be
obtained with a single text. Within certain ill-defined limits, the
difficult y level of any text is more or less continu ously variable.
Concerning the first fac tor mentioned i.n the preceding parag raph,
we can easily see that Text I above (the o ne about Will Rogers) is
easier than Text 2 (abo ut music appreciation). The vocabulary in
Text 1 is simpler and the ideas expressed are less abstract. However,
by varying the ove rall speed of presentation, the length of sequences
between pauses, andior the signal-to-noise ratio, Text J could be
made into a dictation task th at would be m ore difficult than a
different dictation task ove r Text 2.
Fortunately, the difficulty level of any given task is not the
principal thing. If a group of examinees is observed to display a
certain rank orde r of proficiency levels on, say, a particular ch unk of
Mark Twain's prose from a certain novel, chances are extremely good
that they will display a similar rank order of levels on almost any
other portion of text from the same novel, or another no ve l ~ or in fact
any simila r text o r any other passage of discourse . (This is assumin g
that in each case the text in question is ab out equally familiar or
unfamiliar to a ll of the tested subjects.)
It is not the difficulty level of the passage for a group of subjects or
for any particular examinee that m atters m ost, rather it is th e
tendency for examinees to perform proportionately well o r poorly in
relation to each other. T his tendency for exami nees to differ in
performa nce, somewhat independently of the difficulty of any
particular task, is what acco unts fo r the rob ustness of the dictation
procedu re. In a word, it is the variance in test sco res, no t the mean of a
certain group or the score of a particular subject on a part.icular task
that is the main issue. Therefore , the purpose of the examiner is not to
find a text or define a task that will produce a cert a in mean score (i.e.,
a certain difficulty level), rather, it is to find a text and define a task
over it that will generate a desirable amo unt of variance between
subjects in the performance of that task.
DICTATION AND CLOSELY RELATED AUDITORY TASKS 273
If the examiner is looking for an indication of degree of overal1
proficiency, the object may be to set a task that is not too difficult for
the examinees who are lowest in proficiency nor too easy for the
examinees who are highest in proficiency. Fortunately again, the
dicta ti on procedure is sufficiently robust normall y to satisfy this need
regardless which point in a range of difficulty levels is struck upon in
the materials selection process. We return to this topic below in the
discussion of the interpretation of scores on dictation tasks. If, on the
other hand , the purpose of the test is to measure ability to
comprehend a particul ar text that may bave been assigned for study,
a task over that same text would seem to be called for.
In the stand ard dictation proced ure, the task set the examinees is to
write do wn sequences of heard material. Usual1 y, the text to be
written is read once a l a conversational rate while the examinees just
listen. The second time it is read with pauses at pre-estab lished break
points, and sometimes with marks of punctuation given by the
examiner. The third time it is read either with or without pauses
(and /or punctuation) while the examinees check wha t they have
written. An important step in setting up the procedure is deciding on
break point s in the text or texts to be dictated.
There are no absolute criteria for deciding where to place the
pauses, but some general principles can be suggested as heuristic
guidelines. Breaks should be spaced far enough apart to challenge the
limits of the short term memory of the learners and to force a deeper
level of processing than mere phonetic echoing. The amount of
material that can easily be held in short term memory wit hout really
understanding its meaning will vary from learner to learner
somewhat according to the proficiency level of the learners. More
proficient examinees will be able to handle lo nger sequences with the
same ease that less proficient examinees handle shorter o nes.
Probably sequences much less than seven words in length should be
avoided . In some cases sequences in excess of twenty words in length
may be desirable.
Pauses should always be inserted at natural break points. They
should not be inserted in the middle of a phrase, e.g., ' ... grew up in
the ; Western part of the U nited States ... .' (where the inserted slash
mark represents a pause, or break point in the presentation of the
dictation) . Pauses should be inserted at natural point s where pauses
might normally occur in a discou rse, e.g. at the period after the phrase
'the United States'. Sometimes it may be necessary to break in the
middle of a clause, or even in the middle of a series of phrases within a
274 LANGUAGE TESTS AT SCHOOL

clause, e.g., a brea k might be set between 'grew up' and 'in the western
part of the United States" . In all cases, the most natural break points
are to be preferred.
In Text I, the follo wing break points were selected (as indicated by
slash marks):

Will Rogers (Text I)


Will R ogers grew up in the western part of the U nited
States.! He was a real cowboy, riding horses around his
father's ranch all day. ! When he was very young, his parents
wo rried about him/ because he was always doing something
wrong./ No one co uld control him. / His rather finally sent
him to a military school in the South./ When the director of
the school noticed Will 's cowboy rope! lying on top of his
5uitcase,/ he expected trouble with this new boy, and he was
righ!.!

Since the purpose of the test is decidedl y not to assess the speed
with which examinees can write, the pauses must be long enough not
to turn the task into a speed wri ting contest. A rule of thumb that
seems to wo rk fairly well is fo r the examiner to sub vocalize the
spelling of each sequence of verbal material twice during the pause
while the examinees are writing it. If th e materia l is to be tape-
reco rded this process will generally provide a mple time. For instance,
the examiner would read aloud (either li ve in front of the examinees
or onto a tape recording) the first designated sequence of words ofthe
preceding text : ' Will R ogers grew up in the western part of the United
States.' Then he would silently rehearse the spelling of the sequence:
'W-I-L-L-R-O-G-E-R-S-G-R- .. .' and so on twi ce through. Then he
would proceed to the nexl sequence beginning ' He was a reat cowboy,
... ' an d so on through out the pa ssage.
In some dict ation tasks, the punctuation marks are given as part of
the reading of the dictati on. If this is d one, and it usuall y is done in
standard dictation, the examinees should be appropriatel y instructed
to ex pect marks of punctuation to be given. For instance, the
following instructions might be given either orally o r in written fo rm :
DICTATION AND CLOSELY RELATED AUDITORY TASKS 275
This is a test of your ability to comprehend and write down
orally presented material. It is a dictation task. You will hear a
paragraph read aloud three times. The first time it will be read
at a conversational speed. Just listen and try to understand as
much as you can . The second time the passage will be read with
pauses for yo u to write down what you hear. Marks of
punctuation will be given wherever they occur. When you hear
the word ' period ' at the end ofa sentence, put a period (.) at tha t
point in the paragra ph. D o not write out the wo rd P-E-R-I-O-
D. Other marks of punctuation that will be given are 'comma'
(,) and 'semicolon ' (;). (The instructor should make sure the
examinees are forewarned about all of the ma rks of punc-
tuatio n that actually appea r in a given passage, at least until
they are familiar with the procedure to be followed.) The third
time you will hea r the same paragraph without pauses while
you check what you ha ve written down. You should write
exactly what you hear.

Alternative fo rms of the dictation procedure wo uld be to read lhe


passage twice, or only once. Anotber possibility would be to require
the examinee to write the passage with no marks of punctuation
indicated . If this latter option were followed , it would probably be
best to disregard punctuation altogether in the scoring or to allo w as
many forms of punctuation as possible for the given text. Including
the marks of punctuation as part of the dictation procedure is one
way around the difficulty of scoring punctuation (if at all). If the
marks are indicated clearly in the o ral procedure, putting them in
becomes a part of the overall listening comprehension task.
If the object of the lest is to measure overall ability to deal wilh
auditorily presented material that is new or relatively unfamiliar,
probably the methods of reading the material once, twice, or tbree
ti mes will produce roughly equiva lent results though it is expected
that a second or third reading will make the task progressively easier.
Research needs to be done on thi s question, but we might expect an
inc rease in mea n score o n the same passage if it is read a second time,
and possibly another increase ifit is read a third time, but the variance
in scores should be largely overlappi ng (highl y correlated) in all three
cases. That is, the tendency of scores to differ from their respective
means on each of the three tasks should be roughly similar. The
choice of how many presentations of the material to use may make
little difference in some cases, but in others there may be good reas'o ns
for preferring to read the material only once - for instance, when
trying to simulate closely all actual listening task.
276 LANGCAGE TESTS AT SCHOOL

(c) SCOKl NG PROCE D URES


If th e examiner chooses to make a live presentation of the dictation
material, he should ma ke every effort to avoid slips of the tongue,
unnecessary disftuencies, and the like, and should be prepared to
compensate for his own errors in whatever scorin g system is devised.
It is possible fo r the examiner to say somet hing different tha n what is
actuall y in the written passage. Usuall y when this happens, the
examiner is unaware of it until it is time to score the tests. The
problem may then become obvious when the examiner discovers that
aU of the students tried in fact to write what he said and not what the
passage actuall y contained. In such cases, it is appropriate to score
the test for what was said rather than what appeared in the script.
St andard dictation is usuall y scored by allowing one point for
every word in the text. It can be done in either of two rou ghly
equivalent ways. If there are fe w correct sequences in the examinees I

protocols, it is probably best (easiest and most rcli able) to count onl y
the correct words that appear in the appropriate sequence. If there are
few errors on the other hand and many correct words in sequence, it is
probably best to count errors a nd subtract their number from the
total number of points. Because of the possibility of inserting words
not in the text, it is possible by the error counting method to derive
scores less tha n ze ro. Therefore. the error-counting method and t he
correct words-in-sequence method are not perfectly equivalent if
intrusions are counted as errors. For reaso ns to be discussed below, it
is best not to count spelling err ors unless they seem to indicate a
problem in the perception of distincti ve sounds or knowledge of
different word meanings.
Word-for-word scoring goes somewhat slowly fo r the first seve ral
papers (or protocols), but as the scorer becomes more familiar with
the text, it goes considerabl y faster. Errors include deletions,
distortions of fo rm or sequence, a nd intr usions. Spelling errors can be
distinguished as a special category as will be illustrated .
Examples of deletions in protocols generated by dictating the Will
Rogers text (see above) 10 foreign students being ad mitted at LeLA
in the spring of 1971 include the following:
0000 00000 0 00000
I. ';In the western He was a real cowboy,,,all day .... ' in place
il
of, 'Will Rogers grew up in the western part of the United
States. He was a real cowboy, riding horses around his
father's ranch all day . . . .' for a total of 15 deletion errors, or
DICTATION AND CLOSELY RELATED AUDITORY TASKS 277
10 correct words in sequence from a possible 25 (freehand
circles above the line indicate deletions);
2. ,.m",
0 western part 0 feU
A nne
. d States .... ' mstea
. d 0
f ,.In t he
western part of the United States' scored as 2 deletion e rro rs
or alternatively as 6 correct words in sequence out of a
possible 8 ;
3. 'He wa~eal cowboy, .. .' for' He was a real cowboy' with I
deletion error or 4 out of 5 points.
Some examples of distortions of form and their scoring are:
4. 'Will Rogers cgro9 up in the western part of the United
States. He was a real cowboy, (!eadinj) horses@ his fathe€)
ranch all day ... .' In this protocol there are distortion errors
in the words 'grew', 'riding', 'in', and 'father's' for a total of4
errors or 21 of 25 possible points;
.==c-:
5. 'When the director of the schoolcSaw thi@cOwboYAlying
0

on lOp of his suitcase, . .. ' instead of ' when the direclor of the
school noticed Will 's cowboy rope lying on top of his
su itcase .. ,' where the word 'noticed ' has been transformed
into 'saw this (1 error), and 'Will's cowboy rope ' has been
rendered as 'real cowboy' (2 errors);
6. Ihndli!IDhorseback Gill his fathers Q'ang9 all day' for ' riding
horses around his father's ranch all day' (3 errors, 'rinding'
for ' riding'. 'i n' for 'around', and 'range' for 'ranch') ;
7.
°
' Wilcablero~k'
O ·
for 'Will's cowboy rope" (3 errors).
Distortions of sequence are less common, but they do occur
(frequently they are interlaced with a variet y of other errors):

8.
fathers rancW instead of' He was a real cowboy riding horses
around his father' s ranch all day' (the adverbial 'all day' is
lransposed to an earlier positi on ~ 'read ing' is substituted for
'riding'; and 'in' is inserted where it did not appear at all -
total score is 10 out of a possible 13 or a total of 3 errors);
278 LANGUAGE TESTS AT SCHOOL

9.
10. '@ritinlDhorses his around father's ranch all day' instead of
'riding horses around his father's ranch all day' ('his' out of
order for 1 error and 'writing' for 'riding' for a second error);
11. t§0trouble with th~boy' instead of 'trouble with this
new boy' (2 errors; 1 order, 1 intrusion or distortion):
12. ~~ilitary school' for 'sent him to a military
school' (3 errors; 1 order, 1 distortionofform, 1 deletion).
Intrusions of words that did not appear in the text also occur
occasionally:
13. 'expected@troublewith this new boy' for 'expected trouble
with this new boy' (1 error, intrusion of 'the');
14. 'noticed Wills cowbo1(brope' for 'noticed Will's cowboy
rope' (1 error for the intrusion of the possessive morpheme
apparently copied from 'Will's' to 'cowboy');
15. 'all day~ instead of 'all day' (l error):
16. &wiUctowboy rope lying on~top' for -Will's cowboy
rope lying on top' (2 intrusion errors, ] distortion).

The foregoing merely illustrate some of the common types of errors


that occur in non-native attempts at writing dictation of the standard
variety. There is one further type of error that we should consider
before we proceed to look at several actual examinee protocols. It is
possible to distinguish (generally) spelling errors from other types of
errors.

(d) WHY SPELLING ERRORS ARE NOT COUNTED


A general rule of thumb is to ask of any potential spelling error
whether it could-be made by someone who really knows the language
well and is capable of making all the appropriate sound distinctions,
c.g., the distinction between /1/ and /r/ or between /w/ and /v/ and /b/,
or say between the vowel sound of 'beat' versus the vowel of 'bit' and
so forth. English is not an easy language to spell, and it is not the
purpose of dictation (normally) to be a test of knowledge of English
spelling. Below, examples of spelling errors appear in the left hand
DICTA nON AND CLOSELY R£LATED AU DITOR Y TASKS 279
column; the correct spelling appears in the center column; and
examples of lexical or phonological errors th at are not cl assifiable as
spelling errors are given in the third column. Practicall y all the
ex.amples are genuine errors from examinee protocols. Contrived
examples are as terisked.

SJ>ELLING ERRORS CORRECT SPELLING PHONOLOGICAL ERRORS

Will Rodgers 1. Will Rogers Bill R oger(s)


Wit Radgers Will Radios
Wil Rogers Will Lagerse
William Ro ger
We ride
W ill i Rogers
fathers 2. father 's fa rmer's
father
farth er's
al day 3. all day hole day
whole day
notest 4. noticed notice
noteced *know this
roa p S. rope robe
*expekted 6. expected excepted
espected
yung 7. youn g · Iung
directer 8. director *dractor
rannch 9. ranch ra nsh
range
suitecase 10. suitcase *suitscase
so mlbing 11. somethi ng somsi ng
*some t hink
rong 12. wrong "' long
wright 13. righ t write
abillity 14. ability a melad ic
a bilitie
abi laty
militery 15. mil ita ry emelitery
mi llitary omilitary

Of course, there are cases where it is difficult to decide whether an


error is reall y a spelling problem or is indicative of some other
280 LANGUAGE TESTS AT SCHOOL

difficulty besides mere spelling. In such cases. for instance, ' pleshure'
for 'pleasure', 'te ast' for 'taste', ' ridding' for 'riding', 'fain aly' for
'finally' , 'moust' for 'must', 'whit ' for 'with' and similar instances,
perhaps it is best to be consistently lenient or consistently stringent in
scoring. In either case it is a matter of judgement for the scorer.
Fortunately, such doubtful cases account for only a small portio n of
the total variance in dictation scores. For 145 foreign students tested
at UC LA in 197 1 o n the tex ts illustrated above, the average number
o f spelling erro rs was 1.30 on the first passage (Will Rogers) and 1.61
on the second (A Taste for Music). Compare these figures against the
average number of all other errors on the two passages which was in
excess of 50 on each passage. Furthermore, the number of errors
classed as spelling erro rs concerning which there is doubt is only a
small fraction of the total number of spelling errors.
Wh y not include spelling errors as part of th e total score ? That is,
why not subtract them from the total score the same as any o th er type
of errors? Table 5 sho ws that spelling errors are un co rrelated with the
other types of errors that learners make in taking dictation, and they
are also unrelated to performance on a variety of other tasks included
in the UCLA English as a Second Language Placement Examination
(Form 2D). The test consisted of four multiple choice secti ons in
addition to the t wo dictations. It was administered to 145 subjects in
the spring of 197 1. The table reports the intercorrela tio ns between the
two dictations (Dictation 1, Dictati o n 2), a mUhiple choice synonym
matching task (Vocabulary), a multiple choice reading compre-
hensio n task (Reading), a multiple choice grammar test requiring
subjects to select tb e one appropriate word, phrase, or clause to fill a
blank in a larger verbal context (Gramma r Select), and a multiple
choice task requir.ing subjects to put words, phrases, or clauses in an
app ropriate o rd er (G rammar Order). Also included in the table are
two spelling scores, o ne fo r each of the respecti ve dictatio ns. Since the
spelling scores (Spelling 1, Spelling 2) a re negative, tha t is, tbe worse
the subject did the higher the score, and since all of the ot her test
scores are expressed as the num ber of correct answers o n the
respective porti on of the test, correlations between the spelling sco res
and the other test scores should be negative.
The table reveals, however, that in fact the spelling scores are
slightl y positively correlated with tbe other test scores represented in
the correlatio n matrix. The spelling scores o n the two dictations a re
substantially correlated with each otber, but are scarcely correlated at
all with any other test in the table. This would suggest th at spelling
DICTATION AND CLOSELY RELATED AUDITORY TASKS 281
errors on the two dictations used are not related to overall proficiency
in English as measured by the dictations and the other parts of the
UCLA ESLPE 2D. For this reason, it is recommended that spelling
errors not be included as pa rt of the total score computation for
dictation tests. Whitaker (1976) and Johansson (1973) make similar
recommendations.

TABLE 5
Intercorrelat ions between Two Dictations,
Spelling Scores on the Same Dictations, and Four
Other Parts of the eCLA English as a Second
Language Placement Examination Form 2 D
Ad ministered to 145 Foreign Students in the
Spring of 1971.

UCLA ESLPE 2D
UCLA E$LPI:::: Abbreviated
2D Subparts Subparts V GS GO R D I D2 S2 SI

Vocabulary 1.00 .71 .64 .85 .47 .71 - .03 .04


Grammar Selection 1.00 .79 .75 .69 .64 .04 .08
Grammar Order 1.00 .73 .6 1 .61 .06 .07
Reading 1.00 .47 .68 .06 .11
Dictation 1 1.00 .68 .06 .03
Dictation 2 1.00 .06 .03
Spell ing I 1.00 .55
Spelling 2 1.00

From Ta ble 5 we can see that the tendency to make spelling errors
in the two dictations is substantially correlated. That is, the
correlati on between Spelling I and Spelling 2, see row 7 and column 8
of the correlation ma trix, is .55. Thus, about .30 of the variance in
spelling errors on Dicta tio n 1 is also present in the va riance in spell ing
errors on Dictation 2. A rela tionship of this strength would not be
expected to occur by chance as often as one time in one thousa nd
tries. That is, if we repeated the testing procedure with a thousand
different groups of subjects of about 150 in number, we would not
expect a correlation of .55 to occur by chance even one time. Hence,
the correlation must be produced by some empirically reliable factor
- say, ability to spell English wo rds.
However, none of the correlatio ns between the spelling scores and
the other subteslS reaches significance. Correlations of the observed
magnitudes (ranging from - .03 to .11) would be expected to occur
282 LANGUAGE TESTS AT SCHOOL

by chance more than ten times in 100 tries. Besides, if the correl ations
\vere produced by some real rel ation between the cognitive ability
underlying Engli sh language proficiency and abilit y [0 spell English
words, the correlations ought to be stro nger (i.e., of greater
magnitude) and they ought to be consistently negative. If abilit y to
spell English word s is related to overall proficiency as assessed by the
ESLPE 2D, then the greater the number of spelling errorS in a
dictatio n, the lower should be the proficiency of the examinee who
made the erro rs. In fact, the slight positive correlations suggest that
the greater the number of errors the higher the o ve rall proficiency
score,
Altb ough spelling errors are uncorrelated with sco res on any of the
language proficiency subtests, the latter are a ll quite strongly
co rrelated with each o ther. That is, a higher score on one part of the
test will tend to correspond to a higher se-o re on every other part of the
te st, while a lower score will tend to correspond to a lower score o n all
subtests. Along the lines of the argument presented in Chapter 3 of
Part One, the correlations observed in Ta ble 5 should strengthen our
co nfidence in the var io us testing procedures rep resented there. The
stro ngest relationship exists between the Reading subtest and the
Vocabulary subte,!. All of the tests, however, appear to have
substa ntial reliability and validity.

(e) SAM PLE SCORED PROTOCOLS


T o give a clearer notion of how the scoring of a standard dictation
task might be done, three protocols for each oftbe sample texts from
section 1 above are given in their entirety below. All are actual
rendi tions by foreign students at the university leve l. The native
language o rtbe lea rner who wrote the protocol is given at the top left
in parentheses. Cross-outs are the learner's own in ea ch case. In some
cases, error types are in dicated 1n cursive script above the line as
deletions, distortions, intrusions, spelling errors, and so a D.
Punctuation and capitalization have been corrected or ignored. In no
case do they contribute to the total sco re. The mark ing meth od used
is to encircle e rrors and place circles above the line where word s are
deleted. Intrusions above the number of errors in any given line of
text are not counted toward the total score. The total score is then the
number possible minus the number of errors. (Depending on how
intru sions are treated. this method may res ult in a sll ghtly different
score than merel y co unting the tolal number of co rrect words in
DICTATION AN D CLOSELY RELATED AUDITORY TASKS 283
correc t seq uence.)]t is no t pO$sible to get a score lower than ze ro so
lo ng as intrusions in excess of the number of errors in a given line of
text do not count.

Protocol I
(Japanese) SCORE:
Will Rogers grew up in the Western pa rt of the United .
around tonlOfl
States. He was a real cowboy, riding horses/\_ his farther's
cli~to( " i
ranches all day. When he was very you ng, his parents
worried about him because he was always.doing something
al~l-ort,Of'\
wrong. No one could control him. HislfarthEi}finally sent
him to a military scho()l in the South. When the director of
d,~TMt.on c==:::> d i.o;.'rortiol'\
the sc hool noticed@ W 0Y'Wffite lying on the top of his
i nt"'u~IOf\ rave d ,!>-torT'Of\
suitcase, he expecte to ro u Ie with Ihis new boy a nd he was
wrig ht.

Protocol 2
(Chinese) pho ..I"'),..1 d isto,t,o~ SCORE : 65 / 37
Will (jJag~ril grewup in the western pa r!f)of the United
~pe lli n9 (nopoif\ls off)
Sta t,e s, H ~ was a rea ll cowboy, rid ing ho rs~s a ro und~
' n t. rV$ IOn (i)- \ 1'\t.fVSIOI"\
ra nd~ all day. When he wa~very yo~ng, his parents
domg
worried about him, pecau~e he was alw,.q s omethi n~_rOng.
",,\~s lon '1V -_
No one could contro~him. His father finally sentim to
00
@military school south. When the director of the school
nO(iC~C: cowboy ~@ top
10 tro uble WIth
o f@ sui tcase, he
expec eO. A. new boy, a nd he Wct S rig ht.

Protocol 3
(Armenian) \ ', J SCORE: .. 66/87
~ho{\OO~O 0
Will Rog~in Westc:rn part of United States. He \\las a
real cowboy, ridin g horses a rounPrather'sChouMi all day .
pho"o l~ I(.Q;I~ il'\lj)
\Vben he was we-very yo ung , his par e nt s ~abou( him
because he always doing somthing wrong. No ... one could
sp.lli"3 (no po>ots off) '
284 LANGUAGE TESTS AT SCHOOL

cont rol him. His father finall y se~ him lO~elitery school
in th Sho ut When the di recto r oFschool(known Ih~owboy
o inehe ~&r~ase, he expected trouble with this new boy,
andhe~~ lexical (not 5?elli"5l
To check the assigned sco res, the reader may want to look back at the
exacllext given above on p. 274. Three examples from the second text
fo llow:

Pro tocol 4
(J apanese - same subject as in Protocol J) SCO RE : S~
for
A taste ef music, a taste for anything, is an abi lity to
consumf,:with tfte pleasure. Tast@in music is<proporcnliaJ)
$pe1lio9
consamption, a greater liking@certain kinds of it than for
others. A broad taste -ef in music in volves the ability to
consume w1 th Hie pleasure man y kinds o f it. Vast numbers o f
persons, many oflhehighly inlelligent, derive no plea sure al
all from organized sound. An even larger number can takerillf
leave it. They find il agreeable fo r Ihe mosl part, stimulating
10 Ing ()
.jffi the~~emom~ an d occasionall y inleresteti@mind.

ProlocolS
(Chinese - same as Protocol 2) SCOR E :2 %! 94-
o 0 00000 0 I

A taste @music, a taste anything is ability f;\\ta ste music 00


. J;"'!C~o<y'
consumptiorX§lcgrea~.thanA'blhe rs. A~ta st
I!Y A sp.
°
musigi~ ability@ nsupmtioj]°pleasuremany kind s ofi l .~
~p." 0 0 ~
number's persons, many of highly intelegen§! ~@
~o- 0 00
~fromorganized sound. Even largCJnumber can take@
o 00..--,--_
(§They fin~for most part. Slimulating6ellement;y
and occation<\D@~
"
Protocol 6
(Armenian - see Protocol 3) SCORE: ..60~
. . SP ?f· h· an · · of
A taste tor mUSIC, a tate or anyl mg, IS a b 1l ny to consure or
tC.:-o.- 00 5fJ · 000
pleasure . Taste lfQ9 music consumtion, a grea t er~ ircle
DICTATION AND CLOSELY RELATED AU DITORY TASKS 285
than for@)A(broughyt aste@)musicqn vOlver@abilityto
~1l2!J pleasure many kinds of it. CLasll~ of
per~ons, many o f themCh~jnt:«gel!!@evid{i)no pleasure@p
~gan~so und . An even larger number can take it or lea v sp.
it. They find it a(llew robO}l fo romost part,@temler»to 00 ,
o /'.
and occasi on a l y~mind.

2. (ii) Partial die/aliol/. This technique is actua ll y a combination of


dictation and the cloze procedure (see Chapter 12 for a more detailed
discussion of the latter). In pa rtial dictatio n, actually all of the
material is presented in an auditory version, and part of it is also
presented in a printed form . The portions of text that are missing in
the printed version are the criterion parts where the examinee must
write what is heard - hence, though all of the material is presented in
an auditory form , only part of it is really dictated for the learner to
write down. The technique has a great deal of fl exibility and may be
done in such a way as to break up the text somewhat less than the
standard form of dictation.
Johansson (1973) suggests two methods fo r selecting materials.
One way is to tape a portion of natural disco urse - a lecture, a radio
program, a con versation, or some other verbal exchange. An other is
to concoct a tex t or script to be tape recorded as ifit were one of the
fo regoing, or merely to tape record a script, say, a paragraph of prose.
In the first case it is necessary to transfo rm the audito ry version into a
written form , that 15, write the script. In the second, one starts with
the script and then makes a recording of it. Another step in either case
is to decide what portions of the script to leave blank. Once those
decisions are reached , pauses of sufficient length must be in serted in
the taped version . Probably the same rule of thumb recommended for
sta ndard dictation pauses can also be applied for partial dictation -
namely, spell out the deleted portion twice in creating each respective
pause. This method creates a pause length that is consistently related
to the length of the deleted material (as consi"ent at least as the
speller's timing in subvoca lly uttering the sequence of leite rs twice
through).
[n the samples that follow, the material which would be heard but
which would not actually appear on the script placed in the ha nds of
the examinees is in italics and enclosed in parentheses. The slash mark
at the end of each italicized po nion of the text indicates the location
of a pause on the tape. Both samples were used by Johansson (1973)
286 LANGUAGE TESTS ATSCHUOL

in his research wit h Swedish lea rners of English as a fo reign language


(at the college level). The first sample was created from a tex t by
making a recording of it. The second sample originated a s a radio
program that was recorded and then subsequently transcribed for the
purpose of the test. I

(a) SAMPLE MA TE. ' ALS


Partial Dicta ti on Sample I
(Book Review: 'The Fetterman Massacre' by Dee Brow n.)
The Fetterman Massacre is the story of Red Cloud's
completely vic torious ambush of 80 soldiers of th e United
States arm y in December 1866. R ed Cloud, commandin g
armies of Sio ux and Cheyenne, decoyed a troop of 80 men
from F o rt Phil Kearny on the M ontana road. No one
returned ali ve. The man who took lhe news of the massacre
to the head ofthe telegraph near Fo rt Lara mie was Po rtugee
Phillips who, (al below-zero temperatures alld through
continuous blizzards)/. rode 236 miles in four da ys. Portugee
Phillips survived. The horse died. Captain Brown and
Colonel Fetterman, obviously by agreement, (shot each
other to avoid capture)!. The Indians (had very jew rifles)/, the
United States soldiers (were reasonably weil armed) /. They
did not (run out of ammunition)/' Red Cloud' s (victory wos
complete)/. It was, of course, (only temporary)/' The
Montana road was opened again within ten years.
(Westward pressure was inexorable )/. Dee Brown, as in
Bury My Hea rt a t Wounded K nee, (conveys the vast
genocidal tragedy of the period) f. He is not (sentimental or
sensational) /. He draws together (the pertinent fac ts with skill
and clarity);' In a sense he lets the facts tell the story, but to
say this is (doing him less thanjustice)/. It is (his arrangement
and presentation of the facts)! th at gives both books (their
excitement and dist inction)/. It is possible that some readers
may think that Dee Brown (iacks the sweep and the
eloquence)! o f the great American hi storian Fra ncis
Parkman but he has an eloquence of his own (ruther more
closely related 10 lhe literary taste of 1972)/, and I suspect
that in the future it will be with writers like Parkman that
Dee Brown (will he compared);'

Pa rtial Dictation Sample 2


(Book Review: 'The Scandaroo n' by Henry Williamson .)

I Johansson credits Stig Olsso n, Lund University, with the selectio n or the materials
and the creation of the text of the second sample.
DICTATION AND CLOSELY RELATED AU DITORY TASKS 287
This book is ab out some very unusual relationships, not
beL ween men and women no r even between men and men ,
but between men aDd birds. The wo rkpeople in lhe small
town of Thirby in North D evo n keep raci ng pigeo ns. That
sport has made them deadly enemies of th e peregrine falcons,
which have li ved and hunted along that stretch of the Devon
coast for a th o usand summers. The falcons swoop on the
pigeo ns as they return home but Sam Baggott, keeper of the
Black H orse Inn and a dedicated pigeon racer, has his own
deterrent against 'they bloody hawks'. He (sends up decoy
birds impregnated ivith strychnine)/ , and dances fo r joy when
the falcons tumble o ut of the sky, (ha ving sucClimbed to his
poisoned bail)/. The Scandaro on itself is a (rach el' unusual
migrant ji'om distant shores );, something of a (brightly
plumed cross between a pigeon and a cro w)/. Its arrival in
Thirby is g reeted with considerabl e interest. Sam Baggott
(lusts after it as a potential bait Jar the Jalcons) / . The local
doctor regard s it with great interest (as a na tural history
specimen);. But the Scandaro on eventually becomes the
property of (an elderly admiral"s young SOI1)/. That of course,
is onl y the beginning of the sto ry, and since it is rat her a good
one ([ don't propose to repeat all oj if here)!. In this short
novel, set just afte r the First World War, Henry Williamson
presents us with several contrasting views of the English
countryside - th at of (the dedicated nall/ralist)!, [hat of (Ihe
equally dedicated sportsman)!, and that of (the less pas-
sionately involved observer)/. Henry Williamson is not a
sentimental nature writer. He knows the countryside and its
people too well to lapse int o th e (casual urban dream of some
rural Arcadia) j. He recognizes that (crue/ly and beauty fuse
lOgether)! when i:I peregrine faJcon (s woops on a straggling
pigeon )/. In his novels (he exhibits a telescopic eyefor detaiTJj,
and one feels th at he cares more too. OUf concern for the
enviro nment is essentiall y a social concern . It derives from
the socia l problem (oj providing ledger-space Jar the urban
millions)!.

Before going on to discuss some of the details of th e admi nistration


and scoring of partial dictations, a few wo rds need to be sajd about
the placement of blanks. 10hansso n (1973) says on the basis of his
research with the procedure that it is important that blanks not be left
in the middle of a sentence several words befo re the pause is inserted.
For instance, ite ms like the fo llowing should (according to
10hansson) be avoided:
(1) Britain has the (jimest rising cost oJ living) of any country in the
world ';
288 LA NGUAGE TESTS AT SCHOOL

He urges that if omission of material is to be made in the middle of a


sentence, it should be preceded by a pause - otherwise, such items
(where 'item' refe rs to each deleted portion in a text) prove to be
excessively difficult. Perhaps, the same hip-pocket principle used in
the placement of brea ks in standard dictation should also be foll owed
in determining what to delete and what to leave in with partial
dictation. However, it may be wo rth noting that by foll owing
Johansson's recommendation, the unacceptable item given above
could theoretically be converted to an acceptable (but somewhat
more difficult item) as follows:
(2) Britain has the (jastest rising cost of living of any country in the
wor/d)./
In genera l, therefore, it would seem that items for partial dic tation
can be made as difficult o r easy (within certain illdefined limits) as the
examiner would like thelTI to be. The most striking difference between
either of the foregoing items and th e following (3) is the length of the
sequence between the beginning of the deletion and the pause:
(3) Brita in has the fa stest rising cost of li ving of (any country in {he
wor/d) .;
In (I) and (2) the number of words between the beginning of the
deleted portion and the pause is 11 words in each case, whereas it is
onl y 5 in (3). Hence, (1) and (2) o ught to be somewhat more difficult
than (3).

(b) ADMINISTR ATION PROCEDURES


In the research by Joha nsson, the recorded material was played only
once. Subjects were instructed to listen to the ta pe and fill in the
blanks in tbe texl. They were told tha t a pa use would occur at each
slash mark (f ). Examinees were also given 'a few minutes for final
revision' (1973, p. 52) after the tape fin ished playing. Johansson used
a lan guage laboratory with separate headsets for each examinee, but
a single tape playback should work eq ually well so long as the room
accoustics are good. Li ve presentati on is also possible with partial
dic tation and may be preferred by classroom teachers (as it usuall y is
\vith standard versions of dictatio n).
The scoring methods employed with partial dicta tion are quite
sim ilar to those used with standa rd dictation. Johansson allowed one
po int for each correct word in the correct sequence. He djd no t
subtract any po ints for words that were misspelled but clea rly
DICTATION AND CLOSELY RELATED AUDITORY TASKS 289
recognizable as the correct form. His 'guiding principle' was to
'disregard errors which would not affect pronunciation, provided
that the word is clearly recognizable and distinct from other words
with a similar spelling' (1973, p. 15). Exa mples of incorrect spellings
for which no points were subtracted included: stryenine (strychnine),
poisell ed (poisoned), repeale (repeat), rifels (riftes), e1eqllence
(eloquence), senlimemel (sentimental) and frosen (frozen). Examples
that illustrate genuine non-spelling errors include: strickney
(strychnine), specime!1l (specimen), repit (repeat), exapit (exhibit),
and causal (casual).

2. (iii) Elicited imitation. Perhaps the most widely used research


technique with preliterate or barely literate children is elicited
imitation. 1t has not always been a favored technique for language
proficiency testing, however. Like dictation, it has seemed too simple
to some of the experts. Some argued that it was not anything more
than a test of ve ry superficiallevelsofprocessing. The research with the
technique, however, simply does not support tbis narrow in-
terpretation. It certainly can be done in such a way as to cballenge the
short term memory of the examinees and to require a deeper level of
processin g.
The technique is ca lled 'imitation' because it requires the
examinees to repeat or 'imitate' utterances that are presented to them
(usually the scoring system requires a verbatim repetition).
Sometimes children or others imitate utterances spontaneollsly, but
in the test situation the imitation is asked for by the adult or
instructor who is giving the test ~ therefore, it is 'elicited imitation'.
One of the most interesting applications of elicited imitation is in the
determination of language variety dominance in children whose
language varieties include forms otber than the typical middle class
standard varieties of English.
The two samples of discourse which have been used as elicited
imitation tasks to distinguish levels of langua ge va riety proficiency
(or dialect dominance) in children are from Politzer, H oover, and
Brown (1974). The first passage is the text of a story in a widely
spoken va riety of American English. It is about a man who was a
slave and a grea t folk hero. The second sample of discourse is a more
widely known story of another black hero whose exploits and
courage are also told in a well-known ballad. The stories are probably
about equall y appropriate to the preliterate kindergarten and early
primary grade children for whom they were intended.
290 LANG UAGETESTSATSCHOOL

<a) SA "PLE MATERIALS


High John the Conqueror
<from Politzer, Hoover, and Brown, 1974)
This here be a story, This story 'bout High John the
Conqueror. High John might could be call a hero. High John
he could go to all the farms. He could go where the black
people was. This was because he was a preacher and a
doctor. Couldn't no ne of the other black folk do that. High
John was reall y smart. High John master wa nted him to fight
a black man . High John didn 't really want to be fighting
another slave like hisself. So High John he say to hisself:
'Master crazy. Why he want me to do that? 1 bet he be
hoping we kill each other.' So High John he use his head to
get out of fighting. He wait till the day of the fight. Peoples
was coming from miles around. Black folk and white fo lk
was there. Everybody get seated in they place. High John he
walkeded [wok tgd j up to the master daughter. Then he slap
her. This take so much nerve that the o the r slave run away
and refuse to fight. He refuse to fight anybody bad and nervy
as High John.
John Henry
(from Politzer, Hoover, and Brown, 1974)
This is a story about John Henry. You have probabl y heard
this story in school. John Henry could be called a hero. He
was a worker on the railroad. John Henry was a leader. So he
was always w he re tbe o ther wo rkers were. None o f the other
workers knew as many people. John Henry's boss wanted
John Hen ry to race a machine. At first John Henry didn 't
want to do it himself. John Henry says to himself: 'The boss
is crazy. Wh y does he want me to do this? I'll bet he hopes I
kill myself.' But he used his hammer anyway. He practiced
till the day of the race. People were corning from miles
around. Working folks and other folks were there .
Everybody gets settled in his seal. John Henry picks up his
hammer. Then he walks up to his boss and tells him he's
ready to start. Jo hn Henry has so much stren gth that he
hammers lo ng hours till he beats the machine. He dies at the
end, though.

It is probably apparent to the reader that the two passages are


rather deliberately similar. In fact, the parallelism is obvio usly
contri ved. John Henry muses to himself concerning his boss much the
way High Jo hn does concerning the master. The intent of these two
passages was to test certain contrasts between the surface forms of the
language variety used in the first text agai nst the corresponding forms
used in the second text. In this sense, the two texts were contrived to
DICTATlON AN D CLOSELY RELATED AUDITORY TASKS 29 1

test discrete points of surface grammar . For a ll of the reasons already


given in previous chapters, this use is not recommended.
Furthermore, if the two passages are to be used to assess the
domina nce of a gro up of children in o ne language variety o r the
other, they should be carefull y counterbalanced for an order effect. If
all children were tested on one of the passages before the other, it is
likely that performance on the second passage wo uld be affected by
previous experience with the firs t. Another possible problem for the
intended application of th e two sets of materi als (that is, the
application recommended by Politzer, el al) is that the John Henry
story in all probability is more familiar to start with. It would not be
improbable th at the familiar slO ry might in fact cause confusion in
the minds of the children. It is a well kn own fact that similar stories
such as tbe o nes exemplified would be more apt to interfere with each
other (become confused in the retelling for instance) than would
dissimilar ones.
Wby tben use these texts as samples? The answer is simple. Ifthe
materials are used as tests of the ability to cOIllPrehend auditorily
presented material (in either language variety) and if the imitation
tas k is scored for comprehension, either set o f materials will probably
work rather well. However, attention to cert ain surface forms of
grammar (e.g. , whether the ehild says 'himself ' or 'hisself ' or whet her
be says 'might could ' or simply 'could') should not be the point of a
comprehension task. If the question is wbat fo rms does th e child
prefer to use (that is, if the legitimate focus of the test is on surface
grammar) then perhaps attention to such matters could be justified.
In either case, 'tile issues come to loggerheads in the scoring
procedure. We will return to rhem there.

(b) ADMINISTRATION PROCEDU RES


Politzer, Hoover, and Brown (197 4) first recorded the texts. Then, the
tape was played to one child at a time and th e recording was stopped
at each point where a criterion sentence was to be repeated. Not all of
the sentences comained surface forms of interest. Since Polit zer el al
were interested o nl y in whether the chi1dren could repeat the criterio n
surface form s, only the attempts at repeating those forms were
transcribed by the examiner. fr the form was correct, th at is, if it
corresponded perfectly with the form gi ven on the tape, this was
merely no led on the answer sheet. Jf not, the incorrect fo rm was
transcribed.
292 LANGUAGE TESTS AT SCHOOL

Another approach would be to use two tape recorders, one to play


the tape and the other to record the child 's responses. This approach
was used in a study by Teitelbaum (I976) in an evaluation of a
bilingual program in Albuquerque, N ew Mexico. If this is done the
scoring of the child's responses should probably be done immediately
after each session in order that the child's attempt be still fresh in the
memory of the examiner when it comes time to score it. Furthermore,
instead of scoring the attempts merely for some of the wo rds, it could
be scored for all of the attempted material. Alternative methods of
scoring are taken up below.
In order to try to insure that the first performance did not interfere
with the second, Politzer et al all owed a week to elapse between
testings. Howe ver, all children were te sted first on one test and then
on the other (apparently, as the authors are not perfectly clear on this
point). If there were an order effect, that is, if children automatically
do better on the second passage because of ha ving practiced the
technique, it would tend to infl ate sco res on the second passage
unrealistically. Therefore, in dominance testing it is recommended
that half the children be tested first on one passage and hal f on the
other or that there be a warm up testing that does not count. If the
first option is chosen this will tend to spread the practice effect evenly
over both tests. If the second option is selected it will tend to eliminate
the practice effect by making tbe children eq uall y familiar with the
testing technique before either of the actual tests occurs. The first
option would be appropriate if one is merely interested in the
dominance characteristics of a group of ch ildren while the latter
would be more appropriate if one is interested in making individual
decisions relative to the language- variety dominance of individual
children.
The notion of allowing a time lapse between testings makes some
sense, but there is no gua rantee that the children will forget the first
story during the time lapse. If in fact the first story is still fresh in their
minds when th e second o ne is enco untered , they lTIay very well
experience some problems of comprehensi on. They may, for
instance, expect the events of the story the second tim e around to
correspond to the events on the first test. The best way to eli minate
the problem th oroughly is to select materials that are less similar than
the examples given above bUl which are known to be comparable in
difficulty level accordin g to o ther crite ria - e.g., sentence complexity ,
vocabuhtry difficulty, and content.
This can be achieved in the following way. (l) Find two texts of
DICTATION AND CLOSELY RELATED AUDITORY TASKS 293
ro ughly equivalent difficulty and appropriateness for the test
population. (2) Create versions of the text s in both languages (or
language va rieties) of the popul ation to be tested (e.g., in English and
in Spanish, o r in the majo rit y variety o f English and in some noo-
majority variety of English). (3) Test eac h examinee on one of the
texts in one language (orlanguage variety) and on the othertext in the
other language (or language variety) . (4) Test approximately equal
numbers of students on each possible combination of text wi th
language (or language va riety). That is, suppose thaL LI = la nguage
or language variety number one, and L2 = lang uage or language
variety number two. Further that TA = the first text and TB = the
second text. N o examinee would take the same text twice. By
requi rement (3) they would take LI TA and L2TB, or L2T A and
L1 T B. By requirement (4) a bout ha lf wo uld take the first
combination of tests a nd one ha lf woul d take the second com-
bination. The language (or language variety) dominance of the
population can then be estimated by comparing the mean scores of all
subjects on Ll (over both texts) with the mea n scores on L2 (again
over both tex ts). If the texts are of equiva lent difficulty (a nd tbis can
be determined by comparing the mean of all scores o n TA (over both
languages) with the mean on TB (over both languages), then the
dominance score fo r each individual subject can be estimated.
In situations where multilingualism and language dominance
eSlimates are not the main point o f the testin g, the problem is much
simpler. The question is m ore apt to be, how well can these children
understand this story and similar ones? Or, how well does a certain
child perform in relation to his own previo us performances or in
relation to other children who use the same language? In such cases
decisions concerning how to set up and administer th e elicited
imitation task probably allow as much flexibilit y as standard o r
partial dictation which we have already discussed above. Among the
pertinent decisions are where to place the pauses and how long a burst
of speech to expect the child to be able to repeat an d how accurately.

(c) SCORING PROCEDU RES


If the object of the testing is to assess the child"s ability to comprehend
and restate the meaning of each sentence in, for instance, one of the
discourse chun ks given above as samples, a score based on content
rat her than verbatim repetition would seem to make more sense. In
all pro bability, scores based on content and sco res based on verbatim
294 LANGUAGE TESTS AT SCHOOL

repetition will be highly correlated in any case.


In the task as devised by Politzer et al (1974) only certain sentences
were supposed to be repeated, and only certain words in those
sentences were scored. For instance, in the text about High John the
Conqueror, the first sentence, 'This here be a story,' was to be
repeated and the italicized words were scored. If the subject got all
three words in exactly the form they were presented originally, the
'item' was scored correct. The same procedure was followed for all
items (15 in all, many of them single words). Thus, scores might range
from 0 to 15. Clearly, this metbod is based on discrete point test
philosophy. It is appropriate only to find out whether the children in
question normally use the surface forms tested or (perhaps more
accurately) whether or not they can repeat them accurately when they
encounter them in sentences as part of a story. Note that this may
have little or nothing to do with their actual comprehension of the
story in question. In fact, the child might comprehend all of the story
and get none of the criterion forms 'correct' because he may
transform them all to the surface forms characteristic of his own
language variety. Alternatively, the child might get a reasonably high
score on the criterion forms and not understand the material very
well. This is because Politzer et al did not pay any attention in the
scoring to the rest of the sentence. It could be missing or distorted -
e.g., a form like 'This here be,' would get a score of I while a form like
'This is a story,' would get a score ofO.
If comprehension is the key thing, another scoring procedure
seems to be called for. It is possible to use a verbatim scoring method
that requires an exact repetition of the forms used in each sentence.
This scoring would correspond roughly to the technique illustrated
above with respect to standard dictation. By such a method, there
would be as many points possible as there are words in the text to be
repeated. For High John the Conqueror there would be 186 points
possible. Such a scoring would give a much wider range ofvariability
and a much better indication of the degree of overall comprehension
of the passage than the discrete point scoring with a maximum of only
15 points possible relating to only a small number of words in the
passage. It would also be possible to ask the children to repeat only
certain sentences in the text. In this case, the technique might be set up
like a partial dictation. In such a case, the verbatim scoring would be
applied just as it is in partial dictation.
Another scoring technique and one that would seem to make even
more sense in relation to the assessment of comprehension would be
DICTATION AND CLOSELY RELATED AUDITORY TASKS 295
to allow full credit for full restatements of the content of sentences
even if the form of lhe restatement is changed. For example. if the
child hears, 'This here be a story,' and he repeats, 'This is a story,' he
could be allowed full credit - say 1 point fo r each word in the
presented material which would be a score of 5. The next sentence is
'This story 'bout High John the Conque ror.' Ifhe repeats, 'This story
is about High John the Conqueror,' or 'This here story be about H igh
John the Conqueror,' in either case he would get full credit of 7
points. In fact, if the child said, 'It is about High John the
Conqueror,' he should also receive full cred it. But suppose he said,
' This story 'bout somebody name John,' or 'This story is about
someone named John ,' in either case the child should receive less than
full credit. A practical system might be to count off one point for each
of the additional words necessary to specify more full y the original
meaning - e.g., High John the Conqueror, where the italid zed
portions are in a sense deleted from the child's response. Hence, the
score would be 7 minus 3 or 4 for either rendition. By this scorin g
system, the child might be bidialectal in terms of auditory
comprehension and migh t prefer to speak only one of the two dialects
or langua ge varieties represented. This scoring technique would give
a measure of the child's level of comprehension of tbe sto ry rather
than bis preference for one dialect or the other.
It would of course be possible to obtain both a language preference
score and a language comprehension score by scoring the same
subject protocols by the content met hod just illustrated, and by the
ver batim met hod illustrated above. The scores would probably be
highly correlated for a group of children, but there would probably be
some exceptional ca ses of children who differ markedl y in thei r scores
by the' two methods. For insta nce, so me child ren may be trul y
bidialectal both in comprehension and in production of the surface
for ms of both dialects. Others may be equally good at comprehend-
ing both dialects (or both languages in the case of say Spanish and
English) but they may prefer to produce forms onl y in one va riety (o r
only in one language).

3. i nterpreting scores and protocols. We come now to the m05t


important, a nd p robably the most neglected aspect of the testing
procedures discussed in this chapter. What does a score mean for a
given subject? What does an average score mean for a given group o f
subjects? What can or should be done differently in an educational
program, or in a classroom that is suggested by the outcomes of the
296 LANGUAGE TESTS AT SCHOOL

testing? What can be learned from a study of individual learner


protocols, or groups of them?
Answers to the foregoing questions will vary significantly from
population to population and depending on the purpose of the
testing. This does not mean that meaningful generalizations are
impossible, however. The principle purpose in any of the hy-
pothesized testing situations referred to in this chapter is to find out
how well a learner or group of learners can comprehend auditorily
presented materials. In fact, it is to find out how completely they can
process one or another form of discourse. Therefore, the in-
terpretation of scores should relate to that basic underlying purpose.
In relation to the main question just defined, it is possible to look at
the performance of a given learner (or group) from several possible
angles, among them are: (l) How does the learner do at time one in
relation to time two, three, etc.? More succinctly in relation to a
classroom situation, does the learner show any real ,improvement
across time? This question can be phrased with reference to language
learning in the foreign language classroom, the ability to follow
narrative prose about historical events in a history class, the capacity
to comprehend complex mathematical formulas talked about in a
statistics class, or whatever other educational task that can be
translated into a discourse processing task. Hence, the focus may be
on improvement in language skill per se, or it may be on improvement
in the ability to handle some subject matter that can be expressed in a
linguistic form. (2) How does learner skill in the discourse processing
task defined by the test tend to vary across learners? That is, how does
learner A compare with learner B, C, D, and so on. Where does
learner A fit on a developmental scale of language development (or
any other sort of learning that is closely related to the ability to use
language), and more specifically, where does learner A rank on the
scale defined by test scores relative to other learners who were also
tested. (3) What skills or kinds of knowledge does the learner exhibit
or fail to exhibit in the protocols relative to a given discourse
processing task? For instance, is a general problem of language
proficiency indicated? E.g., the learner does not understand language
x or variety x. Is a deficit of specific content indicated? E.g., the
learner does not understand some concept or has not had some
relevant prerequisite experience?
With respect to question (1) raised in the preceding paragraph, a
few comments are necessary on the possibility of improving on
dictation (or one of the other tasks derived from it) without a
DICTATIO!'-l A!'JD CLOSELY RELATED AUDITORY TASKS 297
corresponding improvement in language skill or in the knowledge
base that language skill provides access to. Valette (1973) states that
the repeated use of dictation (and presumably any closely related
assessment technique) should not be recommended. She apparently
believed, at that time, learners who are exposed to repeated testing by
dictation may improve only in their ability to take dictation and not
in their knowledge of the language or anything else. However, Kim
(1972) reported that the repeated use of dictation did not result in
spurious gains. Kim tested students in ES L classes at UCLA
repeatedly during the course of a ten week quaner with dictation and
was somewhat disappointed that there was no apparent gain from the
begi nning of the quarter to the end. This would tend to suggest the
happy result that dictation is relatively insensitive to a practice effect,
and the sad result that all of the practice in English tbat ESL students
were getting in the UCLA ESL classes used in the study was not
helping them much if at all . Valette (1973) apparently based her
contrary concl usion on a study w.ith French as a foreign language in
(1964). The question deserves more investigation, but it would seem
on the strength of Kirn's more extensive study and the larger
population of subjects used in her work that dictation scores are not
likely La improve without a concomitant improvement in language
proficiency.
It is possible to learn a great deal from a systematic study of the
protocols of a group of learners on a dictation task. The maximum
learning comes only at considerable costs in time and eifort, however.
There is no easy way to do a thorough analysis of the errors in a
dictation without spending hours closely examining learner pro-
tocols. Learners themsel ves, however, may benefit greatly from doing
some of the work in studying their own errors. A classroom teacher
too , can readily form some general impressions about learner
comprehensio n and rates of progress by occasionally scoring
dictat.ion tasks himself.
To illustrate the point briefly, co nslder what can be learned from
only two samples of data - in particular, consider protocols 2 and 5
above on p. 283f. Both are protocols from the Chinese subject. We
can see immediately from a thoughtful look at the learner's output
that he is having a good deal of trouble with a number of
phonological, and structural elements of English. For instance, he
hears R ogers as Lagerse probably na]rsj1 Two problems are

2 See William A. Smalley, Manual of Articulatory Phonetics , 1961 .


298 LANGUAGE TESTS AT SCHOOL

apparent here. First the /1/ /r/ distinction is not clear to him. This is
borne out later in the text on music where he writes setlements for
sentiments. Apparently the flap pronunciation of the ntin [s£iigm~mts]
is close enough to an It combination to cause difficulty in an
unfamiliar phonological setting - namely, in the lexical item
sentiments which he either does not know at all or does not know well.
He also deletes the final irj on larger in the same text, and distorts
derive to get devised. A second problem is the final voiced sibilant /zj
on Rogers. He is usually unsure of consonant clusters at the ends of
words - witness parts for part, allvay for ahvays, controledfor control,
send for sent, consumptions for consumption, base for vast. intelligence
for intelligent,jinded forjind it, mine for mind. He often devoices final
voiced consonants as in brought for broad. He reveals a general failure
to comprehend the sense of tense markers such as the -ed in relation
to modals like could as in could controled instead of could control, also
send for sent, and devised for derive. He is unsure of the meaning and
use of determiners of all sorts - a very young instead of very young,
omilitary school south for a military school in the South, lvith net-v boy
for with this new boy, is ability for is an ability. The use and form of
prepositional phrases and intersentential relations of all sorts can
also be shown to be weak in his developing grammatical system.
In addition, many other highly specific diagnoses are possihle.
Each can be checked and corroborated or revised on the basis of
other texts and other discourse processing tasks. Nevertheless, it
should be obvious here and throughout the subsequent chapters in
Part Three that pragmatic tests are vastly richer sources of diagnostic
data than their discrete point rivals. Further, as we will see especially
in Chapter 13, specific workable remedies can be devised more readily
in relation to texts than in relation to isolated bits and pieces of
language.
KEY POINTS
1. Dictation and other pragmatic tasks are unbounded in number.
2. Examples given in this chapter merely illustrate a few of the procedures
that have been used and which are known to work rather \-vell.
3. Pragmatic tests require time constrained processing of meanings coded
in discourse.
4. They don't need to single out a posited component of a skill or even one
of the traditionally recognized skills.
5. Standard dictation, partial dictation, dictation with noise,
dictation/composition, and elicited imitation are closely related
techniques.
6. Dictation and the other procedures described are more similar to the
DICTATIO:r--: AND CLOSELY RELATED AUDITORY TASKS 299
performances they purport to assess than are most educational tests.
7. Language tests like dictation meet three stringent construct validity
criteria: (a) They satisfy the requirements ofa theory; (b) they typically
show strong positive correlations with tasks that meet the same
theoretical requirements: (c) the errors that are generated by dictation
procedures correspond closely to the kinds of errors learners make in real
life language uses.
8. The choice of method and material in dictation testing depends primarily
on the purpose to which the test is to be put.
9. Possible purposes for the auditory testing techniques discussed include
determining the ability of children in a school to understand the language
of instruction, placing foreign students in an appropriate course of study
at the college level, and measuring levels of comprehension for discourse
concerning specific subject matter.
10. Processing of discourse is a constructive and creative task which goes
beyond the surface forms that are given.
11. Connected chunks of discourse are recommended for testing purposes
because they display certain crucial properties of normal constraints on
language use that cannot be expressed in disconnected sentences.
12. Factors known to affect the difficulty of a dictation task are: (a) the
difficulty of the text itself; (b) speed ofprescntation; (c) length of bursts
between pauses: (d) signal-to-noise ratio; (e) number of presentations:
(D dialect of speaker and of listener; and (g) others.
13. Setting a difficulty level for a task for a particular group of learners is a
matter of subjective judgement - rigorous criteria for such a decision
cannot be set (at least not at the present state of our knowledge).
14. It is the variance in test scores, however, rather than the difficulty of a
particular task that is the principal thing.
15. In giving standard dictation, word sequences between pauses should
probably be seven \vords of length or more, and the pauses should be
inserted at natural break points.
16. One technique for setting the length of pauses is to spell, letter-by-letter
each word sequence sub-vocally twice before proceeding to the next
word sequence in the text.
17. By some methods the material is read three times and the marks of
punctuation are given during the second and possibly also the third
reading. Neither of these points, however, is essential.
18. Standard dictation may be scored allowing one point for each correct
word that appears in the correct sequence.
19. Errors include distortions of various sorts. intrusions of extraneous
words into the text, and deletions. Usually, no more errors are counted
for any given sequence of words than the number of words it contains.
This prevents scores lower than zero.
20. Spelling errors which do not indicate difficulties in perception of distinct
sounds of the language or which do not affect the lexical identity of a
word should not be counted.
21. Spelling errors are probably not at all correlated with other types of
errors in dictations or with language proficiency.
22. An example of a spelling error is sumpthing for something. An example of
300 LANGUAGE TESTS AT SCHOOL

an error that affects word meaning is write for right. An example of an


error that indicates a sound perception problem is somsing for something.
23. Partial dictation, a technique developed by Johansson in Sweden,
combines dictation with doze procedure.
24. The text of a passage is provided with some portions deleted. A complete
auditory version is presented and the examinee must fill in the missing
portions of the written version of the text.
25. Elicited imitation is similar to dictation and partial dictation except that
the response is oral instead of written. Therefore, it can be conveniently
used with preliterate children or non-literate adults.
26. Discrete point scoring procedures are not recommended, however, as
they do not necessarily reflect comprehension of the text.
27. Testing bilingual (or bidialectal) populations presents some special
problems. See Chapter 4 Part One. Special steps must be taken in order
to insure test equivalence across languages. (See discussion question 10
below.)
28. Two scoring procedures are suggested for elicited imitation: a word-by-
word (verbatim) scoring is suggested to determine which language or
language variety a child (or group) prefers to speak (and simultaneously
how well they can produce it); and a more lenient content scoring is
recommended to determine how well a child (or group) comprehends a
text in a given language (or language variety).
29. The meanings of scores and the interpretation of specific learner
protocols are points that require much individual attention depending
on the special circumstances of a given test situation and test population.
30. Questions related to rate of progress, degree of comprehension of specific
subject matter, and ability to process discourse in a particular language
or language variety can all be addressed by appropriate study of scores
and learner protocols on the types of tests discussed in this chapter.

DISCUSSION QUESTIONS
1. Consider the stream of speech. Take any utterance of any sequence of
words and ask where the word boundaries are represented in the speech
signa1. How does the listener know where one word ends and another
begins?
2. Try recording an utterance on an electronic recording device while there
is a great deal of noise in the background. Listen to the tape. Can you
recognize the utterance? Listen to it repeatedly. What happens to the
sound of the speech signal? Does it remain constant? Play the tape for
some of your students, classmates, or friends and ask them to tell you (or
better, write down) what they hear on the tape. Read and study the
, protocols. Ask in what sense the words in a dictation are given.
• 3. Why is time such an essential feature of a dictation, partial dictation,
elicited imitation, or other auditory discourse processing task? What are
the crucial facts about the way attention and memory work that make
time such an important factor?
4. Try listening to a fairly long burst of speech and repeating it. Can you do
DlCTATlO" AND CLOSELY RELATED AUDITORY TASKS 301
it without comprehending what is said? Try th e same task wit h a shorter
burst of speech. Up to about how many \\lo rds can you repeat without
understandi ng? Can you do better ieyou do not hav\! to give a word for
word (verbatim) repet ition ? Why or why not ?
5. Disc us s spelling. To what extent does it seem to yo ur mind to be a
language based ta sk? H ow is spelling a word like (or unlike) kn owing
how to use it or understa nd it in a speech con text ? Ask the sa me question
for punctu:11ion. Do you know an yone who is highl y fl uent in a language
but who cannot spell '? D o yo u know anyone who is a rea so nably good
speller but who does no t seem to be very highly verbal ?
./6. What are some of the factors that must be taken into consideration in
tra nslating a text (o r a chu nk of discourse) from one language into
another? What a re some o f the factors that make such lranslation
difficult ? A re some texts therefore more easily translated tha n others? Is
higbly accurate translati on usu ally possible or is it the exception'?
Consider t ra nslating something li ke the li fe story of A lex Ha ley in to a
language other th a n EngliSh. Wh at kind s 01" things wo uld translate easily
and wha t kinds would not? Or take a simpler problem : con side r
translating fro m English to Spa nish, the directio ns fo r getting from the
gymnasium to the cafeteria on a certain co llege campus that you know
well. W hat is the diffe rence between the two translation prOblems? Can
yo u relate t hem to the di stinctions introduced in Pa rt One above?
Use one of the testing proced ures discussed in thi s chapte r and anulyze
the prot ocols of the lea rners. Try the task yo urself and reflect on the
in terna l processes you a re executing in o rder to perfo rm the task. Try to
develop a model of how the task must be done tha t is, the minima l steps
to be executed.
8. In relation to question 7, pick some o ther educat iona l testing procedure
and do the same for it. Then compare the di scourse processing task wit h
th e other procedure. Ho w are they similar a nd how arc they different?
9. Ana lyze a d ictation prOlocol of some lea rn er studyjng a foreign
language. Compare your ana lys is with the result s of a discrete poin t test.
10. Perform a language dom ina nce study al ong the fo llowing Jines: Select
two texts (TA a nd T B) say TA is in Jangua ge (o r language variety) one
(L I) and TB is in L2. Two more texts aTe crea ted by carefully transla ting
TA into L2 and TB into L1 . The four texts are th en used as the basis for
fo ur tests. Each lea rner is tested on two of the four, either TA in LI a nd
TB in L2, or TB in L I and TA in L2. Approximately eq ual numbe rs of
subjects are tested on th e two pai rs of tests. T he success of the equating
procedure for the two texts can be roughl y determin ed by averaging all
scores on TA regardless of the language of the test, a nd a ll scores on TB
also disregarding the language orthe test. T he rel ati.vc proficiency of the
gro up in L I as co mpared again st L2 can be determined roughly by
si mi lar a verages over bot h tests in LI and both tests in L2. (Tha t is,
disregarding the la nguage of the texL) Th e relative proficiency ofa given
subject in LI and L2 may be estimated by exa minin g his score on th e two
texts and la king into account the relative difficulty of the teX lS. Relate the
difTerence scores in Ll and L2 to the dominance scale suggested above in
Chapter 4.
302 LANG UAGE TESTS AT SCHOOL

SUGGESTED R EADINGS
1. H . Gradman and B. Spolsky, 'Reduced Redunda ncy Testing: A
Progress Report.' In R. Jones and B. Spoisky (eds.) Testing Language
Proficiency. Arl ington, Va.: Cenle r fo r Applied Linguistics, 1975, 59- 70.
2. Stig Jo hansson, ' An Eva luation o flh e Noise Test : A Method fo r Testin g
Overall Second Language P roficiency by Perception under Ma sk ing
Noi se,' Internalional Review afApplied Linguistics 11 , 1973, 109- t 33.
3. Stig Johansson, Partial Dictation as a Test of Foreign L anguage
Proficiency. Lund, Sweden: D epartment of English. Contrastive Studies
Report No.3, 1973.
4. Diana S. Nata licio and Frederick Williams. Repetition as an Oral
Language AssessmeJ1l Technique. Austin, Texas: Center for
Communication Resea rch, University of Texas, 197 1.
5. B. Spolsky, Hengt Sigurd, H. Sato, E. Walker, and C. Artcrburn,
'P reliminary Studies in the D evelopment of Techniques for Testing
Overa ll Second Language Proficiency! '10 J. A. Ups hur a nd J. Fata (cd s.)
Problems in Foreign Language Testing. Language Leaming, Special JS~iUe
No .3, 1968,79- 10 1.
6. Thomas A. Stump, ' Ooze and Dictatio n as Predictors of Intelli gence
a nd Achieyement Scores.' in J. W . Oller, J f. and Kyle Perkin s (eds.)
Lang/wge in Education: Testing the Tests. R owley, Mass.: Newbury
Ho use, 1978.
7. R. M. Valette, 'Use of D ictee in the French La nguage CLassroom:
Model'll Language Joumal49, 1964- 431-434.
8. S. F. Whitaker, 'W hat is the Status of Dictation ?' Audio Visual Journal
14, 1976, 87-93.
11

Tests of Productive
Oral Communication

A. Prerequisites for prag matic


speaking tests
I. Examples of pragma tic
speaking tasks
2. The special need for oral
langua ge tests
B. The Bilingual Synwx A1easure
C. The llyin Oral IntervieH' and the
Upshur Oral Communication Test
D . T he Foreign Service Institute Oral
Interview
E. Other pragmatic speaking tas ks
I. Reading aloud
2. Oral cloze procedure
3. Narrative tasks

While it is not claimed anywhere in this book that speaking and


listening tasks are based on independent skill s (nor that readin g and
wri ting tasks are), in this chapter we focus attentio n o n tasks th at
req uire the productio n of utterances in overt response to disco urse
contexts. The issue, however, is the outward manifestation of
discourse processing in [he for m of speech. We avoid hypothesizing
the existence of a special 'speaking' skill as distinct from language
ability in general. 1n the preceding cha pter the focus was on listening
tasks, but the principal examples in vo lved reading a nd writing as well
as listening and speaking. ]n later chapters we wil1 focus attention on
re ading (see Chapter 12) and writing (see Chapter 13) as distinct
forms of disco urse processing. However, we continue Lo wo rk from
the premise that all ofthe traditionally recognized lang uage skill s are
based on the same fundamental sort of la nguage competence or
expectancy grammar.
303
304 LANG UAG E T ESTS AT SCHOOL

A. Prerequisites (or pragmatic speaking lesls

People usuall y talk both because th ey ha ve something to sa y and


because they want to be heard. Sometimes. however. they talk just
because th ey want someone to listen. The child who talks his way
through a sto ne stacking and dirt piling project may prefe r no o ne to
be listening. An adult doing the same so rt of thing is said to be
thinking out lo ud . T here is a ta le abo ut a certa in famed linguist who
once became so engrossed in a syntactic pro blem at a Linguistic
Society meetin g th at he fo rgot he had an audience. He had turn ed to
the blackboard and had begun to mutter incoherently unt il the noise
of the would ~ b e audience rose to a level that disturbed his
concentration. These are examples of talk that arises from having
something to say but without much co ncern fo r any audience besides
one's self.
It is also possible to ha ve no thing much to sa y and neverth eless to
want someone to listen , o r to feel co mpelled to speak just because
someone ap pears to be listening. Lovers tell each other tha t they are
in love. The words do not carry much new cognitive infonn Cltion, but
who is to say that they do not convey meaning? The lecturer who is
accused of saying nothing in many d iffe rent ways is appa re ntly filling
some ill-defined need. Perhaps he is esca ping the uneasy fee ling that
comes when the a ir waves are still and all those students are sitting
there ex pecta ntly as if something sho uld be said . The a udience's
presence goads the speaker to talk even if he has no thin g to say.
But these are unusual cases. In th e norIllal situation, speech is
motivated by someone's having something to say to someo ne else.
Even the muttering and incoherent linguist had something to say, and
he had an audience at all times co nsisting of at lea st hi mself. Lovers
certainl y feel that they are saying something when they tell each o ther
well known truths and trivia called 'swee t no thin gs', a nd the uneasy
lecturer who blithers on sayin g no thing in many wa ys also feels that
he is saying somethin g even th ough his a ud ience ma y not know or
care what it might be. Hence, even in these special cases of pharic
communion o r ve rbalization of th o ughts, there is something to say
and someone to say it to.
What is to be said a nd the perso n to whom it is said are a mo ng lhe
principal things in communicative ac ts. v..' ords and sequences of
them find tbeir reaso n fo r being (rega rdless whether they a re eloquent
or drab) in contexts where people express meanings to one another.
Song and poetry may be special cases, but tbey are not excl uded .
TESTS OF PRODUCTIVE ORA L COMMUNICATION 305
For all of these rea son s, in testing the ability of a person to speak a
la nguage, it is essemial that the test in volve so me thing to sa y and
someone to sa y it to. It is possible to imagine what one would say if
one wanted to convey a certain meaning, but it is easie r to find the
right words when they are called for in the st ream of ex perience,
'T ake ca re o f the sense,' the Duc hess said, 'a nd Ihe sound s will take
care of themselves' (quot h Miller a nd Johnso n-Laird, 1976, p. v).
And, she may well have added, ' Forget the sense or ignore the contex t
a nd t he so unds will scarcely come out at all.' Jf this were n ot so,
a n yo ne co ul d perform a ny part in any pla y and witho ut havi ng to
memorize any lines.
Therefore, the more contrived the task and the more i.t taxes the
imagi nation a nd conjuring powe rs of the examinees, the less it is apt
to be a n effective test o f their abilit y to perfor m in appropriate speec h
acts. For these pragmatic reason s, we seek testing proced ures that
provide the crucial props of something to say and someone to say it
to, o r at leas t that fait hfull y reflec t situations in whi ch suc h factors are
present. Put somew ha t mo re na rrowly, we req uire testing techniques
that will afford oppo rtunities for examinees to display their ability to
string sequences of elements in a strea m of speech in a ppropriate
correspo nde nce with ext ralingu istic context. Tn s hort, we need tests
th a t meet [he pragma tic naturalness c riteria.
I. Examples afp ragmatic speaking tasks. Along with a modified
sco ring of the Bilingual Syntax A'i'easure we co nsider the f/yin Oral
Interview and the Upshur Oral Commuilication Test. It is argued that
th e scoring techn ique for such tests should rela te to th e totality of the
di scourse level meani ngs and n o t exclusively to discrete points of
morphology or syntax. Interview proced ures, it is suggested,
co nstitute special cases of conversa ti o ns that are exa mine r directed .
Attention is often focussed on picture-based contexts interpreted
j ointly by t he examin er a nd the exa minee. BYlheir very nature, such
conversational episodes us uall y involve listenin g comprehension (the
conve rse is not necessa ril y true, h owever). In effect, the exa miner
confront s the examinee with ve rbal problem-solving tasks that
require th e pragmatic mapping o f utteran ces to context and the
reverse. A s we will see below, the scoring of interviews is u suall y
a nalogous to sco ring elicited imitat ion (asks fo r content rather than
surface form.
Less struct ured speak ing tests ca n be conducted al ong the lines of
t he Foreign Sen'ice Inslitute Oral Interview The FSJ approac h is
d iscussed in so me detail becau se it is proba bly the best known and
306 LANGUAGE TESTS AT SCHOOL

most widely researched technique fo r testing oral language skills of


adult second language learners. T houg h the extension of such oral
language tests to native speakers has not been discussed widely, it
alread y ex ists in ce rra in indi vidual measures of 'IQ' suc-h as the
Wechsler Intelligence Scale for Children. ]n any case, classroom
teachers and other educators can learn much about assessing day-to-
day conversationa l exchanges and educational tas ks from a careful
consideration of interview procedures such as the FSI technique. (See
also the A ppendix fo r a consideratio n of how th e FSI Oral InlervielV
relates to o ther testing techniques.)
It is fairly obvio us why conversational techniques such as
structured and unstructured o ral jnterviews co nstitute pragmatic
speaking tasks (for the examiner at least if not always for the
exa minee), but it is less ob vio us how certain other pragmatic tas ks
CHn qualify. For in stance, can rea ding aloud be constru ed as a
pragmatic language task? What abo ut an oral fiJl-in-the-blank test
(i. e ., an oral doze test) '? What ab o ut a narrative repe tition or retelling
task (e.g., the spoken analogue of dictation /composition)" Or how
about a creative co nstructio n task such as in venting a story on the
spot ? All of these have been tried thou gh not all seem equall y
promising for reasons that relate more or less directly to the
pragmatic naturalness criteria, and to technical difficulties of
quantification or scoring.
2. The special need for oral language /esls. interestingly, speech is
the most manifest of language ab ili ties. If a pe rson ca nnot write we do
not co nsider him as not having lan guage, but spea king ability is more
fund a mental. We are apt to say tha t a person who cannot fluently
produce a language does not know o r has nol fully learned the
language. One who has thoroughl y mastered the spoken fo rm of a
language o n the o th er hand is sa id to kno w it in some fund amental
sense independent of whether or not he can read or write it. In deed, in
many cases man y th ousands of speakers of at least many hund reds of
languages can neither read nor write them beca use the languages in
question have never been transcribed and systema tized orthographi-
call y. T hey are languages nonetheless.
Furthermore, speech is important in ano th er sense. A person may
indi cate comprehension and involvement in human discourse by
merely appearing bright-eyed and interested, but these evidences of
comprehension are generally quite subordinate to speech act s where
the same person puts int o wo rd ~ the evide nce of comprehension and
participation. In fact, comprehensio n is not merely unde rstanding
TESTS Or PRODUCTIVE ORA L CO MMUN ICATION 307
someone else's words, it is profoundly more. It is carrying thoughts to
a deeper level and expressing meanings th at were not stated . It
in volves reaching beyond the give n in the personalization o f
meanings. As John Dewey (1 929) pointed o ut, even deductive
reasoning typically involves discovery in the sense of going beyond
what is known. Speech is the principal device for displaying human
kn owledge through discourse in the prescnt tense and it is
simultaneously a mutually engaging melhod of intelligent di scovery
oftbat knowledge.
This is not to say that other forms of human discourse are not
important and effective methods for displaying and participating in
intelligent activity, but it is to say that in an important sense, speech is
the method par excellence of having on-going intelligent interaction
with other human beings. Further, speec h is the common de-
nominator tha t makes written and other sym bolic systems intelligible
to normal human beings. Signing systems of the deaf are special
cases, but they do not disprove the role of speech in normals, they
merely dem om.trate the severity of the lack in persons who are
deprived of [he blessings of speech that hearing persons are able to
enJoy.
The special importance of speech is mani fested in the increasing
concern among ed ucators and legislators fo r the language develop-
ment of children in the school s. Cazden, Bond, Epstein, Matz, and
Savignon (1976) report that any California school receiving state
money for Early Childhood Education must ' not only include a n oral
language component but also evaluate it' (p. 1). The same authors
view with some trepid ation the fact that 'evaluation in struments
become an implicit in-service curriculum for teachers, an int er nalized
framework th at influences the mjni-tests th at teachers continuously
co nstruct in the classroom as they take children's words as indicaLOrs
of wha t they have learned' (p. 1). 1t is therefore imperative that the
testing techniques used in the schools be as effective and humane as
can be devised .
A recent monograph published by the Northwest Regiona l
Educational Labo ratory eva lua tes 24 different tests [hat purport to
assess oral language skills (Silverman, Noa, Russell, and Molina,
1976) . It is disappointing to note that none of the fourteen measures
evalu ated am ong those listed as commerciall y available was rated
G1bove 'fair' (on a tbree point scale ranging from 'poor' to 'good') in
terms o f either 'validity' o r 'technica l excel lence'.
It seems clear that there is a serious need for better oral tests. It is
308 LANGUAGE TESTS AT SCHOOL

theintent oflhis chapter to offer some techniques that can be adapted


to specific classroom or other educational needs to help fill the gap.
Here, more than elsewhere in the book, we are forced into largely
unresearched territory. For ways of checking the adequacy of
techniques that may be adapted from suggestions given here, see the
Discussio n Questio ns at the end of this chapter. (Also see references
to recent research in the Appendix.)
In addition to the growing need fo r effective o ral la nguage
assessment in relatio n to earl y childhood education and especiall y
mu ltilingual education, there is the constant need for assessment of
oral language skill s in foreign language classrooms and across the
whole spectrum of ed ucational endeavors. The oral skills of teachers
are probably no less important than those of the children, not to
mention th ose or th e parents, administra tors, and others in the larger
school comunity.

B. The Bilingual Syntax Measure


Among bilingual measures, the colorful cartoon-styled booklet
bearing the label, the Bilingual Syntax A1easllre, has already bec.:ome a
pace setter. It has several appealing qualit ies. Its cartoon drawings
were probably inspired by the Sesa me Street type of ed uca ti onal
innovation and they are naturall y moti vating to preschoolers a nd
children in the earl y grades. Compare for instance the pict ure shown
as Figure I in Cbapter 3, p. 48 wit h its drab counterpart from the
James Language Dominance Test shown as Figure 14 or the eq ually
unmotivating pictures from the New York City Language Assessment
Battery, Listelling and Speaking Subtest shown as Figure 15. It is not
just the color that differentiates the test illustrations as bases for
eliciting speech. [n both the James and the New Y ork City test, the
intent of the pictures displayed is to elicit o ne-word names of objects.
T here is somet hing unnatural about telling an experimenter o r
examiner the name of an object when it is obvious to the child that the
examiner already k nows the correct answer. By contrast the question
'Ho\v come he's so skinny?' (while the examiner is pointing to the
skinny man in Figure 1, p. 48 above) requires an inference that a child
is usually elated to be ab le to make. The question has pragmatic point.
It makes sense in relation to a context of experience that the child can
relate to, whereas questi o ns like, ' What's this a picture ofT or -Point
to the television,' have co nsiderabl y less pragmatic appeal. They do
not elicit speech in relation to any meani ngful context of discourse
TESTS OF PRODUCTIVE ORAL COMMUNICATION 309

/
'"
O~

,
Figure 14. Pictures from the James Language Dominance Test.

\.~~;:~~
-. -

Figure 15. Pictures from the /I/ew York City Language Assessment Batlery,
Listening and SpeakingSubtesl.

where facts are known to have antecedents and consequences. Both


types of questions require relating words to objects or object
situations, but only the sort of question asked in the BSM requires the
310 LANGUAGE TESTS AT SCHOOL

pragmatic mapping of utterance onto an implied context of


discourse. Further, it requi res the stringing together ofsequences and
subsequences of mea nin gful elements in the tested language.
The questions asked in relation to the series of pictures tha t comes
at the end of the BSM suggest possihilities for elicitation of
meaningful speech in a connected context of discourse where a
chronological and cause-effect type of relation ship is ohtained
between a series of evenLS. In Figure 16, w here all three oflhe pertinent
pictures are displayed, in the firs t picture the King is ab out to take a
bite out of a drumstick. In the same picture the little dog to his left is
eyeing the fowl hungrily. In the next picture, while the King tu rns to
take some fruit offa platter the dog makes off with the bird. In picture
three the King drops the fruit and with eyes wide is agape over the
missing meat. Sly Mr Dog with winking eye .is licking ye olde chops.
The story has point. It is rich in linguistic potential for eliciting speech
from c hil dren. It has a starting point and a picture punch line. If the
chil d is willing to play the examiner's ga me, the pictured events
provide an interestin g context fo r discourse.
The relative complexity of pragmatic mappings of utterances onto
context that can be achieved is suggested by paraphrases of the
questions asked in the BSM. Tn order to protect the security of the
test, only questions 5, 6, and 7 below are given in the exact form in
which they appear in the BSM .
(I) The examiner points to the first picture in the sequence (picture 5)
and asks the child to point out theKing.
(2) Then the child is asked to point to the dog in the second picture
(picture 6).
(3) Next, the last picture (picture 7) is indica ted and again the child is
asked to point out the Kin g.
(4) The firs t scored question in relation to these pictures asks why the
dog is looking at the Kin g (picture 5).
(5) 'What happened to the King's food?' (Picture 3.)
(6) 'What would have happened if the dog had n't eaten lhe food ?'
(No pa rticul ar picture indicated.)
(7) 'What hap pened to that apple ?' Examiner points to the third
pic ture.
(8) Finally, the child is asked why the apple fel l.
Beca use of the discrete-point theory that Bu rt, Dulay, and
Hern andez (1975, 1976) we re wo rking from, they recommend scorin g
only questions 5, 6, and 7. Further, they are act ually concerned with
the past irregular form s ('ate' in quest ion 5, 'fell' in question 7) and
TESTS OF p J<ODUcnV£ ORAL COM MUNICA n ON 311
the perfect conditional ('would have' in question 6, see Burt el af
1976, p. 27 , Table 9). By their scoring method, if the child uses those

Figure 16 .
Pictures 5. 6, <:J nd 7
reprod uced from the
Bilingual Syntax
Measure by permis-
sion , copyr ight (g
19 75 by Harcourt,
Brace, Jovanovich .
In c. A ll rights
rese rved .
314 LAN GUAGE TESTS AT SCHOOL

a bout the BSM is that similar picture tasks that depict events with
meaningrul sequence constraints could easil y be adapted to similar
testing techniques by a crea tive tea cher. Many tasks could be devised.
The exa mple selected for discussio n here is therefo re offered o nl y as
an illustration of how one migh t begin.

C. The flyin Oralllltervi"w and the Upshur Oral Communicatioll Test


Research with tec hniques li ke the BSM for assessing the rel ative
abilit y of children in two o r mo re languages has probabl y been less
ex tensive th an wo rk with techniques for second langua ge oral
assessment, b ut the latte r are also relatively un research ed in
compa riso n to techniques that rely mainl y o n respo nse modes o ther
than speaking. Amo ng t he rece ntly developed structured interview
techniques are th e flyin Oral Interview (1976) developed by D o nna
Ilyin in San Francisco and the Oral Communication Test developed at
the English Language Institute o f the Uni versity o r Michigan chiefly
under the direction of J oh n A. upshur. The Il yin interview is th e
mo re ty pical and the more pragmatically oriented o f the t wo so we
will consider it first.
A page fro\11 the stu dent's test booklet used during the interview is
displayed as Figure ] 7. T he pictures attempt to sumnuuize several
days ' activities in terms of maj o r sequences of events on those days.
For instance, the first sequence of events pictured ac ross t he lop o rlhe
page was sup posed to ha ve occurred ' Last Sunday' as indicated at the
left. The first pictured event was supposed to have take n place at 9:55
in the morning as indicated by the clock un der th e p icture (The man
ge ttin g o ut of bed): the second a t 10 :25 (the ma n ea ting brea kfast) :
and th e third between II :00 am and 5 :00 pm (off to the beach wi th a
re male friend). Il yin explains in the Ma nual ( 1976) lhat t he examinee
is to relate to the pictures in terms of whatever day the test is actually
ad ministered on. T herefo re , she has constr ucted a form to be
ad ministe red o n any weekday except Friday and ano ther to be given
o n any weekday except M o nday (this is so th e referents of'Yesterday'
and T o mo rrow' will fall o n weekda ys o r weekends as desired). The
two fo rms may be used separately in test-retest stud ies or together to
o btain higher reliability_ Eac h co nsists of a 50 item ve rsio n
recommended for lower level st udents and a 30 item version \vhich
may be used at in termedia te and adva nced levels.
The set of pictu res given in Figure 17 are act uall y used in orienting
the examinee to the overall p rocedurc. They will serve the purpose
TESTS OF PRODUCTIVE ORAL COMMUNICATION 315

G

Figure 17. Samp le pictures for the Orien tation Section of the Jly in Oral
In terviell'. (1976)
316 LANGUAGE TESTS AT SCHOOL

here, howeve r, of illust rating the testing technique. The point is not to
recommend thc particular procedures of lJ yin ·s test, but rather to
show how the technique co uld be ad apted easily to a wide variety of
app roaches.
Once it is clear t hat the examinee knows how the procedure works
and understands t he mea nings of the time slots referred to by the
separa te picture sequ ences as well as t he picture seque nce on any
particular day, a number of meanin gful qu estion s can be posed. Fo r
insta nce, the examinee may be asked to tell what th e fictitious person
in the pictures is doin g to day at the approximate time of the
examinatio n. F or instance, ' What is the man in th e pictu re doing
right now') It' s about 10 :00 am. ' An appropriate response might be :
'H e's in class tak ing no tes while the professo r is writin g o n the
black board. ' F rom th ere, more complex questions can be posed by
either looking for ward in time to what th e pictured perso n, say, ' Bill'
(the name offered h y lI yin), is go ing to d o, or what he has alread y
don~ . F or instance, we might as k:
(I) What was Bill d o ing at 7: 15 t his morni ng?
(2) Where is he going to be at lu nc h ti me "
(3) Where was he last Sunday at 7 :45 in the mornin g?
and so forth . The number and compl exity of the questions that can be
reasonabl y posed in rela tio n to such simple contexts is surprisingly
large. Neither is it necessa ry that every single even t that is to be as ked
. bout also be pictured. What is necessa ry is that the ra nge o f
appropriate responses is adequately limited so as to make sco ring
fea sible and reliable.
It is possible to fo llow a strategy in t he const ru ction of such tasks of
working outward from the present mo ment eith er tmvard the past or
towa rd the future. Or o ne mi ght follow the strategy of chrono logi-
cally ordered questions th at ge nerate somet hing li ke a narrati ve with
past to present to future events guidin g the development. IL is also
possible and tempting for t he discrete point minded pe rson
deliberatel y to pla n the elicita Lio n of certain struct ures a nd linguistic
forms, but to opt fo r such an organil.ation al motive for a sequence of
questions is likely to obliterate the se nse of normal disco urse
constraints unless the discrete points of structure are planned very
cleverl y in to the seq uence.
If, o n th e other hand, one opts for o ne of the more pragmatic
stra tegies, say of merely following the chronology of events in the
series, ask ing sim pler questi ons at the beginning and mo re complex
ones later in the series, it is likel y that the discrete points of synt ax one
TESTS OF PRon uCTlVE ORAL COMMUNICATION 317
might have tried to elicit mo re deli berately will na lU ra ll Y fa ll o ut in
the pragma tically moti va ted seq uence. A n example of a relati vel y
simp le question would be: 'Where is Bill now? It's 10:00' A more
comp lex question would be : 'What was Bill doing yesterday at
10 :25 'I' A still more complex q uestion would be: ' What would Bill be
doing today, right now, if it were Sunday?' Or even more complex :
' What wo uld Bill have b een do ing Sunda y at 7 :45 if he had th o ught it
was a regular week-da y?,
Oral tests built aro und act ual o r contrived co ntex ts o f disco urse
affo rd essentially unli mited possibilities. The exa mp les given above in
relati on to the I/yin Ora/ Interview merely display some of the options.
Clea rly the technique could be modified in any numbe r of ways to
create more difficult or less difficult tasks. For instance, the procedure
co uld easily be converted into a story retellin g tas k ifE were to tell th e
examinee just what is happening in each pictured event (say, starting
fro m pas t and wo rking up to ruture events). Then the examinee might
be as ked to retell the story, preferably not to the same E, but to
so meone else. To make the task simpler, the second person might
pro mpt the examinee co ncerning the pictures by aski ng appropriate
leadi ng q uestions. Or, to make the task simpler still, it co uld easily be
converted to an elicited imit ation task.
A no ther testing techniq ue that has been used with non-native
speakers of English is the Oral Communication Test developed by
Jo hn A. Upshur at the U ni versity of Michiga n. It is a highl y
structured o ral inter view task where the exam inee is tested on his
abilit y to convey certain kinds o f informat ion to the examiner. In the
develo pment of th e test, Upshur (1969a) defi ned 'producti ve
communication' as a process whereby 'as a result of some action by a
speaker (or writer) his audience creates a new conce pt' (p. 179). The
criterion o f successful co mm unication was that there must be 'a
correspo ndence between the in tentions o f a spea ker and the concept
created by his a udi ence' (p. 179).
U pshur and his collabo ra to rs set up a test to assess productive
communica tion ability as fo llows : (l) examinee and examiner are
presented with four pictures which differ in certain crucial respects o n
one or mo re 'conceptual d imensions'. (Figure 18 displays three sets of
four pictures each.) (2) Th e exa minee is told wh icb of the fo ur he is to
describe to the exami ner. They are separated fro m each other so that
the examiner canno t see wltich o f the fo ur pictures has been
designated as the criterion fo r the examinee. Furt her, the pictures are
in differen t orders in the exa miner's set other wise the ex am inee co uld
318 LANGUAGE TESTS AT SCHOOL

o o o

o o o

o o o
figure 18. Items from [he U pshur Oral Communication Test.

merely say, 'The one to the farleft ,' or 'T he second one from the left,'
and so forth. (3) The exami nee tells the examiner which picture to
mark and the examiner makes a guess. (4) The number of bits, that is,
correct guesses by the examiner is the sco re of the exami nee.
Interestingly, Upshur found in experimental uses of this testing
technique that time was a crucial factor. If examinees were given
unlimited amounts of time there-was little difference in performance
between speakers who were more proficient and speakers who were
less proficient as judged by other assessment techniques. However, if
320 LANGUAGE TESTS AT SCHOOL

criterion. The technique would, however, afford more realistic testing


of the kinds of discourse constraints that emerge in normal
conversation. Further, it wonld be possible to consider additional
scoring techniques that discriminate responses of the examinee in
greater detail.

D. The Foreign SerJ-'ice Institute O,.alIntaview


Perhaps the most widely used oral testing procedure is one developed
by the Foreign Service Institute. It involves, usually, two interviewers
and a candidate in a room set aside for the purpose of the interview.
Sessions are tape recorded for future reference and to establish a
permanent record for possible validity studies and other purposes.
The main objective of the oral interview is to determine the level of
'speaking' proficiency of candidates on a five point scale defined
roughly as follows:
(I) Able to sati.~/v routine travel needs and minimum courtesy
requirements. Can ask and answer questions on topics very
familiar to him: within the scope of his very limited
language experience can understand simple questions and
statements, ...
(2) Able to satisfy routine social demands and limited lvork
requirements. Can handle with confidence but not with
facility most social situations including introductions and
casual conversations about current events. as well as work,
famlly, and autobiographical information.
(3) Able to ,",'peak the language lvith sufficient structural accuracy
and vocabular:v to participate effectively in most formal and
informal conversations on practical, social, and professional
topics. Can discuss particular interests and special fields of
competence with reasonable ease; comprehension is quite
complete for a normal rate of speech; vocabulary is broad
enough that he rarely has to grope for a word; accent may
be obviously foreign; control of grammar good; errors
never interfere with understanding and rarely disturb the
native speaker.
(4) Able to use the languagefiuently and accurately on all leveL,
normally pertinent 10 professional needs. Can understand
and participate in any conversation within his range of
experience with a high degree of fluency and precision of
vocabulary: would rarely be taken for a native speaker, but
can respond appropriately even in unfamiliar situations;
errors of pronunciation and grammar quite rare; can
handle informal interpreting from and into tbe language.
(5) Speaking pro.ficiency equivalent to that of an educated native
TESTS OF PRODUCTIVE ORAL COMMUNI CATION 321
speaker. H as compiete fluency in t he language such th at his
speec h on all levels is full y accepted by educated native
speakers in all of its features, including breadth of
vocabulary and idiom, colloquialisms, a nd pertinent
cultural refe rences (ETS , 1970, pp. 10- 11).1

The -interview no rmally does no t take more th an abo ut thi rty


minutes and except in cases where subj ects simply are not able to
carryon a conversatio n in the language it usually takes at least fifteen
minutes. While the lvfanual for Peace Corps Language Testers
prepared by ETS in 1970 stresses that th e co nversatio n sho uld be
natural, it also emphas izes the poi nt that 'it is 110t simply a friend ly
conversation on whatever topi cs come to mind ... Il is rather, a
specialized procedure which efficientl y uses t he relatively brief testing
period to explore many different aspects of the stud ent's language
competence in order to place him into one of the catego ries described'
(ETS, 1970, p. II).
The influence of discrete point theo ry is not lacking in the FSI
procedure. Candid ates are rated separately o n scales that pen ain ( 0
accent, gram mar, voca bulary, fluency, and co mprehension. In the
Appendix we will show that these se parate ratings app arently do not
co ntribute different types or variance and th at in fact they appea r to
add little to what could be obtained by sim ply assigning an overa ll
rating of oral language proficiency (also see Callaway, in press, and
Mullen, in press b) . Nevertheless, in o rder to provi de a fairly
comprehensive desc riptio n of the FSI proced ure, we will look at th e
five separate scales th at F SI uses ' to supplement the ove rall rati ng'.
(As we will see below, the ratings on t hese scales arc weigh ted
djfferentially . We will see that the grammar scale receives the heaviest
weighting, followed b y vocabulary, then comprehensio n, then
fluency, then accent which receives the lowest weight ing.)
A ccent
1. Pro nunciatio n frequentl y unintelli gible.
2. Frequent gross errors a nd a very hea vy accent make
understanding difficult, require rreq uent repeti tion.
3. 'Fo reign accent' requires co ncentrated listening and misp ro·
nunciations lead to occasional misund ersta nding and
apparent erro rs in gramm ar or vocab ulary,

1 This qu ote and subsequent materials from the sa me publication (Manual for Peace
CQrps Language Testers, 1970) is reprinted by permission o f Ed ucational Testing
Service.
322 LANG UAGE TESTS AT SCHOOL

4. M a rked 'foreign accent' a nd occasio na l mispronunciatio ns


which do not interfere with und erstand ing.
5. No co nspicuous mispronunciations, but wou ld not be tak en
for a native speaker.
6. N ative pronu nciation, with no trace of'forejgn accent'.

Grammar
1. Grammar almost entirely inaccurate except in stock
phrases.
2. Constant erro rs showing control of ve ry few major patterns
and frequently preventing communicatio n.
3. Frequent erro rs showin g some maj or patterns uncontrolled
and causing occasional irritation and misunderstanding.
4. Occasional errors sho wing imperfect contro l of some
patterns but no weak ness that causes misund erstanding.
S. f ew erro rs, with no patlerns of fai lu re.
6. N o more than two errors during th e interview.

Vocabulary
1. Vocab ul ary inadequate fo r even the simplest conversat io n.
2. Vocabulary limited to basic perso nal and survival areas
(time, food, trans portati on, famil y, etc.) .
3. Choice of wo rds sometimes inacc urate, limitation s of
voca bul ar y prevent d iscussio n of so me common pro ~
fessional and social topics .
4. Professional vocabulary adequate to discuss special in-
terests ; general vocabul a ry permits discussion of any no n-
technical subject with some circumloc ut ions.
5. Professiooi:ll vocabular y broad and precise; general vocab ~
ul ary adequate to cope with complex pra ctical problems
and varied social situations.
6. Vocabulary appa rentl y as accurate and extensive as that of
an educated native speaker.

Fluency
1. Speech is so halting and fragmenta ry that conversation is
virtuall y impossible.
2. Speech is very slow and uneven excep t for short or routine
sentences.
3. Speech is frequently hesitant and jerky ; sentences ma y be
left uncompleted.
4. Speech is occasionally hesitant , with some unevenness
caused by rephrasing and gropin g for words.
5. Speech is effortless an d smooth, but pe rceptibly non~native
in speed and even ness.
6. Speech on all professio nal and general topics as effortless
and smooth as a nati ve speaker's.
TESTS OF PRODUCTIVE ORAL COMMUNICA nON 323
Comprehension
I. Understands too little for the simplest type of conversation.
2. Understands only slow, very simple speech on common
social and touristic topics: requires constant repetition and
rephrasing.
3. Understands careful, somewhat simplified speech directed
to him, with considerable repetition and rephrasing.
4. Understands quite well normal educated speech directed to
him, but requires occasional repetition and rephrasing.
5. Understands everything in normal educated conversation
except for very colloquial or low-frequency items, or
exceptionally rapid or slurred speech.
6. Understands everything in both formal and colloquial
speech to be expected of an educatcd native speaker (ETS,
1970, pp. 20-22).
Although the Alanual seems to insist that the above verbal
descriptions of points on the various scales are merely 'sup-
plementary' in nature, there is a table for converting scores on the
various scales to a composite total score which can then be converted
to a rating on the overall five levels given above. The conversion table
is given below:

WEIGHTING TABLE
Proficiency Description 2 3 4 5 6

Accent 0 2 2 3 4 --------
Grammar 6 12 18 24 30 36 ----

Vocabulary 4 8 12 16 20 24 --------"
Fluency 2 4 6 8 10 12
Comprehension 4 8 12 15 19 23 ----

Total: -------- .

CONVERSION TABLE
Total Score (from Weighting Table) FSI Level
16-25 0+
26-32 1
33-42 1+
43·52 2
53-62 2+
63-72 3
73-82 3+
83-92 4
93-99 4+
324 LANGUAGE TESTS AT SCHOOL

For example, suppose a given candidate is interviewed and it is


decided that he rates a 2 on the scale for Accent. According to the
verbal description th is means that the cand idate makes 'freq uent
gross errors' has 'a very heavy accent' and requires 'frequent
repetition'. From the Weighting Table the tester will determine that a
rating of 2 on the Accent scale is worth I point toward the total
overall rating and the eventual determination of the overall level o n
th e FSI rating system. Suppose furtherth at the examinee is rated 3 on
the Grammar scale ('Frequent errors showing some major. patterns
uncontrolled and causing occasional irritation and misunderstand-
ing'). By the Weighting Table this score is worth 18 points toward the
total. Say then that the same examinee is rated 3 on Vocab ul ary for
an additional 12 points; 3 on Fluency for 6 points; and 3 on
Com prehensio n for 12 points. The examinee's total score wou ld thus
be I + 18 + 12 + 6 + 12 = 49. This score accord ing to the Conversion
Table would rank the candidate at level 2. That is, the candidate
would be judged to be 'Able 10 smi.if)' rowine social demands and
limited ~vork requirements.'
Is it fair to say that the technique is already formidable? There are
many problems in the interpretation orthe verbal descriptions of the
vario us scales, and there are many more in the interpretatio n of the
mea nin g of th e overall ratings once they arc arrived at. Nonetheless,
the procedure seems to wo rk fairl y well. T he examiners of course are
required to complete a fairly rigorous trainin g program, a nd there are
a number of procedural niceties that we have omitted discussi ng. The
interested reader sho uld obtain further information from the ETS
Manualfor Peace Corps Language Teslers(l97 0) and other ETS and
FSI publications. The point in discussing the procedure here has not
been to reco mmend it in particular, but rat her to suggest some ways
that similar testing techniques th at wo uld be m ore feasible for a
broader spectrum of educational assessment problems might be
constructed.
Surely if raters can be trained to agree substantiall y on what
qualifies as 'th e simplest type of comprehension', 'very simple
speech', 'somewhat simplified speech', 'everything in normal
educated conversation except for very colloquial or low-freq uency
items', 'everyth ing in both form al and colloquial speech' (see the
verbal descriptions of the six point Comprehension scale given
above), they can do the simpler task of rat ing subjects o n a fi ve o r six
point scale where the points are defined in terms of assiglling the
lowest ratings to the worst performers and the highest to the highest.
TESTS OF PRODUCTIVE ORAL COMM UNICATfON 325
If interview performance is also compared against performance o n a
more structured oral test, the meanings of points on the interview
scale can be referenced against scores on th e other ta sk and vice versa.
Obvio usly, the fSl Oral interview was conceived with adults in
mind. Further, it was aimed toward adults who were expected to fill
governmental posts at home or ab road. However vague the verbal
descriptions of performances ma y be in relation to the target
population and the requisite skills, they nonetheless work rather well.
There is every reason to helieve that similar techniques referenced
aga inst differem target populations and different performances might
work equally well.
The essential features of the Oral Interview procedure used by FSI
are perhaps difficult to identify. However, if we desire to extend what
has been learned from that technique of eval uation to other similar
techniques, we must decide what the parameters of the similarity are
to be. What techniques in other words can we expect to prod uce
results similar to those achieved with the FSI Oral Interview"
Obviously, not all ofthe results are desirable. The ones that would be
useful in a wide range of oral testing settings are the attainment of
highly reliable ra tings or speech samples in ways t.hat are at least
related to the performance of specific educational tasks. In this sense:
the generaJizabili ty of the procedure to th e rat ing of speech samples in
a wide range of contexts seems possible.
As we show in the Appendix, the FSI Oral Interview is not
dependent for its reliability on the componential breakdown of skills.
In fact , it is apparently dependent on the ability of intelligent speakers
of a language (namely, the raters) to assign scores to performances
that"are not defined in any adequate descriptive terminology. It is
apparently the case that the utility of the FSI procedure is dependent
primarily on the ability of raters to differentiate performances on one
basic dimension - the pragmatic effectiveness of speech acts.
One hesitates to propose any particular example of an extension of
the FSI rating procedure to other speech acts because the range of
posslbililies defies the imagination. It i5 probably the cage that
suitable rating scales can be created for almost any spontaneous
speech act that tends to recur under specifiable circumstam;es. F or
instance, repetitive interactions in the classroom between children
and teachers, or between chi ldren and other child ren, o r outside [he
classroom between children and others. Formal interview situations
of a wide variety could be judged for pragmatic complexity, affective
a~,~~)Usal {say, enth usiasm and effecti veness in creating the same '1
326 LANGUAGE TESTS AT SCHOOL

feelings in the interlocut o r) and co nfidence. Or perhaps they shoul d


be judged o nl y in terms of pragmatic complexity of utterances and at
the same tim e the effectiveness of those same utterances .
People are constantly engaged in evaluating the speec h of other
people. II is not so strange an occurrence that it is exclusively the
do main of language testing per se. Viewers o f televis io n and mo vies
judge th e performan ces o f act o rs to be more o r less effective. Readers
implicitly or often explicitly judge the effectiveness of a narrator in
telling a s to ry, or of an exposit o r in explaining something clearl y.
So me spea kers are said to be articulate and others less so. So me
performan ces of speech acts are judged t o be particularly ade pt,
others inept. All ofthese judgements presuma bly have to do with the
appropriateness of usages of wo rds in contex ts of speech. Hence, the
FSI type of rating scales would a ppear to be generalizable to speech
acts of m any different sorts.
Requisite decisions include ho w the speech acts are to be elicited ;
what the ra ting scale(s) will be referen ced against ; and who is a
qualified rater. Speecb acts ma y be contri ved in interview settin gs (as
they are in th e case of th e FS[ O ra l Interview), or they may be taped
or simply o bserved in te ractions between th e examinee and so me
o ther or o thers. The ra tin g scale may refer to the surface form of the
utterances used by the ex aminee - i.e .. foc uss in g o n questi o ns of well·
fo nnedness (by someo ne's defini tio n), or it m ay refer to questions of
pragmatic effectiveness - i.e. , how well does the exa minee com-
mun icate meanings in the setting of the s peec h act.
Persons who are asked to rate performances of exa min ees should
be speakers of the language in question and should be able to
demonstrate their ability to differentiate p oor performances fro m
good ones in agreement with other competent judges. The latter
requirem ent can be met by ta kin g samples of better and worse
performances and asking judges (o r potential judges) to rate them. If
judges consis tentl y agree o n what co nstitutes a better perfo rmance or
mo re s ucc in ctly on ho w to rank o rder a se ries o f clearl y differentiated
performances rangin g fro m wea k to strong, the necessa ry rudiments
o f reliability and validi ty are pro bably impli cit in suchjudgements.

E. Other pragmatic speaking tasks


Clearly, th e specific tasks discussed 10 this point were all designed
with spec ial purposes ill mind. T here are o ther testin g procedures that
ca n be applied to a wider variet y o f testin g purposes and which were
TESTS OF PRODUCTIVE ORAL COMMUNICATION 327
not designed with any special population or testing objecti ve in mind.
For instance, a generally applicable oral language task is reading
aloud. Another is oral cloze procedure. Still another is narrative
repetition, the oral analogue of what was disc ussed in Chapter 10
above under the term dictation/composition . .In this section, we turn
our attention to these techniques in parti cular as tasks for eliciting
speech. Reading aloud and oral cloze ha ve the disti nct advantage of
being somewhat more easily quantifiable than so me of the mo re
open-ended procedures discussed abo ve. Elicited imitation which
was discussed in Chapter 10 can also be used as a speaking task, and it
too, like oral doze testing, is relatively easy to sco re.
I. Reading aloud. It will be objected early that reading and
speaking are dissimilar tasks. This is largely true if one reads silently,
and even if one reads aloud, there are differences between reading
aloud and speaking. Therefore, reading aloud can o nly be used for
persons who are known to be good readers in at least o ne language
besides the one they are to be tested in, and 1n situations where the
learners have had ample opportunity to become literate in the
language in which they are to be tested.
Paul A. Kolers (\968) reported results with a read ing aloud task
which he used in a study of the nature of bilingualism. He asked
s ubjects to read a paragraph in English and also its translation in
F rench. H is purpose was to discover whether there wa..<> a significant
difference in the amount of time required to read the text in English or
in French and the time required to read the sa me material mlxed
together in both French and English. The lalter detail is of interest to
this discussion only insofar as Kolers' study showed that the
efficiency of processing is related to the amoun t of time it takes to
convert the printed form to a spoken stream of speech. The English
passage and its French translation are given below:
His horse, followed by two hounds, made the ea rth resound
under its even tread. Drops of ice stuck to his cloak. A strong
wind was blowing. One side of the horizon lighted up, and in the
whiteness of the early morning light, he sa w rabbits hopping at
the edge oftheir burrows.
Son cheval, suivi de deux bassets, en marchant d'un pas egal
fai sait resonner la terre. Des gouttes de verglas se collaient ason
manteau. Une brise violente soufHait. Un cote de l'horizon
s'eclaircit ; et, dans la blancheur du crepusculc, il aperyut des
Japi ns sautillant a u bord de leur [erriers.
How could reading aloud be used as a measure of speaking ability
or at least of fluency in reading aloud? There are a variety of
328 LANGU AGE TESTS AT SCHOOL

co nceivabl e sco rin g procedures. One easy way to sco re such a task
wo uld be to record the readi ng protocols on tape and t hen measure
the amount of tim e from the onset of the first word to the termination
of the last word in the spoken pro tocol. This can be done with a slop
watch or with the timing mechanism on the tape recorder provided
the latter ls .3ccurate enough. There must a lso be some meth od for
lCiking into acco unL the accuracy o f the rendit ion. A wo rd by wo rd
sco ring fo r recogniza bility (and /o r accu racy of pro nunciation ) could
be record ed alo ng wi th the amount o f time req uired fo r the read ing.
The average time required by a sample or fiuent native speakers
would provide a kind of ceiling (as opposed to a baseline)
performance against which to compare the performance of non-
natives on th e same tas k.
Ano th er possi ble scoring technique would be to rate the readi ng
aloud pro tocols the way o ne wo ul d ra te speech protocols in an
interview settin g - s ubjecti vel y accordi ng to loosely stated criteria.
For instance. the reading could be rated fo r accuracy and this rating
co uld be rep orted along with the amount of time required to complete
the task.
Reading aloud is probably easier than speaking. That is, a perso n
should be expected to be able to read fiuently things that he co uld no t
sa y fluentl y witho ut the aid of a writLen text. H owever, it is un likely
that a perso n who ca nnot speak fl uentl y in an y contex t without a
script will be able to do so with a script. A fo reign la nguage learner,
fo r example, who cannot carry on a simple co nversat ion fluent ly will
probably not be ab le to read a somewhat more complex text with
fluency either. Hence, the technique dese rves investigation as a
pragmatic speaking task.
It ma y be true that a fiuent nati ve speaker ca n read aloud whi le
th inking about so mething else a nd that in tbis special case th e task
vio lales the pragmatic na turalness criterion of meanin gfulness.
However, the impo rtant question for the possible use we are
discussing is \-vhether the non-native speaker who does not know the
la nguage well can read a passage ortext Ouently without und erstand-
ing 1t in the sense of mapping the sequences of words onto appro priate
extralinguistic co ntext. lft he non-n ative must comprehend wha t he is
read ing in order to do so fiuentl y, then reading aloud to that extent
qualifies as a pragm atic testing proced ure. (For some prelimina ry
da ta o n reading alo ud as a lang uage proficiency measure) see the
Appendix, especialJ y sectio n D.)
2. Oral doze procedure. One of the most ve rsatile and least used
TESTS OF PRODUCTIVE ORAL COMMUN!CA TION 329
o ral testing procedures is the o ral cloze test (Ta ylor, 1956). It can be
constructed in a muJtitude o f forms. Oral d oze procedures have been
applied in a variety of \vays to special research interes ts, bUL only
recently to problems of language proficiency assessment. For
instance, Miller and Selfridge (1950) used a variety of oral cIoze
technique to construct tex ts of diO'ering degrees o f approximation to
no rm al English. Aborn. Rubenstein, and Sterling (1959) used an oral
doze approac h in a study of constraints on words in sentences.
Craker ( 197 1) used an oral cloze test to stud y the relatio nship
between language proficiency and scores on educational tests for
children in Albuquerque, New Mexico. A pioneering study by
Stevenso n (1974) used an oral doze test as a meas ure oftlie English
proficiency of foreign stud en ts at Indiana Universi ty. Scholz, er a/lin
press) used oral doze tests for a similar purpose at Southern Illinois
U ni versity (again see the Appendix , section D) .
Altho ugh no thoroughgo ing study of the pro perties of oral cloze
tests has been done, on the strength of the extensive work with written
cloze tasks and the few successful applications of oral cloze tests, the
technique can be recommended with some confidence. To be sure,
there are important differences in the oral applications of the
technique and the more familiar reading and writing procedures (see
Chapter 12), but there are also some substantial similarities. Whereas
in the written task examinees can utili ze contex t o n both sides of a
blank (o r decisio n point), in an oral cloze test usuall y onl y the
preceding context is available to help the exa minee infer the
appropriate next word at any decision point. This difference can be
minimi zed by allowing th e test subjects to hear the passage once or
twi ce without any deletions before they actually attempt to fill in
blanks in the text. Someone might object that this modification of
oral cloze procedure would make it more o f a test of listening
comprehension, but for reasons given in Part One above, this is not a
very significant objection . No rmal speaking always in volves an
element of listening comprehension. H owever, even if the decisions in
an o ral doze test must be made only on the bas is of preceding context
and without reference to the material to follow, there are still
fund amental similarities between written and oral doze tests.
Proba bly the most importi:l nt is that in both tasks it is necessary to
genera te expectancies concerning what is I1k ely to come next at a
give n point. This requi res knowledge of discou rse co nstraints
concerning the limi(s of sequiturs.
There are many ways of d oing oral cloze tests . We will begin by
330 LANGUAGE TESTS AT SCHOOL

considering a variant that derives in fairly straightforward fashion


from the well-kn own written doze test procedures. As in the ca se of
dictation, a first step is to select an appro priate text. The
considerations that were discussed in Chapter 10 would also
generally apply to the selection of material for an oral cloze task.
Because speaking tasks place a heavier burden on short term memory
and on the attention of the speaker, however, the text fo r an oral cloze
test should probably be pitched at a somewhat lower level other
things being equa l. The text may be a tape-recording of a radio or
television broadcast, a segment of a drama, or a portion of some
written text converted to an oral form.
An example of a fairly difficult text for an oral cloze test follows.
Slash marks indicate pause points. Deletions are italicized.

Each year since the 1960s hospitals in the United States have
had to accommodate about one million additionali (patient.<).
As a result, hospitals across the/ (country) have searched for
new ways to! (be) more efficient in order to provide! (the) best
possible patient care.
Harris Hospital, a 628-bed institution/ (in) Fort Worth,
Texas, has successfully! (met) this need using a computer-based
system. Key to thel (system) is an IBM computer and morel
(than) 100 terminals located at the admitting/ (desk), all nursing
stations and in key! (departments) throughout the hospital .
Staff members can now! (enter) or retrie ve all pertinent
medica l information/ (on) every patient from any au thorized
location in the hospital (from the lext of an IBM advertisement,
Reader's Digest , February, 1977, p. 216).

A text like this one can be administered in a variety of ways. Probably


most of the variations in the way the task is done would tend to affect
its overall difficulty level more than anything else. For instance, the
text may be presented in its entirety (without any deletions) one or
more times before it is presented wi th pauses for theexam inee(s) to fill
in the missing portions. After each decision point, where the
examinee is given a certain amount of time to guess the next word in
the sequence, the correct form mayor may not be given before
progressing to the next decision point.
It is also possible to test examinees over every word in the text by
systematically shifting the deletion points and by making a new pass
through the material for each new set of deletion points. For in stance,
if every fifth word is deleted on the first pass, it is only necessary to
shift all blanks one word to the left, make another pass, and repeat
TESTS OF PRODUCTIVE ORAL COMMUNICA nON 331
this procedure four times (that is, for a total offive passes) in order to
test examinees over every word in the text.
Another technique \\'hich can be used to test s ubjects over every
word in a given text is what may be called a forward build-up. The
examinee is given a certain amount of lead-in material, say, the first
ten wo rd s of text, and then is required to gues5 each successive word
of the text. After each guess he is given the material up to and
including the word just attempted. Obviously, there are many
possible va riations on this theme as well. It could be done sentence by
sentence throughout a text, o r it could be used over the entire text. It
has the drawback of seeming tedious because of the number of times
that the lead-in material preceding each guess has to be repeated. The
word just preceding the word to be attempted at any decision point
may have been heard onl y once (twice if the examinee guessed
correctly on the preceding attempt), but the word before that has
been heard at least t\\>ice. the wo rd before that three times. and so on,
such that the word n words before a given decision point will have
been heard n times when th at decision point is reached in the text.
Yet another possibility is for the examinee to guess not the next
word, but the next sensible phrase or unit of meaning. This type of
oral cloze test seems to be closer to the sort of hypothesizing that
normal language users are always doing when listening to speech with
full comprehension.
If every wo rd in a text is to be tested, which necessitates ooe of the
iterative procedures with mUltiple passes through the same material,
a passage of no more than 100 words will probably provide
substantial reliability as a testing device. The reliability attained, of
course, will vary according to the degree of appropriateness of the
difficulty level of the text, and a number of related parameters that
affect difficulty (see Chapter lOon the discussion of selecting a
dictation text). If au every ntb word deletion procedure is to be used
wi tho ut iterative passes through the text, a passage of n times fifty wjl1
probably provide adequate re liability for most any purpose. For
classroom tests with more specific aims - e.g., assessing compre-
hension of a particular instruction, a given lesson, a portion of a text,
or the like - it is quite possible that much shorter tasks would have
sufficient reliability and validity to be used effectively .
Oral cloze tasks could conceivably be done li ve, but for man y
reaso ns this is not generally recommended . Tape-recorded texts offer
ample time for editing on (he part of the examiner which is simply not
ava ilable in face-to-face live testing. If examinees can be tested in a
332 LANGUAGE TESTS AT SCHOOL

language laboratory where individual responses are recorded, this


too is an advantage generally. Testing can be done one-on-one, but
for many obvious reasons this is less economical in most educ<.ttional
settings.
It is also possible to have examinees write responses to oral cloze
items. This possibility requires investigation, however, before it can
be seriously recommended as an alternative to elicited speech.
Further, if the responses are to be written, many of the techniques
that are applicable in the case of spoken responses cannot be used for
testing. For instance, if the examinee is asked to write responses, no
feedback can be given concerning the correct form before proceeding
to the next decision point -that would be giving away thc answers. In
the oral situation by contrast, if the examinee is required to speak his
answer into a microphone in so many seconds, subsequent feedback
concerning the appropriate word will not impinge on the word or
words spoken onto the tape some seconds before by the examinee.
Iterative techniques are generally less applicable also if responses are
\vritten. We should note, however, that the written response
possibility has much appeal because it would a11O\\/ the testing of
many subjects simultaneously without the need for more than one
tape recorder.
Oral doze test protocols may be scored in several ways of which we
will only briefly consider two. Responses may be scored for the exact
word to appear at a particular point in a text, or they may be scored
for their appropriateness to the preceding (and/or total) context.
Both of these options are discussed in considerably greater detail
along with several other options in Chapter 12 on vvritten doze tasks.
Therefore, only tv'!o recommendations will be made here, and the
reader is encouraged to consider the discussion in Chapter 12 on the
same topic. It is believed to be generally relevant to the scoring
problem regardless ofthe testing mode. However, for oral doze tests,
it is suggested that (1) contextually appropriate responses should be
scored as correct unless there is some compelling reason to use the
exact v.'Ord scoring technique; and (2) in determining what
constitutes an appropriate response, all of the context to which the
learner has been exposed at a given point in the text should be taken
into account (this is assuming that the text is meaningfully sequenced
to start with that it does not consist of disjointed and unrelated
sentences). If the learner has been exposed to the entire text on one or
more readings, then the entire text is the context to be taken into
account in judging the appropriateness of responses.
TESTS OF PRODUCTI VE ORAL COMMUNICATION 333
3. Narrative lGsks. Normall y, describing a picture is not a
narrative task. h onl y becomes one if the picture fits into a stream of
experience where certain events lead to other events. Where causes
have effects, and effects ha ve ca uses - where there is a meaningful a nd
no n-random sequential development of one thing leading to another.
It is not necessary that the progression be a strictly logical one -
experience is not strictly logical · . but it is necessary that there be some
sort of discoverable connection between events in a sequence in order
for the sequence to serve as an appropriate basis rar narrati ve.
As we no ted in C hapter 10, story retelling tasks are merely a special
kind of elicited imitation task. There are many other sor ts of
narrative tasks however. A very interesting exam ple is offered by
Cazd en, et al (1976), A child is given a set of materials and is
instructed to build o r make wha tever he wa nts to with them . The
malerials in experimental trials includ ed so me 'plastic foam,
fasteners, paper. pipe cleaners, tape, etc.' and some 'tools' includi ng
'magic markers and scissors'. The children were given about 30
minutes for th e construction project, then they were asked to write
abo ut' " ho w" they made whatever they made' (p, 7f) , The writing
task was allotted 35 minutes.
An oral form o f this testing procedure could be conducted by
asking childrcn to tell how they made what they made to a non-
th reatening interlocutor. If desired, the protocols could be tape-
reco rded for lat er ana lysis and scoring. One possible simplification o f
the scoring of such pro tocols wo uld be to use a ra ting proced ure
compara ble to the kinds of scales in vented by the FS I (see above) for
thc oral assessmen t of adult protocols, Scales for creativity,
communicative effectiveness, and whatever else the teacher might
su bjectively want to evaluate could be constructed, These should be
carefull y resea rched, howeve r, to assess th eir valid ity. Perh aps a
single overall scale wo uld make the most sense. See Call away (in
press) and Mullen (in press b),
An other type of narrative task consists of retelling a story. This
techniq ue is probably most appli cable for children, but appropriate
versio ns can al so be co nstructed fo r adults (e.g., telling so meo ne how
to follow a set of compii cCi ted instructio ns that ha ve just been
presented by a third party), The Gallup- McKin ley School D istric t in
N ew Mexico has recently collaborated with the Lau Center in
Albuquerque in developing a pilot test that includes a story retelling
task. C hildren 3re interviewed o ne at a time by two adults.
First the child is asked non-threatening questi o ns about himself
334 LANGUAGE TESTS AT SCHOOL

and things that he likes. Then the examiner tells the child a story
preceded by the instruction that the child is to listen carefully so that
he can re tell the story to another person who has ostensi bly not heard
the story before. As much as possible, the story relales to things that
represent familiar objects in the reservation environment o f Ga llup,
New Mexico. The things used include a toy pick-up (of which one
child avowed "Every kid should have one 0' them'), a string doubling
for a rope, a horse made of plastic, a toy dog, a nd some other objects.
The sto ry involves going 00 a trip to a place designated by the child as
a place that he o r she would like to visit more th an any other place in
th e world. The pick-u p gets stuck. The horse is unloaded and hitched
to tb e bumper to pull it out. M eanwhile the dog gets lost chasing
rabbits. After a search he is located o n a desert road where he is being
chased by a gia nt jackrabbit.
The first adult who told the story then leaves the room and the
second adult who has not heard the story enters and the child (usually
with considerable alacrity) tells the second person the story. It is
important that the second adult be someone who (from the child's
point of view) has no t heard the story. It seems odd even to a child to
turn around and retell a story to someone who just made it up, but it is
not unnatural at all to tell a story just heard to a party that has not
heard it before.
The scoring of protocols (which are usually reco rded on tape) is
o nl y slightly less difficult than it is in the case of narralive tasks where
the can lent is less well defined. At least in th e case of story retelling,
the content that the child is su pposed to express is fairly well defined.
It must be decided whether to subtract points for insertion of
extraneous material or in fact to reward it as a case of ver bal
creativity. It must be determ ined how to weight (if at all) different
po rti ons o r the content. Some eve nts may be mo re salient than o thers
- e.g., the jackrabbit chasing th e dog is a ehuckler for any reservation
child . One possible solution is to assign overall SUbj ective ra tin gs to
the ch ild based on predetermined scales - e,g. , overall communica tive
effectiveness may be rated on a sca le of one to five where the points on
the scale are defined in terms of no thing more specific than better and
worse perfo rmances of diffe re nt c hildren atte mpting the task.
Another possibility is to score protocols for major details of the story
and the comprehensibility of the rendition of those details by the
child. A perfect rendition would receive full credit fo r all major facts.
A rendi tion that omit s a majo r fact of [he story would have a certain
n umber of points subtracted dependin g o n the subjectively
TESTS OF PRODUCTIVE ORAL COMMUNICATION 335
determined weighting of that fact in the story. A rendition of a
particular fact that is distorted or only partly comprehensible would
receive partial credit, and so forth.
Versions of narrative tasks for adults or for more advanced
learners can easily be conceived. Interpretive tasks are actually not so
uncommon as the opponents of translation as a testing device have
sometimes argued. A text may be presented in either the target
language (that is, the target of the testing procedure) or in some other
language known to the examinee(s). The task then is telling someone
else what the text contained - that is reiterating it to some third party.
In this light, translation of a substantial text (say, fifty words or more)
is merely a special kind of retelling or paraphrasing task. The reader is
left to explore the many other possibilities that can be developed
along these lines. It is suggested that the scoring considerations
discussed in Chapter 13 with reference to essay tasks be taken as a
foundation for investigating alternative scoring methods for other
productive tasks such as narrative reporting, etc. Also, see Chapter 14
for sample speech protocols from the Mount Gravatt Australian
research project and for suggestions concerning the relationship of
speech to literacy and school tasks in general.

KEY POINTS
1. Pragmatic speaking tasks should generally involve something to say and
someone to say it to.
2. Examples of pragmatic speaking tasks include structured and un-
structured interviews, and a '..vide range of text processing and
construction tasks including oral doze procedures, narrative based
tasks, and for some purposes perhaps, reading aloud.
3. Speaking tests are important because of the special importance
associated with speech in all aspects of human intercourse.
4. There is a serious need for better oral language tests.
S. The Bilingual Syntax Measure is an example of one published test that
affords the possibility of meaningful sequence across test items.
6. Its colorful illustrations and surprise value make it a naturally suitable
basis for eliciting discourse type protocols from children.
7. Alternative question formats and non-discrete point scoring procedures,
however, are recommended.
8. The /lyin Oral Interview is used as an example of hmv pictured sequences
of events can be used to elicit discourse in a somewhat more complex
form than the Bilingual Syntax Aleasure.
9. A possible modification of the Oral Communication Test is discussed.
10. The FSI Oral Interview technique is suggested as a procedure with many
possible applications.
TESTS OF PRODUCTIVE ORAL COMMUNICATION 339
6. John A. Upshur, 'Objective Evaluation of Oral Proficiency in the ESOL
Classroom,' TESOL Quarterly 5, 1971. Reprinted in Palmer and Spolsky
(1975),53··-65. Sec reference 2 above.
7. John A. Upshur, 'Productive Communication Testing: A Progress
Report,' in G. E. Perren and J. L. M. Trim (Eds.) Selected Papers of the
Second international Congress of Applied Linguistics. Cambridge,
England: Cambridge University Press, 1971 , 435-441. Reprinted in J.
W. Oller and J. C. Richards (Eds.) Focus on the Learner: Pragmatic
Perspectives.!(Jr the Language Teacher. Rowley, Mass.: Newbury House,
1973,177-183.
12

Varieties of Cloze Procedure

A What is the cloze procedure?


B. Cloze tests as pragma tic tasks
C. Applications of doze procedure
1. Judging the difficulty of text'
2. Rating bilinguals
3. Estimating readin g
comprehension
4. Studying textual constraints
5. Evaluating teaching
effectiveness
D. How to make and use cloze tests
I. Selecting material fo r the task
2. Deciding on the deletion
procedure
3. Administering the test
4. Scoring procedures
I. The exact word method
II. Scoring for contextual
appropriateness
111 . Weighting degrees of
appropriateness
IV. Interpretin g th e sco res and
protocols

In this chap ter we continue to app ly the pragmatic principles


discussed in earlier chapters, but we focus our attention on a written
testing technique known as cloze procedure. Though ma ny of the
findings of studies with \.v ritten cloze tests are undo ubtedl y
generaliza ble to other types of tests and to other modalities of
language proce5sing, the written cloze procedure is largely a reading
340
VARIETIES OF CLOZ£ PROCEDURE 341

and writing task. Amo ng the questions addressed are: ( I) What is the
cloze procedure and why is it considered a pragma tic task ? (2) What
are some of the applications of the cloze procedure? (3) What scoring
procedures are best ? (4) How can sco res o n cloze tasks be
in terpreted?

A. What is the cloze procedure '!


W. L. Taylor is credited with being the inventor of the cloze
technique. H e is also responsible for coining the word 'cloze' which is
rather obviously a spelling corruption of the word 'close' as in ~close
the do or'. It has been a stumbling stone to many a typesetter and has
often been miss pelled by an overzealous and unknowing editor who
found it difficult to believe that anyone would re ally intend to speJl a
word C - L- O- Z- E. The term is a mnemonic or perhaps a humorless
pun intended to call to mind the process of closure celebrated by
Gestalt psychologists. In the cloze technique blanks are placed in
prose where words in the text ha ve been deleted. Filling th e blanks by
guessing the missing words is, according to Taylor's notion, a special
kind of closure - hence the term cloze. The reader's guessing of
missing words is a kind of gap filling task that is not terribly unlike the
perceiver's complelion of imperfect visual patterns (for instance, the
square, the letter A, and the smiling face of Figure 19).
Compare the ability of a perceiver to complete the patterns in
Figure 19 with the ability of a reader to complete the following
mutilated portions of text:
(1) one, l _ , t _ _, f __, _ive, _x, _ _ n,.
(2) Four _ __ and seven _ _ . ago ______ ..
(3) After the mad dog had bitten several people he was finally
sxghtxd nxxr thx xdgx xftxwn xnd shxt bx a local farmer.
(4) It is true that persons view the treatmenl of
mental from a clinical perspective tend _ _ __
explain socioeconomic and ethnic differences _ _ _ _
biologica l term s.
In example (I) the reader has no difficulty in supplyi ng the missing
letters of lhe words 'two', 'three' and so on. The series is highly
redundant, which is similar to saying that the reader's expectancy
grammar anticipat es the series on the basis of very little textual
information. Example (2) is also highly redundant if one happens to
know the first line of Lincoln's Gettysburg address . If the text is
unfamiliar to the reader, (2) is considerably more difficult to fill in
342 LANGUAGE TESTS AT SCHOOL

~ ~

o
000
o 0
Figure 19. Some examples of visual closu re - seeing the overall pattern or
Gestalt.

than (3) or (4), otherwise it is probably somewhat easier. In any case,


even when the original text is not stored in its enti re ty or at least in
some recoverable fo rm in memo ry, missing o r mutilated po rtions
m ay nonetheless be recoverable by a creative process of construction
as is illustrated by examples (3) and (4). The mutilated words in (3)
are 'sigh ted near the edge of town and shot by' and in (4) the missing
words are 'who ', 'retardation', ' to', and 'in'.
The above exa mples d o not even begi n to exemplify the range of
possible d istortions that readers can cope with in th e processing of
print, bu t perhaps they provide a basis for the comparison of the
notion of pattern completion in lGestalt psychology and the concept
of closure in reI a tion to the processing of text. It would seem that the
perceiver's ability to fill in gaps in imperfect patterns such as the ones
ex amplified in Figure 19 (or in more complicated visual examples)
may be related to the ab ility to construct th e same patterns. In any
event, it wo uld seem that th e abili ty to fill in gaps in prose is
characterizable in tha t way. T he reader can supply the missing
portions by a co nstructive process similar to the wa y the writer put
VARIETIES OF CLOZE PROCEDURE 343
the text together in the first place. When the material is almost
completely redundant, e.g., filling in the missing letters in the series of
words in example (I), or filling in the missing words in a text that has
been committed to memory, the task would seem to be somewhat like
the process of filling in the gaps in imperfect visual patterns.
However, when the material is not so familiar and is therefore less
redundant, the power of the generative mechanisms necessary to fill
in the gaps or to restore distorted portions of text would seem to be
more complex by orders of magnitude than the simpJe visual cases.
Consider the kinds of information that have to be utilized in order
to fill the blanks in example (4) above, or to restore the mutilated
portion of example (3). In the processing of (3), the first mutilated
word could probably be guessed by spelling clues alone. However,
there is much more information available to the reader who knows
the language of the text. He can infer that a mad dog who has bitten
several people is a menace to be disposed of. Further, if he knows
anything of mad dogs he is apt to infer correctly that once the
hydrophobia expresses itself in the slobbering symptoms of vicious
madness, the dog certainly cannot be saved. Therefore, an intelligent
community of human beings interested in protecting their own
persons and the well-being of pets and livestock would in all
probability actively hunt the diseased animal and dispose of the
threat. The range of possible meanings at any point in the text is
expected to conform to these pragmatic considerations. Semantic
and syntactic constraints further I1mit the range of possible
continuations at any particular point in the text. Hence, when the
reader gets to the word 'sxghtxd' he is actively expecting a past
participle of a verb to complete the verb phrase 'was finally
____ '. Subsequent possibilities are similarly limited by the
expectancy grammar of the learner and by the textual clues of word
spaces and unmutilated letters.
In example (4) , the restoration of missing words is dependent on a
host of complex and interrelated systems of grammatical knowledge.
The reader may easily infer that the appropriate pragmatic mapping
of the sentence relates to the clinical treatment of some mental
disorder. Further, that it relates to the characterization of a way of
dealing with such disorders by persons taking the clinical perspective
as the point of departure for explaining socioeconomic and ethnic
differences. On the basis of semantic and syntactic constraints the
reader can determine (if not consciously, at least intuitively and
subconsciously) that a relative pronoun is required for the first blank
344 LA NGUAGE TESTS AT SCHOOL

to subordinate the following clause to the preceding and also to serve


as the surface of the verb 'view' . On the basis of semantic and
syntactic clues, the reader knows that the second blank in the text of
example (4) requires a noun to serve as the head of the noun phrase
that starts with the adjective 'mental'. Moreo ver, the reader knows
that the noun phrase 'mental _ _ __ ' is part of a prepositional
phrase 'of mental .' which serves as a modifier of the
preceding noun ' trea tment' teHing in fact what the treatment in
question is a treatment of. On the basis of such information, the
reader may be inclined to fill in the blank with words such as
'disorders', 'retardates', or possibly with the correct word ' re-
tardation'. If 'treatment' is read in the sense of 'consideration'
'health' is a possibility.
From all of the foregoing, we may deduce that the cloze procedure
- thal is, the famil y of techniques for systematically distorting
porti ons of text - is a method for testing the learner's internalized
system of grammatical knowledge. In fact the cloze technique elicits
informa tion concerning the efficiency of the little understood
gramma tical processes that the learner performs when restoring
missing or mutilated portions oftex l. Wilson Tay lor viewed the cloze
procedure in this way from its inception. His noti on of 'grammatical
expectation' was necessarily less refined than it has come to be in the
last twenty years, but he very clearly had the same idea in mind when
he cited the folJowing argument which he attributed to Charles E.
Osgood :
Some words are more likely than others to appear in certain
patterns or sequences. 'Merry Christmas' is a more probable
combination tha n 'Merry birthday' . 'Please pass the '
is more often completed by 'salt' than by 'sodium chloride' or
' blowtorch' . Some transitions from one word to the next are,
therefore , more probable than others (Taylor, 1953, p. 419).

According to Taylor, Osgood argued that the foregoing was a


product of the redundancy of natural language:
' Man coming' means the same as 'A mao is coming this way
now'. The latter, which is more like ordinary Englis h, is
redundant: it indicates the singular number of the subject three
times (by 'a', 'man', 'is'), the present tense twice ('is coming' and
'no\v'), and the direction of action twice ('co ming' and ' this
way'). Sueh repetitions of meaning, such internal ties between
words, make it possible to replace 'is', ' this', 'way" or 'now',
should any one of them be missed (p. 41 8).
VARIETIES OF CLO ZE PROCEDURE 345
B. Ooze tests as pragmatic tasks

Not aJl fill-in-bl ank tests are pragmatic. It is possible to set lip fiJI-in·
blank items over sentences that are as disjointed as a poorl y planned
pattern drill. Cloze items may also be arranged to focus on certain
discrete points of structure or morpho logy. D av ies (1975), for
instance, repo rts results with a variety of cloze procedure that deletes
only certain grammatical categories of words. He also clued the
examinee by leaving in the first letter of the deleted word - e.g.,
'T_ __ i_ _ a test 0 _ _ _ reading comprehension' (p. 122).
Frequently, blanks were placed contiguo usly in sp ite orthe fac t th at
research has shown that a deleti on ratio of greater than one word in
five creates many items that even native spea kers of English cann ot
complete and Davies was working with non-natives. Consider the
first sentence of one of hi ~ tests: 'B ___ changes i _ __ t _ __
home are less revolutionary, a _ _ _ easier t _ _ _ assimilate,
t _ __ changes i _ __ industry . .. .' These items may work fairl y
well, but they cannot be taken as indicative of th e sort of items that
appear in more standard doze items. Since Davies' items are placed
onl y o n functi on words, they cannot be expected to produce as much
reliable variance as items placed, say, every fifth word over a n entire
text.
What sorts of cIoze tests do qu alify as pragmatic? There are man y
types. T he most commonly used, and therefore , the best researched
type, is the cIoze test constructed by deleting every nth word of a
passage. This procedure has been called the fixed-ratio method
because it deletes l /nth of the \vords in the passage. F or instance, an
every 5th word deletion ratio would resul t in 1/ 5th of the words being
blanked out of [he tex t. By this technique, the number of words
correctly replaced (by the exac t-word sco ring procedure) or the
num ber of co ntextu ally appro priate words supplied (by th e
contextually appropriate scoring method) is a kind of overa ll index of
the subj ect's ability to process th e prose in the text. Or alternatively
the average score of a group of examinees on several passages may be
taken as an indication of the comprehensibility of each tex t to the
group of subjects in question. Or from yet another angle, constraints
within any text may be studied by com parlng scores on individual
items.
Another type of cIoze proced ure (or (,unily of them) is what has
been called the variable- ratio method. Instead of deleting words
according to a countin g procedure, words may be selected o n some
346 LANGUAGE TESTS AT SCHOOL

o ther basis. For instance, it is possible to delete o nl y words that are


richly laden with meaning, typically these would include the nouns,
verbs, adjectives, adverbs, or some combination of them in the text in
question. Another version lea ves out onl y the so-called function
words, e.g., the prepositions, conjunctions, articles and the like.
It is also possible to use an every nth word procedure with some
discretio nar y judgement. This is probably the most commonly used
method for classroom testi ng. Instead of only deleting words on a
counting basis, the cQunting technique may be used only as a general
g uide. Thus, it is common practice to skip over items such as proper
noun s, dates, and other words that would be excessively difficult to
replace. When the test constructor begins to edit many items in a
text, however, he should be aware of the fact that the c10ze test thus
deri ved is not necessarily apt to generate the usual reliability and
validity. Neither are all of the previous generali zations about other
properties of doze tests apt to be true in such a case.
Because of the fact that cloze items are usuall y scattered over an
entire text on some fixed or variable rati o method, c10ze tests are
generally tests of disco urse level processing. Further, it has been
sho wn that performance on c10ze items is affected by the amo unt of
text on either side of a blank up to at least fifty words plus (Oller,
1975). Apparently doze items refiect overall comprehension of a text.
Not every item is sensitive to lon g-range constraints (Cha vez, Oller,
Chihara, and Weaver, 1977), but enough items apparently are
sensitive to such constraints to affect overa ll performance.
It is difficult to imagine anyone filling in the blanks on a cloze test
correctly without understanding the meaning of the text in the sense
of mapping it onto extralinguistic context - hence. cloze tests seem to
meet the second of the two pragmatic naturalness constraints . But
what about the temporal sequence of cloze items and cioze test
material? Are there time constraints that challenge short term
memory? In response to this second question, consider the following
brief text:
(5) 'The general content a nd overall plan of th e previous
edition have proved so well adapted to the needs of its users that
an attempt to change its essential character and form -;-c::--
inadvisable' (from the Preface to Webster's Seventh New
Collegiate Dictionary, 1969, p. 4a).
The word deleted from the blank is ' seems'. In order to answer the
item correctly. t he read er has to process the preceding verb 'have
proved ' twenty words earlier. Further, he presumably has to hold in
VARIETIES Of CLOZE PROCEDURE 347
attention the subject noun phrase 'an attempt' which appears seven
words back. The example illustrates the considerable length of
segments that must be taken into account in order to fill in some of the
blanks in a passage oftexl. Unless tbe short term memory is aided by
a fairly efficient grammatical system that processes segments fairly
quickly leaving the short term memory free to handle new incoming
segments, errors will occur. At least some of those errors will be the
result of a breakdown in the processing of long ran ge temporal
constraints.
Thus, it can be reasoned that there are time constra ints on doze
items. Therefore, cloze tests generally satisfy the first pragmatic
naturalness requirement: they require the learner to process temporal
sequences of elements in the language th at conform to normal
contextual constraints. Although it has sometimes been argued that
doze items are o nl y sensitive to contexts of about five to ten words on
either side of a blank (Aborn, Rubenstein, and Sterling, 1959,
Schlesinger, 1968, Carroll, 1972, and Davies, 1975), these claims have
been shown 10 be generally incorrect (Oller, 1975, Chihara, et ai,
1977, Chavez, et ai, 1977).
It is interesting to note th at Wilson Tay lor was intuitivel y awa re of
the fundamental dillerence between fill-in-bl an k tests ove r si ngle
sentences and c10ze tests ranging over substantial portions of
connected discourse. In his first paper on the topic of doze procedure,
he included a subsection entitled , 'Not a . )~ell1ellce-Completion
. Test'
(1 953. p. 417). After noting the superficial similarities between these
two types ofteslS, Taylor points o ut the basic differences : first, 'doze
procedure deals with contextually interrelated series of blanks, not
isolated ones' ; and second, 'the doze procedure does not deal directly
with specific meaning. Instead it repeatedly samples the likeness
between the language patterns used by Ihe writer to express what he
meant and those possibly different patterns which represent readers'
guesses at what they think he meant' (p. 417) .
Perhaps the distinguishing quality of cloze tests can be stated
somewhat more succinctly: th ey require the utilization of discourse
level constraints as well as structural constraints within sentences.
Probably, it is [his distinguishing characte ri stic which makes doze
tests so robust and which generates their surprisingly strong validity
coefficients in relation to other pragmatic testing procedures (see
Chapter 3 ahove and the Appendix below) .
348 LANGUAGE TESTS AT SCHOOL

C. Applications of cloze procedure

Amo ng the applications of the cloze proced ure which we will consider
briefly are the following: judging readability of textual materials,
estim atjng reading comprehension, stud ying the nature o f contextu al
constraints, estimating overall language proficiency (especially in
bilingua ls and second language learners), a nd evaluating teaching
effectiveness.

I. ./udging (he difficulty of (ex l s. According to Klare (1 976), the


interest in judgi ng the readability of texts has been arou nd for a long
long time 'Lorge (1 944), for example, reports attempts by the
Talmudists in 900 AD to use word and idea counts to aid them in trus
task' (p. 55). Apparently, the moti ve for this interest has always been
to make writ ten materials more effective for communicati ve
purpo ses. In the service of this objective, many readability fo rmulas
have been developed - most o flhem in this century. Kl a re (1974-5)
reports that before 1960 there were already at least 30 formulas of
vari ous sorts in the extant litera ture. Since 1960, many other formulas
have been developed. All of them tend to rely on one or more of the
following criteria: sentence length, vocabulary difficulty, references
(0 persons, and possibly indices of syntactic and morphological
complexity. A recentl y developed formula by Botel and Gra nowski
(1972) requires the rating of syntactic uni ts on a scale of zero to three.
The most widel y used form ulas, however, the D ale-Chall (Dale and
Chall, 1948a, 1948b) a nd the Flesch (1948) rely on the average
sentence length in words plus one or two other cflteria. F or the D ale-
Chall fo rmul a only one additional index is required - the number of
unfami liar words. The F lesch fo rmula requires two additio nal pieces
of in formation - the number of affixes and the number of references
to people. D elailed instructions for both formulas are gjven in the
o riginal sources.
On the whole, the readability formulas have not been as successful
as it was originally hoped they might be. G lazer (1974) contends that
the formulas are inaccurate beca use of the fact that "all language
elements can , in some way, be in volved in the reading comprehension
process' (p. 405), but only a few of the elements can easily be
incorporated in the indices of the formula s. She goes o n to say,
'attempts to develop formulas to include these [additional] variables
has resulted in the development of complex instruments too
cumbersome for practical use' (p. 405). One of the shortcomi ngs of
VARIETIES OF CLOZE PROCEDURE 349
practically all of the formulas is that a pivotal element is sentence
length, yet it has been shown that sentence length per se does not
necessarily make content less recoverable. In fact, increasing the
length of sentences may improve readability in many cases as has
been shown experimentally by Pearson (1974). For example, two
sentences like, 'The chain broke. The machine stopped,' are not
necessarily easier to comprehend than a single sentence such as,
'Because the chain broke, the machine stopped.' Pearson concludes
that 'readability studies must begin with the question: what is the best
way to communicate a given idea?' (p. 191).
Other solutions, frequently discussed in the literature, include
appeal to SUbjective judgements and attempts to measure the
comprehensibility of different texts via a variety of testing techniques.
There is some evidence that subjective judgements of sentence
complexities, word frequencies, and overall passage difficulties may
have some validity. However, in order to attain necessary levels of
reliability, many judges are required, and even then, the judgements
are not very precise.
Klare (1976) shmved that the overall ranking of five passages from
the McCall-Crabbs Standard Test Lessons in Reading (1925) by fifty-
six professional writers were identical to the ranking based on
multiple choice questions over the passages and another ranking
based on the Flesch Reading Ease formula (Flesch, 1948). Richards
(l970b) similarly demonstrated that judgements of familiarity of
words were fairly reliable (.775 for the last list of 26 words presented
in different dummy lists on two different occasions two weeks apart
[or forty judges). Glazer (1974) argues that similar judgements can
reliably be made for sentence difficulties.
However, comprehension questions and judgements of difficulty
are apparently too imprecise for some purposes. Though both seem
to offer higher validities than the formulas, neither is a very consistent
measuring technique. Klare (1976) for instance notes wide discrepan-
cies in the judgements of difficulty levels offered by the fifty-six
professional \\-Titers who participated in his study. Only one of the
"five passages was not ranked in all possible positions by at least some
of the judges. That was the most difficult passage. Noone rated it the
easiest. But, every possible rank was assigned to each of the four
remammg passages.
Multiple choice questions are similarly fraught with problems.
Who is to say that questions \vritten for different passages are
themselves of equivalent difficulty? That is, suppose a set of multiple
350 LANGUAGETESTSATSCHOOL

choice questions is constructed for t he first passage in Klare's study,


another set for the second, and so forth for all five passges of text, as
in facl was done in order to crea le the McCall-Cra bbs widel y used
Standard Test Lessons in Reading , How can we be certain that the
questions asked in relation to passage one are of tbe same difficulty as
those asked in relation to passage two? Three '? Four ? And so forth.
Injudging the vari ous sets of multiple choice questions for difficulty,
we are in precisely the sa me boat as in judging the passages. Js there
no esca pe from this circle?
It was these kinds of considerations that led Wilson Taylor (l953)
to propose the d oze procedure as a basis for measuring the
readability of prose. Kla re, Sinaiko, and Stolurow (1972) point out
that the average score of a group of subjects on a cloze test is an actual
measure o f readabilit y whereas the fo rmulas, and even the multiple
choice question techniques, are probabl y best regarded only as
methods for estimating difficulty levels. Indeed, the cloze procedu re
is a suitable device for validating (that is, testing the validity of) the
other techniques. In anticipation of th e \vell aimed criticism of the
formulas by Glazer (1974), twenty-one years ago, Taylor pointed out
th at 'a d oze score appears to be a measure o f tbe aggrega te of
influences of all/acto rs [his itali cs ] w hich interact to affect the degree
of correspondence between the language patterns of transmitter and
recei ve r' (1953, p. 432). The formul as, on the other hand (as G laze r,
1974 noted) take into account only some of the superficial aspects of
the tex t.
Taylor (1953) and Klare (1974- 5) recommend that the best
estimates of readability may be obt ained by 'd ozing' every word in
sample texts. To keep our reasoning simple, suppose that someone
wants to measure the readability of th ree stories for fourth grade
children in a certain school. If there a re 75 fourth graders in the
school in tliree separate classrooms with 25 in each, the tests might be
set up as follows: fi rst construct five test forms over each text.
Assuming that each text is at least 250 words long, with an every fIfth
word deletion procedure, it is possible to construct five forms o ver
each text by deleting the fifth , tenth, fifteenth , ... and so forth on the
first pass ; the fourth , ninth, fourteenth , ... on the second ; the third,
eighth , . .. on the third ; the second , seventh, ... on t be fourth: and the
first , sixth, ... on the fifth pass. Suppose then that the test forms are
stac ked and distributed as follows:
VARIETIES OF CLOZE PROCEDURE 351
Form Text Student Class

I I
2 2
3 3
2 4

5 3 IS
1 1 16

3 3 25
4 26 2

5 3 75 3

The first student receives form 1 of text 1. The second student gets
form 1 of text 2. The third student receives form I of 3. The fourth
student then gets form 2 of text 1 and the procedure continues
recycling until all twenty-five children in class 1 have received tests.
The procedure is then continued for student number 26 through 50
(i.e., all of those in class 2), and for 51 through 75. Each child tben
takes one form of the five possible forms over one of the three texts.
The selection of subjects who take the test over text 1 as opposed to
text 2 or text 3 is quite random. It is exceedingly unlikely that all of the
better readers happen to be seated in every third seat - or that all of
the not-sa-good readers are seated in a similar arrangement.
Thus, after testing, the average score of the 25 children who
attempted text 1 can be compared against the average of the 25 who
did one of the tests over text 2 and the 25 who did one of the tests over
text 3. In short, the difficulties of the three texts can thus be directly
compared.
Ifit is desired to obtain a finer degree of accuracy, the measures can
be repeated by administering a new set of tests stacked as follows on
two additional occasions:
352 LANGUAGE TESTS AT SCHOOL

Form Text Student Class

2
3 2
on the 3
second test day 2 2 4

Hence, on occasion one, the first, fourth, seventh, ... through the
seventy-third students take one of the five forms over text 1: the
second, fifth, eighth, ... and seventy-rourth students take one of the
five forms over text 2; the third, sixth, ninth, ... and seventy-fifth
students take one of the forms over text 3. On the second occasion,
the situation is adjusted so that the first, fourth, ... through seventy-
third students take one of the five tests over text 2 and the remaining
two groups take tests over text 3 and text I in that order. On the third
occasion of testing, the situation is once more adjusted so that the
first, fourth, ... through seventy-third students take one of the five
forms over text 3 and the remaining two groups take one of the tests
over texts 1 and 2 respectively.
By this latter method, the learning effect of taking one of the tests
on day one, and another on day two and yet another on day three is
spread equally over all three texts. In fact, the experimenter would
have 75 points o[re[erence over which to average in order to compare
the readability of the three passages. Further, if there were any reason
to do so, the experimenter could compare performance over the
fifteen separate tests - that is, the five forms over text 1, the five over 2,
and the five over 3. For each of those comparisons, there would be
(75 x 3 = 225 divided by 5 = 45) exactly 45 different scores.
VARIETIES OF CLOZE PROCEDURE 353
Ultimately the teacher or educator doing the testing will have to
make intelligent decisions about what procedure will be best suited to
measuring readability levels in a given educational setting. However,
some general guidelines can be offered based on the extensive
research literature using the eloze procedure as a measure of
readability. A minimum of 50 blanks in a given eloze test, and at least
25 points of reference (that is, individual subject scores) for every
desired comparison, will generally assure sufficient reliability. By
now the research has made it abundantly clear (see Potter, 1968,
Klare, Sinaiko, and Stolurow, 1972, Klare 1974-5) that the cloze
procedure affords substantially more trustworthy information about
the difficulty levels of samples of text than any other method yet
devised. Other pragmatic text processing tasks may work equally
well, but this has not been demonstrated. What has been shown
convincingly is that the c10ze technique works much better than the
readability formulas.
Anderson (1971 b) offers the following suggestions for determining
what a given cloze score means with respect to a non-native student's
understanding of a particular passage. A doze score of .53 or above
(by the exact word method, see below) corresponds to what has
traditionally been called the 'independent level of reading'. A score
between .44 and .53, he suggests, is in the appropriate range for
instructional materials, the so-called 'instructional level'. However, a
score less than .44 falls into the 'frustrationallevel' and would not,
according to Anderson, be appropriate for instructional use.
Presumably, the percentages of correct items required for similar
judgements concerning native speakers should be adjusted down-
ward somewhat. No comparison between natives and non-natives is
given in the Anderson (1971b) paper, but Potter (1968) may offer a
clue. He suggests that a widely accepted rule of thumb concerning
multiple choice comprehension questions is to judge reading
materials suitable 'for a pupil's instructional use if he responds
correctly to 75 percent or 1110re ofthe items' (p. 7).ln a study of native
speakers, Bormuth (1967) found tbat 'cloze scores between 40 and 45
percent have been found comparable to the 75 percent criterion'
(Potter 1968, p. 7).
Thus, it would appear that the lower bound on materials in the
instructional range for native speakers (.40 to .45) is about the same
as for non-natives (.44). This is contrary to the expectation that it
might be lower for natives. Of course, it is important to realize that
guidelines of this type can never be made any more accurate than the
354 LANGUAGE TESTS AT SCHOOL

meanings or the terms 'independent' , 'instruc tion al', and 'frust-


rati o nal' . Nevert heless, insofar as such terms ha ve determinate
meanings, they ma y help in interpreting either the doze score of an
individual or the average score of agroup over a given text or perhaps
a ran ge of texts.

2. Raling hilinguals. Although the first applications of cloze


procedure were to the measurement of readability levels of texts, it
became apparent almost immediately that the technique would find
many other uses. Osgood and Sebeok (1965) suggested that it might
be used for assessing the relative proficiency of a bilingual person in
his two langua ges. Taylor had hinted at this possibility eve n earlier in
an article reviewing research results in 1956. He had suggested that 'it
al so seems possible to use the doze method . .. for testing the progress
of students learning a foreign language' (p. 99).
A not very encouraging conclusion along the line of testing foreign
language proficiency was reached by Carr oll, Cart on, and Wilds
(1959). Cloze tests were developed in English, French, and German.
The authors claimed that the cloze techniq ue measures language skill
quite indirectly and that there may be a specific skill involved in
taking cloze tests tha t is not stron gly correlated with language
proficiency. There were several problems w hich may have led them to
thi s conclusion, however. First, the criterion measure (against which
the doze tests were correlated) was not another language test, but was
in fact the Jllodern Language Apfilude Test - which it self has
surprisingly weak validity (Carroll, 1967). A second difficulty was
that the tests developed were not of the standard every nth deletion
ratio. (We discuss below the fact that other procedures seem to work
less well on the whole.) A third problem was that the generalizations
were based on quite small sample sizes with subjec.:ts who may have
been atypical in the first place. (See the discussion in Chapter 7 above
concerning the problems of referencing tests against th e performance
of no n-natives who have had only a classroom expos ure to the
la nguage.) In any event, in spite of these facts) the conclusions of
Carro ll . et 01 probably were the main facto r in discouraging much
further researc h with doze as a meas ure o flan guage proficiency per se
until the late 1960s and early 1970s. Since then, the cloze technique
has proved to be a very useful measure of language proficiency.
Holtzman and Hopf presented a paper to the Speech Association
of America in December 1965 on cloze procedure as a test of English
language proficiency, but it was not until 1968 that Darnell published
VA RIETII:S OF C LOZ.E PROCEDURE 355
his stud y revealing some of the potential merits of c10ze as a measure
of proficiency in English as a second language. After 1968, man y
researchers began to take interest. Spolsky (l969b) almost im-
mediately summarized th e results of Darnell 's 1968 study in the
TESOL Quarterly.
For 48 no n-native speakers of English, Darnell fo und a reliabilit y
of .86 fo r a cloze test (scored by a somewhat complex scoring
procedure which we will return to below in section D.4.iii) and a
correlati on of.83 with the Test of English as a Foreign Language. In
view of the estimated reli abi li ty of the TOEFL and of the cloze test
Darnell used, the correlation of .83 between the two tests
approximates the maximum correlation that could be found even if
th e two tests we re perfectly correlated - i.e., if th e TOEFL did no t
prod uce (j ny variance no t also generated b y the c10ze test. 1
Sho rtly afte r D arnell's stud y was completed , several reports
pointing to similar co nclusio n::; began to appea r. An Australian
d octoral dissertation by Jonat bon Anderson was completed in 1969
showing remarkably hi gb reliability and validity coefficients for cloze
tests of 50 items or more used with non-native speakers of English. A
Master's tbesis at UCLA by Christine Conrad yielded similar results
in 1970.
Subsequent research by Olier, Bowen, Dien, and Mason (1972)
extended the technique to tests across la nguages and provided
evidence that the method could be used to de velop ro ughl y equivalent
tests in different languages . Their study investigated tests tn English,
Thai, and Vietnamese. The fact that the extensio n was made to such
unrelated and vastly different language systems provided empirical
support fo r the hope that pa raliel tests could be developed by a simple
translatio n procedure. 2 Encou raging results have also been reported
by McLeod (1975). He prod uced tests in Czech, English, French,
Ger man, and Polish. The translati on met hod of deri ving equivalent
tests has been further researched by Johansson (i n press) who used
cl oze tests in English and Swedish as a basis for what he called a
'bilingu al reading comprehensiqn index'.

t Any readers \vho may have skipped over the statistical portion of Chapter 3, section
E. I may want to return to it befo re pressing on.
2 O sgood and Sebeok ( 1965) had suggested j ust s uch a p roced ure 7 years before t he fir st
data beca me a vailable. Interestingly, Osgood is cred ited by Taylo r (1963) as havi.ng
inspi red the development o f t be cl oze procedure in the first place. Jo hansson (in press),
cites O sgood and Sebeok with reference 10 the tra nsla tio n, but Oller el 01
(1972) , Mc leod (1975), an d Kl are (1 974--5 , esp. p. 9 5[) were <I ll apparentl y unaware of
the suggestion by Osgood and Sebeok .
356 LANGUAGE TESTS AT SCHOOL

Jo hansson (in press), however, voiced an old skepticism concern-


ing the standard every !lth word deletion procedure for [he
assessment of reading comprehension. Therefore, he created tests by
deleting only content words (i.e., nouns, verbs, adjectives, adverbs,
and generally words that appeared to be heavily laden with mea ning).
He reasoned that deletion of function words (e.g. , prepositions,
conjunctions, particles, and the like) wo uld create diffic ulties that
would not reflect the true comprehensibility of a text to a non-native
speaker of the language in question . This wo uld lead to incorrect
assessment of relative rates of readi ng comprehension across
languages. Perhap s .Johansson's argument has merit if app lied
exclusively to some notion of 'reading comprehension' across
languages. H owever, if we are interested in global language
proficiency, the very reasons offered against the standard every nth
deletion procedure would vie in fa vor of it.
Interestin gly, the inventor of the cloze technique was also the first
to investi gate the important questions concerning optimum deletion
procedures. Taylor (1957) addressed the following question raised by
'certain skeptics': 'Wouldn't sampling points created by deleting only
"important" words such as nouns or verbs yield more discriminating
results than the practice of counting o ut words without regard for
their dilTering functionsT (Taylor, 1957, p. 2 1). To answe r the
question, Taylo r constructed three doze tests of 80 ilems each over
the same te xt. The first test consisted of items derived by the standard
doze procedure. The second was based on content deletio ns only
(adverbs, verbs, and nouns) over words that had proved in previous
cloze tests over the same text to be difficult to replace - i.e., they
generated a greater number of incorrect responses. The third test was
based on function word deletions (ve rb auxiliaries, conjunctio ns,
pronouns, and cuticles) over items th at had proved relatively easy to
replace - i.e., they generated fewer incorrect responses than content
words.
Taylor administered the three tests to randomly selected subgroups
of a sample of 152 Air Force trainees. Criteria against whieh the cloze
scores \vere correlated included pre and post comprehension
questions of the multiple choice variety over the same article that the
cloze tests l,ve re based on, and the Armed Forces Qualificati on Test
(p urportedly a meas ure of IQ). On the whole, the cloze test
constructed by the standard method yielded higher reliability and
validity coefficients. Each test was administered twice - once before
the article was studied, and again after study and after about a week
VA RIETIES 0'" C LOZR P ROCEDU RE 357

had ela psed. The pre and post correlations for the cloze tests were .88
(standa rd version), .80 (func tion word ve rsion), and .84 (content
version) . The pre and post correlations for the comprehension
questions (also administered once before the article was studied and a
week later after study) were .83, .74, and .74 for the respective grou ps
of subjects. Overall, the valid ity coefficients were highest for the
sta ndard procedure. Similar results have been attained with non-
n4::l 1ive speakers of English (compare correlatio n coefficients reported
by Oller and lnal , 1971 with a test of prepositions and verb particles
with corresponding coefficients for standard cloze tests reponed by
Oller, 1972, Stubbs and Tucker, 1974, LoCoco, in press, Hinofotis
and Sno w, in press).

3. E.sTimating reading comprehension. Ever sin ce Taylor'S first


studies in 1953, it has been kn own that doze scores were good indices
o f reading comprehensio n. Ta ylo r's o riginal argument was to show
that th e differences betwee n subjects tended to remain c onstant
ac ross different texts. This is the same as saying th at the doze sco res
of the same subjects on different texts were correlated. Since then, it
has been demonstrated many times over that cloze scores are
extremel y sensitive measures of reading abilit y. Correlations between
d oze scores and multiple choice reading comprehension tests are
consistentl y stron g - usually between.6 and .7 and so metimes higher
(Ruddell, 1965, Polter, 1968, Anderson, 1971a). This is consistentl y
true fo r both native and non-native subjects, th ough poorer results
are sometimes reported wh en the method of tes t construction is
radically altered (Davies, 1975, Weaver and Kings ton, 1963).
Cloze scores have also been used in attemp ts to assess the amount
of informati on gained through study of a given text. The question
asked is, how much do scores improve if subjects are a ll owed to stud y
the unmutilated text before taking a cloze test over it. Taylor (1956,
1957) reported significant gains in cloze scores as a result of students
havin g the opportunity to study the text betwee n test occasions. In
ad ditio n to the two admini strations of th e d oze tes ts, a multiple
choice test on the same article was also admini stered before study and
aft er study.
The cloze test seemed to be more sensiti ve to ga in s due to stud y
th an was the multiple choice test. The average gain on the doze Lest
was 8.46 points while on the multiple choice test the gain was 4.79
po in ts. Though Taylor does no t mention the numbe r of points
allowed on the mUltiple choice comprehension test, it is possible to
358 LANGUAGE TESTS AT SCHOOL

infer that the doze scores were in fac t more sensiti ve to the study
effect from the statistics that are given relati ve to the respecti ve
va riances of t he gain scores (for the sta tistica ll y minded readers, th ese
were the t-ratios of 9.55 for the cIoze test ga ins and 6.43 for th e
mUltiple choice test gains). Further, the reliabilit y of the doze test was
higher judging from the test and re-test correlati o n of .88, compared
against .83 for the mul tiple choice test.
Also, though both tests were substantiall y co rrelated with each
o ther in the before-study condition (.70), they were even more
strongly co rrelated in the after-stu d y conditio n (.80). This would seem
to in dica te tha t whatever both tests \vere measuring in the after
condition, they measured more or it after subj ects studied the article
th an before they studied it. Perhaps this is because multiple choice
comprehension questions can sometimes be answered co rrectl y
witbo ut readi ng the article at all (Carroll, 1972), but this is hardl y
possible in the case of d oze scores since th e doze test is the article.
The fact that thc correlatio n bet ween the cloze test and the multiple
cho ice test is higher in the after-stud y condi ti o n than it is in the before -
study condition may suggest that the mUltiple choice test becomes a
more valid test after the material has been studied. In o ther words,
th a t the multiple choice test may become more sensiti ve to the
mea ningful va riance between subjects after they have studied th e
article. Of course, it is not the test itself th at has changed, rather the
way it impresses the minds of the subjects wh o are responding to it.
AJoog this same line, the multiple choice comprehensio n test and the
Armed Forces Qualification T est (a n IQ test modeled after the
Stanford-Binet) correlated at .65 before study but at .70 after study.
The doze scores , incidentally, were somewhat more strongly
co rrelated with the AFQT in bo th co nditio ns (.73 and .74, fo r the
before and after conditions respectively).

4. Sludying textual constraints. Aborn, Rubenstein, and Sterling


(1959) investigated co nstraints within sentences. They concluded that
th e predictabi lit y of a word in a given context was generally inversely
proportional to the size of the grammatical class. That is, the larger
the number o f words in a given gramma tica l class, the harder it is to
predict a particular member o f that class in a sentence context.
Prepositions, for instance, are generall y easier to predict than
adjectives. They also demonstrated that a blank with context on both
sides of it wa s easier to fill in than one wit h context only precedin g or
o nl y following. Their study, however, had one major limitation: they
VA RIETIES OF CLOZE PROCEDURE 359
were examining semences in iso la tion fr om any larger context o f
discourse.
M acGinitie (1961) criticized the stud y by Aborn et al for their
failu re to examine contexts longer than single disconnected
sentences. It is interesting that Aborn et a/ had also criticized some of
th e work prior to their own because it had failed to look beyond
contex ts longer than 32 consecuti ve letters oftex!. MacGinitie took a
step further and suggested looking at contex ts of prose ma terials up
to 144 wo rds in length . He concluded , among other things, that
'context beyond five words did not help in tbe restoration of ...
missing words' (p. 127). However, he noted that 'this does not mean
that constraints never operate over distances of more than four or five
words' (p. 128). F or instance, 'knowing the topic may have a ..
generalized influence that do es not decline with decreas in g length of
contex t in a n easil y specifiable way' (p. 128).
Amo ng the impo rt ant questio ns ra ised by thc studies just cited are:
first , what sort s of cOnlraint s are d oze it ems sens itive to? And ,
seco nd, what sorts of co nstraints exist ? These questions are
intertwined, but perhaps their intermingling is not hopelessly
confounded. We may be sure, at least , that if the cloze procedure is
sensit ive to some sort of co ntextual constraint or other that the
conslraint exists. Thus, the cloze proced ure ca n be used both to assess
the ex istence of cert ain sorts of constraints (namely (be o nes that it is
sensitive to) and it may be used along with other techn iques to
meas ure the strength of th e effect of th ose contextual constraints.
M acGinitie (1961) cites M. R. Trabue as saying that 'the difficulty
of a test sentence is influenced rather markedly by the number and
character of sentences near it' (p. 128 ), but th e evidence that this is so
is not very convincin g in the MacGinitie stud y. Coleman and Miller
( 1967) reached a conclusion ve ry similar to the one MacGiniti e came
to. On tb e basis of three different deletion procedures, they concluded
that guesses are 'constrained ve ry slightly, if at all, by wo rds' across
sentence boundaries. In fac t, they argued that contexts beyond five
words did not affect cl oze scores very much and th at contexts beyond
twenty wo rds from a given cloze item (in texts of 150 words in length)
did not significantly affect item scores at all.
There are some interesting contrary results. however. For example,
Darnell (1963) found that if the fiftee n sentences of a certain passage
of prose that consisted primaril y o f w hat he called "because' relations
we re presented in orders o ther than the o riginal so-called 'because'
order, cloze sco res were adversely affected . That is, if th e o rder of the
360 LANGUAGE TESTS AT SCHOOL

sentences was logicall y consistent wi th a ded uctive rorm of argument,


it was caSler to fill in cloze blanks over the text than if the sentences
were presented in anyone of six other o rders investigated. Similarly,
Carroll, Carton, and Wilds (1959) showed tha t if a text were divided
into ten word segments with one doze item inserted in eaeh segment,
the items were much easier to fill in if th e segments were presented in the
o riginal order of the tex t than if th ey were presented in a scrambled
order. However, as MacGinitie (196 1) noted, 'scrambling the o rder
of the segments pro bably no t only o bscures the paragraph topic, ...
but also reduces restoration scores thro ugh misdirection and
confusion' (p. 128).
Oller (1975) sel o ut to show that cloze items are indeed sensitive to
constraints across co ntexts that exceed the five to ten word limit
suggested by previous research. The technique used was also a cut
and scramble proced ure simila r to that of Ca rroll e[ al (t 959), except
that fi ve separa te lengths of contex t were investigated. F ive prose
passages each of 100 wo rds plus were successively cut into fi ve, ten,
twenty-five, and fifty word segments. [t was demonstrated that items
inserted in th e five word segments were significantly m ore difficult
than the very same items in the ten word segments, and so on. In fact,
the difference between items in the full texts and items in sc rambled
presentations o f tbe fifty word segments was greater than the
corresponding difference between items in the twenty-fi ve word
segments and in the fifty word segments, and so on .
While it is truc that a cut and scramble procedure ma y create fa lse
leads which will make the restoration of missing words more difficult
than it would be if th ere were no misleading contexts, is this not part
of the questi on - that is, whether or not c10ze items are sensitive to
contexts beyo nd a certain number of words? Indeed it would seem
that cloze items are sensitive to constrainls that range beyo nd the
previously estimaled limits of five to ten wo rd s on either side of a
blank. If this is so, th e procedure can be used to study the effects of
discourse constraints ranging across single sentence boundaries.
Incidentally, this res ult was also hinted at in the study by Miller and
Coleman (1967) . They found that on the average, words were 9.2
percent casier to guess when none of their surrounding context (up to
a total of 149 words before and after) was mutilated than when an
every fifth wo rd deletion technique was used. Furthermo re, when
o nl y preced ing context was given, the average loss in predictability
was 30 percent.
One oflhe difficulties wi th the cut and scramble techniques used by
VA RIETIES OF CLOZE PROCEDURE 36 1

Carroll, Carto n, and Wilds (1959) and by Oller (1975) co uld be


overcome if \vhole sentences were used as the criterial segments. If
doze items are not sensitive to const rai nts ranging beyond the
bo undaries o f a single sen tence, items inser ted in a pro se text should
be equall y dillicnlt regardless or the order o f sentences in the lexL
Chihara, Oller, Weaver, and Chavez (1977), however, have shown
th at this is not true. Cloze items \vere inserted in texts on an eve ry nth
word deletion basis. For text A every sixth word was deleted, and for
text B every seventh word was deleted . Then, th e senlences o r both
tex ts were systematicall y reo rdered . Forty-two native spea kers or
Englis h and 20 1 Japanese studying English as a foreign language
completed one or the texts in th e sequential condition and the other in
tbe scrambl ed condition. In spite of the fact that the sentence
bo undaries were left intact in bo th prese nta tions. the scra mbled
sentence versions were consjderably more d ifficult.
Since unlike the Darnell (1 963) study there was no atte mpt to , elect
passages that would maximi ze t he effect of sentence order, perhaps
it is safe to infer that d0 7.e items are in fact generall y sensitive to
co nstraints across se ntence bo undaries in prose. Co nsider the
fo Howing exa mple items from one o f the tex ts used in the Chi bara ef
a/ study :
Joe is a freshman and he (I) ~ hav ing all of th e pro b1cms
t hat most (2) FRESHMEN have. As a matter of fact, his
(3) PROBLEMS started before he even left ho me. (4)
H E had to do a lo t o r (5) THI NGS that he did n't like
to do (6) J UST because he was going to go (7) A WAY
to college .-.- -
Not all of th e forego ing items proved to be easier in the normal text
co ndition than in the scram bled sentence co ndition. but items 3 and 4
were considera bl y easier in the full text condition. It is no t too
diffic ult to see why this is so. Since the fact that Joe was having
problems is mentioned in th e first sentence, it co nstrains the possible
respo nses to item 3. Similarly, the fact that Joe is the perso n being
ta lked about in se ntences one and two co nstrains th e possible
responses to it em 4. By C-O nlm st. items t, 2, 5, and 6 proved to be
about equally difficult in the normal textual co ndidon and in the
sc rambled sentences version.
On the wh ole, items that proved to be maximally se nsitive to
discou rse con straints were simply those that in volved meani ngs that
were ex pressed over larger seg ments of disco urse. Interestin gly. they
were no t limited to contenti ves (e.g., no uns, verbs, and the like). For
362 LANGUAGE TESTS AT SCHOOL

instance, the word 'the' was sensitive to long-range constraints in a


context where it implied reference to a pair of slacks that had been
mentioned previously. The words ' still' and 'so' in the phrases 's{ill
growing' and 'so fa st' also proved to be sensiti ve to discourse
constraints (i.e., constraints beyond sentence boundaries) in a
context where Joe's father was talking about whether or not Joe
should buy a certain pair of slacks.
Man y questions remain unanswered. Perhaps the ra pidly growing
interest in textual constraints a nd their effects o n memory (see
Tulving, 1972)and the correspo ndin g research on discoursegrammars
(Frederiksen, 1975a, 1975b, Schank, 1972, and Ru mmelhart, 1975)
will help to shed some light on th e questions that can now be
formulated more clearly than before. There is every reason to believe
that the cloze procedure will continue to ma ke an impo rt ant
contribution to our meager but growing kno wledge of the effects of
contextual constraints. What appea rs to be needed are clearer
notio ns of just what sorts of conceptual dependencies exist within
discourse.

5. Evaluating leaching effectiveness. Ifdoze scores a re sensitive to


some of the sorts of va riables discussed under earlier applications,
they must also be sensitive to many of the things that teachers are
trying to accomplish in the way of instructional o bjectives. For
instance, if d oze scores ca n be used to judge tbe readability of texts,
they can probably be used to evaJua te tbe appropriateness of certa in
curricular decisions concerning materials selected for instructiona l
use. If cloze scores are sensitive indices of language proficiency, they
can be used to meas ure attainment in foreign language classes, or in
bilingual ed ucati onal settings.
Although much of what has been said in earlier po rti ons of this
book questions the usefulness of discrete point objectives in the
teaching of grammar (in the narrow sense of phonology, morphology
and syntax), doze tests are more useful than competing discrete point
alternatives for the evaluation of the effectiveness of an instructional
program in ach ieving such specific objectives. Furthermore, cloze
tests have the virtue (if applied to normal texts rather than isolated
sentences o r highly artificial, contrived tests) of assessing points of
grammatical knowledge in normal contexts of usage. (See Chapter 8
sectio n D above.)
A more appropriate applicatio n of (he cloze technique in the
evaluation of instruction, however, wo uld seem to be in rela tio n to
VARIETIES OF CLOZE PROCEDURE 363
partic ular contexts of discourse rather than points or gra mmar in the
abstract (not in real discourse contexts), For instance, if a language
learner has been taught to perform certain speech acts (e.g., ordering
a meal at a restaurant , buying a ticket at the airport, taking a
telephone message, and the like), cloze tests over samples of discourse
exemplifying such acts might be appropriate indicators of the
effectiveness of instruction. Or if a student has been asked to stud y a
certain text or article, a doze test over portio ns of the text might be an
appropriate index of the effectiveness of study. If specific points of
grammar have been emphasized in the in structi on, these could be
te sted specifically with reference to actual usages in real contexts.
Oller and Nagata (1974) used a cloze test to determine the effect of
English as a foreign language in the elementary schools in a certain
prefecture in Japan. It wa s demonstrated that by the time children
who had been exposed to EFL in the elementary grades reached the
eleventh grade in high school, any advantage gained by ha ving
sludied EFLin the elcmentary grades had been overcome by student s
who had not had EFL exposure prior to the secondary level.
Although the children who had received EFL instruction in the
elementary grades showed an initial advantage o ver children who had
not been exposed to such in struction, their advantage seemed to be
lost by the eleventh grade. The loss of advantage was attributed to the
fact that the children who got EFL in elementa ry grades were placed
togel her wi th children who had not had EFL and we re in fact fo rced
to repeat materials they had already studied in earlier grades.
To date, the applicatio ns of doze procedure to individual
classroom studies, though th ey may have been fairly widespread in
the last rew years, have not been widely published. Nevertheless,
Kim 's research at UCLA (1972) suggests that cloze tests are not
generall y sensitive to any practice effect, and, therefore, they should
be quit e applicable to roulineclassroom studies w ith repeated testin g.

D. How to make and use cloze tests


We have already discussed several varieties of c10ze te stin g
procedures in earlier parts o f this chapter. Further, we have discussed
some of the advantages of what might be called the 'standard '
proced ure. Here we outline some specific suggesti ons that may be
used as general guidelines in constructing doze tests for a va riety of
purposes.
VA RJETIES OF CLOZE PROCEDURE 365
test - filling in the blanks. Passages that require esoleric o r lechnical
kn owledge not generally available or which is available only to some
of the students should also be avoided. Texts that co ntain arguments
or statements that some of the students may strongly disagree with
(e.g., texts that state strong pro or can views on controversial issues
such as capital punishment, abortion, religion, politics, and the like)
sbould also probably be avoided. Passages that do not contain
enough wo rd s of runnin g text to provide a s ufficient number of
blanks (aboul 50) should probably be sel aside in favor of longer
texts. or course, there may be special circumstan ces where any or all
of the foregoing suggestions should be disregarded. For instance, it
would make no sense to avoid using a short text when it is that same
short text that has been studied. It would not make sense to avoid a
religious topic when the class is a course on religion , and so forth.

2. Deciding on the deletion procedure. Unless the purpose of the


testin g involves a need to assess student performance on some
particular gram matical form, type of content or the like, an every nth
deletion procedure will probably work best. The length ofthe selected
text is also a factor to be taken into account when deciding on a
deletion procedure. [f the text is 250 words or thereabo uts, five 50
item c1 0ze tests are possible with an every fifth word deletion
technique.
Simply begin counting from a wo rd near the beginning oflhc text
and delete every fifth word thereafler until 50 blanks are obtained'. It
matters little where the counting begins. Though some researchers
have left the first sentence intact and have begun their deletions on the
second sentence of a texl, Klare, Sinaiko, and Stolurow (1972)
contend that this is not n~e ss ary ~ though there is no harm in it
either. For similar reasons, so me have also left a certain amount of
unmutilated lead out at the end of the text - i.e., one or more
unmutilated sentences at the end of the test. This too is unnecessary,
though again, there is no harm in it.
If an every nth deletion rati o is desired. and if it is desired to leave
blanks in the entire text at roughly equal intervals, the test preparer
may simply count the number of words in the text and divide by fifty
to arrive at a suitable deletion ratio. For approximately 350 words of
text an every seventh word deletion ratio will produce 50 blanks (i.e.,
350 di vided by 50 equals 7). For 400 words of text a.n every eighth
word deletion ratio will yield 50 blanks, and so o n.
If the purpose is proficiency testing, there is probably nothing to be
366 LANGUAGE TESTS AT SCHOOL

gained by preparing multiple forms over a given cloze passage. Such


forms might have some inst ructional value, but .longsma·s 1971
review of cloze tests as instructional procedures per se expresses
skepticism. However, if mUltiple forms are desired, they can easily be
prepared, by an every nth word deletion procedure as described
above in section C. 1.

3. Administering the res!. The following instructions to test takers


are appropriate to cloze tests constructed over normal sequential
passages of text. (Cloze items over isolated sentences, o f course, do
not qua lify as pragmatic test, at all and a re not recommended.)

In the follo wing passage, some of the words have been left
out. First, read over the entire passage and try to lUlderstand
what itis about. Then. try to fill in the blanks. It takes exactly o ne
word to fiU in each bl ank. Contractions like can' l , or words
written with a hyphen like well-being count as si ngle wo rds. If
you are not sure of the word th at has been left out, guess.
Consider the followin g exa mple:
The ____ barked furiously, and th e ran up the
tree.
The words that correctly fill in the blanks are dog and cal.

These directions, of course, would have to be simplified for children


in the early grades, and pro bably they sho uld be given orall y and with
an example at the blackb oard . H owever, they a re quite adaptable to
almost any do ze test procedure for most any purpose. Fo r instance, -if
the subjects to be tested are non-native speakers of the language of the
test , and if they all happen to speak a common language as in a
foreign language classroo m situatio n, the test instructions can be
presented in their native language.
If th e test is gauged appropriately in difficulty, it is usuall y possible
for even the slowest students to have ample time to attempt every item
of a 50 item cloze test within a 50 minute class period . If the text is
easier, much less time may be required - as little as len minutes per 50
item test for the fastest students. Non -native speakers , o f course, take
longer than fluent natives, but unJess the test is inappropriately
calibrated for the group, a 50 minute class period is usually long
enough for every stud ent in a class to attempt e very item on a 50 item
test. If the test is much too difficult on the other hand, there is little
point in extending the amount of time allowed to co mplete the test.
Fortunately, d oze tests are extremely robust measures, and even if
a test is on the whole more difficult than would I10nnaJl y be
VARIETIES OF CLOZE PROCEDCRE 367

appropriate, or ifit is too easy for a given group of students, it will still
provide some meaningful discrimination. Regardless of the difficulty
level of the test, the weakest students usually make some points, and
the strongest rarely make perfect scores. Thus, the test will usually
provide some meaningful discrimination even if it is poorly judged in
difficulty level. The scoring technique, of course, may greatly affect
the difficulty level of the test. A test that is too difficult by the exact
word techniq ue may not be too difficult at all if a more lenient scoring
technique is used.

4. Scoring procedures. It should be noted at the outset that all of


the scoring methods that have ever been investigated produce highly
correlated measures. The bulk of the change in scores that results
from a more lenient scoring procedure is simply a change in the mean
score. Usually, there is little change in the relative rank order of scores
- that is, high scorers will remain high scorers and low scorers will
remain 10\\1 relative to all of the subjects taking the test. If a more
lenient scoring technique is used, all of the examinees may tend to
improve by about 15 points, say, but if they all improve equally, it
follows that the high scorers will still be at the high end of the rank
and the low scorers will still be at the low end.
Correlations as high as .99 have been observed between exact word
scores and scores that allowed synonyms for the exact words as
correct answers (Miller and Coleman, 1967). Usually the correlations
are in the.9 and up range (Klare, Sinaiko, and Stolurow, 1972) and
this also holds for non~native speakers of the language in question
(Stubbs and Tucker, 1974, Irvine, Atai, and Oller, 1974, Oller, Baca,
and Vigil, 1977, and their references). Thus, the exact word
technique, which is generally the easiest to apply, is usually (but not
always) to be preferred.

(i) The exact word method. The rationale behind the exact word
scoring method goes back to Taylor (1953). He reasoned that the
ability of a reader to fill in the very word used by a writer would be a
suitable index of the degree of correspondence between the language
system employed by the writer and that employed by the reader.
However, it can be argued that requiring the reader to replace the
very word used by the writer is too stringent a requirement. It
certainly places a severe limit on the range of possible responses. In
fact, it would keep the creative reader from supplying an even better
word ill some cases.
368 LANGUAGE TESTS AT SCHOOL

For instance, consider item (3) exemplified above in the text about
Joe and his problems. The line reads, 'As a matter of fact, his
____ started before he even left home.' The correct word by the
exact word criterion is 'problems'. Other alternatives which seem
equally good, or possibly better by some standards, such as
'difficulties', 'troubles', 'trials', 'tribulations', or 'worries', would
have to be marked wrong - just as wrong, in fact, as such clearly off
the wall fill-ins as 'methods', 'learning', or worse yet 'of', 'be', or
'before'. Many writers would have preferred to use some word other
than 'problems' to avoid repeating that word in two consecutive
sentences.
Nonetheless, in spite of the fact that it does seem counter-intuitive
to give more credit for the very word the author used at a particular
point in a text than for some other equally or possibly even more
appropriate word, this whole line of argument is easily countered by
the fact that the exact word method of scoring doze tests correlates so
strongly with all of the other proposed metbods. As Swain, Lapkin,
and Barik (1976) note, bowever, when very fine discrimination is
desired, or for instructional purposes, it may make more sense to use
a scoring method that is more lenient.
If the total score on a composite of 50 items or more is the main
object of interest (e.g., in judging the difficulty of reading materials,
or possibly in rating the proficiency of non-native speakers of a
language, or in evaluating the relative proficiency of bilinguals in
their two languages), the exact word method probably loses very little
of the available information. However, if one is interested in specific
performances of examinees on particular items, much information
may be lost by the exact word technique. It would seem that some of
the other methods have some instructional advantages and there is
evidence to show that they are somewhat more strongly correlated
with other proficiency criteria than the exact word method is.

(ii) Scoringfor contextual appropriateness. The exact word scoring


method is simple enough, but how do you score for sometbing as
vague as 'contextual appropriateness'? The general guideline that is
usually stated is to count any word that fully fits the total surrounding
context as correct. There is no escaping the fact that such a suggestion
invokes the SUbjective judgement of some interpreter - namely, the
person or persons must do the scoring.
It is not enough simply to say, COlint the exact word or any
synonym for it as correct. The notion of synonymy makes sense in
VARIETIES OF C I.OZE PROCEDURE 369
relation to some nouns, ve rbs, adjectives, and adverbs, but what is a
syno nym fo r the word 'even' as in, ' He couldn't even arrive on time'?
There are other words that could replace it, such as 'just', 'barely',
'quite', and so on, but they are not synonyms of 'even'. Some of the
possible alternatives would be ruled out by other facts mentioned or
implied in the context. Usually the kinds of constraints that a native
speaker of a language is intuiti vely aware of cannot be easily made
explicit - but this does no t mean that they are not real nor does it
mea n that they cannot be used to good effect in scoring cloze tests.
Some general guidelines can be stated: first, see if the word in a
given blank is the exact word, if so, score it as correct; second, if it is
not the exact word, check to see ifit fits the immediately surrounding
context, that is, whether or not it violates any local constraints in the
same sentence or surrounding phrases; third, as k if it is consistent
wit h all of the preceding and su bsequent text (this includes previous
and subsequent responses in other blanks as fi lled in by the
examinee). If the respo nse passes all of these checks, score it as
correct. Otherwise score it as incorrect. For example, consider the
fo ll owing item:

He couldn't arrive on time. H e was always late to


everything she ever pla nned for him. It made little difference
that he was the Governor - he could have shown some
consideration for his wife.

A respo nse like "h ave' is clearly incorrect. It violates wh at we have


called 'local constraints' . It wou ld require a past tense marker on the
verb 'a rrive' but none is there, so 'have' would be an incorrect
response.
What about the word 'still' inserted in the same blank? It doesn't
violate any local constraints, but it does seem inconsistent wit h the
meanings that are stated and implied in th e subsequent lines of the
text. T here is nothing to ind ica te that the man referred to ever arrived
on time, and there is plenty to indicate that in fact he never did. Thus,
'still' is an inappropriate response.
On the other hand, 'ever' wou ld be appropriate to all of the stated
and implied meanings. The word 'barely' would seem to fit the local
constraints, but not the long range ones. 'lust' seems inappropriate to
the total context, but it might be difficult to say exactl y how it is
inappropriate.
Beca use of the subjectivity of such a method, some might fea r that
it would prove unreliable. If it were very unreliable, however, the high
370 LANGVAGE TESTS AT SCHOOL

correlations that are observed between scores generated on this basis


and scores based on the exact word technique could not arise.
Furthermore, careful research has shown that the technique is
sufficiently reliable that even test scores generated by different raters
may be regarded as equivalent for all practical purposes (see Hanzeli,
1976, and Naccarato and Gilmore, 1976). In addition, it has been
shown that fewer items fail to discriminate between degrees of
proficiency with the appropriate word scoring technique than with
the exact word technique (Oller, 1972).
Nacarrato and Gilmore (1976) showed by a very stringent
statistical technique that one rater working alone could achieve
substantial reliability and that the scores generated by two raters
working separately were about equally reliable. In fact, if a different
native speaker scored each test in a batch of tests, the scores would
closely approximate the set of scores over all the tests that would be
obtained if anyone of the raters scored all of the tests.
This result is important because it makes it possible for large
populations of subjects to be tested by the cloze technique and still
use the contextually appropriate scoring method. For instance, if
1,000 subjects were tested in a given educational setting, the problem
of scoring all the tests by the contextually appropriate method would
be nearly insurmountable if all of it had to be done by one person, but
if 100 scorers could be used, each one would only have to score 10
tests. This can easily be done in less than an hour.
Research has also shown that the contextually appropriate scoring
method does generate some meaningful variance that is lost if an
exact-word scoring technique is used. In spite of the very high degree
of variance overlap between the exact-word and the contextually-
appropriate scoring method (sometimes referred to as the
'acceptable-word method'), the latter often produces slightly higher
correlations with external criteria of validation - i.e., with other tests
that purport to measure similar abilities (Oller, 1972, Stubbs and
Tucker, 1974, Swain, Lapkin, and Barik, 1976, and LoCoco, in
press).
(iii) Weighting degrees of appropriateness. Some responses seem to
be more appropriate than others. In general, it seems that responses
which violate local constraints are more serious errors than responses
that violate longer range constraints. In fact, several schemas [or
weighting responses for their degree of conformity to contextual
constraints have been suggested and in fact used. In this section, we
will consider two kinds of weighting systems. The first is based on
VARIETIES OF CLOZE PROCEDURE 371
distinctions between types of errors in relation to some linguistic
analysis of the text, and the second is based on the performance of a
group of native speakers on the same items. In the latter case, the
frequency of occurrence of a given word in response to a given c10ze
item by a specified reference group (e.g., native speakers) is used as a
basis for assigning a weighted score to responses to each item made by
a test group (e.g., non-natives).
The first method of weighting scores is based on an analysis
(whether implicit or explicit) of the text. Categories of errors are
distinguished along the following lines. The most serious is the sort
that violates the strongest and most obvious contextual constraints.
The least serious is the sort that violates the weakest and least obvious
contextual constraints. If a response violates constraints in its
immediately surrounding context, it will usually violate longer range
constraints as well, but this is not always the case. It is possible, for
instance, for a response to fit the sense of the text but to violate some
local constraint - e.g., the use of a plural form where a mass noun has
been deleted, as in 'the peoples oftbe world' instead oftbe 'people'. If
groups of people were the intended signification, the first form would
be correct, but ifall the people of the world in the collective sense were
intended, the second form would be more correct. In either case, the
sense of the text is partly preserved even if the wrong morphological
form is selected.
A scale of degrees of appropriateness can be exemplified best witb
reference to a particular c10ze item in a particular context. Suppose
we consider item (3) from the text about Joe and his problems: 'His
___ ~ started before he even left home.'
a. The best response (perhaps) is the very word used by the
author ofthe text - namely, 'problems'.
b. The second best (or perhaps, an equally good) response is a
close synonym for the deleted word - say, the word
'difficulties'.
c. Perhaps the next best response would be one that preserves
the overall intent of the text, but does so with an incorrect
form ofthe lexical item used - e.g., 'bewildering', instead of,
for instance, 'bewilderment'.
d. A response that is appropriate to the local constraints of the
item but which is not appropriate to the meaning of the text
as a whole would probably be judged more severely
(certainly it would be more incorrect than type b or c) - e.g.,
'methods'.
e. An even more severe error would be one that failed to fit
either the loeal or the long range constraints - e.g., 'before'.
372 LANGUAGE TESTS AT SCHOOL

On the basis of some such scale, points could be awarded


differentially. For example, a score of 4 could be awarded for the
exact word (category a), 3 for a perfectly acceptable substitute, 2 for
category c, I for d, and 0 for e. Or, categories a and b could be equally
weighted - say, 3 for a or b, 2 for c, I for d, 0 for e. Or, categories c and
d could be equally weighted - say, 2 for a or b, I for cor d, and 0 for e.
All of the foregoing systems of scoring tests for adult foreign
students studying English as a second language were investigated by
Oller (1972). The results showed all of the methods to be either
equivalent or slightly inferior to a method that counted categories a
and b worth I point and all the rest worth nothing - i.e., for scoring
purposes, differentially weighting degrees of appropriateness had no
significant effect on the overall variance produced. Not only were all
of the methods investigated highly correlated with each other, but all
were about equally correlated with scores produced by other
language tests. The contextually appropriate method yielded slightly
better results than the exact word method, however. Thus it is
probably safe to conclude that complex weighting systems do not
afford much if any improvement over the simpler methods of scoring.
One other system of weighting responses deserves discussion. It is
the technique referred to as 'clozentropy' by D. K. Darnell. No doubt
Darnell's system is the most complex one yet proposed as a basis for
scoring cloze tests, but it has a great deal of intuitive appeal.
Moreover, if computer time and programming expertise are
available, the technique is feasible - at least for research purposes.
The reason for discussing it here is not to recommend it as a
classroom procedure, but rather to show that it is roughly equivalent
to simpler methods that are more appropriate for classroom usc.
Darnell was interested in testing the English skill of non-native
speakers - in particular, foreign students at the University of
Colorado at Boulder. Because of the negative results reported by
Carroll, Carton, and Wilds (1959, see the discussion in section C. 2
above), Darnell shied away from the exact word scoring method, and
opted to develop a whole new system based on the responses of native
speakers. He administered a cloze test to native speakers (200 of
them) on the assumption that the responses given by such a group
would be a reasonable criterion against which to judge the adequacy
of responses given by non-native foreign students to the same test
items. He reasoned that the responses given more frequently by
native speakers should be assigned a heavier weighting than less
frequent responses.
VARIETIES OF CLOZE PROCEDURE 373
Without going into any detail, Darnell showed that his method
of scoring by the dozentropy technique generated surprisingly
high correlations with the Test of English as a Foreign Lang-
uage (Educational Testing Service and the College Entrance
Examination Board) - to be exact, .83 for a group of 48 foreign
students. The only difficulty was that Darnell's method required
about as much work as the preparation of an objectively scorable test
of some other type (that is, other than a doze test). In short, the virtue
of the ease of preparing a doze test was swallowed up in the difficulty
of scoring it.
Fortunately, Pike (1973) showed that there was a very high
correlation between Darnell's scoring technique and an exact word
scoring technique. Both scoring methods produced strong cor-
relations with the TOEFL. This latter result has been further
substantiated by Irvine et al (1974). It follows that ifboth exact and
contextually appropriate methods are substantially correlated with
each other (at .9 or better), and if exact and contextually appropriate
scores as well as exact and clozentropy scores are strongly correlated
with TO bFL (at .6 to .8 or better), it follows that exact, contextually
appropriate, and dozentropy scores are all strongly correlated with
each other.
From all of the foregoing, we may condude that except for special
research purposes there is probably little to be gained by using a
complex weighting scale for degrees of appropriateness in scoring
doze tests. Either an exact word or contextually appropriate method
should be used. The choice between them should be governed by the
purpose to which the scores are to be put. If the intent is to use only
the overall score, probably the exact word method is best. If the
intent is to look at specific performances on particular test items Of in
particular student protocols, probably the contextually appropriate
scoring should be used.
(iv) Interpreting the scores and protocols. Robert Ebel (1972) has
argued that test validity in the traditional sense of the term may not
be as important as test interpretability. An important question that is
too often overlooked is what the test scores mean. In this chapter we
have examined what doze scores mean with reference to a wide variety
of other tests that involve the use of written forms of language. We
have seen, for instance, that it is possible to form a rough idea about
the readability of a text on the basis of a particular group's
performance on doze items formed over the text. We have also seen
that it is possible to form a notion of a given learner's ability to
374 LANGUAGE TESTS AT SCHOOL

comprehend textual material of a defined level of difficulty. Further,


we have seen that specific hypotheses about contextual constraints
can be tested via performa nce o n cloze tests. Finall y, group scores can
be used to assess the effective ness of instructional method s.
There are yct other diagnostic uses of doze tests . For instance, it is
possible to form specific hypotheses about individual lea rner
competencies. It is sometimes even possible to see precisely where a
given student went awry and wit h study it is sometimes possible to
fo rm a good guess as to what led to the erro r -- e.g., inappropriate
instructional procedures, or incomplete learning of constraints on
certain linguistic forms. As an example, consider the foll owing items
from a cloze test protocol (the answe rs were given by a Japanese ad ult
studying English as a second language):
... A responsible society wa nts (30) pWlishmf.f\I: , no t revenge.
It does (3 1) ...n.~__ need to get reven ge (32)
. itself. nor should it (33) to
do so .
The lea rner has seemingly been led to believe that the 'neitherj nor'
construct ion is appro priate for such conte xts. T his could be the fault
of inappropriate pattern practice (i. e. , incomplete exemplification of
usages of such a construction). It would certain ly appear that the
learner is clued to pl ace 'neither' in the blank for item 31 because of
the 'nor' tha t follows seven words later. The word 'punishment' in
item 30 reveals a pa rtial understanding of the tex t, but an incomplete
un derstanding of the meaning of the word. Al though 'punishment'
may not be synonymous with 'revenge', still it in vo lves an element of
vengeance or retributio n. The word that was actually deleted from
blank 30 was 'justice' .
In another test item. from another cloze test, the same student
re vea ls another interesting effect of instruction:
'I thmk D addy (43) _ 'i>h...m4L_ do th e dishes tonight. '
' You r fath er (44) ... ~_ worked ha rd all day.' 'But so
have (45) !jau ... ~ M othe r. You've wo rk ed hard all day
too, (46) h ,f!!/ fl\·t .. th at so T as ked the o ldest. ...
Item 46 is the one of greatest interest (though th ere is an error in item
44 as well). The learner, apparentl y having been trained hy patt ern
drillin g of the so rt exemplified in Chapter 6, n otices the auxiliary
'have' in 'You've wo rked .. .' and ihapproprialely fo .-ms the tag
question 'haven't that so ?' Item 44 is also revealing in asmuch as it
VARIETlE.'; OF C LOZE PROCEDURE 375
shows an inadequate comprehensio n of the meaning of the present
perfect. No obvious explanation suggests itself for the learner's
fai lu re to notice that th e subject is singular. H o wever, a study of other
protocols by this same learner shows that he often has difficulty with
the complex verb forms invo lving auxiliaries and modals, e.g.,
. marijuana has (3) bee I"\. _ _ .effects on people ....
. . . . . . .. .. simp le punishment i ~ o ne of the most
often (28) --..li.S.e. .. _._ a rguments for th e dea th penalty' ...
. . . Abortio n (66) _~IS~_ _
freed wo men from unw an ted (67) b@U
Note in the last example that the problem of the plural/singular
distinction arises again in item 67.
The foregoing examples do not begin to ex haust the richness of
even th e sample protocols referred to for the single subject selected .
Perhaps they wi ll serve, however, to illustrate some of the kinds of
information that can be inferred from learner protocols on d oze
tests.

KEY POINTS
1. The doze technique was invented by Wilson Tay lor who likened it to the
kind of closure characterist ic of visual perception of incomplete patterns.
2. A doze test is constructed by deleti ng words fr om a text or segment of
discourse.
3. It is not a mere sentence complet ion ta sk, however, because d oze j tem ~
are not independent and range across the sequentially related sentences
or a discourse.
4. Closure of blanks left in text is possible because of the redundancy of
discourse and the internalized expectancy system that speakers of a
language possess.
5. The standard test construction technique , known as the fixed ratio
method, involves deleling every 11th wo rd (where /1 usuall y varies from 5
to 10) and replacing each one with a standa rd ized blank (usuaUy about
J 5 typed spaces).
6. The sta ndard length of thecloze test is 50 items - thus the passage length
must be approximately 50 times n.
7. Another procedure, the properties of which arc less determinate , is to
delete words on some variable ratio usually decided by a rational
selection procedure - e.g., delete only content words.

3As we noted above, topics such ~ s the decriminalizat ion of ma rijuana, capital
punishment , and abort jon are not universally recommendable. llley may be
djstract ing because of thei r controversial natu re. They were used in several tests
exemplified here, however, [0 tesl the effects of altitude variables in cloze le5ts over
precisely such topics (see D oerr, in press).
376 LA NGUA GE TESTS AT SCHOOL

8. Cloze tests can be shown to req uire both temporally constrained


seq uential processing and pragmatic mapping of li nguistic elements onto
extraling uistic contexts ~ therefore, they are pragmatic tests.
9. 1t is possible to construct fill- in-blank tests tb at do not meet one or both
of the pragmatic naturalness criteria.
10. Cloze procedure has been shown to be a hi ghly reliable method of
distinguishing the readability of te xts; in fact, the technique appears to
be a good measure of the redundancy of discourse to particular
audiences.
11 . The doze method asa measure of readabi lity is superio r in validity to a ny
of the other tec.hniq ues that have been proposed and stud ied ca refully:
the readabilit y ro rmulas are less accurate.
12. As a rough guid eline it is suggested that at least 25 scores on tests of at
least 50 items be used fo r comparative purposes - e.g., fo r comparing the
readability of texts, the language proficiency of groups and the like.
13. Anderson (1 97 1b) suggests that doze scores at .53 or above (by the exac t·
word scoring) can be eq uated wi th an 'independent' level of readin g For
the ma terial tested ; a score between .44 and .53 fa lls within the
'instructional' range; and below.44 in the ' frustrational' range.
14. Researc h by Bo nnu th (1967) indicates that scores in the .4 to .45 range
and up are suitable for instruction with native speakers insofar as they
are roughly equivalent to scores of .75 on the traditional mult iple choice
tests of reading comprehension.
15. Osgood and Sebeo k (1965) suggested using the c!oze method fo r
assessing the relative proficiency of bilinguals in each of their languages.
16. Only recently has it been demonstrated that a simple translation
proced ure may yield roughly comparab le tests in tw o or more languages.
(However, see the equati ng procedure recommended in Chapter 10 p.
292f.)
17. Ta ylor (1957) demonS[rated that the every nth word delet ion technique
yielded doze items of somewhat greater reliability and validity than
deliberate selection of words to be deleted - e. g., deleting only content or
only function word s.
18. Subseq uent research with non-native speakers suggests that doze tests
based on an every nth word procedure are also superior Fo r bili ngual
testing.
19. Cloze scores are known to be good indicators of whatever is measured by
standardized reading compre hen sion tests, an d appear to be somewhat
more sensitive to gains attributable to study than are sta ndard multiple
choice items.
20. Deletion techniques that operate over every word in a text seem
especiall y well -suited to the study of contextual constraints.
21. Although there is so me controversy in the literature, it seems empiIically
established th at c1 0ze tests are sensitive t o constraints that range across
sen tences - th ough so me items are mo re sensitive tha n others.
22. Cloze scores can also be used to eval uate the effectiveness of an
instructional proced ure.
23. for most purposes, the standard method of deleting every nth word of a
text is best.
VA RIETIES OF C LOZ E PROCEDURE 377
24. For an every 5th word deletion ratio, 250 word s of tex t will yield 50
items; in general, the lengt h of text needed can be ro ughly determined by
multiplyin g th e number of desired blanks by the nu mber of words
between dele(lons.
25. Instructions should encourage examinees to read the entire text befo re
attempting to fi ll in blanks and sho ul d infor m them that there is no
penalty for guessing.
26. Most adult students (native or non-native) can fill in, or at least attempt,
50 items in a 50 minute peri od - perhaps a general guideline of one
minute per item could be followed for shorter tests, tho ugh for lo nge r
tests such a rule of thumb would probably be overgenerous. Fo r younger
subjects, on the other hand , it might be on the stingy side.
27. Correlations between exact·word sco res and contextually appropriate
scores are suffiCiently stron g to recommend th e exact-word sco ring for
most purposes - however, research has shown that slightly finer
discriminati on can be achieved by the co ntextuall y appropriate method.
28. Techn iques thal weight responses fo r deg rees of app ropriateness o r fo r
degrees of co nformity to native speaker performance do not see m to
yield finer degrees of discriminati on for practical purposes.
29. It has been show'n that conte.'( tually appropriate scorin g can be done
either by a sin gle ra ter working alone or by several raters, with equivalent
results (Naccarato and Gilmo re, 1976).
30. In additi on to the mo re common uses, d oze sco res can also be
interpreted for di agnostic info rmation c.oncerning indi vidual learner
performances.

D ISC USSION QU ESTIONS


I. Discuss the factors that contribute [0 the difficulty (o r the ease) of filling
in the blanks on a doze test. Compare these with the factors discussed in
Chapters] 0 and 11. To wh at extent can such fact ors be generalized to all
discourse p rocessing tasks? Wh at are the fact ors that contribute to the
differences across such tasks - e.g., wh at fact ors might make an o ral doze
task more difficult tha n a dictation over the same tex t ?
2. Taylor (1953, 1956, 1957) believed that the d oze testing concept could be
extended to visual process ing. Consider the possibili ty of a text
accompanied by audio-visual aids used as a bas is for a doze procedure
th at might leave out whole phrases, clauses, or even exchanges between
persons, i. e., if the picture were left intact but the sound were
intermittentl y im errupted.
3. What does a person have to know in order LO tm in the missing portions
of thel tcms in Figu re 19 ? H ow does that knowledge resemble (o r fail to
resemble) the kn owledge necessa ry to fill in the blanks in a d oze test over
the numbers 'one, two, . .. ' or the Gettysburg address, or some less
familiar text ?
4. Co nstruct and administer a d oze test to a gro up of students. Analyze the
students' response::; with a view to determining the strategies they
em ployed in fill ing in the items. Compare the responses of diffe rent
378 LANGUAGE TESTS AT SCHOOL

students. Compare the difficulty of di fTerent items and the constraints


operating on each, W hich items do you think a re sensitive to constraints
ac ross sentences?
S. Const ruct multiple forms of d oz.e tests over a single text using a n every
fifth wo rd deletion proced ure. Administe r the tests in such a way that the
practice effect is distrib uted roughly equally over all parts of the text.
This ca n be done by handing out the Hve forms of the rest so that the first
student gets form one, the second ge ts two, th e third three, the fourt h
four, the fifth five, the sixth one, a nd so on. On the sec ond test day , ha nd
o ut the tests so that student one gets form two. the second ge ts fo rm
tb ree, the third gets fo rm four, the fou rth fi ve, the fifth one, a nd so on.
Th is procedure is repeated three more times. At the end of the fiv e test
sessions, each student will have completed all five forms of t he doze
pa ssage. However, student one will have completed them in the order
one, two, .. . five ; the seco nd will have completed them in the order two,
three, four, fi ve, one ; the third in th e order three, fo ur, fi ve, t wo, one ; a nd
so on. T hus, the practice or ha vin g taken o ne ortbe fo rms on day one and
a different one on occasion two is nearly equ ally spread over all five
forms of the doze passage. Score al l of the tests and count the pe rcentage
of right answers (sa y, by the exact-word scoring to ma ke it simpler) to
each item in each form. W rite the perce ntage of correct answers above
eac h word in a full copy of the text. Which words were easiest and wby '!
Whic h were the ha rdes t ? W ha t sorts of co nstrain ts seem t o be operating
within phrases, across phrases. withi n cla uses, aCross clauses, and so o n ?
6. If you are a foreign la nguage teacher, you might want to try the
fo ll owing: sel ect a passage you think your stu dents can handl e in the
ta rget language (i.e., in the foreign language in t his case) ; translate it into
English ; construct doze tests over each text ; use the same deletion
proced ure (e.g., every fiftb wo rd, say); give botb tests to your students
(target language pa ssage fi rst, then the passage in English). The scores
your students ma ke in English should be roughly comparable to the
scores native speakers of the target language woul d have made on the test
in the target language. How d o your students compare to nati ve speaker
performance on the task ? In other words, the difference between th eir
scorcs in Engl ish a nd their scores in the ta rget la nguage is roughl y
equi va lent to t be diffe re nce between their scores in the ta rget language
and scores th at would ha ve been achieved by a socioeconomicall y and
educa tionally compara ble gro up of native speakers of the target
language.
7. W hat effect would you predict if the distance between blanks on a d oze
text were increased? Would it make the task eas ie r o r harde r ? W hy?
8. T ry cutting a text into fi ve word segmems a nd inserting a cloze item in lhe
midd le of each segment. Type each segment on a separate card and
shuffle the cards - or use some other scrambling procedure. Then try
running the items together to fo rm a text. What sorts offalse leads are set
up by such a contrived doze passage? How does such a text differ fr om
normal prose? Repeat the proced u re with diffe rent le ngths of segments.
Is {hc effect a ny different ? When does the effect cease 10 be significant to
the reader ? Consider shufflin g t be chapters of a book into a new order.
VA RIETIES OF C lOZE PROCEI)URE 379
That is, consider the effects that would be produced. It isn't a very
practical experiment to suggest.
9. No one, to the knowledge of thi s writer, ha s cver investigat ed the effects
of different instructions on sco res on cloze tests. What if learners were
not instructed to read over th e entire tcx t befo re altempting items? What
effect would yo u expect? What if they were told not to guess - that there
would be a penalty for guessing incorrectly ? H ow would such changes in
instructions be likely to affect scores?
J O. Try to conceptualize a scoring schema that assigns points for different
degrees of correctness. What ad vantages wou ld your system have for
diagnostic purposes? Cou ld such a schema be based on the errors that
learners are actually obse rved to make ? Cou ld the method )'ou propose
be taught to someone else ? Try it out with another scorer over a batch of
cloze tests. See how much the t\\'o of you agree or disagree in assigning
particular responses to the weighted categories you have set up .
II. Analyze the responses of a single student on vari ous pra gmatic tasks
including clozc tests compl eted by (hat student. Try to devise special
drills or exercises (based on disco urse processing ta.sk s) that will help Lhis
student overcome certain delermined deficiencies . For instance, if the
learner makes errors in the use of articles across all of the tasks, try to
determine with respect to partie.ular text s what information exactly the
student is not attending to that the native spea ker would have noticed-
e.g., [he fact that a part icular pair of slacks had been mentioned prior to a
certain item req uiring ' thc' in reference to that same pair of slach (see the
text about Joe's problems and the discussion above on pp. 36 1-362).

SUGGESTED READINGS
I. Jonath on Anderson, 'Selecting a Suitable " Reader" : Procedures for
Teachers to Assess Language Difficully,' Regional English Language
Center Journ al 2, 1971,35- 42 .
2. John E. H ofman, Assessment of English Proficiency in the African
Primary S chool. Series in Educa tion, University of Rhodesia, Occasional
Paper Number 3, 1974.
3. Slig Johan sso n, Partial Dictatio1l as a Test of Foreign Language
Proficiency. Swed ish-English Co ntrastive Stu di e~, Report Number 3,
1973.
4. George R. Klare, 'Assessing Readability,' R eading Research Quarterly
10,1974- 1975, 62- 102.
5. Bernard Spolsky. ' Reduced Redundancy as a Language Testing Tool,'
paper prese nted at the 2nd Internat iona l Congress of Applied
Linguistics, Cambridge, England, September, 1969. ERJ C ED 031- 702.
6. Joseph Bartow Stubbs and G. Richard Tucker, 'The Cloze Test as a
Measure of English Proficiency,' .Modern Language Journ al 58, 1974,
239-24l.
7. Merrill Swain, Sha ro n Lapkin, and Henri C. Barik, 'The Cloze Test as a
Meas ure of Second Language P roficiency,' Working Papers on
Bilingualism 11 , 1976,32- 43.
380 LANGUAGE TESTS AT SCHOOL

8. Wilson L. Taylor, "'Cloze Procedure": A New Tool for Measuring


Readability,' Journalism Quarterly 30, 1953, 415-433.
9. Wilson L. Taylor, 'Recent Developments in the Use of' "CIoze
Procedure",' Journalism Quarterly 33,1956,33,42-48,99.
13

Essays and Related


Writing Tasks

A. Why essays?
B. Examples of pragmatic writing
tasks
C. Scoring for conformity to correct
prose
D. Rating content and organization
E. Interpreting protocols

For many reasons writing is one of the foundational skills of educated


persons. Tests of writing skills are therefore needed. This chapter
explores traditional essay techniques and suggests both objective and
subjective scoring methods. It is recognized that the problem of
quantifying essay tasks is a crucial difficulty in school applications. A
method of interpreting learner protocols with a view toward helping
learners to overcome language difficulties is discussed. Though essay
testing may require more work of the teacher and of the students than
many other testing procedures, it is considered to be a profitable
assessment technique. Certainly it affords a rich yield of diagnostic
information concernmg the learner's developing expectancy
grammar.

A. Why essays?
One of the most often used and perhaps least understood methods of
testing language skills is the traditional essay or composition. Usually
a topic or selection of topics is assigned and students are asked to
write an essay of approximately so many words, or possibly to write
an essay within a certain time limit. Sometimes, in the interest of
controlling the nature of the task, students may be asked to retell a
narrative or to expand on or sunimarize an exposition. For example,
the student might be asked to discuss the plot of a novel or short
381
382 LANGUAGE TESTS AT SCHOOL

story, or to report on the major themes of a non-fictional book or


passage of prose.
Essays are probably used so frequently because ofthe value placed
on the ability to organize and express our ideas in written form. This
ability is counted among the most important skills imparted by any
educational system. Witness the three R's. It is not known to what
extent the acquisition of writing skill carries over into other aspects of
language use - e.g., being an articulate speaker, a good listener, or an
insightful reader - but there is evidence that a1l of these skills arc
substantially interrelated. Moreover, there is much evidence to
suggest that practice and improvement in onc skill area may rather
automatically result in a corresponding gain in another skill area.
There is even some suggestion in the data from second language
learning that practice in speaking and listening (in real life
communication) may have as much of an effect on reading and
writing as it does on speaking and listening and conversely, pragmatic
practice in reading and writing (where communication is the goal)
may affect performance in speaking and listening at least as much as
reading and writing (see Murakami, in press).
However, in spite of their popularity, there is a major problem with
free composition tasks as tests. It is difficult to quantify the results -
i.e., to score the protocols produced by examinees. We will see below
that reliable methods of scoring can be devised, and that they are only
slightly more complex conceptually than the methods suggested in
Chapter 10 for the scoring of dictation protocols. Perhaps we should
not stress the difficulty of obtaining reliable scores, but rather the ease
and likelihood of obtaining unreliable (and therefore invalid) scores
on essay tasks.
As we noted earlier, in Chapters 10 and II, tasks that require the
production of sequences of linguistic material that are not highly
constrained all present similar scoring problems. For instance, a
dictation/composition task, or a hear and retell task, or a creative
story telling task, and all similar production tasks whether written or
oral entail stubborn scoring problems. The oral tasks require an extra
step that is not present in the scoring of written protocols - namely
transcribing the protocols from tape or scoring them live (an even
more difficult undertaking in most cases). Further, the scorer must in
many cases try to infer what the examinee intended to say or write
instead of merely going by what appears in the protocol. If the
examinee intends La say 'he did it by a hammer', then there is no error
- i.e., if the person referred to performed the act near a hammer rather
ESSAYS AND RELATED WRITING TASKS 383
than 'with' a hammer. On the other hand, if the examinee intended to
encode the instrumental meaning (i.e., the hammer was the
instrument used in the action performed) there is at least one error.
In this chapter we are concerned with defining methods of scoring
essay tasks in particular, and productive language tasks in general. It
is assumed that essay tasks are fundamentally similar to speaking
tasks in certain respects - namely in that both types of discourse
processing usually presuppose someone's saying (or writing) some-
thing for the benefit of someone else. If the writer has nothing to say
he is in very much the same boat as the speaker holding forth on no
particular topic. lfthe writer has something to say and no prospective
audience, he may be in the position of the child describing its own
performances with no audience in mind, or the adult who is said to be
thinking out loud.
Of course, these parallels can be drawn too closely. There are
important differences between acts of speaking and acts of writing as
any literate person knows all too well. The old saw that a person
should write as he would speak, or the popular wisdom that unclear
writing is the result of unclear thinking, like all forms of proverbial
wisdom require intelligent interpretation. Nonetheless, it is taken for
granted here that much of what was said in Chapter II concerning
productive oral testing is applicable to writing tasks and need not be
repeated here, and conversely that much of what is suggested here
concerning the scoring of written protocols (especially essays) can be
applied to transcriptions of oral protocols or to tape recorded oral
performances.

B. Examples of pragmatic writing tasks


Just as it is possible to conceive of non-pragmatic tests of other sorts,
it is also possible to invent testing procedures that may require
writing, but which are not in any realistic sense pragmatic. Sentence
completion tasks, for instance, do not necessarily involve pragmatic
mapping onto extra-linguistic context, and they do not generally
involve higher order discourse constraints ranging beyond the
boundary of a single sentence. A typical school task that requires the
utilization of words in sentences (e.g., as a part of a spelling exercise)
would not qualify as a pragmatic task. Neither would a written
transformation task that requires changing declarative sentences into
questions, passives into actives, present tense statements into past
tense statements, etc. In general, any manipulative exercise that uses
384 LANGUAGE TESTS AT SCHOOL

isolated sentences unrelated to a particular pragmatic context of


discourse does not constitute a pragmatic writing task.
The key elements that must be present in order for a writing task to
qualify are similar to those for speaking tasks. The writer must have
something to say; there must be someone to say it to (either explicitly
or implicitly); and the task must require the sequential production of
elements in the language that are temporally constrained and related
via pragmatic mapping to the context of discourse defined by (or for)
the writer. Probably such tasks will be maximally successful when the
writer is motivated to write about something that has personal value
to himself and that he would want to communicate to someone else
(preferably the intended readership). Contrived topics and possibly
imagined audiences can be expected to be successful only to the extent
that they motivate the writer to get personally (albeit vicariously)
involved in the production of the text. An unmotivated com-
municator is a notoriously poor source ofinformatioll. To the degree
that a task fails or succeeds in eliciting a highly motivated
performance, it will probably fail or succeed in eliciting valid
information about the writing ability of the examinee.
In this chapter we will consider writing tasks to be pragmatic if they
relate to full fledged contexts of discourse that are known to the writer
and that the writer is attempting to make known to the reader.
Protocols that meet these requirements have two important
properties that disjointed sentence tasks do not have. First, as
Rummelhart (975) points out, 'Connected discourse differs from an
unrelated string of sentences ... in that it is possible to pick out what
is important in connected discourse and summarize it without
seriously altering the meaning of the discourse' (p. 226). Second, and
for the same reasons, it is possible to expand on a discourse text and
to interpolate facts and information that are not overtly stated.
Neither of these things is possible with disjointed strings of sentences.
Writing about a poignant experience, a narrow escape, recol-
lections of childhood, and the like all constitute bases for pragmatic
essay tasks. Of course, topics need not focus on the past, nor even on
what is likely. They may be entirely fictional predicated on nothing
but the writer's creative imagination. Or, the writing may be
analytical or expository. The task may be to summarize an argument;
retell a narrative: recall an accident; explain a lecture; expand on a
summary; fill in the details in an incomplete story; and so on.
There really is no limit to the kinds of writing tasks that are
potentially usable as language tests. Tkere is a problem, however,
ESSAYS AND RELATED WRITING TASKS 385
with using such tasks as tests and it has to do with scoring. How can
essays or other writing tasks be converted to numbers that will yield
meaningful variance between learners? Below we consider two
methods of converting essay protocols and the like to numerical
values. The first method involves counting errors and determining the
ratio of the number of errors to the number of opportunities for
errors, and the second technique depends on a more subjective rating
system roughly calibrated in the way the FSI rating scales for
interview protocols are calibrated.

C. Scoring for conformity to correct prose


It is possible in scoring essays to look only at certain so-called
grammatical 'functors'. Such a scoring method would be similar to
the one used in scoring the Bilingual Syntax Measure (discussed in
Chapter 11, above). It would, however, be a discrete point scoring
method. For instance, the rater might check certain morphemes such
as plurals, tense indicators, and the like on 'obligatory occasions'.
The subject's score would be the ratio of correct usages to the number
of obligatory occasions for the use of such functors. This method,
however, does not necessarily have any direct relationship to how
effectively the student expresses intended meanings. Therefore, it is
considered incomplete and is rejected in favor of a method that
focusses on meaning (see Evola, Marner, and Lentz, in press).
To score an essay for its conformity to correct prose, it is first
necessary to determine what the essay writer was trying to say. In
making such a determination, there is no way to avoid the inferential
judgement of someone who knows the language of the essay. Further,
it helps a great deal if the reader studies what is said in its full context.
Knowledge of the topic, the outline ofthe material, or any other clues
to intended meanings may also be helpful. To illustrate the
importance of context to the reader's task of discovering (or rather
inferring) the writer's intended meanings, consider the following
excerpts of unclear prose:
(I) I woke up when I was in the hospital with broken legs
because I had a fiat tire.
(2) Adaptation is the way how to grow the plant or the
animals in their environments.
In the first case the task was to 'write about an accident you have
witnessed'. In studying the protocol, from the text surrounding the
386 LANGUAGE TESTS AT SCHOOL

sentence given as example (I) it is clea r that the author was driving
carelessly when he had an accident. H e ble w a tire going sevent y miles
per hour on an expressway. The next thlDg he knew he was in a
hospital with two broken legs. In the second case, the task was to
study a text and then write it from recall. The sentence given as
example (2) ab ove was apparently an attempt to reproduce from
memo ry the sentence: 'An adaptatio n is a body structure that makes
an animal or plant more likely to sur vive in its environment. '
Once the re ader has developed a no tio n of what the writer had in
mind (or what the writer should have had in mind, as in the case of
example 2 where the task is to rec all a given text), it is possible to
assess the degree of conformity of what the author said to what a
more skilled writer might have said (o r to what the text actually said
when the task is recall).
A simple mel hod tbat can be used with essay scoring is to restate
an y portion of the prolocol that does nOl conform to idio matic usage
or which is clea rl y in error. Theenor may be in saying something that
does not express the intended meaning - fo r instance, the author of
example (1) above did not wake up because he had a flat tire - o r the
error may be in simply saying something quite incorrectly - for
in stance, ad aptat ion is not 'the way how to grow the plant o r the
animals' as stated in example (2) above, rather it is a change that
makes plants o r animals better suited to their environments.
Rewriting 'he pro toco l to make it conform to standard usage and
also to express the intended meaning of the author may be di ffic ult,
but it is not impossible, and it does provide a systematic basis for
evaluating the qu ality of a text. Furthermore, rewriting an essa y to
express the intended meanings (insofar as they can be determined by
the scorer) requi res evaluation not just in terms of how well the text
conforms to discrete points of morphology and synta x, but how well
it expresses the autho r's intended mea nings. There is guesswo rk and
inference in any such rewriting process, but this reflects a no rmal
aspect of the interpretation of lang uage in use whether it is spo ken or
written. Indeed, the difficulties associated with making the right
guesses about intended meanings sho uld reflect the degree of
conformity of the essay to normal clear prose. Hence, the guessing
in volved is not a mere artefact of the task but reflects faith full y the
normal use of la nguage in communicatio n.
In a classroom situ ation. there are many reasons why the rew riting
of each learner's protocol should be do ne in co nsultatio n with the
writer of the essay. Ho wever, due to practical considerations, this is
ESSAYS AND RELATED WRITING TASKS 387
not always possible. Nonetheless, whether the procedure can be done
consultatively or not, rewriting the text to make it express the scorer's
best guess as to the intended meaning and to make it do so in
idiomatic form is apt to be a much more useful procedure peda-
gogically than merely marking the errors on the student's paper
and handing the paper back. In the latter case, the attention is
focussed on the surface form of the language used in the essay, in the
former, the attention is focussed on the intended meaning and the
surface form is kept in proper perspective - as a servant to the
mcamng.
Once the rewriting has been carefully completed - deleting
superfluous or extraneous material, including obligatory information
that may not have been included in the protocol, changing distorted
forms or misused items, and so forth - a score may be computed as
follows: first, count the number of error-free words in the examinee's
protocol: second, subtract from the number of error-free words, the
number of errors (allowing a maximum of one error per word of
text); third, divide the result of step two by the number of words in the
rewritten text. These steps can be expressed simply as shown in the
following formula;
ESSAY SCORE = [(the number of error-free words in the student's
protocol) minus (the number of errors in the
student's protocol)] divided by (the number of
words in the rewritten text)
In general there are three types of errors. There are words that must
be deleted from the student's protocol; there are words that must be
added; and there are words that must be changed.
Below, the suggested scoring method is applied to three protocols
of essays written by college level students at Southern Illinois
University in the Center for English as a Second Language. The
protocols are genuine and have not been changed except to transform
them from their original cursive script to print. The cursive notes
above the line are my suggested rewrites. Other scorers might have
made some different suggestions for changes. Or, they might have
been more or less lenient. The point ofthe method, however, is not to
attempt to provide a scoring system that will be equivalent across all
possible scorers. This would not even be a reasonable objective.
Differences across readers cannot be dissolved nor would an attempt
to eradicate such differences reflect the normal use of language in
written form or any other form. The point of the scoring method is to
388 LANGUAGE TESTS AT SCHOOL

provide a technique of converting essay protocols to a quantity that


reflects (in the view of at least one proficient reader) the degree of
conformity of those protocols to effective idiomatic prose.
More research is needed to test the reliability and validity of the
recommended scoring method, but it has been shown to be
substantially reliable and valid in recent work at Southern Illinois
University (see Scholz, Hendricks, Spurling, Johnson, and
Vandenburg, in press, and Kaczmarek, in press). In brief, the method
probably works as well as it does because it very clearly assesses the
overall conformity of the student's writing to idiomatic prose.

Protocol I
Score: 024-24)/142 ~ .70
(An advanced ESL student at Southern Illinois University)
1::13 6 -,89
w my
4-:5
I was going home from the school. When I was
10 at \1 11. 13 14- 15 Ib 17 18 1"\ (an
standing before a red light at the corner, a yellow car ~
20 2.1 1.1-:1.~ 2J.i. :1.~ 2b !104'n~ 1~ol.lsh 2.1il\t~ectio(\ :l. S
the red light and hit a blue car pa"ing t h e _ . Obviously
2.4 ~o '!ol the ~2 .;:>., 31;- 35~' 3-1 2>8 3'\ Lt-O Itol
the driver of yellow car was at fault. It was about noon and
11-1. I+:!o ~'+ 4-5 1+& 't' 4-B 4-'\ So desl:.rojed
there was heavy traffic. The blue car was almost damaged.
51 S2 53 l::.ke Sit- 55 5"& 1S7 SS Sq ovef60 ~I 61-
The driver of yellow car was calm. He cameA to the other
63 ,~ 1\.5 b& fl, b8 f,9 70 71 t:he 71. 73 74-
driver and begged his pardon. But the driver of blue car was
75 70 '77 78 71 80 ~[ %2. ~3
nervous. Someone called the police. After five minutes a
~lt . 8? 8b «1 M $"I 90 _ 'l.1 ''U.. '13 94-
polIce car c<l:me and towed tlie two cars away The guilty
q~ "1.& gIVe" c:t'1 'l.! 9'1 100 101 10). 103 0. lot of
driver was ~.,.. fifty dollar ticket. The blue car had JHftft)'""
104- Qt 105 lOb ICfl log IOq
damage% J.haf estimated about five hundred dollars. Though
110 III 11'- 9ettil\,9 \\3 '114- IV; Ilb Il, uS Hq 12.0
I was late le-m;' home that day, \:ml I had an interesting story
121 1).2.. !1..3. 12-4-
to tell my parents.

Rewrite of I
I 1 :!> I.t 5 6 I e. 9 10 (\ 12-
I was going home from school. When I was standing at a
13 14- IS It> 17 Ie. 1"1 2.0 .l-l '2.2. 2.3 2,+ 2"'>2.6 '2.7
red light at the corner, a yellow car ran the light and hit a blue
28 Z."l ~o "3.\ ,:2.. . .~~ 34 3.':;' 'H
car gomg through the intersectIOn. ObVIOusly, tlie driver of
n 3'0 3"'1 ,,+0 41 4-1- 4~ l+4- 45 I+b L\.-i L\-~ 1.\-"1
the yellow car was at fault. It was about noon and there was
50;1 52 53 S4- SS?b r::>7 '5$ 54'
heavy traffic. The blue car was almost destroyed. The driver
&0 f,1 61. {,; b4- (,$ _ j,b ('1 "8 b'l"10 il ""12-
of the yellow car was calm. He came over to the other driver
ESSAYS AND RELATED WRITING T ASKS 389
73 71;- 75 ." 77 78 -, q to 81 81 &3 34-
and begged his pardon. But the driver of the blue ca r was
~5 f.b &7 is " '\0 ql ~l q~
nervous. Someone called the police. After five minutes a
'14- 0,'5 C\b <n <\g q'l 100 101 10.. 10~ 10'+
police car came and towed the two cars away. The guilty
lOS lob len lOS 10". 11 0 III IlL il 3 I I ... 11'5 lib 11"'1 II,
driver was given a fifty dollar ticket. The blue car had a lot of
Ilq 12.0 1:1.1 11."1.. 12..~ 12..ij.. 11.5 l1..b 12-7
damage estimated at about five hundred dollars. Though J
!l. S , 1'2.1\ 11 0 \'!.I 1'01. , ~~ ':S'+ \~S \ ~b 1\ , n'll 13'1
was late getting home that day. I had an interesting story to
11.. 0 14 1 '4-1.-
tell my pa rents.

Protocol 2
Score: (127-70)/188 = .30
(A n intermediate ESL student at Southern Illinois
University)
I 1 3 Lt ~ 5 '" 7 8 q 10 '" II I l.
Four years ago beforAI gOl my di)?lom'Q} hadAvery bad
I) 1'+ ,5 I" I' '6 t,U"bou t At 1'\ 1. 0 ;1.\ :1:1
accident that I am goiqg to~. .lri that time my father
l' :l.\t- 2.5 One eVenln" about J.b 1."1 2. 8 'Z.~ ,0 31
had two car'0'H8 it "as at 8 p.m. tmrf my fa ther and my
~2. ?~::'4- ~'5 '''' ~I of .,!oS & ~, to ld '+0
mother went out to visit some"our relati ve. 1 ~ ..kJ my
er ... 1 to 4-2. 1.4 ~ door,,-,+ the 4'5 l+b;), would '+1 4 2> 4-<1,
yo un!\., brotherAopen the dore of garage and ~ drive for a
So liltle ~I ..... hi\e r>J. didn't ~Q df;~e( '5 j i(el\~~, we drove off Of'l a
Iitte! whail aH<t J ROen ', ari"iog lieeHe I was ari"iRg in)1igh
<53 but .f' f\olly 5"/0. 50;: 90 C;b 0;"1 0S8 ed 6"
way .He in thi, ti.fle I <\i'ci<jlto came-bac k 16 home I turn the
~o Qrov",d '" bl. b~ 6..,. .. 5 it An o tl1~r 6b 67
ca r baek-very fast and J co uldn·t controlod",JP,d <>neear was
po.-ked 6f. "" 76 ron I "~O it. I.!J B e didn't nave
SUmedA,tbere and I betMcl tmrr.ctrrvery ha rd,~eca u~I ba<I!H
'0 ah'lefs.
hc.enSe7J l 'd 74- 15 71:0 11 p t> 79
... t .", better l<l g: I went very fast kf
to 1.1 ,... & h.om~, 114 ~ 1'>b r:,7
my 10m when my fa ther came a.R<f ,He asked me what
S~ e S" '\0 '\ 1 (3 'U.. ~~ ' It " o;. % Yun ·,nto 0. '" 1S '1'1
happen.d to the ca~l lOld him I had a•• ieeH' \',ith wall and he
100' 0. 8 102. 10) 104 10; JOb ;0, 5ropPl'd I'" 10<1 11 0
!>!We meA~ut the day a fter that the police.gor my fa ther when
h u, /1 2 II; In J/4 liS 0 l Ib HI /'8 11<'1 fl.~ 11..1 e In 123
H e was drivingb9 that car,imct He paid a lot of monv to the
Il. ", I1.S to l2-b ,2.1 .....'\ 0'>'1 , or 1 had (un into -iT'
police and,the man ,hat I hae aeGid"nl "'il/;1 Ai, ear.

Rewrite of2
I L:' '+ 5 b 7 t ., 10 I[ 12. 13
Four years ago, before I got my dipl oma, J had a very bad
'It 15 Ib 11" 1'\ l.O :l..1 :1.:1. l.~ '1.'+ LS 1.6
accident that I am goi ng to tell about. At that time my father
27 2.8 2,<11 2>0
had two cars. One evening, about 8 p.m. my father and my
~I ~ l- 33 ~ 105 * 37 ~s

~~ i+O 1+1
mo ther went out to visit some of our relatives. I told my
1+-1. ... , 4'1- <+5 .... E. ,+7 'ttl 't' 50
390 LANGUAGE TESTS AT SCHOOL

5\ 'S~ '":>3 '?'t- 55 Sf> '57 '5"8 '59 60 bl


younger brother to open the door of the garage and we
62. b3 b<t-6S b6 {;7 6<6 b,\ 70 71 72 7~
would drive for a little while. I didn't have a driver's license.
-;tr IS 76 7776 79 W SI 15.2. '83 ~If. tS B6
We drove off on a highway, but finally decided to go home. I
81 ~ g9 90 <H 'U. q~ <14- % '\b <17
turned the car around very fast and I couldn't control it.
qS 9" 100 10\ 102. 103. 104- lOS lob 101 lOS \ oq
Another car was parked there and I ran into it very hard.
110 III Ill.. \l; \14- liS lIb \11 nt HI~ 12.0
Because I didn't have a driver's license, I thought I'd better
1;U III 12.; 12...4- 12.5 12.6 1;2.1 12$ 1~9 1"3,0 1:'1 ';2-
go. I went home very fast. When my father came home, he
\~3 nit- 1";5 \~ b 137 13g 13914-0 14-1 1~2. Ilf.?> 1lf.4. 14-'5 1'1-6
asked me what happened to the car. I told him I had run into
Iltl 11.4-5 14-'1 I?O I?I 152. IS3 1'54- 1'55 1'56 1'37 I'5B IS9
a wall and he believed me. But the day after that the police
Ibo Ibl 162 \63, IbLf. 1f:,5 166 1"7 lbS Ib"! 110 17/ 172.
stopped my father when he was driving in that car. He paid a

.
17;' 17't 17S 11f:,·111
lot of money to the police and to the man whose car I had run
'
into.
17& 11'1 I~ lSI 151. 183 ISl;-I'6S:I'1ib 1'lI1

Protocol 3
Score: (12-21)/33 ~ -.27
(A beginning student ofESL at t Southern Illinois University)
1 2. :; '+ S WQ!> b<:l'\('f\S 6'I'\ o ~n.e .,;treet hit "I 10
I have a car. I ~ Wslret and an-other c~r Vs \i.tesfi me. I
/!d . II P They ,I;\me ..., !hey e a
caljct¢' the jolic'0maa he', e~i11 ill. lk's gay tef the ~ A
ti.c:ket to the oTher· man rill: -,
ticet"becuas he~ tftes.ftAme. 11..

Rewrite 0[3
I 1. 3 4- '5 6 7 g 9 \0 II 12. I~
I have a car. I was backing into the street and another car
11+ 15 16 17 1& Iq 'LO 2-\ 2.2. l."? ~ 2.'5 ')..6
hit me. I called the police. They came. They gave a ticket to
27 2-~ 2.'1 2>0 31 31- ~3
the other man because he hit me.

Just as in the case of dictations (and c10ze tasks) spelling errors are
counted only when they distort a word's morphology or pro-
nunciation (e.g., the word 'happened' spelled 'happend' is not scored
as an error, whereas the \vord 'believed' spelled 'blive' is scored as an
error). In the above protocols, punctuation errors are corrected, but
do not contribute to the total score of the examinee. This is not to
suggest that punctuation and spelling are not important aspects of
writing skill, but for the same reasons that spelling is not scored in
dictation these mechanical features are not counted in the essay score
ESSAYS AND RELATED WRITING TASKS 391
either. Because of the results with non-native speakers of English, it is
assumed that learning to spell English words and learning to put in
appropriate punctuation marks in writing are relatively independent
of the more fundamental problem of learning the language. Research
is needed to see if this is not also true for native speakers of English.
It can readily be seen from the three sample protocols that the
scores reflect the presumed ran ge of abilities of the examinees. The
most advanced student got the hi ghes t score (.70), the beginner got
the lowest score ( - .27); and the intermediate student got a score in
the middle (.30). The beginner's negative sco re reflects the fact that
the free writing task (students were asked to write about an accident
for fifteen minutes) was so difficult that his errors outnumbered the
correct sequences he was able to generate. It is probably true that the
task was too difficult and therefo re was frustra ting and to that extent
pedagogically inappropriate for tbis student and for others like him,
but it does not follow from this that the test was invalid for this
student. Quite the contrary, the test was valid inasmuch as it revealed
the difference between the ability of the beginner, the intermediate,
and t he advanced student to perform the essay writing task - namel y,
to describe an accident within a fifteen minute time limit.
Many scoring methods besides the one exemplified could be used.
For instance, if for whatever reaso n the examiner wanted to place a
premium on quantity of output, the examinee might be awarded
points for errorless words of text. The sco re might be the number of
errorless words of text written in a certain time period. Briere (1966)
even argued th at the mere quantity of output regardless of errors
should be considered. In an experimental study he claimed to have
shown that learners who were enco uraged to write as muc h as they
co uld as fast as they could within a time limit learned as much as
students who received corrective feedback (i.e., whose papers wefe
marked for errors). Howe ver, from a testing point of view, it would
have to be shown that a mere word count would correlate with other
presumed valid measures of writing skilL In the three examples given
above, the best essay is not the longest.
On tbe whole it v'lOuld seem best to use (t scoring met hod that
awards positive points for error· free words and subtracts points for
errors. To keep the focus of the examinee and the scorer on the
intended meanings and the clear ex pression of them, some attention
sbould be paid to the amount of deviation from clear prose in any
given attempt at written expressio n. The scoring method proposed
above reflects all of these considerations.
392 LANGUAGE TESTS AT SCHOOL

D. Rating content and organization

It has long been supposed that subjective ratings \vere less accurate
than more objective scoring methods. Howevcr, as we have seen
repeatedly, subjective judgements are indispensable in decisions
concerning whether a writer has expressed well his intended meaning
and, of course, in determining what that intended meaning is. There is
no escapc from sUbjective judgement in the interpretation of normal
expression in a natural language. In fact, there are many reasons to
suppose that a trained judge may be the most reliable source of
information about how well something is said or written. The crucial
question in any appeal to such judgements is whether or not they are
reliable - that is, do different judges render similar decisions
concerning the same samples of writing, and do the same judges
render similar decisions concerning the same samples of writing on
different occasions. The question of the reliability of ratings is the
same whether we are thinking of written or spoken (possibly
recorded) samples of language.
Research with the FSI oral testing procedure discussed in Chapter
11 above has indicated that trained judges can render highly reliable
evaluations of samples of speech acquired in oral interview settings.
More recent work with oral ratings and with the evaluation of written
protocols has indicated that even untrained raters tcnd to render
fairly reliable judgements though trained raters do still better.
Although Mullen found substantial variability across judges in the
calibration of their evaluations, reliability across judges (that is, the
correlation of the ratings ofthe same subjects by different judges) was
consistently higb. For both written and oral protocols, the
correlation across pairs of judges working independently was
consistently above .5 and for most pairs was above .8 (Mullen, in
press a, and Mullen, in press b). Callaway (in press) similarly showed
that the correlation across judges in the evalutation of spoken
protocols was consistently above. 7 for more than half ofthe raters in
a sample of70.
The work of both Callaway and Mullen indicates further that
judges always seem to be evaluating communicative effectiveness,
regardless whether they are trying to gauge 'fluency', 'accented ness',
'nativeness', 'grammar', 'vocabulary', 'content', 'comprehension', or
whatever. Moreover, this is apparently true both for naive
(untrained) raters and for ESL teachers (trained raters, in Callmvay's
study). The variance overlap across scales aimed at, say, 'accented-
ESSAYS AND RELATED WRITING TASKS 393
ness' versus 'vocabulary' was nearly perfect within the estimated
limits of the reliability of the separate scales. This suggests that
whatever it is that judges are assessing when they evaluate the
'accentedness' of a speech sample, it is the same thing that they are
considering when they attempt to evaluate the 'vocabulary' used in it.
Similarly for written protocols, whatever it is that judges are
evaluating when they assess the 'content' is nearly perfectly
correlated (within the estimated error of measurement) with
whatever they are assessing when they judge the 'organization' or the
'grammar' and so on. Both of the studies by Mullen, and the
Callaway research as well as recent work by Oller and Hinofotis
(1976) and by Scholz el al (in press) point to the conclusion that
trained and untrained judges alike listen for the communicative efTect
of speech and they read for the communicative effect of writing. Put
very simply, the judges all seem to agree that what is most important
in the evaluation of the use oflanguage is how well a person says what
he means. They may not agree on this consciously or overtly, but their
behaviour in evaluating protocols of speech and writing supports the
conclusion that they agree on this implicitly at the very least.
But isn't it true that judges may differ widely in their ratings of the
same subjects on the same rating scale? The answer, of course, is that
they do sometimes differ widely. Well, doesn't this prove that their
judgements are therefore unreliable'? No, it doesn't prove that at all.
Consider a scale that asks judges to rate the three sample protocols
given earlier in this chapter on degree ofcommuqicative effectiveness.
Suppose that the scale allows for points to be assigned between zero
and ten. Suppose further that two different judges are asked to assign
scores on the same ten point scale to each of the three protocols. Let's
say that judge A is somewhat more severe than judge B. The two
judges assign scores as follows:
Judge A Judge B
Protocol I (advanced) 3 10
Protocol 2 (intermediate) 2 9
Protocol 3 (beginning) 8
Judge A consistently assigned lower scores than B, yet the two judges
ranked the students in exactly the same order. In both cases the
advanced student was ranked ahead of the intermediate student who
was ranked ahead of the beginning student. The judges may have
disagreed about how to calibrate the scale, but nonetheless, their
evaluations are perfectly correlated.
The correlation is perfect because the relative distance of any given
394 LANGUAGE TESTS AT SCHOOL

score in the set from the mean score for that set is exactly the same
across the two sets of scores. That is, the advanced student is rated
one point above the mean by both of the judges and the beginner is
rated one point below the mean. In both cases, the intermediate
student is rated at the mean. Hence, there is perfect agreement (in this
hypothetical example) about the relative proficiencies of the three
subjects tested even though the scores assigned are quite different
when compared directly. The same remarks would hold true if a third
judge, C, were added who rated the protocols 10, 5, and 0 on the same
scale. The reader can check the correlations between judges by
computing the correlations between judge A and B, A and C. and B
and C (see Chapter 3, section E. 1 above for an explanation of the
computational procedure).
If the scores are expressed in terms of units of deviation from the
mean score for each set of scores, it can be seen that the judges are in
fact in perfect agreement about how the subjects' performances
should be ranked relative to each other. In all three cases, the
advanced student is rated as one standard deviation above the mean;
the intermediate student is rated at the mean; and the beginner is
rated at one standard deviation below the mean. Thus, it is possible
for judges to agree substantially on ratings of speech or writing even if
they disagree substantially on the particular score that should be
assigned to a given subject.
The available research indicates that SUbjective ratings that
evaluate the overall communicative effect of a piece of \vriting (or
speech) are about as informative as objective scores in terms of
differentiating between better and worse performances of the
individuals in a group of students. However, SUbjective ratings are
summative in nature and do not provide the diagnostic detail that is
possible with the more objective type of scoring outlined earlier in this
chapter. Another factor to be considered is that the more objective
type of scoring is probably less sensitive to individual differences of
calibration.

E. Interpreting protocols
Essay tasks have often been favored as classroom testing techniques.
Educators sometimes appeal to them as a kind of ultimate criterion
against which other tests must be judged. However, the greatest
virtue of essay tasks may also be their greatest liability. While it is true
that such tasks confer the freedom and responsibility of choice on the
ESSAYSANDRELATEDWRIllNGTASKS 395
examinee, they also require thoughtful evaluatio n on the part of the
examiner. The writer may elect to express very simple ideas only in
words that he is sure of, or he may venture into more complex or
profound meanings that stretch bis ca pacity to put things into words,
\\1hether the writer has charted a conservati ve or a daring course is
left to the judgement of the examiner. For this reason, equal SCOTes
may not represent equivalent performances, and unequal scores do
not necessarily imply tbat the higher score represents the better
performance.
The main question in interpreting an essay protocol is, 'What was
the writer trying to say, and how well did he say it?' There may seem
to be two questions here, but for good reasons readers tend to treat
them as one. I f a writer does not express things fairly well, it will be
hard fb tell what he is trying to say ; similarly, if a writer has little or
nothing to say, how can he say it well? It does not take a sage to see
the wisdom of saying nothing unless there is 'Something to say. In fact,
a person has to go to school ror a very long time to become able to
\vrite on topics which do not naturally elicit a desire to communicate
(t hough this is not because people lack things to talk and write
aqout).
Once the evaluator is fairly co nfident that he knows what the writer
was trying to express, it is possible to evaluate how well the job was
done. Obviously! the evaluator cannot be much better at evaluating
than he himself is at writing and this is where the teeth of subjectivity
bite bard. The examiner must place himself in the shoes of the write r
and try to figu re out precisely what the writer meant and whether he
said it well, or how he could have said it better. Thus, the largest part
of evaluating essays or other written protocols is inferential , just as
the interpretation of speech and writing in other contexts is
inferential.
In spite of the criticism that essay tasks allow the wri ter the
freedom to avoid 'difficult structures', such tasks are nonetheless
usually quite revealing. A number of problems can be diagnosed by
studying a protocol such as number 1 above. Among the glaring
errorS is the failure to use the definite article where it is required in
such expressions as 'The driver of yellow car'. The surface aspect of
this error lies in the fact that any noun phrase with a countable head
noun (such as 'car' or 'yellow car') requires an article. The deeper
aspect of this same error is that without the article, in fact without a
definite article, the writer fails to call attention to the fact that the
reader knows which yellow car the writer is referrin g to ~ namely, the
396 LANGUAG E TESTS AT SCHOOL

one that ran the red light. If the wri ter can be made to see this, he can
learn to use the article co rrectly in such cases.
Another nOlin phrase pro ble m occurs in the phrase 'ma ny
damages'. Here, the trouble is that we nonnaIIy think of damage as a
kind of amorphous something which is uncounta ble. The word
'damage' does not usually appear in the plural form because the
referent is not discrete and countable. There can be a little dam age or
a lot of damage, but not few Or many damages. The reader should
note that this has no more to do wit h the syntax of countable and
uncountable nouns than it has to do with the concept of what damage
is. If we were to conceptualize damage differentl y, there is nothing in
the grammar of the English language that would keep us from
counting it or saying 'many dam ages'. The problem involves mel:lning
and goes rar beyond ' pure syntax' even if there wcre such a thing. If
the writer ca n see the sense or mea nin g the way the native spea ker
does, this kind of error can be o vercome.
The writer confuses 'give' and ' take' . This distinction has puzzled if
not bewildered many a grammarian , ye t native spea kers ha ve littl e o r
no difficult y in using the terms distinctl y and appropriatel y. In this
context, 'give' is required because the police were there al the scene.
There was no need to take the ticket anywhere. The policeman just
handed it to the driver who was presumed to be at fault. H ad the
dri ver who was to receive the tick et left the scene of the accident and,
say, go ne home to Chicago (from Carbondale), it is possible that
someone mi ght have w ken him th e ticket. It wo uld appear th at the
cop gave the ticket to the man, and the man to whom he gave it look it
home to Chicago or wherever he was going.
Two o ther erro rs involved clause connectors of sorts. The writer
says, 'The blue car had a lot of damage th at estimated abou t five
hundred dollars'. The problem the writer needs to be made to see is
that the way he has it written , th e damage is doing the estimatin g.
Another difficulty inj oining clauses together appropriately, occurs in
the last sentence o rthe protocol wh ere the learner a vows, 'Th o ugh T
was la te getting home that day, but I had an interesting story to tell
my pa rents'. Perhaps the learner does not realize that ' though' and
'but' in this sentence bo th set up th e condition fo r a nother statement
offering contrary inform ation. In the case of 'hut' the contrast has a
kind of backward look whereas with 'though' an anticipatory set is
cued for informatio n ye t to come.
How can the diagnostic info rm <:ltion no ted above be applied ?
Surely it is not enough to offer grammatical explanations to the
ESSAYS AND RELATED WRITING TASKS 397
learner. In fact, such explanations may not even be helpful. What can
be done? For one thing, factual pattern drills may be used. For
instance, using only the facts in the story as told by the writer (and
well known to the writer) it is possible to set up the following pattern
drill with 'the' following a preposition and elsewhere.

1. I was waiting at a light. The color of the light was red.


2. A car ran the light. The color of the car was yellow.
3. It hit another car. The color of the car that it hit was blue.
4. The color of the car that hit the blue one was yellow.
5. The driver of the blue car was nervous.
6. He came over to the driver of the yellow car and apologized.
7. The police gave the driver of the blue car a ticket. The
amount of the fine was fifty dollars.

In addition to the above sentences it is possible to set up meaningful


question and answer drills. For example,
1a. Where were you waiting?
At a light.
I b. What color was the light?
It was red.
Ic. What was red?
The light was red.
2a. What happened next?
A car ran the light.
2b. Wbat color was the car"
It was yellow.
2c. What was yellow"
The car was.
3a. Whicb car ran the light ')
The yellow car did.
and so forth.
The key advantage to the above type of pattern drill is the meaning.
The writer knows what the facts are. The deep structure, or the sense,
is known. It is how to express the sense in terms of surface structure
that the writer needs to discover. The drills should be designed to help
the learner see how to express meanings by using certain surface
forms. The main difference between the type of drill illustrated above
(and the number and variety of such drills is unlimited) and the type
so often found in ESL or foreign language curricula is where. the
learner's attention is focussed. In the drills recommended here, the
focus is on the meaning. In the disconnected and relatively
meaningless sentences of typical pattern drills, on the other hand, the
focus is on the surface form - with scarcely a thought for the meaning.
398 LANGUAGE TESTS AT SCHOOL

(See Chapter 6, section C above.) Here the learner knows what the
meaning is and can discover what the surface form should be. With
meaningless sentences, by contrast, the meaning is not known to the
learner and cannot be discovered (unless he already knows the
language).
Much remains to be discovered concerning the nature of writing
and the discourse processing skills that underlie it. However, there is
little reason to doubt the utility of essay writing as a reasonable
testing procedure to find out how well learners can manipulate the
written forms of a language. In fact, there are many more arguments
in favor of essay writing as a testing technique than there are against
it. Its usefulness as a diagnostic tool for informing the organizers of a
curriculum cannot be overlooked. Further, such tests can easily
(though not effortlessly) be integrated into a sensible curriculum.

KEY POINTS
1. The freedom allowed by essay tasks may be both their greatest strength
and weakness-a strength because they require a lot of the examinee, and
a weakness because of the judgement req uired ofthe examiner.
2. Except for the greater accessibility of the written protocols of learners,
the evaluation of writing performances is similar to the evaluation of
spoken protocols.
3. The fundamental problem in using essay tasks as tests is the dilIiculty of
quantification - converting performances to scores.
4. A technique for evaluating the conformity of a text to normal written
discourse is for a skilled writer (presumably the language teacher) to read
the text and restate any portions of it that do not seem to express the
author's intended meaning, or which do not conform to idiomatic usage.
5. For instructional value and also to insure the most accurate possible
inferences concerning intended meanings, such inferences are best made
in consultation with the author of the text in question.
6. Something to say and the motivation to say it are crucial ingredients to
pragmatic writing tasks.
7. A recommended scoring procedure is to count the number of error-free
words in the text; subtract the number of errors; and divide the result by
the number of words in the error-free version of the text.
8. Research has shown that SUbjective ratings of written protocols are
about as reliable as objective scoring techniques and that the subjective
methods generate about as much valid test variance as the objective
techniques do.
9. Research has also shown that attempts to direct attention toward
presumed components or aspects of writing ability (e.g., vocabulary,
grammar, organization, content, and the like) have generally failed.
10. Apparently, whatever is involved in the skill of writing is a relatively
unitary factor that does not lend itself easily to componential analysis.
ESSAYS AND RELATED WRITING TASKS 399
II . It mu st be noted tha t the reliability o f rat ings of essays by different judges
is o nly marginally related to the calibration of the raLings - it is
principally a matter of whether different judges rank subjects similarl y.
Judges that differ widely in th e specific ratings they assign to a set of
protocols may agree almost entirely in what they rank as high and what
they rank as low in the set.
12. A given learner's protocol may serve as a basis for an in depth anaJysis of
tha t lea rner's developin g grammatical system . F urlher, it may provide
the basis for factual pauern drill s designed to help (he learner acqu ire
di llic ull structures and usages.
13. lndeed, factual pattern drills, derived directl y fro m the facts in the
contexts can serve as a basis for preparing m aterials fo r an enti re class or
a whole curriculum to be used in many classes.
14. Essays arc reasonable testing devices that have the adva ntage of being
easi ly incorporated into a language curriculum.

DISC USS ION QUESTIONS


l. How did you learn to write essays in your native language? Wha t
experiences in and out of sc hool helped you to acquire skill as a writer ?
Wha t was the relationshi p of the experiences you had as a talker to your
becomi ng a writer?
2. To what extent do you agree or disagree with the suggestion that people
would do well to write the way they talk?
3. Consider a task where yo u read a book and summ a ri ze in writ ten form
what yo u have read . How wo uld the written form compa re with a
summ ary yo u might give to so meo ne who wanted to know what (he book
was about ? Or, consider seeing a movie and writing a nar rative of wha t
happened in it. What is the relation bet ween wriling and experience?
Int rospect on the task of writing a brief autobiography.
4. Assign an essay task to a gro up of students and analyze the protocols.
Try to find group tend encies that would jU$tify drills (factual pattern
practices) which would benefit the maximum number of stu dents.
5. Compare the protocols of learners on the essay ta sk with protocols from
dictatio n, doze proced ure, ora l inte rview and the lik e.
6. What would you predict concerning a compariso n of a discrNe point
met hod fo r scoring essay protocols and a more co mpre hensive scoring
met hod focussed on meaning ? What sort of expe riment would be
necessary to test the validity of the two method s? How could correlation
be ll sed to assess which of the two methods was producing the greater
quantity of reliable and va li d variance? How could it be determined that
the variance prod uced wa s in fact attribut ab le to the learner's
commu nicative effectiveness in performing the essay ta sk?
7. Di ~cuss your own experiences as a teacher of writin g skill s or the notions
that you would try to e mpl oy if you were such a teac her. W ha t methods
o f in st ruc tion have been (o r would be, in yo ur view) most e R'ecti ve? Wha t
ev idence can you offe r ? Is it your opinion th at writing classes per se
usually teach people to be good writers? 1fnot, what is lacking? How can
400 LANGUAGE TESTS AT SCHOOL

people learn to express their ideas in writing, and how can they be
instructed so that they will improve?
8. Discuss the differences and similarities between a recall task, such as the
one briefly described in Chapter 13, and an essay task where a topic and
major points to be covered are suggested by the examiner or instructor.
Consider also the case where no topic is offered.
9. Construct scales for evaluating essays that try to sort out all of the things
that you think contribute to good or eiTective writing. For instance, you
might have a scale that assesses content and organization; another for
vocabulary; another for syntactic and morphological correctness, etc.
Apply the scales as diligently as you can to the essays written by a group
of students (natives, non-natives, pre-secondary, secondary, or post-
secondary, no matter). Compute correlations between the scales.
Examine the correlations and the individual protocols to evaluate the
success with which you have partitioned the writing skill into
components, aspects, or elements.

SUGGESTED READINGS
1. William E. Coffman, 'The Validity of Essay Tests,' in Glenn H. Bracht,
Kenneth D. Hopkins, and Julian C. Stanley (eds.), Perspectives in
Educational and Psychological Measurement. Englewood Cliffs, New
Jersey: Prentice-Hall, 1972, 157-63.
2. Celeste M. Kaczmarek, 'Scoring and Rating Essay Tasks,' in J. W. Oller
and Kyle Perkins (eds.), Research in Language Testing. Rowley, Mass.:
Newbury House, in press.
3. Nancy Martin, Pat D'Arcy, Bryan Nev.ton, and Robert Parker, Writing
and Learning across the Curriculum 11-16. Leicester, England:
Wool aston Parker, 1976.
4. Karen A. Mullen, 'Evaluating Writing Proficiency in ESL,' in Oller and
Perkins (op cit).
14

Inventing New Tests in


Relation to a Coherent
Curriculum

A. Why language skills in a school


curriculum?
B. The ultimate problem of test
validity
C. A model: the Mount Gravatt
reading program
D. Guidelines and checks for new
testing procedures

Tests in education should he purposefully related to what the schools


are trying to accomplish. For this to be so, it is necessary to carry the
validation of testing techniques beyond the desk, chalkboard, and
classroom to the broader world of experience. The tests and the
curriculum which they are part of are both designed to serve the
interests of students and of the larger community beyond the school
grounds. They constitute the planned part of the educational effort to
instill values, and to impart skills and knowledge to the populace at
large. Education in this sense is not an accident and there is no excuse
for its effects to remain as poorly understood as they are at the present
time. In this chapter we will see that the curriculum is subject to
validation by research just as the tests are. Whereas the purpose of the
tests is to measure variance in performances of various sorts, it is the
purpose of the curriculum to produce such variances over the course
of time, or possibly to eliminate them. In all of this, language plays a
crucial pivotal role - the curriculum and the tests are largely
dependent upon language. In this chapter, we will consider some of
the factors necessary to valid language curricula and tests.

A. Why language skills in a school curriculum?


Doesn't the question seem rhetorical? Isn't it obvious why language
401
402 LANGUAGE TESTS AT SCHOOL

skills are crucial to the whole business of education? Perhaps it is


obvious that language skills are important to what happens in the
schools, but it may not be obvious just hOll' important they arc.
Stump (1978) observes that language is central to at least two of
the three R's, and we may add that it may be of nearly equal
importance in the early stages of all three. Reading requires
knowledge of language in a fundamental and obvious way. Writing
similarly requires language. Furthermore, it isn't enough merely to be
able to recognize and form the shapes of letters. Consider the
difficulty of writiug something coherent about something of which
you know absolutely nothing, or alternatively, consider the difficulty
of writing something intelligible in a language that you do not know
(albeit on a topic of which you may have vast knowledge).
It has often been observed by math teachers that a big part of the
problem is simply learning how to read, interpret, and refer to the
symbology. As a statistics instructor put it recently, 'Once you have
learned the language of multiple regression, you will have the course
in your pocket.' Before that? That is, until the language is
comprehensible, presumably you will not have understood the
content of the course. Thus, it isn't just 'Readin' 'n Writin" that
require language skills. 'Rithmetic' requires language skills too.
Stump's milestone study of fourth and seventh grade children in
the Saint Louis schools demonstrates that all of the major
standardized tests used in that school district (including the Lorge-
Thorndike IQ Test, Verbal and Non-Verbal; and the Iowa Tests of
Basic Skills, with subtcsts aimed at language, arithmetic, and
reading) are principally measures of the same language factor that is
common to doze tests and dictations - the same sort used with non-
native speakers in many of the studies referred to in earlier chapters of
this book. Further, Stump's study shows that the distinction between
verbal and non-verbal IQ (as expressed in the items of the Lorge-
Thorndike) is quite suspect. At the seventh grade level, non-verbal IQ
was as strongly correlated with the doze and dictation scores as was
verbalIQ.
Could it be that Stump's results are spurious? Is it possible that the
particular tests investigated are not characteristic of similar
educational tests in general? The content and the history of the
development of the tests make any such explanation unlikely. As
Robertson (1972) has shown, nearly every widely used IQ test in
existence was modeled after the same pattern - namely the original
Binet, later modified to become the Stanford-Binet. The Lorge-
" EW TESTS IN A COHERENT CURRICULUM 403
Thorndike test is no exceptio n, rather, it is typical of gro up
administered IQ tests. The same argument can be offered for the Iowa
Tests of Basic Skills. It is very similar to many other batteries
of achievement tests. Furthermore, as Gunnarsson (1978) has
demonstrated, it can be concluded directl y from the items on
standardi zed tests that purport to measure 'intelligence', 'achieve-
ment', and <personality' that they are scarcely distinguishable from
tests th at purport to measure va rious aspee-ts of "language ability'.
Unfortunately, the similarities between educational tests generally
and tests that are known to measure language proficiency reliably
and validly are not usu ally taken into account in the interpretation
and application of the tests in placement. counseling and guidance,
remedial teaching, and the curriculum in general. IQ tests may be
used to label a child as 'mentally deficient' ratherthan as not knowing
a certain language. Personality tests may be used to identify children
who ' ha ve problems of adapting' rather than children who for
whatever reason have not learned a certain language or perhaps who
have not become very proficient in that language. Achievement tests
are usually interpreted in relation to how much of the curriculum has
been learned or not learned rather than in relation to how much of the
language of the curriculum the child understands.
Langua ge pervades ed ucation . This is no doubt a fact that
motivated a book of papers on the topic of Language Comprehension
and Ihe Acquisilion of Knowledge edited by Roy O. Freedle and John
B. Carroll (1972). Being ab le to use a language o r eve n a particular
variety of a language seems to be q prerequisite for anything that
education attempts to accompli sh; or viewed from a slightly different
angle, we might say that the ability to use language in a variety of
ways is the principal thing that education tri es to in still. It is never
true to say that language is a mere adj unct of the curriculum, for
without language, there could be no c urric ulum at all. The three R's
and all the benefits to which they provide access are founded in
language ability.

B. The ultimate problem of test validity


It is said in a widely used book on foreign language testing that the
final criterion of tes t validit y in foreign language teaching is whether
or not the test assesses wha t is taught in the foreign language
classroom. There is a problem with this crit erio n of validity, however.
It relates to the fact that most of what is taught in foreign language
404 LANGUAGE TESTS AT SCHOOL

classrooms is apparently not language. At least it is not language as it


is known outside foreign language classrooms. Few, if any, of the
many thousands of students who study foreign languages in
classroom settings are able to pass as native speakers of those
languages even though they may have been among the best students
in their classes. The ones who do acquire the foreign language fluently
usually have had to supplement their classroom experience by many
hours of experience outside the classroom on their own time and
creating their own curriculum. Many will report that they learned
next to nothing in the classroom context and that they only acquired
the language after travelling to the country where it was spoken.
Merely requiring language tests (or other educational tests) to
measure what the curriculum teaches may not be enough. It will be
enough only if the curriculum teaches what it is supposed to teach.
Most foreign language curricula do not. They do not produce people
who can understand and use tbe foreign languages they try to teach.
Hence, there must be a deeper validity requirement. It must be more
abstract, more profound, more fundamental than any particular
curriculum. The ultimate validity criterion is not a mere requirement
for tests alone, but it must be a requirement for the curriculum itself.
If the purpose of a curriculum is to teach children to read in a
certain variety of English, the ultimate criterion of success (and the
very definition of valid tests for assessing success) is whether or not
children who have had the program can read in that variety of
English. If the purpose of the curriculum is to teach people to
communicate in a foreign language, the criterion of success is how
well they can communicate in the language after they have had the
course. Thus, the validity of tests in education must be referenced
against the skill, performance, ability, or whatever the educational
program purports to instil1. Anything less wlll not do. The curriculum
itself is subject to validation just as the tests are.
What is the ultimate criterion of validity for language tests? To
many it will seem presumptuous even to pose such a question, but
putting the question clearly should not be more objectionable than
merely offering a glib answer to a vaguer form of the question in an
implicit unstated form as is usually done in the testing literature.
Certainly the ultimate criterion of validity in language testing cannot
be the curriculum because the language teacher's curriculum is also
subject to validation. The ultimate criterion must be what people do
when they use language for the purpose of communication in all of its
ramifications - from poetry and song to ordinary conversation.
!'JEW TESTS IN A COHER ENT CURRICULUM 405
But how can the facts of language usage beelarified so that they can
serve as the basis for validating curricula and lests? Part of the
difficulty is that the 'facts of language usage' are so pervasive and so
thoroughly a part of everyday experience that even the most essential
research may seem trivial to the casual observer. Why bother to find
out how people learn and use language? Everyone who is not an
imbecile already knows how people learn and use language. There is a
danger that to the casual o bserver the most fund amental aspects of
language will not seem worth the investigati ve effort, yet there is no
rational activity of man that is more profound and abstract than the
use and learning of language. The latter generaliza tion follows as a
logical entailment of the fact that any cognitive activity can be talked
about. Whether it be the conception of theory of relativity or figuring
out why the key won 't turn in the lock, from the least abstract
cognitions to the most profound thou ghts, la nguage remains at least
one level above, because we can alw ays talk about whatever the
activity may be and howeve r abstract it ma y be. We can turn
la nguage inward on itself and talk about talking aboutlalki ng to any
degree of abstraction our minds can handle.
Carl Frederiksen (1977a) urges research into the nature of discourse
processing. He suggests that
the ability to produce and comprehend discourse is amo ng the
most important areas of cogniti ve development . ... Discourse
is the natural unit of language, whether a discourse is produced
by a single speaker, as in stories or school texts, or by more than
one speaker as in conversations. Consequently, the develop-
ment of commun icative competency intimately involves
processes which operate ac ross sentences, conversational turns,
and larger stretches of language .... Discourse processing skill
is ... central to chi ldren 's learning and cognit ive development
(p. 1).

If discourse is the nat uT") unit of langu age use, and if discourse
processing skill is the principal objective of child lang uage learning, it
would make sense to propose discourse process ing as the prime
obj ect of study for the validation of language based curricula.
Teaching a child to read in his native lan guage can be viewed as a
problem ofteachlng him to process discourse in an unfamiliar mode.
Teaching a person (whether a child or an adult) anotbe r language can
be viewed as a problem of teaching him to process discourse in that
other language. Superficially the tasks appear to be radicall y
different, but they may not be nearly as different as they appear on the
406 LANGUAGE TESTS AT SCHOOL

surface. Readingcurricula, foreign language curricula, and many other


language based educational programs are fundamentally similar in
their aim to instill discourse processing s12ill. Therefore, at a deep
level, the ultimate validity criterion for evaluating such programs is
how well they succeed in enabling students to process discourse in
different modes or different languages, and possibly on certain topics
or concerning certain sUbjects. By the same token, the ultimate
criterion for the validity oflanguage tests is the extent to which they
reliably assess the ability of examinees to process discourse.
Tests that purport to be measures oflanguage ability that do not
assess discourse processing skill can hardly be called 'language tests'
at all. Neither can curricula that purport to teach language or
language skills (e.g., reading) be called 'language curricula' unless
they really do teach people to use language in the normal ways that
language is used. It would seem to be axiomatic that instructional or
testing procedures that do not relate to discourse processing skill in
demonstrable ways can scarcely be called 'language teaching' or
'language testing'. Further, instructional and evaluational pro-
cedures that are principally oriented toward discourse processing
tasks can scarcely be called anything else. So-called 'intelligence',
'achievement', 'aptitude', and 'personality' tests that require complex
discourse processing skills should be considered language tests until it
is clearly demonstrated by empirical research that they are actually
measures of something else. This is imperative because we have a
much better chance as educators of bringing about therapeutic
changes in language proficiency than we have of changing 'innate
intelligence' or 'aptitude' or perhaps even 'personality'. The stakes
are too high to base decisions on untested theoretical opinions.
Moreover, there are a number of promising methods of empirically
determining just what the nature of discourse processing actually is.
Why should we rely on opinions when empirical answers are
accessible?
Two kinds of investigation are currently yielding informative
results. The first type of study involves the analysis of discourse with a
view to determining precisely what sorts of processes would be
necessary to produce or comprehend it, and the second method of
investigation involves detailed examination of the inputs to and
outputs from human beings involved in the activity of discourse
processing. Examples of the first type of study include the
investigation of the constraints underlying stories, narratives, and
other forms of discourse currently being done by psychologists and
NEW TESTS I':\i A COHERENT CURRICULUM 407
others who are attempting to simulate human discourse processing
with computational procedures. Rummelhart's attempt to character-
ize the grammatical systems underlying stories (1972) offers many
insights into the nature of discourse processing and how it is distinct
from mere knowledge of words or sentence patterns. Schank and
Colby (1973) and Grimes (1975) demonstrate the crucial role of
knowledge of the world and the relationships that hold between
things, people, events, states-of-affairs, and so on, to the notion of
coherence in discourse. Frederiksen (1977a) has demonstrated
convincingly that inference plays an important role in both the
production and comprehension of some of the simplest aspects of
utterances and surface [arms of discourse such as pronoun
reference to take a simple case in point.
Frederiksen argues, partly on the basis of requisite assumptions for
computer simulation, that normal discourse processing is usually
guided by inferential assumptions about intentions, meanings,
topics, and the like. Such assumptions or hypotheses about meaning
must often be inferred at an abstract level utilizing information that is
not given anywhere in the text. Thus, he reasons that overemphasis of
surface correspondence of sound and symbol, or visual pattern and
word, in the early stages of reading curricula may quite logically
produce non-readers - children who have many of the superficial
skills that may be necessary to the process, but who cannot
comprehend text because they get bogged down in trying to decode
forms at the surface while the sense of the discourse eludes them.
The argument can easily be extended to language curricula in
general. If the teaching methods focus on relations between surface
forms without making it possible for the learner to discover the
relationship between discourse and the contexts of discourse, they are
bound to fail. Further, we may relate Frederiksen's point to language
testing procedures. If a proposed task does not require processing
that involves relating discourse to contexts of experience outside the
linguistic forms per se, then the task cannot legitimately be called a
language test in the normal sense of the word 'language'.
A second type of empirical study that is relevant to the questions
associated with human discourse processing involves an examination
of the human activity itself. Goodman's work with reading miscues
(1965,1968,1970) is a good example ofa technique for discovering
something of the way that human beings actually process discourse.
There are, of course, many other angles of approach. Frederiksen
(1976a, 1976b, 1977a, 1977b) has coupled his work in computer
408 LANGUAGE TESTS AT SCHOOL

modeling with investigations of actual protocols of children


reporting discourse. He used a story retelling technique to elicit
samples of data. Other techniques that have been used widely, though
they have certainly not yet been wrung dry, include samples of
writing, conversation, etc. Indeed, any pragmatic language test offers
a suitable basis for eliciting data for the investigation of discourse
processing. Conversely, any discourse processing study is in a
fundamental sense an investigation of both the nature and the
validity of pragmatic language testing.

C. A model: the Mount Gravatt reading program


By the arguments already presented in this chapter we are led to
conclude that teaching and testing are merely aspects of the same
basic problem. With respect to language curricula the crucial
problem for teaching is to instill discourse processing skill. The
central problem for testing is to assess the extent of such skill. [f the
objective of the curriculum is to teach children to read in their native
language, the objective of the tests must be to assess their ability to do
so. At the surface, teaching and testing may look like quite different
activities, but down underneath the apparent differences there is an
important sameness - indeed an identity of purpose. If that sameness
is not obtained for whatever reason, there must be a validity problem
either in the teaching or in the testing.
Just as the fundamental sameness of teaching and testing activities
in schools may escape notice, the relationship between educational
tasks and normal experience outside the classroom may similarly be
neglected. If learning to read is a problem of learning to process
discourse in a different mode, then it would make sense to capitalize
on the discourse processing skill that the learner already has. Yet
most curricula do not fully capitalize on the natural relationship
between what Hart, Walker, and Gray (1977) have termed 'oracy and
literacy'. That is to say, reading curricula should take advantage of
what the child already knows about the processing of discourse in an
oral mode. They should maximize his chances of success in learning
to read by building on what he already knows.
Of course, it would be important to have an understanding of the
mechanics ofthe reading process itself, but this understanding would
necessarily remain subordinate to the main goal of teaching children
to comprehend and produce written discourse. The mechanics of
surface processing compared with the deeper processing of meaning
NEW TESTS IN A COHERENT CURRICU Lli M 409
pale to a much lesser significance. The central question becomes how
to capitalize on what the child already knows of oral language in
teaching him to read. Traditiona l answers have been based on
guesses. Vast experience of educators with failures in initial reading
programs ind icates that the guesses have no t always been very
helpful. Another alternative would be to examine empirically the
language of children prior to an attempt to set up a reading
curriculum.
The intent of a research program might well be to discover the
kinds ofthings children can say and understand in norma l contexts of
oral discourse prior to the presentation of written form s. Hart,
Walker, and Gray (1977) report on a ten year study of child language
that has offered many insights for reading curricula. They examined
sizable samples of data from 2t, 3t. 4t, 5t, and 6t year old children.
Fo r each child included in the study, data consisted of every utterance
spoken to or by the child from the time he got up in the morning until
he went to bed that night. In addition to the tape recording of all
utterances, a running commentary was kept on the context s in which
the utterances occurred. Thus it was possible to link utterances with
contexts taking inlo accoum the normal antecedent constraints of
previous contexts, their consequences, and so rorth.
The recorded protocols of children's discourse provide a basis for
testing many specific hypotheses and for answering many questions
about the nature of child language acquisition. Further, on the basis
of such data it is possible to test existing curricula for their degree of
similarity to the kinds of uses normal children put language to - at,
say, age 5-}. For instance, are the utterances and contexts of child
language experience similar to those found in widely used reading
programs ? Hart et al ha ve found that in fact child language in some
impo rtant respects bears a clo ser resemblance to newspaper copy, or
to the text of an article likely to be found in a popular magazine than
to the language of widely used reading curricula.
Fortunatel y, the Australian project provides an alternative
curriculum for the teaching of reading and suggests an approach to
the validation o f ctUTicul£l o f many sorts. Indeed, it affo rds a model,
or at least one possible way, of relating the curricula employed in
school to the experience of children outside the schooL As they point
out in Child Language: A Key lO Literacy, it is a widel y accepted
notion that teaching should begin with what is known and build upon
it. It certainl y should no t begin by tearing down o r throwing away
what is known.
410 LANGUAGE TESTS AT SCHOOL

Well, where should the reading curriculum begin? There have been
many answers. For instance, the so-called 'phonics method ' insists
that it is sounding out the words that the child must learn to do. The
'phonetic' system on the other hand asserts that it is the relationship
between sound and symbol itselfthat the curriculum must simplify by
making sure tha t each phoneme of the language in question has only
one spelling instead of man y as in tradit io nal orthographic systems
(especially English where the spellings of words are noto rious for
their diversity of pronunciation, and similarly, the pronunciations
are notorious for the diversity of representations) . Another method
emphasizes the recognition of 'whole words'. Yet another stresses the
'experience of the child' outside of school. Anotber author may
suggest the pot-pourri app roach or the eclectic method, taking a little
from here and a little from there according to personal preference.
The Australian research program at Mount Gravatt is rcfre5hingly
empirical. It is predicated on the belief o f Norman Hart and o ther
members of the team that the best foundation for teaching children to
read is what they already know about talking and understanding talk.
This idea is not new, but their use of it is new. They reasoned that if
the language used in beginning readers were pragmatically mapped
onto experiences, facts, events, and states of affairs of kn own or
discoverable sorts (from the child 's vantage point), and if th e
utterances that appea red in written form in the texts also conformed
to the kinds ofunera nces (that is pragmatic shapes) already known to
the children, they would have an easier time of learning to recognize
and manipulate the printed form s of those utterances.
The task Hart el ol set themselves was, therefore, two-fold : 51"st, to
determine the nature of children's utterances in contexts of normal
communicatio n ; second, to devel op a reading curriculum that would
utilize such utterance forms. Probably the most striking difference
between the uttera nces of children and t he reading curricula to which
they are usually exposed (subjected?) has to do with the deep struct u re
of children's disco urse - that is, the meanings they are capable of
understanding and expressing. In particular, it is obvious in
comparing samples of actual child speech with samples of school
book material that the children' s language is far richer, more
interesting and more complex than the language of the books.
Children are capable of understandin g, and producing more abstract,
more comple x, and more intriguing plots, situations, relationships,
and states of affairs than they are usually presented with in school
texts. A second contrast has to do with the forms of utterances. The
NEW TESTS IN A COHERENT CURRICULUM 411

actual protocols of child language reveal a very different surface form


tban the typical ea rly reading lessons in m ost school books.
To illustrate the aforementioned contrasts, i1 may be usefu l to
compare actual transcripts of child language discourse wit h sa mples
of text taken [ro m reading curricula for the early grades. The child
language samples are from the M ount Gravatt Langua ge Resea rch
Program. They are excerpted from the tran scripts associa ted with 51
yea r old child ren fro m the Brisba ne area. In the first example of child
discourse, Jaso n is at school doiug a killd of show-and-tell about a
trip to a Japanese restaurant, The Little Tokyo. Jason begins with
what is apparently a practiced introduction for such occasjons:
'Mrs Simmons, Boys and Girls, we went to the restaurant
and got chopsticks. Ah ... ' The teacher interrupts, 'They're at
school here so mewhere, aren't they?' Jaso n answers, ' ]11 my
port. ' He is referring to a lunch-box type of affai r th at the
Australian children carry their school things in back and forth
between home and school. The teacher suggests, 'Go an d find
them so you can show them to us. ' While Jason is getting the
chopsticks Mrs Simmons says, 'We're just waitin g for Jason.'
He returns holding up the cho psticks and an nounces, T hese' re
chopsticks.' The teacher interprets, ·You like those. They're
chopsticks. Chinese people eat with them. ' The teacher aide
interjects, '" Ja son will show you how he eats with them.'
Another child puts in, 'We got some.' Mrs Simmons responds,
'You've got some at home tooT Then , turning to Jason, she
urges, 'Show them how you use them, Jaso n. That 's ri ght .' The
aide puts in, ' Did you like the Chinese food yo u had'?' Jason
answers, 'Aw, aw, some chicken .' Mrs Simmon s answers, 'Oh,
that was nice,' and th e aide asks, ' What did Mum my have'"
'Ah, rice,' Jason replies. 'She had rice,' Mrs Simmons answers:
'Right. Thanks, Jason .'
Thro ughout the scbool day, Jaso n makes repeated references to the
experience at th e restaurant. For instance, later, during free time, he
says,
'I'm doin g a restaurant first. I'm gonna do the sukiak i. The
Little Tokyo. Mrs Simmons I' She answers, 'Yes.' 'UIll , the
restaurant's called The Little Tokyo .. .'
Compare {he disco urse in which Jason partici pates with the fo ll owing
mate rial from McCracken and Walcutt (1 969), Lippincott's Basic
Reading:
Ann ran . A man ran . A ram ran (p. 9),
Run , rat, run , run , run , run . Run to a red sun. Run to a red
sun. Run, run, run (p. 17).
412 LA NGUAGE T ESTS AT SCHOOL

Interestingly, on the following pages the text seems to ramble fro m


one context to another with no thought Cor mean in g or context. At
least there is no apparent attempt to maintain an experientially
sensible fl ow. For instance, the authors jauntily jump from dropping
eggs and spinning tops to a ram running at two boys whose names
begin with T and wh o happen to be resting in a tent. In the end a dog
named Rags chases Red, the squirrel, up a tree.
The next sample of actual child discourse comes from a different 51
yea r old child. He's ha ving breakfast wit h Mom and the experimen-
ters who are caneelin g the data sample:

Mom says, 'That's a good boy. Don't pull the cloth. You've
got it all crooked.' The child respo nds, ' M r. Fraser's bigger'n
you. Oh, who pulling the table ?' Mother replies, 'You are~ ' He
answers, 'I'm not .' She says, 'Sit up and be qujet.' He repeats,
Tm not. Something under the table what's pulling.' Mother
answers, CAll right, fine.' One of the experimenters approaches.
'Can I sit anyw here?' Mother says, 'Just anywhere. Anywh ere,
yes . You've got a spoon, right?' The experimenter (apparently
Mr. Fraser) answers, 'Yes.' The chi ld interjects. 'I'm beat you.'
M r. Fraser responds, 'You beat me, did you?" Yeah: the child
answers. 'You've got yo ur other course yet, so you might be
beaten yet,' the experimenter challenges. '] only wa nt one thing
and then nothing; the child responds indicating that he is
almost through, Now, Mother add resses the other experimen-
ter, ' Did your father find his way all right, Donelle?' 'Yes,'
Danelle replies. 'Oh, beaut,' says the child's mother in typical
Australian slang. Danelle continues, 'We missed the street first,
but, ah, when we were going past, and 1 said , "That's it. " He
said, " No it isn 't," We looked past. Yes, that was it.' The other
experimenter puts in, 'You were right, oh?' T he child adds,
'You still remember where my house is, cause ] told you, but
yo u didn 't know, did you, a while ago?' Da nelle says, 'No.' The
ehild continues, 'You thou ght you didn't, they did .' Danelle
says, ' Mmmm .' 'But, but I told yo u. Then you know,' the child
con tinues and then adds as an afterthought, 'You didn't drive
here.' Danelle asks, 'Hmmm ?' 'You didn 't dri ve here, cause
you haven't got any license.' ' Mmm. That' s right,' Danelle
answers. , If you drive ... ' the child begins but is interrupted by
his mother. '0 0 yo u have a license, Do nelle'?' she asks. 'No,'
says Do nelle. 'You haven 't,' Mother says rhetoricall y. 'Twish I
had now.' Danelle laments. 'Cause too bad I' the child
interjects, 'I would put a match in the car.' Danelle says, 'Oh "
and he continues, 'and put in three bags of dynamite.' 'Don't be
rude. Brenton,' his mother remo nstrates, but he cont inues
undaunted. 'A nd push the gear lever up, and I will blow up.'
'Mmmm: says Danelle. Now Brenton warms to the con~
NEW TESTS IN A COHERENT CURRICULUM 413
versation, 'That will be the last of the powerline. It will blow up .
. . . Even pour some pepper in your nose, and you will go achoo
and I will .. 'Mother interrupts, 'I think you're being a rude
little boy.' He goes on, 'and I will put some salt down in your
nose. And I will put some wire and some matches. I will light the
match, put it in, and you will blow up to pieces.' 'Then 1
wouldn't be able to come and visit you,' Danelle suggests. But
Brenton has a solution, 'Cause I think you could, cause I'd
make you back into the same pieces and put black hair on you.'
'Oh,' she says. '] started to find, cause I started to find some
black chalk under the house ... ' he continues.
Now, compare the foregoing discourse with a sample of text from
Rasmussen and Goldberg (1970), A Fig Can Jig, published by Science
Research Associates (a subsidiary of IBM) designed for use in reading
curricula in the early grades:
man Dan ran fan can] the (p. 1).
] ran. Dan ran. The man ran (p. 2).
] can fan. Dan can fan. The man can fan (p. 3).
4. The text continues with declarative sentences with several possible
permutations of Dan and the man being fanned. Later, near the end
of the book a character named 'lim' has a remarkable conversation
with 'Dad' about fitting a big rag in a bag. Dad asks if it can fit. Jim
says it can if he rips the bag - so 'rip! rip! rip! zip! zip! zip!' And with
rags in his lap Jim finally gets the big one in the bag (see p. 83). Strange
discourse, isn't it? Hardly like what children really say.
A final example of child discourse comes from a little girl in the
same unpubhshed Mount Gravatt data sample. She too is 5~. At
school on the playground she begins to talk with one of the
experimenters who is helping to record the speech sample:
'That's what Mark taught me,' she says, 'karate chops.' 'Oh,
did he,' says the adult. 'He used, he used to have his arm in
plaster. He got ran over by a car. He got knocked over by a car.
And guess \vhat he was doin'. Ridin' a motor bike with one
hand.' 'Yeh,' says the adult. 'He's clever,' continues the little
girl. 'He's clever, is he?' says the adult. 'He changed hands and
he fell off. He nearly broke the other arm. He fell in the water
and wet all the plaster.'
Compare this data protocol with the following excerpt of text from
Early, Cooper, Santeusanio, and Adell (1974), Sun Up Reading Skills
1, which is part of the Bookmark Reading Program published by
Harcourt, Brace, 10vanovich:
The sun was up (p. 5). Sandy [the dog] was up (p. 6). Bing [the
cat] was up (p. 7).
414 LANGUAGE TESTS AT SCHOOL

This unusual beginning is followed by all possible combinations of


the greeting 'Good Morning' with the sun, Sandy, Bing, Bing and
Sandy, and Sandy and Bing. Then on page 41 there is a sequence
\vhcre Bing, the dog, and a certain grasshopper, each in their turn go
'hop, hop, hop'. Later, near the end of the book, a turtle, aduck, a
rabbit, and a character known as 'Little Pig' all fall down the hill,
'down, down, down' (p. 72).
There are some remarkable differences between the sorts of things
children say, the kinds of conversations they engage in, and the texts
they arc expected to learn to read from. ln the children's speech
samples, there is a rich system of organization whereby utterances are
tied to meanings - present events, previous events, persons, objects,
causes and effects. In the reading texts there is a near total disregard
for such organization. The point of such texts, apparently, is to
present forms of language that use a small inventory of elements
(letters and sequences of them as surrogates for sounds and syllables).
The object is certainly not to say anything a child would be likely to
think of saying. The materials exemplified from widely used readers,
rather, display a near complete disregard for intelligent com-
munication. They lack any sense of flow or coherence which is so
characteristic of actual language use. They say the most unusual
things for the sake of being able to use certain syllables, consonants,
or vowels with practically no attention whatsoever to the highly
developed expectancies of children concerning the likely re-
lationships bet\veen utterances and meanings.
The reading materials developed by Hart, Walker, Gray and the
other members of the Mount Gravatt research team on the other
hand are predicated on the assumption that the texts should contain
utterance forms and meanings that are systematically related to the
sorts of meanings that children commonly communicate spon-
taneously, and in the normal school contexts of their everyday
experience. They contend that in order to make the reading materials
meaningful to the chlld, 'he has to be placed in a practical situation
which clearly demonstrates contextual meaning' (Hart and Gray,
1977, p. 2). The approach which they have developed on the basis of
'Pragmatic Language Lessons' is systematically rooted in key concepts
orthe total curriculum as well as the linguistic experience ofthe child.
It has profited much from the failures of other reading programs that
lacked the pragmatic emphasis, and it stands on the shoulders of
those programs that have emphasized the language experience of
children.
NEW TESTS " ' A COHERENT Ct:RR1CULUM 41 5
Although it is impossible to do justice to the full sco pe of even a
single lesso n in the Mount Gravatt program in t he short space that is
available here, it may be inst ructive to note the co ntrast between the
first lesson in their program and the examples cited earlier. With
sui tab le illustrations di splaying the pragmatic sense of the language
form s u sed , in the first lesso n , t he children read:
I' m five. I'm wa lking to schooL Tha t's my teacher. I' ve got
friends (see H a rt a nd G ra y, 1977, p. 24 of the Teacher's
Manual).

Preliminary results reveal th at children are no t only learning to re ad


in very sho rt order, but th ey ",re spontaneously writing as well.
But what has all of this to do with language testing? Everything. If
we a re interested in valid tests of ho w people actually u se language,
the ultimate validity c riterion is holt' they actually use language . The
questio n of how to develop new testing proced ures (o r how to lest
existing o nes) is intimateiy related to what "ve wa nt th e tests to be tes ts
of. The latter question, at least in education, is rela ted t o the equall y
importan t question concerning what the curriculum is supposed to be
accomplishing. The problems o f teaching and testin g are as closely
linked in a coherent philoso phy of education as are time and space in
modern ph ysics.
The above samples of c hild ren 's discourse illustra te a rem a rkable
complexity and abstrac tn ess as well as cq herence a nd sensc. The
speech of children is no t em pty as the old time Dick and Jane readers
might have led us to suppose. Neither is it dull and insipid as many of
the modern approaches imply. It does not repeat endlessly ideas th a t
havc already been made clear or which can be easily inferred by
ordinary people from the context. It doe s n ot jump aimlessl y about
fro m topic 10 topic like a pattern dril l. Child di sco urse arises within
mea ningful co ntexts (ha t spa rk intcrest and set the gen ius oflanguage
off a nd running. It has co herence in the sense of meaningfu l sequence
whe re events, things, people, relationships, and sta tes of affairs are
connected by causal and o ther relations. It has sen se because the
utteranc·es of children are related to present and implied contexts in
sensible non-random ways.

D. Guidelines and checks for new testing procedures


10 the preceding chapters we ha ve discussed quite a number of tests
th at have been shown to meet the t wo naturalness criteria for
416 LANGt:AGE TESTS AT SCHOOL

pragmatic language tests. Further, we ha ve noted in the case of each


type of testing procedure discussed that it is really a fami ly of testing
procedures rather than a single test. For instance, dictation can be
done in a vast variety o[ways, as can cloze testing, oral interview, and
essay writing. T here are potentially many other procedures that will
work as well or possibly even better tha n the ones considered in this
book. It seems likel y th at the specific procedures recommend ed here
will be improved on as mo re research is done in tb ecoming years. It is,
therefore , probably safe to say that the best pragmatic testing
procedures have yet to be invented . It is the purpose of this final
section to suggest some heuristic guidelines on the basis of which new
procedures might be developed. Of course, any technique that is
proposed will have to be tested the same as the existing procedures
have been and a re being tested. When dealing with such empirically
vul nera ble quantities as test scores, it is never safe merely to assume
that a test is a good test of whatever it purports to be a test of. Tests
mu st be tested.
By now it may be obvious to the reader that the first guideline to be
recommended must be to select a discourse processing task tha t
faithfully mirrors things that people no rmally do when using
language in nat ural contexts. Deviation from normal uses of
la nguage requires justification in every case - whereas adherence to
no rm al uses of language is an initial basis fo r asserting test validity.
Scoring conversatio nal exchanges for communicative effecti veness
on a subj ecti ve basis req uires less ju stification than scoring
conversational exchanges on the basis of certain discrete points of
surface form si ngled out by a linguistic analysis. The questi on of
whether a person can or cannot make himself understood is a
common sense question that relates easily to normal uses oflanguage.
However, the questio n of whether o r no t a person knows and uses
certain functo r.!) (e .g., the plural morpheme, ar ticles, the possessive
morphemes, prono minal fo rms, prepositions, tense markers and the
like) is less obviously and less directly related to how well he can
speak or use a language. The latter type of question thus requ ires
morejustificatio n as a basis for testing th an does the former.
Discourse processing tasks of all so rts afe logical choices for tasks
that may reaso nably be converted into procedures that might
justifiably be called language tests. It is necessary that such tasks be
quantifiable in some way in o rder for them to be used for
measurement and evaluatio n in the us ual wa ys - fo r instance, to
assess the stud ent's ability; to evaluate the effectiveness of the
NEW TESTS IN A COHERENT CURRICULUM 417

teacher; and to judge the instructional validity of the curriculum.


Quantification may be accomplished by virtue of a scoring procedure
that counts certain units in the surface form of the discourse such as
the counting procedures that are used with dictation and cloze
testing, or it may be achieved on a more subjective basis in relation to
a judgement of cOlllillunicative effectiveness a s in the evaluation of
oral imerview protocols.
The two principal properties of discourse that la nguage tests must
reflect are related to the naturalness criteria iterated and reiterated in
relation to each of the testing procedures discussed in preceding
chapters. First, the task must require the processing of sequences of
language elements in temporal relationships, and second, they must
require the mapping of th ose sequences of elements onto extra-
lingui stic contexts (and vice versa). This is tantamoun t to saying as
Frederiksen (l977a) does that a major property of discourse is its
·coherence'. There is a meaningful sequence of words, phrases, and so
on that corresponds to ordered relationships between states of affairs,
events, objects, persons, etc. in the world of experience which is
distinct from the linguistic forms per se. Further, the connections that
exist between linguistic forms of discourse and extra-linguistic
context, that is the pragmatic mapping relation ships, are dis-
tinguished by the fact (as Frederiksen also points out) that they are
discoverable only by 'inference.
What sources of data can be investigated as bases for proposing
new pragmatic language testing procedures ? The Mount Gravatt
research program is suggestive. If we want to know how people use
language in school, it would make sense to investigate the uses of
language in school settings. Similarly, if we want a test of a particular
type of discourse processi ng - e.g., the ability to understand the
language of the co urts in the U.S., sometimes called 'legalese' _. it
would make sense to go to the contexts in which discourse of the type
in question arises naturall y. Examine it. Analyze it. Synthesize it.
Develop some likely procedures and try them. Evaluate, re vise, retest
and refine them.
But how can we test the tests? We have already suggested several
heuristics. Among them are the requirement s of natural discourse
processing tasks, but we must go further before we claim validity for
any proposed language test. It needs to be demonstrated that the test
produces the kind aftest varia nce that it claims (0 produce. It must be
shown that the test is reliable in the sense defined above in Chapter 3,
and it must be shown that the test is valid in the sense of correlating
418 LANGUAGE TESTS AT SCHOOL

with other tests that are known to produce reliable and valid
measures of the same skill or of a skill that 1S known to be strongly
correlated with the skill tested.
In the end we have to make practical choices. We run the risk of a
certain amount of inescapable error. We cannot, however, afford to
avoid the risk. Decisions are required. The schools are now in session
and the decisions are presently demanded. They cannot be neglected.
Not deciding on testing procedures is merely a form of a decision that
itself is fraught with the potential for irreparable harm to school
goers. The decisions cannot be avoided. The best course, it would
seem, is to make the best choices available and to keep on making
them better.

KEY POINTS
1. Language processing skills are essential in obvious ways to reading and
writing, and they are also important in less obvious but easily
demonstrable ways to arithmetic - hence to all of the three R's.
2. Stump's research shows that traditional educational measures (in
particular the ones used in the Saint Louis schools) are principally
measures of language skill.
3. Gunnarsson has demonstrated that tests of many different types
including so-called measures of 'intelligence', 'achievement', and
'personality' include item types that are often indistinguishable from
each other, though they appear in tests with different labels, and which
are also indistinguishable from items that appear on standardized tests
aimed at measuring language proficiency.
4. The ultimate problem of test validity is a matter of what the curriculum
tries to teach - not the curriculum itself.
5. The validity criterion for language tests can be shown to be identical to
the validity criterion for language teaching curricula.
6. Discourse processing skill is central to all sorts oflearning and to human
cognition in general.
7. Investigations of human discourse processing skills are presently
proceeding on two related fronts: first, there are simulation studies
where the objective is to get a machine or an appropriately programmed
device to produce and/or interpret discourse; second, there are
investigative studies that actually collect data on inputs to and outputs
from human beings during their production and comprehension of
discourse.
8. Research into the nature of discourse processing suggests that it is guided
by attention to hypotheses about meaning or plans for communicating
meanings - not a very surprising fact in itself.
9. But, if the foregoing fact is not realized by curriculum designers, it is
possible that they wi11 neglect thc meanings and requisite inferential
processes associated with their encoding and decoding and wiJi plan
curricula that focus too narrowly on surface forms of language.
NEW TESTS IN A COHERENT CLRRICU LUM 419
10. R eading curricula that trai n children to co ncentrate on decoding
symbols into sounds, or word s into sequences of sounds, and the like
may re sult in children wh o ca n laboriously read sounds and words
wi thout understanding what the text they are reading is about.
II . Langua ge [cachingcurrjcul a that foc us on the sur face for ms of language
and neglect the pragmatic mapping of th ose su rface forms onto extra-
linguistic contexts and the inferential processes that the normal
pragmatic mappings requi re mmt necessaril y rail short of th eir primary
go al - namely, e.nabling learn ers to understand, produce, read, and write
sensible di scourse in the language.
12. T he Mount Gra vatt reading progra m wa s developed out o f researc h into
the discourse of children. Among other things, the resea rch dem-
onstrated th at the children were typically able to handle considerably
more complex ideas than they ,lfe usually exposed to in bo oks designed
to teach them how to read. Further, the surface forms used in th e books
a re scarcel y simila r to those actually used by children in normal
d iscourse.
13. It can be etfecti vely argued that the accurate charac-teriza tio n of real
discourse processing has everything to do with the development not only
of cUTficula for instilling di scourse processin g skill, but also for tests that
purport to assess such sk ill.
14. Conformity to normal discou rse process ing is a natural crit eri on to
requ ire of language tests - related to it are the criteria o f co herence and
inrerence, meaningful sequence and pragmat ic mappin g.
J 5. [n the final analysis tests must be te sted in empirical con texts to

determine whether they measure what they are supposed to measure .

DISCuSS ION QUESTIONS


1. Co nsider ways in which language skills are crucial to the development or
knowledge within a particular area of the curriculum at you r sc hool.
Ana lyze th e written materials presented to students, collected fro m
them, sp ok en explanations and lectures presented to them, and
responses o rall y elicited from them. Where d o the st udent s get practice in
developing the language skill s necessary to hea r abo ut, talk about , read
abo ut and write about mathem(t lics? Geography? Chemistry? Social
studies ? Literature?
2. Consider a difficult learning tas k that was set for you in sch ool. Ho\v did
yo u conqu er it, or why did it conquer you ? Did language figure in the
problem? Wa s it written ? Spoke n ? Heard'? Read?
3. If IQ tests are principally measures o f langua ge proficiency, bow can
their present uses be modified to make them more meani ngful and more
va lid educat ional toob ?
4. If ac hievement tes ts are as much a measure ofJanguage proficiency as of
unique subject matter area s, wh at does thi s imply for the interpretation
and use of achievement batteries? What remed ies woul d you recom -
mend, for instance, ira child go t a low score in a certain subject matter, o r
in a ll o f [hem (wh ich is usuall y the case fo r low sco rers)?
5. Do you believe that IQ can be modified? Do you thi nk that language
420 LANGUAGE TESTS AT SCHOOL

proficiency ca n be modified ? Do you thin k that what IQ tests mea sure


can be mod ified ? H ow could you test yo ur opin ions?
6. If there can be shown to be no test variance produced by so-called
measures of TQ that is not also produced by so-called m ea~ur es of
language proficiency, what can be deduced about both types of test?
Now; consid er in serting other labels in relation to the same empirical
result. F or insta nce, co nsider personality tests in relati on to la nguage
tests? Achievement tests in relati on to la nguage tests ?
7. Which sOr t of construct, as an abstract object of psyc homel ric theory_ is
more difficult to define - lan gua ge profic iency o r in telli gence?
Personalit y or language proficiency? D efend your an swer::, with
empirical data. What will you tak e as evi dence for intelligence, language,
or personality '! Which sort of evidence is most readily access ible to the
senses? Which is most accessible to test? To experimentati on ? To
change? To thera peutic interventi on ?
8. Invent a testing technique for use in yo ur classroom to assess student
knowledge. your success as a teacher, t he effectiveness of the c urriculum
you are using (o r that your sc hooi is usi ng), o r pjck one of the many
evaluation procedures (hat you have alread y used. E va luate [he
procedure. D oes it conform to the no rmal requirements on pragmatic
tests? D oes it require normal langua ge use? D oes it relate th e curriculum
to normal context s of discourse tha t thc lea rner can relate to as a human
being (with intelligence)? Does it require th e manipUlation of mea nings
and forms under normal temporal constra int s? Could another task that
meets these req uirements bc constructed to do what you want do ne?
9. Analyze sa mples of test data-lea rner out puts in response to tests used in
your sc-hoo!. Do the- learner protocols reflect normal communica tive
processes? Is the learner tryin g to sa y something, o r write so methin g
meaningful for the benefit of so meone ? Or is the task devoid of rhe
essential co mmunication properti es of normal language use? Are the
tests of a di !;crete point, analytical. t3 ke-things-apart type ? O r do they
require the use of information in rela tion to normal contexts of human
experience?
10. Write a letter to your scheol principa l, head a dministrator, or the school
board asking them to explain the purp ose behind the stand ardized or
other testin g procedures used in your school o r school distric t. A sk fo r
the rat iona le behind the use o f IQ mea sures, persona lity invento ries,
aptitude batteries. and an y other tests th a t may be used fo r prognostic
purposes. Ask him , or them , to ex pl a in how the achievement ba tt eries
that may be u sed reflect what the ~c h o()l is or should be teac hing in the
curriculum. Tfyou are dissatisfied with the response, as you aTe likely to
be, then \\lhy not get involved in trying to change the tests a nd the
curriculum in your o\vn classroom. sch ool, district, state, o r na tio n to
make th em mo re re!;po nsive to the mai n objecti ve of educa tion . na mely,
enabling people to enjoy fully the benefits of the negotia tjo l1 o f humal1
discourse.
NEW TESTS IN A COHERENT CU RRICULUM 421
SUGGESTED READI NGS
1. j . Britton, Language lind LWf11ing. New York: Peng uin, 1973.
2. Bjarni G unnarsson, 'A Look at the Content Similarities Between
lntelligence, Achievement. Personality, and Language Tests,' in J. W.
Oller, Jr. and Kyle Perkins (eds. ) Language ill Education: Testing the
Tests. Rowley, Mass. : Newbury House, 1978.
3. Norman W. M. Hart, R. F. Wa lker, and B. G rClY, The Language of
Children: A Key 10 LUnacy. Rea ding, Mass.: Ad di ~o n · Wesley, 1977.
Appendix

The Factorial Structure


of Language Proficiency:
Divisible or Not?
A. Three em piricall y testable
alternatives
B. The empirical method
C. Data from second language studies
D. The Carbondale Project, 1976- 7
E. Data from first language studies
F. Direclions for further empirica l
research

This Appendix is induded for seve ral reaso ns. It discusses the recent
findings or several research studies that support some of th e
implications and suggestions offered in the earlier ch apters of the
book , and it clarifies some of the avenues of empirical in vesti ga ti on
that have only just begun to be explored. It is included as an appendix
rat her than as a part of the body of the text becau se of the statistical
technicality of the argumenLs and da m presented . There arc man y
related questions that could be discussed, but the main focus of
attenti on here will be whether or not language abili ty can be divided
up into separately testable compo nents.
Am ong the closely related issues that are considered periphe ral to
the main questi on is whet her or not first langua ge learning and
second language lea rning are essentiall y similar o r fundamentall y
di fferent processes. The evidence that can be amassed from the data
discussed here would seem to suggest that the similarities across th e
two learning tasks out weigh the differences - that inferences
co ncerning second language acqui sitio n (the attainment of second
language proficiency) are usually also applicable to first language
acquisition (attainment of first lan guage proficiency) , and vice versa .
Further, the data do not seem to support the view that w ha t has
sometimes been called 'acquisition' (language learning in the natural
contexts of communication) is distinct from 'learning' (language
423
424 LANGUAGE TESTS AT SCHOOL

lea rning in form al classroom contexts, insofar as the latter can ever be
said actua ll y to take place). Aboul the onl y distinction that seems
supported by the data is that lea rning in the cl assroom is usually far
less efficient than learning in more natural communicati on contexts.

A. Three empirically testable alternatives


Is langua ge proficiency, however it may be attained , divisible int o
components that may be assessed separately, for instance, by
different testing procedures? Anot her way or putting the question is
to ask wh ether language processin g ta sks of di verse son s tend to
produce o verlapping va riances or whether they tend to produce
unique variances (in the algebraic sense of 'variance', as defined in
C hapter 3, above) . Or putting it in yet ano lher way, we might ask ,
what is the eX Lenl of correlation he tween a variety of lan guage
processing ta sks, or tests that purport to mea sure dill'e rent aspects of
language proficiency ? The usual assumption is that tests which have
th e same name, or whic h purpo rt to measure the same thing, should
be highl y correlated, whereas tests that purpo rt to measure different
th ings need not be highly correl ated, and in some cases should not be.
With respect to language proficiency, three possibilities can be
suggested. It has been suggested that language skill might be di vided
up into components much the way discrete point testers ha ve
suggested. We will refer to this possibility a s the divisibility
hypothesis. For instance, it might be possible to differentia te
knowledge of vocabulary, gramma r, and phonology. Further, it
might be po ssible to distin guish test variances associated with the
traditionally recogn ized skills of listening, speaking, reading, and
writing, or aspects of these skills such as productive versus receptive
competenc ies viS-O-l'is the hypotbe5ized components (e.g. , producti ve
phonology in an oral mode, or receptive vocabula ry in a writt en
mode). But thi s is only the first of th ree possibilities.
A second alternative is that the construct oflanguage proficiency
may be more like a viscous substance than like a machine that can
readily be broken down into component parls. We will refer to this
second possibility as the indivisibility, o r unitary compet ence
hypothesis. It may be that language proficiency is relatively more
uni tary tha n the di screte point testers ha ve contended. Perhaps what
has been called 'vocabulary ' kn owledge (as measured by tests that
have been called 'vocabulary' tests) cannot in fact be distinguished
from 'gralnmar' knowledge (as measured by 'grammar' tests). This
alternati ve is nol apt to be found as appealing as the fi rst mentioned
A PPENDIX 425
one, but it cannot be excluded by pure logic.
A third possibility is to take a kind of middle ground. Actually,
there are many points between any two we ll -defined positions, so
logically, this third alternative could express itself at any point
between the two extreme possibilities already defined. It will be called
the partial divisibility hypothesis. It could be argued that in addition to
a general component common to all of the variances of all language
tests (at least those with some claim to validity), there ought to be
portions of variance relia bly (consistently) associa ted with tests in a
listening mode that would no t also be associated with tests in, say, a
read ing mode. That is, the 'reading' tests ought to share some
varia nce among them that would not be common to 'listening' tests:
'vocabulary' tests should share some variance not common to
'grammar' tests: a nd so on for all of the contrasts between all of the
posited components assumed to exist over and above [he genera)
com ponent presumed to be common to all of the language tes ts.
Whether we take the first alternative or the third, we must find
testing procedures that \\fill generate variances that are unique to tests
that are suppo sed to measure different things. Either the indi visibility
hypothesis or the partial divisibility hypothesis allows for a large
general factor (or component of va riance) common to all language
tests. The difference between these alternatives is that the in-
divisibility hypothesis allows only for a general component of test
variance. Once such a compo nent is accounted for, the indi visibility
hypothesis predicts that no additional reliable va riance will remain to
he acco unted for.
Hence, the three alternatives allow for three kinds of variance -
error variance (random variance), reliable variance common to all of
the tests, and reliable va riance common only to some of the tests.
They can be summarized as follows :

The DivisibililY Hypothesis (H I): there wi ll be reliable va riance


shared by tests that assess the same component, skill, aspect, or
element of language proficiency, but essentially no common
variance across tests of different components, skills, aspects, or
el ement s~
The Indivisihility Hypothesis (H ,): there will be reliable
variance shared by all of the tests and essentially no unique
variance shared by tests that purport to measure a particular
skill, component, or aspect of language proficiency:
The Partial Divisibility Hypothesis (H ,): there will be a large
chunk of reliable variance shared by all of the tests, plus small
amonnts of reliable variance shared by onl y some of the tests.
426 LANGUAGE TESTS AT SCHOOL

In all three cases, so me non-reliable variance must be all owed for.


Thus, it is important to the questi on of test validity to determine
whether the error variance is large or small in relation to the reliable
variance attributa ble to a particular construct or skill. If, for instance,
the reliable variance attributable to a general factor were as large as
the estimated reliable va riance of any single test in, say , a large
battery of diverse tests, it would seem reasonable to assume that the
o nly variance left over after the general facto r was accounted for
would have to be error va riance, or unreliable va riance.

B. The empirical method


Thus, we can see that the question posed at the outset - namely,
whether or not language proficiency is di visib le into components -
can be construed as an empirical issue with at least three alternative
outcomes. The crucial experimental method from the point of view of
language proficiency measurement is to examine the correlations
amo ng a battery of tests administered to a large enough sample of
language learners to provide the necessary statistical reliability.
Actually, such an empirical study can be viewed as a method of
evaluating the theoretical hypotheses and at the same time assessing
the construct validity of the various tests that might be included in
such a study. If the tests employed were not demonstrabl y reliable in
their own right, and if they had no independent claims to validity, a
failure to clarify the choice between the theoretical positions would
hardly be conclusive. On the other hand, if the tests employed in the
research can be shown on independent grounds to be valid measures
of language ab ility, the results may indeed discriminate convincingly
between the several theoretical positions, and at the same time,
further substa ntiate the claims to validity of some o r all of the tests
employed . Similarly, if tests that purport to measure different things
can be shown to measure essentiall y the same thin g, the construct
validity of those tests must be re-evaluated.
Some of the evidence concerning the above hyp otheses was
discussed briefly in Chapter 3 of this volume. However, the
technicality of the procedures used in some of the data ana lyses
requires that they be discussed in somewhat more detail than seemed
appropriate in the main body of the text. Most of the studies that are
pertinent to the centra l question rely on the statistica l technique of
factor analysis - or in fact on the famil y of techniques that go by that
name. In particular, the most appropriate method for testing for a
APPENDIX 427
general factor is the one originally developed by Charles Spearman at
about the turn of the century. It is the method often referred to as
'principal components analysis'.
Factor analysis is a family of statistical procedures that exam ine
many correlations simultaneously to sort out patterns of re-
la tions hips. All factoring methods are concerned with variances in
the statistical and algebraic sense of the term. Especially, they are
concerned with determining what patterns of common and unique
variances exist for a given set of measures. The principal components
method was originally conceptualized by Spearman in an attempt to
determine whether or not there existed a general factor of what he
called 'intelligence'. This factor was often spoken of in the literature
and at professional meetings, and came to be referred to as 'g' . The
empirical evidence for 'g' consisted in factoring a battery aftests to a
principal components solution. When this was done repeatedJy, it
was determined that a general facto r existed which 'explained' or
'accounted for' a substantial portion of the variance inju st about any
complex problem solving test. Whether it wa s a matter of finding a
ro ute through a maze or discussing a verbal analogy or matching
abstract visual patterns, 'g' alwa ys seemed to emerge as the first and
largest principal component of just about any such test. Thus, it came
to be believed that there was indeed a general factor of intelligence,
call it 'g'.
As Nunnall y (1967) points o ut, the notion of a general factor of
intelligence though popular for a season lost its luster after a few
years and in the 1960s and early 1970s was rarely referred to in the
literature. Jensen (1969) mentioned it simply to note that it remains
'like a Rock of Gibraltar', undaunted by the several theoretical
attacks launched against it. Nunnally, however, points out a method
of extending and simplifying the technique of principal components
analysis to test for a single general factor. He shows that if there were
only one factor, it is a mathematical necessity that the correlation
between any two tests in the set used in the definition of the general
factor must equal the product of the separate correlations of each of
the tests with the general factor. It so unds complex, but at base it is
quite simple.
The first princ ipal component extracted in the factor analysis is
actually a linear combination of the variables (or tests) entered into
the computat ions in the first place. The correlation of each
contributing variable with that factor is an index ofthe contribution
which that variable makes to the definition of the factor. The same
428 LANGUAGE T~TS AT SCHOOL

co rrelation can also be used as an estimate of the validit y of the


contributing variable as a measure of the posited factor supposed to
underlie that principal component. Suppose we call that factor an
expectancy grammar (or any other name that signifies the
internalized knowledge of language). To t he extent that the posited
factor actually exists, the 'loadings' (another name for the
correlations of the individual contributors with that fact or) of the
respective var iables input to the a na lysis can be read directl y as
validity coefficients. Further, if we ass ume that the general facto r,
that is the expectancy grammar, exha usts the available meaningful
variance, it follows as a mathematical necessity that the product of
the loadings of any pair of variables on that factor must equal the
correlation between the m. Without going into any or the details, this
is similar to saying that the general factor (if it exists, and if the
indi visibility hypot hesis stated abo ve is co rrect) is either the only
factor that exists, or it is the only one that the tests are capable of
measuring.
Fortunately, even if the indivisibility hypothesis should turn out to
be incorrect, the factoring methods applicable subsequent to a
principal components analysis can conveniently be used to test
hypothesis three. As we will see shortly, there is no hope whatsoever
for the first hypot hesis (t he di visibility hypothesis). Other statistical
techniques can also be used, but the most obvio us approach is the o ne
used here. Multiple regression techniques that treat individual
language tests as repea ted measures of the posited general component
of expectancy grammar can be used to sharpen up the picture (for th is
approach see K erlinger and Pedhazur, 1973), but the con-
ceptualization of the problem with regression techniques is
considerably more complex and is not as easily accessible in computer
programs. Of course, without computing fac i1i6es neither approach
wou ld be very fea sible. Both techniques are computationa ll y so
complex that they could scarcely be done at all by hand .

C. Data from second language studies


The first application of Nunnally's proposed modificatio n of
Spearman's test for a general factor was to several versions of the
UCLA English as a Second Language Placement Examination. That
test consisted of at least five subtests in each of its administrations.
The data were collected between 1969 and 1972, and were analyzed in
1974 with the Nunnally method. The results were reported in a
APPEN DIX 429

European psycholinguistics journa l, Die Neuren Sprachen, and were


also presented at a meeting of the Pacific Northwes t Conference on
tbe Teaching of Foreign Languages sponsored by the American
Council of Teachers of Foreign Languages in Ap ril of 1976. (See
Oller, 1976a.)
In brief, the findings showed that once the general factor predicted
by the indivisibility (or unita ry co mpetence) hypothesis was
extracted, essentially no meaningful variance wa s left in any of the
tests on the several batteries of UCLA ESLPEs examined. The first
principal component extracted accoun ted for about .70 of the total
va riance produced by all of the tests (in each offive separate ba tteries
of tests), and wh atever variance remained eonld not be attributed to
anyt hing other than error of measurement. In other words, both the
divisibility and partial divisibility hypotheses (H I a nd H " above)
were clea rly ruled out. Since all five of the test batteries investigated
were adminlstered to rather siza ble samples of incoming foreign
student s the results were judged to be fairl y conclusive. Wha tever the
separate grammar, reading, phonology, compositio n, dictation, and
cloze tests were measuring, they were apparently all mea sudng it.
Some of them appeared to be better measures of the global language
proficiency facto r, but all appeared to be measures of that fact or and
not much else.
The second applicat ion of the Nunnall y technique to second
language data ca me in 1976. D ata from the Test of English as a
Foreign Language were available from a study reported by Irvine,
Atai, and Oller (1974). The battery o f tests investigated included the
five subtests of the TOEFL - Listening Comprehension, English
Structure ~ Vocabulary. Readin g Comprehension , and Writing
Ability - alo ng with a d oze test and a dic tati on. The subject sample
consisted of 159 Iranian adults in Tehran. Again the results
supported H 2 , the indivi sibility hypothesis. A single global factor
emerged and practicall y no variance whatsoever remained to be
accounted for once that factor wa s extracted.
A third application of Nunnally's suggested technique wa s to data
collected as part ofa doctoral dissertation study by Frances Hinofotis
a t the Center for Engli sh as a Second La nguage at Carbondale,
Illinois in 1975- 6. Hinofotis (1976) investigated the pattern of
relationships amo ng the various parts of th e placement test used at
C ESL (SIU in Ca rbondale), the TOEFL, the five subscales on the
Foreign Service Institute Oral Interview, plus a cloze test (in a
standard written format) . Her stud y, together with the Irvine, el al
430 LANGUAGE TESTS AT SCHOOL

data, provided the basis fo r a report presented at the Linguistic


Society winter meeting in Philadelphia (Oller and Hinofotis, 1976).
The results with the CESL data were somewhat less cleareut than
with the Irania n sUbjects. A major difference in the test batteries
examined was that the Hinofotis data included the measures from the
FS [ Oral Interview procedure.
While a general factor emerged from the Hin ofotis dat a, it was no t
possible to rule out the alternati ve that a separate facto r, possibly
associated with speaking skill, also existed. The data were examined
from several different angles, and in each case, the general fac tor
accounted for no less than .65 of the total available variance.!
However, a rotated va rimax solution seemed to justify the suggestion
that the five subscales of the FSI Oral Interview were measuring
something other t han what was measured by the nine other tests
studied . The questio n that arose at this point was whether the
apparently separa te oral faclOr could reliably be a ssocia ted with a
'speaking skill' ratber than tbe mere subjective judgement of the
interviewers who provided the fi ve ratings on each subject. Clea rly, if
a speaking skill factor could be isolated o r at least partially separated
from the general factor, this result would force a rejection of the
strong version of the indivisibility hypothesis in favor of the partial
divisibility option. There was also some hint in th e Hinofot is data
that perhaps a 'graphic skills' facto r could be distin guished from the
general facror.
Although the Hinofatis data raised the possibility of separate skills
(e.g., listening, speaking, reading, and writing, or possibly oral skills
as distinct from graphic skills) it seemed to rule out conclusively the
possibility of sepa rable components (e.g., phonology versus

1 We should take note of the fa cl that the data reported by Oller (1976a) and by Oller

a nd Hinofotis (1 976) used slightl y different statist·ical procedures. Wherea s the first
study used a non· jterative procedure with communality estimates of unity in the
diagonaJs fo r the principal components solur ian, the latter st udies reported by Oller
and Hinofotis, used an iterative procedure which successively refined thecommunaJity
estimates through a procedure of repeated calcula tio ns. As a result, all of the available
variance in the correlation matrix for the TOEFL , c\oze and dictation from the Ira nian
subjects was explained by a single general factor. This seemingly anomalous fi nding is
due to the itera tive procedure and to the fact that once the error variance was
discarded, nothing remained that was not attributable to the general factor. Whereas
this procedure (factoring wilh iterations) is a common one, Nunnally argues that the
method or using unities in the diagonal is mathema ticall y prefe rable. For this rea$Oll,
a nd because or the desire to a ....oid the seeming anomaly ora single fa ctor explaining all
of the variance in a ba ttery of tests (which disturbed at least one psychometrist
cons-uhed) subsequent anaJyses reported here revert to the non-itera tive method of
placing unities in the diagonal of the initial co rrelation matrix.
APPENDIX 431
vocabulary versus grammar and the like). There was no evidence
whatsoever to support the claim that the so-ca lled 'vocabulary'
measures \vere producing any reliable variance that was not also
attribut able to the so-caHed 'grammar' or 'st ructure' tests.
In fact, there was no ba sis for claiming tha t the five separate scales
of the FSI Oral Interview were measures of different things at all.
Since the Hinofotis study, this latter res ult has been confirmed
independently by Mullen (in press a, in press b) and Callaway (in
press). When people try to judge speech for vocabulary, or for
grammar, or for fluency, or even for accent, they apparently afe so
influenced by the overall communicative effect that all of the
judgements turn out to be quite indistinguishable. This is a serious
validity problem for the FSI Oral Interview - it could be resolved by
condensing all of the scales to one single scale. It is important to
realize that the failure of the FS[ technique to disti nguish
components of oral skill is doubly significant because at least one of
the contributing judges must be a trained linguist. If trained linguists
cannot make such distinctions, wh o can?

D. The Carbondale Project, 1976-7


It remained to determine whether the variance prod uced by the oral
interview procedure used in the Hinofotis stud y was mereJy a
subjective consistency possibly unrelated to speaking skill, or
whether it was a genuine source of variance in a separable component
of language skill that could be called 'speaking ability'. A suitable
empirical approach would be to devi se a battery of speaking tests - or
at least tasks that require speaking - along with a battery of tests
aimed at the other traditionally recognized skills oflistening, reading,
and writing. At the same time, it would be desirable to include only
tasks kn own to produce reliable variance and which had independent
claims to validity .. language tests.
The opportunity for the latter sort of study came in 1976-7 during
the author's visiting appointment in the D epartment of Linguistics at
Southern Illinois Uni versity on a grant from the Center for English as
a Second Language. Several other reports discussing aspects of the
research undertaken are to be published in two separate volumes,
Language in Education : Testing the Tests and Research in Language
Testing (Oller and Perkins, 1978, and in press). Therefore, only the
relevant non-redundant data will be discussed here.
The project began by assembling a team of researchers, mostly
432 LANGUAGE TESTS AT SCHOOL

volunt ee rs, with th e able help of Richard Daesch, Admini stralive


Director CESL at SIU, and Charles Parish, Academic Director of
CESL. Other major contributors included Professor Kyle Perkins,
who was a driving force behind the attitudinal research incorporated,
and George Scholz, who headed a group of instructors and staff
responsible for much of the oral testing that was done. Reports by
man y other contributors to the wo rk are included in the volumes
mentioned a bove. The second task was to assemble a battery of tests
aimed at four skills and, at Charles Parish's urging, importam points
of grammatical knowledge.
Early on there was considerable discussion about whether or not
the test should be almed at discrete points of structure exclusively or
whether they should have a broader focus. In the end, the desira bility
of including discourse processing tasks wa s agreed to by all , and the
desire to assess specific points of grammatical knowledge wa s bent to
the mold of a modified cioze testing procedure with selected points of
deletion . Professor Parish prepared the latter test.
The enormity of the task of preparing tests of listening, speaking,
reading, and writing (not to menti on the grammar test battery) was
enough to discou rage many potentia l participants and some of them
fell by the wayside. However, the objective of producing a battery of
tests that would answer certain crucial questions about the factorial
structure of language proficiency and which wo uld also eventuate in
the possibility of constructing a better placement in strument for the
CESL program provided sufficient incenti ve to keep a surprisingly
large number of teaching assistants, research assistants, and facu lty
doggedly plodding on.
In all, 182 students were tested. A total of 60 separate discourse
processing tasks were used. Not every student was tested o n all 60
tests. This turned out to be impossible due to the fa ct that all of the
tests had to be administered to small gro ups meeting in separate
classes. Because of absences, occasional equipment failures,
inevitable scheduling difficulties, etc. , it was not possible to get test
data on every subject fo r every task. However, the smallest group of
subjects that completed any sub-battery of tests (i.e., the sub-batteries
aimed at listening, at speaking, and so forth), was never less than 36
and in some cases was as high as 137 . Students tested we re only those
enrolled in CESL ciasses - they ranged in placement from the lowest
level (1) to the highest level (5). There was a considerable tendency for
the drop-outs to be students who we re at the lower end of the
distribution of scores. Hence, the students who completed most or all
APpaND[X 433
of t he tests tended [ 0 be at the high end of the distribution. According
to Rand ( 1976) this sho uld bias things against the indivisibility
hypot hesis by reducing the total spread of scores.
Since the tests (or samples o f them) are reproduced in Research in
Language Testing, only brief descript ions will be given here. Further,
only th e factor analyses using list-wise deletion of missing cases will
be reported. Although separate computa ti ons based on tbe pair- wise
o ptio n were also done corresponding [0 each of the analyses
reported , the patterning of the data was nearly identical in every case
for t he principal component (or 'g' factor ) solu tio n. Moreover, the
list-wi se deletion of missing cases has the adv antage of selecting cases
for ana lysis which are quite comparable across all of the tests
included jn that particular ana lys is. The pair-wise deleti on of missing
data o n the other hand produces correlations across pairs of tests tha t
may be based on quite different sub-groups drawn from the total
populatio n (Nie, Hull, Jenkins, Steinbrenner a nd Bent , 1975). Hence,
[he list-wise procedure affords a more straightfo rwa rd interpretation
of the factorial composition of the variance produced by the several
tests - it is less apt to be confounded by differences that might pertain
merel y to contrasts across sub-groups accidentally selected by the
sampling techniq ue.
First we will look at the factorial structure of the various sub-
batteries ta ken as a whole the entire battery of tests - then we will
exa mine the subscores pertaining to (he separate tes ts wi thin each
sub-battery. It should be kept in mind that the tests discussed in this
first anaJysis in many cases are composite scores derived from several
separate sub-tests.

i. Overall pattern.
Five types of tests can be distinguished. First, th ere were ti ve sets of
tests aimed a[ listening tasks. They included the subtest ca lled
Listening Comprehension o n the Comprehensive English Language
T est (henceforth referred to as CELT-Le) ; an ope n-ended cloze test
in a li stening format (Listening Cloze): a multiple choice doze test 111
a listening format (Listening M C Cloze); a multiple choice listening
comprehension test based on questions over three different texts that
the examinees heard (MC Listening Comprehension); and three
dictations (Dictation).
Second, tbere we re four speaking ta sks. The first was an o ral
inte rview which resulted in at least fi ve relatively in dependent scores
for accent (01 Accent), grammar (01 Grammar), vocabulary (01
434 LANGUAGE TESTS AT SCHOOL

Vocabulary), fluency (01 Fluency), and comprehension (01


Comprehension), respectively. The other three consisted of first a
composite repetition score over three separate texts (Repetition):
three fill-in-the-blank oral cloze tests (where the responses were
spoken into a microphone, Oral CIoze); and a reading aloud task
over three separate texts (Reading Aloud).
Third, there were three types of reading tasks. The first was the
Reading subtest from CELT (CEL T-R). The second type of reading
task involved identifying a synonym or paraphrase for a word,
phrase, or clause in a text (actually there were three tests of this type,
each over a different text (MC Reading Match). The third reading
task was actually a composite of eight open-ended doze tests in a
written format (Standard CIoze).
Fourth, three writing test scores were included. The first type of
writing task was actually an essay scored in two ways - first by a
SUbjective rating on a five point scale (Essay Rating), and second by
an objective scoring technique (number of error-free words minus
number of errors, all divided by the number of words required for a
fully intelligible rewrite by a native speaker, Essay Score). The second
type of so-called 'writing' test was actually the result of an attempt to
analyze writing tasks into three subtasks namely, selecting the
appropriate word, phrase, or clause to continue a test at any given
point: editing texts for errors in choice of words, phrases, or clauses:
and ordering words, phrases, and clauses appropriately at certain
decision points in given texts (MC Writing). The third type consisted
of a teacher's rating on a five point scale of the accuracy and
completeness of written recalls of three separate texts that were
displayed in a printed format for exactly one minute each (Recall
Rating).
The fifth, and last type of test included in the overall analysis,
included two tasks aimed at grammatical knowledge. The first of
these was the Structure subtest on the CELT (CELT-S), and the
second was a modified cloze test (126 items) aimed at specific points
of grammatical usage (Parish's Grammar test, referred to above).
Following the program oftest analysis discussed above, two factor
solutions are presented - the result of a principal components
analysis is given in Table 1, and of a varimax rotation method in
Table 2. In the first column of Table I, the loadings of eacb test on the
hypothesized 'g' factor are given. In column two, the squares of those
loadings can be read to determine the amount of variance shared by
the 'g' factor and the test in question. The sum ofthe squared loadings
APPENDIX 435

or Eigen value is given at the bottom of column two, and can be


divided by the total number of tests in the analysis to get the total
amount of explained variance equal to .52.

TABLE 1
Principal Components Analysis over Twenty-two
Scores on Language Processing Tasks Requiring
Listening, Speaking, Reading, and Writing as wen
as Specific Grammatical Decisions (N = 27).

SCORES LOADINGS ON SQUARED LOADINGS


g
~----

Listening
CELT-LC ~64 .41
Listening Cloze .78 .61
Listening Me Cloze .40 .16
Me Listening Comprehension .62 .38
Dictation .83 .69

Speaking
01 Accent .42 .18
01 Grammar .88 .77
or Vocabulary .80 .64
or Fluency .62 .38
01 Comprehension .88 .77
Repetition .59 .35
Oral Cloze .76 .56
Reading Aloud .56 .31

Reading
CELT-R .64 .41
Me Reading Match .83 .69
Standard Cloze .83 .69

Writing
Essay Rating .71 .50
Essay Score .77 .59
MCWriting .85 .72
Recall Rating .77 .59

Grammar
CELT-S .55 .30
Grammar (Parish test) .88 .77
Eigen value = 11.49
------- --------
436 LANGUAGE TESTS AT SCHOOL

The loadings on g given in Table I reveal that there are good tests of
g in each of the batteries of tests or scores. For instance, both the
Listening Cloze and the Dictation tasks produce substantial loadings
on g. Similarly, the 01 Grammar, 01 Vocabulary, 01
Comprehension, and Oral Cloze scores produce substantial loadings
on the same factor. Two of the reading tasks load heavily on the g
factor·· both the MC Reading Match task, and the Standard Cloze.
Among the writing tasks, the best measure of g appears to be the MC
Writing test, but all four tasks produce substantial loadings -
noteworthy among them are the two subjective scores based on Essay
Rating and Recall Rating. Finally, the Parish Grammar test loads
heavily on g.
It is worth noting that the 27 subjects included in the analysis
reported in Table I are a relatively small sub-sample of the total
number of subjects tested. In particular, this sub-sample is relatively
higher in the distribution than would be likely to be chosen on a
random basis - hence, the variability in the sample should be
depressed somewhat, and the g factor should be minimized by such a
selection. This kind of selection, according to some writers (especially
Rand, J 976), should reduce the importance of g, yet it does not.
Furthermore, as we noted above, the results of a pair-\vise deletion of
missing cases (thus basing computations of requisite correlations on
different and considerably larger sub-samples subjects) was essen-
tially similar. Hence, the pattern observed in Table I appears to be
quite characteristic of the population as a whole. Nonetheless, for the
sake of completeness a varimax solution over the same twenty-t\VO
tests is presented in Table 2.
The patterning of correlations (factor loadings, in this case) in
Table 2 is somewhat different than in Table I. However, the total
amount of explained variance is not a great deal higher (at least not
for the interpreted loadings). Further, the patterning of loadings is
not particularly easy to explain on the basis of any existing theory of
discrete point notions about different components or skills. For
instance, although both Reading Aloud and 01 Fluency are supposed
to involve an clement of speed, it docs not emerge as a separate factor
- that is, the two types of scores load on different factors, namely, 1
and 3. Similarly, the 01 Comprehension scale does not tend to load
on the same factor(s) as the various Listening tests (especially factor
3), rather it tends to load on factor 1 with a considerable variety of
other tasks.
One might try to make a case for interpreting factors 4 and 5 as
APPENDIX 437

TAB L E 2
Varimax Rotated Solution for Twenty-t wo
Language Score s (derived from the principal
components solution partially displayed in Table 1,
only loadings above .47 are interpreted) .

\ FACTO RS
SCORES \ 2 3 4 5 h'
Listening
CELT-LC .49 .48 .47
Listening Cloze .57 .32
Li stening Me Ooze .88 .77
Me Listening Comp .80 .64
Dict<ltion .55 .30

Speaking
0 1 Accent .88 .77
01 Grammar .69 .48
01 Vocabulary .78 .61
01 Fluency .77 .59
0 1 Co mprehension .80 .64
Repetition .68 .46
Oral Cloze .47 .22
Reading Aloud .85 .72

Reading
CELT -R .57 .48 .55
Me Read ing Match .71 .52 .77
Standard CIoze .67 .45

Writing
Essay Rating .56 .31
Essay Score .56 .49 .55
Me Writing .5 1 .6 1 .76
Reca ll Rating .74 .55

Grammar
CE LT-S .87 .76
Grammar (Parish Test) .55 .64 .89
Eigen value = 13.22

being primarily associated wit h speaking and listening whereas 2 and


3 seem to be predo minantly reading and writing factors - however,
there are exceptions. Why for exa mple does the essay score load with
438 LANGUAGE TESTS AT SCHOOL

the speaking and listening tests on factor 4 '1 Why does th e Read ing
Aloud test load so heavily on factor 3 and not at all on 1,2, 4, or 5?
One possible explanation for the patterning observed in Table 2 is
that it may be largely due to random variances attributable to erro r in
the measurement technique. Interestingly, the varimax solution
presented here is not strikingly similar to the one that emerged in the
Hinofotis study (see Oller and Hino rotis, 1976). If the patterning is
no t due to random variation, it sho uld tend to reappear in repealed
studies. However, for the varima x rotation (though not fo r the
principal components solution) the pair-wise deletion procedure
tended to produce a somewhat different pattern of fact ors than the
list-wise procedure. This too lends support to the possibility that the
varimax patterning may be a result of unreliable variance rather than
va lid differences across tasks.
A final bit of evidence favoring the unita ry competence hypo thesis
is the fact that the loadings on g (see Table I) are higher t han the
loadings on any ot her factor (see T able 2) for 15 of the 22 scores input
to the original analysis. Thus, it woul d appear that g is th e better
theoretical expl anat ion of the availab le non-random variance for well
over half of the tas ks or scales studied. It is also a surprisingly good
basis for explai ning substantial amounts of va riance in all of the
scores except possibl y the Listening MC Cloze which only loaded at
.40, and the 01 Accent scale which loaded a t .42 on the g factor.

ii. Listening.
Among th e listening tasks investigated were the CELT-LC; three
open-ended cl oze tests in a listening for mat (Listening Cloze A,
Listening Cloze B, and Listening Cloze C); three multiple choice
cloze tests in a listening format (Listenin g MC Cloze A, Listening
M C Cloze B, Listening MC Cloze C) ; three listening comprehension
passages with multiple choice questions follo wing each (MC
Listening Co mp rehension A, B, and C); and three dictatio ns
(Dictation A. B, and C). There were geveral ulterior motives that
guided the selecti on of precisely these tasks. For one, it was assumed
that listening tasks that required the use of full-fledged disco urse were
preferable to tasks requiring the processing of isolated sentences and
the like (see the argument in Part Two of the text above in the ma in
body of the boo k). A second guiding principle was the need to
develop tests that could be scored easily wi th large numbers of
subjects. Yet anoth er moti va tion, and the one that is o f prime
importance to this Appendix, was to include tasks that were known to
APPENDIX 439
have some reliability and validity along with the more experimental
tests (the latter being the cloze tests in listening for mat).
The results of a pri ncipal components solution extracting the first
principal factor are given in Table 3. The problem here, as before, is
to select the solution that best explains the pattel1ling of the data-
i.e., that maximizes the explained va riance in the tests investigated. It
is not sufficient merely to find factors. One must also account for
them on the basis of some reasonable expl anation. An alternative
factor solution is gjven in Table 4. There, fo ur factors a re extracted,
but no pa rticularly satisfying pattern emerges. \Vhy, for instance,

TABLE 3
Principal Components AnaJ ysis ove r Sixteen
Listening Scores (N ~ 36).
SCORES LOADtN GS ON SQU ARED LOADINGS
g

CE LT-LC .50 .25


List Cloze A (Exact)' .70 .49
List Cloze B (Exact) .so .64
List Cl oze C (Exact) .66 .44
List Me Cl oze A .76 .58
Li st MC Cl oze B .52 .27
L ist MC Cloze C .71 .50
L ist Comprehen!>ion A .63 .40
List Comprehension B .59 .35
List Comprehension C .68 .46
Dictation A .69 .47
Dictation B .73 .53
Dictati on C .7 1 .50
List Cloze A (Appropriate)** .44 .19
List Cl oze B (Appropriate) .58 .34
Li st Cloze C (Appropriate) . 19 .04
Eigen va lue = 6.47

• The exact word scoring method was used to obtain the va l u~ for each protocol (12
items in each subsectio n).
U The so-called 'appropriate' scores used here were actually obtained by counling only

the words that had not already been included in the exact score. This, of course, is not
the usual method , but was easier to apply in this case due to the computer program
used a t the S[U Testing Center where the data was initially processed. The exact word
method and the more usual a ppropriate word method areexpiained in grea ter detail in
C hapler 12 above.
440 LANG CAGE TESTS AT SCHOOL

sh ould the CELT-LC load on facto r 2 while the multiple choice


Listening Comprehension tests A, B, and C all load on factor I? Wh y
is it that Listening Cloze A scatters over three factors (2, 3, and 4)
while the other two load consistently on factor 2 ? It would appear
that the patterning in Table 3 is due to reliable variance attribut able
to the g factor while the patterning in Table 4 results from the
partitioni ng of unreliable variances. Interestingly, the multiple choice
tests all tend to load on factor I in Table 4, except [or the CEL T-LC.
The Dicta tions load primarily on fac tor 2 in Table 4, but ha rdl y as
strongly as the composite Dictation score loaded on g in Table I
above. 2

TABLE 4
Va rimax Rotated Solution for Si xleen Li stening
Scores (N = 36).

~CTORS
SCORES 2 3 4 h'
- - - - --_.- -,_._--- -- .._ - - -
CE LT· LC .67 .45
Li st Cloze A (Exact) .53 .49 .46 .73
List Cloze B (Exact) .78 .61
Li st Cloze C (Exact) .84 .71
List MC Cloze A .80 .64
Li st MC Cloze B .84 .7 1
List Me Cloze C .76 .58
Li ~ t Com prehension A .69 .48
List Comprehension B .71 .50
List Comprehension C .72 .52
Dictatio n A .67 .45
Dictatio n B .84 .7 1
Dictation C .79 .62
Lis( Cloze A (Appropriate) .78 .6 1
List Cloze B (Appropriate) .80 .64
Li st C~oze C (Appropriate) -.69 .48
Eige n value = 9.44

21t may be worth nOling here that the listening tasks were the very last t ask~ to be
administered in the series . Because of the extensive testing done during the project, the
order may be significant due to a fatigue faewr. The latter tests in the series seemed to
be somewhat less reliable on the whole and also prod uced the weakest correla tio ns.
The o rder fo r the remaining lest types reported below was Wriling, Grammar,
Reading, with the Speaking tests interspersed among all the rest.
APPENDI X 441
iii. Speaking.
The speaking tasks were the result of the coll abora tion of a group of
graduale sludents and staff at CESL under the ca pable leadership of
George Scholz. Lela Vandenburg was also very much involved in lhe
initial stages of the data collection and her responsibilities were later
largely taken over by Deborah Hendricks. T ogether, they and two
other students ha ve compiled two exten sive reports related to th e
tests aimed at spea king skills (see Scholz, Hendric ks, ef ai, in press,
and Hend rick s, et a i, in press). The summary offered here
incorpora tes their major findings a nd relates them to aspects of the
total project outside orthe purview oftbeir repo rts.
T he speaking tasks used by Scholz ef af included an interview
proced ure modeled after the FSI technique. Five rati ng scales were
used (Ol Accent, 01 Grammar, or Vocabulary, 0 1 F luency, and 01
Comprehension - see Chapter II above for elaboration on each of
these). Over and above the scales an FSI Oral Proficiency Level was
assessed for each interview. In addition to the o ral inter view
procedu re, th ree other types of task were used - repetition, oral d oze,
and reading aloud. It was reasoned that since a ll three of the latter
tasks involved speaking in the ta rget language they wo uld be suitable
procedures against which to compare the scores on the oral interview
scales. Each ofthelatrer ta sks employed three texts - one judged to be
easy in relation to the supposed level of skill of CESL students,
another judged to be modera tely difficult, and finall y, one of
considerable difficulty. In all , then, nine texts we re required - th ree for
the repetition tasks (Repetition A, B, and C): three for oral cloze
tasks (Oral Cloze A, S, and C); and three for the reading aloud texts.
The latte r three were used to generate three scores each - the amount
oftime (i n seconds) required to read the text (Readi ng Time A, B, and
C) ; the number of full y recogni za ble wo rds (Reading Accuracy Exact
A, B, C): and the number of additional words that were appropriate
to the sense of the text though they did not actuall y appea r in it
(Reading Accuracy Appropria te A, B, C).
Repetition tasks were scored in two way s. The first set of scores
were based on the number of words reproduced in recognizable form
that actually appeared in tbe original text (Repetition Exact A, B, C).
The second set of scores consisted of the number of additional word s
produced that were actuall y appropriate to the sense of the original
text though they did not appea r in it (Repetition Approp riate A, B,
C). Similarly, the oral cl oze tasks we re scored both fo r exact wo rd
responses (Oral Cloze Exact A, S, C), and for words ot her than the
442 LANGUAGE TESTS AT SCHOOL

exact word lha t fil the context (Oral Cl oze Appropriate A, B, C).3
]n an there were twenty-se ven scores computed o ver the speaking
tasks. A principal components analysis over those scores is given in
Table 5 followed by a varimax rotation to an orthogonal solution
revealing seven factors with an Eigen va lue greater than one in Table
S.
There are several fact s that should be considered in interpreting the
factor analyses in Ta bles 5 and 6. Foremost among them are the
methods used fo r obtaining the various scores.
Consider Table 5 first. Repetition exact scores over texts A and B
load substantiall y on the g factor - however, the 'appropriate' scores
are scarcely correlated with that factor at all. This is because of the
nature of the task. The more an examinee tended to answer with an
exact word repetition of the original, the less he \va s apt to come up
with some o the r appropriate renditio n - hence, the low correlation
between appropriate sco res for the repetition tasks, and also for the
reading aloud tasks over all three texts, In fact, in the case of the
appropriate scores for the reading aloud texts, the correlation with g
tended to be negative, This is not surprising if the method of
obtaining those scores is kept in mind, Similarly, the correlations
(loadings) of the time scores for reading aloud tasks tended to be
substantial and nega tive for all three of the texts in question, This is
not surprising since we should expect lower proficiencies to result in
longer reading rate!). The appropriate scores over oraJ d oze tests,
however, were on the whole slightl y belter measures of g than the
exact scores. This is al so a function of the way the appropriate scores
were computed in this case (see fo otnote 3 below) by adding the exact
word score and any additional acceptable words to obtain the
appropriate score.
Thus, the best measures of g appear to be ratings based on the oral
interview procedure, exact scores of the A and B repetiti on tex ts,
appropriate d o ze scores, especiall y A and C, reading alo ud times,
and for some rea son the accuracy score (exact) o n texts Band C. Due
partly to the unreliability of some of the measures, and no doubt also
to the fact that some of the scores computed are not actually measures
of g at all , the total amount of explained va riance in all of the scores is

3 Here, the appropriate sco re i~ the sum of exact plus additional appro priate words. In
fact, this 'appro priate' sco ring method is the o nly o ne that includes the exact word
score. In subsequent test ~ (see Tables 7- 12 below), c10ze scores that are ca lled
'appropriate' word score:- are aClually a count of the words over and above (not
including) the exact words restored in the text in question.
AP PENDIX 443
TABLE 5
Principal Components Analysis over Twenty-seven
Speaking Scores (N ~ 64).
SCO RES LOADINGS ON SQUARED LOADINGS
g
- - - .. - - - ._- - - - -- --- - ---- -
01 Accen t .59 .35
OlGrammar .83 .69
OJ Vocabulary .79 .63
OJ Fluency .80 .64
01 Comprehension .87 .76
FSI Oral Proficiency Level .87 .76

Repetition Exact A .71 .50


Repet ition Appropriate A .07 .00
Repetition Exact B .87 .76
Repetition Appropriate B .29 .08
Repetition Exact C .39 .15
Repetition Appropriate C - .06 .00

Oral Cloze Exact A .54 .29


Oral Cloze Approp riate A .66 .44
Oral Cloze Exact B .59 .35
Oral Cloze Appropriate B .43 .18
Oral Cloze Exac t C .38 . 14
Oral CIoze Appropriate C .67 .45

Reading Al o ud Time A - .65 .42


Reading AJoud Exact A .39 . 15
Reading Al oud Appropriate A - .10 .01
Reading Al oud Time B - .65 .42
Reading Aloud Exact B .70 .49
Reading A loud Appropriate B - .36 . \3
Reading Aloud Time C - .54 .29
Reading Al oud Exact C .52 .27
Reading Al oud Appropriate C - .17 .03
Eigen value = 9.39

relatively small (.34). However, if we eliminate the scores tha t are not
expected to correlate with g (in particular, lhe appro priate word
scores fo rth e repetitio n and reading aloud tasks), the total amount of
variance explained by the g fact or jumps to .44. All four lypes of tas ks
appear to be measuring g within the limits of reliability oflbe various
scores deri ved from them.
444 LANGUAGE TESTS AT SCHOOL

TABLE 6
Varimax Rotated Soluti on ove r Twenty-seven
Speaking Scores (N ~ 64).

~CJORS
SCORES 2 3 4 5 6 7 h'
- - - - - ----- "_.._ - - - - - - - -.- - --
01 Accent .48 .23
0 1 Grammar .89 .79
01 Vocabulary .90 .SI
0 1 Fluency .83 .69
0 1 Com p .87 .76
FSI Oral Level .92 .85

Rep Exact A .45 .46 AI


Rep AppropA .73 .53
Rep Exact B .48 .50 .48
Rep Approp B .49 .46 .45
Rep Exact C .69 .48
Rep Approp C .58 .34

OC Exact A .49 .24


OCAppropA .75 .56
OCExacr B .52 .45 .47
OCApprop B .86 .74
OC Exact C .72 .52
OCApprop C .56 .33

RATimeA - .82 .67


RA Ex act A .61 .44 .45 .77
RAAppropA .87 .76
RA Time 6 - .84 .71
RA Exac t B .63 .40
RAAppropB _72 .52
RATimeC -.82 .67
RA Exact C .45 .56 .51
RAAppropC .73 .53
Eigen value = 15.22
---- - -- - - -

In Table 6, the varimax rotated factors reveal a possible fluency


factor (see factor 3) associated strongly with all three of the reading
aloud time scores, but oddly, not associa ted with the fluency scale on
the oral interview. The exact scores for oral doze. reading alo ud, and
one of the repetition tasks all load on factor S. Facto r 7 appears to be
APPENDIX 445
a throw out created by some sort of unreliable variance in the exact
score over repetition text C. Perhaps this is due to the fact that the
task turned out to be too difficult for most of the examinees. Similar
unreliabilities emerge with reference to several of the subseores.
Considering the fact that the g factor displayed in Table 5 accounts
for substantial variance in all of the sub-tasks investigated, and
taking into account that the more reliable tests seem to load very
substantially on that factor, the results of the speaking subtests on the
whole seem to support the indivisible competence hypothesis rather
than the partial divisibility hypothesis.

iv. Reading.
The first reading score was the Reading subtest on the CELT (CELT-
R). The second type of reading test was a matching task in a multiple
choice format. It involved selecting the nearest match in meaning for
an underlined word, phrase, or clause from a field of alternatives at
various decision points in a text. There were three texts (Me Reading
Match A, B, C). Finally, there were eight cloze tests in an every fifth
word format scored by the exact and appropriate word methods. The
appropriate word score here, however (see footnote 3 above), did not
include the exact word score as a subpart - rather it was simply the
count of appropriate responses over and above the exact word
responses.
There was an ulterior motive for including so many different cloze
tests. We wanted to find out if agreeing or disagreeing with
controversial subject matter would affect performance. Therefore,
pro and con texts on the legalization of marijuana (Marijuana Pro,
Marijuana Can), the abolition of capital punishment (Cap Pun Pro,
Con), and the morality of abortion (Abortion Pro, Con) were included
along with two neutral texts on non-controversial subjects (Cloze A
and B). Results on the agreement versus disagreement question are
reported by Doerr (in press). In a nutshell, it was determined that the
effect of one's own belief though possibly significant is apparently not
very strong. In fact, it would appear from the data to be quite
negligible. Furthermore, the correlation between scores over texts
with which subjects agreed and with which they disagreed was very
strong, .91, and there was no significant contrast in scores. Similarly,
the correlation across pro and con texts (independent of the subjects'
agreement or disagreement) was .90. The pro texts were significantly
easier, however. These results at least mollify if they do not in fact
controvert the findings of Manis and Dawes (1961) concerning the
446 LANGUAGE TESTS AT SCHOOL

sensitivity of doze scores to differences in beliefs. Further research is


needed with a more sensitive experimental design, and with more
proficient examinees.
As before, the data were factor analyzed to a principal components
solution and then rotated to a varimax solution. Results of these
procedures are given in Tables 7 and S. Whereas the loadings on g
account for 47 % of 'he 'otal ava ilable variance, the interpre,ed
loadings on the four factor rotated solution only account for an
additional 6 %(for a total of 53 %). Again, the pattern of loadings in
the rotated solulj on seems to indicate a considerable amount of
unreliable variability in the data. As noted above, this is partly
attributable to the data collection procedures, and partly to the
shortness of some of the subtests for which scores are reported . [n the
cloze scores rep orted here (Table 7, especially) and above in Table 5,

TABLE 7
Princip al Components Solu tion over Twenty
Reading Scores (N = 51).

SCORES LOADINGS ON SQUARED LOAD IN GS


g

C EL T- R eading SubleSl .57 .32


MC Reading Match A .74 .55
MC Reading Match B .75 .56
Me Readin g Match C .71 .59
Cloze Exact A .66 .44
Cloze Exact B .69 .48
Marijuana Pro Exact .68 .46
Marijuana Con Exact .79 .62
Cap Pun Pro Exact .71 .59
Cap Pun Con Exact .71 .59
Abortion Pro Exact .85 .72
Abortion Con Exact .74 .55
Cloze A (Appro p) .54 .29
Cloze B (A pprop) .63 .40
Marijuana Pro (Approp) .71 .50
Marijuana Con (Approp) .55 .30
Cap Pun Pro (App ro p) .74 .55
Cap Pun Con (Approp) .62 .38
Abortion Pro (Approp) .63 .40
Abortion Con (Approp) .64 .4 1
Ei gen value = 9.40
APPENDIX 447
TABLE 8
Varimax Rotated Solution over Twenty Reading
Scores (N ~ 61).

~CTORS 2 4 h'
SCORES 3

CELT-R .65 .42


MCRMA .79 .62
MCRMB .65 .42
MCRMC .55 .30
Cloze Exact A .77 .59
Cloze Exact B .68 .46
MP Exact .81 .66
MCExact .49 .41 .50 .66
CP Exact .63 .47 .62
CC Exact .54 .29
AP Exact .52 .67 .72
AC Exact .47 .57 .55
Cloze A (Approp) .48 .58 .57
Cloze B (Approp) .70 .49
MP (Approp) .49 .49 .48
MC (Approp) .72 .52
CP (Approp) .81 .66
CC (Approp) .43 .18
AP (Approp) .73 .53
AC (Approp) .85 .72
Eigen value = 10.46
----------------

the appropriate scores appear to be contributing about as strongly to


g as the exact scores. Further, in neither of the rotated solutions (see
Tables 6 and 8) do the appropriate and exact scores over the doze
texts clearly differentiate themselves - that is, they do not load
exclusively on separate orthogonal factors.
Therefore, in view of all of the foregoing, it seems reasonable to
conclude that the indivisibility hypothesis is again to be preferred. In
other words, little or no explanatory power is gained by the rotated
solution over the single factor principal component solution. These
results also accord well with the findings of Anderson (1976). He
found that a single factor accounted for 65 % of the variance in ten
different measures of what were presumed to be eight different
aspects of reading ability. His subjects were 150 children spread
evenly over three grade levels in elementary schools in Papua New
Guinea. For many of them, English was a second language.
448 LANGUAGE TESTS AT SCHOOL

v. Writing.
Eighteen writing scores were generated. In this analysis three separate
scores were obtained over the essay task referred to above in Tables 1
and 2. First, the results were rated on a five point scale by the
instructors in the classes in which the subjects originally wrote the
essays (see Kaczmarek, in press, for a more complete description of
the procedure). Second, the same essays were judged for content and
organization on a different five point scale by a different rater. Third,
the person assigning the latter subjective rating also computed an
objective score by counting the number of error-free words in the
subject's protocol, subtracting the number of errors, and dividing the
result by the total number of words in an errorless rewrite ofthe text.
These scores are referred to below as Instructor's Essay Rating,
Content and Organization Rating, and Essay Score.
Next, a battery of multiple choice tests aimed at components of
writing skiJl were included. Each task was based on decisions related
to text. The first task was to select the appropriate word, phrase, or
clause to fiJI in a blank in a text. It was, thus, a kind of multiple choice
c10ze test. This task was repeated over three texts - one supposed to
be easy for the subjects tested (Select A), a second supposed to be
moderately difficult (Select B), and a third expected to be difficult
(Select C).
The second type of test included in the mUltiple choice battery
required editorial decisions concerning errors systematically implan-
ted in texts. The errors were ofthe type frequently made by native and
non-native writers. The format was similar except that the criterial
word, phrase, or clause was underlined in each case and followed by
several alternatives. The first alternative in every case was to leave the
underlined portion alone - i.e., no change. In addition, four other
options were provided as possible replacements for the underlined
portion. Again, easy, moderate, and difficult texts were used (Edit A,
B, and C).
The third type of test included in the multiple choice battery
required subjects to place words, phrases, or clauses in an
appropriate order within a given text. In each case four words,
phrases, or clauses (or some combination) were provided and the
subject had to associate the correct unit with the appropriate blank in
a series of four blanks. As before, three texts were used (Order A, B,
and C).
A final type of writing task used was a read and recall procedure.
Again, three texts were used. Each one was displayed via an overhead
APPENDIX 449
projector for one minute. Then subjects (having been instructed
beforehand) wrote all that they could recall from the text. The
resulting protocols were scored in two ways. First, they were rated on
a simple five point scale (see Kaczmarek, in press, for elaboration);
second, they were scored by allowing one point per word of text that
conformed to something stated or implied by the original text. The
rating was done by the instructors who also rated the essay task and is
thus referred to as Instructor's Rating of Recall A, B, and C. The
scoring, on the other hand, was done by the same team of graders
who scored the essay task (Recall Score A, B, and C).
The principal components solution is presented in Table 9 and the
rotated solution in Table 10. Whereas 49 /~ of the total available
variance is explained by g, only 4 additional percentage points are
explained by the rotated solution which accounts for a total of 53 %.
Furthermore, the patterning displayed in Table 10 certainly appears
to be due largely to chance. There is no clear tendency for the same

TABLE 9
Principal Components Analysis over Eighteen
Writing Scores (N ~ 137).
SCORES LOADINGS ON SQUARED LOADINGS
g

Instructors' Essay Rating .80 .64


Content and Organization Rating .80 .64
Essay Score .77 .59
Select A .63 AO
Select B .74 .55
Select C .71 .50
Edit A .67 AS
Edit B .62 .38
EditC .45 .20
Order A .75 .56
Order B .69 .48
Order C .62 .38
Recall Rating A .78 .61
Recall Rating B .69 .48
Recall Rating C .78 .61
Recall Score A .74 .55
Recall Score B .66 .44
Recall Score C .62 .38
Eigen value = 8.84
450 LANGUAGE TESTS AT SCHOOL

TABLE 10
Varimax Rotated Solution over Eighteen Writing
Scores (N ~ 137).

~CTORS h'
SCORES 2 3

Instructors' Essay Rating .45 .57 .60


Content and Organization Ratlng .70 .49
Essay Score .77 .59
Select A .49 .24
Select B .70 .49
Select C .65 .42
Edit A .45 .52 .52
Edit B .49 .53 .59
EditC .71 .50
Order A .69 .48
Order B .55 .30
Order C .46 .60 .66
Recall Rating A .57 .65 .75
Recall Rating B .75 .56
Reca~l Rating C .81 .66
Recall Score A .70 .49
Recall Score B .64 .41
Recall Score C .71 .50
Eigen value= 9.25

types of scores to load on the same factors. For instance, the Select
scores scatter over three uncorrelated factors. Thus, again the
indivisibility hypothesis seems to be the better explanation of the
data.

vi. Grammar.
Two kinds of grammar scores were used. First, the subtest on the
eEL T labeled Structure was included, and second, there were several
subscores computed over Parish's Grammar tcst. Except for the fact
that it was possible to compute subscores over the latter test, there
would have been no need for this section. However, it serves a very
useful purpose in providing a straightforward test of the view that
discrete points of grammar can be usefully distinguished whether for
teaching or for testing purposes. Further, it is worth noting that the
author of the grammar items began with (and perhaps still maintains)
APPENDIX 451
a sympathetic view toward the discrete point analysis of grammatical
'knowledge' (whatever the latter turns out to be).
Among the suhscores on the Parish Grammar test were sums of
scores over selected items representing particular grammatical
categories believed to be functionally similar by the test author. For
instance, noun modifiers such as adjectives and determiners were
extracted by a computer scoring techniq ue to obtain a subseore called
Adj Det. ft is important to note that the items did not include any
heavily laden content adjectives but rather were limited to highly
redundant demonstratives and the like. The entire text of the test with
items used in computing subscores are given in the Appendix to
Research in Language Testing. The other subscores are named in
Tables II and 12 which report the requisite factor analyses. ft is
apparent from an examination of both tables that there is no basis
whatever for claiming that separately testable grammatical categories
exist. In the principal components solution and in the varimax
rotation, all of the exact scores over all of the various categories load
on the same factor. Again, the additional factors that are required for
the rotated solution add little new information to what is already
contained in g. Further, it is only the appropriate scores which tend to
sort out randomly onto different factors. This is undoubtedly due to
the fact that there was little reliable variance in those scores. There
could not be, because the exact word scores exhausted nearly all of
the variability in the performance of different examinees. If they
didn't know the correct answer (in fact, the exact word deleted),
chances were good that they could not get the item correct even if they
were allowed to put in some other response. Hence, the appropriate
scores here as in the cases of the repetition tasks and the reading aloud
tasks added little new information.

E. Data from first language studies


The relationship between measures of language ability and IQ has
often been examined in empirical studies of children and adults who
are native speakers of English (and no doubt other languages),
However, for some reason, high correlations between IQ tests and
other tests - e.g., vocabulary tests, listening comprehension tests,
reading tests, achievement batteries, personality indices, cloze
procedure, etc. - have usually been interpreted to mean that the other
tests were also incidentally testing IQ. This interpretation is
unassailable from the point of view of pure statistics, but it leaves
452 LANGUAGE TFSTS AT SCHOOL
TABLE 11
Principal Components Analysis over Twenty-three
Grammar Scores (N = 63).

SCORES LOADINGS ON SQUARED LOADI?>lGS


g

CELT -Structure .70 .49


Adjectives & Determiners
Exact .78 .61
Adverbs & Particles Exact .83 .69
Copula Forms ('Be') Exact .77 .59
'Do' Forms Exact .78 .61
Inevitable Nouns (as in
idioms) Exact .67 A5
Interrogatives Exact .83 .69
Modals Exact .80 .64
Prepositions Exact .86 .74
Pronouns Exact .91 .S3
Subordinating
Conjunctions Exact .74 .55
Verbals Exact .89 .79
Adjectives & Determiners
Approp .15 .02
Adverbs & Particles Approp .23 .05
Copula CBe') Approp .08 .01
'Do' Forms Approp .08 .01
Inevitable Nouns Approp .24 .06
Interrogatives Approp .08 .01
Modals Approp .28 .OS
Preposition Approp .06 .00
Pronouns Approp .00 .00
Subordinating Conjunction Approp .19 .04
Verbals Approp ~ .08 .01
Eigen value = 7.97

open another alternative which is equally acceptable aud worthy of


consideration.
That other possibility is that the so-called 'IQ' tests are very likely
measures of something other than innate intelligence. Although no-
one would be apt to deny that in fact 'IQ' tests measure language
proficiency to a great extent, the tests are rarely applied and
interpreted as if they were measures of language proficiency - an
attained skill as opposed to an innate, immutable, genetically
APPENDIX 453
TABLE 12
Varimax Ro tated Solution o ver T wenty-th ree
G rammar Scores (N = 63) .

SCORES
~CTORS 2 J 4 5 h'
"-- ~ - - -- - --

CE LT-Structure .70 .49


Adj D et Exact .77 .59
Adv Part Exact .85 .72
Copul a (be) Exact .8 1 .66
D o Exact .77 .59
Inevitable N Exact .70 049
l nterrog Exact .8 1 66
Moda ls Exact .79 .62
Prep Exact .88 .77
Pronouns Exact .90 .SI
Subord Conj Exac t .72 .52
Verba ls Exact .87 .76
Adj O ct Approp .68 046
Adv Part Approp .59 .35
Cop ul a A pprop .64 AI
D o A pprap .68 .46
Inevitable N Approp .64 .41
Jnte!Tog App rop .55 .30
Moda ls Approp .63 .40
Prep Approp .60 .36
Pronouns A pp rop .70 .49
Subord Conj A pprop .63 .40
Verbals A pprop .69 .48
Eigen value = 12.20
- - - _ . - ... _ - - - - _. ._...•_- . - - -

determinable quantity. For instance, the Sa n Diego twins (CBS


News, August 1977) whose language system was incomprehensible to
anyo ne but themselves we re wro ngly diagn osed as mentall y retarded
on the strength of so-called ' IQ' tests. I ndeed, for thousands of
children, the meaning of variance in IQ scores (not to mention other
educatio nal tests which a re equally suspect ) is crucial to the kind of
educa tio nal experience whic h they are apt to receive.
Put simpl y, the questio n is whether or not intell igence in the
abstract can o r cannot be identified with what IQ tests measure.
Related to this fi rst questio n is the important matter as to whether or
not what lQ tests measure can be modified by experience. If it can be
shown th at what IQ tests measure is principall y language proficiency
454 LANGUAGETESTSATSCHOOL

and nothing else, it will be excessively difficult for Jensen and others
to keep maintaining tha t ' TQ' (that is, whatever is measured by TQ
tests so-called) is genetically determined (or even largely so). This
would make about as much sense as claiming that the fact tha t an
Englishman speaks English and not Chinese or something else is
geneticall y determined, and has nothing to do with the fact that he
grew up in England and just happened to learn English. It would be
tantamount to the claim tbat attained langua ge proficiency is not
attained. rn the case of second lan guage learners, this latter claim is
patently false - even in the case offirst language learners it is likely to
be false also for all practical purposes.
With respect to first language acqui sition it might be reasona ble to
ask whether language skill can be componentialized just as we have
done in relatio n to second language learning. For in stance, is th ere a
component of voca bulary knowledge tha t is demonstrably distinct
from syntactic knowledge? Ca n a component of intelligence be
distinguished from a component of reading ability? Can e ither of
these be distinguished from specific kind s of knowledge typically
displayed through the use of language ·- e.g., social studies, history,
biology, etc." One might suppose these questions to be a bit
ridiculous since they seem to challenge the ve ry core of accepted
educational dogma, but that in itself onl y increases the shock value of
posing the questions in the first place. It certainl y does not decrease
their importance to educators. Indeed, if anything, it makes it all the
mo re imperative that the widely accepted dogmas be empirically
supported.lfn o empirical support can be offered (save that 'this is the
way we've been doing it for two hund red yea rs or more'), the dogmas
should perhaps be replaced by more defensi ble notions. In any event,
the strength of the distinctions common to educational practices
should be founded on something other than precedent a nd opini on.
Is there any reason to suppose that the g factor of intelligence so
widely recognized by educational psychologists and psychometrists
might actually be something other than global language proficiency'
There are a great many relevant studies. So many that we cannot do
more than men tion a smattering of them. For instance, Kelly (1 965)
found that a certain form of the Brown -Carlsen Listening
Comprehension Test correlated more strongly with the Otis Test of
Mental Abilities than it did with a nother form of the Brown-Carlsen.
This raised the serious question w hether the Brown-Carl sen leSl was
measuring anything different from whal th e Otis test was measuring.
Howev'er, since tbe Otis test was supposed to be a measure of IQ the
APPENDIX 455
autho r questio ned the Brown-Carlsen's validity ra ther than that of
the Otis test. Presumably this course of actio n was due to the
mysti que associated with t he co ncept of intelligence --a mystique that
listenin g tests apparently lack.
Conca nnon (1975) compared results on the Stanfo rd-Binet and the
Peabody Picture Vocabulary Test with 94 preschoolers. Co rrelations
at grades 3, 4, and 5 were .65, .55, and .61. These fi gures roughl y
a pproximate the ma ximum po ssible correlatio ns that could be
achieved if the tests were measuring the very same thing given the
limi tati o ns on the reliabil ity of both tests. In other words, it is
dou btful that correlations of significantly grea ter magnitude would
be obtained if the tests were merely administered twice and the
co rresponding sets of scores on the same tests were correlated.
Hunt, Lunneberg, and Lewis (197 5) investigated va rious aspects of
what tbey considered verbal intelligence. In pa rticular they were
interesled in whether or not a verbal IQ leSI co uld be taken as a
meas ure of IQ independent o f acquired kno wledge. They
co nclude that alth o ugh a verbal intelligence test is directly a
measure of what people know, jt is indirectly a way of
identifying people wh o can code and manipulate verbal stimuli
rapidl y in situations in which knowledge per se is not a majo r
fac to r (p. 233).
Does this no t sound reminiscent o f o ne o f th e nalUralness constrain ts
o n pragmatic language tests? But, what abo ul th e other side of the
same coi n ? Don't tb e so -called IQ tests a lso require pragmatic
ma pping o f utterances ont o co ntexts? Gunnarsson (1 978) at least
has dem onstrated for many so-called verbal and no n-verbal IQ tests
that pragmatic mapping is involved.
Ultimately, it is a matter of investigating the nature of test
va riances and their pa tte rn ~ o f intercorrelatio n with other tests. In a
stud y o f high school seniors, Stinso n and Mo rrison (1 959) found a
correlatio n of .85 ctwee n a reading le st and the Wechsler
In tell igence Scale for dults. D oes this mean tha t the reading test is
actuall y an IQ test, or ha t t he IQ test is a reading test ') Or could it be
th at bo th are meas ring the same global facto r of language
p roficie ncy? In anot er stu dy, Wakefield , Veselka, and Miller
(1 974- 5) fo und that a single ca noni cal variate accou nted for 75 % of
the va ri ance in the se era l subtests o n the Iowa Tests of Basic Skills
and th e Prescriptive eading inventory. One o f the test batteries is
supposedJ y aimed at a hievement. Ne vertheless, it would appear that
"',,' ." ,. i~ ,~" r~h <h' '" m, <hi.,

,
456 LANGUAGE TESTS AT SCHOOL

In another study focussed on quite different questions, Chansky


(1964) found a range of correlations between .58 and .68 between
presumed tests of personality and so-called reading tests. Is it possible
that as much as 45 ~>~ of the variance in so-called personality tests is
attributable to reading ability" To global language proficiency" It is
noteworthy that the amount of variance thus attributable to a
language factor approximates the estimated total available non-
random variance (i.e., the square of the estimated reliability) of the
personality measures (see Chapter 5, above).
Until recently all that was lacking in terms of empirical data needed
to link IQ and language proficiency tests of the sort discussed in Part
Three of this book was an empirical study of native speakers
employing pragmatic language tests in combination with a battery of
lQ and achievement tests. The results of such a study are now
available in the master's thesis of Tom Stump. He showed that a
single factor of language proficiency accounted for about equal
amounts of variance in the dictations and cloze tests included, as well
as in the IQ scores (both verbal and non-verbal) and the various
achievement scores. The test batteries used were the Lorge-
Thorndike Intelligence Test, and the Iowa Tests ofBasie Skills. A few
scores were also available on the Stanford-Binet lQ Test, but only
enough to indicate the possibility that the language tests \vere
apparently better measures of whatever the IQ tests were measures of,
than were the IQ tests. Interestingly, for the 109 seventh graders
tested, the non-verbal portion of the Lorge-Thorndike correlated
more strongly with the language scores (doze and dictation) than did
the verbal portion of the same test. Further, for the two samples of
data (109 fourth graders and the nearly equal number of seventh
graders - native speakers of English enrolled in middle class Saint
Louis Schools) a single principal component accounted for .54 orthe
total variance in all oftlle tests in the one case, and .67 in the other.
In sum it would seem that the data from first language studies do
not support either a componentialization of language skills nor do
they support a differentiation of IQ and language tests. Apparently,
the g factor of intelligence is indistinguishable from global language
proficiency. Moreover, the relative indivisibility of the g factor seems
to hold either for first or second language proficiency.

F. Directions for further empirical research


Implications of the foregoing findings for education are sweeping.
APPENDIX 457

They suggest a possible need for reform that would challenge some of
the most deeply seated notions of what school is about - how schools
fail and how they succeed. The potential reforms that might be
required if these findings can be substantiated are difficult to predict.
Clearly they point liS in the direction of curricula in which the focus is
on the skills required to negotiate symbols rather than on the 'subject
matter' in the traditio nal senses of the term . They point away from
the medieval notion that psychology, grammar, philosophy, English,
hislOry, and biology are intrinsicall y different subject matters.
Physics and mathematics may not be as reasonably distinct from
English literature and sociology as the structure of universities
implies.
Because of the potency of the implications of language testing
research for the whole spectrum of educational endeavor it is of
paramount importance lhat the findings on wh ic h decisio ns arc based
be the most ca refull y sought out and tested results that can be
obtained. If it is not reasonable to make the trad itional distinction
between language and IQ, then the distinction should be done away
with or replaced by an updated version consistent with available
empirical findings. Ifvocabul ary is not really distinct in a useful sense
from syntax then the text s and the tests should abandon that
distinction or replace it wi th one that can be empiricall y defended.
Certainl y, when we are dealing in an area of ed ucation whe re
empi rical answers are readil y accessible. the re is no defence for
untes ted opinions. There is no room for appea l to autholity
concerning what a test is a te st of. It doesn't matter a great deal what
someone (anyone) thinks a test is measuring - what does matter is
how people perform on the test and how their performance on that
test com pares with their performance on other similar and dissimilar
tasks.
The research discussed here o nly begins to scratch the surface.
Man y unanswered questions remain, and many of the answers
proposed here will no doubt need to be refined if not in fact discarded.
The Stump project needs to be replicated in a large range of contexts.
The study in Carbondale with second language learners needs to be
refined and repeated in a context where more stringent controls can
be imposed on the administration oftbe tasks employed. Other tasks
need to be thoroughl y studied. Especially, the nature of variability in
the proficiency of native spea kers in a greater variety of languages
and dialects needs to be studied.
In spite of all the remaining uncertainties, it seems safe to suggest
458 LANGUAGE TESTS AT SCHOOL

that the current practice of many ESL programs, textbooks, and


curricula of separating listening, speaking, and reading and writing
activities is probably not just pointless but in fact detrimental. A
similar conclusion can be suggested with respect to curricula in the
native language. It would appear that every teacher in every area of
the curriculum should be teaching all of the traditionally recognized
language skills.
References

Aborn, M., H. Rubenstein, and T. D. Sterling. 1959. 'Sources of contextual


constraint upon words in sentences.' Journal of Experimental Psychology
57,171-180.
Adorno, T. W., Else Frenkel-Brunswick, D. J. Levinson, and R. N. Sanford.
1950. The authoritarian personality. New York: Harper.
Ahman, J. Stanley, and Marvin D. Glock. 1975. Aleasuring and evaluating
educational achievement. Boston: Allyn and Bacon, 2nd ed.
Allen, Harold B., and R. N. Campbell (eds.). 1972. Teaching English as a
second language: a book of readings . New Y 6rk: McGraw Hill.
Allen, J. P. 8., and Alan Davies. 1977. Testing and experimental methods.
London: Oxford University Press.
Anastasi, Anne. 1976. Ps:ychological testing. New York: Macmillan.
Anderson, D. F. 1953. 'Tests of achievement in the English language'.
English Language Teaching 7, 37-69.
Anderson, Jonathon. 1969. Application of doze procedure fa English learned
as a foreign language. Unpublished doctoral dissertation, University of
New England, Australia.
Anderson, Jonathon, 1971a 'Research on comprehension in reading'. In
Bracken and Malmquist (eds.) Improving reading ability around the ';Forld.
Newark, Delaware: International Reading Association.
Anderson, Jonathon. 1971 b 'Selecting a suitable "reader": procedures for
teachers to assess language difficulty'. Regional English Language Center
Journal 2, 3542.
Anderson, Jonathon. 1976. Psycholinguistic experiments in foreign language
testing. Santa Lucia, Queensland, Australia: University of Queensland
Press.
Angoff, William R., and Auriel T. Sharon. 1971. 'Comparison of scores
earned by native American college students and foreign applicants to U.S.
colleges'. TESOL Quarterly 5, 129-136. Also in Palmer and Spolsky
(1975),154-162.
Anisfeld, M., and W. E. Lambert. 1961. 'Social and psychological variables
in learning Hebrew'. Journal of Abnormal and Social Psychology 63,
524-529. Also in Gardner and Lambert (1972), 217-227.
Asakawa, Yoshio, and John W. Oller, Jr. 1977. 'Attitudes and attained
proficiency in EFL: a sociolinguistic study of Japanese learners at the
secondary level'. SP EA Q Journall. 71-80.
Asher, 1.1. 1969. 'The total physical response approach to second language
459
460 LANGUAGE TESTS A l ' SCHOOL

lea rning', 1110dern Language Journal 59, 3- 17.


Asher, 1. J. 1974. 'Lea rn ing a second langu age through commands : the
second field test'. Modern Language Journal 58, 24--3 2.
Bacheller, Frank. In press. 'Communicative effectiveness as predicted by
judgements of the severity of learner errors in dictations'. In Oll er and
Perkins.
Backman , Nancy. 1976. 'Two measures of affective factors as they relate
to progress in adult second langua ge learn ing'. Working papers. On Bi-
lingualism 10, 100--122.
BaHy, Charles, Albefl Sechehaye, a nd Albert Riedlinger (eds.) 1959. Course
in general linguistics: Ferdinand de Saussure. Tra nsla ted by Wade
Baskin. New York: McGraw Hill .
Banks, James A. 1972. ' Imperatives in ethnic minority education'. Phi Delta
Kappan 53, 266-269.
Baratz, Joan. 1969. 'A bidialectal task for determining language proficiency
in economica lly disadvan taged Negro ch ildren'. Child Development 40,
889-90 1.
BaTik, Henri C. a nd Merrill Swa in. 1975. 'Th ree-year eva luati on of a la rge
scale ea rly grade F rench imme rsio n progra m : the Ottawa study '.
Language Learning 25. 1- 30.
Barrutia, Richard . 1969. Linguistic theory o/Ianguagl' learning a,~ related to
machine teaching. Heidelberg: Julius Groos Verlag.
Bass, B. M. 1955 . 'Authoritarianism or acquiescence'!' Journ al of Ahnormal
Social Psychology 51, 616-623,
Bereiter, C. and S. Engelmann. 1967. Teaching disadvantaged children in
(he preschool. Engelwood Cliffs, N.J.: P rent ice-Hall.
Bernstein , Basil. 1960. ' Language a nd social cla ss'. British Journal of
Sociology 11 ,27 1- 276.
Bezanso n, Keith A. ;md !'Jicolas Hawkes. 1976. 'Bjlingual reading skills of
primary school children in Ghamt ' . WOl'kingpapers in Bilingualism 11,
44-73.
Bird, Charles S. and W. L. Woolf. 1968. English in Mali. Edited by James E.
Redden. Carbondale, Illinois: Sout hern Illinois University Press.
Bloom. Benja min S. 1976. ' Human cha racteristics and school learning'. New
York: McGraw Hill.
Bloomfield, leonard. 1933. Language. New York: Holt, Rineh<'lrt, a nd
Winsro n.
Boga rdus, E. S. 1925. ' Measuring social distance'. Joumal·o/ Applied
Sociology 9, 299--308.
Bogardus, E. S. 1933. 'Socia l distance scale' . ..S'odology and Social Research
17,265-27 l.
Bolinger, Dwi ght L. 1975. Aspects of language. Second edition. New York :
Harcourt, Brace, a nd J ovanovich.
Bormuth, John. 1967. 'Com parable doze and multiple choice com pre-
hension test scores'. Journal of Reading 10,29 1- 299.
BormUl h, Jo hn . 1970, On the theory of achievement fest ifem ,\·. Chicago:
Universit y of C hicago Press.
Botel, M. a nd Alvin G ranowsky. 1972. ' A fo rmula for measuring syntactic
complexity: a directional effort'. Elemellfary English 49, 513- 5 16.
REFERENCES 461

Bower, T. G. R. 1971. 'The object in the world of the infant'. Sci£~ntific


American 225, 30- 38.
Bower, T. G. R. 1974. Development in infancy. San Francisco: Freeman.
Brac ht, Glenn H ., Kennet h D. H opk inf;, and Julia n C. Stanley (eds.) 1972.
Pefspeclives in eduw tiono/ and psychological measuremellf. Engelwood
Cliffs, N.J.: Prentice·Hall.
Briere, Eugene. 1966. 'Quantity before qualit y in second language
composition'. Lang uage Learning 16,141-151.
Briere, Eugene. 1972. 'Cross cu ltu ral biases in language testing'. Paper
presented at tbe Third Internat ional Congress of Applied Linguistics in
Copenhagen, Denmark . Reprinted in Oller and Richards (I 973), 214- 227.
Britton, J. 1973. Language and learning. New York : Penguin.
Bradkey, Dean. 1972. 'Dictation as a measure of mutual intelligib ility: a
pilot study'. Language Learning 22, 203-220.
Brodkey, Dean and Howard Shore. 1976. 'Student persona lity and success in
an English lan guage program'. Language Learning 26, 153-162.
Brooks, Nel son. 1964. Language and lang uage learning. New York:
Harcourt, Brace, and World .
Brown, H. Douglas. 1973. 'Affective variables in second la nguage
acq uisition'. Language Learning 23, 231-244.
Bung, Klaus. 1973. Towards a lheo ry of programmed language instruction.
The Hague: Mouton.
Buros, Oscar K. (ed .) 1970. PersonalilY: Jests and reviews. Highland Park,
'l.J.: Grypho n Press.
BUfOS, Oscar K. (ed.) 1974. Personality tests alld "e ~' iews 11: a monograph
conSisting of the persollalit}' seer ions of the seventh mental measurements
yearbook 1972 and tests in print /974. Highl and Park, N.J.: Gryphon
Press.
Burt, Marina K .• Heid i C. Dulay, and Eduardo Heroandez-Ch<1vez. 1973.
The bilingual syntax measure. Wilh il/ustrations by Gary KImler. New
York: Harcourt, Br<1ce, and J ova novich.
Burt, Marina K. , Heidi C. Dulay, and Eduard o Hernandez-Chavez. 1975.
Bilingual syntax measure: Ma nual. Ne\\i York : Harcourt, Brace, and
Jovanovich.
Burt, Marina K. , Heid i C. Dula y, and Eduardo Hernandez-Cha vez. 1976.
Bilingual syntax measure : technical handbook . New York: Ha rco urt,
Brace, and Jovanovich.
Ca ll away, Donn . 1n press. 'Accent and the eV<1\ uation or ES L oral
profiCiency'. In Oller and Perkins.
Ca rroll, John B. 196 1. 'Fundamental consider<1ti o ns in testin g for English
proficiency o f fore ign students' . Testing fhe English proficiency offoreign
students. Wa shington, D.C. : Center for Applied Linguist ics, 31 40.
Reprinted in Allen a nd Campbell (1972 ), 3 13- 120.
Carro lL John B. 1967. 'Foreign la nguage pro ficiency levels a tt a ined by
language maj ors near graduation fro m college '. Foreign Language Annals
1,131 - 151.
Ca rro ll. John B. 197 2, 'Definingla nguage comprehension :some speculat ions'.
In F reedle and Ca rroll (1972), 1- 29.
Ca rroll, John B.. Aaron S. Carton, and Claudia P. Wilds. 1959. ' An
462 LANGUAGE TESTS AT SCHOOL

investigation of doze items in the measurement of achievement in foreign


languages'. College Entrance Examination Board Research and
Development Report. Laboratory for Research in Instruction, Harvard
University.
Carroll, Lewis. 1957. Alice in wonderland and through the looking glass. New
York: Grosset and Dunlap.
Cazden, Courtney B., James T. Bond, Ann S. Epstein, Robert D. Matz, and
Sandra J. Savignon. 1976. 'Language assessment: where, what, and how'.
Paper presented at the Workshop for Exploring Next Steps in Qualitative
and Quantitative Research Methodologies in Education, Monterey,
California. Also in Anthropology and Education Quarterly 8, 1977, 83-91.
Chansky, Norman. 1964. 'A note on the validity of reading test scores'.
Journal of Educational Research 58, 90.
Chapman, L. J. and D. T. Campbell. 1957. 'Response set in F scale'. Journal
of Abnormal and Social Psychology 54, 129-132.
Chase, Richard A., S. Sutton, and Daphne First. 1959. 'Bibliography:
delayed auditory feedback'. Journal of Speech and Hearing Disorders 2,
193-200.
Chastain, Kenneth. 1975. 'Affective and ability factors in second language
acquisition'. Language Learning 25,153-161.
Chavez, M. A., Tetsuro Chihara, John W. Oller, Jr., and Kelley Weaver.
1977. 'Are doze items sensitive to constraints across sentences'?' II Paper
read at the Eleventh Annual TESOL Convention Miami, Florida.
Cherry, Colin. 1957. On human communication: a review, a survey, and a
criticisrrl. Second edition, 1966. Cambridge, Mass.: MIT Press.
Chihara, Tetsuro and John W. Oller, Jr. 1978. 'Attitudes and attained
proficiency in EFL: a sociolinguistic study of adult Japanese learners'.
Language Learning 28, 55-68.
Chihara, Tetsuro. John W. Oller, Jr., Kelley Weaver, and Mary Anne
Chavez. 1977. 'Are cloze items sensitive to constraints across sentences?' I
Language Learning 27, 63-73.
Chomsky, Noam A. 1957. SYI1lGclic structures. The Hague: Mouton.
Chomsky, Noam A. 1964. 'Current issues in linguistic theory'. In Fodor and
Katz (1964), 50-118.
Chomsky, Noam A. 1965. Aspects of the theory of syntax. Cambridge, Mass.'
MIT Press.
Chomsky, Noam A. 1966a. Topics in the theory of syntax. The Hague:
Mouton.
Chomsky, Noam A. 1966b. 'Linguistic theory'. In R. G. Mead (ed.)
jVortheast cOl~ference 011 the leaching of foreign languages. Menasha,
Wisconsin: George Banta. Reprinted in Oller and Richards (1973), 29-35.
Chomsky, Noam A. 1972. Language Gnd mind. New York: Harcourt, Brace,
and Jovanovich.
Christie, R. and Florence Geis. 1970. Sludies in machial'eilianism. New York:
Academic Press.
Chun, Ki Taek, Sidney Cobb, and J.R.P. French Jr. 1975. Measuresfor
psychological assessment: a gUide to 3,000 original sources and their
applications. Ann Arbor, Michigan: Survey Research Center of the
Institute for Social Research. (Foreword by E. Lowell Kelly.)
REFERENCES 463
C la rk , Jo hn L. D . 1972. Foreign language fes fil1g: theory and prac fice .
Ph ilad elphia: Center for Curriculum D evelopment.
Coffman, William E. 1966. 'The validity of essay te,sts', Journal of
Educmional Afeasurement 3, 151- 156. Reprint ed in Bracht, Hopkins and
Stanley ( 1972), 157-1 63.
Cohen, Andrew. 1973. 'The socio linguistic assessment of speaki ng skills in a
bilingual educat ion program '. Paper read at the Internat ional Seminar on
L<tnguage Test ing sponso red jointly by the AfLA Commission on Tests
and th e TESO L o rgania lion, Sa n Juan, Puerto Rico. In Palmer and
Spolsky (1975), \73- 186.
Coleman E. B. a nd G. R. Miller. 1967. 'The mea surement of informa lion
gained during prose learning'. Reading Research Quarterly 3,369- 386.
College Entrance E xamination Board and Educationa l Te~ting Service.
1968. Test (~f" English as a foreign language: interpretive information.
Princeton, N. J.: Educational Testing Service.
College Entrance Examination Board and Educati ona l Testing Service.
1969. Manualfor studies ill support ofscore inrerpretation. Princeto n, NJ. :
Educationa l Testing Service.
College Entra nce Exam ina tion Board and Educat iona l Testing Service.
1973. Manllal for TOEFL score recipients. Princeton. N.J.: Educational
Test in g Service.
Concanno n, S. J. 1975. 'Comparison of t he Stanford-Binet ~ca l e \vith the
Peabody Pict ure Vocabu lary Test'. Journa l of J.:.:ducalional Research 69,
104-- 105.
Condon, Elaine C. 1973. 'The cultural co ntext of lan gua ge testing' . Paper
read at the International Seminar o n Language Testing sponsoredjoimly
by the A ILA Comm issio n 0 11 Tests a nd the T ESOL org<lniza tion, in Sa n
J uan, Puerto Rico. In Palmer and Spolsky (1975), 204-2 17.
Condon. W. S. a nd W. D. Ogston . 197 1. 'Speech a nd body mo tion synchrony
of the speaker-hearer'. "In David L. Horton and James J. Jenk in s (cds.) Th e
perception a/language. Columbus, Ohio : M errill , 150- 173.
Conrad, Christine. 1970. The c/o;:e procedure as a measure of Enf(lish
pr(~fi c iell("Y. Linpublished maste r's thesis, Universi ty of Ca lifornia, Los
Angeles. Abstracted in Workpap ers in TESL : UCLA 5, 197J, 159.
Cooke. W. [ed.) 1902. The tahle 1alk and bon-mots of Samuel Foote . London:
\1 yers and Rogers.
Cooper, Robert L. 1968. ' An elabo rated la nguage resting model". ]n.J o hn A.
Upshur and Julia Fa ta (eds.) Problems in foreign language lesting.
Language Learning, Special Issue Number 3,1 5···72. Reprinted in Allen and
Camp bell (1972), 330-346.
Cooper, Robert L. 1969. 'Two contcxtualized measures of degree of
bilingualism'. Modern Language Journal 53, 172- 178.
Cooper, Robert L. and Joshua A. Fishman. 1973. 'Some iss ues in the t heory
and measurement of language. attitude'. Paper read at the International
Semi nar on La nguage Testing spo nso red joinl ly b y the A ILA Commission
on Tests a nd the T ESOL o rganizat ion, Scm Jua n, Puerto Rico. In Pa lmer
and Spolsky (1 975), 187- 197.
Co pi, h ving. 1972. An imroduCliol1 10 logic Fourth edition. New York:
Macm illan.
464 LANGUAGE TESTS AT SCHOOL

Cowan, J. Ronayne and Zahreh Zarmed. 1976. 'Reading performance of


bilingual children according to type of school and home language'.
Working papers on Bilingualism 11, 74--114.
Craker, Viola. 1971. Clozenlropy procedure as an instrument for measuring
oral English competencies offirst grade children. Unpublished doctoral
dissertation, University of New Mexico.
Cronbach, Lee J. 1970. Essentials ofpsychological testing. New York: Harper
and Row.
Cronbach, Lee J. and P. E. Meehl. 1955. 'Construct validity in psychological
tests'. Psychological Bulletin 52, 281-302.
Crowne, D. P. and D. Marlowe. 1964. The approval motive. New York:
Wiley.
Dale, E. and J. S. Chall. 1948a 'A formula for predicting readability'.
Educational Research Bulletin 27, 11-20,28.
Dale, E. and J. S. Chall. 1948b. 'A formula for predicting readability:
instructions'. Educational Research Bulletin 27, 37-54.
Darnell, Donald K. 1963. 'The relation between sentence order and
comprehension'. Speech ]I;[onographs 30,97-100.
Darnell, Donald K. 1968. The development of an English language
proficiency lest 0/ foreign studems using a doze-entrop:v procedure.
ERIC ED 024039.
Davies, Alan. 1975. 'Two tests of speeded reading'. In Jones and Spolsky,
119-130.
Davies, Allen. 1977. The construction of language tests'. In Allen and
Davies, 38-104.
Dc\vey, John. 1910. How fl'e think. Boston: D. C. Heath.
Dewey, John. 1916. Essays in experimental logic. New York: Dover.
De\vey, John. 1929. 'Nature, communication, and meaning'. Chapter 5 in
Experience and nature. Chicago: Open Court. Reprinted in Hayden and
Alworth (1965), 265 296.
Diana vs. California State Education Department. 1970. CA No. C-7037
RFD (n.d. Cal.. February 3).
Dickens, C. 1962. Oliver Twist. Edited and abridged by Latif Doss. Hong
Kong: The Bridge Series, Longman.
Doerr, Naomi. In press. 'The effects of agreement/disagreement on cloze
scores', In Oller and Perkins.
D'Oyley, Vincent and H. Silverman (eds.) 1976. Preface to Black Students in
urban Canada. TESL Talk 7, January.
Dumas, Guy and Merrill Swain. 1973. 'L'apprcntisage du francais langue
seconde en class d'immersion dans un milieu torontois'. Paper read at the
Conference on Bilingualism and Its Implications in the West, University of
Alberta.
Early, Margaret, EEzabeth K. Cooper, Nancy Santeusanio, and Marian
Young Adell. 1974. Teacher's edition/or sun-up and reading skills 1. New
York: Harcourt, Brace, and Jovanovich.
Ebel, Robert L. 1970. 'Some limitations on criterion referenced measure-
ment'. Paper read at a meeting of the American Educational Research
Association. In G. H. Bracht, Kenneth D. Hopkins, and Julian C. Stanley
(eds.) Perspectives in Educational and Psycholof?ical ]k[easurement.
REFERENCES 465
Engelwood CHIs. New Jersey: Pren tice-H all, 1972,74- 87.
Educat ional Testing Service. 1970. MUlluaiJor pence corps language lesters.
Princeton, N .J. : ETS.
Ervin ~ Tripp, Susan M. 1970. 'S tructu re a nd process in language acquisition'.
In James F. Alatis (ed .) Report of the twenty:first annual round tahle
meeTing on linguistic.'.. alld language studies. Wash ington, D .C.:
Georgeto wn University Press, 312- 344.
Evo Ju, J ., E. Mamer, and R. Lentz. In press. ' Di sc rete point versus globa l
scoring for co hesive devices.' In Oller an d Perkins.
Fergu son, C ba rles A. 1972. 'Sou nd ings: so me t op ic-s in t he study of language
a Uitudes in multilingual area s'. Paper read at t he Tri-university Met!ling
on Language Attitudes, Yeshi va University.
Ferguso n, Charles A. and John Gum perz. 1971. 'Ling uistic diversity in
South Asia'. In Anwar S. Dil (cd.) L anguage structure and language use:
essays by Charles A. Ferguson. Palo Alto, Californ.ia: Stanford University
Press.
Fishman , Josh ua A. 1976. A series oj lectures on bilingualism and bilingual
education. U ni versity of New Me.tieo.
Fishman , M. In press. ' We all make l he sa me m ista kes: a compara tive study
of nat ive and no n-na tive erro rs'. In Oller and Perkins.
Flanagan , J. C. 1939. 'General co nsidera tions in the selec tion or test items
and a sho rt method of estimating the product-moment coefficient of
correlation from data at the tails of t he distribution'. JOtlmal of
Educational Psychology 30, 674- 6l:W.
F lesch, Rudolf. 1948. 'A formu la for predicting readability: instructions'.
Journal of Applied Psychology 32, 22 1-233.
Fodor, J . A. and J . J. Ka tz (eds.) 1964. The structure of language: readings ill
fhe philosophy of language. Englewood Cl iffs, N. J.; Prentice-Hall.

or
Fraser, Colin, U rsula Be.llugi ) and Roge r Brown. 1963. ' Control o r gra mma r
in imitation, comprehension, and production'. Journ al Verbal Learning
and Verhal Behal'ior 2,1 21 - 135.
Frederiksen, Carl H. 1975<.1. 'Ellects of context induced proce ssing
operations on semantic informatio n acquired fr om discourse'. Cognitive
Psychology 7, 139-166.
Frederik sen, Carl H . 197 5b. 'Representing logical and semantic structure of
knowled ge acqui red from discourse' . Cognilire Psychology 7, 371 --459 .
Frederik sen, Ca rl H . 1977a. ' Inference a nd t he st ructure o f c hild ren's
discourse' . Paper read at the Sy mposium on the Developmen t of D iscourse
Processing Skills, at a meeting of the Society fo r Resea rch in Child
Development in Ne\\; Orleans.
Frederiksen, Carl H. 1977b. 'Discourse comprehension and eady reading'.
Unpublished paper available from the National Institute of Education,
Washington D.C.
Freed le, Roy O. and John B. Carro ll (eds.) 1972. L cmgll(jge comprehension
and lhe acquisition o/know/edge. Washington, D .C.: V. H . Winston, and
New York : Wiley.
Ga rd ner, Robe rt C. 1975. 'Socia l factors in second la nguage acq uisitio n a nd
bilinguali ty'. Paper read a t the in vita tio n of t he Canada Cou nci l's
Consultative Committee on the lndividual, Language, and Society at a
466 LA~ GUAGE TESTS AT SCHOOL

conference in Kin gsto n, Ontario.


Gardner, Robert C a nd Wallace E. Lambert. 1959. ' M otivational variab les
in seco nd language acquisition '. Canadian Journ al of Psychology 13,
266-272. Reprinted in G ardne r " nd La mbert (1972), 19 1- 197.
Gardner, Roberl C. and Wallace E. Lamberl. 1972. ATtitudes and Ino/il'ation
;'1 second language learning. Rowley, Ma ss.: New'bury House.
Gardner, Robert C, R. Smythe, R. Clement, and L. Gliksman. 1976.
'Seco nd language lea rning: a socia l psychological perspective'. Canadian
Modern Language Reviev.' 32, 198- 213.
Galtegno, Caleb. 1963. Teachingforeign languages in schools : the silenr way.
New Yo rk : Educa tion<:tl Solutions.
Gentner, D onald R . 1976. ' The structure and recall of narrat ive prose'.
Journal of Verbal Learning and Verba l Behavior 15, 41 1-418.
Glazer, Susan Mand el. 1974. 'Is sentence length a valid meas ure of di ffic ulty
in readab ility formul as?' The Reading Teache r 27, 464-468.
Good man, Kenneth S. 1965. 'Dialect ba rriers to readi ng comprehension'.
Elementary English 42, R.
Goodman , Kenneth S. 1968. 'The psycholinguistic nature of the reading
process'. In K. S. Goodman (cd.) The psycho linguistic naTure afThe reading
process . Detroit, Michigan: Wayne State University Press, 13-26.
Goodman, Kenneth S. 1970. 'Reading: a psycholingui stic guessing game'. In
H arry Singer and Robert Ruddell (eds.) Theoretical models and processing
ill reading. Newark, Dela ware: Inte rn<:ttiona l Rea ding A ssocia tion,
259- 272.
Goodman, Kenneth S. and Cath erin e Buck. 1973. 'Dialect ba rriers to
read in g comprehension revisited'. Th e Reading Teach er 27, 6-12.
Gradma n, Harry L. an d Bernard Spolsky. 1975. ' Reduced redun dancy
testing : a progress report'. In .J ones a nd Spolsky ( 1975), 59-70.
Grimes, J oseph Eva ns. 1975. Th e Ihread o/discourse. The Hague : M OUlo n.
Gu iora, A. Z., \ 1a ria Paluszny, Benja min Beit-Ha lla hmi, J. C. Cal t'o rd ,
Ralph E. Cooley, and C-ecelia Yoder Dull. 1975. 'Language an d person
studies in language behavior'. Language Learning 25, 43- 62.
Gunna rsson, Bjarni . 1978. 'A look at the content similarities between
intelligence, ach ievemen t, personality, a nd lan gu;tge rests'. In Oller and
Pe rkins, 17- 35.
Haggard , L. A. and R. S. Isa acs. 1966. 'Micro-mo mentary facial expressions
as indica tors of ego-mechan isms in psychotherapy'. In L. A. Gottscha lk
and A. H. Auerbach (eds.) M~eth()ds of research in psycho therapy. New
York: Appleton, Century and Cro th, 154-165.
Hanze1i, Victor E. 1976. 'The effectiveness of doze tests in meas,uring the
competence of student s of F rench in a n aca demic seuing'. P<:t per read a t
the Universit y of Ke ntuc ky Foreign La nguage Conference. Al so in the
French Re vielt'. SO, 1977, 865- 74.
Harri s, Da vid P. 1969. Tes ting English as a second language. New York:
McG raw Hill.
H arri s, Zellig. 1951. STJ'ucturailingui.Hics. Chicago: U ni ve rsity of Chicago
Press.
H art , No rman W. M. 1974. 'Lan guage of you ng children' . Education News
Decembe r. 29- 3 1.
468 LANGUAGE TESTS AT SCHOOL

Joh ansson, Stig. In press. 'Reading comprehension in the native language


and the foreign language: on an Engli sh-Swedish reading comprehension
index', In Arne Zetterslen (cd.) Language resting. Copenhagen, D enmark:
Departme nt of English, University of Copenhagen.
John, Vera and Vivian J. Horner. 1971. Early childhood bilingual edu('{I/ion.
New York: Modern Language A ssociation.
Johnson, Thomas Ray and Kathy Krug. In press. 'I ntegrative and
instrumental motivations: in search of a measure', In Oller and Perkins.
J ones, R. L. and Bernard Spolsky (eds.) 1975. Testing language proficien cy.
Arlington, Virgi nia : Ce nter for Applied Lingui stics.
Jongsma , Eugene R. 197 1. 'The doze procedure: a survey of the resea rch'.
Occasional Papers ill R eading, Bloomi ngton, lndian a: Indiana University
School of Education. ERIC ED 050893.
Jon z, Jon. 1974. 'Improving on th e basic egg: the mUltiple ch oice do ze test'.
Paper read at the Eighth Ann ual Meeting of TESOL, Denver, Colorado.
Also in Language L earn ing 26,1976, 255- 65.
Kaczmare k, Celeste. In press. ' Ra tin g and scori ng essays' . In O ller and
Perkins.
Katz, J.J. and J.A. Fodor. 1963. 'The structure of a semantic theory'.
Language 39, 170- 210. Reprinted in Fodo r and Kat z (1964), 479- 5 18.
Kat z, J . .T. and P. M. Postal. 1964. An integrated lheory of linguistic
descriptions. Cambridge, Mass.: MIT Press.
Kelly. C. M. 1965. ' An in vest igation of the construct va lidity of two
commercially published lests'. Speech MOll ograpb' 32,139-143.
Kerlinger, Fred N. and Elazar J. Pedhazur. 1973. Multiple regression in
behavioral research. New York: Hoit , Rinehart, and Winston.
Kinzel, PauL 1964. Le:\"iwl and grammatical interference in the speech of a
hilingual child. Seattle : University of Washington Press,
K irn . Harriet. 1972. The e.Uect of practice Oil performance on dictations and
do ze tests. U npu blished master's lhesis, Un iversity o f Ca lilorn ia, Los
Angeles. Abstracted in Workpapers in TES L: UCLA 6,102.
Klare, George R. 1974-5. 'Assessing readability'. Reading R esearch
Quarrerly 10, 62- 102. .
Klare, George R. 1976. 'Judging readability". Instructional Science 5, 55- 61.
Klare. Geo rge R., H . W. Sinaiko, a nd L. M' Stolurow. 1972. 'The doze
proced ure: a convenient readabilit y test for t ra inin g materials and
translations' . lntenlariollai R el'ielV of Applied Psychology 21, 77~ 106.
Kolers, Paul A. 1968. ' Bilingualism and information processing'. Scienff/ic
Americon 218,78- 84.
Labov, William. 1972. Language in the inner city: studies ill the black English
vernacular. Philadelphia : University of Pennsylvania Press.
La hov, Willi am and P. Co hen. 1967. ' Systematic rela ting of standard a nd
non -standa rd rules in the grammar of negro speake rs'. Project Literacy 7
(as cited by Politzer, Hoover, and Brown, 1974).
Lado, Robert. 1957. Linguistics across cultures. Ann Arbor, Michigan:
University of Michigan.
Lado, Robert. 1961. Language Jesting. New York: Mc Griiw Hill.
Lado, Robert. 1964. Language teaching: a scientific: apprQach. New York :
McGraw Hill .
REFERENCES 469
Lado, Robelt and Cha rles C. Fries. 1954. English pronunciat ion. Ann Arbor :
University of Michigan.
Lado, Robert and Charles C. F ries. 1957 . English sentence pat terns. Ann
Arbor : Un iversity of Michi gan Press.
Lad o, Robert and Charles C. Fries. 1958. English pattern practices. An n
Arb or : University of M ichiga n Press.
Lambe rt, Wallace E. 1955. 'The measurement of lin guistic dominance of
bil inguals'. Journal of A bnormal alld Social Psychology SO, 1 97 ~-200.
Lam be rt , Wallace E., R. C. Gardner. H . C. Barik, and K. Tunstall. 1962.
' Att itud inal and cognitive aspects of intensive st udy ora second language'.
Journal of Abnormal and So(;ia! Psychology 66, 358--368. Reprinted in
Gardner a nd Lambert (1972).228- 245.
Lambert, Wall ace E. , R. C. Hodgson, R. C. Gardne r, and S. Fillenbaum.
1960. ' Evalualional reacti ons to spoken language'. Journal of Abnormal
and Social Psychol ogy 55, 44--5 1.
Lam bert, Wallace E. and G . Richard Tucker. 1972. Bilingual education of
children : the S t . Lambert experiment. Rowley, Mass.: Newbury House.
Lapkin, Sharon and Merrill Swa in 1977. 'The use o f English and F rench
cJ oze tests in a bilingual educa (io n program evalu ati on : validity and error
analys is'. Language Learning 27, 279-3 14.
La sh ley, Karl S. 195 1. 'The problem of serial order in behavio r'. In L. A.
Jeffress (ed. ) Cerebral mechanisms ill behavior. New York: Wiley,
11 2-· 136. Also in Sol Saporta and Jarvis R. Bastian (eds.)
Psycl!o linguis tics: a book 0/ readings. Nevi York : Holt, Rinehart, and
Win ston, 1961 , 180- 197.
La u vs. Nichols. 1974. 414 U. S. 563, 94 S. Ct. 786, 38 L. Ed. Zd 1.
Leu, John. 1977. 'Assess ing attitudinal outco mes' . In Ju ne K . Phillips (ed .)
The ialtguageconnecliol1 :j"rom the classroom to the world. ACTFL Foreign
l anguage Education Series. Nati ona l Textbook.
Liebert , R. M. and Michael D. Spiegler. 1974. Personality: strategies/or the
study olman. Revised edition. Homewoo d, 1llin ois: Dorsey Press.
Likert , R. 193 2. ' A techniqu e fo r the measuremen t of attitudes'. Archives of
Psychology , Number 40.
LoCoco, Veron ica G onzalez-).1ena. 197 6. 'i\. comparison of three methods
fo r the collection of L2 d ata: free com position, tnm sl alion, and picture
desc ri ption '. Working papers on Bilingualism 8, 59- 86.
LoCoco, Veronica Gonzalez- Mena. 1977. ' A co mpa rison of d oze tests as
measures of overall language proficiency'. M odem Language Journal, in
press.
Lofgre n, Horst. 1972. 'T he meas urement of langu age proficiency'. Studill
psych%gica et pedagogica, Serie.'· altera 57. Lu nd , Sweden: CWK Gl eerup
Berlin gska Boktryckerict.
Lorge, l rving. 1944. 'Word lists as background for co mmunication' .
Teachers College Record 45,543- 552.
Lukma ni , Yasmeen . 1972. ' Motiva tion to learn and lan gu age proficiency'.
Language Learning 22, 261- 274.
Lur ia, A. R. 1959. 'The directive functio n of speech, in development and
dissolution.' Word 15, 34 1- 52.
MacG ini tie, W. H. 1961. 'Co ntextual constra int in E ngli ~ h prose para·
470 LANGUAGE TESTS AT SCHOOL

graphs'. Journal afPsychology 51,121-130.


MacKay, D. M. 1951. 'Mindlike behavior in artefacts'. British Journal afthe
Philosophy of Science 2,105-]21.
McCall, W. A. and L. M. Crabbs. 1925. Standard test lessons in reading:
teacher's manual for all books. New York: Bureau of Publications,
Teacher's College, Columbia University,
McCracken, Glenn and Charles C. Walcutt. 1969. Teacher's edition:
Lippincott's basic reading. Philadelphia: J. B. Lippincott.
McLeod, John. 1975. 'Uncertainty reduction in difTerent languages through
reading comprehension'. Journal of Psycholinguistic Research 4,343-355.
Manis, M. and Dawes, R. 1961. 'Cloze score as a function of attitude'.
Psychological Reports 9, 79-84.
Martin, Nancy, Pat D'Arcy, Bryan Newton, and Robert Parker. 1976.
Writing and learning: across the curricuLum 11~16. Foreword by James
Britton. Leicester: Woolaston Parker.
Menyuk, Paula. 1969. Sentences children use. Cambridge, Mass.: MIT Press.
Miller, G. A. 1956. 'The magical number seven plus or minus one or two'.
Psychological Rerie~v 63, 81 ~97.
Miller, G. A. 1964. 'The psycholinguists: on the new scientists oflanguage'.
Encounter 23~ 29~37. Reprinted in Osgood and Sebeok (1965), as an
appendix.
Miller, G. A., G. A. Heise, and W. Llchten. 1951. 'The intelligibility ofspeeeh
as a function of the context of the test materials'. Iournal of Experimental
Psychology 41, 8\-97.
Miller, G. A. and Phillip N. Johnson-Laird. 1976. Language and perception.
Cambridge, Mass.: Harvard University Press.
Miller, G. A. and Jennifer Selfridge. 1950. 'Verbal context and the recall of
meaningful material'. American Journal of Psychology 63, 176~185.
Miller, G. R. and E. B. Coleman. 1967. 'A set of thirty-six prose passages
calibrated for complexity'. Journal of Verbal Learning and Verbal Behavior
6,851-854.
Miller, Jon F. 1973. 'Sentence imitation in pre-school children'. Language
and Speech 16~ 1-14.
Morton, Rand. 1960. The language lab as a teaching machine: notes on the
mechanization of language learning. Ann Arbor: University of Michigan.
Morton, Rand. 1966. 'The behavioral analysis of Spanish syntax: toward an
acoustic grammar'. IRAL4, 170-177.
Mullen, Karen. In press a. 'Evaluating writing proficiency in ESL'. In Oller
and Perkins.
Mullen, Karen. In press b. 'Rater reliability and oral proficiency
evaluations'. Paper read at the First International Conference on Frontiers
in Language Proficiency and Dominance Testing April, 1977. In Oller and
Perkins.
Murakami, Mitsuhisa. In press. 'Behavioral and attitudinal correlates of
progress in ESL'. In Oller and Perkins (b).
Naccarato, R. W. and G. M. Gillmore. 1976. 'The application of genera liz a-
bility theory to a college level French placement test'. Paper read at the
Fourth Annual Pacific Northwest Educational Research and Evaluation
Conference, Seattle, Washington.
REFERENCES 471
Nadler, Ha rvey, Leonard R. MareHi , a nd Charles S. Haynes. 197 1. American
English: grammatical structllre book l. Paris: Didier.
Naiman, N. 1974. 'The use of elicited imitation in second language
acquisition research'. Working papers on Bilingualism 2,1 - 37.
Naka no, Patricia J. 1977. 'Educational implications oCthe La u vs. N ichols
decision' . In M arina K. Burt, Heidi C . Dulay, and Ma ry Finocchiaro (eds.)
Vie wpoints on Ellglish as a second language. New York : Regents, 219-234.
Natalicio, Diana S. and F . Williams. 1971. R epetition as an oral language
assessment technique. A ustin , Texas: Cen ter fo r Co mmun icati o n
Research.
National Council of Teachers of English. 197 3. English for today _ Second
editio n, W. R. Slager, Lois MdnLOsh, Ha rold B. Allen, a nd Bernice E.
L eary. New York: M cGraw Hill.
Newcomb , T. M . 195 0. Social psychology. N ew York: Holt, Rineha rt, and
Winston.
N ie, Norman H ., C. Hadlai Hull, Jean G. Jenkins, Karen Sleinbrenner, a nd
Dale H. Brent. 1975. Statistical package for the social sciences . Second
editi on. New York: M cGraw Hill.
No rton , Darryl E. and William R. Hodgson. 1973. ' In telligibjlity of black
and wh ite speakers rOf blac k and wllite lislene rs'. L angllogeand S peech 16,
207- 210.
Nunnally , 1. C. 1967. Psy chometric fheory. New York: M cGraw Hill.
' OCR sets guidelines for fu lfilling La u decisio n' . 1975. The Linguistic
Reporter 18, I, .s. 7.
Oller, J o hn W. Sf. 1963. El espailo1 pO l' ei nlulldo; prime/' nive/.· la familia
Fembndez. Chi cago : Encyclo pedia Britannica Films.
Oller, J o hn W. Sr. , a nd Angel Gonzalez. 1965. EI esp anoi pOI' d mundo ;
segundo nipel: Emilio en E.~paJla. Chicago: Encyclopedia Britannica Films.
Oller, Jo hn W. Jr. 1970. 'Dictation a s a device fo r testing forei gn language
proficiency'. Wo rkpapet;.\' in TESL: UCLA 4, 37-42. Also in Engli~h
Language Teaching 25, 197 1,254-259.
Oller, Jo hn W. Jr. 1972. 'Scoring m ethods and difficulty levels for cloze tests
of proficiency in ESL'. lv/odern L anguage i Ollrnal56, 15'1- 158.
Oller, Jo hn W. I r. 1975. 'Cloze, disco urse, a nd approxi matio ns LO English' .
In Marina K. Burt and Heidi C. Dulay (eds.) New directions in second
language teaching and bilingual education: OIL TESOL '75. W ashi ngton,
D .C. : TESOL, 345--355 _
Oller, Jo hn W. J r. 1976a. 'Evidence for a ge neral language proficiency factor :
an expectancy grammar'. Die Neuren SpracIJen 2, 165-1 74.
Oller, John W. Jr. 1976b. 'Language testing'. In Ronald Wardhaug h and H .
D o uglas Brown (eds.) Survey of applied linguistiCS. Ann Arbo r : Universit y
of Michi ga n Press, 275- 300.
Oller, John W. Jr. 1976c. 'A program fo r language testing research'. In H.
Douglas Brown (ed.) Papers in second language acqui sition: proceedings
o f the sixt h annual conrerence o n a pplied linguist ics a t the U ni ve rsity of
Michigan. Language Learning, Special lssu e Number 4, l4l-165.
Oller, John W. Jr. 197..7. 'The psychology of language and contrastive
lin guistics : th e research an d the debate'. Foreign Lcmguage Anl1als, in
pre-ss.
472 LANGUAGE TESTS AT SCHOOL

Oller, John W. Jr., Lori Baca, and Fred VigiL 1977. 'Attitudes and attained
proficiency in ESL: a sociolinguistic study of Mexican-Americans in the
Southwest'. TESOL Quarterly 11,173-183.
Oller, John W. Jr., 1. Donald Bowen, Ton That Dien, and Victor Mason.
1972. 'Cloze tests in English, Thai, and Vietnamese: native and non-native
performance'. Language Learning 22, 1-15.
Oller, John W. Jr. and Christine Conrad. 1971. 'The doze procedure and
ESL proficiency'. Language Learning 21, 183..--196.
Oller, John W . .If. and Francis Ann Butler Hinofotis. 1976. 'Two mutually
exclusive hypotheses about second language proficiency: factor analytic
studies of a variety of language tests'. Paper read at the Winter Meeting of
the Linguistic Society of America, Philadelphia. Also in Oller and Perkins
(in press).
Oller, John W. Jr., AlanJ. Hudson, and Phyllis Fei Liu. 1977. 'Attitudes and
attained proficiency in ESL: a sociolinguistic study of native speakers of
Chinese in the United States'. Language Learning 27,1-27.
Oller, John. W. JT. and Nevin Inal. 1971. 'A cloze test of English
prepositions'. TESOL Quarterly 5, 315-326. Reprinted in Palmer and
Spolsky (1975), 37-49.
Oller, John W. JT. and Naoko Nagato. 1974. 'The long~term effect of FLES:
an experiment'. Modern Language Journal 58, 15-19.
Oller, John W . .If. and Kyle Perkins (eds.) 1978. Language in education:
testing the tests. Rowley, Mass.: Newbury House.
Oller, John W. Jr. and Kyle Perkins (eds.) in press. Research in language
testing. Ro",.·Jey, Mass.: Newbury House.
Oller, John W. Jr. and Jack C. Richards (eds.) 1973. Focus on the learner:
pragmatic perspectives for the language teacher. Rmvley, Mass.: Newbury
House.
Oller, John W. Jr., Bruce Dennis Sales, and Ronald V. Harrington. 1969. 'A
basic circularity in traditional and current linguistic theory'. Lingua 22,
317-328.
Oller, John W. Jr. and Virginia Streiff. 1975. 'Dictation: a test of grammar
based expectancies'. English Language Teaching 30,25-36. Also in Jones
and Spolsky (1975), 71--88.
Osgood, Charles E. 1955. 'A behavioristic analysis of perception and
language as cognitive phenomena'. In Ithiel de Sola Pool (ed.)
Contemporary approaches to cognition: a report of a symposium at the
University of Colorado 1955. Cambridge, Mass.: MIT Press, 1957,
75-118.
Osgood, Charles E. and Thomas A. Sebeok. 1965. Psychoh'nguistics: a survey
of theory and research problems. Bloomington, Indiana: Indiana
University Press.
Osgood, Charles E., G. T. Suci, and P. H. Tannenbaum. 1957. The
measurement of meaning. Urbana: University of Illinois Press.
Palmer, Leslie and Bernard Spolsky (eds.) 1975. Papers on language testing:
1967-1974. Washington, D.C.: TESOL.
Paulston, Christina B. and Mary Newton Bruder. 1975. From substitution to
substance: a handbook of structural pattern drills. Rowley, Mass.:
Newbury House.
REFERENCES 473

Pearson, David p, 1974 . ' T he effects of gramm a tical complexity on children's


comprehension, recall, and conception o f certain se mantic relation s',
R eading Research Quarter(v 10, 155- 192.
Petersen, Calvin R . a nd Franci s A. Carti er. 1975. 'S ome theoretical problems
and practical solutions in proficiency test validity', In J ones a nd Spoisky
(197 5), 105-11 8.
Pi ke, Lewis. 1973. All el1aluation ofpresent alld alternative itemformatsfor use
in the TOEFL. Princeto n, N.J .: Educational Test ing Ser vice (m imeo).
Pimsleur, Paul. 1962. 'Student fact ors in Foreign Language Lea rning: A
Review of the Litera ture' . M Ol/ern Language Journal 46, 160-9.
Pla lt, J ahn R. 1964. 'Strong in ference'. Science 146, 34 i - 353.
Pol itzer, Robert L. , M . R . H oover, and D. Brown. 197 4. 'A test of proficiency
in black standa rd and non-sta nd ard speech' . TESO L Quarterly 8,27- 35.
Pojit:r..er, Robert L. and Charles .Sta ubach. 1961. Te aching Spanish : a
linguistic orielllalion. Boston : Gin n.
P orte r, D . 1976. ' M odified d oze proced ure: a more valid reading
comprehension test' . English Language Teaching 30, 15 1- 155.
P otter, T homas C. 1968. A taxonomy ofd oze research, Part I : readabililY and
reading comprehension . Inglewood, Califo rnia: Southwestern Regi onal
Laboratory fo r Educational Research and Development.
Rand, Earl. J969a. Construcling dialogs. New York: H olt , Rinehart, a nd
Winst on .
Rand, Ea rl. 1969b. Constructing selllem:es. New Yo rk: H olt, Rinehart, a nd
Winston.
R and, Earl. 1972. 'Integrative and di screte point tests at UC LA'.
Workpapers in TESL: UCL A 6, 67- 78.
Rand, Earl. 1976, ' A facto red ho mogeneous items a pproach to the UCLA
ESL Pl acement Exa mination'. Paper read at the Tenth Annual M eeting of
TESOL, New York City.
R asmussen, D onald and Lynn Gol dberg. 1970. A pig can jig. Chicago :
Science Research Associates.
Reed, Ca role. 1973 . 'Ada pting TESL approac hes to the teaching of written
sta nda rd En gli sh as a second dia lect to speakers of' America n black English
vernacular' . TESOL Quarterly 7, 289- 308.
Richards, Jack C. 1970a. ' A non-contrastive approach to error analysis' .
Paper read at the Fourth Annua l Meeting of TESOL, San Fran cisco. Al so
in English Language Tea ching 25, 204-219. Re printed in Oller a nd
Richa rds (197 3), 96- 11 3.
Ric ha rds, .lack C. I 970b. 'A psycholingui stic measu re o f vocabul ary
selection '. JRAL 8, 87- 102.
Richards, Jack C. 1971. 'Error an alsis and second language strategies' .
Language Sciences 17, 12- 22. Rep rinted in Oller a nd Richard s ( 1973),
114-135.
Richa rds, Jack C. 1972. 'Some socia l aspects of langu age learning'. TESOL
Quarterly 6, 243- 254.
Rivers, Wilga. 1964. The psychologist and the for eign language teacher.
C hicago; University of Chicago Press.
Ri vers, Wilga. 1967 . 'Listenin g co mprehension'. In Mi ldred R. Donaghue
(ed.) Foreign languages alld the schools : a book of readings. Dubuque,
474 LANGUAGE TESTS AT SCHOOL

Iowa: \Villiam C. Brown, 189-200.


Rivers, Wilga. 1968. Teachingforcign lmlKuage skills. Chicago: University of
Chicago Press.
Robertson, Gary J. 1972. 'Development of the first group mental ability test'.
In Bracht, Hopkins, and Stanley (1972), 183-190.
Ruddell, Robert B. 1965. 'Reading comprehension and structural re-
dundancy in written material'. Proceedings of the International Reading
Association to, 308-31l.
Rummelhart. D. E. 1975. 'Notes on a schema for stories'. In D. G. Bobrow
and A. M. Collins (eds.) Representation and understanding: studies in
cognitive science. New Y ark: Academic Press, 211-236.
Rutherford, William E. 1968. Modern English: a textbook for foreign
students. Ne\.,.. York: Harcourt, Brace, and Jovanovich.
Sapir, Edward. 1921. Language: an introduction to the study of speech. New
York: Harcourt, Brace, and \Vorld.
Sarason, 1. G. 1958. 'Interrelationships among individual difference
variables, behavior in psychotherapy, and verbal conditioning. loumal of
Abnormal and Social Psychology 56, 339-344.
Sara son, 1. G. 1961. 'Test anxiety and intellectual performance of college
students'. Journal of Educational Psychology 52, 201- -206.
Savignon, Sandra J. 1972. Communicative competence: an experiment in
foreign language teaching. Montreal: Marcel Didier.
Schank, Roger. 1972. 'Conceptual dependency: a theory of natural language
understanding'. Cognitive Psychology 3,552-631.
Schank, Roger and K. Colby (eds.) 1973. Computer models of thought and
language. San Francisco: w. H. Freeman,
Schlesinger, I. M. 1968. Sentence structure and the reading process. The
Hague: Mouton.
Scholz, George. D. Hendricks, R. Spurling, M. Johnson, and L.
Vandenburg. In press. 'Is language ability divisible or unitary? A factor
analysis of twenty-t\vo language proficiency tests'. In Oller and Perkins.
Schumann, John H. 1975. 'Affective factors and the problem of age in second
language acquisition'. Language Learning 25, 209-235.
Schumann, John H. 1976. 'Social distance as a factor in second language
acquisition'. Language Learning 26, 135-143.
Scoon, Annabelle. 1974. The feasihility of test translation - English to
Narajo. Unpublished doctoral dissertation, University of New Mexico,
Albuquerque.
Selinker, Larry. 1972. 'Interlanguage'. IRAL 10, 209-231.
Shaw, Marvin E. and Jack M. Wright. 1967. Scales for the measurement (~[
attitudes. New York: McGraw Hill.
Shore, M. 1974. Final report of project BEST: [he content ana(ysis q[ 125
Title VII bilingual programs funded in 1969 and 1970. New' York:
Bilingual Education Application Research Unit, Hunter College (mimeo,
as cited by Zirkel, 1976, 32g).
Silverman, Robert J., Joslyn K. '\loa, Randall H. Russell, and John Molina.
1976. Oral language tests for hilingual students: an evaluation of language
dominance and pn~ficiency instruments. Washington, D.C.: United States
Office of Education.
REFERENCES 475
Skinner, B. F. 1953. Science alld human behavior. N ew York: Macmillan.
Skinner, B. F. 1957. Verbal behavior. New York: Appleton , Century, Crofts.
Skinner, B. F. 197 1. Beyond freedom and dignilJ" New York: Alfred K.
Knopi'.
Siobin, 0.1. 1973.' lntroduction to chapter on studies of imitat ion and
co mprehension '. In C. A. F erguson and D. L Siohin (eds.) Studies a/ child
la1lguage and del'eiopment. New York: Holt, Rinehart , and Winston,
462-465.
Slobin, D. I. and C. A. Welsh. 1967. 'Elicited imitation as a research to ol in
Jevelopmem<tl psyc holinguislics'. Paper presented at the Center fo r
Research on Language and Language Be havior, University of Mjchigan,
March, 1967. In C. A. Ferguson and D . I. Slobin (eds.) Studies of child
language and de velopment. New York: Holt, Rineha rt , and Winston , 1973,
485-497.
Smalley, \Villiam A. 196 1. A10llual of articulatory phol1elics. Ann Arbor,
Michigan: Cushing-Malloy.
Somaratne, W. 1957. Aids alld tests in the teaching of El1glish . London:
Oxfo rd University Press.
Spearman, Charles E. 1904. ' ''General intelligence" objectively determined
and measured '. American Journal o/Psychology 15, 201- 293.
Spolsky, Bernard . 19M::' 'What does it mean to know a language? O r how do
you get someone to perform his competence?' Paper read at the Seco nd
Conference on Problems in Foreign Lan guage Testing, University of
Southern California. Reprinted in Oller and Richards (1973),164-176.
Spolsky, Bernard. 1969a. 'Attitudinal aspects of seco nd language learnin g' .
Language L earning 19, 271 - 283. Reprinted in Allen and Campbell (1972),
403-414.
Spolsky, Bernard . 1969b. 'Recent resea rch in TESOL'. TESOL Quar terly 3,
355.
Spolsky, Bernard. 1969c. 'Reduced redundancy as a langu age testing tool'.
Paper read at the Second International Co ngress of Applied Linguislics,
Cambridge, England. ERI C ED 031702.
Spolsky, Bernard. 1974. 'Speech communitie s and scho ols'. TESOL
Quarterly 8, 17- 26.
Spolsky , Bernard. 1976. ' Lan guage testing: art or scie nce'. Paper read at the
Fourth Intern ational Congress of Applied Lin gui stics. In Pro ceedings of
the fOllrth interlWlional congress oj applied lillguist ics. Stuttgart, Germany:
Sonderdruck , 9 -28. Also appears as ' Introduction : Linguists and
langua ge leslcrs,' in AdllUflces in language lestillg: Series 2, Approaches TO
language tesring. Arlington, Virginia: Center fo r Applied Linguistics,
1978 , v-x.
Spoisky. Bernard , Bengt Sig urd , M . Sato, E. Walker. and C. ArLerburn.
1968. 'Preliminary stud ies in the development of techniques for testing
overall second language proficiency '. In Upsh ur and Fata. 79- 101.
Spolsky , Bernard, Penny Murphy, 'Vayne Holm. and Allen Ferrel. 1972.
'Three functional te sts of oral proficiency'. TESOL Quarterly 6,221 - 235 .
Reprinted in Palmer and Spolsky (I 975), 75- 90.
Srote, Leo. 1951. 'Social dysfunction, personality, and social distance
attitudes'. Pa per read at a meeting of the American Sociol ogical Society,
REFERENCES 477
66-73.
Tucker, G . Richard and Wallace E. Lamhert . 1973. 'Sociocu ltura l aspects of
lan guage stud y', 1n Oller and Ric hards, 246-250.
Tulving, E. 1972, 'Episodic and seman tic memory'. In E. Tulving and W.
Donal dson (eds.) Organization WJd memor,V. New York: Academic Press.
Upshur, John A. n.d. Ora! communication lest. Ann Arbor: University of
\1ichigan (mimeo).
Upshur, John A. 1962. 'Language proficiency testing and the contrastive
analysi s dilemma' . Language L earning 12, 123--127.
Upsh ur, John A. 1969a. ' Productive communicatio n testing: a progress
repo rt'. Paper read at the Second 1nternational Congress of Applied
Linguistics, Cambridge, Engla nd. In G. E. Pe rren a nd J. L M. Trim (ed s.)
Selected papers of the second imerna/ional congress of applied linguistics.
Cambridge University Press, 1971 , 435 ~44 1 . Reprinted in Oller and
Richard s (1973), 177- 183.
Ups hur, John A, 1969b. 'T EST isa fo ur letter word '. Paper read aL the EPDA
Institute, C ni versity of Illinois.
Upsh ur, John A. 1971 . 'Objective evaluation of oral proficiency in the ESOL
classroom' . TESOL Quarterly 5, 47- 60. Also in Palmer and Spolsky
(1975),53- 65.
Upshur, John A. 1973. 'Context fo r language testing'. Ln Oller and Richards
(1973),200- 213.
Upshur, John A. 1976, 'Discussion of "A program for language testing
research'''. In H. Douglas Brown (ed.) Papers in second language
learning: proceedings of the sixth annual conference 011 applied linguistics
at the University of MichigatJ. Language Learning, Special Issue Number 4,
167-1 74.
Upshur, John A. and Julia Fata (eds.) 1968. Problems in foreign language
testing. Language Learning, Spedal lssue Number 3.
Vachek, J . 1966. Th e lillguislic school oj Prague. Bloomingto n, Indiana:
Indiana University Press.
Valelte, Rebecca M. 1964. 'The use of dict ee in the French lan guage
classroom'. Alodern LClI1Kuage JOllrlJa l39, 4 31 --434.
Va leLtc., Rebecca M. 1973. 'Developing and evaluating communication skills
in the classroom'. TESOL QlIanerly 7, 407-424.
Yalette, Rebecca M. J977. Modern language tesling. Second edition. New
York: Harcourt, Brace, and Jovanovich.
Wakefiel d, .1. A .. R onilld E. Veselka, and Leslie Mill er. 1974-5 . 'A
co mparison oflhe Jowa Tests of Basic Skills and the Prescripti ve Reading
Invent ory'. J ournal of EduCOIionol Research 68, 347- 349.
Watzlawick, Paul, Janet Beavin, and Kenneth Jackso n. 1967 . Pragmatics of
human communication : ({ study of interactional patterns, pathologies, and
paradoxes. New' York: Norton.
Weaver, W. W. and A.J. Kingston. 1963. 'A factor analysis of the d oze
procedure and oth er measures of rea ding and la nguage abi li ty'. Journal of
Comm unication 13, 252- 261.
Whitaker, S. F. 1976. 'What is the stat us of dictation?' Audio Visua l Journ al
14.87- 93.
Wilson, Craig. 1977. 'Ca n ESL doze tests be contrasti vely biased:
478 LANGUAGE TESTS AT SCHOOL

Vietnamese as a test case'. Paper read at the First International Conference


on Frontiers in Language Proficiency and Dominance Testing, Southern
Illinois University. In Oller and Perkins (in press).
Woods, William A. 1970. 'Transition network grammars for natural
language analysis'. Communications of the Association for Computing
Machinery 13, 59\-606.
Wright, Audrey L. and James H. McGillivray. 1971. Let's learn English.
Fourth edition. New York: American Book.
Zirkel, Perry A. 1973. Black American cultural attitude scale. Austin, Texas:
Learning Concepts.
Zirkel, Perry A. 1974. 'A method for determining and depicting language
dominance'. TESOL Quarterly 8, 7-16.
Zirkel, Perry A. 1976. 'The why's and ways of testing bilinguality before
teaching bilingually'. The Elementary School Journal, March, 323-330.
Zirkel, Perry A. and S. L Jackson. 1974. Cultural attitude scales: test manual.
Austin, Texas: Learning Concepts.
Index

abb reviation, 19,33 anomie, 117, 126, 127, 146


Aborn, M ., 329, 347, 358, 359 Anomie Scale, 122, 126, 127, 144
abstracti o n, 30 anticipatory planning (see expecl-
accent (see also pro llunciatio n) ancy gra mmar)
rating scales, 32 1, 335, 392, 393, anxiety , 106, 107, 144
433ff aptitude, 3, 5, 12, 102, 406
achievement, 3, 5, 9,11 ,62,7 1,82, appropriatenes s (see also prag-
89, 102, 138, 195, 200, 403, 406, matic mappin gs), 24
41 8-420, 455 A rabic, tests in , 59
acquisition (sec langua ge learning) arithmetic, 382, 41 8
acquiescence, 125 Armed Forces Oualification Test,
Adell, M., 41 3 356, 358 .
adject ives, 44 Arterburn , c., 264, 302,475
Adorno, T. W. 107, 123, 124, 144 articles (see also determiners), 395.
advance hypothesizing (see expect- 396
ancy grammar) artificial grammars, 7
adverbs (see also pragmatic map- Asaka wa, Yoshio, 140, 459
pings), 43 Asher, J. J., 32, 459, 460
affective Factors (see al so attitudes aspect ma rkers (see also pragmatic
and emoti ve aspect) 17, 33, mappings), 43
105-148 Atai, Parvin , 267, 367, 429 , 46 7
direct versus indirect measures attentio n (see also consciousness
of, 121- 138 and short-term memory), 224
Ahman, J. Stanley, 180, 4 59 attitudes, 3, 7, 16 · 19, 27, 28, 33, 55,
A lati s, James F. , 464 103, 105- 148
AileD, H. B., 73,459,463, 470,475 reli ability and validity of
Allen, J. P. B., 40, 217, 259, 459 measures, 108, 136
Ame rican Coll ege Tests, 208 teacher, 131
A merican U ni versity a1 Bei rut, 59 Auerbach, A. H., 466
American C ni ve rsitv at Beirut auditory discrimination (see com-
English Entrance Examination, prehensio n, li stening and
59 minimal pairs)
Anastasi, Anne, 73, 122 , 142,208, auditory comprehension (see com -
459 prehenSion, listening)
Anderson, D. F., 40, 459 a ural pe rception (see also compre-
Anderson, Jonathon, 40, 353 , 355, hension, listening), 40
357,376,379, 447,459 auth oritarianism (see also ethno -
Angoff, William H., 202, 206, 459 centrism and fasci sm), 107 , 110,
Anisfeld, M. , 11 7, 459 125.127
Anisman , Paul , 49 Autobiographical Survey, 132

479
480 LANGUAGE TESTI; AT SCHOOL

auxiliaries, 375 Bloomfield, Leonard , 153- 156,


177,460
Baca, Lori , 134- 136, 141,367,471 Bl oo mfieldian structuralism,
Bacheller, F. , 267, 460 152- 154
Backman, ~ancy , 139, 460 Bobrow, D. G., 473
Bacon, Francis, 109, 152 Boehm Test of Basic Concepts, 88
Bally, Charles, 41 , 460 Bogardus, E. S., 137, 138,460
Banks, James A., 82, 460 Bolinger, Dwighl L., 188,460
Baralz, Joan, 65, 78, 81, 86, 460 Bo nd , T., 307, 338, 462
Barik, H enri, 63, 103, 104, 117, Bormulh, l o hn , 238, 259, 353, 376,
368, 370,379, 460, 469, 476 460
Barrolia, Richard , 170, 460 BOlel, M., 348, 460
Bass, B. M. , 125,460 Bowen, J. Donald, 92, 95, 165,355,
Bastian, Jarvi s R ., 469 471
Beavin, Janet, 27, 119,477 Bower, T. G. R., 31,461
Beit-Hallahmi, Benjamin, 113 , Bracht, Glenn H., 73, 208, 400,
133,466 46 1, 463,464,467, 473,476
belief systems (see also expectancy Brent, D., 433, 471
grammar),7 Briere, Eugenel. , 84, 232, 39 1, 46 1
Bellugi, Ursula, 65, 66, 465 Britton, J., 421 , 461 , 470
Bereiter, C., 80, 4 60 Brodkey, Dean, 129- 131, 133, 146,
Bernstein, Basil, 80, 460 265,46 1
Bezanson, Keith A ., 76, 92, 95, 460 Brooks, Nelson, 165, 461
BIA schools, 94 Br o wn-Carlsen Liste n ing
bias, 101, 103 Comprehension Test, 454, 455
cultural, 7, 84-88 Brown, D., 65, 66, 171, 289- 291,
expectancy, 72, 139 468, 472
experiential , l03 Brown, H. Douglas, 73, 116, 143,
language variety, 80 147, 461 , 471,476
lest , 84-88 Brown, Roger, 65, 66, 465
theoretical, 176 Bruder, Mary, 160, 163, 472
bilingual education (see also cur- Buck , Catherine, 82, 170, 466
ricula, multilin gual delivery), 32, Bung, KLaus, 170,46 1
75,87, 100, 102 Buros, Oscar K. , 108 , 109, 143, 145
Bilingual Syntax Measure, 9, 47, Burt, Marina K., 47, 104,3 10,3 12,
48,305-314,335,385 461,470,471
discrete point versus pragmatic
scoring of, 47, 310-31 4, 335, Callaway, Donn, 321, 333, 392,
337 393, 431,461
bilingual reading comprehension Campbell, D . T., 125,462
index, 355 Campbell, R. N., 73, 459, 463, 475
bilingual testing (see multilingual Ca rroll , Joh n B., 36, 72, 270, 347,
testing) 358, 360,361 ,372, 40 3,46 1, 465
bilingualism (see also dominance, Carroll, Lewis, 24, 29, 462
bilingual), 13, 51. 74, 92, 327 Ca rtier, Francis A., 182, 184, 185,
degrees of, 3, II, 13, 100, 268, 208, 472
355,366,368 Carton, A. S. 354, 360, 36 1, 461
Bird, Charles S., 161, 168, 460 Calford,J. C., 113, 133, 466
Black American Cu ltural Attitude Cazden, Courtney, 307, 333, 338,
Scale, 137 462
black English, lests in , 2, 65ff, 78, cent ra l grammatical facto r (see
86 a lso g), 61 , 62
blind back translation. 92,10 1 central tendency, S3
Bloom, Benjamin S., 88, 139,460 cerebral activity, 21
tNDEX 481

Cball, Jean S. 348, 464 Cobb, Sidney, 109, 462


C hansky, N. , 456, 462 coefficient of determinatio n, 55, 70
C hapman, L. J. , 125,462 Coffman, W. E. , 400, 463
Chase, Richard A., 21 , 462 cogniti ve aspect (see factive aspect)
Chastain, Kenneth, 132, 142, 144, Cohen, Andrew, 97,104,463
462 Cohen, P., 65, 468
Cbavez, M . A., 346, 347, 36 1, 462 coherence of discourse (see aJ so
Cherry, Colin, 28, 34,462 meani ngful sequence), 407, 419
Chihara, Tets uro, 140, 141, 346, Colby, K., 407, 474
347,361 ,462 Coleman, E . B., 359, 360, 367,463,
child language (see als o language 470
learnin g), 10 College Entrance Examination
Cbo msky, N oam A. , 22, 79, Board, 189, 373, 463
154-157, 177, 185,1 95,462 Collins, A. M., 473
Christie, R .. 128, 462 communicative effectiven ess,
Chun, Ki Taek, 109, 462 392- 394, 399,416, 417 .
c hunking, 30 comm unicative purposes, 16
Clark, Jo hn L. D ., 165, 168, 180, competence, language (see intell i-
212- 214,2 17,230,463 gence, general an d langu age
Clement, R. , 138,466 proficiency)
cioze, 8- 10, 39, 42-46, 49, 58-60, composition (see essays)
63, 64, 70- 72, 92, 98, 139, 141 , comprehension (see also receptive
176, 180,227,285, 340--379,390, mode). 12, 51,65, 66, 175, 176,
402,416 270, 291
administration, 366- 367 listening, 40, 46, 50, 57, 59, 141,
applications, 348-363 190- 193
deletion procedures , 345, 365, rat in g scale, 322, 323, 392,
366 433ff
fixed ratio, 345, 375 reading, 46, 54, 58, 59, 190-- 192,
variable ratio, 345, 37 5 226, 235, 357-358, 434ff
effects of agreement/dis agree- Comprehensive English Language
ment on scores, 44 5, 446 Test, 433ff
measure o f readability, 348-3 54, comprehensive language test, 194
376 Comprehensive Tes(s of Basic
multiple choice format, 92, 236, Skills, 208
257, 378, 433ff Conca nnon, S. J. , 455, 463
oral , 306, 326- 332, 335, 336, 338, Concordia University, 23 8
433 Condon, Elaine C, 84, 463
preparation, 363-37 3 Condon, W . C , 28, 463
selection of materials. 364, confidence (see self-concept)
365 Conrad, Christine, 58, 175, 355,
protocols, 373-375 463, 471
re1iabihty, 63, 355, 356, 376 conscio usness (see also short term
scoring, 59,63,367- 373,377 memory), 30, 68
exact word, 367- 368 constraints on verbal sequences, 4,
contextual appropria teness , 12,70,358,359, 376
368- 370 local, 369, 371
weightin g degrees of appro- lon g range, 43, 371
pria(eness, 370-373 pragmatic, 43
validity, 63, 64, 192, 193, 356, construc t, theoretical (see also
357,376 validity , construct) , 110
cloze-dictation, (see also partial content words, 356, 376
dictation), 59 context, 18, 19,23, 30, 31,33, 40,
c1ozentropy, 373 42, 43, 47
4 82 LANGt;AGE TESTS AT SCHOOL
extralinguistic 19, 24,26,32,33, DLI (see D efense Language
38, 39, 42-44, 47, 50, 70, Institute)
89-9 1, 152, 162,164, 177,205, Daescb, Richard , 432
346 Da le· Chall readability form ula,
objective vers us subjective, 19, 348
33 D ale, E., 348, 464
linguistic, 18, 20, 25, 28, 32-34, D 'Arcy, P., 400, 470
38, 42-45,50,89 D a rnell , D onald K ., 58-60, 175,
gestural versus verbal , 18 192,354,355,359, 36 1, 372, 373,
pragmatic (see also pragmalic 464
mappings), 178, 2 16, 230 D avies, Alan, 40, 217, 224, 225,
social 259,345,357, 459, 464
monolingua l (see also multi- D awes, R obyn, 139, 44 5
dialectal), 74-84, 101 Defense Language Institute, 11 5,
multilingual (see also multi- 182,184
lingualism), 59, 69, 95, 101 deictic words (see also pragmatic
for eign language versu s mappings), 43
second language, 140 dehyd rated sentences, 49
contextuaiization, 220 , 223, 229 delayed auditory feedback (see
contrastive linguistics, 7, 86, 87, feedback)
101 , 169- 172, 178 demonstratives (see ::I lso pragma tic
Cooke, W., 23, 463 ma ppings), 43
Cooley, R alph E., 11 3, 133, 466 determi ners (see also pragmatic
Cooper, E., 413, 464 mappings), 43, 298, 395, 396
Cooper, Robert , 96, 11 8, 146,148, de Sa ussure, Ferdinand, 41 , 460
172, 173, 463 Dewey, John, 19,22, 25, 26, 151,
Cooperative E nglish Tests, 132 307, 464
Copi, Irvin g, 26, 463 diagnosis, 209, 211 , 2 17, 218, 225,
co rrelation, 51- 57, 70, 73, 248, 228, 229, 297, 298, 374, 38 1,
393,394 396
high vers us low, 187- 196, 205 dialect (see language varieties)
magnitude of, 56 dialogue (see also oral interview),
misinterpretatio ns of, 187-1 96 9, 47
sign (plus or min us), 55 D iana versus C alifornia, 76
counterfactual conditional, 337 Dickens, Charles, 241,464
covariance (see also coefficient of Di en, Ton That, 92, 95, 355, 47 1
determinati on),54 dictation, 39-45, 57-6 1, 15,70-72,
Cowan, J . Ronayne, 99, 464 98, 176, 193,262.. ·302, 402, 416,
Crabbs, L M., 350, 469 4331f
Craker, Viola, 62, 329, 464 protocols, 282-285, 295-300
criteri o n refe renced testing (see punctuation of, 274- 276. 390
nat ive speaker as criterion), 6 ra te, 41
Cronb ach, Lee J ., 73, 110-112, scoring, 41 , 276- 285
208,266,267, 464 spelling errors, 278-282, 390
Crowne, D. P ., 125, 126, 464 selecting materials, 267- 27 1
Cultural Attitude Scales, 136, 137 standard, 227, 263, 264,
reliability and validity, 137 270- 285,298,299
cultural bias (see bias) valid ity, 265-267
curricula, 69, 76, 82, 87, 398, 399, variance in, 58
401-421 dictati on-com p osi Li on (see also
evaJuation, 4, IO, 41 7 recall tasks), 264, 265, 298
language of, 269, 40 1-403 d icto-c-o mp (see di ctatio l1 -
multilingu al, 3. 6, 7, 102, 308 co mposition)
Czech, tests in, 92, 355 difficu lty of processing, (see. also
IN DEX 483
hierarchy of difficulties), 27 1, slrength jweak ness, 110
377,416 elicitatio n procedures (see also
Dil, An wa r S., 465 language tests), 4-5, 12,65, 69
discourse processing, 269, 298, 299, elicited imitation, 9, 42, 6511, 69, 7 1,
303- 358,379, 405 78, 176, 264, 265, 268, 289- 295,
constrai nts o n, 221 , 329 , 418, 298, 338, 4340'
432 ad ministration, 291-293
discovery procedure. 15 8 scoring, 293- 295
discrete point teaching, 165- 169, elicited translatio n (see translati o n
178,229 as a test)
discrete po int tes ting, 7- 9, 11, emotive aspect (see also attitudes
36-39, 60, 61, 67, 70, 93, 103, and affective fac tors) , 17, 26-28,
15G-230, 269, 29 1, 301 , 432 33,34, 92, 101 - 103, 105- 148
goals of, 209-2 12 empathy, 11 3- 11 6, \33 , 143, 147
items, 217-227 Engelmann, S., 80, 460
rec onciliation with p ragmatic English Comprehension Level, 182
testing, 227-228 English, tests in, 41 , 46, 59ff, 77,
scoring methods, 300 90ff, 94ff, \1 7, 182,354,355,357,
distinctive features 26 36 1
distracrors (see also mu lt iple Epstein, A. , 307, 33 8, 462
choice tests), 89, 237, 242, erro r analysis, 5, 64-69, 7 1, 195
256-258 errors, 20, 24, 35, 48, 66, 7 1, 72,
distributional analysis, 154- 155 , 205,207,208,230, 347,379,390,
158, 162, 179 395
divisibili ty hypothesis, 424-458 listenin g, 297-299
Doerr, Naomi, 139, 375, 445,464 spelling, 41 , 28G-282, 299
domain (as a factor in language Ervin, Susan, 65 , 464
proficiency) , 94-98, 104 ESLPE (see UCLA English as a
d o minance, bilingual, 7, 5 1, 75, Second Language Placement
93- 104,300,30 1 Exam inatio n)
dominance, language variety, essays, 10,49, 58, 60, 70, 98, 265,
289- 295 381-400
dominance sc ale, 99, 101 , 104 raling, 60, 385- 392, 434ff
do uble-bi nd , 11 9 sco ring, 10,385-392, 399, 434ff
Dulay, Heidi C, 47, 104,31 1, 461, ethnocentrism (see also authori-
470, 471 taria nism) , 80, 117,11 8, 123,144
Dull, Cecelia Yoder, 11 3, 133, 466 Evola, J., 385, 465
Dumas, G uy, 50,64-69,464, 476 ex pectancy bias (see bias)
expectancy gramma r, 6, 16--35,38,
EEE (see A merjcan University at 43,50, 61,68, 186,195,303,343,
B eirut En glish Entra nc e 375, 38 1,428
Examinatio n) ex perience, 29, 30, 33, 34, 204
E scale, (see also ethnocem rism), home versus school, 82
122- 125, 127, 128,144 unive rse of, 29,184
ea rly childhood ed uca tion, 307 f ex traling ui stic c on text (see
Early, Ma rgarel, 4 13, 464 contex t)
Ebel, Robert L. , 208, 373, 464
educatio nal meas uremen t, 3, 12, F scale (Fascism Scale), 122-1 28,
69, 102, 181 144, 146
educational policy: 78 face vali dity, 52
Educatio nal Test ing Service, 46, facti ve aspect, 17, 19- 26,28, 33,34,
189,32 1, 323, 324,373,463 91, 10 1, 105,144
ego (see also self-concept), 106 , 116 fact or analysis
language, 113 of attitude variables, I J 1, 145
484 LANGUAGE TESTS AT SCHOOL

oflanguage tests, 62, 423-458 122-125,127,128,134,135,138,


false expectancies (see also expect- 144,459,465,466,469,475
ancy grammar), 221, 223 Gattegno, Caleb, 32,466
fascism, 123-128, 144 Geis, Florence, 128, 462
Fata, Julia, 475, 477 general factor (see also indivisi-
feedback, 20 bility hypothesis, partial divisi-
delayed auditory, 21 bility, and g), 429
Ferguson, Charles, 79, 118, 465, generative system (see also expect-
474 ancy grammar), 153, 186
Ferrel, Alan, 93,104,169,475 Gentner, D., 270, 466
Finocchiaro, Mary, 104,470 German, tests in, 92, 354, 355
First, Daphne, 21, 462 Gestalt psychology, 42,341,342
Fishman, Joshua A" I I 8, 146, 148, gestural context (see context, and
463,465 emotive aspect)
Fishman, Michelle, 270, 465 Gilmore, G. M., 370, 377, 470
Flanagan, J, c" 250, 258,465 Glazer, S., 348-350, 466
Flesch readability formula, 348, Gliksman, L, 138,466
349 Glock, Marvin D., 180,459
Flesch, Rudolf, 348-349, 465 Goldberg, L, 413, 473
fluency, 62 Gonzalez, Angel, 471
rating scale, 322, 323, 337, 392, Goodman, Kenneth S., 82, 86, 170,
434ff,444 407,466
Fodor, J, A" 156, 177, 462, 465, Gottschalk, LA., 466
468 Gradman, Harry Lee, 45, 302, 466
Foote, Samuel, 22ff, 31, 463 ' grammar, 24, 37,185
Foreign Service Institute Oral tests, 58, 72,141,176,280-282
Interview, 9, 49, 60, 305, 306, items, 226, 227, 233
320-326,335,337,338,385,392, rating scale 322, 323, 335, 392.
430,443 433ff
Fraser, Colin, 65, 66, 465 grammatical expectation, 344
freak grammars (see also artificial grammatical knowledge, 13, 24,
grammars),203 432
Frederiksen, Carl, 238, 265, 270, grammatical rule, 37
362,405,407,417,465 Granowski, A., 348, 460
Freedle, Roy 0., 270, 403, 461, 465 graphological elements, 31
French, J. R. R. Jr., 109,462 Gray, Brian, 32, 97, 269, 408-409,
French, tests in, 57, 64ff, 84, 92, 93, 414-415,421,467
139,297,327,354,355 Grimes, J., 407, 466
Frenkel-Bruns\\!ick, Else, 107,459 group norms, 78,101
Fries, Charles c., 158, 159, 163, Guiora, Alexander, 2, 113-116,
165,167,468 133,143,148,466
Frost, Robert, 89 Gumperz, John, 79, 465
function words, 356, 376 Gunnarsson, Bjarni, 78, 403, 418,
functors, 47, 383, 416 421,455,466

g (see also intelligence, genera]), Haggard, L. A., 133,466


427-428,433-458 Hanzeli, V., 370, 466
first language studies, 451--456 Harrington, Ronald v., 155,472
indistinguishable 'from global Harris, David P., 40, 52, 167-168,
language proficiency, 456 172-173,180,217,218,466
gain scores, 358 Harris, Zellig, 154- 155, 158, 162,
Gallup McKinley School District, 177,466
333 Hart, Norman W. M., 10, 32, 97,
Gardner, Robert C., 107, 116, 118, 269, 408-410, 414-415, 421,
INDEX 485
466-467 di stinct from pragmatic tests, 70
Hawkes, Nicolas, 76, 92, 95, 460 reconciliation with di screte
Haynes, Cha rlesS., 16J , 471 poi nt testing, 8, 227-230
Heaton, J. B., 40, 180, 217, 233, intelligence , general (see also
259,467 langu age factor), 2- 3, 5, 11- 13,
Heise, G. A. , 264, 470 51,62,71,82,89, 102, 163,195,
Hendricks. D ., 236, 3ij8, 441 , 4 74 200, 208, 225,227,306,356,358.
Hernandez-Chavez, Eduardo, 47 , 406, 418, 419,420, 427-428
310,461 verbal versus non-verbal. 12
Herrnstein , Richard. 80, 102, 467 interlanguage, 64, 65, 71, 186
hierarchies o f linguistic elements, internal consistency, 248
21, 23, 24, 29 internal sta tes (see attitudes)
hierare-hy o f task diffic ulties, 67-69 interpersonal relatio ns hip (see
Hinofotis, Frances A. Butler, 236, emotive aspect), 27, 33
267,357,393,429-431,438,467, interpretability, 373
471 intersentential relati ons, 298
H o fman , Jo hn, 63, 77 , 102, 379 , interview technique (see ora.l
467 interview)
Holm, W ayne, 104 interviewer effect , 98 (see also ora l
Hodgson, William R., 86, 471 interview)
Holm, Wayne, 93,104,169,475 Iowa Tests of Basic Sk ill s,
Holtzman , P., 35 4, 467 402-403, 455f
Hoover, M. R ., 65··66, 171, Irvine, P., 267, 367, 373, 429, 467
289-291 , 468,472 Isaacs, R. S., 133,466
Hopf, 354, 46 7 item analysis, 60, 196ff, 205, 233,
Hopkins, Kenneth D., 73, 208, 400, 245- 255,257-259
461,463,464, 467, 473, 476 discrimina tion (lD)
H orner, Vivian J., 76, 467 (as correlatio n) J96f, 245.
Hudso n, Alan, 141 ,472 247-252
Huff, D., 181, 467 facility (IF), 245- 247, 257
Hull, c., 433, 471 resp onse frequency, 245,
Hunt, E. , 455, 467 252-254

IQ (see intelligence) Jackson, D . N., 125- 126,467


lIyin, Donna, 9, 49, 3 14, 316,338, Jackson, Kenneth, 27, 119,477
467 Jackson, S. L., 122, 136, 137, 145,
lI yi n Oral Interview, 9, 49 , 305 , 477
3 14--31 7,335,337 James Language Domina nce Test,
immediate memory , (see ShOft- 308- 309
term memory) James , William, 82
immersion, 103 Jenkins, J., 433, 463, 471
implication, 26 Jensen, Arthur, 102, 141, 142,427,
Inal , N. , 357, 472 454, 467
Indiana University, 329 Jobansson, Stig, 41 , 59, 28 1,
Ingram, E. , 40, 467 285- 288,300,302,355-356,379,
indivisibility hypothesis, 424---458 467
inductive inference (in science), 108 John , Vera , 76, 467
inference, 18, 417, 419 Johnson , M., 236, 388, 474
innate expectations, 29, 3 1 John son, Thomas, 11 7, 468
instructional value (o f tests), 4, 13 , Johnson· Laird, P. N. , 305, 470
49,51,52,230, 256 Jones, Randall L. , 208, 302, 464,
instrumental motive, 11 6, 144 468
jntegrative motive, 116 , 144 Jongsma, 10. , 366, 468
integrative tests, 36--39, 60, 70, 93 Jonz, Jon , 236, 468
486 LANG UAGE TE'lTS AT SCHOOL

Kaczmarek, C., 388, 400, 448-449, language teaching methods, 32


468 la nguage tests (see p ragma tic
Katz , J., 156, 177,462, 465, 468 tests), 1-3, 6, 11 ,36-73, 101, 102
Kelly, E. Lowell , 109- 110, 454, as elicitation proced ures, 4
462,468 language testing researoh, I , 3- 6,
Kerlinger, Fred N ., 53, 428 , 468 12
Kinesthetic fee d bac k (see feed- language use, 19, 38,50,93
back) language varieties (see also multi-
K ingston, A . J., 357, 477 lingualism and multidi alectal-
K inzel, Paul, 97, 468 ism), 2, II , 12, 78- 79, 84,
Kim . H a rriet, 58, 297,363, 468 101- 102
Kla re, G eorge R., 49, 92, 348- Lapkin, Sharon, 63- 64, 368, 370,
350, 353,355,364-365,367,379, 379, 469, 476
468 Lashl ey, Karl, 21, 469
knowledge, 3, 12, 307 La u versus Nichol s, 6, 75, 80, 100,
Kolers, Paul A , 89, 90, 92- 93, 104, 469
468,327 Lall Remedies, 75, 93- 94, 99, 101 ,
Krug. Kathy, 11 7, 468 104
Kud e r ~ Richard so n formul as, 195 lea rner characteristics (see erro r
a nalysis and inlerla nguage)
Labo v, William, 65, 82, 468 lea rner protocols, 228, 282- 284,
La do, Robert , 37, 39f, 62, 65, 73, 301, 375,3H I-383,3H5, 39g- 399,
158-1 59, 163, 165, 167,1 69-170, 409
172,21 7- 218, 222- 223, 225,230, spoke n, 407, 411 - 41 5
468 written, 276-27S, 295-298, 388-
Lambert, Wall ace E., 32, 81, 107, 390
116, 118, 122- 125, 127-128, learner systems (see expectancy
134-1 35, 144, 459,465-4 66, 469, grammar)
475-476 learnin g, 25, 30, 34, 59
la nguage a bilit y (see intelligence ra te, 13
and language proficiency) varia nce in, 82-83, 99. 101
language a rlS, 4 Leary, Bernice E ., 470
language assessme nt (see language Lentz, R ., 465
tests) Lett, John, 148,469
language attitudes, 11 8 levels of reading comprehension,
language fact or (see also intelli- 353
gence and la nguage proficiency) , frustrational,353
12 independent, 353
language learning, 1- 3, 7, 11, 13, instructional, 353
28,32, 34,50, 62 Lev in son, D . J ., 107, 459
affeclive va ria bles in, 11 2-147 Lewis, J ., 455, 467
language policy, 74- 103 lexical items (see vocabula ry)
la nguage proficiency, Lichten, W., 264, 470
factorial structure, 423-458 Liebert, R . M., 125, 469
global factor (see unita ry factor, Likert, R. , 122,469
indivisibility hypothesis and LikerHype scales, 122, 136, 147
intelligence general) ling ui stic comp e te n ce (see
language proficiency (see also in- com petence)
telligence), 2- 3, 5- 7, 9-·12, 16, li nguistic context (see context)
50, 62, 64-65, 93, 95, 102, 105, lin guistic sequences (:oee context
173, 195, 204, 403, 41 8, 420 li nguistic)
multilingual, 98- 101 linguistics, 17, 21, 228
sel f ratings o f, 97, 103, 141 listening, 10,35,37, 49, 59, 68 f, 70,
teacher ratings of, 97. 103 14,
INDEX 4 87

ability, 10,2 12 Test (see also empath y),


mode, 60, 62, 68 11 4-1 15, 133, 134
'asks. 433--440 reliability, 133
literacy , 408 validity, 134
Liu, P. F. , 140, 472 Miller, George A ., 13, 22, 30, 35,
LoCoco. V. , 267, 338, 357, 370, 469 184-1 85, 264,305,329, 470
Lofgren, Horst. 62, 469 Miller, G. R. , 359-360, 367, 463,
logic, 17 470
Lorge. I., 348, 469 Miller, Jon E., 265, 470
Lorge- Th orndike In telligence Miller, Leslie, 455, 477
Test. 402, 40 3, 456 minimal pairs, 37
Lukmani. Yasmeen, J 17, 469 phon emic contrasts, 167, 217-
Lund Uni versity, 286 224
Lunneberg, C, 455. 467 MME (see Micro Moment ary
Luria A. R" 2. 469 Expression Test)
mod als. 298, 375
McCall -Crabbs mode of processing, 2 5
Sta nd ard Test Lessons In Modern Language Apt irude Test,
Reading, 349, 350 266
McCall , W. A .• 350, 469 Modern Language Associ ation
McCracken, G. , 411 , 469 Tests, 202
McGillivray. Ja mes H .. 4 77 M olina, John, 172, 180, 307, 474
MacG ini, ie, W H .• 358- 360. 469 mo no ling ua l contexts (multi -
Mc Int osh, Lois, 470 dialectali srn)
MacKay, M " 121,469 morphemes, 385
MacKay, R., 238 morphological items, 37, 47, 233
Macklin, Charles. 22f, 31 Morri son, M., 45 5, 475
Mc Leod, John , 92, 95,355, 469 motivatio n (see attitudes)
Marner, E. , 465 Mo rto n, Rand , 165 , 470
Manis, Marvin, 139,445 motor tasks, 2, 12
Mareili, Leo nard R. , 16 1, 470 Mount G ravatt Read ing Research
Marlowe , D., 125. 126,464 Project, 10, 408- 41 5
Martin, N ., 400, 470 Mullen, K ., 333, 392, 393, 400, 431
Maso n, Victor, 92, 95 , 355, 47 1 multidialectalism, 77- 84
matched-guise, 135 multilingual testing (see also
Matz, R. , 307 , 338, 462 bilingua l educatio n), 74-104,
mean, 53, 54, 72 355,376
mean in g constraints (see prag- multilingualism, 74-84
mal ic mappings, also natualness [active a nd emo ti ve aspec ts,
criteria) 8()-84
meaning, importance of, 150-165, multiple choice tests, 8, 47, 57 , 88,
177,214,385 91 , 227, 23 1--259, 376, 433fT
meaningful sequence (see natural- it em w ri ting, 237- 24 5
ness constra ints), 22- 24, 31, 319. reliabilit y of scoring, 232
335 steps for preparation, 255
mechanical drill (see al so syntax- multiple regressi on, 111 , 428
based drill), 163 Murakami , M. , 382, 470
Meehl, P. E., 110- 112, 208, Murphy, Penny, 93, 104, 169, 475
266-267, 464
memory, 224, 343 Naccara,o, R. W., 370, 377, 470
Menguk , Paula, 65, 47 0 Nadler, Harvey, 161, 470
mental retardation, 45 3 Nagato, N., 365, 472
Messick, 125- 126.467 Naiman , Neil, 50, 64-69, 47 0, 476
Micro Momentary Expressio n Nakano, Patricia J" 104, 470
488 LANGUAGE TESlll AT SCHOOL

narrati ve repeti ti on (see also Oller, D . K " 22


elicl ted imit ation ), 326, 336 Oll er, J . W. , 2, 13, 35, 50, 58, 59,
narration (see also story telling), 61 - 62,73,87,92,95, 134- 136,
332-335 140- 141 , 155, 172,175,195,233,
Natalicio, Diana, 65, 302, 338,470 267, 302,339,346- 347,355,357,
National Counci l of Teac hers of 360- 36 1, 363,370,372,393, 42 1,
English, 161 , 470 429-43 1, 438,459- 462,464-465,
native language, 2, 12 467.-468,470-473,475-477
nal ive speaker, 2, 62 Oller, J. W. Sr., 23, 32, 177, 400,
criteri on for test performance, 6, 471
8, 199- 208 Olsson, Stig., 286
nativeness, ratin g scale, 392 Osgood, Cha rles E., 13,22,25, 134,
naturalness criteria, 6, 28 , 33, 34, 344,354-355, 376, 47 2
36, 38, 42, 44, 46, 70, 72, 180, other concept (see interpersonal
22 1,263- 265, 267,305,306,319, relationships)
346,376,41 5, 417 Otis Test of Mental Abilities,
Navajo, tests in , 88ff, 93 4541'
negatives, 44
Newco mb, J. M ., 137, 47 1 Palme r, Leslie, 104, 148, 338, 459,
Newton, B., 400, 470 463, 467, 472,475, 476
New York City Language Paluszny, Maria, 113, 133, 466
Assessment Banery, 308- 309 paralinguistics, 18
Nie, N . H ., 433, 47 1 paraphrasing, 336
Noa, Joslyn K., 172, 180,307, 474 paraphrase recognition , 46, 50,70
noise procedure, 45, 70, 72, 264, sentence, 234
298 Parish, C, 432, 434, 451
nominal validity, 109- 110, 128, 176 Parker, R., 400, 470
no n-native performance as a cri- partial dictation, 264, 285- 289,
terion fo r tests, 199- 208 298,300
nonsense, 2 14, 229 ad ministration, 288- 289
normal distri bution, 54 s pelling errors, 289
N o rman, D o nald , 35 partial divisibilit y hypot heses,
N o rthwest Regio nal Educational 425- 458
Laboratory, 307 pattern drills, 32, 157- 165, 179,
Norton, Darry, 86, 471 204,415
noun, 44 factua l, 397, 399
phrases, 395, 396 Paulston, C. B., 160, 163, 472
Nunnally, J . C, 53, 427- 429, 431, Peabody Picture V ocabulary Test,
471 455
Pearson, D ., 349, 472
objecti vity, 232, 257 Pedhazur, Elaza r J., 53 , 428
obligatory occasio ns, 385 percept, 31
Obrecht, H., 256 pe rcept ual-motor skill s, 65
oraey,408 Perkins, Kyl e, 2, 50, 62, 79, 195,
oral doze, (see doze) 233, 302, 400, 421 , 431-432,
Oral Communication Test, 9, 49, 460-461,464,465,468,470,472,
305,314,317- 320,335,337 47 6, 477
o ral interview, 9, 47--49, 60, 70, Perren, G. E., 339, 476
176, 301- 338, 416, 433ff personality, 3, 5, 12, 105, 109, 403,
oral modality, 9, 37, 47, 68, 70, 406, 41 8, 420
30 1-338 (see also attitudes)
need for o ral language tests, Pelerson, Ca lvin R ., ) 82, 184 · 185.
306-308 208, 472
Ogston, W . D. , n , 463 Pharis, Keith, 79
IN DEX 489
Phillips, June K ., 148,469 projecti ve teChniques, 107, 122,
phil osophy, \7 141 , 142
phonemic cont racts (see minimal reliability and validity, 142
pairs) pronouns, 43
ph onological items, 31, 37, 58, 68, (see also pragmatic mappin gs)
70-7 1,2 17-224,233,269 pronunciatio n, 40, 62, 143, 167,
co nso nant cluste rs, 298 226
Pike, Lewis, 60, 174, 373,472 (and empathy), 113- 11 6
Pimsleur, Paul, 148, 472 psycho ling uistics, 2, 13, 19
Pl att, John, R., 61, 69, \08- 109, psycho logically real grammar
145, 152, 472 (see also expectancy grammar)
plural endings (see also mor-
phological items), 375, 385, 396 Q-Sort, 1301f, 146
Polis h, tests in, 92, 355 quant ification, 398
Politzer, Robert , 65, 170, 171 , questi o n answering, 46. 57, 70, 240
289- 292,294, 468, 472 meaningful drills, 397
Porter, D. , 236, 47 3
Postal, Paul M ., 156. 177, 467 Rand , Ea rl , 161, 193, 432, 436, 473
Potter, T. C , 353, 357, 473 Rasmussen, D ., 413, 473
praetieality, 4, 12, 13,5 1, 52,239 rating scales, 320-326, 385, 400
pragmatic cash va lue, 225 reliability, 392, 393, 399
pragmatic expecta ncy gramm a r readability fo rmulas, 49, 348, 353,
(see expectancy grammar) 376
pragmatic lan guage tasks, (or reading
tests), 6, 8- 9, 39, 42, 44-50, ability 10, 12,37, 46, 63, 72, 141 ,
60- 61, 64,69-7 1, 93, 101 , 180, 225,226
193,229,23 1, 237,259,261 - 458 curric ulu m, to, 4J 9
prag matic mappings, 19, 23-25, ph onetic system, 41 0
31, 33, 34, 38, 43-44, 50, 68, 152, ph o nics, 410
162, 178,179,2 14-216, 241, 305, mode, 49, 62, 69
310, 312,343,376,383, 384, 41 0, tests, 280-282
41 7, 419,455 reading aloud , 9, \76, 306,
pragmatic naturalness criteria (see 326- 328, 338, 434ff
naturalness criteria) fluen cy, 327, 444
pragmatic reading lessons, 414 miscues, 407
pragmatics, 16, 19,27, 33 recall tasks, 400, 434ff
T he Prague School, 154 receptive mode, 5, 37 , 67 , 68, 70
prejudice, 107, 110, III (see also (see also comprehension)
autho rita rianis m and social redundancy, 344, 375, 376
distance) Reed, Carole, 86, 170, 473
in stitutionallzation o f, 80, 101 relati o nship informa tion (see
prepositional phrases, 298 emo tive aspect), 27- 29
Prescriptive R ea ding Inventory, reliability, 4, 12, 13, 51 , 56, 57, 72,
455 194, 195,207, 248, 338,392- 393
present perfect, 375 repetiti on (see also elicited i mi~
presuppositio ns, 26 tatio n), 434ff
principal compo nent analysis, respo nse set, J 25 (see also
427-428,433 acq uiescence)
P rod u ct~moment correlation rewriting essay protocols, 386--3 91
pro ductive mode, 5, 9, 37, 65- 68, Richard s, Jack C, 5, 35, 48, 64,
70, 303338 339,349, 461, 462, 472,473, 475,
productive oral testing, 303-3 38 , 476
383 Riedlinger, Albert, 41 , 460
profiCiency sca les, 99 Rivers, Wilga, 165, \70, 473
490 LANGUAGE TESTS AT SCHOOL

Robertson, G ., 402, 473 (see elicited imitation)


robustness, 27 1, 347 ser ial order, 2 1
Rodman, R obert, 465 Sharon, Aurid, 212, 206, 459
Rorschach Ink Blot Test, 122, 146 Shaw, Marvin E., 106- 108 , 120,
Rubenstein, R ., 329, 347, 358 129, 137,474
Ruddell, R. , 357, 466. 473 Shore, Howard, 129- 131, 133, 146,
Ruder, K. F. , 86, 47 5 46 1
Rummelhart, D . E., 362, 384, 407, Sh o re, M. , 76, 474
473 s ho rHerm memory, 40, 42, 44, 50,
Russell , Ra ndall H ., 172, 180, 307, 65, 66,273,347
474 sidetone (see feedbac k)
Rutherford, William E., 16 I , 166, Sigurd, B., 264, 302, 475
473 Silverman, R., 172, 174, 180,307
Sinaiko. H. W., 92, 95, 350, 353,
Sales, Bruce Dennis, 155, 472 364,365,367,468
SA T (sec Scholastic Aptitude Singer, Harry, 466
Test) Skinner. B. F. , 106, 145, 153,
sa mpling theory . 181- 187, 204, 474
206,219 Slager, W. R. , 470
San Diego Twins, 453 Siobill, Daniel, I., 65, 265, 474
Sand o, Joe, 87 Small ey, W. A ., 297
Sanfo rd, R. N. , 107 , 459 Smythe, C. R., 138, 466
Santeusanio, N., 4 13, 464 Snow, Becky, 236, 357, 467
Sapir, Edward , 154,473 social contexts
Saport., Sol, 469 (see multilingualism and multi-
Sarason, I. G., t 32, 474 dialectalism), 59
Sato, M. , 264, 302, 475 socia l distance, 13 8
Savignon, Sand ra, 139, 148, 307, Social Distance Scale, 137
338,474 social sciences , 108
Schan k, R ., 362, 407, 474 sociocu ltural factors (see emotive
Schlesinger, I. M ., 347, 474 aspect)
Scholastic Aptitude Test, 132,208 sociolinguistics, 2, 19
Scholz, G. , 236-237, 329. 388, 393, So ll ey, C. M ., 126,467
432,441 ,474 Sornaratne , W., 4()' ·41 , 474
Schumann, John. 114- 116, 138, Srole, Leo, 128,475
148.474 STP (Standard Thai Proced ure)
Scoon, Annabelle, 88, 91, 474 standa rd deviation, 53- 55, 72
Sebeok, Th omas A ., 13,354-355, standard erro r, 100, 104
376, 472 Standard Thai Procedure
Sechebaye, Albert, 4 1, 460 (see a lso pro nun c iatio n ),
seco nd language if!a rning (see 114- 11 5
language lea rning) reliability and vaJidit y, 11 5
self.concept, 106, 136, 143 Stanfo rd-Binet Intelligence Sca le,
se lf-reports, 147 358,455,456
Selfridge , J., 329, 470 Sta nley, Julian c., 73, 208, 400,
Selinker, Larry, 64, 474 46 1,463.464,467, 473, 47 6
semantic differentia l scales, 144, sta tistics, 7,181-208
147 spontaneous speech, 49, 65, 67, 71
bipolar, 134 Southern Illinois University, 329,
unipolar, 135 387, 429. 431 , 432,439
semamics, 24 Spa nish , tests in . 93, 47, 76. 90ff,
sentence completi on (see also 94/f
cloze),58 speak ing, 10,37,62, 65,68--70, 141
sentence repetition a bility, 10, 193,336
INDEX 491
rating scales. 320-32 1 (see al so trans fo rm ationa l
mode, 68 linguistics)
tasks, 3 03 ~ 338 syntactic markers, 47
Spearma n, c., 427-428, 474 syntax , 24, 37, 70,2 14, 233, 396
speech, 306-30S, 335 based s tructure drill s (see pat-
prot ocols, 337 tern drills)
speech erro rs (see errors)
Speigler, D., 125,469 Tannenbaum , P . H., 134,4 72
speLting, 40-4 1, 224, 279~282, 301 , Tate, Merle, 53, 56, 192, 476
390 Taylo r, Wilson. 42, 328, 341 , 344,
Spolsky, Bernard, 5, 35, 45, 93~96, 347,3 50,355.. 357, 367, 375~ 377 ,
99, 1 03~ 104, 134-135, 148, 169, 380, 476
208,264,302, 338,355,379, 459, teaching
463,464, 466, 467, 468, 472, 475, effectiveness, 362- 363
476 methods, 105
Spurling, R. , 236, 388, 474 Teitelbaum, H erta, 76, 87, 96~98 ,
Staubach, Charies, 170, 472 292, 476
Steinbrenner, K ., 433, 471 tempo ral constraints (see natural -
stereotypes, 118 ness criteria)
Sterling, T., 329, 347, 358, 459 tense indicators
Steven, J. H ., S6, 475 (see also pragm atic m.a ppings),
Stevenson, Douglas, 329, 475 43, 298,385
Stinson, P., 45 5, 475 Test A nxiety, 132
Stockwell, Robert , L66 Test o f Englis h as a Foreign
Stolu row, L. M ., 92, 95, 350, 353, L an gu age, 4 6~ 47 , 59 ~60,
3 64~ 365 , 3 6 7 ,4 68 L 74- 1 75, 1 88~ 1 92, 19 S,2 01 ~20 3,
story retelling, 9, 49, 70, 97, 98, 208,355, 429,430
332~ 33 4 , 338,40 8 test equivalence across languages,
Strei ff, Vi rginia . 5S, 63 , 77, 175 , 8 8 ~93, 300 ,3 01 ,3 76
472, 475 Tew, Roy, 86, 475
st ress and intonatio n, 217 Thai, tests in, 92, 11 5, 355
Strevens, Peter, 170, 475 Thematic Apperception Test, 122,
Strickla nd, G., 136,475 123, 142, 146
Stubbs. Joseph Bartow, 59, 267, thinking, 25, 26
357,367,370,379,475 Thorndike, Ro bert L. , 73
Slump, T homas, 78, 200, 267, 302, time (see naturalness criteria)
402 , 41 8, 456~4 57, 475 TOE F L (see Test of English as a
subjocti ve judgments, 392- 394, 398 Fo reign Language)
subme rsio n, 103 to ne of voice (see emo tive aspect)
subordinating conjunctions, 44 Trabue, M. R. , 359
successful student , 130ff transfo rmation exercise , 383
Suei, G . T ., 134, 472 transfo rmation al linguisti cs, 7
suffixes, 37, 38 lranslatio n (as a test), 50, 62, 65,
summative evaluation, 64 71 ~72,3 25 ,336
surname surveys, 76, 102 fro m native language, 66
Sutton, S., 2 1, 462 from ta rget language, 66
Swain, Merrill , 50, 63-69, SI, 84, tran s lalion of tests, 88- 93, 96.
103, 104,368,370,379, 460,464, 103,301,376
476 of discrete poin t tests, 88- 92, 101
Swedish, tests in, 355 Trim, J . L M., 339, 476
syLLa bLes, 215 Tucke r, G . Richard, 32, 59, 8 1,
synon ym matching 267, 357, 367, 370, 379, 469,
(see also vocabula ry), 176, 227 475-476
syntacti c linguistics, 150-1 80 Tulving, E., 362,476

You might also like