Brown - Language Assessment - 23 - 24
Brown - Language Assessment - 23 - 24
● Offers comments.
● Written work gets assessed by self, teacher, and maybe other students.
A good teacher never ceases to assess students, whether those assessments are incidental
or intentional.
Tests are subsets of assessment they are not the only form of assessment that a teacher can make.
For optimal learning to take place, students must have the freedom in the classroom to
experiment and try out their own hypotheses about language without feeling that their overall
competence is being “judged”. During these practice activities, teachers are observing
students’ performance and making various evaluations of the learner. In the ideal classroom,
all those observations feed into the way the teacher provides instruction to each student.
1. Informal Assessment: “coaching” students and giving them feedback. This means
incidental, unplanned comments and responses or a marginal comment on a paper.
A good deal of a teacher’s informal assessment is embedded in classroom tasks
designed to elicit performance but NOT with the intent of recording results and
making fixed judgements about a student’s competence.
1. Practicality
A good test is practical, it is within the means of financial limitations, time constraints, ease
of administration, scoring and interpretation. The extent to which a test is practical
sometimes hinges on whether a test is designed to be:
2. Reliability
We can be unreliable in the consistency we apply to test evaluation: unclear scoring criteria, fatigue, carelessness, a bia
3. Validity
Validity is the degree to which the test actually measures what it is intended to measure.
Statistical correlation with other related measures is a standard method. Validity can be
established only by observation and theoretical justification. There is no final, absolute, and
objective measure of validity. We have to ask questions that give us convincing evidence
that a test accurately and sufficiently measures the test-taker for the particular objective, or
criterion, of the test. If that evidence is there, then the test may be said to have criterion
validity.
How can teachers be somewhat assured that a test is indeed valid? Three types of
validation are important:
a. Content validity
If a test requires the test-taker to perform the behaviour that is being measured, it
can claim content validity. You can usually determine content validity,
observationally, if you can define the achievement that you are measuring. If you are
trying to assess a person’s ability to speak a second language in a conversational
setting, a test that requires the learner actually to speak within some sort of authentic
context achieves content validity.
b. Face validity
Face validity asks the question “Does the test, on the ‘face’ of it, appear from the
learner’s perspective to test what it is designed to test?”. To achieve “peak”
performance on a test, a learner needs to be convinced that the test is indeed testing
what it claims to test. Face validity is almost always perceived in terms of content: if
the test samples the actual content of what the learner has achieved or expects to
achieve, then face validity will be perceived.
c. Construct validity
One way to look at construct validity is to ask the question “Does this test actually tap
into the theoretical construct as it has been defined?”.
● “Proficiency” is a construct.
● “Self-esteem” is a construct.
4. Authenticity
5. Washback
When students take a test, ideally they will receive information (feedback) about their
competence, based on their performance. That feedback should “wash back” to them in the
form of useful diagnoses of strengths and weaknesses. Informal assessment is more likely
to have built-in washback effects, because the teacher is usually providing interactive
feedback.
The challenge to teachers is to create classroom tests that serve as learning devices
through which washback is achieved. Students’ incorrect responses can become windows of
insight into further work; their correct responses may need to be praised. Washback
enhances a number of basic principles of language acquisition:
● Intrinsic motivation.
● Autonomy.
● Self-confidence.
● Language ego.
● Interlanguage.
● Strategic investment.
Washback also implies that students have ready access to you to discuss the feedback and
evaluation you have given. Washback may also imply the benefit learners experience in their
preparation for a test, before the fact = “Wash forward”. By using appropriate strategies for
reviewing, synthesising, and consolidating material before taking a test, students may find
that the preparation time is as beneficial as the feedback received after the fact.
Kinds of tests
● Today: test designers are still challenged in their quest for more authentic, content-
valid instruments that stimulate real-world interaction while still meeting reliability and
practicality criteria.
This historical perspective underscores two major approaches to language testing that still prevail, even if in mutated fo
Discrete-point tests were constructed on the assumption that language can be broken down
into its component parts and those parts adequately tested. Those components are basically
the skills of listening, speaking, reading, writing, the various hierarchical units of language
within each skill, and subcategories within those units. It was claimed that a typical
proficiency test, by adequate sampling of these units, can achieve validity.
Criticism: language competence is a unified set of interacting abilities that cannot be tested
separately. The claim was that communicative competence is so global and requires such
integration that it cannot be captured in additive tests of grammar and reading and
vocabulary and other discrete points of language.
● Cloze tests
a reading passage (150-300 words) that has been “mutilated” by the deletion of roughly
● Dictations
this is familiar to virtually all classroom language learners. The argument for claiming dic
The unitary trait hypothesis suggested an “indivisible” view of language proficiency, namely,
that vocabulary, grammar, phonology, the four skills, and other discrete points of language
cannot be distinguished from each other. The hypothesis contended that there is a general
factor of language proficiency such that all the discrete points do not add up to that whole.
But it was finally admitted that this hypothesis was WRONG.
First issue: global obsession over standardised tests, mass-produced by corporations and
government agencies, hailed as empirically validated and thought to provide accurate
measures of ability.
Researchers continue to focus on the components of communicative competence in their efforts to specify the multiple
multi-trait approach to testing:
● Strategic components.
2. Authenticity
The focus in language pedagogy on communication in real-world contexts has spurred many
attempts to create more communicative assessment procedures. Users of a language
creatively interact with other people as well as with texts; this means that tests have to
involve people in actually performing the behaviour that we want to measure. Interactive
testing involves test-takers in speaking, requesting, responding, interacting, or in combining
listening and speaking, or reading and writing.
Creating authentic tasks within formal assessment procedures presents some dilemmas
because they are often complex and lack practicality. They are also difficult to create and
even more difficult to evaluate because they often involve reliability issues.
Another problem raised by authentic assessment tasks is how to judge the difficulty of a
task, an important factor in standardised testing. They are rarely confined to one simple level
of difficulty across phonological, syntactic, discourse, and pragmatic planes.
Authenticity almost always means the integration of two or more skills, and so how is an
evaluator to judge, say, both listening and speaking competence? They are independent
skills, and so a speaking error may actually stem from a listening error, or vice versa.
3. Performance-Based Assessment
An authentic task in any assessment implies that the test-taker must engage in actual
performance of the specified linguistic objective. Instead of just offering paper-and-pencil
single-answer tests of possibly hundreds of discrete items, performance-based testing of
typical school subjects involves:
● Open-ended problem solving tasks.
● Hands-on projects.
● Student portfolios.
● Experiments.
● Group projects.
Such testing is time-consuming and therefore expensive, but the losses in practicality are
made up for in higher validity. Higher content validity is achieved as learners are measured
in the process of performing the objectives of a lesson/course. In ESL context, if you do a
little more formative evaluation during students’ performance of various tasks, you will be
taking some steps toward meeting some of the goals of performance-based testing.
Intelligence
was once viewed strictly as the ability to perform linguistic and logical-mathematical problem solving
“IQ” concept of intelligence. Research on
intelligence by psychologists led to the expansion of standard theories of intelligence (on
which standardised IQ and other tests are based) to include inter- and intrapersonal, spatial,
kinesthetic, contextual, and emotional intelligences, among others.
These new conceptualizations of intelligence infused the decade of the 90s with a sense of both freedom and responsib
But we assumed the responsibility for tapping into whole
language skills, learning processes, and the ability to negotiate meaning. The challenge is to
test interpersonal, creative, communicative, interactive skills, and in doing so, to place some
trust in our subjectivity.
● Authentic assessment.
● Performance-based assessment.
● Informal assessment.
● Alternatives in assessment.
Traditional testing offers significantly higher levels of practicality. More time and higher
institutional budgets are required to administer and evaluate assessments that presuppose
more subjective evaluation, more individualization, and more interaction in the process of
offering feedback (payoff: more useful washback, intrinsic motivation, greater validity).
Traditional Tests Alternatives in Assessment
Summative Formative
Rapidly growing testing industry = danger of an abuse of power. Tests represent a social
technology deeply embedded in education, government, and business; as such they provide
the mechanism for enforcing power and control. Tests are most powerful as they are often
the single indicators for determining the future of individuals.
Test designers, and the corporate sociopolitical infrastructure that they represent, have an
obligation to maintain certain standards as specified by their client educational institutions.
These standards bring with them certain ethical issues.
Some see the ethics of testing as a case of critical language testing. The issues of critical
language testing are numerous:
Make sure that students actually perform the criterion objectives that your test was designed
to assess. You need to know as specifically as possible what it is you want to test. Carefully
list everything that you think your students should “know” or be able to “do”, based on the
material students are responsible for.
Test specifications for classroom use can be a simple and practical outline of your test.
These informal classroom-oriented specifications give you an indication of (a) which of the
topics (objectives) you will cover, (b) what the item types will be, c) how many items will be in
each section, and (d) how much time is allocated for each.
A first draft will give you a good idea of what the test will look like, how students will perceive
it (face validity), the extent to which authentic language and contexts are present, the length
of the listening stimuli, how well a storyline comes across, how things like the cloze testing
format will work, and other practicalities.
Ask yourself a number of important questions: Are the directions to each section absolutely
clear? Is there an example item for each section? Does each item measure a specified
objective? Is each item stated in clear, simple language? Does each multiple-choice item
have appropriate distractors? Does the difficulty of each item seem to be appropriate for
your students? Do the sum of the items and test as a whole adequately reflect the learning
objectives?
In your final editing of the test, imagine that you are one of your students. Go through each
set of directions and all items slowly and deliberately, timing yourself as you do so. Often we
underestimate the time students will need to complete a test. If the test needs to be
shortened or lengthened, make the necessary adjustments. Then make sure your test is
near and uncluttered on the page.
After you give your test, you will have some information about how easy or difficult it was,
about the time limits, and about your students' affective reaction to it and their general
performance. Take note of these forms of feedback and use them for making your next test.
With some preparation in test-taking strategies, learners can allay some of their fears and
put their best foot forward during a test. Through strategies-based test-taking, they can avoid
miscues due to the format of the test alone. They should also be able to demonstrate their
competence through an optimal level of performance. The principle “bias for best” means
you should design, prepare, administer, and evaluate tests in such a way that the best
performance of the students will be elicited.
Sometimes students don’t know what is being tested when they table a test. You can help to
foster perception with:
Make sure that the language in your test is as natural and authentic as possible. Also, try to
give language some context so that items aren’t just a string of unrelated language samples.
Also, the tasks themselves need to be tasks in a form that students have practised and feel
comfortable with.
Formal tests must be learning devices through which students can receive a diagnosis of
areas of strength and weakness. Their incorrect responses can become windows of insight
about further work. Your prompt return of written tests with your feedback is therefore very
important to intrinsic motivation.
1. Portfolios
● Be clear yourself on the principal purpose of the portfolio and make sure your
feedback speaks to that purpose.
● Help students to process your feedback and show them how to respond to
your responses.
2. Journals
Usually one thinks of journals simply as opportunities for learners to write relatively freely
without undue concern for grammaticality. Journals can have a number of purposes:
language-learning logs, grammar discussions, responses to readings, self-assessment, and
reflections on attitudes and feelings about oneself. These are guidelines for using journals in
a classroom:
● Give guidelines on length of each entry and any other format expectations.
● Be clear yourself on the principal purpose of the journals and make sure your
feedback speaks to that purpose.
● Help students to process your feedback, and show them how to respond to
your responses.
3. Conferences
● Reviewing portfolios.
● Responding to journals.
Through conferences, a teacher can assume the role of a facilitator and guide, rather than a
master controller and deliverer of final grades. In this intrinsically motivating atmosphere,
students can feel that the teacher is an ally who is encouraging self-reflection. Conferences
are by nature formative, they are not dialogues meant to be graded.
4. Observations
One of the characteristics of an effective teacher is the ability to observe students as they
perform. Teachers are constantly engaged in a process of taking students’ performance and
intuitively assesting it and using those evaluations to offer feedback. Observations can
become systematic, planned procedures for real-time, almost surreptitious recording of
student verbal and nonverbal behaviour. One of the objectives of such observation is to
assess students as much as possible without their awareness of the observation, so that the
naturalness of their linguistic performance will be maximised.
Checklists and grids are a common form of recording observed behaviour. Checklists need
not be that elaborate. Simpler options may be more realistic. Rating scales have also been
suggested for recording observations. You will often find moderate practicality and reliability
in observations, especially if the objectives are kept simple. Face validity and content validity
are likely to get high marks since observations are likely to be integrated into the ongoing
process of a course. Authenticity is high because, if an observation goes relatively unnoticed
by the student, then there is little likelihood of contrived situations. Washback can be high if
you take the time and effort to help students to become aware of your data on their
performance.
A closer look at the acquisition of any skills reveals the importance of self-assessment and
the benefit of peer-assessment. Successful learners extend the learning process well
beyond the classroom and the presence of a teacher or tutor, autonomously mastering the
art of self-assessment.
Research has shown a number of advantages of self- and peer-assessment: speed, direct
involvement of students, the encouragement of autonomy, and increased motivation
because of self-involvement in the process of learning. Of course, the disadvantage of
subjectivity looms large, and must be considered whenever you propose to involve students
in this kind of assessment. However, self- and peer-assessment can surely be implemented
to evaluate oral production, listening comprehension, writing and reading skills.
Formal standardised tests are almost by definition highly practical, reliable instruments. They
are designed to minimise time and money, and to be painstakingly accurate in their scoring.
Alternative assessment requires considerable time and effort on the part of the teacher and
the student. But alternative assessment also offers markedly greater washback, superior
formative measures, and greater face validity.
With some creativity and effort, we can transform otherwise inauthentic and negative-
washback-producing tests into more pedagogically fulfilling learning experiences by doing
the following:
2. Performance-based assessment
Standardised tests, to a large extent, do not elicit actual performance on the part of test-
takers. Performance-based assessment implies productive, observable skills, such as
speaking and writing, of content-valid tasks. Such performance usually brings with it an air of
authenticity, real world tasks that students have had time to develop. They often imply an
integration of language skills, perhaps all four skills. Because the tasks that students perform
are consistent with course goals and curriculum, students and teachers are likely to be more
motivated to perform them.
In reality, performance as assessment procedures need to be treated with the same rigour
as traditional tests. This implies that teachers should: