TESTING, ASSESSING, EVALUATING, AND TEACHING
Definition of terms measurement, test and evaluation
        The terms measurement, test, and evaluation are often used synonymously, but actually
these three terms are not the same things. It is important to understand the distinction among
them. To understand them, let’s learn the figure below,
                                                                   Measureme
                                                                   nt
                Evaluation                             4
                                                       3
                                                Test
                             1           2
                                                                          5
From the figure above, we can devide five areas which can explain the distinction of the three
terms:
     Area 1 : Evaluation doesn’t involve either tests or measurement.
       e.g : The use of qualitative descriptions of students performance for diagnosing
              learning problems.
     Area 2 : A non-test measurement for evaluation.
       e.g : A teacher ranking used for assigning graders.
     Area 3 : A test is used for purposes of evaluation.
       e.g : The use of an achievement test to determine students progress.
     Area 4 : A non- test uses of tests and measurement.
       e.g : The use of a proficiency tests as a criterion in second language acquisition
              research.
     Area 5 : A non-test measurement that is not used for evaluation.
       e.g : Assigning code nimbers to subjects in second language research according to
              native language.
From the distinction above, it is clear that:
            Not all measurements are tests
            Not all tests are evaluative
            Not all evaluation involves either measurement or tests
TESTS
         In common, the word test sometimes means something with unpleasant and anxiety
feelings or self doubt. According to Brown (2004), a test is a method of measuring a person’s
ability, knowledge, or performance in a given domain. As a method, a test must be explicit and
structured, such as multiple-choice questions which are accompanied by prescribed correct
answers. Some tests measure general ablility, while others focus on very specific competence or
objectives. Most language tests measure one’s ability to perform language, that is, to speak,
write, read or listen to a subset of language.
        According to Bachman (1990), a test is a measurement instrument designed to elicit a
specific sample of an individuals’ behavior. From this definition, it is clear that a test is one type
of measurement, and of course, there are still many other kinds of instrument to measure
something. The distinction between test and measurement is very clear, so these two terms can
not be used to subtitude one another.
Kinds of tests
There are many kinds of tess, each with a specific purpose and particular criterion to be
measured (Brown, 2007). Below, you will find description of five test types that are in common
use in language curricula.
     Proficieny Tests
        A proficiency test is not intended to be limited to any one course, curriculum, or single
        skill in the language. Proficiency tests have traditionally consisted of standardized
        multiple-choice items on grammar, vocabulary, reading comprehension, oral
        comprehension, and sometimes a sample of writing. Typical example of standardized
        proficiency tests are the Test of English as a Foreign Language (TOEFL) and the
        International English Language Testing System (IELTS).
     Diagnostic Tests
        A diagnostic test is designed to diagnose a particular aspect of a language. A diagnostic
        test in pronunciation might have the purpose of determining which phonological features
        of English are difficult for a learner and should therefore become a part of curriculum.
     Pacement Tests
        Placement test has a purpose to place a student into an appropriate level or section of a
        language curriculum or school.
     Achievement Tests
        An achievement test is related directly to classroom lessons, units, or even a total
        curriculum in which limited to particular material covered in a curriculum within a
        particular time frame, and are offered after a course has covered the objectives in
        question.
     Aptitude Tests
        A language aptitude test is designed to measure a person’s capacity or general ability to
        learn a foreign language and to be successful in that undertaking. Aptitude tests are
        considered to be independent of a particular language. Two standardized aptitude tests
        were once in popular use – the Modern Language Aptitude Test (MLAT) and the
        Pimsleur Language Aptitude Battery (PLAB). Both are English language tests and
        require students to perform such tasks as memorizing numbers and vocabulary, listening
        to foreign words, and detecting spelling clues and grammatical patterns.
MEASUREMENT
         Bachman (1990) states that measurement (in the social science) is the process of
quantifying the characteristics of persons according to explicit procedures and rules. This
definition includes three distinguishing features: quantification, characteristics, and explicit rules
and procedure. Quantification involves the assigning of numbers. Characteristics can be physical
and mental. In testing, we are almost always interested in quantifying mental attributes and
abilities. Mental attributes can be aptitude, intelligent, motivation, field
dependence/independence, attitude, native language, fluency in speaking and achievement in
reading, while abilities refer to performance on a set of mental tasks. The third distinguishing
characteristic of measurement is that quantification must be done according to explicit rules and
procedures.
        If we are to interpret the score on a given test as an indicator of an individual’s ability,
that score must be both reliable and valid. Reliability has to do with the consistency of measures
accross different times, test forms, raters, and other characteristics of the measurement context.
The primary concerns in examining the reliability of test scores are:
            1. To identify the different sources of error.
            2. To use the appropriate empirical procedures for estimating the effect of these
                sources of error on test scores.
While validity refers to the extent to which the inference or decisions we make on the basis of
test scores are meaningful, appropriate, and useful (American Psychological Association, 1985 in
Bachman (1990)). In examining vlidity, we must also be concerned with the appropriateness and
usefulness of the test score for a given purpose. Reliability and validity are both essential to the
use of tests. Reliability is a quality of tests scores themselves, while validity is a quality of test
interpretation and use. Therefore, a test score which is not reliable, it can not be valid.
        Measurement specialists have defined four types of measurement scales; nominal scale,
ordinal scale, interval scale, and ratio scale. Nominal scale comprises numbers that are used to
name the classes or categories of a guiven attribute. Ordinal scale comprises the numbering of
different levels of an attribute that are ordered with respect to each other. Interval scale is a
numbering of different levels in which the distances, or intervals, between the levels are equal.
Ratio scale means if we can make comparison in terms of ratios with such scale.
                                             Type of scale
      Property             Nominal            Ordinal            Interval              Ratio
Distinctiveness              +                  +                   +                   +
Ordering                      -                 +                   +                   +
Equal intervals               -                  -                  +                   +
Absolute zero point           -                  -                   -                  +
       As test developers and test users, we all sincerely want our tests to be the best measures
possible. In order to measure a given language ability, we must be able to specify what it is, and
this specification generally is at two levels. First, at the theoritical level, we need to specify the
ability in relation to/in contrast to other language abilities and other factors that may affect test
performance. Secondly, at the operational level, we need to specify the instances of language
performance that we are willing to interpret as indicators, or tokens, of ability we wish to
measure. In addition to the limitations related to the underspecification of factors that affect test
performance, there are characteristics of the processes of observation and quantification that
limit our interpretations of test results. These derive from the fact that all measures of mental
ability are necessarily indirect, incomplete, imprecise, subjective, and relative.
       The limitations discussed above restrict our ability to make such inferences. A major
concern of language test development, therefore, is to minimize the effects of these limitations.
To acomplish this, the development of language tests needs to be based on a logical sequence of
procedures linking the putative ability, or construct, to be observed performance. This sequence
includes three steps: (1) identifying and defining the construct theoritically; (2) defining the
construct operationally, and (3) establishing procedures for quantifying observations (Thorn-dike
and Hagen, 1977 in Bachman (1990)).
        Those general steps in measurement provide a framework both for the development of
language tests and for the interpretation of language test results, in that they provide the essential
linkage between the unobservable language ability or construct we are interested in measuring
and the observation of performance, or the behavioral manifestation, of that construct in the form
of a test score. As an example of the application of these steps to language test development,
consider the theoritical definition of pragmatic competence we presented above. The steps in
measurement discussed above also relate to virtually all concern regarding the interpretation of
test results;
            defining the construct theoritically provides the basis for evaluating the validity of
               the uses of test scores.
            defining the construct operationally is also related to test validity in that the
               observed relationships among different measures of the same theoritical construct
               provide the basis for investigating concurrent relatedness.
            establishing procedures for quantifying observations is dirrectly related to
               reliability in which the precision of the scales we use and the consistency with
               which they are applied across different test administrations, different test forms,
               different scores, and with different groups of test takers will affect the results of
               tests.
EVALUATION
        Evaluation can be defined as the systematic gathering of information for the purpose of
making decisions. The probability of making the correct decision in any given situation is a
function not only of the ability of the decision maker, but also of the quality of the information
upon which the decision is based. Evaluation does not necessarily entail testing. It is only when
the results of tests are used as the basis for making a decision that evaluation is involved.
Definition of terms testing, assessing, and teaching
       Before differentiating those three terms, it is better for us to learn the figure below;
                                             TESTS
                                     ASSESSMENT
                                 TEACHING
         From the figure above, it is described that tests are a subset of assessment and assessment
itself is a subset of teaching. Tests can be useful devices, but they are only one among many
procedures and tasks that teachers can ultimately use to assess students in teaching learning
activities. Teaching sets up the practice games of language learning: the opportunities forn
learners to listen, think, take risks, set goals, and process feedback from teachers and the recycle
through the skills that they are trying to master.
       Distinguishing among tests, assessment and teaching is to distingusih between informal
and formal assessment. Informal assessment can take a number of forms, starting with incidental,
unplanned comments and responses, along with coaching and other impromptu feedback to the
students. On the other hand, formal assessments are exercises or procedures specifically
designed to tap into a storehouse of skills and knowledge. They are systematic, planned sampling
techniques constructed to give teacher and student an appraisal of student achievement. And we
can say that all tests are formal assessment, but not all formal assessment is testing.
        Another distinction of an assessment deals with the function. Two functions are
commonly identified in the literature: formative and summative assessment. Most of our
classroom assessment is formative assessment: evaluating students in the process of forming
their competences and skills with the goal of helping them to continue that growth process.
While summative assessment aims to measure, or summarize, what a student has grasped, and
typically occurs at the end of a course or unit of instruction. A summation of what a student has
learned implies looking back and taking stock of how well that student has accomplished
objectives, but does not necessarily point the way to future progress. Final exams in a course and
general proficiency exams are examples of summative assessment.
        From the historical perspective, it underscores two major approaches to language testing
that were debated in the 1970s and early 1980s. These approaches still prevail today, even in
mutated form: discrete-point and integrative testing. Discrete-point tests are constructed on the
assumption that language can be brokeb down into its component parts and that those parts can
be tested successfully. These components are the skills of listening, speaking, reading, and
writing, and various units of language (discete points) of phonology/graphology, morphology,
lexicon, syntax, and discourse. Two types of tests have historically been claimed to be examples
of integrative tests: cloze tests and dictation. A cloze test is a reading passage (perhaps 150 to
300 words) in which roughly every sixth or seventh word has been deleted; the test taker is
required to supply words that fit into those blanhs. Dictation is a familiar language-teaching
technique that evolved into a testing technique. Supporters ague that dictation is an integrative
test because it taps into grammatical and discourse competencies required for other modes of
performance in a language. Success on a dictation requires careful listening, reproduction in
writing of what is heard, efficient short-term memory, and to an extent, some expectancy rules to
aid the short-term memory.
Principles of language assessment
        Whether focusing on testing or assessing, a finite nuber of principles can be named that
serve as guidelines for the design of a new test or assessment and for evaluating the efficacy of
an existing procedure. The term test is used as a generic term for both test and formal
assessment, since all the principles apply to both (Brown, 2007). There are five basic principles
for designing effective tests and assessment;
     Practicality
        It is within the means of financial limitations, time constraints, ease of administration,
        and scoring and interpretation.
     Reliability
        A reliable test is consistent and dependable.
     validity (content, face, and construct)
        Validity of test deals with the degree to which the test actually measures what it is
        intended to measure.
     Authenticity
        In a test, authenticity may be presented in the following ways:
              The language use in the test is as natural as possible
              Items are contextualized
              Topics and situations are interesting, enjoyable, and humorous
              Some thematic organization to items is provided such as through story line
              Tasks represent real world tasks
     Washback
        The feedback should wash back to students in the form of useful diagnoses of strengths
        and weakness.
Current issues in classroom testing
        By the mid 1980s, the language-testing field had abandoned arguments about the unitary
trait hypothesis and had begun to focus on designing communicative language-testing tasks.
Communicative testing presented challenges to test designers, test constructors began to identify
the kinds of real-world tasks that language learners were called upon to perform. Weir (1990) in
Brown (2004) reminded is readers that “to measure language proficiency ... account must now be
taken of : where, when, how, with whom, and why language is to be used, and on what topics,
and with what effect.” And the assessment field became more and more concerned with the
authenticity of tasks and the genuineness of texts.
         Instead of just offering paper-and-pencil selective response tests of a plethora of separate
items, perfomance-based assessment of language typically involves oral production, writtten
production, open-ended response, integrate performance ( across skill areas), group performance,
and other interactive tasks. Such assessment is time-consuming and therefore expensive, but
those extra efforts are paying off in the form of more direct testing because students are assessed
as they perform actual or simulated real-world tasks. In technical terms, higher content validity is
achieved because learners are measured in the process of performing the targeted linguistic acts.
In an English language-teaching context, performance-based assessment means that you may
have a difficult time distinguishing between formal and informal assessment. If you rely a little
less on formally structured tests and a little more on evaluation while students are performing
various tasks, you will be taking some steps toward meeting the goals of performance-based
testing.
        The design of communicative performance-based assessment rubrics continues to
challenge both assessment experts and classroom teachers. Such efforts to improve various facets
of classroom testing are accompanied by some stimulating issues, all of which are helping to
shape our current understanding of effective assessment. Intelligence was once viewed strictly as
the ability to perform (a) linguistic and (b) logical-mathematical problem solving. This “IQ”
concept of intelligence has permeated the western world and its way of testing for almost a
century. However, research on intelligence by psychologista like Howard Gardner, Robert
Sternberg, and Daniel Goleman has begun to turn the psychometric world upside down. Standard
theories of intelligence, on which standardized IQ (and other) tests are based, were expanded to
include seven different components among others (Brown, 2007). The seven components are:
          interpersonal intelligence
          intrapersonal intelligence
          spatial intelligence
          musical intelligence
          bodily-kinesthetic intelligence
          contextual intelligence
          emotional intelligence
      These new conceptualizations of intelligence have not been universally accepted by the
academic community. Nevertheless, their intuitive appeal infused the decade of the 1990s with a
sense of both freedom and responsibility in our testing agenda. Coupled with parallel educational
reforms, they helped to free us from relying exclusively on timed, discrete-point, analytical tests
in measuring language. We were prodded to cautiously combat the potential tyranny of
“objectivity” and its accompanyingimpersonal approach. But, we also assumed the responsibility
for tapping into whole language skills, learning processes, and the ability to negotiate meaning.
        Recent years have seen a burgeoning of assessment in which the test-taker performs
responses on a computer. Some computer-based tests (also know as “computer-assisted” or
“web-based” tests) are small-scale “home-grown” tests available on web-sites. Others are
standardized, large-scale tests in which thousands or even tens of thousands of test-takers are
involved. Students receive prompts (or probes, as they are sometimes referred to) in the form of
spoken or written stimuli from the computerized test and are required to type (or in some
cases,speak) their responses. Almost all computer-based test items have fixed, closed-ended
responses. Computer-based testing, with or without CAT technology, offers these advantages:
          classromm-based testing
          self-directed testing on various aspects of a language (vocabulary, grammar,
           discourse, one or all of the four skill, etc)
        practice for upcoming high-stakes standardized tests
        some individualization, in the case of CATs
        large-scale standardized tests that can be administered easily to thousands of test-
           takers at many different stations, then scored electronically for rapid reporting of
           results.
Of course, some disadvantages are present in our current prediction for computerizing testing.
Among them:
        lack of security and possibility of cheating are inherent in classroom-based,
           unsupervised computerized tests.
        Occasional “home-grown” quizzes that appear on unofficial websides may be
           mistaken for validate assessments.
        The multiple-choice format preferred for most computer-based tests contains the
           usual potential for flawed item design
        Open-ended responses are less likely to appear because of the need for human scores,
           with all the atendant issues of cost, reliability, and turn-around time.
        The human interactive element (especially in oral production) is absent.
                                      REFERENCES
Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford
     University Press.
Brown, H.D. 2004. Language assessment: Principles and classroom practies. White Plains.
      NY: Pearson Education.
Brown, H.D. 2007. Teaching by principles: An interactive approach to language pedagogy.
      White Plains. NY: Pearson Education.
TESTING, ASSESSING, EVALUATING, AND TEACHING
                           Presented in:
          Advanced Assessment in English Language Teaching
                         (Class discussion)
                             Lectuer:
                      Fachrurrazy, MA, PhD
  GRADUATE PROGRAM IN ENGLISH LANGUAGE TEACHING
            STATE UNIVERSITY OF MALANG
                       FEBRUARY 2013