Reliability in the same form and should cover the same
• Consistency in measurement content. Range and level of difficulty should be
• Doesn’t mean that test is good or bad equal.
• Measured by the reliability coefficient, ratio of true score • The degree of relationship between various forms of
variance and the total variance. (Total Variance = True a test is measured by a coefficient of equivalence.
Score Variance + Error Variance). • Such reliability coefficient is a measure of both
• The greater the proportion of the total variance attributed to true temporal consistency and consistency of
variance, the more reliable the test. response to different item samples or test
forms.
Sources of Error Variance • If you want to use this kind of reliability,
• TEST CONSTRUCTION- Item sampling/ Content Sampling ensure that you pay attention to content
sampling.
• TEST ADMINISTRATION- Temperature, materials, test
taker variables, examiner-related variables • Reports should also be accompanied by
length of interval between testing as well
• TEST SCORING AND INTEPRETATION- Scoring
as relevant intervening experiences.
and scoring systems, human & electronic
• Note that parallel forms are also helpful not just
to establish reliability but also for use in follow
• The most common way of computing a correlation is
up studies. It is a means of reducing
through the Pearson Product Moment Correlation
likelihood of coaching or cheating.
Coefficient.
• May still be subject to practice effect.
• Correlation coefficient (r) expresses the degree of
correspondence or a relationship between two sets of
3. Split-Half Reliability Estimate
scores. A perfect correlation is ±1.00, a zero correlation
indicates complete absence of a relationship as might • Split half reliability is obtained by correlating two
occur by chance. pairs of scores obtained from equivalent halves
of a single test administered once.
• Desirable reliability coefficients usually fall in the 0.80s and
0.90s • Temporal stability doesn’t enter as only one test
session is involved.
• Concept underlies the computation of error of
measurement of a single score, whereby we can predict the * A Measure of internal consistency or inter-item consistency or
range of fluctuation likely to occur in a single individual’s test homogeneity
score as a result of irrelevant or chance factors. • Randomly assign items to one or the other half of the
test.
RELIABILITY ESTIMATES • Odd-Even Split
1. Test-Retest Reliability • Divide by content so that each half is equivalent in
• Estimate of reliability obtained by correlating content and difficulty.
pairs of scores from the same people on two
different administrations of the same test. How to Split the Test
• It assumes that what is being measured is stable over • STEP 1: Divide test into equivalent halves
time • STEP 2: Calculate Pearson r between scores of the
• It assumes no significant learning occurs in the time two halves
between administrations • STEP 3: Adjust the half test reliability using the
• It is subject to errors due to practice effects. Spearman-Brown formula
• Nature of the test may also change with • Estimates the effect of lengthening and
repetition, as is the case with problems involving shortening a test. Other things being equal, the
reasoning or ingenuity. longer the test, the more reliable it will be.
• Most appropriate for measuring reaction time and
perceptual judgments 4. Kuder-Richardson Reliability & Coefficient Alpha
• *The reliability estimate is called coefficient of • Way of finding reliability also using a single form
stability. of the test, based on the consistency of
responses in the test.
Take note: • This interitem consistency is influenced by two
• When test-retest reliability is reported in the test sources of error variance: (1) content sampling (as
manual, interval over which it is measured should in alternate form and split half reliability) and (2)
always be specified. heterogeneity of the behavior being sampled. The
more homogenous the domain, the higher the
• Since retest correlations decrease interitem consistency.
progressively as interval lengthens, there is an
infinite number of test –retest reliability • It’s important to also consider whether the criterion the
coefficients for any test. test is trying to predict is itself
relatively homogenous or heterogenous.
2. Parallel –Forms Reliability Estimate • The most common formula for finding interitem
• Estimate of the extent to which item sampling consistency is the Kuder Richardson Formula
and other errors have affected scores on 20. Unlike analyzing performance in split
versions of the same test. halves, KR20 analyzes performance in each
item.
• Alternate/parallel: independently constructed test
designed to meet same specifications. Same • Kuder Richardson formula is applicable to tests
number of items, and items should be expressed whose items are scored as right or wrong, or
according to some other all-or-none systems. standard error of measurement (SEM) also called the
• For tests that have multiple-scored items, on a standard error of score. This measure is suited to the
Likert scale personality test for example, a interpretation of individual scores.
generalized formula has been derived called
Coefficient Alpha. Validity
• The procedure is to find the variance of all individual’s
scores for each item and then to add 3 Categories:
this variance across all items. Content validity – evaluation of subjects, topics, or
contents in a test
5. Interscorer Reliability Criterion-related validity – evaluation of the
• Degree of agreement or consistency relationship of scores to scores on other tests or
between two or more scorers with regards instruments
to a particular measure. Construct validity – comprehensive analysis of
• Most applicable for tests of creativity and projective theoretical framework + scores on other tests
tests.
• Controls for examiner variance 1. Content Validity
• Simplest way is to have a sample of test papers • Describes a judgment of how adequately a test
independently scored by two examiners. samples behavior representative of the universe of
behavior that the test is designed to sample.
• Index of measurement is the coefficient of interscorer
reliability.
How to Create Content Valid Items
Errors Due to Examiner’s Bias • a) Content validity is built into a test from the outset
through the choice of appropriate items.
1. Error of Central Tendency • Test specifications are drawn up to show content
areas or topics to be covered, the instructional
objectives or processes to be tested, and the relative
• less than accurate rating or evaluation by a rater or judge due
importance of individual topics and processes.
to that rater’s general tendency to make ratings near the
midpoint of the scale. • b) In test manuals, number and qualifications of subject
matter experts must be reported.
• The rater doesn’t want to identify low or high score, always • c) Other empirical procedures: total scores and
average. performance on individual items can be checked.
• Face validity is a judgment on how relevant the test
items appears to be.
2. Leniency/ Generosity Error • Fundamentally, face validity concerns rapport and
public relations.
• Rater’s tendency to be too forgiving or insufficiently critical. • Face validity can be improved by merely
reformulating test items in terms that appear relevant
and plausible in the particular setting in which they will
3. Severity Error: be used.
2. Criterion-related Validity
• rater’s tendency to be overly critical.
• Judgment of how adequately a test score can be used
to infer an individual’s most probably
4. Halo Effect
standing based on some measure of interest
• Tendency of the leader to judge all aspects of an individual
• The measure of interest being the criterion,
effectiveness of test in predicting an individual’s
using a general impression that was formed on only one or few
performance in specified activities.
of the individual’s characteristics.
• Samples of Criterion: academic achievement,
performance in specialized training, and job
performance, psychiatric diagnosis.
5. Horn Effect:
2 TYPES:
• Opposite of Halo effect, this refers to a tendency to let one
poor rating influence all other ratings, resulting in a lower over-
• Concurrent validity: the extent scores on a new
measure relate to scores from a criterion measure
all evaluation than deserved.
administered at the same time
• Predictive validity: uses the scores from the new
measure to predict performance on a criterion
6. Contrast Error
measure administered at a later time.
• Happens when raters when compares examinees with one
another instead of against performance standards. 3. Construct Validity
• This may result in an average employee being rated as high
• Judgment about the appropriateness of inferences
performer when compared to their underperforming peers, or drawn from test scores regarding individual
a good performer can be rated as poor performer when standings on a variable called construct.
compared to their high performing peers.
• The reliability of a test may be expressed in terms of the
• Is an informed scientific idea developed or
hypothesized to describe or explain behavior.
• Constructs are unobservable, presupposed
(underlying) traits that test developer may invoke to
describe test behavior or criterion performance.
Examples of constructs are job satisfaction, personality,
bigotry, clerical aptitude, depression, motivation, self-
esteem, emotional adjustment, potential
dangerousness, executive potential, creativity,
mechanical comprehension and more.
• Construct validation focused attention on the role of
psychological theory in test construction.
Techniques that Contribute to Construct
Identification
a) Correlations with other tests
• Correlations between new and similar earlier tests
are sometimes cited as evidence that the new test
measures approximately the same general area of
behavior as other tests designated by the same
name.
• Correlations with other tests employed are still
another way to demonstrate the new test is
relatively free from the influence of certain
irrelevant factors.
b) Factor analysis
• Factor Analysis is a refined statistical technique for
analyzing the interrelationships of behavior data. In
the process, the number of variables or categories
in terms of which each individual’s performance
can be described is reduced from the number of
original tests to a relatively small number of factors,
or common traits.
c) Internal Consistency
• Items that fail to show a significantly greater
proportion of “passes” in the upper than in the
lower criterion group are considered as invalid, and
are either eliminated or revised.
• Another application involves correlation of subtests
with total score, and any subtests whose
correlation is too low will be eliminated.
d) Convergent and Discriminant Validation
• DT Campbell (1960) pointed that in order to
demonstrate construct validity, we must show that
not only must a test correlate highly with other
variables it should theoretically correlate, it must
not correlate with variables it’s supposed to differ
from.
• Convergent evidence – high relationship with
measures construct is supposed to be related
to.
• Discriminant evidence – low relationship
with measures construct is NOT supposed to
be related to.
Standard error of estimate (SEEST): computes for the
margin of error to be expected in the
individual’s predicted criterion score as a result of the
imperfect validity of the test.