Validity
In everyday language, we say that something is valid if it is sound, meaningful. in everyday
language valid means something well-grounded on a logic or evidence. For example, we speak
of a valid argument, or a valid reason. In such instances, people make judgment based on
evidence of the meaningfulness of something. Similarly, in the language of Psychological
assessment,
“Validity is a term used in conjunction with the
meaningfulness of an assessment score - what the score truly
means.”
The Concept of Validity
Validity, as applied to an assessment, is an estimate of how well a test measures what it
purpose to measure in particular context. It is a judgment based on proof about the
appropriateness of inference drawn from the test scores.
            “An inference is a logical result or deduction.”
The validity of a test or scores of test are frequently phrased as “acceptable” or “weak” and
reflect a judgement about how appropriately the test measures what it is suppose to
measure.
Judgment of a Test’s Validity
  It is a judgment of how useful it is for a particular purpose for a particular population of
                                            people.
What is really meant is that the test has been shown to be valid for a particular use with a
particular population at a particular time. No test or assessment technique is “universally
valid” for all the time, for all the uses, with all the types of populations. Test may be shown to
be valid within reasonable boundaries of a contemplated usage. if those boundaries are
exceeded, the validity of the test may be called into question.
→ Validity of the test may diminish as the culture or the times changes.
→ The validity of a test must be proven again from time to time.
Validation
it is a process of gathering and evaluating evidences about validity. In validation of a test for a
specific purpose, both test developer and test takers play a role. however, test developers
provide the validity evidence in test manuals.
Local Validation Studies
when test users conduct their own validation studies with their own groups of testmakers.
such studies provide insight regarding a particular population of testtakers as compared to
normal sample given in a test manuals.
Why these studies are important?
✭ when the test user sought to transform a nationally standardized test into Braille for
administration to blind and visually impaired testtakers.
✭ when the test user plans to alter in someway the format, instructions, language, or the
content of the test.
✭ when test user wants to use a test with population of test takers that differed in some
significant way from the population on which the test was standardize.
Categories of Validity
Validity can be conceptualized according to three categories.
 1. Content Validity
 2. Criterion-related Validity
 3. Construct Validity
Construct validity is an “umbrella validity” since every other variety of validity falls under
it. However, all three types of validity evidence contribute to the unified picture of a test’s
validity.
Three Approaches to assessing validity – associated, respectively, with content validity,
criterion-related validity, and construct validity – are:
    Scrutinizing the test’s content.
    executing a comprehensive analysis.
    Relating scores obtained on the test to other test scores or other measures – how scores
    on the test can be understood within some theoretical framework for understanding the
    construct that the test was designed to measure.
These three approaches are not mutually exclusive. Each should be considering as one type
 of evidence that, with others, contributes to a judgment concerning the validity of test.
Face Validity
Face validity relates to what a test appears to measure to the person being tested than to
what the test actually measures. It is a judgment concerning how relevant the test items
appear to be.
              “A paper-pencil personality test labeled the introversion/extroversion test, with
              items that ask respondent whether they have acted in an introverted or an
              extroverted way in a particular situation, may be perceived as a highly face-valid
              test. On the other hand, personality test in which respondent are asked to report
              what they see in inkblots may be perceived as a test in which respondent to report
              what they see in inkblots may be perceived as a test with low-face validity. many
              respondents would be left wondering how what they said they saw in the inkblots
              really had anything at all to do with personality   ”
Content Validity
Content Validity describes a judgment of how properly a test samples behavior
representative of the universe of the behavior that the test was design to sample.
                 “For example, the universe of behavior referred to such as “assertive” is very
                 wide-ranging. A content valid assertiveness paper-pencil test of assertiveness
                 would be the one that is properly the representativeness of this test.”
   Quantification of content validity depends on – each rater responds to the
   following question for each item - is the skill or knowledge measure by this
                                       item:
                 ✭ Essential
                 ✭ Useful but not essential
                 ✭ Not necessary
Criterion-related validity is a type of validity that assesses how well one measure
predicts an outcome based on another measure (called a criterion). It shows how effectively
a test or instrument corresponds with an external criterion that it should theoretically be
related to.
There are two main types of criterion-related validity:
1. Concurrent Validity - The degree to which a new test correlates with an established
measure of the same construct when both are administered at the same time.
              “A newly developed anxiety scale is tested against the Beck Anxiety Inventory
              (BAI). If the scores on both tests are highly correlated, the new test has good
           concurrent validity.”
2. Predictive Validity - The extent to which a test predicts future performance or behavior.
       “The Graduate Record Examination (GRE) psychology subject test predicting success
       in a psychology graduate program. If high GRE scores are associated with higher
       academic performance or thesis quality, it shows predictive validity. or, A personality
       test (e.g., Big Five Inventory) used in recruitment predicts job performance or
       teamwork behavior 6 months later.”
The validity coefficient is a statistical value that shows the strength of the relationship
between a test and a criterion measure.
 It’s most commonly expressed as a correlation coefficient (r) and ranges from -1.00
                                     to +1.00.
If you develop a new stress inventory, and it correlates with a well-known measure like the
Perceived Stress Scale (PSS) at r = 0.60, that means your test has good concurrent validity
with a validity coefficient of 0.60.
    A higher coefficient indicates a stronger relationship between the test and the criterion.
    It is used to evaluate criterion-related validity (both concurrent and predictive).
    Most psychological and educational testing research accepts:
         r = .35 to .65 → Moderate to strong validity
         r ≥ .70 → Very strong (rare in social sciences)
         r ≤ .30 → Weak validity
Incremental validity refers to the additional value a new test or measure provides in
predicting an outcome above and beyond what existing tests already predict.
In simpler words - Does this new test tell us something extra that we didn’t already know from
other tests?
    “you're trying to predict job performance - You already use a cognitive ability test. Now
    you want to add a personality test (e.g., conscientiousness from the Big Five). If adding
    the personality test improves your prediction of job performance significantly, then it has
    incremental validity.”
Reliability
In everyday conversation, reliability is the synonyms for “dependability” or
“consistency” - If we’re lucky, we have a reliable friend who is always there for us in a
time of need.
In the language of psychometrics reliability refers to consistency in measurement -
something that is consistent.
It is important for us, as users of tests to know how reliable tests and other measurement
procedures are. but reliability is not an all-or-none matter. A test maybe reliable in one
context and unreliable in another.
There are different types and degree of reliability.
Test-Retest Reliability means checking how consistent a test is over time. To do
this, the same people take the same test twice, at different times. Then we see how
similar their scores are. If the scores are very similar, the test is reliable.
This method is useful for things that don’t change quickly—like personality traits. But if
the thing being measured changes a lot over time (like mood or energy level), then this
method doesn’t work well.
Coefficient of Stability is a special name for test-retest reliability when the time
between the two tests is long—usually more than six months.
As time goes by, people change. They might learn, forget, or grow in different ways. So,
the longer the gap between the tests, the more likely their scores will be different. That’s
why the reliability usually drops with time. The changes over time create errors, making
the test less reliable.
Coefficient of Equivalence                checks how similar two different versions of the
same test are.
Let’s say a test has two versions (Form A and Form B), both designed to measure the
same thing. We give one version to a group of people, and then the other version to the
same group. If the scores from both versions are very close, it means both forms are
equal in quality and measure the same thing reliably.
  This kind of reliability is called alternate-forms or parallel-forms
 reliability, and the result is known as the coefficient of equivalence.
Parallel Forms of a Test
   Parallel forms are two versions of the same test that are almost exactly equal. That
   means:
   Both have the same average scores (means).
   Both have the same spread of scores (variances).
   These tests are designed to measure the same ability or trait in the same way.
    In real life, creating perfect parallel forms is hard, so we usually create alternative forms
   instead.
   Alternative Forms of a Test
   These are different versions of the same test.
   They might not be perfectly equal, but they’re made to:
   Cover the same content.
   Be similar in difficulty.
   So, while they aren't technically “parallel,” they’re close enough to compare.
   Two Ways to Check Test-Retest Reliability
1. Same Test, Same People, Two Times:
    You give the same test to the same group at two different times.
    The similarity between the two sets of scores shows how reliable the test is.
2. Parallel/Alternative Forms:
    You give two different versions (forms) of the test to the same group at two different
   times and compare their scores.
   Factors That Can Affect Test Scores Over Time
   Motivation: A person might try harder the first time or second time.
   Fatigue: They might be tired during one of the tests.
   Practice or Learning: Taking the test once may help them do better next time, not
   because of improvement in ability, but because of familiarity.
   Therapy or Life Events: Experiences between tests might genuinely change the person.
   Item Sampling
   Sometimes, the difference in test scores isn’t about the person’s real ability.
   It could be due to which questions were included in the test.
   If one version had easier or harder items, it could unfairly affect the results.
   Internal Consistency Estimate of Reliability
   One way to measure the reliability of a test is by checking how consistent it is within
   itself. This means we look at whether the different items or questions on the test are all
working together to measure the same thing. This type of reliability is called an internal
consistency estimate of reliability. It helps us understand if the test is balanced and
focused, or if some items are out of place. A related concept is inter-item consistency,
which refers to how closely related the individual test items are. If a test has good
internal consistency, a person who scores well on one part of the test is likely to score
well on the other parts too.
Split-Half Reliability
One method used to check internal consistency is called split-half reliability. This
method is helpful when you don’t want to or cannot give the same test more than once.
In this approach, the test is given just one time to a group of people. Then, the test is
divided into two equal halves—for example, one half may include the odd-numbered
questions and the other half the even-numbered ones. After that, the scores from each
half are compared to see how similar they are. If the two halves show similar results, it
means the test is reliable and the items are measuring the same underlying concept. This
method is especially useful when it is not practical or possible to give the test twice or to
create two different versions of it.
Inter-Item Reliability (Consistency Between Items)
Inter-item reliability means how well all the questions (or items) on a test relate to each
other. In other words, it checks if the questions are working together to measure the same
thing. To find this out, we don’t need to give the test more than once — it can be
measured using just one version of the test given one time.
Homogeneity of a Test
A test is called homogeneous when all of its items or questions are focused on measuring
one single trait or topic. For example, if a test is designed to measure anxiety, and all the
questions are only about anxiety, then it is a homogeneous test. The more the items are
similar in focus, the more consistent the test is likely to be. This is because it is only
sampling from a narrow or specific content area.
Heterogeneity of a Test
On the other hand, if a test includes questions that measure different traits or abilities, it
is called heterogeneous. For example, if a test includes questions about anxiety,
depression, and self-esteem all together, it is measuring multiple things. Since the items
are more varied, they are less likely to be strongly related to each other, and the inter-
item consistency will be lower.
Relationship Between Homogeneity and Inter-Item Consistency
The more homogeneous a test is — meaning it focuses on one topic — the higher the
inter-item consistency is likely to be. This is because all the questions are closely related
and point in the same direction. But if a test is heterogeneous, measuring many things at
once, then it will naturally have lower inter-item consistency, because the questions are
about different ideas.
Interpreting a Reliability Coefficient
An important question is: "How high should the reliability score be?"
A simple answer is: "It depends on how important the decisions based on the test are."
If the test is very important (like making life-changing decisions), then it must be very
reliable. But if the test is less important, then we don’t need as high reliability.
For example, if a test helps make major decisions (like medical diagnoses), it should be
held to high standards. But if a test is just one part of many things being considered, then
its reliability doesn't need to be as high.
As a general rule, we can think of reliability like school grades:
.90s = A (very good reliability)
.80s = B (good, but below .85 is like a B-)
.65 to .70s = weak, barely acceptable
Below .65 = failing, not acceptable
__________________________________________________________________________________________________