0% found this document useful (0 votes)
361 views16 pages

Constructing Validity

Uploaded by

Maithili Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
361 views16 pages

Constructing Validity

Uploaded by

Maithili Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Psychological Assessment

© 2019 American Psychological Association 2019, Vol. 1, No. 999, 000


1040-3590/19/$12.00 http://dx.doi.org/10.1037/pas0000626

Constructing Validity: New Developments in Creating Objective


Measuring Instruments
Lee Anna Clark and David Watson
University of Notre Dame

In this update of Clark and Watson (1995), we provide a synopsis of major points of our earlier article
and discuss issues in scale construction that have become more salient as clinical and personality
assessment has progressed over the past quarter-century. It remains true that the primary goal of scale
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

development is to create valid measures of underlying constructs and that Loevinger’s theoretical scheme
This document is copyrighted by the American Psychological Association or one of its allied publishers.

provides a powerful model for scale development. We still discuss practical issues to help developers
maximize their measures’ construct validity, reiterating the importance of (a) clear conceptualization of
target constructs, (b) an overinclusive initial item pool, (c) paying careful attention to item wording, (d)
testing the item pool against closely related constructs, (e) choosing validation samples thoughtfully, and
(f) emphasizing unidimensionality over internal consistency. We have added (g) consideration of the
hierarchical structures of personality and psychopathology in scale development, discussion of (h)
codeveloping scales in the context of these structures, (i) “orphan,” and “interstitial” constructs, which
do not fit neatly within these structures, (j) problems with “conglomerate” constructs, and (k) developing
alternative versions of measures, including short forms, translations, informant versions, and age-based
adaptations. Finally, we have expanded our discussions of (l) item-response theory and of external
validity, emphasizing (m) convergent and discriminant validity, (n) incremental validity, and (o)
cross-method analyses, such as questionnaires and interviews. We conclude by reaffirming that all
mature sciences are built on the bedrock of sound measurement and that psychology must redouble its
efforts to develop reliable and valid measures.

Public Significance Statement


Over the past 50 years, our understanding has greatly increased regarding how various psychological
problems are interrelated and how they relate to various aspects of personality. In this context, this
article describes a “best practice” process and relevant specific issues for developing measures to
assess personality and psychological problems.

Keywords: test construction, construct validity, hierarchical structures, incremental validity

Clark and Watson (1995) discussed theoretical principles, prac- that continue to be underappreciated or largely ignored, but we
tical issues, and pragmatic decisions in the process of objective primarily discuss additional issues that have become important as
scale development to maximize the construct validity of measures. a result of advances in the field over the past two-plus decades. As
In this article, we reiterate a few points we discussed previously before, we focus on language-mediated measures (vs., e.g., coding
of direct observations), on scales with clinical relevance (i.e., those
of most interest to readers of this journal) and on measures that are
indeed intended to measure a construct (vs., e.g., a checklist to
assess mortality risk that could be used to make valid inferences
Lee Anna Clark and David Watson, Department of Psychology, Uni- for the purpose of life insurance premiums1). Readers are encour-
versity of Notre Dame.
aged to review the previous paper for points underdeveloped in
Lee Anna Clark is the author and copyright owner of the Schedule for
Nonadaptive and Adaptive Personality, © 2014. There is no licensing fee
this one, as we often provide here only a synopsis of the earlier
for use of the SNAP family of measures for non-commercial, unfunded material.
research. For all other uses, LAC negotiates a mutually acceptable licens-
ing fee. The Centrality of Psychological Measurement
We acknowledge the helpful comments of Jane Loevinger on the orig-
inal version of this article. The study reported herein was supported by Measurement is fundamental in science, and, arguably, the two
National Institute of Mental Health Grant R01-MH068472 to David Wat- most important qualities related to measurement are reliability and
son.
Correspondence concerning this article should be addressed to Lee Anna
Clark, Department of Psychology, University of Notre Dame, E390 Corbett 1
We thank an anonymous reviewer for pointing out this issue and
Family Hall, Notre Dame, IN 46556. E-mail: la.clark@nd.edu providing this example.

1
2 CLARK AND WATSON

validity. Note that we say “measurement” not “measure.” Despite both the previous and this article, we offer practical guidance for
the thousands of times that some variant of the phrase “[measure] applying Loevinger’s theoretical approach to the process of scale
X has been shown to have good reliability and validity” has development, with specific emphasis on the “three components of
appeared in articles’ Method sections,2 the phrase is vacuous. construct validity”: substantive, structural, and external.
Validity in particular is not a property of a measure, but pertains
to interpretations of measurements. As first stated by Cronbach
and Meehl (1955), “One does not validate a test, but only a principle Substantive Validity: Conceptualization and
for making inferences” (p. 297). Similarly, the fifth edition of the Development of an Initial Item Pool
Standards for Educational and Psychological Testing (American Conceptualization. There is essentially no limit to the num-
Educational Research Association, American Psychological Asso- ber of psychological constructs that can be operationalized as
ciation, National Council on Measurement in Education, and Joint scales, and sometimes it seems that there is a scale for every
Committee on Standards for Educational and Psychological Test-
human attribute (e.g., adaptability, belligerence, complexity, do-
ing [AERA, APA, & NCME], 2014) states unequivocally, “Va-
cility, efficiency, flexibility, grit, hardiness, imagination, . . . zest).
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

lidity refers to the degree to which evidence and theory support the
However, not all of these represent a sufficiently important and
This document is copyrighted by the American Psychological Association or one of its allied publishers.

interpretations of test scores for proposed uses of a test. Validity is,


distinct construct to justify scale development. As a thought ex-
therefore, the most fundamental consideration in developing and
periment, imagine listing not only the thousands of such constructs
evaluating tests” (p. 11). Accordingly, investigating a measure’s
in the English language (Allport & Odbert, 1936), but also doing
construct validity necessarily involves empirical tests of hypothe-
the same for the approximately 7,000 existing human languages.
sized relations among theory-based constructs and their observable
No doubt many of the constructs these words represent are highly
manifestations (Cronbach & Meehl, 1955), and absent an articu-
overlapping, and it would be absurd to argue that each one would
lated theory (i.e., “the nomological net”), there is no construct
make a significant difference in predicting real-world outcomes.
validity. Despite the implication that a series of interrelated inves-
This point holds true even within just the English language. Thus,
tigations is required to understand the construct(s) that a measure
an essential early step is to crystallize one’s conceptual model by
assesses, scale developers often speak rather lightly of establishing
writing a precise, reasonably detailed description of the target
a scale’s construct validity in its initial publication. Cronbach and
construct.
Meehl (1955) was published 60! years ago, and 30! years have
Literature review. To articulate the basic construct as clearly
passed since the third edition of the Standards (APA, 1985), which
and thoroughly as possible, this step should be embedded in a
firmly established construct validity as the core of measurement.
literature review to ensure that the construct doesn’t already have
Yet, there remains widespread misunderstanding regarding the
one or more well-constructed measures and to describe the con-
overarching concept of construct validity and what establishing
struct in its full theoretical and hierarchical-structural context,
construct validity entails. Clearly, test developers, not to mention
including its level of abstraction and how it is distinguished from
test users, either do not fully appreciate or willfully choose to
near-neighbor constructs. For instance, in developing a new mea-
ignore the complexity and importance of the concept.
sure of hopelessness, the literature review would encompass not
only existing measures of hopelessness, but also measures of related,
Why Should I Care About Construct Validity?
broader constructs (e.g., depression and optimism-pessimism), and
First, construct validity is the foundation of clinical utility. That somewhat less immediately related constructs that might correlate
is, to the extent that real-world decisions (e.g., eligibility for social with the target construct, such as various measures of negative
services, psycho- or pharmaco-therapy selection) are based on psy- affect (anxiety, guilt and shame, dissatisfaction, etc.) to articulate
chological measurements, the quality of those decisions depends on the hypothesized overlap and distinctiveness of hopelessness in
the construct validity of the measurements on which they are relation to other negative affects. That is, conceptual models must
based. Second, practitioners increasingly are asked to justify use of articulate both what a construct is and what it is not. The impor-
specific assessment procedures to third-party payers. Use of psy- tance of a comprehensive literature review cannot be overstated,
chological measures whose precision and efficiency are well es- because it enables a clear articulation of how the proposed mea-
tablished within an articulated theory that is well supported by sure(s) will be either a theoretical or an empirical improvement
multiple types of empirical data (i.e., measurements with demon- over existing measures or fill an important measurement gap.
strated construct validity) may be required in the future. Third, Our emphasis on theory and structure is not meant to intimidate
progress in psychological science, especially as we explore more or to imply that one must have from the outset a fully articulated
deeply the interface between psychosocial and neurobiological set of interrelated theoretical concepts and know in advance ex-
systems, is critically dependent on measurement validity. Detailed actly how the construct will fill a gap in an established hierarchy
understanding of brain activity will be useful only insofar as we of psychological constructs or improve measurement over existing
can connect it to phenotypic phenomena, so the more validly and scales. Rather, our point is that serious consideration of theoretical
reliably we can measure experienced affects, behaviors, and cog- and structural issues prior to scale construction increases the likeli-
nitions, the more we will be able to advance psychology and hood that the resulting scale will make a substantial contribution by
neuroscience. providing significant incremental validity over existing measures, a
topic we return to in a subsequent section.
A Theoretical Model for Scale Development
Loevinger’s (1957) monograph remains the most complete ex- 2
So as not to be completely hypocritical, we admit that we, too, use this
position of theoretically based psychological test construction. In short-hand language.
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 3

Hierarchical structure of constructs. It is now well estab- theory and the utility of the measures’ empirical findings guide
lished that psychological constructs—at least in the clinical and axis placement.
personality domains that are our focus—are ordered hierarchically Finally, conglomerate constructs are intended to be “winning
at different levels of abstraction or breadth (see Comrey, 1988; combinations” of two or more modestly to moderately related
Watson, Clark, & Harkness, 1994). In personality, for instance, constructs. For example, the popular construct grit, defined as
one can conceive of the narrow-ish traits of “talkativeness” and “perseverance and passion for long-term goals” (Duckworth, Pe-
“attention-seeking,” the somewhat broader concepts of “gregari- terson, Matthews, & Kelly, 2007, p. 1087), was intended to predict
ousness” and “ascendance” that encompass these more specific success in domains as variant as the National Spelling Bee and
terms, respectively, and the still more general disposition of “ex- West Point. A recent meta-analysis, however, found that its per-
traversion” that subsumes all these lower order constructs. Scales severance facet better predicted success than the construct as a
can be developed to assess constructs at each level of abstraction, whole (Credé, Tynan, & Harms, 2017), thus challenging the the-
so a key initial issue that is too often overlooked is the level at oretical basis of the construct. More generally, conglomerate con-
which a construct is expected to fit in a particular structure. structs rarely fulfill their enticing premise that the total is greater
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

As our knowledge of the hierarchical structure of personality than the sum of its parts. If development of such a construct is
This document is copyrighted by the American Psychological Association or one of its allied publishers.

and psychopathology has grown, so too has the importance of pursued, the burden of proof is on the developer to show that the
considering measures as elements in these structures (vs. focusing conglomerate is superior to the linear combination of its compo-
on developing a single, isolated scale). In particular, the broader nents.
the construct (i.e., the higher it lies in the hierarchy), the more We stress here that we are not suggesting that one should seek
important it is to articulate its lower level components—that is, to to eliminate these types of constructs from one’s measurement
explicate the nature of its multidimensionality. This has become model and strive for all scales to mark a single factor clearly. On
important enough that in many cases, new scales should not be the contrary, orphan and interstitial constructs—with low and
developed in isolation, but rather should be “co-developed” with cross-loadings, respectively—are particularly important for pro-
the express intent of considering their convergent and discriminant viding a full characterization and understanding of hierarchical
validity as part of initial scale development (i.e., not leaving structures of personality and psychopathology. We aim rather to
alert structural researchers to the fact that this variance also should
consideration of convergent and discriminant validity to the later
be recognized and appropriately modeled.
external-validation phase). We discuss this further in a subsequent
Broader implications of hierarchical models. The emergence
section.
of hierarchical models has led to the important recognition that
Orphan, interstitial, and conglomerate constructs. Growth in
scales— even highly homogeneous ones— contain multiple sources
our understanding of hierarchical structures has increased aware-
of variance that reflect different hierarchical levels. For example,
ness that not all constructs fit neatly into these structures. We
a well-designed assertiveness scale contains not only construct-
consider three types of such constructs. Orphan constructs are
specific variance reflecting stable individual differences in this
unidimensional constructs that load only weakly on any superor-
lower order trait, but also includes shared variance with other
dinate dimension, and their value is relative to their purpose: They
lower order components of extraversion (e.g., gregariousness and
may have significant incremental predictive power over estab-
positive emotionality; Watson, Stasik, Ellickson-Larew, & Stan-
lished constructs for specific outcomes or be important in a par- ton, 2015), which reflects the higher order construct of extraver-
ticular area of study. For example, intrinsic religiosity is largely sion. Thus, a lower order scale simultaneously contains both
unrelated to the personality-trait hierarchy (Lee, Ogunfowora, & unique (lower order facet) and shared (higher order trait) compo-
Ashton, 2005), yet it predicts various mental health outcomes nents. Multivariate techniques such as multiple regression (e.g.,
across the life span (e.g., Ahmed, Fowler, & Toro, 2011). Simi- Watson, Clark, Chmielewski, & Kotov, 2013) and bifactor analy-
larly, dependency is an important clinical construct that does not sis (e.g., Mansolf & Reise, 2016) can be used to isolate the specific
load strongly on any primary personality domain (e.g., Lowe, influences of these different elements.
Edmundson, & Widiger, 2009). It is less widely recognized that items also simultaneously
Interstitial constructs are both unidimensional and blends of two contain multiple sources of variance reflecting different levels of
distinct constructs, such that factor analysis of their items yields (a) the hierarchy in which they are embedded. We illustrate this point
a single factor on which all scale items load and/or (b) two (or using data from a large sample (N " 8,305) that includes patients,
more) factors that are either orthogonal with all or almost all items adults, postpartum women, and college students (Watson et al.,
loading on both (all) factors or highly correlated (i.e., per an 2013, Study 1). All participants completed the Inventory of De-
oblique rotation). For example, Watson, Suls, and Haig (2002) pression and Anxiety Symptoms (IDAS; Watson et al., 2007).
showed that self-esteem is unidimensional but yet has strong We focus on the IDAS item I woke up much earlier than usual.
loadings on both (low) negative affectivity and positive affectivity, At its most specific level, this item can be viewed as an indicator
because the measure’s items themselves inherently encompass of the construct of terminal insomnia/early morning awakening: It
variance from both higher order dimensions. Some interstitial correlates strongly with another terminal-insomnia item (r " .65
constructs blend two dimensions at the same hierarchical level: In with I woke up early and could not get back to sleep) and could be
the interpersonal circumplex (IPC; Zimmermann & Wright, 2017), used to create a very narrow measure of this construct. However,
dominance and affiliation typically constitute the primary axes, this item also correlates moderately (rs " .34 to .45; see the upper
and dimensions that fall between these define interstitial constructs portion of Table 1) with items assessing other types of insomnia
(e.g., IPC arrogance and introversion-extraversion). In a perfect and could be combined with those items to create a unidimensional
circumplex, the axes location is arbitrary; in reality, psychological measure of insomnia: We subjected these items to a confirmatory
4 CLARK AND WATSON

Table 1 and selecting items very carefully during the process of scale
Correlations Among Selected IDAS Items (Overall Standardized construction.
Sample) Creation of an item pool. The next step is item writing. No
existing data-analytic technique can remedy item-pool deficien-
Paraphrased Item 1 2 3
cies, so this is a crucial stage whose fundamental goal is to sample
Model 1 – Sleep problems, AIC " .46 systematically all potentially relevant content to ensure the mea-
1. Slept less than usual — sure’s ultimate content validity. Loevinger (1957) offered the
2. Had trouble falling asleep .45 —
classic articulation of this principle: “the items of the pool should
3. Woke up earlier than usual .36 .34 —
4. Slept very poorly .54 .61 .45 be chosen so as to sample all possible contents which might comprise
Model 2 – Depression, AIC " .31 the putative trait according to all known alternative theories of the
1. Felt depressed — trait” (p. 659; emphasis in original). Two key implications of this
2. Did not have much of an appetite .34 —
3. Woke up earlier than usual .26 .22 — principle are: the initial pool should be broader and more comprehen-
4. Took a lot of effort to get going .50 .25 .24 sive than one’s theoretical view of the target construct and include
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Model 3 – Internalizing, AIC " .29 content that ultimately will be eliminated. Simply put, psychometric
This document is copyrighted by the American Psychological Association or one of its allied publishers.

1. Felt dizzy or lightheaded — analyses can identify items to drop but not missing content that should
2. Little things made me mad .32 —
3. Woke up earlier than usual .24 .24 — have been included; accordingly, one initially should be overinclu-
4. Was difficult to make eye contact .31 .36 .24 sive.
Note. N " 8,305. IDAS-II " Expanded version of the Inventory of
In addition, the item pool must include an adequate sample of
Depression and Anxiety Symptoms; AIC " Average interitem correlation. each major content area that potentially composes the construct,
Target item is bolded. because undersampled areas likely will be underrepresented in the
final scale. To ensure that all aspects of a construct are assessed
adequately, some test developers recommend creating formal sub-
factor analysis (CFA) using PROC CALIS in SAS 9.4 (SAS scales, called homogeneous item composites (Hogan, 1983) or
Institute, Inc., 2013) to test how well a single factor modeled their factored homogeneous item dimensions (Comrey, 1988), to assess
intercorrelations. We used four fit indices to evaluate the model: each content area (see Watson et al., 2007, for an example).
the standardized root-mean-square residual (SRMR), root-mean- Ideally, the number of items in each content area should be
square error of approximation (RMSEA), comparative fit index proportional to that area’s importance in the target construct, but
(CFI), and Tucker-Lewis Index (TLI).3 A one-factor model fit often the theoretically ideal proportions are unknown. In general,
these data extremely well (CFI " .996, TLI " .988, SRMR " however, broader content areas should be represented by more
.013, RMSEA " .048), demonstrating that this item is a valid items than narrower ones.
indicator of general insomnia. Many of these procedures are traditionally described as reflect-
Further, this item correlates moderately (rs " .22 to .26; middle ing the theoretical-rational or deductive method of scale develop-
portion of Table 1) with other symptoms of major depression and ment (e.g., Burisch, 1984), but we consider them an initial step in
could be used to create a unidimensional measure of this broader an extensive process, not a “stand-alone” scale-development method.
construct. A CFA of these items also indicated that a one-factor Loevinger (1957) emphasized that attending to content was necessary,
model fit the data well (CFI " .983, TLI " .948, SRMR " .023, but not sufficient; rather, empirical validation of content was critical:
RMSEA " .068). Thus, this item also is a valid indicator of “If theory is fully to profit from test construction . . ., every item [on
depression. Finally, at an even broader level of generality, this item
a scale] must be accounted for” (Loevinger, 1957, p. 94 –95). This
is moderately related (all rs ! .24; bottom portion of Table 1) to
obviously is an ideal to be striven for, not an absolute requirement (see
indicators of internalizing psychopathology and could be com-
also Comrey, 1988; Haynes, Richard, & Kubany, 1995).
bined with them to create a unifactorial measure of this overarch-
Good scale construction is an iterative process involving several
ing construct. Again, a CFA of these items indicated that a one-
stages of item writing, each followed by conceptual and psycho-
factor model fit the data very well (CFI " .998, TFI " .995,
SRMR " .007, RMSEA " .018), showing that this item is also a metric analysis that sharpen one’s understanding of the nature and
valid indicator of internalizing. structure of the target domain and may identify shortcomings in
In theory, one could extend this analysis to establish that this the initial item pool. For instance, factor analysis might identify
item also reflects the influence of a general factor of psychopa- subscales and also show that the initial pool contains too few items
thology (Tackett et al., 2013). Consequently, structural analyses— to assess one or more content domains reliably. Accordingly, new
based on different sets of indicators reflecting varying hierarchical items must be written, and additional data collected and analyzed.
levels— could be used to establish that the item I woke up much Alternatively, analyses may suggest that the target construct’s
earlier than usual is simultaneously an indicator of (a) terminal original conceptualization is countermanded by the empirical re-
insomnia, (b) general insomnia, (c) depression, (d) internalizing, sults, requiring revision of the theoretical model, a point we develop
and (e) general psychopathology. Note, moreover, that this com- further later.
plexity is inherent in the item itself. We could have used any
number of items to illustrate this point, so these analyses strongly 3
Fit is generally considered acceptable if CFI and TLI are .90 or greater
support the assertion that items and scales typically reflect multiple and SRMR and RMSEA are .10 or less (Finch & West, 1997; Hu &
meanings and constructs, not simply a single set of inferences Bentler, 1998); and as excellent if CFI and TLI are .95 or greater and
(Standards, p. 11). They also highlight the importance of writing SRMR and RMSEA are .06 or less (Hu & Bentler, 1999).
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 5

Basic principles of item writing. In addition to sampling easily scored via computer administration—are making a come-
well, it is essential to write “good” items, and it is worth the time back, particularly in studies of medical problems using simple
to consult the item-writing literature on how to do this (e.g., ratings of single mood terms or problem severity (e.g., pain,
Angleitner & Wiggins, 1985; Comrey, 1988). We mention only a loudness of tinnitus). Simms et al. (under review) found them to be
few basic principles here. Items should be simple, straightforward, only slightly less reliable/valid than numerical rating scales.
and appropriate for the target population’s reading level. Avoid (a) Forced-choice formats, which are largely limited to legacy mea-
expressions that may become dated quickly; (b) colloquialisms sures in personality and clinical assessment, also are making a
that may be not be familiar across age, ethnicity, region, gender, comeback in the industrial-organizational psychology literature
and so forth; (c) items that virtually everyone (e.g., “Sometimes I because of advances in statistical modeling techniques that solve
am happier than at other times”) or no one (e.g., “I am always problems of ipsative data that previously plagued this format (e.g.,
furious”) will endorse; and (d) complex or “double-barreled” items Brown & Maydeu-Olivares, 2013). The advantage of this format is
that assess more than one characteristic; for example, “I would reduction in the effect of social desirability responding, but unfor-
never drink and drive for fear that I might be stopped by the tunately we do not have the space to do justice to these develop-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

police,” assesses both a behavior’s (non)occurrence and a putative ments here.


This document is copyrighted by the American Psychological Association or one of its allied publishers.

motive. Finally, the exact phrasing of items can greatly influence Derivative versions. Different versions of a measure may be
the construct that is being measured. For example, the inclusion of needed for a range of purposes and include short forms, translations,
almost any negative mood term (e.g., “I worry about . . .,” “I am age-group adaptations, and other-report forms (e.g., for parents,
upset [or bothered or troubled] by . . .”) virtually guarantees a spouses). Adaptations require revalidation, but far too often this step
substantial neuroticism/negative affectivity component to an item. is skipped or given short shrift. We have space only to raise briefly
Choice of format. some key issues in derivations. When available, we refer readers to
Two dominant formats. Currently, the two dominant response sources that provide more detailed guidance for their development.
formats in personality and clinical assessment are dichotomous Short forms. Smith, McCarthy, and Anderson (2000) provide
responding (e.g., true/false; yes/no) and Likert-type rating scales an excellent summary of the many challenges in developing and
with three or more options. There are several considerations in validating short forms. They address the tendency to try to main-
choosing between these, but surprisingly little empirical research. tain a similar level of internal consistency (e.g., coefficient alpha)
Recently, however, Simms, Zelazny, Williams, and Bernstein (un- by narrowing the content, which leads into the classic attenuation
der review) systematically examined an agree– disagree format paradox of psychometrics (Boyle, 1991; Loevinger, 1954): In-
with response options ranging from 2 to 11, evaluating the psy- creasing a test’s internal consistency beyond a certain point can
chometric properties and convergent validity of a well-known reduce its validity relative to its initially intended interpretation(s).
personality trait measure in a large undergraduate sample. Their Specifically, by narrowing the scale content, the scope and nature
results indicated that psychometric quality (e.g., internal consis- of the assessed construct is itself changed; in particular, it increases
tency reliability, dependability) increased up to six response op- item redundancy, thereby reducing the total amount of construct-
tions, but the number of response options had less effect on related information the test provides.
validity than expected. Broadly speaking, the central challenge in creating short forms
Likert-type scales are used with various response formats, in- is to maintain the level of test information while simultaneously
cluding frequency (e.g., never to always), degree or extent (e.g., significantly reducing scale length. Analyses based on item re-
not at all to very much), similarity (e.g., very much like me to not sponse theory (IRT; Reise, Ainsworth, & Haviland, 2005; Simms
at all like me), and agreement (e.g., strongly agree to strongly & Watson, 2007) can be invaluable for this purpose by providing
disagree). Obviously, the nature of the response format constrains a detailed summary of the nature of the construct-related informa-
item content in an important way and vice versa (Comrey, 1988). tion that each item provides, which can be used to identify a
For example, frequency formats are inappropriate if the items reduced set of items that yields maximal information. Thus, we
themselves use frequency terms (e.g., “I often lose my temper”). strongly support the increasing use of IRT to create short forms
Whether to label all or only some response options also must be (e.g., Carmona-Perera, Caracuel, Pérez-García, & Verdejo-García,
decided. Most measures label up to about six response options; 2015).
beyond which they vary (e.g., only the extremes, every other Developing short forms also provides an opportunity to improve
response, etc.). With an odd number of response options, the a measure’s psychometric properties, particularly in the case of
middle option’s label must be considered carefully (e.g., cannot hierarchical instruments. For example, some of the domain scores
say confounds uncertainty with a midrange rating such as neither of the widely used Revised NEO Personality Inventory (NEO
agree nor disagree), whereas even numbers of response options PI-R; Costa & McCrae, 1992) correlate substantially with one
force respondents to “fall on one side of the fence or the other,” another, which lessens their discriminant validity (e.g., Costa &
which some respondents dislike. However, Simms et al. found no McCrae, 1992, reported a #.53 correlation between Neuroticism
systematic differences between odd versus even number of re- and Conscientiousness). The measure contains six lower order
sponse options. More research of this type is needed using a broad facet scales to assess each domain, but in creating a short form—
range of constructs (e.g., psychopathology, attitudes), samples the NEO Five-Factor Inventory (NEO-FFI; Costa & McCrae,
(e.g., patients, community adults), type of response formats (i.e., 1992)— the authors selected the best markers of each higher order
extent, frequency), and so on. domain, rather than sampling equally across their facets, which
Less common formats. Checklists have fallen out of favor improved the measure’s discriminant validity. In a student sample
because they are more prone to response biases (e.g., D. P. Green, (N " 329; Watson, Clark, & Chmielewski, 2008), the NEO-FFI
Goldman, & Salovey, 1993), whereas visual analog scales—now Neuroticism and Conscientiousness scales correlated significantly
6 CLARK AND WATSON

lower (r " #.23) than did those of the full NEO PI-R version sometimes feels unreal”). It also is important to note that self-
(r " #.37). informant correlations are typically modest (.20 –.30) to moderate
As new technologies have enabled new research methods, such (.40 –.60) for a wide variety of reasons (see Achenbach et al.,
as ecological momentary assessment, in which behavior is sampled 2005; De Los Reyes et al., 2015; and Connelly & Ones, 2010 for
multiple times over the course of several days or weeks, and as meta-analyses and discussions).
researchers increasingly investigate complex interplays of diverse Adaptations for different age groups. Consistency of mea-
factors influencing such outcomes as psychopathology or crime surement across developmental periods requires measure adapta-
via multilevel modeling, the demand for very short forms— one to tion and revalidation. Many such adaptations are extensions down-
three items— has increased. These approaches offer important new ward to adolescence of measures developed for adults (e.g., Linde,
perspectives on human behavior, but meaningful and generalizable Stringer, Simms, & Clark, 2013). For children, measure adaptation
results still depend on measures’ reliability and validity. Psycho- and revalidation are enormously complex because of the need for
metric research is accruing on measurement relevant to these (a) multisource assessment— child self-report except in very
methods. For example, Soto and John (2017) reported that their young children, multiple informant reports (e.g., parent, teacher)
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

15-item extrashort form (of the 60-item Big Five Inventory-2), that typically show low-to-moderate agreement (De Los Reyes et
This document is copyrighted by the American Psychological Association or one of its allied publishers.

which has one item per facet, should be used only to derive domain al., 2015), and behavioral observation—and (b) to address devel-
and not facet scores, whereas the 30-item short form could be used opmental changes in both the nature of the construct itself, and
to assess facets. Similarly, McCrae (2015) cautioned that, on children’s levels on the construct of interest (e.g., Putnam et al.,
average, only one third of single items’ variance reflected the 2008). For the elderly, adaptations may or may not be needed to
target construct for broad traits (e.g., neuroticism); moreover, account for normal age-related changes. For example, if the re-
items may capture valid subfacet or “nuance” variance, but little is search question is how does life satisfaction change across the life
known about the construct validity of nuances compared to do- span, then adjustments for age-related change in physical health
mains and facets. The crucial point is that it remains incumbent on would confound the results. In contrast, when assessing psycho-
the researcher to demonstrate the validity of the inferences drawn pathology, normal-range age-related decline in physical ability
from ultrashort form measures. needs to be considered lest it be wrongly interpreted as a psycho-
Translations. Translations into Western European languages logical symptom. See Achenbach, Ivanova, and Rescorla (2017)
may be the “worst offenders” in terms of ignoring revalidation, in for a research program on multicultural, multiinformant assess-
that articles often do not even indicate that study measures are ment of psychopathology across the life span.
translations other than by implication (e.g., stating that the data
were collected in Germany). There is quite a large literature on
translation validation to which we cannot do justice here, so we Structural Validity: Item Selection and
simply refer readers to a chapter that explicitly discusses the key Psychometric Evaluation
issues (Geisinger, 2003) and two examples of strong translation
development and validation processes (Schwartz et al., 2014; Test-construction strategies. Choosing a test-construction or
Watson, Clark, & Tellegen, 1984). This issue is sufficiently im- item-selection strategy should match the scale development goal
portant that the Test Standards explicitly state that those who both and the target construct(s)’ theoretical conceptualization. Loev-
develop and use translations/adaptations are responsible for pro- inger (1957) described three main conceptual models: (a) quanti-
viding evidence of their construct validity (AERA, APA, & NCME, tative (dimensional) models that differentiate individuals with re-
2014, Standard 3.12). spect to degree or level of the target construct, (b) class models that
Informant versions. There are various reasons for collecting seek to categorize individuals into qualitatively different groups,
information from individuals other than, or in addition to, target and (c) more complex dynamic models. However, since then,
individuals: difficulty or inability in responding for themselves (e.g., including since the publication of our previous article, consider-
individuals with dementia, McDade-Montez, Watson, O’Hara, & able research has confirmed that dimensional models fit the vast
Denburg, 2008; or children, Putnam, Rothbart, & Gartstein, 2008), majority of data best (Haslam, Holland, & Kuppens, 2012; Markon,
concerns about valid responding (e.g., psychopathology, Achenbach, Chmielewski, & Miller, 2011a, 2011b), so anyone considering either
Krukowski, Dumenci, & Ivanova, 2005), and simply to provide a class or a dynamic model should have a very well-established
another perspective on the person’s behavior (e.g., Funder, 2012). theoretical reason to pursue it; consequently, we do not discuss them
Preparing informant versions is not simply a matter of changing further in this article.
“I” in self-report items to “He” or “She” in informant versions, Loevinger (1957) championed the concept of structural validity
although this works for some items, particularly those with high (see also Messick, 1995)—that a scale’s internal structure (i.e.,
behavioral visibility (e.g., “I/He/She get(s) into more fights than interitem correlations) should parallel the external structure of the
most people”). Depending on the purpose of the informant assess- target trait (i.e., correlations among nontest manifestations of the
ment, items that reflect one’s self-view (e.g., “I haven’t made trait), and that items should reflect the underlying (latent) trait
much of my life”) may need to be phrased so that informants report variance. These concerns parallel the three main item-selection
on the target’s self-view (e.g., “She thinks she hasn’t made much strategies for dimensional constructs: empirical (primarily con-
of her life”) or so that they report their own perspective about the cerned with nontest manifestations), internal consistency (con-
person (e.g., “He hasn’t made much of his life”). Similarly, infor- cerned with interitem structure), and item response theory (focused
mants can report on items that refer to internal experience (e.g., “I on latent traits). These methods are not mutually exclusive and
sometimes feel unreal”) only if the person has talked about those typically should be used in conjunction with one another: struc-
experiences, so such items must reflect this fact (“She says that she tural validity encompasses all three.
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 7

Criterion-based methods. Beginning with Meehl’s (1945) theory inadequate? Are the items poorly worded? Is the sample
“empirical manifesto,” empirically keyed test construction became nonrepresentative in some important way? Are the items’ base
the dominant scale-construction method. However, because of rates too extreme? and Are there too few items representing the
major difficulties in cross-validation and generalization, plus the core construct?
method’s inability to advance psychological theory, its popularity Item response theory (IRT). IRT (Reise et al., 2005; Reise &
waned. The field readily embraced Cronbach and Meehl’s (1955) Waller, 2009; Simms & Watson, 2007) increasingly is being used
notion of construct validity, although a number of years passed in scale development; as noted earlier, it plays a particularly
before the widespread availability of computers facilitated a broad important role in short-form creation. IRT is based on the assump-
switch to internal consistency methods of scale construction. To- tion that item responses reflect levels of an underlying construct
day, attention to empirical correlations with nontest criteria has and, moreover, that each item’s response-trait relation can be
largely shifted to the external validation phase of test development, described by a monotonically increasing function called an item
although there is no reason to avoid examining these relations characteristic curve (ICC). Individuals with higher levels of the
early in scale development. One common strategy is to administer trait have greater expected probabilities for answering the item in
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

the initial item pool to both community and clinical samples, and a keyed direction (e.g., highly extraverted individuals are more
This document is copyrighted by the American Psychological Association or one of its allied publishers.

to consider differences in items’ mean levels across the two groups likely to endorse an item about frequent partying). ICCs provide
as one criterion, among several, for item selection. the precise values (within a standard error range) of these proba-
Internal-consistency methods. Currently, the single most widely bilities across the entire range of trait levels.
used method for item selection is some form of internal- In using IRT, the emphasis is on identifying the specific items
consistency analysis. When developing a single scale, corrected that are maximally informative for each individual, given his or her
item-total correlations are used frequently to eliminate items that level of the underlying dimension. For instance, a challenging
do not correlate strongly with the assessed construct, but factor algebra problem may provide useful information for respondents
analysis is essential when the target construct is conceptualized as with a high level of mathematical ability (who may or may not be
part of a hierarchical structure or when multiple constructs are able to get it correct), but it is uninformative when given to
being developed simultaneously. We typically use exploratory individuals with little mathematical facility, because we know in
factor analysis (EFA) to identify the underlying latent dimension advance that they almost surely will get it wrong. From an IRT
(for a single unidimensional scale) or dimensions (for a higher perspective, the optimal item is one that a respondent has a 50%
order scale with lower order facets). We then use the identified probability of answering correctly or endorsing in the keyed di-
dimension(s) as the basis for scale creation (e.g., Clark & Watson, rection, because this provides the greatest increment in trait-
1995; Simms & Watson, 2007). relevant information for that person.
Consistent with general guidelines in the broader factor analytic Within the IRT literature, a model with parameters for item
literature (e.g., Comrey, 1988; Fabrigar, Wegener, MacCallum, & difficulty and item discrimination is used most frequently (Reise &
Strahan, 1999; Russell, 2002), we recommend using principal Waller, 2009; Simms & Watson, 2007). Item difficulty is the point
factor analysis (PFA vs. principal components analysis; PCA) as along the underlying continuum at which an item has a 50%
the initial extraction method (but see Cortina, 1993, for arguments probability of being answered correctly (or endorsed in the keyed
favoring PCA). It also is helpful to examine both orthogonal and direction) across all respondents. Items with high (vs. low) diffi-
oblique rotations (Watson, 2012). For most purposes, we recom- culty values reflect higher (vs. lower) trait levels and have low (vs.
mend eliminating items (a) with primary loadings below .35 to .40 high) correct-response/endorsement probabilities. Discrimination
(for broader scales; below .45 to .50 for narrower scales) and (b) reflects the degree of psychometric precision, or information, that
that have similar or stronger loadings on other factors, although an item provides across difficulty levels.
these guidelines may need to be relaxed in some circumstances IRT offers two important advantages over other item-selection
(e.g., clinical measures in which it may be important to include low strategies. First, it enables specification of the trait level at which
base-rate items). Resulting scales can then be refined using CFA. each item is maximally informative. This information can then be
Factor analysis is a tool that can be used wisely or foolishly. used to identify a set of items that yield precise, reliable, and valid
Fortunately, the nature of factor analysis is such that blind adher- assessment across the entire range of the trait. Thus, IRT methods
ence to a few simple rules typically will lead to developing a offer an enhanced ability to discriminate among individuals at the
decent (though not likely optimal) scale. Even with factor analysis, extremes of trait distributions (e.g., both among those very high
there is no substitute for good theory and careful thought. For and very low in extraversion). Second, IRT methods allow esti-
example, as noted earlier, internal consistency and breadth are mation of each individual’s trait level without administering a
countervailing, so simply retaining the strongest loading items may fixed set of items. This flexibility permits the development of
not yield a scale that best represents the target construct. That is, computer-adaptive tests (CATs) in which assessment focuses pri-
if the top-loading items are highly redundant with one another, marily on the subset of items that are maximally informative for
including them all will increase internal consistency estimates but each respondent (e.g., difficult items for quantitatively gifted in-
also may create an overly narrow scale that does not assess the dividuals, easier items for those low in mathematical ability).
construct optimally. This represents another illustration of the CATs are extremely efficient and provide roughly equivalent
attenuation paradox we discussed previously. trait-relevant information using fewer items than conventional
Similarly, if items that reflect the theoretical core of the con- measures (typical item reductions are !50%; Reise & Waller,
struct do not correlate strongly with it or with each other in 2009; Rudick, Yam, & Simms, 2013).
preliminary analyses, it is not wise simply to eliminate them As a scale development technique, IRT’s main limitation is that
without considering why they did not behave as expected: Is the it requires a good working knowledge of the basic underlying
8 CLARK AND WATSON

trait(s) to be modeled (i.e., that one’s measurement model be structs should be included in the initial data collection to begin to
reasonably well established). Consequently, IRT methods are most test these hypotheses. Too often, test developers belatedly discover
useful in domains in which the basic constructs are well under- that their new scale correlates too strongly with an established
stood and less helpful when the underlying structure is unclear. measure or, worse, with one of a theoretically distinct construct.
Thus, EFA remains the method of choice during the early, inves- Scale codevelopment. As our knowledge of the hierarchical
tigative stages of assessment within a domain. Once the basic structure of personality and psychopathology has grown, paying
factors/scales/constructs within the domain have been established, attention to where a construct fits within a particular structure has
they can be refined further using analytic approaches such as CFA become important enough that in most cases, new scales should
and IRT. not be developed in isolation, but rather “co-developed” with the
It also is more challenging to develop multiple scales simulta- express intent of considering their convergent and discriminant
neously in IRT than using techniques such as EFA. Consequently, validity as part of the initial scale-development process. Note that
IRT-based scales frequently are developed in isolation from one single scales with subscales are special examples of hierarchical
another, with insufficient attention paid to discriminant validity. structures, so the various principles we discuss here are relevant to
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

For example, Pilkonis et al. (2011) describe the development of such cases as well; we specifically address subscales subsequently.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

brief, IRT-based anxiety and depression scales. Although these When codeveloping scales for constructs conceptualized as fit-
scales display some exemplary qualities and compare favorably in ting within a hierarchical structure, there are several issues to
many ways to more traditional indicators of these constructs, the consider initially, the most important of which are (a) whether the
depression and anxiety trait scores correlated .81 with one another primary focus of scale development is on the lower order con-
(Pilkonis et al., 2011). Similarly, they correlated .79 in a sample of structs, the higher order constructs, or both and (b) how extensive
448 Mechanical Turk (MTurk) workers (Watson, Stanton, & a portion of the hierarchical structure the scale-development proj-
Clark, 2017). This example illustrates the importance of develop- ect targets. For example, in developing the SNAP (Clark, Simms,
ing scales simultaneously— using techniques such as EFA—to Wu, & Casillas, 2014), the focus was on lower order scales
maximize measures’ discriminant validity. relevant to personality pathology which, accordingly, were devel-
Initial data collection. In this section, “initial” does not in- oped without regard to how they loaded on higher order dimen-
clude pilot testing, which can be helpful to conduct on moderately sions. In contrast, in developing the Faceted Inventory of the
sized samples of convenience (e.g., 100 –200 college students or an Five-Factor Model (FI-FFM; Watson, Nus, & Wu, 2017), the
MTurk sample) to test item formats, ensure that links to online focus was equally on the higher and lower order dimensions, so
surveys work, and so on. scales that either cross-loaded on two or more higher order dimen-
Sample considerations. Basic item-content decisions that will
sions or did not load strongly on any higher order dimension were
shape the scale’s empirical and conceptual development are made
not included in the final instrument.4
after the first full round of data collection. Therefore, it is very
In developing the Personality Inventory for the Diagnostic and
important to use at least one, and preferably two or three, large,
Statistical Manual of Mental Disorders (DSM), fifth edition
reasonably heterogeneous sample(s). Based on evidence regarding
(PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2011),
the stability and replicability of structural analyses (Guadagnoli &
the facets of the five hypothesized domains were initially created
Velicer, 1988; MacCallum, Widaman, Zhang, & Hong, 1999), we
individually. Then, exploratory factor analyses (EFA) were run on
recommend a minimum of 300 respondents per (sub)sample, in-
all items within each domain, extracting factors up to one more
cluding students, community adults (e.g., MTurk), and, ideally, a
than the number of each domain’s hypothesized facets, and select-
sample specifically representing the target population. If that is not
ing the EFA with the best fit to the data for each domain. In several
feasible (which is not uncommon for financial reasons), then at
cases, several facets were collapsed into one scale because their
least one sample should include individuals who share a key
respective items formed a single factor. In other cases, items were
attribute with the target population. For example, if the measure is
moved from one facet to another. Fewer than half of the facets
to be used in a clinical setting, then consider using an analog
sample (e.g., college students or MTurk workers) who score above “survived” these analyses more or less intact; in the most extreme
a given cutpoint on a psychopathology screening scale. case, five facets merged into one.
As we discuss below, the reason that it is critical to obtain such However, even this level of attention proved to be insufficient.
data early on is that the target construct and its items may have A scale-level EFA with oblique rotation found significant cross-
rather different properties in such samples compared to college- factor loadings for 11 (44%) of the PID-5’s 25 facets, and subse-
student or community-adult samples. If this is not discovered until quent studies have established that the measure would have ben-
late in the development process, the scale’s utility may be seriously efitted from additional examination of discriminant validity across
compromised. domains. Crego, Gore, Rojas, and Widiger (2015) provided a brief
Inclusion of comparison (anchor) scales. In the initial round review of such studies and reported that in their own data, five
of data collection, it is common practice to administer the prelim- PID-5 facets (20%) had higher average discriminant (cross-
inary item pool without additional items or scales. This practice is domain) than convergent (within-domain) correlations. In the Im-
regrettable, however, because it does not permit examination of the proving the Measurement of Personality Project (IMPP), with a
target construct’s boundaries, which is critical to understanding the mixed sample of 305 outpatients and 302 community adults
construct from both theoretical and empirical viewpoints. Just as
the initial literature review permits identification of existing scales 4
Note that the former and latter approaches facilitate and eschew,
and concepts that may help establish the measure’s convergent and respectively, including orphan and interstitial constructs, which we dis-
discriminant validity, marker scales assessing these other con- cussed earlier.
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 9

screened to be at high risk for personality pathology, we replicated basis for this reconceptualization: Glenn, Michel, Franklin, Hooley,
those results for four of these five facets (Clark, 2018). Of course, and Nock (2014) found that the relation between self-injury and
it may be that, although conceptualized initially as facets of a experimentally tested pain analgesia was mediated by self-criticalness
particular domain, these traits actually are interstitial and thus and hypothesized “the tendency to experience self-critical thoughts in
should have high cross-loadings. Our point here is not to set the bar response to stressful events . . . increases the likelihood of both
for scale development prohibitively high, but to raise attention to self-injury and pain analgesia” (p. 921). A full reconceptualization lies
important issues such as paying careful attention to expected and in the future, but the convergence of findings from independently
observed convergent and discriminant validity, so as to maximize conducted self-injury research and psychometric scale evaluation is
the outcomes of scale development projects and increase the evidence supporting both.
likelihood that new scales will make important contributions to the Analysis of item distributions. Before conducting more com-
psychological literature. plex structural analyses, scale developers should examine individ-
Subscales. A special case of codevelopment is that of hierar- ual items’ response distributions. Two considerations are para-
chically multidimensional scales, commonly called measures with mount: First, it is important to consider eliminating items that have
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

subscales. These measures are unidimensional at their higher order highly skewed and unbalanced distributions. In a true/false format,
This document is copyrighted by the American Psychological Association or one of its allied publishers.

level, with correlated subfactors at their lower order level. Factor these are items that virtually everyone (e.g., 95% or more) either
analysis of the items comprising a hierarchically multidimensional endorses or denies; with a Likert-rating format, these are items to
construct yields a strong general factor, with all or almost all items which almost all respondents respond similarly (e.g., “slightly
loading strongly on this factor; nonetheless, when additional fac- agree”). Highly unbalanced items are undesirable for several rea-
tors are extracted, clear—yet correlated—lower order dimensions sons: (a) When most respondents answer similarly, items convey
emerge. For example, two subscales constitute the SNAP-2 Self- very little information, except perhaps at extremes of the trait
harm scale—Suicide Potential and Low Self-esteem (Clark et al., distribution; (b) relatedly, items with limited variability are likely
2014)—the former composed of item content directly concerned to correlate weakly with other items, and therefore will fare poorly
with suicide and self-harming thoughts, feelings, and behaviors, in structural analyses; and (c) items with extremely unbalanced
and the latter composed of items expressing self-derogation versus distributions can produce highly unstable correlational results (see
self-satisfaction. In the IMPP sample, all 16 items loaded quite
Clark & Watson, 1995, for an example from Comrey, 1988).
strongly on the first general factor (M loading " .52, range "
Importantly, only items with unbalanced distributions across
.41–.66), yet when two varimax-rotated factors were extracted, the
diverse samples representing the full range of the scale’s target
two subscales’ items loaded cleanly on the two factors, with mean
population should be eliminated. As mentioned earlier, many
loadings of .57 and .17 on their primary and nonprimary factors,
items show very different response distributions across clinical
respectively (Clark, 2018); unit-weighted subscales constructed
and nonclinical samples. For instance, the item “I have things in
from the two subsets of items correlated .48. Any hierarchically
my possession that I can’t explain how I got” likely would be
dimensional scale (i.e., any measure with subscales) should have
endorsed by very few undergraduates, whereas in an appropri-
similar properties.
ate patient sample, it may have a much higher endorsement rate
Psychometric evaluation: An iterative process. We return
and prove useful in assessing clinically significant levels of
here to an earlier point: Good scale construction is an iterative
process involving an initial cycle of preliminary measure devel- dissociative pathology. Thus, it may be desirable to retain items
opment, data collection, and psychometric evaluation, followed by that assess important construct-relevant information in a sample
at least one additional cycle of revision of both measure and more like the target population, even if they have extremely
construct, data collection, psychometric evaluation, revision, and unbalanced distributions and, therefore, relatively poor psycho-
so forth. The most often neglected aspect of this process is revision metric properties in others.
of the target construct’s conceptualization. Too often, scale devel- The second consideration is that it is desirable to retain items
opers assume that their initial conceptualization is entirely correct, with a broad range of distributions. In the case of true/false and
considering only the measure as open to revision. However, it is Likert-type items, respectively, this means keeping items with
critical to remain open to rethinking one’s initial construct—to widely varying endorsement percentages and means (in IRT
“listen to the data” not “make the data talk.” terms, items with widely varying difficulty parameters), be-
Often this involves only slight tweaking, but it may involve cause most constructs represent continuously distributed dimen-
more fundamental reconceptualization. For example, the Multidi- sions, such that scores can occur across the entire dimension.
mensional Personality Questionnaire (Tellegen & Waller, 2008) Thus, it is important to retain items that discriminate at many
originally had a single, bipolar, trait-affect scale, but ended with different points along the continuum (e.g., at mild, moderate,
nearly orthogonal negative and positive emotionality scales. The and extreme levels). Returning to the earlier example, “I have
SNAP-2 Self-Harm scale provides a converse example. Initially things in my possession that I can’t explain how I got” may be
two scales, Low Self-esteem (originally Self-derogation) and Sui- useful precisely because it serves to define the extreme upper-
cide Proneness were developed independently. However, across end of the dissociative continuum (i.e., those who suffer from
repeated rounds of data collection in diverse samples, the scales dissociative identity disorder).
correlated highly and yielded a single dimension with two lower As noted earlier, a key advantage of IRT (Reise et al., 2005;
order facets, so they were combined to form a Self-Harm scale, Reise & Waller, 2009) is that it yields parameter estimates that
with two subscales. This necessitated reconceptualization of the specify the point along a continuum at which a given item is
combined item set as single construct with two strongly correlated maximally informative. These estimates can be used to choose an
item-content subsets. Research on self-injurers provided an initial efficient set of items that yield precise assessment across the entire
10 CLARK AND WATSON

range of the continuum, which naturally leads to retaining items As suggested earlier, however, even the AIC cannot alone
with widely varying distributions. establish the unidimensionality of a scale; in fact, a multidimen-
Unidimensionality, internal consistency, and coefficient sional scale actually can have an “acceptable” AIC: Cortina (1993,
alpha. The next stage is to determine which items to eliminate or Table 2) artificially constructed an 18-item scale composed of two
retain in the item pool via structural analyses. This is most critical distinct nine-item groups. The items within each cluster had an
when seeking to create a theoretically based measure of a target AIC of .50. However, the two clusters were uncorrelated. Obvi-
construct, so that the goal is to measure one thing (i.e., the target ously, the full scale was not unidimensional, instead reflecting two
construct)—and only this thing—as precisely as possible. This completely independent dimensions; nevertheless, it had a coeffi-
goal may seem relatively straightforward, but it remains poorly cient alpha of .85 and a moderate AIC (.24).
understood by test developers and users. The most obvious prob- This example clearly illustrates that one can achieve a seem-
lem is the widespread misapprehension that this goal can be ingly satisfactory AIC by averaging many higher coefficients with
attained simply by demonstrating an “acceptable” level of internal many lower ones. Thus, unidimensionality cannot be ensured
consistency reliability, typically as estimated by coefficient alpha simply by focusing on the average interitem correlation; rather, it
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(Cronbach, 1951). A further complication is that recommendations is necessary to examine the range and distribution of these corre-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

regarding .80 as “acceptable” level of internal consistency for lations as well. Consequently, we must amend our earlier guideline
basic research (e.g., Nunnally, 1978; Streiner, 2003) are widely to state that not only the AIC, but virtually all of the individual
ignored, such that characterizations of coefficient alphas in the interitem correlations also should fall somewhere in the .15 to .50
.60s and .70s as “good” or “adequate” are far too common. range to ensure unidimensionality. Ideally, almost all of the interi-
More fundamentally, psychometricians long have disavowed tem correlations would be moderate in magnitude and cluster
using reliability indexes to establish scales’ homogeneity (see narrowly around the mean value. Green (1978) articulated this
Boyle, 1991; Cortina, 1993). To understand why this is so, we principle most eloquently, stating that to assess a broad construct,
must distinguish between internal consistency and homogeneity or the item intercorrelation matrix should appear as “a calm but
unidimensionality. “Internal consistency” refers to the overall de- insistent sea of small, highly similar correlations” (pp. 665– 666).
gree to which a scale’s items are intercorrelated, whereas “homo- Cross-validation. If our previous recommendations have been
geneity” and “unidimensionality” indicate whether or not the scale followed, this section would hardly be necessary, because we
items assess a single underlying factor or construct (Briggs & advise testing the item pool and resultant scales in multiple sam-
Cheek, 1986; Cortina, 1993). Thus, internal consistency is a nec- ples from essentially the beginning of the process. Cross-validation
essary but not sufficient condition for homogeneity or unidimen- has been made much easier by the existence of crowdsourcing
sionality. In other words, a scale cannot be homogeneous unless all platforms such as MTurk, and it appears that most scale-
of its items are interrelated. Because theory-driven assessment development articles published these days in top-line journals such
seeks to measure a single construct systematically, the test devel- as Psychological Assessment, include a cross-validation sample.
oper ultimately is pursuing the goal of homogeneity or unidimen- However, we easily were able to identify some that did not, even
sionality, not internal consistency per se. among articles new enough to be in the “Online First” publication
Unfortunately, KR-20 and coefficient alpha are measures of stage.
internal consistency, not homogeneity, and so are of limited utility Loevinger split the structural and external stages at the point
in establishing the unidimensionality of a scale. Furthermore, they wherein the focus moves from items to total scores. We follow her
are even ambiguous and imperfect indicators of internal consis- lead and shift now to the external phase of development.
tency, because they essentially are a function of two parameters:
scale length and the average interitem correlation (AIC; Cortina,
External Validity: An Ongoing Process
1993; Cronbach, 1951). Thus, one can achieve a high internal
consistency reliability estimate with many moderately correlated We have emphasized the iterative process of scale development.
items, a small number of highly intercorrelated items, or various Phrased in its most extreme form, scale development ends only
combinations of scale length and AIC. Whereas AIC is a straight- when a measure is “retired” because, owing to increased knowl-
forward indicator of internal consistency, scale length is entirely edge, it is better to develop a new measure of a revised construct
irrelevant. In fact, with a large number of items, it is exceedingly than to modify an existing measure. That said, when it has been
difficult to avoid having a high reliability estimate, so coefficient established that measures have strong psychometric properties
alpha is virtually useless for scales containing 40 or more items relative to their respective target constructs, evaluation shifts to
(Cortina, 1993). focusing on placement in their immediate and broader nomological
Accordingly, the AIC is a much more useful index than coef- net (Cronbach & Meehl, 1955). As our focus is primarily on scale
ficient alpha, and test developers should work toward a target AIC, development, we cover only a few important aspects of this stage.
rather than a particular level of alpha. As a more specific guideline, We refer readers to Smith and McCarthy (1995), who describe the
we recommend that the AIC fall in the range of .15 to .50 (see later “refinement” stages of scale development in some detail.
Briggs & Cheek, 1986), with the scale’s optimal value determined First, however, it is important to note that the quality of the
by the generality versus specificity of the target construct. For a initial scale-development stages has clear ramifications for exter-
broad higher order construct such as extraversion, a mean corre- nal validity. If the concept is clearly conceptualized and delin-
lation as low as .15–.20 may be desirable; by contrast, for a valid eated, its “rival” constructs and target criteria will also be clearer.
measure of a narrower construct such as talkativeness, a much If the original item pool included a widely relevant range of
higher mean intercorrelation (e.g., in the .40 to .50 range) is content, the scale’s range of clinical utility will be more clearly
needed. defined. If the measure was constructed with a focus on unidimen-
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 11

sionality (vs. internal consistency), the scale will identify a more Campbell and Fiske (1959) state that discriminant validity is
homogeneous clinical group, rather than a more heterogeneous one established by demonstrating that convergent correlations are
requiring further demarcation. Finally, if convergent and discrim- higher than discriminant coefficients. For instance, self-rated self-
inant validity have been considered from the outset, it will be far esteem should correlate more strongly with peer-rated self-esteem
easier to delineate the construct boundaries and achieve the im- than with peer-rated extraversion. One complication, however, is
portant goal of knowing exactly what the scale measures and what that the meaning of the word “higher” is ambiguous in this context.
it does not. Many researchers interpret it rather loosely to mean simply that the
Convergent and discriminant validity. According to the convergent correlation must be descriptively higher than the dis-
Test Standards, “Relationships between test scores and other mea- criminant correlations to which it is compared. For instance, if the
sures intended to assess the same or similar constructs provide convergent correlation is .50, and the highest discriminant corre-
convergent evidence, whereas relationships between measures pur- lation is only .45, then it is assumed that this requirement is met.
portedly of different constructs provide discriminant evidence” It is better to use the more stringent requirement that the con-
(AERA, APA, & NCME, 2014, pp. 16 –17). Inclusion of “similar vergent correlation should be significantly higher than the discrim-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

constructs” in this definition creates an unfortunate gray area in inant coefficients to which it is compared, which obviously is more
This document is copyrighted by the American Psychological Association or one of its allied publishers.

which it is unclear whether a construct is similar enough to provide difficult to meet. Perhaps most importantly, it also requires rela-
convergent—as opposed to discriminant— evidence. It is clearer tively large sample sizes (typically, at least 200 observations) to
simply to state that convergent validity is assessed by examining have sufficient statistical power to conduct these tests in a mean-
relations among purported indicators of the same construct. Using ingful way. Nevertheless, the payoff is well worth it in terms of the
Campbell and Fiske’s (1959) terminology, convergent evidence is greater precision of the validity analyses. For instance, Watson et
established by examining monotrait correlations. al. (2008) examined the convergent and discriminant validity of
Campbell and Fiske (1959) argue that convergent correlations the 11 nonoverlapping IDAS scales in a sample of 605 outpatients.
“should be significantly different from zero and sufficiently large The convergent correlations ranged from .52 to .71, with a mean
to encourage further examination of validity” (p. 82). Unfortu- value of .62. Significance tests further revealed that these conver-
nately, what “sufficiently large” means in this context cannot be gent correlations exceeded the discriminant coefficients in 219 of
answered simply because the expected magnitude of these corre- 220 comparisons (99.5%). These results thereby provide substan-
lations will vary dramatically as a function of various design tial evidence of discriminant validity.
features. The single most important factor is the nature of the Criterion validity. Criterion validity is established by dem-
different methods that are used to examine convergent validity. In onstrating that a test is significantly related to theoretically rele-
their original formulation, Campbell and Fiske (1959) largely vant nontest outcomes (e.g., clinical diagnoses, arrest records).
assumed that investigators would examine convergence across Although criterion keying is no longer widely used as a scale-
fundamentally different methods. For example, in one analysis development method, demonstrating criterion validity remains an
(their Table 2), they examined the associations between trait scores important part of construct validation. The choice of criteria rep-
assessed using (a) peer ratings versus (b) a word association task. resents a very important aspect of examining criterion validity.
Over time, investigators began to interpret the concept of Specifically, it is important to put the construct to what Meehl
“method” much more loosely, for example, offering correlations (1978) called a “risky test” (p. 818), one that provides strong
among different self-report measures of the same target construct support for the construct if it passes. It is not uncommon for
to establish convergent validity (e.g., Watson et al., 2017, 2007). developers of a measure of some aspect of psychopathology to
This practice is not problematic, but obviously is very different claim criterion validity based on finding significant differences
from what Campbell and Fiske (1959) envisioned. Most notably, between scores on the measure in target-patient and nonpatient
convergent correlations will be—and should be—substantially samples. This is not a risky test; for example, it would be far better
higher when they are computed within the same basic method to show that the measure differentiated a particular target-patient
(e.g., between different self-report measures of neuroticism) than group from other types of patients.
when they are calculated across very different methods (e.g., Incremental validity. Criterion validity also involves the re-
between self- vs. informant-rated neuroticism). This, in turn, means lated concept of incremental validity. Incremental validity is es-
that the same level of convergence might support construct validity in tablished by demonstrating that a measure adds significantly to the
one context, but challenge it in another. For instance, it would be prediction of a criterion over and above what can be predicted by
difficult to argue that a .45 correlation between two self-report mea- other sources of data (Hunsley & Meyer, 2003). Ensuring that a
sures of self-esteem reflects adequate convergent validity, but the scale is sufficiently distinct from well-established constructs that it
same correlation between self- and parent-ratings of self-esteem might has significant incremental validity for predicting important exter-
do so. nal variables is a crucial issue that often is given too little consid-
Discriminant validity involves examining how a measure relates eration.
to purported indicators of other constructs (i.e., heterotrait corre- Three interrelated issues are important when considering incre-
lations). Discriminant validity is particularly important in estab- mental validity. They also are related to discriminant validity, so
lishing that highly correlated constructs within hierarchical models we discuss them together. The first issue is what variables to use
are, in fact, empirically distinct from one another (see, e.g., Wat- to test incremental validity—most notably, what are its competing
son & Clark, 1992; Watson et al., 2017). Indeed, the most inter- predictors, and also what criterion is being predicted. The second
esting tests of discriminant validity involve near-neighbor con- is intertwined with where a measure fits within an established
structs that are known to be strongly related to one another hierarchical structure; a full understanding of a new measure’s
(Watson, 2012). incremental validity requires comparison with other measures at
12 CLARK AND WATSON

the same level of abstraction. The third is how much incremental Conversely, for a very expensive treatment with modest harm and
validity is enough, which affects interpretation of findings and the benefit (e.g., an experimental treatment that only slightly prolongs
new measure’s value to the field. When considering incremental life), respectively, a very small NNT would be more appropriate.
and discriminant validity, it again is important to put the construct Use of such statistics in psychological research would help to
to a “risky test” (Meehl, 1978). For incremental validity, this clarify evaluating new measures’ incremental validity.
means adding significant predictive power to a known, strong Cross-method analyses. The utility of obtaining information
correlate of the criterion, whereas for discriminant validity, this from multiple sources is increasingly being recognized, and re-
means showing independence from variables that one might expect search related to the issues that arise when diverse sources provide
would be correlated with the new measure. The biggest challenge, discrepant information also is growing, but the topic still remains
therefore, is to demonstrate discriminant and incremental validity relatively understudied. Earlier, we discussed informant reports,
for a new measure of an established construct. and we do not have much additional space to devote to the broader
We again use grit (Duckworth et al., 2007) to illustrate. In both topic of cross-method analyses, so we simply bring a few issues to
its seminal article and that introducing its short form (Duckworth readers’ attention.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

& Quinn, 2009), grit’s developers acknowledged that conscien- Self-report questionnaires versus interviews. Self-ratings of-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

tiousness (C) was highly correlated with grit (r " $.70 –.75). ten are denigrated unfairly as “just self-report,” whereas, for many
Conscientiousness was thus a strongly competing predictor, raising psychological phenomena, self-report is the best— or even the
concerns about both incremental and discriminant validity; none- only—appropriate method. For example, no one knows how a
theless, grit showed incremental validity over C in two studies of person is feeling or how much something hurts other than that
relevant criteria. However, although Duckworth and Quinn (2009) person, but inherent in this strength are self-report’s greatest lim-
acknowledge that grit might not outpredict C’s facets, they did not itations: Because N always " 1 for information regarding an
examine this question empirically. Subsequently, MacCann and individual’s internal sensations, no other method is available to
Roberts (2010) reported that grit had no incremental validity over verify such self-reports, even though (a) individuals may not report
eight C facets for predicting a number of relevant variables. on their internal thoughts or feelings accurately, either because
Moreover, meta-analytic results (Credé et al., 2017) indicated that they choose not to do so (e.g., if honest reporting might lead to an
grit had insufficient discriminant validity and “was simply a dif- adverse decision) or because they cannot (e.g., poor insight, mem-
ferent manifestation of conscientiousness” (p. 12). These results ory lapses, dementia), (b) individuals may interpret and use rating
reinforce the importance of considering which level of an estab- scales differently (e.g., “usually” may represent different subjec-
lished hierarchy provides the “riskier” test of incremental validity; tive frequencies across individuals; Schmidt, Le, & Ilies, 2003),
such tests also serve to determine where a new construct fits best and (c) individual items may be interpreted differently depending
within that hierarchy. on one’s level of the trait. For example, the item “I’m often not as
Watson et al. (2017) provide a good recent example of “risky” cautious as I should be” was intended as an indicator of impulsiv-
tests of incremental validity involving the FI-FFM neuroticism ity, but was endorsed positively more often by respondents lower
facet scales. The authors first presented convergent and discrimi- on the trait (i.e., more cautious individuals!; Clark et al., 2014).
nant validity analyses to establish that some scales assessed the Nonetheless, despite these limitations, self-reported information is
same trait dimensions as scales in the NEO Personality Inventory-3 remarkably reliable and supports valid inference most of the time.
(NEO-PI-3; McCrae, Costa, & Martin, 2005), whereas others did not. Interviews often are considered superior to self-report because
They then reported incremental-validity analyses wherein they (a) they involve “expert judgment” and (b) follow-up questions
showed that the two FI-FFM scales assessing novel traits—Somatic permit clarification of responses. However, for the most part, they
Complaints and Envy—added significantly to the prediction of vari- are based on self-report and thus reflect the strengths and limita-
ous criteria (e.g., anxiety and depressive disorder diagnoses, measures tions of both self-report and interviewing. Perhaps the most serious
of health anxiety and hypochondriasis) over and above the NEO-PI-3 limitation of interviews is that interviewers always filter interview-
facets. ees’ responses through their own perspective and, as the clinical-
Finally, the issue of whether a construct has sufficient incre- judgment literature has shown repeatedly (e.g., Dawes, Faust, &
mental validity for predicting relevant constructs has no simple Meehl, 2002), this typically lowers reliability and predictive va-
answer. Rather, it depends on the purpose and scope of prediction lidity over an empirically established method, a problem that
and ultimately reduces to a type of cost-benefit analysis. For decreases relative to the degree of structure in the interview. For
example, in epidemiological studies, relatively weak predictors example, the DSM–III field trials reported unstructured-interview-
with small but significant incremental validity (e.g., 2–3%) may based interrater and retest reliabilities for personality disorder of
lead to public-health-policy recommendations that, if followed, .61 and .54, respectively (Spitzer, Forman, & Nee, 1979), whereas
could save thousands of lives. Thus, in medicine, statistics such as a later review based on semistructured interviews reported these
“number needed to treat” (NNT; a prediction of the number of values to be .79 and .62 (Zimmerman, 1994). Convergence with
additional people who would need to be treated to affect one of self-report also drops when unstructured versus semistructured
them) have been developed. But whether 10 or 100 is a small interviews are compared; for example, Clark, Livesley, and Morey
enough number to warrant treating more people depends on the (1997) reported mean correlations of .25 versus .40 and mean
cost of the treatment, the amount of harm caused by not treating, kappas of .08 versus .27 for these comparisons. Nonetheless,
and the degree of benefit to those treated successfully. For a demonstrating good convergent/discriminant validity between
low-cost treatment with great harm for not treating and great self-report and interview measures of constructs provides support
benefit for successful treatment (e.g., a cheap vaccine for a usually for both methods (e.g., Dornbach-Bender et al., 2017; Watson et
fatal illness), a very large NNT might still be small enough. al., 2008).
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 13

Symptom scales versus diagnoses. There often is a parallelism Allport, G. W., & Odbert, H. S. (1936). Trait-names: A psycho-lexical
of symptom and diagnoses with self-report versus interview that is study. Psychological Monographs, 47, i–171. http://dx.doi.org/10.1037/
important to consider in its own right. However, we focus here h0093360
only on interview-based assessment of both. Because they are American Educational Research Association, American Psychological As-
sociation, National Council on Measurement in Education, and Joint
more fully dimensional than dichotomous diagnoses, symptom
Committee on Standards for Educational and Psychological Testing
scales are both more reliable and valid (Markon et al., 2011a,
(AERA, APA, & NCME). (2014). Standards for educational and psy-
2011b). However, diagnoses carry a greater degree of clinical chological testing. Washington, DC: American Educational Research
“respectability,” which we believe is unwarranted, but is a reality Association.
nonetheless. One important issue is the use of “skip-outs”: not American Psychological Association. (1985). Standards for educational
assessing a symptom set if a core criterion is not met. To the extent and psychological testing. Washington, DC: Author.
possible, we recommend against using skip-outs because there Angleitner, A., & Wiggins, J. S. (1985). Personality assessment via ques-
often is important information in symptoms even when a core tionnaires. New York, NY: Springer-Verlag.
criterion is not met (Dornbach-Bender et al., 2017; Kotov, Perl- Boyle, G. J. (1991). Does item homogeneity indicate internal consistency or
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

man, Gámez, & Watson, 2015), with some clear exceptions, such item redundancy in psychometric scales? Personality and Individual Dif-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

ferences, 12, 291–294. http://dx.doi.org/10.1016/0191-8869(91)90115-R


as trauma-related symptoms. Demonstrating that similar inferences
Briggs, S. R., & Cheek, J. M. (1986). The role of factor analysis in the
can be made from symptom measures and diagnoses increases the
development and evaluation of personality scales. Journal of Personal-
credibility of the symptom measures, so we recommend comparing ity, 54, 106 –148. http://dx.doi.org/10.1111/j.1467-6494.1986.tb00391.x
them when feasible, all the while recognizing that the obtained Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve problems
correlations are simply indicators of convergent validity between of ipsative data in forced-choice questionnaires. Psychological Methods,
two measures of the same phenomenon, not a comparison of a 18, 36 –52. http://dx.doi.org/10.1037/a0030641
proxy against a gold standard. Burisch, M. (1984). Approaches to personality inventory construction: A
comparison of merits. American Psychologist, 39, 214 –227. http://dx
.doi.org/10.1037/0003-066X.39.3.214
Conclusion Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant
validation by the multitrait-multimethod matrix. Psychological Bulletin,
We concluded Clark and Watson (1995) by noting that both the 56, 81–105. http://dx.doi.org/10.1037/h0046016
target of measurement and measurement of the target are important Carmona-Perera, M., Caracuel, A., Pérez-García, M., & Verdejo-García,
for optimal scale development, that later stages will proceed more A. (2015). Brief moral decision-making questionnaire: A Rasch-derived
smoothly if the earlier stages have both theoretical clarity and short form of the Greene dilemmas. Psychological Assessment, 27,
empirical precision. These points are still important today, but we 424 – 432. http://dx.doi.org/10.1037/pas0000049
now also encourage scale developers to consider the broader Clark, L. A. (2018). The improving the measurement of personality project
context of their target construct, in particular, the reasonably (IMPP). Unpublished dataset. Notre Dame, IN: University of Notre
well-established hierarchical structures of personality and, to a Dame.
Clark, L. A., Livesley, W. J., & Morey, L. (1997). Personality disorder
lesser but growing extent, psychopathology.
assessment: The challenge of construct validity. Journal of Personality
As a result of expansion of knowledge in our field, it is increas-
Disorders, 11, 205–231. http://dx.doi.org/10.1521/pedi.1997.11.3.205
ingly important to attend to measures’ external validity, particu- Clark, L. A., Simms, L. J., Wu, K. D., & Casillas, A. (2014). Schedule for
larly convergent and discriminant validity, and incremental valid- Nonadaptive and Adaptive Personality-2nd Edition (SNAP-2): Manual
ity over well-established measures; as well as to use multitrait, for administration, scoring, and interpretation. Notre Dame, IN: Uni-
multimethod, multioccasion frameworks for evaluating new mea- versity of Notre Dame.
sures. Perhaps we can summarize the direction in which scale devel- Clark, L. A., & Watson, D. B. (1995). Constructing validity: Basic issues
opment is moving by stating that in the fields of personality and in objective scale development. Psychological Assessment, 7, 309 –319.
psychopathology, the nomological net is no longer just an abstract http://dx.doi.org/10.1037/1040-3590.7.3.309
ideal to which we need only pay lip service, but a practical reality that Comrey, A. L. (1988). Factor-analytic methods of scale development in
personality and clinical psychology. Journal of Consulting and Clinical
deserves our full and careful consideration.
Psychology, 56, 754 –761. http://dx.doi.org/10.1037/0022-006X.56.5.754
Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality:
References Meta-analytic integration of observers’ accuracy and predictive validity.
Psychological Bulletin, 136, 1092–1122. http://dx.doi.org/10.1037/
Achenbach, T. M., Ivanova, M. Y., & Rescorla, L. A. (2017). Empirically a0021212
based assessment and taxonomy of psychopathology for ages 1[1/2]-90! Cortina, J. M. (1993). What is coefficient alpha? An examination of theory
years: Developmental, multi-informant, and multicultural findings. Com- and applications. Journal of Applied Psychology, 78, 98 –104. http://dx
prehensive Psychiatry, 79, 4 –18. http://dx.doi.org/10.1016/j.comppsych .doi.org/10.1037/0021-9010.78.1.98
.2017.03.006 Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory
Achenbach, T. M., Krukowski, R. A., Dumenci, L., & Ivanova, M. Y. (NEO-PIR) and NEO Five Factor Inventory (NEO-FFI) professional
(2005). Assessment of adult psychopathology: Meta-analyses and im- manual. Odessa, FL: Psychological Assessment Resources.
plications of cross-informant correlations. Psychological Bulletin, 131, Credé, M., Tynan, M. C., & Harms, P. D. (2017). Much ado about grit: A
361–382. http://dx.doi.org/10.1037/0033-2909.131.3.361 meta-analytic synthesis of the grit literature. Journal of Personality and
Ahmed, S. R., Fowler, P. J., & Toro, P. A. (2011). Family, public and Social Psychology, 113, 492–511. http://dx.doi.org/10.1037/
private religiousness and psychological well-being over time in at-risk pspp0000102
adolescents. Mental Health, Religion & Culture, 14, 393– 408. http://dx Crego, C., Gore, W. L., Rojas, S. L., & Widiger, T. A. (2015). The
.doi.org/10.1080/13674671003762685 discriminant (and convergent) validity of the Personality Inventory for
14 CLARK AND WATSON

DSM–5. Personality Disorders: Theory, Research, and Treatment, 6, Hogan, R. T. (1983). A socioanalytic theory of personality. In M. Page
321–335. http://dx.doi.org/10.1037/per0000118 (Ed.), 1982 Nebraska symposium on motivation (pp. 55– 89). Lincoln,
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. NE: University of Nebraska Press.
Psychometrika, 16, 297–334. http://dx.doi.org/10.1007/BF02310555 Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling:
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological Sensitivity to underparameterized model misspecification. Psychological
tests. Psychological Bulletin, 52, 281–302. http://dx.doi.org/10.1037/ Methods, 3, 424 – 453. http://dx.doi.org/10.1037/1082-989X.3.4.424
h0040957 Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance
Dawes, R. M., Faust, D., & Meehl, P. E. (2002). Clinical versus actuarial structure analysis: Conventional criteria versus new alternatives. Struc-
judgment. In T. Gilovich, D. Griffin, & D. Kahneman (Eds.), Heuristics tural Equation Modeling, 6, 1–55. http://dx.doi.org/10.1080/107055
and biases: The psychology of intuitive judgment (pp. 716 –729). New 19909540118
York, NY: Cambridge University Press. http://dx.doi.org/10.1017/ Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psycho-
CBO9780511808098.042 logical testing and assessment: Conceptual, methodological, and statis-
De Los Reyes, A., Augenstein, T. M., Wang, M., Thomas, S. A., Drabick, tical issues. Psychological Assessment, 15, 446 – 455. http://dx.doi.org/
D. A. G., Burgers, D. E., & Rabinowitz, J. (2015). The validity of the 10.1037/1040-3590.15.4.446
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

multi-informant approach to assessing child and adolescent mental Kotov, R., Perlman, G., Gámez, W., & Watson, D. (2015). The structure
This document is copyrighted by the American Psychological Association or one of its allied publishers.

health. Psychological Bulletin, 141, 858 –900. http://dx.doi.org/10.1037/ and short-term stability of the emotional disorders: A dimensional ap-
a0038498 proach. Psychological Medicine, 45, 1687–1698. http://dx.doi.org/10
Dornbach-Bender, A., Ruggero, C. J., Waszczuk, M. A., Gamez, W., .1017/S0033291714002815
Watson, D., & Kotov, R. (2017). Mapping emotional disorders at the Krueger, R. F., Eaton, N. R., Derringer, J., Markon, K. E., Watson, D., &
finest level: Convergent validity and joint structure based on alternative Skodol, A. E. (2011). Personality in DSM–5: Helping delineate person-
measures. Comprehensive Psychiatry, 79, 31–39. http://dx.doi.org/10 ality disorder content and framing the metastructure. Journal of Person-
.1016/j.comppsych.2017.06.011 ality Assessment, 93, 325–331. http://dx.doi.org/10.1080/00223891
Duckworth, A. L., Peterson, C., Matthews, M. D., & Kelly, D. R. (2007). .2011.577478
Grit: Perseverance and passion for long-term goals. Journal of Person- Lee, K., Ogunfowora, B., & Ashton, M. C. (2005). Personality traits
beyond the big five: Are they within the HEXACO space? Journal of
ality and Social Psychology, 92, 1087–1101. http://dx.doi.org/10.1037/
Personality, 73, 1437–1463. http://dx.doi.org/10.1111/j.1467-6494.2005
0022-3514.92.6.1087
.00354.x
Duckworth, A. L., & Quinn, P. D. (2009). Development and validation of
Linde, J. A., Stringer, D. M., Simms, L. J., & Clark, L. A. (2013). The
the short grit scale (grit-s). Journal of Personality Assessment, 91,
Schedule for Nonadaptive and Adaptive Personality for Youth (SNAP-Y):
166 –174. http://dx.doi.org/10.1080/00223890802634290
A new measure for assessing adolescent personality and personality pathol-
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J.
ogy. Assessment, 20, 387– 404. http://dx.doi.org/10.1177/1073191
(1999). Evaluating the use of exploratory factor analysis in psycholog-
113489847
ical research. Psychological Methods, 4, 272–299. http://dx.doi.org/10
Loevinger, J. (1954). The attenuation paradox in test theory. Psychological
.1037/1082-989X.4.3.272
Bulletin, 51, 493–504. http://dx.doi.org/10.1037/h0058543
Finch, J. F., & West, S. G. (1997). The investigation of personality
Loevinger, J. (1957). Objective tests as instruments of psychological the-
structure: Statistical models. Journal of Research in Personality, 31,
ory. Psychological Reports, 3, 635– 694.
439 – 485. http://dx.doi.org/10.1006/jrpe.1997.2194
Lowe, J. R., Edmundson, M., & Widiger, T. A. (2009). Assessment of
Funder, D. C. (2012). Accurate personality judgment. Current Directions in
dependency, agreeableness, and their relationship. Psychological As-
Psychological Science, 21, 177–182. http://dx.doi.org/10.1177/0963
sessment, 21, 543–553. http://dx.doi.org/10.1037/a0016899
721412445309 MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample
Geisinger, K. F. (2003). Testing and assessment in cross-cultural psychol- size in factor analysis. Psychological Methods, 4, 84 –99. http://dx.doi
ogy. In J. R. Graham & J. A. Naglieri (Eds.), Handbook of psychology: .org/10.1037/1082-989X.4.1.84
Assessment psychology, Vol. 10 (pp. 95–117). Hoboken, NJ: Wiley. MacCann, C., & Roberts, R. D. (2010). Do time management, grit, and
http://dx.doi.org/10.1002/0471264385.wei1005 self-control relate to academic achievement independently of conscien-
Glenn, J. J., Michel, B. D., Franklin, J. C., Hooley, J. M., & Nock, M. K. tiousness? In R. Hicks (Ed.), Personality and individual differences:
(2014). Pain analgesia among adolescent self-injurers. Psychiatry Re- Current directions (pp. 79 –90). Queensland, Australia: Australian Ac-
search, 220, 921–926. http://dx.doi.org/10.1016/j.psychres.2014.08.016 ademic Press.
Green, B. F., Jr. (1978). In defense of measurement. American Psycholo- Mansolf, M., & Reise, S. P. (2016). Exploratory bifactor analysis: The
gist, 33, 664 – 670. http://dx.doi.org/10.1037/0003-066X.33.7.664 Schmid-Leiman orthogonalization and Jennrich-Bentler analytic rota-
Green, D. P., Goldman, S. L., & Salovey, P. (1993). Measurement error tions. Multivariate Behavioral Research, 51, 698 –717. http://dx.doi.org/
masks bipolarity in affect ratings. Journal of Personality and Social 10.1080/00273171.2016.1215898
Psychology, 64, 1029 –1041. http://dx.doi.org/10.1037/0022-3514.64.6 Markon, K. E., Chmielewski, M., & Miller, C. J. (2011a). The reliability
.1029 and validity of discrete and continuous measures of psychopathology: A
Guadagnoli, E., & Velicer, W. F. (1988). Relation of sample size to the quantitative review. Psychological Bulletin, 137, 856 – 879. http://dx.doi
stability of component patterns. Psychological Bulletin, 103, 265–275. .org/10.1037/a0023678
http://dx.doi.org/10.1037/0033-2909.103.2.265 Markon, K. E., Chmielewski, M., & Miller, C. J. (2011b). “The reliability
Haslam, N., Holland, E., & Kuppens, P. (2012). Categories versus dimen- and validity of discrete and continuous measures of psychopathology: A
sions in personality and psychopathology: A quantitative review of quantitative review”: Correction to Markon et al. (2011). Psychological
taxometric research. Psychological Medicine, 42, 903–920. http://dx.doi Bulletin, 137, 1093. http://dx.doi.org/10.1037/a0025727
.org/10.1017/S0033291711001966 McCrae, R. R. (2015). A more nuanced view of reliability: Specificity in
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity the trait hierarchy. Personality and Social Psychology Review, 19, 97–
in psychological assessment: A functional approach to concepts and 112. http://dx.doi.org/10.1177/1088868314541857
methods. Psychological Assessment, 7, 238 –247. http://dx.doi.org/10 McCrae, R. R., Costa, P. T., Jr., & Martin, T. A. (2005). The NEO-PI-3: A
.1037/1040-3590.7.3.238 more readable revised NEO Personality Inventory. Journal of Person-
CONSTRUCTING VALIDITY: NEW DEVELOPMENTS 15

ality Assessment, 84, 261–270. http://dx.doi.org/10.1207/s15327752 Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of
jpa8403_05 short-form development. Psychological Assessment, 12, 102–111. http://
McDade-Montez, E. A., Watson, D., O’Hara, M. W., & Denburg, N. L. dx.doi.org/10.1037/1040-3590.12.1.102
(2008). The effect of symptom visibility on informant reporting. Psy- Soto, C. J., & John, O. P. (2017). Short and extra-short forms of the Big
chology and Aging, 23, 940 –946. http://dx.doi.org/10.1037/a0014297 Five Inventory–2: The BFI-2-S and BFI-2-XS. Journal of Research in
Meehl, P. E. (1945). The dynamics of “structured” personality tests. [no Personality, 68, 69 – 81. http://dx.doi.org/10.1016/j.jrp.2017.02.004
doi]. Journal of Clinical Psychology, 1, 296 –303. Spitzer, R. L., Forman, J. B., & Nee, J. (1979). DSM–III field trials: I.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Initial interrater diagnostic reliability. The American Journal of Psychi-
Ronald, and the slow progress of soft psychology. Journal of Consulting atry, 136, 815– 817. http://dx.doi.org/10.1176/ajp.136.6.815
and Clinical Psychology, 46, 806 – 834. http://dx.doi.org/10.1037/0022- Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient
006X.46.4.806 alpha and internal consistency. Journal of Personality Assessment, 80,
Messick, S. (1995). Standards of validity and the validity of standards in 99 –103. http://dx.doi.org/10.1207/S15327752JPA8001_18
performance assessment. Educational Measurement: Issues and Prac- Tackett, J. L., Lahey, B. B., van Hulle, C., Waldman, I., Krueger, R. F., &
tice, 14, 5– 8. http://dx.doi.org/10.1111/j.1745-3992.1995.tb00881.x Rathouz, P. J. (2013). Common genetic influences on negative emotion-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: ality and a general psychopathology factor in childhood and adoles-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

McGraw-Hill. cence. Journal of Abnormal Psychology, 122, 1142–1153. http://dx.doi


Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., & .org/10.1037/a0034151
Cella, D. (2011). Item banks for measuring emotional distress from the Tellegen, A., & Waller, N. G. (2008). Exploring personality through test
patient-reported outcomes measurement information system (PROMIS®): De- construction: Development of the multidimensional personality ques-
pression, anxiety, and anger. Assessment, 18, 263–283. http://dx.doi.org/10 tionnaire. In G. J. Boyle, G. Matthews, & D. H. Saklofske (Eds.), The
.1177/1073191111411667 SAGE handbook of personality theory and assessment, Vol. 2: Person-
Putnam, S. P., Rothbart, M. K., & Gartstein, M. A. (2008). Homotypic and ality measurement and testing (pp. 261–292) Thousand Oaks, CA: Sage.
heterotypic continuity of fine-grained temperament during infancy, tod- http://dx.doi.org/10.4135/9781849200479.n1
dlerhood, and early childhood. Infant and Child Development, 17, 387– Watson, D. (2012). Objective tests as instruments of psychological theory
405. http://dx.doi.org/10.1002/icd.582 and research. In H. Cooper (Ed.), Handbook of research methods in
Reise, S. P., Ainsworth, A. T., & Haviland, M. G. (2005). Item response psychology, Vol. 1: Foundations, planning, measures, and psychomet-
theory: Fundamentals, applications, and promise in psychological re- rics (pp. 349 –369). Washington, DC: American Psychological Associ-
search. Current Directions in Psychological Science, 14, 95–101. http:// ation. http://dx.doi.org/10.1037/13619-019
dx.doi.org/10.1111/j.0963-7214.2005.00342.x Watson, D., & Clark, L. A. (1992). On traits and temperament: General and
Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical specific factors of emotional experience and their relation to the five-
measurement. Annual Review of Clinical Psychology, 5, 27– 48. http:// factor model. Journal of Personality, 60, 441– 476. http://dx.doi.org/10
dx.doi.org/10.1146/annurev.clinpsy.032408.153553 .1111/j.1467-6494.1992.tb00980.x
Rudick, M. M., Yam, W. H., & Simms, L. J. (2013). Comparing Watson, D., Clark, L. A., & Chmielewski, M. (2008). Structures of per-
countdown- and IRT-based approaches to computerized adaptive per- sonality and their relevance to psychopathology: II. Further articulation
sonality testing. Psychological Assessment, 25, 769 –779. http://dx.doi of a comprehensive unified trait structure. Journal of Personality, 76,
.org/10.1037/a0032541 1485–1522. http://dx.doi.org/10.1111/j.1467-6494.2008.00531.x
Russell, D. W. (2002). In search of underlying dimensions: The use (and Watson, D., Clark, L. A., Chmielewski, M., & Kotov, R. (2013). The value
abuse) of factor analysis in personality and social psychology bulletin. of suppressor effects in explicating the construct validity of symptom
Personality and Social Psychology Bulletin, 28, 1629 –1646. http://dx measures. Psychological Assessment, 25, 929 –941. http://dx.doi.org/10
.doi.org/10.1177/014616702237645 .1037/a0032781
SAS Institute, Inc. (2013). SAS/STAT software: Version 9.4. Cary, NC: Watson, D., Clark, L. A., & Harkness, A. R. (1994). Structures of person-
SAS Institute. ality and their relevance to psychopathology. Journal of Abnormal
Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical Psychology, 103, 18 –31. http://dx.doi.org/10.1037/0021-843X.103.1.18
examination of the effects of different sources of measurement error on Watson, D., Clark, L. A., & Tellegen, A. (1984). Cross-cultural conver-
reliability estimates for measures of individual-differences constructs. gence in the structure of mood: A Japanese replication and a comparison
Psychological Methods, 8, 206 –224. http://dx.doi.org/10.1037/1082- with U.S. findings. Journal of Personality and Social Psychology, 47,
989X.8.2.206 127–144. http://dx.doi.org/10.1037/0022-3514.47.1.127
Schwartz, S. J., Benet-Martínez, V., Knight, G. P., Unger, J. B., Zambo- Watson, D., Nus, E., & Wu, K. D. (2017). Development and Validation of
anga, B. L., Des Rosiers, S. E., . . . Szapocznik, J. (2014). Effects of the Faceted Inventory of the Five-Factor Model (FI-FFM). Assessment.
language of assessment on the measurement of acculturation: Measure- Advance online publication. http://dx.doi.org/10.1177/10731911
ment equivalence and cultural frame switching. Psychological Assess- 17711022
ment, 26, 100 –114. http://dx.doi.org/10.1037/a0034717 Watson, D., O’Hara, M. W., Simms, L. J., Kotov, R., Chmielewski, M.,
Simms, L. J., & Watson, D. (2007). The construct validation approach to McDade-Montez, E. A., . . . Stuart, S. (2007). Development and vali-
personality scale construction. In R. W. Robins, R. C. Fraley, & R. F. dation of the inventory of depression and anxiety symptoms (IDAS).
Krueger (Eds.), Handbook of research methods in personality psychol- Psychological Assessment, 19, 253–268. http://dx.doi.org/10.1037/1040-
ogy (pp. 240 –258). New York, NY: Guilford Press. 3590.19.3.253
Simms, L. J., Zelazny, K., Williams, T., & Bernstein, L. (under review). Watson, D., Stanton, K., & Clark, L. A. (2017). Self-report indicators of
Does the number of response options matter? Psychometric perspectives negative valence constructs within the research domain criteria (RDoC):
using personality questionnaire data. Unpublished manuscript, Univer- A critical review. Journal of Affective Disorders, 216, 58 – 69. http://dx
sity of Buffalo, Buffalo, NY. .doi.org/10.1016/j.jad.2016.09.065
Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in Watson, D., Stasik, S. M., Ellickson-Larew, S., & Stanton, K. (2015). Extra-
the refinement of clinical assessment instruments. Psychological Assess- version and psychopathology: A facet-level analysis. Journal of Abnormal
ment, 7, 300 –308. http://dx.doi.org/10.1037/1040-3590.7.3.300 Psychology, 124, 432– 446. http://dx.doi.org/10.1037/abn0000051
16 CLARK AND WATSON

Watson, D., Suls, J., & Haig, J. (2002). Global self-esteem in relation to cumplex structural summary approach. Assessment, 24, 3–23. http://dx
structural models of personality and affectivity. Journal of Personality and .doi.org/10.1177/1073191115621795
Social Psychology, 83, 185–197. http://dx.doi.org/10.1037/0022-3514.83.1
.185
Zimmerman, M. (1994). Diagnosing personality disorders. A review of
issues and research methods. Archives of General Psychiatry, 51, 225–
245. http://dx.doi.org/10.1001/archpsyc.1994.03950030061006 Received January 14, 2018
Zimmermann, J., & Wright, A. G. C. (2017). Beyond description in Revision received May 7, 2018
interpersonal construct validation: Methodological advances in the cir- Accepted May 15, 2018 !
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

You might also like