Article 2
Article 2
Editors' Suggestion
In physics education research, instructors and researchers often use research-based assessments (RBAs)
to assess students’ skills and knowledge. In this paper, we support the development of a mechanics
cognitive diagnostic to test and implement effective and equitable pedagogies for physics instruction.
Adaptive assessments using cognitive diagnostic models provide significant advantages over fixed-length
RBAs commonly used in physics education research. As part of a broader project to develop a cognitive
diagnostic assessment for introductory mechanics within an evidence-centered design framework, we
identified and tested the student models of four skills that cross content areas in introductory physics: apply
vectors, conceptual relationships, algebra, and visualizations. We developed the student models in three
steps. First, we based the model on learning objectives from instructors. Second, we coded the items on
RBAs using the student models. Finally, we then tested and refined this coding using a common cognitive
diagnostic model, the deterministic inputs, noisy “and” gate model. The data included 19 889 students who
completed either the Force Concept Inventory, Force and Motion Conceptual Evaluation, or Energy and
Momentum Conceptual Survey on the LASSO platform. The results indicated a good to adequate fit for the
student models with high accuracies for classifying students with many of the skills. The items from these
three RBAs do not cover all of the skills in enough detail, however, they will form a useful initial item bank
for the development of the mechanics cognitive diagnostic.
DOI: 10.1103/PhysRevPhysEducRes.21.010103
knowledge and skill acquisition to help tailor instruction to and pencil, but the administration is moving to online
students’ needs. formats [27]. This move to online data collection has led to
To support the development of the MCD, we investigated the development of CATs for introductory physics that have
the skills assessed by three RBAs commonly used in advantages over fixed-length tests. In this section, we
introductory college mechanics courses [1]. This research discuss RBAs in introductory mechanics, options for
develops the models for the student skills and the evidence administering RBAs online, CAT broadly, and the appli-
for assessing those skills as a component of the larger cation of CAT to RBAs in physics.
development of the MCD. The MCD will leverage this
information to provide instructors with timely and action-
A. RBAs in introductory mechanics
able formative assessments.
PhysPort [28] provides an extensive list of RBAs for
II. RESEARCH QUESTION physics and other extensive pedagogical resources.
PhysPort, however, does not administer assessments on-
To support the development of the MCD to measure line. RBA developers and researchers have instead often
skills across introductory mechanics content areas, we relied on Qualtrics to administer the RBAs they develop or
developed and applied a model of four skills to three use online or the LASSO platform [26,27]. Administering
commonly used RBAs for introductory mechanics courses. RBAs online allows assessing students in class or outside of
To this end, we ask the following research question: class to save class time, automatically analyze the collected
• What skills and content areas do three RBAs for data, and aggregate the data for research purposes [29].
introductory mechanics cover? PhysPort describes 117 RBAs [28] with 16 RBAs for
introductory mechanics. Each RBA targets content areas and
III. DEFINITIONS skills important for physics learning. The titles of each RBA
often state the focus of the RBAs. For example, our study
To support readers’ interpretation of our research, Table I
analyzed data from three RBAs because we had access to
includes a selection of terms and their definitions.
enough data for the analysis in this paper through the LASSO
database. The Force Concept Inventory (FCI) [30] focuses on
IV. LITERATURE REVIEW conceptual knowledge of forces and kinematics. The Force
Many physics education researchers and instructors use and Motion Conceptual Evaluation (FMCE) [31] provides
existing fixed-length RBAs. PhysPort [25] and the LASSO similar coverage but has four energy questions. The Energy
platform [26] provide lists and resources of these RBAs. and Momentum Conceptual Survey (EMCS) [32] covers
Initially, instructors administered these RBAs with paper exactly what the name states. Other assessment names also
Term—Definition
Computerized adaptive testing (CAT)—Administered on computers, the test adaptively selects appropriate items for each person to
match student proficiency [12–14].
Proficiency—“…the student’s general facility with answering the items correctly on the assessment under consideration” [15]. Higher
proficiency increases the probability of answering assessment items correctly. Different fields use different terms for proficiency, such
as skill, ability, latent trait, and omega.
Skills—A latent attribute that students need to master to answer items correctly and that cuts across content areas [13,16,17].
Q-matrix—A Q-matrix, or “question matrix,” is a binary matrix that maps the relationship between test items and the underlying skills
they measure. Each row represents a test item, and each column represents a specific skill. An entry of 1 in the matrix indicates that a
particular skill is required to answer the corresponding test item correctly, while a 0 indicates that the skill is not required.
Cognitive diagnostic (CD) assessment—An assessment method that evaluates students on specific skills to determine mastery. In
contrast to traditional assessment methods that measure students on a single proficiency, CD provides diagnostic information on
students’ skill strengths and weaknesses to support personalized educational strategies [18,19].
Classification accuracy—The agreement between observed and true skill classifications. In practice, this is calculated using the expected
skill classifications rather than the true classifications, which is detailed in an example around Eqs. (4) and (6) in Ref. [20].
Deterministic inputs, noisy “and” gate (DINA) model—A cognitive diagnostic model assuming that a student must master all the
required skills to solve an item correctly. The absence of any required skills cannot be compensated by the mastery of others. This
model operates within a binary framework, categorizing each skill as either mastered or not mastered [19,21–23].
Evidence-centered design—A framework for developing educational assessments based on establishing logical, evidence-based
arguments [24].
010103-2
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
portray skills or content areas of interest to physics education: evaluate students’ proficiencies [12,39]. One such study by
the Test of Understanding Graphs in Kinematics, the Test of Istiyono et al. [40] utilized CAT to assess the physics
Understanding Vectors in Kinematics, and the Rotational problem-solving skills of senior high school students,
Kinematics Inventory. These names imply that graphs and revealing that most students’ competencies fell within
vectors play an important role in many physics courses and the medium-to-low categories. Morphew et al. [12]
that many physics courses cover rotation. As discussed explored the use of CAT to evaluate physics proficiency
below, cognitive diagnostics allow for incorporating addi- and identify the areas where students needed to improve
tional items to cover new topics throughout their lifetime. when preparing for course exams in an introductory
physics course. Their studies showed that students who
B. Cognitive diagnostic—Computerized used the CAT improved their performance on subsequent
adaptive testing exams. In another study, Yasuda et al. [41] also indicated
Computerized adaptive testing (CAT) uses item response CAT can reduce testing time by shorter test lengths while
theory to establish a relationship between the student’s maintaining the accuracy of test measurement and admin-
proficiency levels and the probability of their success in istration. Yasuda et al. [39] examined item overexposure in
answering test items [13]. CAT selects items based on FCI-CAT, employing pretest proficiency for item selection.
student responses to the preceding items to estimate the This shortened test duration while maintaining accuracy
student’s proficiency and then aligns each item’s difficulty and enhanced security by reducing item content memori-
with the individual’s proficiency [13]. This continuous zation and sharing among students.
adaptation of item difficulty to student proficiency ensures
that the test remains challenging and engaging for the V. THEORETICAL FRAMEWORK
students throughout its duration and provides a more We drew on evidence-centered design [24] to inform our
precise estimation of the proficiency of students than development of the MCD. Evidence-centered design was
paper-and-pencil assessment [12–14]. Compared to first applied in the high-stakes contexts of the graduate
paper-and-pencil assessment methods, CAT requires fewer record examinations [24,42] and has also been effectively
items to accurately measure students’ proficiency mean- utilized in physics education research for the development
while controlling the selected items concerning their of RBAs [43,44]. We used three core premises in the
content variety [33]. Chen et al. [34] show that CAT evidence-centered design framework [24].
supports test security by drawing from a large item bank to 1. Assessment developers need content and context
control for item overexposure and how CAT can use pretest expertise to create high-quality items. In this analysis,
proficiency estimates for item selection and proficiency we focused our analysis on three RBAs developed by
estimation to maximize test efficiency. physics education researchers—FCI, FMCE, and
Combining cognitive diagnostic (CD) models and CAT EMCS.
improves the assessment process and categorizes students 2. Assessment developers use evidence-based reason-
based on their mastery of distinct skills associated with ing to evaluate students’ comprehension and identify
each item. CD models aim to estimate how the students’ misunderstandings accurately. In this analysis, we
cognitive proficiency relates to the specific skills or developed a Q-matrix that identified which under-
contents necessary to solve individual test items [13,35], lying skills were required to correctly answer each
with skill as a fundamental cognitive unit or proficiency item (more details in Sec. VI B).
that students need to acquire and master to answer certain 3. When creating assessments, developers must con-
items [16,17]. Deterministic inputs, noisy “and” gate sider various factors such as resource availability,
(DINA) model emerges as a CD model that facilitates limitations, and usage conditions. For instance, the
the assessment of skill mastery profiles and estimating item LASSO platform supports multiple-choice items and
parameters [36]. The DINA model leverages a Q-matrix to needs web-enabled devices, but it conserves class
test the relationships between items and the skills requisite and instructor time.
to answer them [37], thereby providing a structured Our work used the conceptual assessment framework
framework for monitoring the mastery levels of distinct provided by the evidence-centered design framework with
proficiency [37]. The DINA model is applied for the its five models [24] (shown in Fig. 1) to guide assessment
evaluation of the mastery situation of students across development. The models and their connections to our
various skills, including problem solving [38], computa- work are as follows:
tional thinking [17], and domain-specific knowledge [37]. 1. Student models focus on identifying one or more
variables directly relevant to the knowledge, skills,
C. CAT in physics education or proficiencies an instructor wishes to examine. In
We are unaware of any CD assessments in physics. this project, a qualitative analysis (see Sec. VI B)
Researchers have, however, conducted studies on the indicated that four skills (i.e., apply vectors, con-
effectiveness of CAT using an item response theory to ceptual relationships, algebra, and visualizations)
010103-3
VY LE et al. PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
FIG. 1. An evidence-centered design framework for creating the mechanics cognitive diagnostic (MCD). This paper focuses on the
student models and evidence models. The student models determine the skills and content areas that our assessment aims to measure.
The evidence models apply the DINA model to the multiple-choice questions (task model) students answer to measure students’ skills.
Our CD-CAT algorithm will determine which items to ask students, who will take the assessment online through the LASSO platform.
and four content areas (i.e., kinematics, forces, skills and content areas to measure for the student models.
energy, and momentum) would be optimum for Subsequent quantitative analyses drove the testing of the
our MCD. evidence models and iterative improvements of the student
2. Evidence models include evidence rules and meas- models. We first used artifacts from courses to build the
urement models to provide a guide to update student models of skills that cut across the content of
information regarding a student’s performance. introductory mechanics courses. We then identified RBAs
The evidence rules govern how observable variables with sufficient data available through the LASSO platform
summarize a student’s performance on individual and coded each item for the skills it assessed. Finally, we used
test items. The measurement model transforms the an iterative process that applied the DINA model to build the
student responses into the student skill profile. In evidence models and to improve our definitions of the skills
this project, the evidence rules were binary, right or and the coding of the skills on each item. In this iterative
wrong scores, and the measurement model is the process, the DINA model suggested changes to the item skill
DINA model, which includes the Q-matrix. codes initially made by content experts. The suggested
3. Task model describes what students do to provide changes were accepted or rejected by content experts. We
input to the evidence models. In this project, the task then ran a final DINA model on our revised codes.
model was multiple-choice questions.
4. Assembly model describes how the three models A. RBAs data collection and cleaning
above, including the student models, evidence mod-
Our analysis examined student responses on three RBAs:
els, and task models, work together to form the
the FCI (30 items, 12 932 students), FMCE (47 items, 5510
psychometric frame of the assessment. In the
students), and EMCS (25 items, 1447 students). Our dataset
broader project, we developed a CD-CAT algorithm
came from the LASSO platform [26,29,45]. LASSO pro-
that integrated models 1–3 for the MCD.
vided post-test data from 19 889 students across the three
5. Delivery model describes integrating all the models
assessments. We removed assessments completed in less
required for evaluation. We used the online LASSO
than 5 min and assessments with missing answers.
platform [29,45] in this project.
In this paper, we focus on the student models and
evidence models (models 1–2). These models are instru- B. Qualitative data analysis
mental in aligning our analysis with the research question. We developed an initial list of skills and content areas
By evaluating the student models, we gain insights into the covered in physics courses by coding learning objectives
range of competencies RBAs are designed to assess. from courses using standards-based grading. We focused
Similarly, through the evidence models, we understand on standards-based grading because instructors explicitly
how these assessments capture and represent student list the learning objectives students should master during
understanding in various skills and content areas. the course [46]. Initially, we coded a set of skills based on
both the standards and the items on the RBAs; the skills
included apply vectors, conceptual understanding, algebra,
VI. MATERIALS AND METHODS
visualizations, and definitions. We discarded definitions as
To answer the research question, we employed a mixed a skill because it represents a memorized response that the
methods approach using qualitative coding to identify the other skills covered in greater depth by asking students to
010103-4
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
TABLE II. Definition of the skills in the FCI, FMCE, and EMCS assessments.
Skills Definition
Apply vectors Item requires manipulating vectors in more than one dimension or has a change in sign
for a 1D vector quantity.
Conceptual relationships Item requires students to identify a relationship between variables and/or the situations
in which those relationships apply.
Algebra Item requires students to reorganize one or more equations. This goes beyond recognizing
the standard forms of equations.
Visualizations Item requires extracting information from or creating formal visualizations
such as xy plots, bar plots, or line graphs.
TABLE III. Definition of the content areas in the FCI, FMCE, and EMCS assessments.
apply or understand the concept. And, we are not aware of items and the required skills, which we defined in Table I.
RBAs for introductory physics that ask definition ques- Each row of the Q-matrix corresponds to a test item, and
tions. Table II lists the four skills and their definitions. each column corresponds to a skill. Q-matrix entries are
We initially coded content areas at a finer grain size to binary, indicating whether a skill is needed for a specific
match the standards-based grading learning objectives, e.g., item. The DINA model produces a skill profile for each
kinematics was split into four areas across two variables: student, represented as a binary vector, indicating whether
1D or 2D and constant velocity or constant acceleration. they have mastered each skill. For example, a profile of [1,
These content areas, however, were too fine grained to 0, 1, 0] means the student has mastered skills 1 and 3 but
develop an assessment with a reasonable length for students not skills 2 and 4. The DINA model assumes a student
to complete or a realistic size item bank. Therefore, we needs to have mastered all the required skills for a particular
simplified the content codes: i.e., kinematics, forces, item to answer it correctly. If a student lacks even one
energy, and momentum for these three RBAs. Table III required skill, the model assumes the student will answer
lists the four content areas covered by these three RBAs and the item incorrectly [23]. The model incorporates a prob-
their definitions. abilistic component (noisy “AND” gate) to account for real-
Based on this initial set of codes we developed, we coded world inconsistencies with two complementary parameters:
each item for its relevant skills and content areas. Our slip (s) and guess (g). Slip is the probability that a student
coding team included three researchers with backgrounds who has mastered all the required skills still answers the
in physics and teaching physics. Each item was independ- item incorrectly due to carelessness, distraction, or error.
ently coded by at least two team members. The three coders Guess is the probability that a student who has not mastered
then compared the coding for the items and reached a all the required skills answers the item correctly by
consensus on all items. This consensus coding of the three guessing or other factors. Slip and guess add a stochastic
assessments provided one of the inputs into the DINA element to help to account for the noise in real-testing
analysis. scenarios, where students might guess or make unexpected
errors. For each item, the probability that a student answers
C. Quantitative data analysis correctly is determined by whether they have the required
skills and the slip and guess parameters. If the student has
1. DINA model all required skills then PðcorrectÞ ¼ 1 − s. If the student
The Deterministic Input, Noisy “AND” gate (DINA) does not have all required skills then PðcorrectÞ ¼ g. The
model is the foundational cognitive diagnostic model model estimates each students skill profile based on their
[21,22]. The DINA model is used to analyze responses responses, the Q-matrix, and the slip and guess parameters
to test items and determine the underlying skills that for each item. We used the DINA model because the model
students possess [19]. A Q-matrix [47] (acting as a fits indicated it was not necessary to use a more complex
deterministic input) defines the relationship between test model like the generalized DINA model.
010103-5
VY LE et al. PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
In this study, we used the DINA model to analyze TABLE IV. Q-matrix modifications and adoption rates.
students’ response data for each of the three RBAs to
refine our item codes further and calibrate each item’s slip Total Possible Suggested Adopted Adoption Change
items changes changes changes rate (%) rate (%)
and guess parameters. The DINA model analyses also
generated skill mastery profiles for each student, which FCI 30 90 11 7 64 7.8
were not the focus of the research question in this paper. FMCE 47 141 14 5 36 3.5
These psychometric analyses were implemented using the EMCS 25 75 1 1 100 4.0
G-DINA package [48] in the R programming environment.
Overall 102 306 26 13 50 4.2
RMSEA2 and SRMSR were used to assess the degree of
the model-data fit. RMSEA2 is the root mean square error
approximation (RMSEA) based on the M2 statistic using skill was initially not considered essential for item 7.
the univariate and bivariate margins. RMSEA2 ranges However, empirical response data suggested that this skill
from 0 to 1, and RMSEA2 < 0.06 indicates a good fit was required to answer item 7 correctly. Postreview, the
[49,50]. SRMSR, the standardized root mean squared expert panel endorsed this modification; thereby, the value
residual, has acceptable values ranging between 0 and 0.8. in the Q-matrix corresponding to the intersection of item 7
Models with SRMSR < 0.05 can be viewed as a well- and conceptual relationships was changed from “0” to “1.”
fitted model, and models with SRMSR < 0.08 are typi- Overall, only 8.5% of the codings (26 of 306) were
cally considered acceptable models [50–52]. Additionally, identified for reexamination by this analysis. Of the 26
the skill-level classification accuracy, defined in Table I, proposed changes, 13 were adopted across the 3 assess-
informed the reliability and validity of the CD assessment. ments, yielding an overall adoption rate of 50%. This
Classification accuracies range from 0 to 1, with values iterative approach to informing the validity of the Q-matrix
greater than or equal to 0.9 considered high [53,54] and avoids overreliance on either expert opinion or empirical
values greater than 0.8 are acceptable [55]. data, harmonizing both information sources to enhance the
The appropriateness of the Q-matrix plays an important accuracy of the Q-matrix. Table VIII (see the Appendix)
role in CD assessments and affects the degree of model- shows the final coding for each RBA item across the four
data fit. Inappropriate specifications in the Q-matrix may content areas and four skills.
lead to poor model fit and thus may produce incorrect
skill diagnosis results for students. Therefore, we need a VII. FINDINGS
Q-matrix validation step in the study. The input Q-matrices
for the DINA analysis for each RBA were constructed by This section addresses the research question by detailing
content experts, as detailed in the prior section. In the the skills and content areas measured by the three assess-
Q-matrix validation step, detailed below, the DINA analysis ments, as detailed in Table VIII in the Appendix. First, we
further examined each Q-matrix to identify potential present which of the four skills the items on the three
misspecifications in the Q-matrices. assessments measured and the number of skills the items
measured. The specific models relating the items to the four
2. Q-matrix validation skills are presented in the Appendix, see Tables IX–XI.
Second, we show the content areas covered in the three
The analysis fitted the DINA model to students’ post- assessments. Finally, we examine the skills across content
assessment responses using the Q-matrix constructed by areas. This structure highlights the various aspects of the
the three coders. The proportion of variance accounted for items in these three assessments.
method [56] measured the relationships between the items
and the skills specified in the provided Q-matrix. The
analysis of the empirical response data suggested changes A. Skills
to the provided Q-matrix, which the three coders reviewed. FCI—The FCI assessed three skills (Fig. 2). Eighteen
The coders assessed the suggested modifications for how items assessed apply vector skill, 17 assessed conceptual
well they aligned with the definitions and revised the relationships skill, 1 assessed visualizations skill, and 0
Q-matrix when the majority of the team agreed with the assessed algebra skill. The majority of items assessed a
suggested changes. The refined Q-matrix was then used in single skill. Twenty-four items (80%) assessed a single
subsequent CD modeling analyses. skill, 6 items (20%) assessed two skills, and 0 items
Table IV presents a summary detailing the frequency of assessed three skills (Table V).
data-driven modifications suggested, adopted by the FMCE—The FMCE assessed the same three skills as the
coders, and the rate of adoption for each of the three FCI (Fig. 2). All 47 items assessed conceptual relationships
assessments under study. The FCI, for example, had 11 skill, 19 items assessed the visualizations skill, 18 items
proposed changes of the 90 possible changes (30 items assessed apply vectors skill, and 0 items assessed algebra
each with three possible skills), and the coders adopted 7 of skill. The majority of items assessed multiple skills.
these suggestions. For instance, conceptual relationships Thirteen items (28%) assessed a single skill, while 31
010103-6
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
FCI
1. DINA model fit
10
10 8 8 9
The analysis fitted the DINA model with the refined
1 Q-matrix to the response data. According to the established
0 criteria [50,57], the model demonstrated satisfactory
30
31
fit (RMSEA2 < 0.05, SRMSR < 0.07) for the FCI
Number of items
FMCE
20
14
12
for the FMCE was unsatisfactory (RMSEA2 ¼ 0.090,
10 9 8 SRMSR ¼ 0.110). These outcomes suggest that the model
4 4
2
adequately represents the underlying data structure for the
0 FCI and EMCS but might not capture the latent structure of
the FMCE well.
30
EMCS
20
15
Table VI presents the classification accuracy [20] for
10
10 each skill across the three RBAs. As discussed in the skills
4 section, not all skills were measured by each of the RBAs; 9
1 1 1
0 of 12 were possible. For those skills that were measured, 7
Apply Vectors Conceptual Algebra Visualizations of the 9 classification accuracies were high (over 0.9). The
Relationships
Skills classification accuracy of visualizations for the FCI (0.79)
and algebra for the EMCS (0.63) was notably lower. The
FIG. 2. The distribution of items across skills, content areas, lower classification accuracy reflects the lack of items
and assessments. Note that each item can assess multiple skills. measuring these skills (Fig. 2).
Only 2 items, 3 and 13 of the EMCS assessed multiple content
areas (i.e., energy and momentum) under the conceptual rela- B. Content areas
tionships skill.
FCI—The FCI assessed two content areas (Fig. 2 and
Table VII). Eighteen items assessed forces, 12 assessed
items (66%) assessed two skills, and 3 items (6%) assessed kinematics, and 0 assessed energy and momentum. All 30
three skills (Table V). items (100%) assessed a single content area.
EMCS—Similar to the FCI and FMCE, the EMCS FMCE—The FMCE assessed three content areas (Fig. 2
assessed the apply vectors and conceptual relationships and Table VII). Thirty-one items assessed forces, 12
skills (Fig. 2). The EMCS differed in that it included 2
items that assessed the algebra skill. Of the 25 EMCS items, TABLE VI. Skill classification accuracy by assessment.
23 assessed the conceptual relationships skill (with items 3
and 13 both coded for energy and momentum), 5 assessed Apply Conceptual
the apply vectors skill, 2 assessed the algebra skill, and 0 vectors relationships Algebra Visualizations
assessed the visualizations skill. The EMCS was the only FCI 0.97 0.96 ··· 0.79
assessment with items assessing the algebra skill. The most FMCE 0.96 0.98 ··· 0.91
of items assessed a single skill. Twenty items (80%) EMCS 0.94 0.95 0.63 ···
010103-7
VY LE et al. PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
010103-8
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
The three RBAs had classification accuracy of above 0.9 the inclusion of new items under development to fill in gaps
for the apply vectors and conceptual relationships skills, as in the item bank. The combined item bank will also
shown in Table VI. This makes sense for the FMCE and improve classification accuracy by having more items to
FCI, given that they each had at least 17 items for each of draw on. However, the high classification accuracy (0.941)
the apply vectors and conceptual relationships skills for the apply vectors skill on the EMCS indicates that even
(Fig. 2). Although the EMCS only had 5 items measuring just 5 items can provide a high classification accuracy. This
apply vectors, the classification accuracy was still 0.94. result indicates that shorter assessments may allow for high
This finding indicates that a relatively small number of levels of classification accuracy for skills while also using
items can still accurately assess a skill. The number of items fewer questions. We plan on ensuring sufficient classifi-
measuring algebra skills on the EMCS (2) and visualization cation accuracy with a minimum of ten items for each
skills on the FCI (1) was not sufficient to generate useful content area and skill combination. This will also provide
classification accuracies (< 0.8). Combining the three enough items to estimate student proficiency when an
assessments into a single-item bank should provide suffi- instructor administers a single content area and skill
cient coverage of apply vectors, conceptual relationships, combination as a weekly test. Future work will add content
and visualization skills, but it will not offer enough items to areas for mathematics and rotational mechanics.
assess the algebra skill. Additionally, the combined item Using LASSO as the delivery system for the MCD
bank will require additional items to assess the visualiza- provides instructors with an adaptive tool to assess stu-
tion and apply vectors skills in the content areas of energy dent’s skills and knowledge across content areas or in
and momentum. specific content areas. In particular, using a cognitive
diagnostic for the assembly model allows instructors to
IX. LIMITATIONS design formative assessments by choosing the skills and
The DINA analysis assumes students have mastered each content areas to measure. Integrating guidelines and con-
skill assessed by an item to answer that item correctly. A straints on test lengths will help instructors design accurate
less restrictive analysis, such as the generalized DINA, that assessments of those skills and content areas. The cognitive
assumes some questions can be answered by only master- diagnostic also allows flexible timing; instructors can
ing a subset of skills or by students who have only partially design pretests or post-tests that cover many skills and
mastered skills may provide a better fit. The three RBAs content areas or weekly tests focused on a few skills for one
constrained the skills that the analyses could test. This was content area.
an obvious issue for the algebra skill, which was only For researchers, the MCD will collect longitudinal data
assessed by two items on one assessment. Physics instruc- across skills and content areas. These data can inform the
tors also likely value and teach other skills they would want development of learning progressions or skills transfer
to assess, such as the ability to decompose complex across content areas, such as applying vectors in math-
problems into smaller pieces to solve as assessed by the ematical, kinematics, and momentum content areas.
Mechanics Reasoning Inventory [59]). The analysis does Developing more items that cover multiple content areas
not test the extent to which the items and assessments act can inform how physics content interacts, which current
differently across populations, e.g., gender, race, or type of RBAs do not assess. Because LASSO is free for instructors,
physics course. Mixed evidence exists about the measure- the data will likely also represent a broader cross section of
ment invariance [60] and differential item functioning physics learners [63] than physics education research has
[61,62] of the FCI and FMCE. The combination of items historically included [64].
from these three assessments administered through a
cognitive diagnostic at a large scale will provide a dataset ACKNOWLEDGMENTS
to identify and understand item differences and potential
This research was made possible through the financial
item biases between groups of students.
support provided by National Science Foundation Grant
No. 2141847. We extend our appreciation to LASSO for
X. CONCLUSIONS
their support in both collecting and sharing data for this
Combining 102 items from three RBAs into a single item research.
bank to create a CD-CAT provides a solid foundation for
building the MCD. The limited number of items assessing
APPENDIX
the algebra skill and the apply vectors and visualizations
skills for energy and momentum point to these as specific The Appendix includes the coding and refined Q-matrix
areas for improvement of the item bank. Delivering the tables (Table VIII–IX) for the three assessments used to
MCD online, fortunately, has the advantage of allowing for conduct the DINA model analysis.
010103-9
VY LE et al. PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
TABLE VIII. The skills and content areas for items from the FCI, FMCE, and EMCS. Note that “FCI_01” represents an abbreviation
of the assessment name and the number of the item on the assessment.
TABLE IX. The table provides the refined Q-matrix for each TABLE IX. (Continued)
FCI item, represented as binary coding, with * denoted adoption
changes from the suggested Q-matrix of the DINA model. Apply Conceptual
FCI item vectors relationships Algebra Visualizations
Apply Conceptual 15 0* 1 0 0
FCI item vectors relationships Algebra Visualizations 16 0 1 0 0
1 0 1 0 0 17 1 0 0 0
2 0* 1 0 0 18 1 0 0 0
3 0* 1 0 0 19 0 1 0 0
4 0 1 0 0 20 0 1 0 1
5 1 0 0 0 21 1 0 0 0
6 0 1 0 0 22 1 0 0 0
7 1 1 0 0 23 1 1 0 0
8 1 0* 0 0 24 0 1 0 0
9 1 0 0 0 25 1 1* 0 0
10 0 1 0 0 26 1 0 0 0
11 1 0 0 0 27 1 0* 0 0
12 1 1 0 0 28 0 1 0 0
13 1 0 0 0 29 1 0* 0 0
14 1 1 0 0 30 1 0 0 0
(Table continued)
010103-10
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
TABLE X. The table provides the refined Q-matrix for each TABLE X. (Continued)
FMCE item, represented as binary coding, with * denoted
adoption changes from the suggested Q-matrix of the DINA Apply Conceptual
model. FMCE item vectors relationships Algebra Visualizations
Apply Conceptual 40 0 1 0 1
FMCE item vectors relationships Algebra Visualizations 41 1* 1 0 1
42 0 1 0 1
1 1 1 0 0 43 0 1 0 1
2 0* 1 0 0 44 0 1 0 1
3 1 1 0 0 45 0 1 0 1
4 1* 1 0 0 46 0 1 0 0
5 1 1 0 0 47 0 1 0 0
6 1* 1 0 0
7 1 1 0 0
8 1 1 0 0
9 1 1 0 0 TABLE XI. The table provides the refined Q-matrix for each
10 1 1 0 0 EMCS item, represented as binary coding, with * denoting
11 1 1 0 0 adoption changes from the suggested Q-matrix of the DINA
12 1 1 0 0 model.
13 1 1 0 0
Apply Conceptual
14 0 1 0 1
EMCS item vectors relationships Algebra Visualizations
15 0 1 0 1
16 0 1 0 1 1 1 1 0 0
17 0 1 0 1 2 0 1 0 0
18 0 1 0 1 3 0 1 0 0
19 0 1 0 1 4 0 1 0 0
20 1* 1 0 1 5 1 1 0 0
21 1 1 0 1 6 0 1 0 0
22 0 1 0 1 7 0 1 0 0
23 0 1 0 1 8 0 1 0 0
24 0 1 0 1 9 0 1 0 0
25 0 1 0 1 10 0 1 0 0
26 0 1 0 1 11 1 0 0 0
27 1 1 0 0 12 0 1 0 0
28 1 1 0 0 13 1 1 0 0
29 1 1 0 0 14 0 1 0 0
30 0 1 0 0 15 0 1 1 0
31 0 1 0 0 16 0 1 0 0
32 0 1 0 0 17 0 1 0 0
33 0 1 0 0 18 0 1 0 0
34 0 1 0 0 19 0 1 0 0
35 0 1 0 0 20 0 1 0 0
36 0 1 0 0 21 0 1 1 0
37 0 1 0 0 22 0 1 0 0
38 0 1 0 0 23 1 0* 0 0
39 0 1 0 0 24 0 1 0 0
25 0 1 0 0
(Table continued)
[1] Adrian Madsen, Sarah B. McKagan, and Eleanor C. Sayre, [2] Jennifer L. Docktor and José P. Mestre, Synthesis of
Resource letter RBAI-1: Research-based assessment instru- discipline-based education research in physics, Phys.
ments in physics and astronomy, Am. J. Phys. 85, 245 (2017). Rev. ST Phys. Educ. Res. 10, 020119 (2014).
010103-11
VY LE et al. PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
[3] Adrian Madsen, Sarah B. McKagan, Mathew Sandy [19] Jimmy De La Torre and Nathan Minchen, Cognitively
Martinuk, Alexander Bell, and Eleanor C. Sayre, Re- diagnostic assessments and the cognitive diagnosis model
search-based assessment affordances and constraints: Per- framework, Psicol. Educ. 20, 89 (2014).
ceptions of physics faculty, Phys. Rev. Phys. Educ. Res. 12, [20] Wenyi Wang, Lihong Song, Ping Chen, Yaru Meng, and
010115 (2016). Shuliang Ding, Attribute-level and pattern-level classifica-
[4] Ben Van Dusen and Jayson Nissen, Equity in college tion consistency and accuracy indices for cognitive diag-
physics student learning: A critical quantitative intersec- nostic assessment, J. Educ. Measure. 52, 457 (2015).
tionality investigation, J. Res. Sci. Teach. 57, 33 (2020). [21] Edward Haertel, An application of latent class models to
[5] Bethany R Wilcox and HJ Lewandowski, Research-based assessment data, Appl. Psychol. Meas. 8, 333 (1984).
of students’ beliefs about experimental physics: When is [22] Brian W Junker and Klaas Sijtsma, Cognitive assessment
gender a factor?, Phys. Rev. Phys. Educ. Res. 12, 020130 models with few assumptions, and connections with non-
(2016). parametric item response theory, Appl. Psychol. Meas. 25,
[6] Ronald K Thornton, Dennis Kuhl, Karen Cummings, and 258 (2001).
Jeffrey Marx, Comparing the force and motion conceptual [23] J. de la Torre, DINA model and parameter estimation: A
evaluation and the force concept inventory, Phys. Rev. ST didactic, J. Educ. Behav. Stat. 34, 115 (2009).
Phys. Educ. Res. 5, 010105 (2009). [24] Robert J. Mislevy, Russell G. Almond, and Janice F. Lukas,
[7] Siera M. Stoen, Mark A. McDaniel, Regina F. Frey, K. A brief introduction to evidence-centered design, ETS Res.
Mairin Hynes, and Michael J. Cahill, Force concept Rep. Ser. 2003, i–29 (2003).
inventory: More than just conceptual understanding, Phys. [25] Physport assessments: Force and motion conceptual
Rev. Phys. Educ. Res. 16, 010105 (2020). evaluation (n.d.), https://www.physport.org/assessments/
[8] James T. Laverty, Amogh Sirnoorkar, Amali Priyanka assessment.cfm?A=FMCE.
Jambuge, Katherine D. Rainey, Joshua Weaver, [26] Learning Assistant Alliance (2020), https://lassoeducation
Alexander Adamson, and Bethany R. Wilcox, A new .org/?fbclid=IwAR0ACweS923WEt-7d_Q_s5AGr0TFjJf
paradigm for research-based assessment development, F8Gkq-r6-1Ajjr8onOL_yzkSsY0c.
presented at PER Conf. 2022, Grand Rapids, MI, [27] Ben Van Dusen, Mollee Shultz, Jayson M. Nissen, Bethany
10.1119/perc.2022.pr.Laverty. R. Wilcox, N. G. Holmes, Manher Jariwala, Eleanor W.
[9] Nance S. Wilson, Teachers expanding pedagogical content Close, H. J. Lewandowski, and Steven Pollock, Online
knowledge: Learning about formative assessment together, administration of research-based assessments, Am. J. Phys.
J. Serv. Educ. 34, 283 (2008). 89, 7 (2021).
[10] Jacqueline Leighton and Mark Gierl, Cognitive Diagnostic [28] Physport: Browse assessments (n.d.), https://www
Assessment for Education: Theory and Applications .physport.org/assessments/?fbclid=IwAR0-A-5UFMfpUs
(Cambridge University Press, Cambridge, England, 2007). QprAnyxSVlozSaMXZb9rUJ5wnrFOqw24aQDopYmW
[11] Ying Cui, Mark J. Gierl, and Hua-Hua Chang, Estimating SRP0I.
classification consistency and accuracy for cognitive diag- [29] Ben Van Dusen, LASSO: A new tool to support instructors
nostic assessment, J. Educ. Measure. 49, 19 (2012). and researchers, American Physics Society Forum on
[12] Jason W. Morphew, Jose P. Mestre, Hyeon-Ah Kang, Hua- Education Fall 2018, arXiv:1812.02299.
Hua Chang, and Gregory Fabry, Using computer adaptive [30] David Hestenes, Malcolm Wells, and Gregg Swackhamer,
testing to assess physics proficiency and improve exam Force concept inventory, Phys. Teach. 30, 141 (1992).
performance in an introductory physics course, Phys. Rev. [31] Ronald K. Thornton and David R. Sokoloff, Assessing
Phys. Educ. Res. 14, 020110 (2018). student learning of Newton’s laws: The force and motion
[13] Hua-Hua Chang, Psychometrics behind computerized conceptual evaluation and the evaluation of active learning
adaptive testing, Psychometrika 80, 1 (2015). laboratory and lecture curricula, Am. J. Phys. 66, 338
[14] David J. Weiss, Improving measurement quality and (1998).
efficiency with adaptive testing, Appl. Psychol. Meas. 6, [32] Chandralekha Singh and David Rosengrant, Multiple-
473 (1982). choice test of energy and momentum concepts, Am. J.
[15] John Stewart, John Hansen, and Lin Ding, Quantitative Phys. 71, 607 (2003).
methods in PER, in The International Handbook of Physics [33] Alper Şahin and Durmus Özbasi, Effects of content
Education Research: Special Topics (AIP Publishing LLC, balancing and item selection method on ability estimation
Melville, NY, 2023), Chap. 24. in computerized adaptive tests, Eurasian J. Educ. Res.
[16] Christoph Helm, Julia Warwas, and Henry Schirmer, (2017).
Cognitive diagnosis models of students’ skill profiles as [34] Shu-Ying Chen, Pui-Wa Lei, and Wen-Han Liao, Control-
a basis for adaptive teaching: An example from introduc- ling item exposure and test overlap on the fly in comput-
tory accounting classes, Empirical Res. Vocat. Educ. Train. erized adaptive testing, Br. J. Math. Stat. Psychol. 61, 471
14, 1 (2022). (2008).
[17] Tingxuan Li and Anne Traynor, The use of cognitive [35] Carlos Fernando Collares, Cognitive diagnostic modeling
diagnostic modeling in the assessment of computational in healthcare professions education: An eye-opener, Adv.
thinking, AERA Open 8, 23328584221081256 (2022). Health Sci. Educ. 27, 427 (2022).
[18] Hamdollah Ravand and Alexander Robitzsch, Cognitive [36] Rose C. Anamezie and Fidelis O. Nnadi, Parameterization
diagnostic modeling using R, Pract. Assess. Res. Eval. 20, of teacher-made physics achievement test using determin-
11 (2015). istic-input-noisy-and-gate (DINA) model, J. Prof. Issues
010103-12
APPLYING COGNITIVE DIAGNOSTIC MODELS … PHYS. REV. PHYS. EDUC. RES. 21, 010103 (2025)
Eng. Educ. Pract. 9, 101 (2018), https://iiste.org/Journals/ criteria versus new alternatives, Struct. Equation Modell.
index.php/JEP/article/view/45266. 6, 1 (1999).
[37] Yunxiao Chen, Jingchen Liu, Gongjun Xu, and Zhiliang [51] Alberto Maydeu-Olivares and Harry Joe, Assessing
Ying, Statistical analysis of Q-matrix based diagnostic approximate fit in categorical data analysis, Multivariate
classification models, J. Am. Stat. Assoc. 110, 850 (2015). Behav. Res. 49, 305 (2014).
[38] Jiwei Zhang, Jing Lu, Jing Yang, Zhaoyuan Zhang, and [52] Sung Tae Jang, The implications of intersectionality on
Shanshan Sun, Exploring multiple strategic problem Southeast Asian female students’ educational outcomes in
solving behaviors in educational psychology research by the United States: A critical quantitative intersectionality
using mixture cognitive diagnosis model, Front. Psychol. analysis, Am. Educ. Res. J. 55, 1268 (2018).
12, 568348 (2021). [53] Zhengqi Tan, Jimmy De la Torre, Wenchao Ma, David
[39] J Yasuda, N Mae, MM Hull, and M Taniguchi, Analysis to Huh, Mary E. Larimer, and Eun-Young Mun, A tutorial on
develop computerized adaptive testing with the force cognitive diagnosis modeling for characterizing mental
concept inventory, J. Phys. Conf. Ser. 1929, 012009 health symptom profiles using existing item responses,
(2021). Prev. Sci. 24, 480 (2023).
[40] Edi Istiyono, Wipsar Sunu Brams Dwandaru, and Revnika [54] Qianru Liang, Jimmy de la Torre, Mary E Larimer, and
Faizah, Mapping of physics problem-solving skills of Eun-Young Mun, Mental health symptom profiles over
senior high school students using PhysProSS-CAT, Res. time: A three-step latent transition cognitive diagnosis
Eval. Educ. 4, 144 (2018). modeling analysis with covariates, in Dependent Data in
[41] Jun-ichiro Yasuda, Naohiro Mae, Michael M. Hull, and Social Sciences Research (Springer, Cham, 2024),
Masa-aki Taniguchi, Optimizing the length of computer- pp. 539–562, 10.1007/978-3-031-56318-8_22.
ized adaptive testing for the force concept inventory, Phys. [55] Justin Paulsen, Dubravka Svetina, Yanan Feng, and
Rev. Phys. Educ. Res. 17, 010115 (2021). Montserrat Valdivia, Examining the impact of differential
[42] Kathleen M Sheehan, Irene Kostin, and Yoko Futagi, item functioning on classification accuracy in cognitive
Supporting efficient, evidence-centered item development diagnostic models, Appl. Psychol. Meas. 44, 267 (2020).
for the GRE verbal measure, ETS Research Report [56] Jimmy de la Torre and Chia-Yi Chiu, A general method of
No. RR-07-29, 2007. empirical Q-matrix validation, Psychometrika 81, 253
[43] Benjamin Pollard, Robert Hobbs, Rachel Henderson, (2016).
Marcos D. Caballero, and H. J. Lewandowski, Introductory [57] Peter M. Bentler, Comparative fit indexes in structural
physics lab instructors’ perspectives on measurement un- models, Psychol. Bull. 107, 238 (1990).
certainty, Phys. Rev. Phys. Educ. Res. 17, 010133 (2021). [58] Genaro Zavala, Santa Tejeda, Pablo Barniol, and Robert J.
[44] Michael Vignal, Gayle Geschwind, Benjamin Pollard, Beichner, Modifying the test of understanding graphs
Rachel Henderson, Marcos D. Caballero, and H. J. in kinematics, Phys. Rev. Phys. Educ. Res. 13, 020111
Lewandowski, Survey of physics reasoning on uncertainty (2017).
concepts in experiments: An assessment of measure- [59] Andrew Pawl, Analia Barrantes, Carolin Cardamone, Saif
ment uncertainty for introductory physics labs, arXiv: Rayyan, and David E. Pritchard, Development of a me-
2302.07336. chanics reasoning inventory, AIP Conf. Proc. 1413, 287
[45] Jayson M. Nissen, Ian Her Many Horses, Ben Van Dusen, (2012).
Manher Jariwala, and Eleanor Close, Providing context for [60] Alicen Morley, Jayson M. Nissen, and Ben Van Dusen,
identifying effective introductory mechanics courses, Phys. Measurement invariance across race and gender for the
Teach. 60, 179 (2022). force concept inventory, Phys. Rev. Phys. Educ. Res. 19,
[46] Ian D. Beatty, Standards-based grading in introductory 020102 (2023).
university physics, J. Scholarship Teach. Learn. 13, 1 [61] Adrienne Traxler, Rachel Henderson, John Stewart, Gay
(2013), https://scholarworks.iu.edu/journals/index.php/ Stewart, Alexis Papak, and Rebecca Lindell, Gender fair-
josotl/article/view/3264. ness within the force concept inventory, Phys. Rev. Phys.
[47] Kikumi K. Tatsuoka, Architecture of knowledge structures Educ. Res. 14, 010103 (2018).
and cognitive diagnosis: A statistical pattern recognition [62] Rachel Henderson, Paul Miller, John Stewart, Adrienne
and classification approach, in Cognitively Diagnostic Traxler, and Rebecca Lindell, Item-level gender fairness in
Assessment (Routledge, London, 2012), pp. 327–359. the force and motion conceptual evaluation and the con-
[48] Wenchao Ma and Jimmy de la Torre, GDINA: An R ceptual survey of electricity and magnetism, Phys. Rev.
package for cognitive diagnosis modeling, J. Stat. Softw. Phys. Educ. Res. 14, 020103 (2018).
93, 1 (2020). [63] Jayson M Nissen, Ian Her Many Horses, Ben Van Dusen,
[49] Daire Hooper, Joseph Coughlan, and Michael Mullen, Manher Jariwala, and Eleanor W. Close, Tools for identify-
Evaluating model fit: A synthesis of the structural equation ing courses that support development of expertlike
modelling literature, in Proceedings of the 7th European physics attitudes, Phys. Rev. Phys. Educ. Res. 17, 013103
Conference on Research Methodology for Business and (2021).
Management Studies (2008), Vol. 2008, pp. 195–200. [64] Stephen Kanim and Ximena C. Cid, Demographics of
[50] Li-tze Hu and Peter M. Bentler, Cutoff criteria for fit physics education research, Phys. Rev. Phys. Educ. Res.
indexes in covariance structure analysis: Conventional 16, 020106 (2020).
010103-13