Atm 07 23 796
Atm 07 23 796
Page 1 of 96
Abstract: This article is the series of methodology of clinical prediction model construction (total 16
sections of this methodology series). The first section mainly introduces the concept, current application
status, construction methods and processes, classification of clinical prediction models, and the necessary
conditions for conducting such researches and the problems currently faced. The second episode of
these series mainly concentrates on the screening method in multivariate regression analysis. The third
section mainly introduces the construction method of prediction models based on Logistic regression and
Nomogram drawing. The fourth episode mainly concentrates on Cox proportional hazards regression
model and Nomogram drawing. The fifth Section of the series mainly introduces the calculation method of
C-Statistics in the logistic regression model. The sixth section mainly introduces two common calculation
methods for C-Index in Cox regression based on R. The seventh section focuses on the principle and
calculation methods of Net Reclassification Index (NRI) using R. The eighth section focuses on the principle
and calculation methods of IDI (Integrated Discrimination Index) using R. The ninth section continues to
explore the evaluation method of clinical utility after predictive model construction: Decision Curve Analysis.
The tenth section is a supplement to the previous section and mainly introduces the Decision Curve Analysis
of survival outcome data. The eleventh section mainly discusses the external validation method of Logistic
regression model. The twelfth mainly discusses the in-depth evaluation of Cox regression model based on R,
including calculating the concordance index of discrimination (C-index) in the validation data set and drawing
the calibration curve. The thirteenth section mainly introduces how to deal with the survival data outcome
using competitive risk model with R. The fourteenth section mainly introduces how to draw the nomogram
of the competitive risk model with R. The fifteenth section of the series mainly discusses the identification
of outliers and the interpolation of missing values. The sixteenth section of the series mainly introduced the
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 2 of 96 Zhou et al. Clinical prediction models with R
advanced variable selection methods in linear model, such as Ridge regression and LASSO regression.
Submitted Jun 05, 2019. Accepted for publication Aug 02, 2019.
doi: 10.21037/atm.2019.08.63
View this article at: http://dx.doi.org/10.21037/atm.2019.08.63
Introduction to Clinical Prediction Models of clinical prediction models, necessary conditions for
conducting such researches and the current problems.
Background
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 3 of 96
common in cohort studies. There are many similarities rehabilitation programs, preventing disease recurrence,
among the diagnostic model, the prognostic model and reducing mortality and disability, and promoting functional
the disease occurrence model. Their outcomes are often recovery and quality of life.
dichotomous data and their effect indicators are the absolute There are several mature prediction models in clinical
risks of the outcome occurrence, that is, the probability of practice. For example, Framingham, QRISK, PROCAM,
occurrence, not the effect indicators of relative risk such as and ASSIGN scores are all well-known prediction models.
relative risk (RR), odds ratio (OR) or hazard ratio (HR). At The TNM staging system for malignant tumors is the most
the technical level of the model, researchers will face with representative prediction model. The biggest advantage
the selection of predictors, the establishment of modeling of TNM is that it is simple and fast, and the greatest
strategies, and the evaluation and verification of model problem is that the prediction is not accurate enough,
performance in all of these models. which is far from the expectations of clinicians. The need
to use predictive tools in clinical practice is far more than
predicting disease occurrence or predicting the prognosis
Applications of clinical prediction models
of patients. If we can predict the patient’s disease status in
As described in the background part, clinical prediction advance, for example, for patients with liver cancer, if we
models are widely used in medical research and practice. can predict whether there is microvascular infiltration in
With the help of clinical prediction models, clinical advance, it may help surgeons to choose between standard
researchers can select appropriate study subjects more resection and extended resection, which are completely
accurately, patients can make choices more beneficial for different. Preoperative neoadjuvant radiotherapy and
themselves, doctors can make better clinical decisions, and chemotherapy is the standard treatment for T1-4N+ middle
health management departments can monitor and manage and low rectal cancer. However, it is found during clinical
the quality of medical services better and allocate medical practice that the status of lymph nodes estimated according
resources more rationally. The effects of clinical prediction to the imaging examinations before surgery is not accurate
models are almost reflected in any of the three-grade enough, and the proportion of false positive or false
prevention system of diseases: negative is high. Is it possible to predict the patient’s lymph
node status accurately based on known characteristics before
Primary prevention of disease radiotherapy and chemotherapy? These clinical problems
The clinical prediction model can provide patients might be solved by constructing a suitable prediction model.
and doctors with a quantitative risk value (probability)
of diagnosing a particular disease in the future based
Research approach of clinical prediction models
on current health status, offering a more intuitive and
powerful scientific tool for health education and behavioral Clinical prediction models are not as simple as fitting a
intervention. For example, the Framingham Cardiovascular statistical model. From the establishment, verification,
Risk Score based on the Framingham’s studies on heart evaluation and application of the model, there is a complete
clarified that lowering blood lipids and blood pressure could research process of the clinical prediction model. Many
prevent myocardial infarction (7). scholars have discussed the research approaches of clinical
prediction models (8-11). Heart Magazine recently
Secondary prevention of disease published a review, in which the authors used risk score for
Diagnostic models often use non-invasive, low-cost and cardiovascular diseases (CVD) as an example to explore how
easy-to-acquire indicators to construct diagnostic means to construct a predictive model of disease with the help of
with high sensitivity and specificity and to practice the idea visual graphics and proposed six important steps (12):
of “early detection, early diagnosis, early treatment”, which (I) Select a data set of predictors as potential CVD
has important significance of health economics. influencing factors to be included in the risk score;
(II) Choose a suitable statistical model to analyze the
Tertiary prevention of disease relationship between the predictors and CVD;
The prognostic model provides quantitative estimates (III) Select the variables from the existing predictors
for probabilities of disease recurrence, death, disability that are significant enough to be included in the
and complications, guiding symptomatic treatment and risk score;
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 4 of 96 Zhou et al. Clinical prediction models with R
(IV) Construct the risk score model; model and a validation set to verify the prediction
(V) Evaluate the risk score model; ability of the model.
(VI) Explain the applications of the risk score in clinical (II) There are prediction models at present. To
practice. construct a new model, a validation set is applied
The author combined literature reports and personal to build the new model and the same training data
research experience and summarized the research steps as set is applied to verify the prediction ability of the
shown in Figure 1. existing model and the new model respectively.
(III) To update the existing models, the same validation
Clinical problem determine research type selection set is used to verify the prediction ability of the two
Clinical prediction models can answer questions about models.
etiology, diagnosis, patients’ response to treatment and With regard to the generation of training data sets and
prognosis of diseases. Different research types of design validation data sets, data can be collected prospectively or
are required for different problems. For instance, in retrospectively, and the data sets collected prospectively
regard to etiology studies, a cohort study can be used are of higher quality. For the modeling population, the
to predict whether a disease occurs based on potential sample size is expected to be as large as possible. For
causes. Questions about diagnostic accuracy are suitable prospective clinical studies, the preparation of relevant
for cross-sectional study design as the predictive factors documents includes the research protocol, the researcher’s
and outcomes occur at the same time or in a short period operation manual, the case report form, and the ethical
of time. To predict patients’ response to treatment, cohort approval document. Quality control and management of
study or randomized controlled trial (RCT) can be applied. data collection should also be performed. If data is collected
For prognostic problems, cohort study is suitable as there retrospectively, the data quality should also be evaluated,
are longitudinal time logics for predictors and outcomes. the outliers should be identified, and the missing values
Cohort study assessing the etiology requires rational should be properly processed, such as filling or deleting.
selection of study subjects and control of confounding Finally, the training data set for modeling and the validation
factors. In studies of diagnostic models, a “gold standard” set for verification are determined according to the actual
or reference standard is required to independently diagnose situations. Sometimes we can only model and verify in the
the disease, and the reference standard diagnosis should same data set due to realistic reasons, which is allowed, but
be performed with blind method. That is to say, reference the external applicability of the model will be affected to
standard diagnosis cannot rely on information of predictors some extent.
in prediction models to avoid diagnostic review bias.
Assessing patients’ response to treatment is one type of Establishment and evaluation of clinical prediction
interventional researches. It is also necessary to rationally models
select study subjects and control the interference of non- Before establishing a prediction model, it is necessary to
test factors. In studies of prognostic model, there is a clarify the predictors reported in the previous literature,
vertical relationship between predictors and outcomes, and determine the principles and methods for selecting
researchers usually expect to obtain outcome of the disease predictors, and choose the type of mathematical model
in the natural status, so prospective cohort study is the most applied. Usually a parametric or semi-parametric model
common prognostic model and the best type of research will be used, such as logistic regression model or Cox
design. regression model. Sometimes algorithms of machine
learning are used to build models and most of these models
Establishment of study design and implementation are non-parametric. Because there are no parameters like
protocol, data collection and quality control regression coefficients, the clinical interpretation of such
Good study design and implementation protocol are nonparametric models is difficult. Then fit the model and
needed. First, we need to review the literatures to determine estimate the parameters of the model. It is necessary to
the number of prediction models to be constructed: determine the presentation form of the prediction model in
(I) At present, there is no prediction model for a advance. Currently, there are four forms commonly used in
specific clinical problem. To construct a new model, prediction models:
generally a training set is required to construct the (I) Formula. Use mathematical formulas directly as
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 5 of 96
Ridge regression
Elastic network
Hierarchical clustering
Variable Screen Mehod
Cluster analysis K-means clustering
PAM method
Logisticregression
Parameterized model Generalized linear model
Possion regreesion
Discriminant analysis
KNN method
SVM method
Discrimination C-Statistics
C-Index
Validation set validation
Calibration Calibration Plot
C-Index
Already have 2 models Validate in one dataset
Calibration Calibration Plot
C-Statistics
DCA Curve
Figure 1 The flow chart of construction and evaluation of clinical prediction models.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 6 of 96 Zhou et al. Clinical prediction models with R
scores by appropriate mathematical operations, and relatively unstable. Studies with small sample sizes
to make it into a website for online use. are not suitable for this method.
(IV) Scoring system. The regression coefficients of (II) Cross-validation method. This method is a further
the regression model are transformed into a evolution of the split-half method. The half-fold
quantifiable scoring system through appropriate cross-validation and the ten-fold cross-validation
mathematical operations. are commonly used. The half-fold cross-validation
The first form is mainly for linear regression model, method is to divide the original data into two parts,
which is a deterministic regression. The latter forms one for establishing and the other for validating the
are based on parametric or semi-parametric models, the model. Then exchange the rolls of the two parts
statistical nature of which is the visual representation of and mutually verifying each other. The ten-fold
the model parameters. The researchers can make choices cross-validation method is to divide the data into
based on actual conditions. After the model is built, how to ten parts, and to uses nine parts for establishing the
evaluate the pros and cons of the model? The evaluation model, and the other part for verifying the model.
and verification of the model are of higher statistical analysis By establishing and verifying the model ten times
technology. For example, the discrimination, calibration, in this way, a relatively stable can be constructed.
clinical effectiveness and other indicators of the prediction (III) Bootstrap method. The conventional Bootstrap
models are evaluated to determine the performance of the internal validity analysis method is to randomly
models. sample a certain number of returnable cases in the
original data set to build a model, and then use
Validation of clinical prediction models the original data set to verify the model. By doing
The effect of the prediction model is prone to change as the random sampling, establishment and validation
scenario and the population change. Therefore, a complete for 500–1,000 times, 500–1,000 models can be
study of prediction model should include validation of the obtained, and the parameter distributions of the
model. The content of the validation includes the internal model can be summarized. Therefore, the final
validity and external validity of the model. Internal validity parameter values of the model can be determined.
reflects the reproducibility of the model, which can be Bootstrap method is a fast-developing method
validated through cross-validation and Bootstrap with in recent years. This method develops in the
the data of the study itself. External validity reflects the background of computer numeration increase. It is
generalizability of the model and needs to be validated with proved that models acquired through this method
data sets not from the study itself, which are temporally and have higher stability than through the previous
geographically independent, or completely independent. two methods. It can be speculated that Bootstrap
Internal and external validation of the model are method will be increasingly applied internal
necessary steps to assess the stability and applicability of the validity analysis of the prediction models. Of
model. The data sets for internal validation and external course, if conditions are met, we should do external
validation should be heterogeneous, but not to a certain validation of prediction models as much as possible
extent. Generally, data from the original institution are used to improve the external applicability of the models.
as training set to build the model and a part of the internal
data are randomly selected to perform internal validation. Assessment of clinical effectiveness of clinical prediction
Data from other institutions are selected as the external models
verification data set. Of course, it is best to do external data The ultimate goal of the clinical prediction models is
set validation. I will introduce several methods to verify whether the clinical prediction model changes the behaviors
internal validity. of doctors/patients, improves patients’ outcomes or cost
(I) Split-half method. Randomly divide the existing effect, which is the clinical effect study of the clinical
data into two parts, one for building the model prediction models. From the methodological point of view,
and the other for validating the model. The data generally the training set and the validation set are divided
is divided into two parts by the semi-division according to the new prediction model. For example,
method for “internal verification”. Since only half for predicting dichotomous outcome, we can assess the
of the data is used to build the model, the model is clinical effectiveness by assessing the sensitivity and
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 7 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 8 of 96 Zhou et al. Clinical prediction models with R
of RCT and ignored the great value of real-world the clinical application of the prediction model needs
data. RCT data have the highest quality without to be balanced between the accuracy and the simplicity
doubt, but the data have been screened strictly, of the model. Imagine if there is a model that is as
therefore the extrapolation of the evidence is limited. easy to use as TNM staging, but more accurate than
Real-world data come from our daily clinical practice, TNM staging, what choices would you make?
which reflects the efficacy of clinical interventions (II) Most of the clinical prediction models were
more comprehensively, and the evidence has better constructed and validated based on retrospective
external applicability. However, the biggest problems datasets and validation is rarely performed in the
of real-world study are that the data quality varies prospective data. Therefore, the stability of the results
wildly and there are too many confounding factors predicted by the models was comparatively poor.
that is difficult to identify. Therefore, it is necessary (III) Validation of most clinical prediction models is based
to use more complicated statistical methods to find on internal data. Most articles have only one dataset.
the truth from the complicated confounding factors. Even if there are two datasets, one to construct and
It is not easy to sift sand for gold, and solid statistical the other to validate, but the two datasets often come
foundation is like a sifter for gold. Here we need to from the same research center. If the validation of
understand that confounding factors exist objectively, the prediction model can be further extended to
because the occurrence of any clinical outcome is not dataset of another research center, the application
the result from a single factor. There are two levels of value of the model will be greatly expanded. This
correction for confounding factors. One is correction work is extremely difficult and requires multi-center
at the experimental design stage, which is the top-level cooperation. Moreover, most of the domestic centers
correction, such as equalizing confounding factors do not have a complete database for validation,
between groups by randomization and enough sample which comes back to the topic “database importance”
size. This is also the reason why RCT is popular: as discussed earlier.
long as the sample size is enough and randomization is
correct, the problem of confounding factors is solved
Brief summary
once and for all. The second is after-effect correction
through statistical methods, which is obviously not The original intention of the clinical prediction model is
as thorough as the RCT correction, but the second to predict the status and prognosis of diseases with a small
situation is closer to the real situation of our clinical number of easily collected, low-cost predictors. Therefore,
practice. most prediction models are short and refined. This is
(III) Sample size. Because there are many confounding logical and rational in an era when information technology
factors in real-world research, a certain sample size is underdeveloped and data collection, storage and analysis
is necessary to achieve sufficient statistical efficacy to are costly. However, with the development of economy and
discern the influence of confounding factors on the the advancement of technology, costs of data collection
outcome. A simple and feasible principle for screening and storage have been greatly reduced and technology of
variables by multivariate analysis is that if one variable data analysis is improving. Therefore, clinical prediction
is included in multivariate analysis, there should be model should also break through the inherent concept, with
20 samples of the endpoint, which is called “1:20 application of larger amounts of data (big data) and more
principle” (13,14). complex models as well as algorithms (machine learning and
(IV) Clinical research insight. Construction of clinical artificial intelligence) to serve doctors, patients, and medical
prediction model is to solve clinical problems. The decision makers with more accurate results.
ability to discover valuable clinical problems is an In addition, from the perspective of a clinical doctor
insight that is cultivated through widely reading and conducting clinical researches, the following four principles
clinical practice. should be grasped when conducting researches of clinical
prediction models:
(I) Building a better clinical prediction model is also
Issues currently faced in the development of prediction model
an inherent requirement of precise medicine;
(I) Low clinical conversion rate. The main reason is that (II) H o w t o g e t h i g h q u a l i t y d a t a ? D a t a b a s e
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 9 of 96
construction is the core competitiveness, while will be included in the regression formula (here P value
prediction model is only a technical method; could be less than 0.05 or 0.2, but in common condition,
(III) We need to raise awareness that RCT is as P value can range between 0.05–0.2). This method is very
important as real-world study. Both are ways to controversial. For practitioners, how to choose a better
provide reliable clinical evidence; method is really an optional test. To be honest, there is no
(IV) Validation of the models requires internal and standard answer. But we still have some rules for better
external cooperation, so we should strengthen variable screening:
the internal cooperation of scientific research and (I) When the sample size is larger enough, statistical
improve the awareness of multi-center scientific test power is enough, you can choose one from
research cooperation. those six screening methods we mentioned
before. Hereby we introduce a method which can
help you evaluate the test efficiency quickly: 20
Variable screening method in multivariate
samples (events) are available for each variable.
regression analysis
For example, in Cox regression test, if we include
Background 10 variables associated with prognosis, at least 200
patients should be recruited to evaluate endpoint
Linear regression, Logistic regression and Cox proportional events, such as death (200 dead patients should be
hazards regression model are very widely used multivariate included instead of 200 patients in total). Because
regression analysis methods. We have introduced details those samples without endpoint event won’t be
about the principle of computing, application of associated considered to be test effective samples (13,14).
software and result interpreting of these three in our book (II) When the sample size is not qualified for the first
Intelligent Statistics (15). However, we talked little about condition or statistical power is not enough for
independent variable screening methods, which have some other reasons, widely used screening method
induced confusion during data analysis process and article in most clinical report should be applied. You can
writing for practitioners. This episode will focus more on perform univariate regression analysis of every
this part. variable one by one firstly; those with P value less
Practitioners will turn to statisticians when they are than 0.2 will be included in the regression formula.
confronted with problems during independent variable As we mentioned before, this method is quite
screening, statisticians will suggest the application of controversial during its wide application.
automatic screening in software, such as Logistic regression (III) Even the second screening method will be
and Cox regression in IBM SPSS, which have suggested challenged during practice. Sometimes we find
seven methods for variable screening as follow (16,17): some variables significantly associated with
(I) Conditional parameter estimation likelihood ratio prognosis may be excluded for its disqualification
test (Forward: Condition); in already set-up screening methods. For example,
(II) Likelihood Ratio Test of Maximum Partial in a prostate cancer prognosis study, the author find
Likelihood Estimation (Forward LR); Gleason score is not significantly associated with
(III) Wald chi-square test (Forward: Wald); prognosis in the screening model, while Gleason
(IV) Conditional parameter estimation likelihood ratio score is a confirmed factor for prostate cancer
test (Backward: Condition); prognosis in previous study. What should we do
(V) Likelihood Ratio Test of Maximum Partial now? In our opinion, we should include those
Likelihood Estimation (Backward: LR); variables significantly associated with prognosis
(VI) Wald chi-square test (Backward: Wald); in our analysis though they may be disqualified
(VII) Enter method (all variable included, default in statistical screening method for professional
method). perspective and clinical reasons.
Actually, in clinical trial report, many authors will adopt To sum up, the author recommends the third variable
one of these screening methods. I am going to talk about: screening method. The univariate analysis results and clinical
they will perform univariate regression analysis of every reasons, sample size and statistical power should be considered
variable one by one firstly; those with P value less than 0.1 at the same time. We will explain it in detail below.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 10 of 96 Zhou et al. Clinical prediction models with R
Disputes and consensus considered that the β1 change exceeds 10%, and
the variable needs to be adjusted, otherwise it is not
The discussion about variable screening has been going on
needed. This method is different from the second
for a long time. Statistician consider it with very professional
one because of quantification of confounding factor
perspective, however clinical doctors will not always stick
effect. This is not perfect because the effect “Z”
to these suggestions. It is very hard to distinguish the right
and “X” exert on “Y” could be affected by other
and the wrong for actual problem during clinical study,
confounding factors. This thought may lead to
such as small simple size; limited knowledge to confirm
logical confusion. Complicated methodological
the exact factors for the outcome. However, we still have
problem will be left for further exploration by the
some standards for reference during screening. During the
smart ones. In our opinion, this is acceptable option
review about good quality clinical studies published on top
for variable screening, especially for those programs
magazines, 5 conditions will be considered during variable
with very specific targets. We can confirm the effect
screening (18):
of “X” on independent “Y”. This effect is real and
(I) Clinical perspective. This is the most basic what we do can regulate these confounding factors.
consideration for variable screening. Medical (IV) Choosing right number of variables that will
statistical analysis can be meaningless if it is just eventually be included in the model. is very
statistical analysis. Based on professional view, important. This is a realistic problem. If the sample
confirmed factors significantly associated with size is large enough and the statistical performance
outcome, such as Gleason score in prostate cancer, is sufficient, we can use the variable screening
should be included in regression model. We do not method provided by the statistical software to
need to consider its statistical disqualification in automatically filter the variables, and we can filter
variable screening. out the variables that are suitable for independent
(II) S c r e e n i n g b a s e d o n u n i v a r i a t e a n a l y s i s . impact results in statistical. “Ideal is full, the reality
Independent variable included in multivariate is very skinny”. Sometimes we will consider a lot
analysis based on the results of the univariate of variables, while the sample size is pretty small.
analysis. Variables with significant P value should We have to make compromise between statistical
be included in multivariate regression. If P<0.1, we efficiency and variable screening. Compromise can
think it is “significant”. Sometimes P<0.2 or P<0.05 bear better results (13,14).
will be considered “significant”. P value can range (V) Above we listed four commonly used variable
according to sample size. Big sample size will be screening methods. Many other variable screening
with small P value. Small sample size will be with methods, such as some methods based on model
relatively big P value. This screening method is parameters: determination coefficient R 2, AIC,
quite common in already published articles, even likelihood logarithm, C-Statistics, etc. can be
in top magazines. Though this method has been an option also. The fact that too many variable
questioned by statisticians, it is still being used for screening methods is a good evidence to support a
no method because no more precise and scientific view that there is no best available during practice.
option available now. Even statisticians could not This article aims to help us find the right screening
find better replacement. Before we find better method instead of confirming the best one or worst
option to replace this one, it is better to have one one. Choosing the fittest one according to actual
than to have none. condition is the goal of this article.
(III) The variables would be chosen based on the
influence of the confounding factor “Z” on the test
The methods for recruiting different types of variables
factor or the exposure factor “X”. To be specific,
we will observe if “X” will affect dependent variable Continuous variable
“Y” when “Z” change or not. First run the basic For continuous variable, there is a good protocol for
model that only includes “X”, record the regression reference. If the relationship between variable and the
coefficient β1, and then add “Z” to the model outcome is linear, you can include the continuous variable
to see how much the β1 changes. It is generally in the regression formula. If not, you can transform it into
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 11 of 96
dichotomous variable or ordinal categorical variable, then evaluate the normality of data. In their research, variables
put them into the regression formula. We have already such as troponin I, NT-proBNP, or corin is fit abnormal
changed former continuous variable into categorical distribution. So, the author describes baseline characters
variable by this way. We do this transformation because of these recruited objects by median (quartile - third
the variable may be not linear to the outcome. Some other quartile). For example, the median of Troponin I is
relationship instead of linear one may present. 4.5 (1.8–12.6) ng/mL. Then multivariable linear regression
is performed to analyze corin. The original expression is
Continuous variable transformation summary as follows: multiple linear regression analysis was applied
When continuous variable is included in regression to determine factors influencing corin levels. Levels of
model, the original variable, as far as possible, should troponin I, NT-proBNP, and corin were normalized by
be included in this model and actual needs should be Log10 transformation. Variables like troponin I, NT-
considered also. The variable can be transformed based proBNP, corin have been normalized by function Log 10.
on some rules. Two-category grouping, aliquot grouping, After that, they have been included into multivariable linear
equidistant grouping, and clinical cut-off value grouping regression. Then, the author performed Cox regression.
are present for better professional explanation. By Though there is no specific requirement for Cox regression,
optimal truncation point analysis, we convert continuous Log 10 function was used to normalize troponin I, NT-
variables into categorical variables and introduce them proBNP and corin. All these three variables have been
into the regression model as dummy variables. In included in multivariable linear regression model for
regression model, the continuous variable can be present consistency with original ones.
in different ways. We will give specific examples as Transformation for each change of fixed increment
follow. No matter which way it will present, the general If continuous variable is introduced directly into the
principle is that this change is better for professional model in its original form, the regression parameter is
interpretation and understanding (19-21). interpreted as the effect of the change in the dependent
Normal transformation variable caused by each unit change. However, sometimes
For continuous variables which are in normal distribution, the effect of this change may be weak. Therefore, we can
this is not a problem. However, when we confronted with transform the continuous independent variables into a
data which is not fit normal distribution, we can make categorical variable by fixed interval, in an equidistant
transformation based on some function, then these data grouping, and then introduce them into the model for
will be normalized. And it will fit the regression model. analysis. This grouping is good for better understanding
Original data can be normalized by different function, such and application for patients. For example, we include
as Square Root method, LnX method, Log 10X method patients whose age range between 31 to 80 years old. We
and (1/X) etc., according to its own character. If you have can divide it into groups of 10–40, 41–50, 51–60, 61–70,
normalized the original data, you should interpret the 71–80 according to10 years age interval. Then five already
variable after normal transformation instead of the original set dummy variables will be included into the model for
ones in the regression model or you can reckon the effect of analysis. However, if the variable range a lot, grouping
the original independent variable exerting on the original according to the methods we mentioned before will lead
dependent variable according to the function used in the to too many groups and too many dummy variables, which
transformation. will be quite redundant during analysis. It will be very
For example, the authors have done normality test hard for clinical interpretation too. In the opposite, some
in the article they published in JACC, 2016 (21). The data with a small range and cannot be grouping again, it
original expression is as follows: Normality of continuous cannot be transformed into categorical variable also. Then,
variables was assessed by the Kolmogorov-Smirnov what should we do when we are confronted with these two
test. The method of normality test includes using the situations?
parameters of the data distribution (the skewness value Here, we can refer to an article published in JACC,
and the kurtosis value) and using the data distribution 2016 (19). We find in the model, the author used a lot of
graph (histogram, P-P diagram, Q-Q diagram). Or “per”, such as per 5% change, per 0.1 U, per 100 mL/min
some nonparametric test methods (Shapiro-Wilk test, etc. This is transformation of continuous variables in fixed
Kolmogorov-Smirnov test) will be applied to help us to increments per change, which has been present in “per +
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 12 of 96 Zhou et al. Clinical prediction models with R
interval + unit”. We will illustrate 2 examples in this article. interpretation, we can put per SD into the model. This can
The mean of oxygen uptake efficiency slope is 1,655 U and guide the patient to see that he or she is within the range of
5–95% population will change from 846 to 2,800 U. It is several standard deviations of the population distribution
really a big range. If the original data is put into formula, level according to his or her actual measurement results,
per 1 U change will lead to very week change of HR, which and then to assess how much the corresponding risk will
is meaningless in clinical practice. If it is transformed change.
into categorical variables, many groups will appear. So, It is very simple to do this kind of transformation. We
the author includes per 100 U change into the model and can do it by these two ways:
finds that the mortality risk will decrease 9% (HR =0.91, (I) Before constructing the regression model, the
95% CI: 0.89–0.93) when oxygen uptake efficiency slope original continuous variables should be normalized,
increases in per 100 U. Another example is variable Peak and the normalized independent variables
RER. The median is 1.08 U and 5–95% population will are brought into the regression model. The
change from 0.91–1.27 U. It is really a small range. If the regression coefficient obtained is the influence
original data is put into formula, per 1 U change will lead of the dependent variable on each dependent
to very big change of HR. In clinical practice, patients with SD. (Attention: Only independent variables are
a change of 1 U are quite rare and this outcome will be normalized here).
of limited practicality. It will be very hard for categorical (II) If the original variables are not normalized, the
variable transformation too for its small range. So, the original variables can be directly brought into the
author includes per 0.1U change into the model and finds model, and the Unstandardized Coefficients are
that the mortality risk will decrease 6% (HR =0.94, 95% CI: obtained, and then the standard deviation of the
0.86–1.04) when Peak RER increases per 0.1 U. However, independent variables is calculated by multiplying
it is not statistically significant. the standard deviation of the independent
Then, how can we do this transformation? If we want variables, which is also called Standardized
to change the factor from each1 unit to 100 units, it will Coefficients. This is the effect of the dependent
be 100 times larger. We only need to divide the original variable for each additional SD of the independent
variable by 100 and then to include into the model. variable.
Similarly, if we want to change the factor from 1 unit to 0.1
unit, the change is reduced by 10 times. It is only necessary Rank variable
to multiply the original variable by 10 and include it into Rank variable is very common. It is a kind of ordered
the regression model. multi-category variable. Generally, multiple data may
Transformation of each standard deviation present in the same variable and these data are rank
In clinical study, we get another transformation method: correlated with each other. For example, the grade of
independent variable change at per SD increase. Let us see hypertension (0= normal, 1= high normal, 2= Grade
an article published in JACC in 2016 (20). Age and systolic 1, 3= Grade 2, 4= Grade 3), the level of urine protein
pressure are included in the model as per SD increase. The (0=−, 1=±, 2=+, 3=++, 4=+++, 5=++++), the effect of drug
age increase at per SD, the risk of atherosclerotic heart (invalid, improvement, cure), they are all rank variable.
disease (ASCVD) increases by 70% (HR =1.70, 95% CI: It is different from non-ordered multi-category variable.
1.32–2.19). Systolic blood pressure (SBP) increased at per Ordered multi-category variable presents monotonic
SD, the risk of ASCVD increases by 25% (HR =1.25, 95% increasing or decreasing. When ordered multi-category
CI: 1.05–1.49). Here the author has put continuous variable variable are in Logistic regression model, these variables
into the model with the form of per SD increase. Assuming are not suggested to be brought in directly as continuous
that the variables are fit to normal distribution, the area variables unless per one-unit change can lead to the same
within the mean ±1 SD interval is 68.27%, while the mean risk ratio change in the outcome. However, mostly it will
value is ±1.96, the area within the SD interval is 95%. If the not change so ideally. So, we suggest to treat ordered
mean value is ±2.58, the area within the SD interval is 99%. multi-category variable as dummy variables, and you can
We can tell that if the data range within 4 SD, about 95% compare each level with another. When the outcome is
samples will be covered. Therefore, new variables, especially not linear related, the optimal scale regression should be
for those rare ones which are still unclear in clinical used to explore the effect inflection point.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 13 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 14 of 96 Zhou et al. Clinical prediction models with R
Identify clinical
Identify clinical Determine research
Determine research Determine the
Determine the Determine the
Determine the
issues
issues strategies
strategies predictors
predictors outcome
outcome
Assessment
Assessmentof Assessment
Assessmentof
Construct
Construct the
the ofmodel
model Assessment
Assessment ofof
prediction model discrimination model ofclinical
clinical
prediction model discrimination model calibration
calibration effectiveness
ability effectiveness
ability
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 15 of 96
2= black, 3= other). ## 4 88 normal weight 21 108 white smoking 0 no pih yes 2 2594
## 5 89 normal weight 18 107 white smoking 0 no pih yes 0 2600
## 6 91 normal weight 21 124 other no smoking 0 no pih no 0 2622
[Case 1] interpretation
In this case, the dependent variable is dichotomous (whether Data preprocessing: set the outcome variable as a
or not a low birth weight infant is delivered). The purpose dichotomous variable, define “low birth weight” as “1”, and
of the study is to investigate the independent influencing set the unordered categorical variable “race” as a dummy
factors of low birth weight infants, which is consistent with variable.
the application conditions of binary Logistic regression. As mydata$low <- ifelse(mydata$low ==“low weight”,1,0)
there is only one data set in this case, we can use this data mydata$race1 <- ifelse(mydata$race ==“white”,1,0)
set as the training set to model, and then use Bootstrap mydata$race2 <- ifelse(mydata$race ==“black”,1,0)
mydata$race3 <- ifelse(mydata$race ==“other”,1,0)
resampling method to perform internal model validation
in the same data set. It should be noted here that we can Load the data frame “mydata” into the current working
also randomly divide the data set into a training set and an environment and “package” the data using function
internal validation set according to a 7:3 ratio, but we did datadist().
not do so considering the sample size. We will demonstrate attach(mydata)
the prediction model construction of low birth weight dd<-datadist(mydata)
options(datadist=‘dd’)
infants and the rendering of Nomogram with R language
below. The data were collected and named “Lweight.sav”, Fit the Logistic regression model using function lrm()
which is saved in the current working path of R language. and present the results of the model fitting and model
The data and code can be downloaded from the attachments parameters. Note: The parameter C of Rank Discrim
in this Section for readers to practice. The specific analysis Indexes. in the model can be directly read. This is the
and calculation steps are as follows: C-statistics of model “fit1”. According to the calculation
(I) Screen the independent influencing factors results, the C-Statistics is 0.738 in this example. The
affecting low birth weight infants and construct a meaning and calculation method of C-Statistic will be
Logistic regression model; further explained in the following sections.
(II) Visualize the Logistic regression model and draw a fit1<-lrm(low ~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2,
Nomogram; data = mydata, x = T, y = T)
(III) Calculate the discrimination degree (C-Statistics) fit1
of the Logistic model; ## Logistic Regression Model
##
(IV) Perform internal validation with resampling
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui +
method and draw the Calibration curve. ## race1 + race2, data = mydata, x = T, y = T)
##
[Case 1] R codes and results interpretation ## Model Likelihood Discrimination Rank Discrim.
Load the “foreign” package for importing external data in ## Ratio Test Indexes Indexes
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
.sav format (IBM SPSS style data format). Load the rms
## 0 130 d.f. 9 g 1.122 Dxy 0.476
package to build the Logistic regression model and to plot ## 1 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
the nomogram (23): ## max |deriv| 7e-05 gp 0.207 tau-a 0.206
library(foreign) ## Brier 0.181
library(rms) ##
## Coef S.E. Wald Z Pr(>|Z|)
Import the external data in .sav format and name it
## Intercept 1.1427 1.0873 1.05 0.2933
“mydata”. Then set the data to the structure of data frame, ## age -0.0255 0.0366 -0.69 0.4871
and display the first 6 lines of the data frame. ## ftv 0.0321 0.1708 0.19 0.8509
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 16 of 96 Zhou et al. Clinical prediction models with R
0 10 20 30 40 50 60 70 80 90 100
Points
age
45 40 35 30 25 20 15 10
1 6
ftv
0 4 pih
ht
no pih
lwt
260 240 220 200 180 160 140 120 100 80
1 3
ptl
0 2
smoking
smoke
no smoking yes
ui
no 0
race1
1
1
race2
0
Total points
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Low weight rate
0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 3 Nomogram based on model “fit1”.
0.4 ##
## n=189 Mean absolute error=0.037 Mean squared error=0.00173
Apparent
Bias-corrected ## 0.9 Quantile of absolute error=0.054
0.2
ldeal
From the calculation results of Logistic regression model
0.0
fit1 above and Figure 3, it is obvious that the contribution
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Pr{low=1} of some predictors to the model are negligible, such as
B=100 repetitions, boot Mean absolute error=0.035 n=189
the variable “ftv”. There are also some predictors that
Figure 4 Calibration curve based on model “fit1”. may not be suitable for entering the prediction model as
dummy variable, such as “race”, and the clinical operation is
## ht=pih 1.7631 0.6894 2.56 0.0105 cumbersome. We can consider conversing the un-ordered
## lwt -0.0137 0.0068 -2.02 0.0431 categorical variables into dichotomous variables properly
## ptl 0.5517 0.3446 1.60 0.1094 and involve them into the regression model. The adjusted
## smoke=smoking 0.9275 0.3986 2.33 0.0200
codes are as follows:
## ui=yes 0.6488 0.4676 1.39 0.1653
## race1 -0.9082 0.4367 -2.08 0.0375
First of all, according to the actual situation, we convert
## race2 0.3293 0.5339 0.62 0.5374 the unordered categorical variable “race” into a binominal
## variable. The standard of conversion is mainly based on
Use function nomogram() to construct the Nomogram professional knowledge. We classify “white” as one category
object “nom1” and print the Nomogram. The result is and “black and other” as another.
shown in Figure 3. mydata$race <- as.factor(ifelse(mydata$race==“white”, “white”, “black
and other”))
nom1 <- nomogram(fit1, fun = plogis,fun.at = c(.001, .01, .05,
seq(.1,.9, by = .1), .95, .99, .999), Use function datadist() to “package the current data set.
lp = F, funlabel = “Low weight rate”) dd<-datadist(mydata)
plot(nom1) options(datadist =‘dd’)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 17 of 96
0 10 20 30 40 50 60 70 80 90 100
Points
pih
ht
no pih
lwt
260 240 220 200 180 160 140 120 100 80
1 3
ptl
0 2
smoking
smoke
no smoking
black and other
race
white
Total points
0 20 40 60 80 100 120 140 160 180 200 220 240
Exclude the variable “ftv that contributes less to the shown in Figure 5.
result from the regression model, then reconstruct model nom2 <- nomogram(fit2, fun = plogis,fun.at = c(.001, .01, .05,
“fit2” and display the model parameters. It can be seen that seq(.1,.9, by=.1), .95, .99, .999),
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 18 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 19 of 96
0 10 20 30 40 50 60 70 80 90 100
Points
age
35 40 45 50 55 60 65 70 75 80 85
male
sex
female
Total points
0 20 40 60 80 100 120 140 160 180
Risk
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
##
## lrm(formula = status ~ age + sex, data = lung, x = T, y = T) 0.6
##
## Model Likelihood Discrimination Rank Discrim. 0.4
## Ratio Test Indexes Indexes
Apparent
## Obs 228 LR chi2 16.85 R2 0.103 C 0.666 0.2 Bias-corrected
## 1 63 d.f. 2 g 0.708 Dxy 0.331 ldeal
## 2 165 Pr(> chi2) 0.0002 gr 2.030 gamma 0.336 0.0
## max |deriv| 2e-09 gp 0.138 tau-a 0.133
0.0 0.2 0.4 0.6 0.8 1.0
## Brier 0.185 Predicted Pr{status=2}
## B=100 repetitions, boot Mean absolute error=0.019 n=228
## Coef S.E. Wald Z Pr(>|Z|) Figure 8 Calibration curve based on model “fit”.
## Intercept -0.5333 1.0726 -0.50 0.6190
## age 0.0319 0.0170 1.87 0.0609
## sex=female -1.0484 0.3084 -3.40 0.0007
cal <- calibrate(fit, method = ‘boot’, B = 100)
##
plot(cal,xlim = c(0,1.0),ylim = c (0,1.0))
##
Use function nomogram() to plot Nomogram of the risk
## n=228 Mean absolute error=0.014 Mean squared error=0.00034
estimate for the Logisitc regression model “fit”, as shown in
## 0.9 Quantile of absolute error=0.032
Figure 7.
nom <- nomogram(fit, fun= function(x)1/(1+exp(-x)),# fun=plogis The graphic interpretation is the same as before.
lp = F, funlabel = “Risk”)
plot(nom)
Brief summary
The graphic interpretation is the same as before.
Use the calibrate() function to construct the object of In summary, this section introduces the construction
calibration curve “cal” and print the calibration curve. The of Logistic regression prediction model and drawing of
result is shown in Figure 8. Nomogram. It should be noted that to assess the practical
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 20 of 96 Zhou et al. Clinical prediction models with R
value of a prediction model, its operability should be of pancreas and intraoperative radiotherapy is applied.
considered as well as the accuracy of its prediction. In Peritoneal metastasis is present. We can calculate the
addition to the internal validation, external validation is total score according to all this available information by a
sometimes necessary. In this case, as the external validation mathematical model: 40-year-old can be scored 10 points;
data is not obtained, the external validation process is not gender of male can be scored 4 points and so on… Finally,
demonstrated, and validation is only performed in the the total score can be obtained. Different score will be with
original data set with Bootstrap method. different survival probability in 3 months, 6 months and
1 year. Complicated Cox regression formula now is visual
graph. Practitioners can calculate the survival probability of
Method of building nomogram based on Cox
each patient conveniently and relatively accurate “fortune-
regression model with R
telling” can be present to each patient. In the previous
Background episode, we talked about Logistic regression Nomogram.
Cox regression Nomogram is quite similar with the Logistic
Human beings are always crazy about “fortune-telling”. Nomogram in interpretation (15,17).
Whether it is “fortune-telling” in Chinese culture or Like the previous episode, the first question is when
“astrology” in Western culture, it all shows people’s should we choose Cox regression? It is actually about
enthusiasm for this. In this section, we will discuss another the method choosing in multiple variable analysis. If the
scientific “fortune-telling”. It is a model which will assess outcome we are observing is survival, or we call it “Time
the prognosis of patients. As an oncologist, you will be to event” survival outcome, we can choose Cox regression
confronted with questions like “how long will I survive” from model. We have already introduced how to screen variables
patients suffering cancer during clinical practice. It is really in the 2 nd section. We should also pay attention to the
a gut-wrenching question. Mostly, we can tell a median balance between the numbers of variables you are going
survival time based on the staging of corresponding disease. to bring in the prediction model and the convenience,
Actually, clinical staging is the basis for our predication for practicality of the model. We will show two examples of
these patients or in other word it is “predicting model”. Nomogram construction with R. Detailed performance of
We answer this question by median survival time according R application will be present here instead of principles of
to its clinical stage. It could bring new questions because statistics behind.
it may not be so accurate to predict the survival time of
specific individual by median time of a group of people. No
can tell whether this specific individual will enjoy better or [Example 1] analysis
worse prognosis (15,17). [Example 1]
Is there any possibility we can calculate the survival of Here we will use the data in [Example 1] to introduce the
every patient by a more accurate and scientific method? construction of survival prediction model and corresponding
The answer is yes. We can firstly construct a mathematical Nomogram. Original data have been simplified for better
model by Cox proportional hazard model, and then understanding and practice. The clinical data of 1,215
visualize parameter associated with the patient’s survival invasive breast cancer patients is downloaded from TGCA
by Nomogram. This paragraph can relatively accurately (https://genome-cancer.ucsc.edu/). We have simplified
calculate the survival probability of each patient. Nomogram the original data by steps in Table 1. The definition and
in essence is the visualization of regression model. It sets assignment of variables is present in Table 2. We will try
the scoring criteria according to the regression coefficients to construct survival prediction model and corresponding
of all the independent variables, and then gives each scoring Nomogram of this cohort. The readers can download
value of each independent variable, so that for each patient, original data and R code in the attachment file of this
a total score can be calculated. A transfer between the episode for better practice.
occurrence probabilities and the outcome is calculated to
by function and the probability of each patient’s outcome [Example 1] analysis
can be obtained. For example, we have 40-year-old male This cohort is about the construction of prognosis
pancreatic cancer patient who have went through operation. predication model. Steps are as follow:
The clinical stage is IV. The tumor locates in the head (I) Cox regression will be used and screening
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 21 of 96
independent prognostic factors based on training bootstrap resampling methods based on internal
sets and predictive models can be built firstly. The data set and Calibration Plot will be recommended
data sets used for modeling are generally referred for validation (22,24).
to as training set or internal data set. You can refer Building of Cox regression model-based Nomogram,
to already published Intelligent Statistics and Crazy C-Index calculation, Bootstrap resampling methods and
Statistics (15,17) for the details about data entry, Calibration Plot are emphasized here. All processing can be
univariable Cox regression and multivariable Cox done by R (R software downloading: https://www.r-project.
regression. Finally, we get three independent org/). All data processed will be save as “BreastCancer.sav”
variables for prognosis: age, PgR, Pathologic_stage. and put under the R current running directory. The results
(II) Building Nomogram based on these three variables will show in Figures 9 and 10.
(these 3 have been treated as independent variable
in this Cox model) [Example 1] R codes and its interpretation
(III) Assessing the discrimination efficiency of these Load the rms package and the necessary helper packages.
models. C-Index will be calculated.
library(foreign)
(IV) Validation of this model can be performed by
library(rms)
external data set. If external data set is not available,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 22 of 96 Zhou et al. Clinical prediction models with R
No. Number /
3= simple mastectomy
4= other method
3= other
0 10 20 30 40 50 60 70 80 90 100
Points
Age
25 30 35 40 45 50 55 60 65 70 75 80 85 90
Stage II Stage IV
Pathologic_stage
Stage I Stage III
Negative
PgR
Positive
Total points
0 20 40 60 80 100 120 140 160 180 200 220
1-year OS
0.95 0.85 0.80
3-year OS
0.95 0.85 0.80 0.70 0.6 0.5 0.4
5-year OS
0.95 0.85 0.80 0.70 0.6 0.5 0.4 0.3 0.2
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 23 of 96
1.0 package.
coxm <-cph(Surv(Months,Status==1) ~ Age+Pathologic_stage+PgR,
Actual 5-year OS (proportion)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 24 of 96 Zhou et al. Clinical prediction models with R
very good predicting value. In this example, C-Index equals (IV) Age: age in years;
0.7503 and se(C-Index) equals 0.02992. All results above are (V) Sex: male =1, female =2;
the direct output of the software (23). (VI) ph.ecog: ECOG performance score (0= good 5=
Calculation complement of C-index. dead);
library(Hmisc) (VII) ph.karno: Karnofsky performance score (bad =0 to
S<-Surv(breast$Months,breast$Status==1)
good =100) rated by physician;
rcorrcens(S~predict(coxm),outx = TRUE)
(VIII) pat.karno: Karnofsky performance score as rated
##
## Somers’ Rank Correlation for Censored Data Response variable:S by patient;
## (IX) meal.cal: calories consumed at meals;
## C Dxy aDxy SD ZP n (X) wt.loss: weight loss in last six months.
## predict(coxm) 0.22 -0.559 0.559 0.09 6.21 0 549
set. Lateral axis shows the predicated survival rate of each ## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
patient while the vertical axis shows the actual survival
## 3 3 1010 1 56 1 0 90 90 NA 15
rate of each patient. It is ideal if the red line in the picture ## 4 5 210 2 57 1 1 90 60 1150 11
exactly coincides with the blue dotted line. ## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
[Example 2] analysis You can further use the following command to display
the variable descriptions in the lung dataset.
[Example 2] help(lung)
Survival in patients with advanced lung cancer from the ## starting httpd help server ... done
North Central Cancer Treatment Group. Performance Variable tags can be added to dataset variables for
scores rate how well the patient can perform usual daily subsequent explanation.
activities. Total 10 variates: lung$sex <- factor(lung$sex,
(I) inst: institution code; levels = c(1,2),
(II) Time: survival time in days; labels = c(“male”, “female”))
(III) Status: censoring status 1 = censored, 2 = dead; “Packaging” data before the building of regression model
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 25 of 96
and Nomogram by rms package. This is very important step The interpretation of this Nomogram can be referred to
for Nomogram building. Commend “?datadist” can enable that of Example 1.
you see the detailed helper document of this function. C-Index can be calculated by commends as follow. This
dd=datadist(lung) will give objective assessment of this model.
options(datadist=“dd”)
rcorrcens(Surv(time,status) ~ predict(fit), data = lung)
Cox regression fit will be built based on dependent ##
variable time, status and independent variable age, sex. ## Somers’ Rank Correlation for Censored Data Response variable:Sur-
Show the parameter of this model. Dxy =0.206. C-Index v(time, status)
=0.206/2+0.5 =0.603. ##
## C Dxy aDxy SD Z P n
fit <- cph(Surv(time,status) ~ age + sex, data = lung, x = T,y = T, surv = T)
## predict(fit) 0.397 -0.206 0.206 0.051 4.03 1e-04 228
fit
## Cox Proportional Hazards Model
##
The parameter C here is 0.397 (C=0.397). It is actually
## cph(formula = Surv(time, status) ~ age + sex, data = lung, x = T, the complement of C-Index. So, we know C-Index is 0.603
## y = T, surv = T) (C-Index = 1 − 0.397 =0.603), which is totally exactly the
##
same as we calculated before.
## Model Tests Discrimination
## Indexes We can modify the curve by commends as follow
## Obs 228 LR chi2 14.12 R2 0.060 (Figure 13). Firstly, calibrate() function is used to build
## Events 165 d.f. 2 Dxy 0.206
the object cal. Then we print the graph, and you can use
## Center 0.8618 Pr(> chi2) 0.0009 g 0.356
## Score chi2 13.72 gr 1.428 the graph parameters to beautify the calibration curve.
## Pr(> chi2) 0.0010 Commend “?calibrate” can help you see more details about
## the parameters.
## Coef S.E. Wald Z Pr(>|Z|)
## age 0.0170 0.0092 1.85 0.0646 cal <- calibrate(fit, cmethod=‘KM’, method=“boot”,
## sex=female -0.5131 0.1675 -3.06 0.0022 u = 365, m = 50, B = 100)
## plot(cal)
plot(cal,lwd=2,lty=1,
Median survival time calculating,
errbar.col=c(rgb(0,118,192,maxColorValue=255)),
med <- Quantile(fit)
xlim=c(0.1,1),ylim=c(0.1,1),
Build the survival function object surv. xlab=“Nomogram-Predicted Probability of 1-Year”,
surv <- Survival(fit) ylab=“Actual Probability of 1-Year”,
Build Nomogram of median survival based on Cox col=c(rgb(192,98,83,maxColorValue=255)))
regression by commends as follow. As show in Figure 11. The interpretation of this graph can be referred to that
nom <- nomogram(fit, fun = function(x) med(lp = x), of Example 1.
funlabel = “Median Survival Time”)
plot(nom)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 26 of 96 Zhou et al. Clinical prediction models with R
0 10 20 30 40 50 60 70 80 90 100
Points
Age
35 40 45 50 55 60 65 70 75 80 85
male
Sex
female
Total points
0 20 40 60 80 100 120 140 160 180
Linear predictor
–0.7 –0.6 –0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
Points 0 10 20 30 40 50 60 70 80 90 100
Age
35 40 45 50 55 60 65 70 75 80 85
male
Sex
female
Total points
0 20 40 60 80 100 120 140 160 180
Linear predictor
–0.7 –0.5 –0.3 –0.1 0 0.1 0.3 0.5
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 27 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 28 of 96 Zhou et al. Clinical prediction models with R
calculated. In the logistic regression model, sometimes smoking); pre-pregnancy premature births (Ptl, unit:
the consistency can also be measured by the Hosmer- times); whether the birth mother has high blood pressure
Lemeshow goodness-of-fit test. The calibration curve is a (ht, 0= not suffering, 1= sick); uterus stress on contraction,
scatter plot of the actual incidence and predicted incidence. oxytocin and other stimuli (ui, 0= no, 1= yes); the number
Statistically, the calibration curve is a visualized result of the of community physician visits in the first three months
Hosmer-Lemeshow goodness of fit test (28-30). of pregnancy (ftv, unit: times); race (race, 1= white, 2=
It is worth noting that a well-differentiated model may black, 3= other ethnic groups). We sorted out the data
have a poor calibration. For example, it can determine in this example and named it “Lowweight.sav”, which is
that the risk of a person’s disease is five times that of stored in the current working path. For the convenience
another person. It determines that the risk of the two of the reader, the data and code can be downloaded in the
people is 5% and 1%, respectively. In fact, the risk of attachment to this Section.
the two is 50% and 10%, respectively. The model is
also quite outrageous, which is a bad calibration. The [Case 1] analysis
calibration of the model can be tested with Hosmer- The dependent variable is a binary outcome variable
Lemeshow. If the results are statistically significant, there (whether low birth weight or not) in this case. The purpose
is a difference between the predicted and observed values. of the study was to investigate the independent influencing
Discrimination and calibration are important evaluations factors of low birth weight infants, which is consistent with
for a model, but many newly developed models are not fully the application conditions of binary Logistic regression. We
evaluated. A systematic review of cardiovascular system construct a Logistic regression equation with “age + ftv + ht
risk prediction models found that only 63% of the models + lwt + ptl + smoke + ui + race” as the independent variable
reported Discrimination, and even fewer models reported and “low” as the dependent variable. Based on this Logistic
Calibration, at 36%. regression model, we have three methods to calculate its
C-Statistics:
R-squared (R2) (I) Method 1. Use the lrm() function in the rms
The Coefficient of determination, also commonly known package to construct a logistic regression model
as “R-squared”, is used as a guideline to measure the and directly read the model “Rank Discrim.
accuracy of the model, which is a combination of metrics Indexes” Parameter C, which is C-Statistics.
discrimination and consistency. The model determines (II) Method 2. Construct a logistic regression model,
the coefficient R2 to be more comprehensive, but slightly predict() function to calculate the model prediction
rough (31,32). probability, then use the ROCR package to
Below we will explain the method of calculating draw the ROC curve according to the predicted
C-statistics in R language in a classic case of Logistic probability, and calculate the area under the curve
regression. As with the previous Sections, the focus is on the (AUC), which is C-Statistics. Note: This method is
process of using R language calculation instead of complex consistent with the calculation method in SPSS.
statistical principles. (III) Method 3. Construct a logistic regression
model, predict() function to calculate the model
prediction probability, and directly calculate the
Case analysis
area under the ROC curve AUC by using the
[Case 1] somers2 function in the Hmisc package. Note:
Hosmer and Lemeshow studied the influencing factors of This method is consistent with the calculation
low birth weight in infants in 1989. The outcome variable method in SPSS.
was whether or not to deliver low birth weight infants
(variable name “low”, dichotomous variable, 1= low birth
R codes of calculation process
weight, which birth weight <2,500 g; 0= non-low birth
weight), consideration influencing factors (independent Load the foreign package and the rms package.
variable) may include: maternal pre-pregnancy weight library(foreign)
(lwt, unit: pounds); maternal age (age, unit: year);maternal library(rms)
smoking during pregnancy (smoke, 0 = not sucked, 1 = Import external data in .sav format and convert the data
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 29 of 96
into a data frame structure while presenting the first 6 rows (II) Method 2. Calculate the AUC using the ROCR
of the data frame. package, the code is as follows:
mydata<- read.spss(“Lowweight.sav”) First, calculate the prediction probability of constructing
mydata<- as.data.frame(mydata) the logistic regression model.
head(mydata) mydata$predvalue <- predict(fit1)
## id low age lwt race smoke ptl ht ui ftv bwt
Load the ROCR package.
## 1 85 normal weight 19 182 black no smoking 0 no pih yes 0 2523
library(ROCR)
## 2 86 normal weight 33 155 other no smoking 0 no pih no 3 2551
## 3 87 normal weight 20 105 white smoking 0 no pih no 1 2557 Use the prediction() function to build the object “pred”,
## 4 88 normal weight 21 108 white smoking 0 no pih yes 2 2594 and the performance() function to build the object perf to
## 5 89 normal weight 18 107 white smoking 0 no pih yes 0 2600 plot the ROC curve.
## 6 91 normal weight 21 124 other no smoking 0 no pih no 0 2622
pred <- prediction(mydata$predvalue, mydata$low)
Set the ending variable to two categories, define “low perf<- performance(pred,”tpr”,”fpr”)
weight” as “1”, and set the unordered multi-class variable Draw the ROC curve as shown in Figure 14 below.
“race” to a dummy variable. plot(perf)
abline(0,1, col = 3, lty = 2)
mydata$low <- ifelse(mydata$low ==“low weight”,1,0)
mydata$race1 <- ifelse(mydata$race ==“white”,1,0) Calculating the area under the ROC curve (AUC) is
mydata$race2 <- ifelse(mydata$race ==“black”,1,0) C-statistics =0.7382008, which is consistent with the above
mydata$race3 <- ifelse(mydata$race ==“other”,1,0)
calculation results.
Load the data into the current working environment and auc <- performance(pred,”auc”)
“package” the data using the datadist() function. auc
## An object of class “performance”
dd<-datadist(mydata)
## Slot “x.name”:
options(datadist=‘dd’)
## [1] “None”
The logistic regression model was fitted using the lrm() ##
function in the rms package. ## Slot “y.name”:
## [1] “Area under the ROC curve”
(I) Method 1. Directly read the Rank Discrim. parameter
##
C in the model parameters, ie C-Statistics = 0.738. ## Slot “alpha.name”:
## [1] “none”
fit1<-lrm(low~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2, ##
data = mydata,x = T,y = T) ## Slot “x.values”:
fit1 ## list()
## Logistic Regression Model ##
## ## Slot “y.values”:
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui + ## [[1]]
## race1 + race2, data = mydata, x = T, y = T) ## [1] 0.7382008
## ##
## Model Likelihood Discrimination Rank Discrim. ##
## Ratio Test Indexes Indexes
## Slot “alpha.values”:
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
## list()
## 0 130 d.f. 9 g 1.122 Dxy 0.476
## 1 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
## max |deriv| 7e-05 gp 0.207 tau-a 0.206
(III) Method 3. Hmisc package somers2 () function
## Brier 0.181 calculation, we can see that AUC = 0.7382,
## consistent with the above calculation results.
## Coef S.E. Wald Z Pr(>|Z|)
library(Hmisc)
## Intercept 1.1427 1.0873 1.05 0.2933
## age -0.0255 0.0366 -0.69 0.4871 somers2(mydata$predvalue, mydata$low)
## ftv 0.0321 0.1708 0.19 0.8509 ## C Dxy n Missing
## ht=pih 1.7631 0.6894 2.56 0.0105 ## 0.7382008 0.4764016 189.0000000 0.0000000
## lwt -0.0137 0.0068 -2.02 0.0431
## ptl 0.5517 0.3446 1.60 0.1094
## smoke=smoking 0.9275 0.3986 2.33 0.0200
## ui=yes 0.6488 0.4676 1.39 0.1653 Brief summary
## race1 -0.9082 0.4367 -2.08 0.0375
## race2 0.3293 0.5339 0.62 0.5374 In fact, no matter which method we use, the standard error
## of C-Statistics is not directly given, so the calculation of
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 30 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 31 of 96
biostatistics at Vanderbilt University in 1996, to see if the people we predicted do not mean that we really use the
model can make accurate predictions. model to predict whether a person has the disease or not.
The definition of C-Index is very simple, C-Index = The model only gives us the probability of people being
concordant pairs/useable pairs. Imagine pairing all the sick, based on the probability being greater than a certain
subjects randomly, N subjects will produce N*(N-1)/2 cutoff value (e.g., 0.5) to determine if the person has the
pairs. If the sample size N is large, the calculation is huge disease or not. For example, there are 100 people, we will
and must be done with computer software. We first find finally get 100 probabilities through the model which range
the concordant pairs as molecules. What is the concordant from 0 to 1. We ranked the 100 probabilities in order from
pair? Taking the survival analysis Cox regression analysis small to large, and then divided them into 10 groups with
as an example, if the actual survival time is longer, the 10 people in each group. The actual probability is actually
predicted survival probability is larger, or the survival time the proportion of diseases in these 10 people, the predicted
with shorter survival time is smaller, that is, the prediction probability is the average of the 10 ratios predicted by each
result is consistent with the actual result, on the contrary, it group, and then compare the two numbers, one as the
is inconsistent. Then we find useable pairs in so many pairs abscissa and one as the ordinate. A Balance Plot is obtained,
as the denominator. What is a useable pair? In the case of and the 95% range of the plot can also be calculated. In
Survival Analysis Cox Regression Analysis, for example, the logistic regression model, sometimes the consistency
at least one of the two pairs of useable pairs required to can also be measured by the Hosmer-Lemeshow goodness-
have a target endpoint event. That is to say, if the paired of-fit test. The calibration curve is a scatter plot of the
two people do not have an end point event throughout actual incidence and predicted incidence. Statistically,
the observation period, they are not included in the the calibration curve is a visualized result of the Hosmer-
denominator. In addition, there are two other situations Lemeshow goodness of fit test (28,29,35).
that need to be excluded: It is worth noting that a well-differentiated model may
(I) If one of the paired two people has an end point have a poor calibration. For example, it can determine
event and the other person loss of follow-up, this that the risk of a person’s disease is five times that of
situation cannot compare the survival time of the another person. It determines that the risk of the two
two, these parried should be excluded; people is 5% and 1%, respectively. In fact, the risk of
(II) The two pairs of people who died at the same time the two is 50% and 10%, respectively. The model is
should also be excluded. After finding a useable also quite outrageous, which is a bad calibration. The
pair as the denominator, how do we determine the calibration of the model can be tested with Hosmer-
molecule? Lemeshow. If the results are statistically significant, there
What is the relationship between C-Index and AUC? is a difference between the predicted and observed values.
We have said that C-Index is an indicator that can be used Discrimination and calibration are important evaluations
to evaluate the distinguish ability of various models. For the for a model, but many newly developed models are not fully
binary logistic regression model, C-Index can be simplified evaluated. A systematic review of cardiovascular system
as: the probability of predicting the disease of a patient risk prediction models found that only 63% of the models
with a disease is greater than the probability of predicting reported Discrimination, and even fewer models reported
the disease. It has been shown that the C-Index for the Calibration, at 36%.
binary logistic regression is equivalent to AUC. AUC
mainly reflects the predictive ability of the binary logistic R-squared (R2)
regression model, but C-Index can evaluate the accuracy of The Coefficient of determination, also commonly known
various model prediction results. It can be easily understood as “R-squared”, is used as a guideline to measure the
that C-Index is an extension of AUC, and AUC is a special accuracy of the model, which is a combination of metrics
case of C-Index. discrimination and consistency. The model determines the
coefficient R2 to be more comprehensive, but slightly rough.
Consistency and calibration
It refers to the congruency of the probability of actual
Calculation methods of C-Index
occurrence and the probability of prediction. It seems a
bit puzzling that we still cite the above example. The 100 In many clinical articles, it is often seen that the
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 32 of 96 Zhou et al. Clinical prediction models with R
discriminating ability of the description method in the ## 1 51.48918 122.6809 1.7626881 FALSE
statistical method is measured by C-Statistics or C-Index. ## 2 53.41727 138.3601 0.6110944 FALSE
## 3 46.42942 104.4238 0.2602561 TRUE
Below we use the R language to demonstrate the calculation
## 4 48.07264 127.1035 0.1138379 TRUE
method of C-Index in this Cox regression. The C-Statistics ## 5 53.56849 130.9504 0.3202129 TRUE
calculation for Logistic regression has been introduced in ## 6 53.94391 135.3093 1.2506595 FALSE
Section 5. As with the previous Sections, the focus of this
(I) Method 1. survival package, load the survival
article is on the R language calculation process. We try to
package, coxph () function to fit the cox regression
avoid the complex statistical principles.
model, summary () function to display the model
Strictly speaking, C-Index includes the following types.
results and assign to the object sum.surv, the model
We only introduce the first one that is more commonly
parameter concordance is displayed, it is C-Index.
used in clinical practice.
This example C-Index =0.5416, se(C) =0.02704
(I) Harrell’s C;
(II) C-statistic by Begg et al. (survAUC::BeggC); library(survival)
(III) C-statistic by Uno et al. (survC1::Inf.Cval; fit <- coxph(Surv(os, death) ~ age + bp,data = sample.data)
sum.surv<- summary(fit)
survAUC::UnoC);
c_index <- sum.surv$concordance
(IV) Gonen and Heller Concordance Index forCox c_index
models (survAUC::GHCI, CPE::phcpe, ## C se(C)
clinfun::coxphCPE). ## 0.54156912 0.02704007
There are two common calculation methods for C-index
(II) Method 2. rms package, build a Cox regression
in the Cox regression model:
model, read the model parameter Dxy, Dxy*0.5+0.5
(I) Method 1: output directly from the function coxph()
is C-Index. Note: the seed is set here using the set.
of the survival package, the version of R needs to
seed() function in order to repeat the final result
be higher than 2.15. You need to install the survival
because the adjusted result of the validate function
package in advance to see that this method outputs
is random.
C-Index (corresponding to model parameter C).
library(rms)
The standard error is also output, and the 95%
## Loading required package: Hmisc
confidence interval can be obtained by adding or ## Loading required package: lattice
subtracting 1.96*se from C. This method is also ## Loading required package: Formula
applicable to many indicators combined (22-24). ## Loading required package: ggplot2
## Loading required package: SparseM
(II) Method 2: using the cph() function and the validate()
set.seed(123)
function in the rms package, both un-adjusted and dd<- datadist(sample.data)
bias adjusted C-Index are available (23). options(datadist=‘dd’)
fit.cph <- cph(Surv(os, death)~ age + bp, data = sample.data,
x = TRUE, y = TRUE, surv = TRUE)
R code and its interpretation fit.cph
## Cox Proportional Hazards Model
Simulate a set of survival data and set it to the data frame ##
structure, age and bp are the independent variables, os ## cph(formula = Surv(os, death) ~ age + bp, data = sample.data,
and death are the survival ending, the data frame is named ## x = TRUE, y = TRUE, surv = TRUE)
##
“sample.data“, and the first 6 lines of the data frame are
## Model Tests Discrimination
displayed: ## Indexes
## Obs 200 LR chi2 4.97 R2 0.025
age <- rnorm(200,50,5)
## Events 137 d.f. 2 Dxy 0.083
bp <- rnorm(200,120,10)
## Center -1.3246 Pr(> chi2) 0.0833 g 0.218
d.time <- rexp(200)
## Score chi2 4.95 gr 1.243
cens <- runif(200,.5,2)
## Pr(> chi2) 0.0842
death <- d.time <= cens
##
os <- pmin(d.time, cens)
## Coef S.E. Wald Z Pr(>|Z|)
sample.data <- data.frame(age =age,bp = bp,os = os,death = death)
## age 0.0169 0.0178 0.95 0.3416
head(sample.data)
## bp -0.0180 0.0089 -2.01 0.0444
## age bp os death
##
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 33 of 96
Calculate the adjusted C-Index and the un-adjusted significantly improve C-Statistic/AUC.
C-Index, and the result is as follows. (II)The meaning of C-Statistic/AUC is hard to
understand and properly explain in clinical use.
v <- validate(fit.cph, dxy = TRUE, B = 1000)
NRI overcomes these two limitations.
Dxy = v[rownames(v)==“Dxy”, colnames(v)==“index.corrected”]
orig_Dxy = v[rownames(v)==“Dxy”, colnames(v)==“index.orig”]
bias_corrected_c_index <- abs(Dxy)/2+0.5
Calculation method of NRI
orig_c_index <- abs(orig_Dxy)/2+0.5
bias_corrected_c_index We use a dichotomous diagnostic marker as an example
## [1] 0.5276304
to explain the principle of NRI calculation and then
orig_c_index
## [1] 0.5415691
quantitatively compare predictive performance of
different models. Calculation of NRI could be done by
Un-adjusted C-Index =0.5416, adjusted C-Index =0.5276. using customized function in R or by using formula.
Comparison of performance of predictive models requires
statistical software. In short, the original marker would
Brief summary
classify study objects into patients and non-patients
C-Index is one of the most important parameters in the while the new marker would reclassify study objects into
evaluation of Cox regression model. It reflects the pros and patients and non-patients. When comparing results of
cons of the model prediction effect and is an important classification and reclassification, some objects may be
parameter to measure the discrimination of the model. mistakenly classified by original marker but corrected by
However, this parameter cannot be calculated in IBM SPSS. new marker and vice versa. Therefore, classification of
This Section has described two methods of R language study objects changes when using different markers. We
calculation and grasping one of them would be sufficient. use the change to calculate NRI (26,27,36). It may look
The author recommends the first one, because it reports confusing, but calculation below could help readers to
C-Index and its standard error at the same time and can understand the concept.
thus conveniently calculate the confidence interval of First, we classify study objectives into diseases group
C-Index. and healthy group using gold standard test. Two matched
fourfold tables are generated from the classification results
using original and new markers within two groups, as shown
Calculation method of NRI with R
below in Tables 3 and 4.
We mainly focus on study objects who are reclassified.
Background
As shown in Tables 3 and 4, in disease group (N1 in total),
NRI was originally used to quantitatively evaluate the c1 objects are correctly classified by new marker and
improvement in classification performance of the new mistakenly classified by original marker, b2 objects are
marker over the original marker. Since NRI is capable of correctly classified by original marker and mistakenly
evaluating the precision of a diagnostic test, NRI could also classified by new marker. So, comparing to original model,
be used to evaluate the performance of a predictive model improved proportion of correct classification in new
because statistically diagnostic test and predictive model model is (c1 − b1)/N1. Similarly, in healthy group (N2 in
are the same. Based on published clinical trials, NRI has total), b2 objects are correctly classified by new marker
been widely used to compare the accuracy of two predictive and mistakenly classified by original marker, c2 objects
models. We have discussed the methods to compare the are correctly classified by original marker and mistakenly
accuracy and discrimination of two predictive models such classified by new marker. improved proportion of correct
as C-Statistics and AUC. However, both methods have classification in new model is (b2 − c2)/N2. At last,
several limitations. combining two groups together, NRI = (c1 − b1)/N1 + (b2
(I) C-Statistic/AUC lacks sensitivity. When evaluating − c2)/N2, which is often refer to as absolute NRI.
the improvement of predictive performance of a If NRI > 0, it means positive improvement, which
predictive model after incorporating a new marker, indicates that new marker has better predictive value
the improvement of C-Statistic/AUC is always comparing to original marker; If NRI <0, it means negative
small, therefore the new marker sometimes fails to improvement, which indicates that new marker has worse
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 34 of 96 Zhou et al. Clinical prediction models with R
Table 3 Reclassification in disease group to explain its clinical significance. However, NRI could
New marker easily solve these problems. So how does NRI solve these
Disease group (N1) problems?
Positive Negative
Take the paper about the methodology of NRI
Original marker
calculation published on Stat Med for example (26). Based
Positive a1 b1 on the famous Framingham Heart Study, researchers
Negative c1 d1 assessed the improvement of new model which incorporate
high density lipoprotein-C (HDL-C) with classic model
in the prediction of 10-year risk of coronary heart disease
Table 4 Reclassification in healthy group (CHD). Researchers compare the AUC of the new model
and classic model. Results showed that AUC of two models
New marker
Healthy group (N2) were 0.774 and 0.762, respectively. AUC only increased
Positive Negative
0.012 after incorporating HDL-C and failed to reach a
Original marker significant level (P=0.092), which indicated that new model
Positive a2 b2 had no significant improvement as shown in Figure 1 in the
paper (26). Researchers further classified study objects into
Negative c2 d2
low-risk group (<6%), intermediate-risk group (6–20%)
and high-risk group (>20%) as shown in Table 2 (26).
Researchers also calculated NRI (NRI =12.1%), z-score
predictive value comparing to original marker; If NRI (z-score =3.616) and P value (P<0.001), which indicated
=0, it means no improvement. We could calculate z-score that incorporating new marker improved the predictive
to determine whether the difference between new model performance and the proportion of correct classification
and original model reaches a significant level. Z-score increased 12.1%.
obeys standard normal distribution. Formula for Z-score The Principle of NRI has been fully discussed above.
calculation is listed below. P value could be calculated from Here we discuss how to calculate NRI using R software.
Z-score. There are three circumstances:
NRI (I) To calculate how much predictive performance of a
Z=
b1 + c1 b2 + c2 new marker improves comparing to original marker
+ [2]
N12 N 22 could use formula listed above or use R code in
reference material to calculate NRI;
The example described above is a dichotomous (II) To calculate NRI of two predictive models
diagnostic marker. In predictive model studies, situation is constructed by Logistic regression;
more complicated. But the principle is the same. Making (III) To calculate NIR of two predictive models
diagnosis solely based on a dichotomous diagnostic constructed by Cox regression.
marker seems to be too “crude”, researchers may be more The calculation method using R is listed below (Table 5)
concerned with the risk of having disease in the future (27,37-39). We mainly demonstrate how to calculate NRI
instead of the present status of having or not having a using “nricens” package, which is highly recommended.
disease. And predictive model could offer the probability of
developing a disease or other outcome. For example, study
Case analysis
objects could be classified into low-risk group, intermediate-
risk group and high-risk group based on predicted risk The calculation of NRI of two markers
probability. If outcome variable is multiple categorical [Case 1]
variable, ROC analysis is not appropriate because outcome Researchers wanted to assess the predictive value of two
variable of ROC analysis is often dichotomous. If outcome diagnostic tests for diabetes. They used three methods
variable is multiple categorical variable, ROC may present (gold standard test, diagnostic test1 and diagnostic test2)
to be spherical surface, which is hard to draw. Even if to evaluate the disease status of 100 patients. The data
ROC was drawing, it is also hard to compare AUC of two used here is documented in appendix “diagnosisdata.
predictive model directly, which makes it complicated csv”. Disease status predicted by gold standard test and
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 35 of 96
Hmisc CRAN improveProb() function Not available Not available Not available
function based on the definition we describe above. R Using NRI calculation function NRIcalculate() to
Codes are presented below: calculate NRI. Codes are presented as below:
NRIcalculate(m1=“t1”,m2=“t2”,gold=“gold”)
NRIcalculate=function(m1=“dia1”,m2=“dia2”,gold=“gold”){
## [1] “NRI=0.566, z=4.618, P=<0.001”
datanri=datanri[complete.cases(datanri),]
for (i in 1:length(names(datanri))){ m1 is variable name of diagnostic test1, m2 is variable
if (names(datanri)[i]==m1)nm1=as.numeric(i) name of diagnostic test2 and gold is gold standard test. NRI
if (names(datanri)[i]==m2)nm2=as.numeric(i) is 0.566, NRI of diagnostic test1 is significantly higher than
if(names(datanri)[i]==gold)ngold=as.numeric(i)
diagnostic test2.
}
if(names(table(datanri[,nm1]))[1]!=“0” ||
names(table(datanri[,nm1]))[2]!=“1”)stop(“index test 1 value not NRI calculation of dichotomous outcome
0or 1”) [Case 2]
if(names(table(datanri[,nm2]))[1]!=“0” ||
names(table(datanri[,nm2]))[2]!=“1”)stop(“index test 2 value not This data is from the Mayo Clinic trial in primary
0or 1”) biliary cirrhosis (PBC) of the liver conducted between
if(names(table(datanri[,ngold]))[1]!=“0” ||
names(table(datanri[,ngold]))[2]!=“1”)stop(“reference standard
1974 and 1984. A total of 424 PBC patients, referred to
value not 0 or 1”) Mayo Clinic during that ten-year interval, met eligibility
datanri1=datanri[datanri[,ngold]==1,]
criteria for the randomized placebo-controlled trial of
table1=table(datanri1[,nm1],datanri1[,nm2])
the drug D-penicillamine. The first 312 cases in the
datanri2=datanri[datanri[,ngold]==0,]
table2=table(datanri2[,nm1],datanri2[,nm2])
data set participated in the randomized trial and contain
p1=as.numeric(table1[2,1]/table(datanri[,ngold])[2]) largely complete data. The additional 112 cases did not
p2=as.numeric(table1[1,2]/table(datanri[,ngold])[2]) participate in the clinical trial, but consented to have basic
p3=as.numeric(table2[2,1]/table(datanri[,ngold])[1]) measurements recorded and to be followed for survival. Six
p4=as.numeric(table2[1,2]/table(datanri[,ngold])[1]) of those cases were lost to follow-up shortly after diagnosis,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 36 of 96 Zhou et al. Clinical prediction models with R
Table 6 Data structure and data description dichotomous, therefore it requires data conversion. As for
Variable names Description “time” variable, some samples failed to reach 2,000 days,
Age: In years
which showed that these patients died or censored before
2,000 days. Here we deleted data which failed to reach 2,000
Albumin: Serum albumin (g/dL)
days. The detailed description of other variables is available
alk.phos: Alkaline phosphatase (U/liter) using “?pbc”.
Ascites: Presence of ascites A nearly identical data set found in appendix D of
Fleming and Harrington; this version has fewer missing
ast: Aspartate aminotransferase, once called
SGOT (U/mL) values.
R codes and its interpretation
bili: Serum bilirunbin (mg/dL)
Firstly, we load “nricens” package and the dataset, then
chol: Serum cholesterol (mg/dL) extract first 312 observations.
copper: Urine copper (μg/day) library(nricens)
## Loading required package: survival
Edema: 0 no edema, 0.5 untreated or successfully
dat= pbc[1:312,]
treated, 1 edema despite diuretic therapy
dat$sex= ifelse(dat$sex==‘f’, 1, 0)
hepato: Presence of hepatomegaly or enlarged
liver
We delete the data of which follow-up time is shorter
than 2,000 days. “[]” stands for filter, “|” stands for “or”, “&”
ID: Case number
stands for “and”. So, data with “time” >2,000 and data with
Platelet: Platelet count “time” <2,000 but “status” is “dead” are selected. Do not
protime: Standardised blood clotting time miss the last “,”. “,” means selecting all rows which fits the
filter.
Sex: M/F
dat= dat[ dat$time >2,000 | (dat$time <2,000 & dat$status == 2),]
Spiders: Blood vessel malformations in the skin
Define the endpoint: 1 stands for time <2000 and status =
Stage: Histologic stage of disease (needs biopsy) 2 (dead), 0 stands for others.
event= ifelse(dat$time <2,000 & dat$status == 2, 1, 0)
Status: Status at endpoint, 0/1/2 for censored,
transplant, dead Build a matrix out of a subset of dat containing age, bili
Time: Number of days between registration and
and albumin.
the earlier of death, transplantion, or study z.std= as.matrix(subset(dat, select = c(age, bili, albumin)))
analysis in July, 1986 Build a matrix out of a subset of dat containing age, bili,
trt: 1/2/NA for D-penicillamine, placebo, not albumin and protime.
randomized z.new= as.matrix(subset(dat, select = c(age, bili, albumin, protime)))
Construct two logistic regression model: mstd and
trig: Triglycerides (mg/dL)
mnew. Model “mnew” has one more variable “protime”.
Calculation using “nricens” package requires x = TRUE,
which means that output contains the matrix.
mstd= glm(event ~ ., binomial(logit), data.frame(event, z.std), x=TRUE)
so the data here are on an additional 106 cases as well as the mnew= glm(event ~ ., binomial(logit), data.frame(event, z.new), x=TRUE)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 37 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 38 of 96 Zhou et al. Clinical prediction models with R
(III) Calculation of risk difference NRI using (‘event’, (IV) Calculation of risk difference NRI using (‘mdl.std’,
‘z.std’, ‘z.new’). ‘mdl.new’), updown = ‘diff’.
nribin(event= event, z.std = z.std, z.new = z.new, cut = c(0.2, 0.4), nribin(mdl.std= mstd, mdl.new = mnew, cut = 0.02, niter = 0,
niter = 100, updown = ‘category’) updown = ‘diff’)
## ##
## STANDARD prediction model:
## UP and DOWN calculation:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.98927136 2.20809035 0.4480212 6.541379e-01 ## #of total, case, and control subjects at t0: 232 88 144
## age 0.07128234 0.01988079 3.5854876 3.364490e-04 ## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## bili 0.61686651 0.10992947 5.6114755 2.006087e-08 ## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23
## albumin -1.95859156 0.53031693 -3.6932473 2.214085e-04 ##
## ## NRI estimation:
## NEW prediction model:
## Point estimates:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.16682234 2.92204889 -0.3993165 6.896600e-01 ## Estimate
## age 0.06659224 0.02032242 3.2767864 1.049958e-03 ## NRI 0.08712121
## bili 0.59995139 0.11022521 5.4429600 5.240243e-08 ## NRI+ 0.04545455
## albumin -1.88620553 0.53144647 -3.5491919 3.864153e-04 ## NRI- 0.04166667
## protime 0.20127560 0.18388726 1.0945598 2.737095e-01
## Pr(Up|Case) 0.19318182
##
## UP and DOWN calculation: ## Pr(Down|Case) 0.14772727
## #of total, case, and control subjects at t0: 232 88 144 ## Pr(Down|Ctrl) 0.15972222
## ## Pr(Up|Ctrl) 0.11805556
## Reclassification Table for all subjects:
## New (V) Calculation of risk difference NRI using (‘event’,
## Standard < 0.2 < 0.4 >= 0.4
‘z.std’, ‘z.new’), updown = ‘diff’.
## < 0.2 110 3 0
## < 0.4 3 30 0
nribin(event= event, z.std = z.std, z.new = z.new, cut = 0.02,
## >= 0.4 0 2 84
## niter = 100, updown = ‘diff’)
## Reclassification Table for case: ##
## New ## STANDARD prediction model:
## Standard < 0.2 < 0.4 >= 0.4 ## Estimate Std. Error z value Pr(>|z|)
## < 0.2 7 0 0
## (Intercept) 0.98927136 2.20809035 0.4480212 6.541379e-01
## < 0.4 0 8 0
## >= 0.4 0 2 71 ## age 0.07128234 0.01988079 3.5854876 3.364490e-04
## ## bili 0.61686651 0.10992947 5.6114755 2.006087e-08
## Reclassification Table for control: ## albumin -1.95859156 0.53031693 -3.6932473 2.214085e-04
## New ##
## Standard < 0.2 < 0.4 >= 0.4
## NEW prediction model:
## < 0.2 103 3 0
## < 0.4 3 22 0 ## Estimate Std. Error z value Pr(>|z|)
## >= 0.4 0 0 13 ## (Intercept) -1.16682234 2.92204889 -0.3993165 6.896600e-01
## ## age 0.06659224 0.02032242 3.2767864 1.049958e-03
## NRI estimation: ## bili 0.59995139 0.11022521 5.4429600 5.240243e-08
## Point estimates:
## albumin -1.88620553 0.53144647 -3.5491919 3.864153e-04
## Estimate
## protime 0.20127560 0.18388726 1.0945598 2.737095e-01
## NRI -0.02272727
## NRI+ -0.02272727 ##
## NRI- 0.00000000 ## UP and DOWN calculation:
## Pr(Up|Case) 0.00000000 ## #of total, case, and control subjects at t0: 232 88 144
## Pr(Down|Case) 0.02272727 ## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## Pr(Down|Ctrl) 0.02083333
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23
## Pr(Up|Ctrl) 0.02083333
## ##
## Now in bootstrap…… ## NRI estimation:
## ## Point estimates:
## Point & Interval estimates: ## Estimate
## Estimate Std.Error Lower Upper
## NRI 0.08712121
## NRI -0.02272727 0.03630980 -0.05063291 0.11288105
## NRI+ -0.02272727 0.02303343 -0.05063291 0.03448276 ## NRI+ 0.04545455
## NRI- 0.00000000 0.03004483 -0.02684564 0.07746479 ## NRI- 0.04166667
## Pr(Up|Case) 0.00000000 0.01763929 0.00000000 0.04878049 ## Pr(Up|Case) 0.19318182
## Pr(Down|Case) 0.02272727 0.02334453 0.00000000 0.08860759 ## Pr(Down|Case) 0.14772727
## Pr(Down|Ctrl) 0.02083333 0.03459169 0.00000000 0.12676056
## Pr(Down|Ctrl) 0.15972222
## Pr(Up|Ctrl) 0.02083333 0.01853583 0.00000000 0.05970149
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 39 of 96
## Pr(Up|Ctrl) 0.11805556 which stratifies the risk into three groups: low risk (0–20%),
## intermediate risk (20–40%) and high risk (40–100%). We
## Now in bootstrap..
convert the continuous variable to a categorical variable
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
based on cut-off value of actual risk. “updown” is defined as
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred how the predicted risk of one sample changes. “category”
is categorical variable defined as low, intermediate and high
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred risk. “diff” is a continuous value. When selected, “cut” is
## defined as one value, for example 0.02, which means that the
## Point & Interval estimates:
difference of predicted risk between new and original model
## Estimate Std.Error Lower Upper
## NRI 0.08712121 0.09606989 -0.02530364 0.3338028
predicted risk more than 0.02 is defined as reclassification.
## NRI+ 0.04545455 0.04172751 -0.02941176 0.1279070 “niter” is the number of iterations, namely number of
## NRI- 0.04166667 0.07033087 -0.02898551 0.2214765 resampling in bootstrap. The calculation of the standard
## Pr(Up|Case) 0.19318182 0.09990415 0.00000000 0.3797468 error of NRI requires resampling method. If “niter” =0,
## Pr(Down|Case) 0.14772727 0.07988383 0.00000000 0.2650602 standard error of NRI would not be calculated. Generally,
## Pr(Down|Ctrl) 0.15972222 0.12260050 0.00000000 0.4140127
set “niter” =1,000. If “niter” is too large, it takes a longer
## Pr(Up|Ctrl) 0.11805556 0.06328156 0.00000000 0.2420382
computing time and faster computing speed. However, the
(VI) Calculation of risk difference NRI using (‘event’, larger the “niter” is, the higher accuracy is. Significant level
‘p.std’, ‘p.new’), updown = ‘diff’. α is 0.05.
nribin(event= event, p.std = p.std, p.new = p.new, cut = 0.02, Results are listed as below:
niter = 100, updown = ‘diff’) (I) Tables of all outcomes, positive outcomes and
## negative outcomes. In a predictive model, case
## UP and DOWN calculation: means that outcome takes place, control means that
## #of total, case, and control subjects at t0: 232 88 144
outcome fails to take place.
## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23 ## Reclassification Table for all subjects:
## ## New
## NRI estimation: ## Standard < 0.2 < 0.4 >= 0.4
## Point estimates: ## < 0.2 110 3 0
## Estimate ## < 0.4 3 30 0
## NRI 0.08712121 ## >= 0.4 0 2 84
## NRI+ 0.04545455 ##
## NRI- 0.04166667 ## Reclassification Table for case:
## Pr(Up|Case) 0.19318182 ## New
## Pr(Down|Case) 0.14772727 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Down|Ctrl) 0.15972222 ## < 0.2 7 0 0
## Pr(Up|Ctrl) 0.11805556 ## < 0.4 0 8 0
## ## >= 0.4 0 2 71
## Now in bootstrap…… ##
## ## Reclassification Table for control:
## Point & Interval estimates: ## New
## Estimate Std.Error Lower Upper ## Standard < 0.2 < 0.4 >= 0.4
## NRI 0.08712121 0.07622661 -0.06506300 0.2364524 ## < 0.2 103 3 0
## NRI+ 0.04545455 0.06076779 -0.08602151 0.1666667 ## < 0.4 3 22 0
## NRI- 0.04166667 0.04317476 -0.04444444 0.1205674 ## >= 0.4 0 0 13
## Pr(Up|Case) 0.19318182 0.03999520 0.11688312 0.2674419
## Pr(Down|Case) 0.14772727 0.03670913 0.07142857 0.2278481 (II) Point estimation, standard error and confidence
## Pr(Down|Ctrl) 0.15972222 0.03103384 0.09615385 0.2237762 interval of NRI. Comparing to the original model,
## Pr(Up|Ctrl) 0.11805556 0.02604235 0.07407407 0.1703704
the proportion of correct reclassification improves
R code interpretation: the comparison of the predictive −2.2727%. Incorporating a new variable reduces
performance of two models. “cut” represents the cut-off the predictive accuracy, the new model is worse
value of predicted risk. Here we define two cut-off value, than original model.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 40 of 96 Zhou et al. Clinical prediction models with R
## Estimate Std.Error Lower Upper ## #of total, case, and control subjects at t0: 312 88 144
## NRI -0.02272727 0.03093492 -0.04211382 0.08275862
##
## NRI+ -0.02272727 0.02163173 -0.05376344 0.04950495
## Reclassification Table for all subjects:
## NRI- 0.00000000 0.02853621 -0.03571429 0.08333333
## Pr(Up|Case) 0.00000000 0.01963109 0.00000000 0.06930693 ## New
## Pr(Down|Case) 0.02272727 0.01939570 0.00000000 0.07142857 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Down|Ctrl) 0.02083333 0.03822346 0.00000000 0.14503817 ## < 0.2 139 7 1
## Pr(Up|Ctrl) 0.02083333 0.02285539 0.00000000 0.09160305 ## < 0.4 17 72 6
Outcome of survival data ## >= 0.4 0 5 65
[Case 3] ##
We use the same data with Case 2. The difference between ## Reclassification Table for case:
NRI of survival data and NRI of categorical data is that ## New
the former needs to construct a cox regression model. So, ## Standard < 0.2 < 0.4 >= 0.4
we need to construct two cox models and calculate NRI of ## < 0.2 9 2 0
these two models. ## < 0.4 1 21 4
R codes and its interpretation ## >= 0.4 0 0 51
Firstly, we load necessary packages and data. ##
Here consider pbc dataset in survival package as an ## Reclassification Table for control:
example. ## New
time= dat$time ##
event= ifelse(dat$status==2, 1, 0) ## NRI estimation by KM estimator:
There are many ways to calculate risk categorical ## Point & Interval estimates:
NRI. Readers could choose any one. The first method is ## Estimate Lower Upper
(I) By the KM estimator using (‘mdl.std’, ‘mdl.new’). ## NRI+ 0.05123381 -0.09480483 0.14708696
## NRI- 0.05904686 -0.01180288 0.11261994
nricens(mdl.std= mstd, mdl.new = mnew, t0 = 2000, cut = c(0.2, 0.4), ## Pr(Up|Case) 0.06348538 0.01113699 0.16595888
niter = 100, updown = ‘category’) ## Pr(Down|Case) 0.01225156 0.00000000 0.15653476
## ## Pr(Down|Ctrl) 0.09583016 0.02631760 0.16468399
## UP and DOWN calculation: ## Pr(Up|Ctrl) 0.03678329 0.01316053 0.08912133
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 41 of 96
(II) By the KM estimator using (‘time’, ‘event’, ‘z.std’, (III) By the KM estimator using (‘time’,‘event’,‘p.std’,‘p.
‘z.new’). new’).
nricens(time= time, event = event, z.std = z.std, z.new = z.new, nricens(time= time, event = event, p.std = p.std, p.new = p.new,
t0 = 2000, cut = c(0.2, 0.4), niter = 100, updown = ‘category’) t0 = 2000, cut = c(0.2, 0.4), niter = 100, updown = ‘category’)
## ##
## STANDARD prediction model (Cox model):
## UP and DOWN calculation:
## coef exp(coef) se(coef) z Pr(>|z|)
## #of total, case, and control subjects at t0: 312 88 144
## age 0.03726683 1.0379699 0.009048925 4.118371 3.815600e-05
## bili 0.13531179 1.1448937 0.013711323 9.868617 5.694436e-23 ##
## albumin -1.44611854 0.2354825 0.221997986 -6.514107 7.312356e-11 ## Reclassification Table for all subjects:
## ## New
## NEW prediction model (Cox model):
## Standard < 0.2 < 0.4 >= 0.4
## coef exp(coef) se(coef) z Pr(>|z|)
## < 0.2 139 7 1
## age 0.03362675 1.0341985 0.009214173 3.649460 2.627925e-04
## bili 0.12517886 1.1333511 0.014406820 8.688861 3.660902e-18 ## < 0.4 17 72 6
## albumin -1.39395237 0.2480928 0.217046959 -6.422354 1.341831e-10 ## >= 0.4 0 5 65
## protime 0.28602917 1.3311313 0.070536400 4.055058 5.012193e-05 ##
## ## Reclassification Table for case:
## UP and DOWN calculation:
## New
## #of total, case, and control subjects at t0: 312 88 144
## ## Standard < 0.2 < 0.4 >= 0.4
## Reclassification Table for all subjects: ## < 0.2 9 2 0
## New ## < 0.4 1 21 4
## Standard < 0.2 < 0.4 >= 0.4 ## >= 0.4 0 0 51
## < 0.2 139 7 1
##
## < 0.4 17 72 6
## >= 0.4 0 5 65 ## Reclassification Table for control:
## ## New
## Reclassification Table for case: ## Standard < 0.2 < 0.4 >= 0.4
## New ## < 0.2 92 4 1
## Standard < 0.2 < 0.4 >= 0.4
## < 0.4 9 29 2
## < 0.2 9 2 0
## < 0.4 1 21 4 ## >= 0.4 0 3 4
## >= 0.4 0 0 51 ##
## ## NRI estimation by KM estimator:
## Reclassification Table for control: ##
## New
## Point estimates:
## Standard < 0.2 < 0.4 >= 0.4
## Estimate
## < 0.2 92 4 1
## < 0.4 9 29 2 ## NRI 0.11028068
## >= 0.4 0 3 4 ## NRI+ 0.05123381
## ## NRI- 0.05904686
## NRI estimation by KM estimator: ## Pr(Up|Case) 0.06348538
##
## Pr(Down|Case) 0.01225156
## Point estimates:
## Estimate ## Pr(Down|Ctrl) 0.09583016
## NRI 0.11028068 ## Pr(Up|Ctrl) 0.03678329
## NRI+ 0.05123381 ##
## NRI- 0.05904686 ## Now in bootstrap……
## Pr(Up|Case) 0.06348538
##
## Pr(Down|Case) 0.01225156
## Pr(Down|Ctrl) 0.09583016 ## Point & Interval estimates:
## Pr(Up|Ctrl) 0.03678329 ## Estimate Lower Upper
## ## NRI 0.11028068 0.045766814 0.17347974
## Now in bootstrap…… ## NRI+ 0.05123381 -0.001867747 0.10493820
##
## NRI- 0.05904686 0.013007633 0.10168702
## Point & Interval estimates:
## Estimate Lower Upper ## Pr(Up|Case) 0.06348538 0.021802922 0.11136896
## NRI 0.11028068 -0.03560702 0.20881092 ## Pr(Down|Case) 0.01225156 0.000000000 0.03783913
## NRI+ 0.05123381 -0.08359649 0.11206601 ## Pr(Down|Ctrl) 0.09583016 0.056016704 0.13446013
## NRI- 0.05904686 -0.01795177 0.13023171 ## Pr(Up|Ctrl) 0.03678329 0.018781733 0.05795319
## Pr(Up|Case) 0.06348538 0.01955126 0.17904180
## Pr(Down|Case) 0.01225156 0.00000000 0.19505939 (IV) Calculation of risk difference NRI by the KM
## Pr(Down|Ctrl) 0.09583016 0.02681223 0.20450527
## Pr(Up|Ctrl) 0.03678329 0.01779895 0.09818359 estimator, updown = ‘diff’.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 42 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 43 of 96
Brief summary C-statistics but also NRI and IDI, which could give a
comprehensive perspective on how much the predictive
It is necessary to have a correct understanding of NRI.
performance improves.
NRI and C-statistics evaluate the discrimination of models.
Improvement of C-statistics sometimes is limited but NRI
could significantly improves, which means that predictive Calculation principle of IDI
performance of the new model improves significantly
The formula of IDI reflects the difference between the
comparing to the original model. It should be noted that
predictive probability of two models (26,36). Therefore,
there is a little difference between R codes for dichotomous
IDI is calculated based on the predictive probability of each
data and survival data. In this Section, we discuss the
study object using given predictive models. The formula is:
calculation of NRI. In the next Section, we will introduce
IDI = (Pnew,events–Pold,events) – (Pnew,non-events – Pold,non-events)
the calculation theory and definition of another index, IDI.
Pnew,events, Pold,events are the average value of the predictive
probability of each study object in disease group using
Calculation method of IDI with R the new model and the original model. P new,events-Pold,events
represents the improvement of predictive probability. For
Background
disease group, the higher the probability of disease is, the
In the previous section about the principle and calculation more accurate the predictive model is. Therefore, the larger
methods of NRI, we compare AUC (also known as difference means that the new model is better.
C-statistics) with NRI. NRI has two advantages: Pnew,non-events, Pold,non-events means the average value of the
(I) NRI is more sensitive than C-statistics/AUC predictive probability of each study object in healthy group
derived from ROC (26); using the new model and the original model. Pnew,non-events −
(II) NRI is easier to understand in clinical practice Pold,non-events represents the reduction of predictive probability.
when a cut-off value is given, for example, a cut- For healthy group, the smaller the probability of disease is,
off value of a diagnostic marker or cut-off values to the more accurate the predictive model is. Therefore, the
stratify low risk, intermediate risk and high risk (27). smaller difference means that the new model is better.
However, NRI has its disadvantages: NRI only considers At last, IDI is calculated by doing subtraction. In general,
performance improvement at one time point and fail to larger IDI means better predictive performance of the new
evaluate the overall improvement of a predictive model. model. Like NRI, If IDI >0, it means positive improvement,
Therefore, we could use another index: IDI (Integrated which indicates that new model has better predictive value
Discrimination Index) (40,41). comparing to original marker; If IDI <0, it means negative
Some readers may ask: is AUC/C-statistics able improvement, which indicates that new model has worse
to evaluate the overall improvement of a predictive predictive value comparing to original marker; If IDI =
model? To answer this question, we must go back to the 0, it means no improvement. We could calculate z-score
limitation of AUC/C-statistics. If we must compare IDI to determine whether the difference between new model
with AUC/C-statistics, IDI is more sensitive and easier and original model reaches a significant level. Z-score
to understand in clinical practice. It should be noted that approximately obeys standard normal distribution. Formula
we could calculate AUC/C-statistics of one predictive for Z-score calculation is listed below.
model, but we could not calculate NRI or IDI of one IDI
predictive model. IDI and NRI are calculated from the Z= [3]
( SEevents ) 2 + ( SEnoevents ) 2
comparison of two models. One model does not have IDI
or NRI. SEevents is the standard error of (Pnew,events − Pold,events). First,
We don’t know whether the last paragraph makes calculate predictive probability of each patient using new
sense to readers. But when we are dealing with difficult and original model in disease group and its difference. Then
problems, we could put them aside and forget about calculate standard error of the difference. SEnon-events is the
the advantages and disadvantages of AUC/C-statistics, standard error of (Pnew,non-events − Pold,non-events). First, calculate
NRI and IDI. What we should remember is that when predictive probability of each patient using new and original
comparing diagnostic power of two markers or comparing model in healthy group and its difference. Then calculate
two predictive models, we could use not only AUC/ standard error of the difference.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 44 of 96 Zhou et al. Clinical prediction models with R
Case study Load case data to current working directory, load case
data and set data format as data frame. Codes are presented
Calculation of IDI
below:
[Case 1]
library(foreign)
Researchers want to evaluate the predictive value of two dignosisdata <- read.csv(“dignosisdata.csv”)
diagnostic markers for diabetes. Three diagnostic methods dataidi=as.data.frame(dignosisdata)
(gold standard test, diagnostic marker 1 and diagnostic
Using IDI calculation function IDIcalculate() to calculate
marker2) are applied to 100 study objects. Data used in this
IDI. Codes are presented below:
case is available in appendix “diagnosisdata.csv”. Disease
IDIcalculate(m1=“t1”,m2=“t2”,gold=“gold”)
status predicted by gold standard test, diagnostic test1 and ## [1] “IDI=0.599,z=5.803,p=<0.001”
diagnostic test2 were listed, of which “gold” represented
m1 is variable name of diagnostic test1, m2 is variable
results of gold standard test (1= disease, 0= healthy); “t1”
name of diagnostic test2 and gold is gold standard test. IDI
represented results of diagnostic test1 (1= positive, 0=
is 0.599, IDI of diagnostic test1 is significantly higher than
negative); “t2” represented results of diagnostic test2 (1=
diagnostic test2.
positive, 0= negative). Readers could use formulas listed
above. Here we use our own R codes to calculate IDI of two
IDI calculation of dichotomous outcome
diagnostic tests. We have organized our data, renamed as
[Case2]
“diagnosisdata.csv” and stored in current working directory.
Data used here is a dataset from mayo clinic which could
To make it easier for readers to practice, data and codes are
be imported from “survival” package. Data contains clinical
available for download in appendix.
data and PBC status of 418 patients of which first 312
R codes and its interpretation
patients participated in randomized trial and other were
Because there is no function available for calculation of
from cohort studies. We use data from the first 312 patients
IDI, we need to define function named “IDIcalculate()”
to predict survival status. “Status” is the outcome variable,
function based on the definition we describe above. Codes
“0” means censored, “1” means liver transplant, “2” means
are presented below:
dead. But outcome of our study is dichotomous, therefore
IDIcalculate=function(m1=“dia1”,m2=“dia2”,gold=“gold”){ it requires data conversion. We construct a logistic model
dataidi= dataidi [complete.cases(dataidi),]
for (i in 1:length(names(dataidi))){ based on patients’ survival status. The detailed description
if(names(dataidi)[i]==m1)nm1=as.numeric(i) of other variables is available using “?pbc”. R packages for
if(names(dataidi)[i]==m2)nm2=as.numeric(i)
IDI calculation were shown in Table 7.
if(names(dataidi)[i]==gold)ngold=as.numeric(i)
} R codes and its interpretation
if(names(table(dataidi[,ngold]))[1]!=“0” || Here consider pbc dataset in survival package as an example.
names(table(dataidi[,ngold]))[2]!=“1”)
stop(“reference standard value not 0 or 1”)
First, we load “survival” package and the dataset, then
logit1=glm(dataidi[,ngold]~dataidi[,nm1], extract first 312 observations.
family=binomial(link=‘logit’),data=dataidi)
library(survival)
dataidi$pre1=logit1$fitted.values
logit2=glm(dataidi[,ngold]~dataidi[,nm2], dat=pbc[1:312,]
family=binomial(link=‘logit’),data=dataidi) dat$sex=ifelse(dat$sex==‘f’,1,0)
dataidi$pre2=logit2$fitted.values
dataidi$predif=dataidi$pre1-dataidi$pre2
subjects censored before 2000 days are excluded.
dataidi1=dataidi[dataidi[,ngold]==1,] dat=dat[dat$time>2000|(dat$time<2000&dat$status==2),]
dataidi2=dataidi[dataidi[,ngold]==0,] predciting the event of ‘death’ before 2000 days.
p1=mean(dataidi1$pre1)
p2=mean(dataidi1$pre2) event=ifelse(dat$time<2000&dat$status==2,1,0)
p3=mean(dataidi2$pre1) standard prediction model: age, bilirubin, and albumin.
p4=mean(dataidi2$pre2)
z.std=as.matrix(subset(dat,select=c(age,bili,albumin)))
IDI=round(p1-p2-p3+p4,3)
z=IDI/sqrt(sd(dataidi1$predif)/length(dataidi1$predif)+ new prediction model: age, bilirubin, albumin, and
sd(dataidi2$predif)/length(dataidi2$predif)) protime.
z=round(as.numeric(z),3)
z.new=as.matrix(subset(dat,select=c(age,bili,albumin,protime)))
pvalue=round((1-pnorm(abs(z)))*2,3)
if(pvalue<0.001)pvalue=“<0.001” glm() fit (logistic model)
result=paste(“IDI=“,IDI,”,z=“,z,”,p=“,pvalue,sep= ““) mstd=glm(event~.,binomial(logit),data.frame(event,z.std),x=TRUE)
return(result)
mnew=glm(event~.,binomial(logit),data.frame(event,z.new),x=TRUE)
}
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 45 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 46 of 96 Zhou et al. Clinical prediction models with R
model.
1.0
Furthermore, the result is illustrated as (Figure 15):
0.8 IDI.INF.GRAPH(x)
0.6
pr (D ≤s)
Brief summary
0.4
We introduce AUC, NRI, IDI and DCA in evaluating and
0.2
comparing predictive performance of two models in a series
0.0 of articles. These indices reflect the predictive performance
–0.1 0.0 0.1 0.2 0.3 0.4 from different angles. Here we do a summary of the three
s indices.
Figure 15 The comparison of two models. (I) AUC/C-statistic derived from ROC analysis
is the classic method and the foundation of
First, we load “survival” package and the dataset, then discrimination evaluation. Although NRI and IDI
extract first 312 observations. were recently developed and highly recommended,
AUC/C-statistic still is the basic method to
library(survival)
evaluate predictive performance improvement
dat=pbc[1:312,]
dat$time=as.numeric(dat$time)
of two models. Of course, we recommend that
Define survival outcome. Define “dead” as endpoint. C-Statistics/AUC, NRI and IDI should all be
dat$status=ifelse(dat$status==2,1, 0) calculated. It would be perfect if DCA analysis is
Define survival outcome. Define “dead” as endpoint. also available. But Done is better than perfect.
dat$status=ifelse(dat$status==2,1, 0) (II) If the outcome variable is a multiple categorical
Define time point. variable, for example, low risk, intermediate risk
t0=365*5 and high risk, NRI and IDI are better. AUC/
Construct a basic matrix containing regression model C-statistic is more complicated and harder to
variables. explain which we have discussed before.
indata0 = as.matrix(subset(dat,select=c(time,status,age,bili,albumin))) (III) The calculation of NRI is related to the cut-
Construct a new matrix adding one new regression off point. If the cut-off point is too high or the
model variable. number of cut-off point is too low, NRI could be
indata1 = as.matrix(subset(dat,select=c(time,status,age,bili,albumin,protime))) underestimated and failed to reach a significant
Variable matrix in basic regression models level. If the cut-off point is too low or the
covs0<-as.matrix(indata0[,c(-1,-2)]) number of cut-off point is too large, NRI could
Variable matrix in the new regression models. be overestimated in clinical practice. Therefore,
covs1<-as.matrix(indata1[,c(-1,-2)]) setting the cut-off value is important for NRI
dat[,2:3] is the survival outcome. The second column and calculation. It is necessary to set the correct cut-off
third column are the survival time and status, respectively. point based on clinical need. If the cut-off point is
covs0,covs1 are the original and new variable matrix, too hard to determine, IDI and AUC are better. If
respectively. t0 is the time point. npert is the number of cut-off value could be determined, NRI is better.
iteration. Calculation result of IDI is listed below: (IV) Using DCA in clinical utility analysis is the icing on the
library(survIDINRI) cake. DCA is not the only method to do clinical utility
## Loading required package: survC1 analysis which we will discuss in the next Section.
x<-IDI.INF(dat[,2:3],covs0, covs1, t0, npert=100) In addition, we need to consider a question which could
IDI.INF.OUT(x)
be easily neglected. After adding a new marker, the new
## Est. Lower Upper p-value
## M1 0.025 -0.001 0.055 0.079
model is more complicated than the original model. Is the
## M2 0.226 -0.057 0.401 0.079 complicated model acceptable? Is the new marker accessible?
## M3 0.012 -0.002 0.036 0.079 Is the new marker convenient? Everything has its pros and
IDI is 0.025, indicating that predictive performance cons. Researchers need make a decision between model
of new model improves 2.5% comparing to the original improvement and the cost of adding a new marker.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 47 of 96
Table 8 A subset of the Framingham Heart Study dataset (only the top 10 observations are listed)
ID Sex SBP (mmHg) DBP (mmHg) SCL (mmol/L) Chdfate Follow up (days) Age (years) BMI (kg/m2) Month
…
ID, serial number; SBP, systolic blood pressure; DBP, diastolic blood pressure; SCL, serum cholesterol; BMI, body mass index.
Decision Curve Analysis for Binary Outcome with R and Stata-based DCA algorithms. Kerr et al. also created a R
packagepackage called DecisionCurve for the implementation
Background
of the decision curve method which cannot be downloaded in
In the previous sections we have explored the C-Statistics the CRAN official website. All the functions of the original
(i.e., AUC, the area under the ROC curve) that evaluates DecisionCurve package have been integrated into the rmda
the discrimination of a predictive model. But is it good package. So, when you need to draw the decision curve, you
enough? The answer is: no best, only better. For example, just have to install the rmda package in R (46). The tutorial
predicting whether a patient is ill by a continuous index on how to install the DecisionCurve package in the new
has a certain probability of false positive and false negative version of R software on the Internet is not appropriate (47).
regardless of which value is selected as the cutoff value. The correct way is to install the rmda package directly. Below
Since neither of these situations can be avoided, we will start we will focus on the method of drawing DCA curves, and do
from the original motivation of constructing the predictive not explain too much complicated statistical principles.
model and try to find a model that predicts the greatest net
benefit. but how do we can calculate the net benefit of this
Case study
forecast?
In 2006, Andrew Vickers et al. who working for Dichotomous Outcome
Memorial Sloan-Kettering Cancer Center invented a new [Case 1]
calculation method called Decision Curve Analysis (DCA) The data is a subset of the famous Framingham Heart
(42-44). Compared with the ROC analysis that was born Study data set from the NHLBI (National Heart, Lung,
during the Second World War, DCA is obviously still and Blood Institute), containing 4,699 samples and 10
“innocent“, but “green out of blue, and better than blue“, variables. The independent variables include sex (sex), SBP,
many top medical journal, such as Ann Intern Med, JAMA, diastolic blood pressure (DBP), serum cholesterol (SCL),
BMJ, J Clin Oncol and others have encouraged to use DCA age (age), body mass index (BMI), etc., and the dependent
decision curve analysis method (45). Then how to draw an variable is CHD-related death event (chdfate). In this case,
attractive Decision Curve Analysis? the dependent variable was a two-category variable with a
Statisticians always firstly think of using R to implement death of 1 and no death of 0 during the follow-up period.
new algorithms. In fact, this is true. The DCA algorithm The data structure is shown in Table 8. We sorted out and
based on R language was firstly announced, followed by SAA named it ‘Framingham.csv’ which is stored in the current
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 48 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 49 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 50 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 51 of 96
cost.benefit.axis = T, confidence.intervals = T,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 52 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 53 of 96
0.00 Read in the data file “dca.txt” under the current working path.
data.set <- read.table(“dca.txt”, header=TRUE, sep=“\t”)
–0.02 attach(data.set)
str(data.set)
## ‘data.frame’: 750 obs. of 10 variables:
–0.04 ## $ patientid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cancer : int 0 0 0 0 0 0 0 1 0 0 ...
0.05 0.10 0.15 0.20 0.25 ## $ dead : int 0 0 0 0 0 0 1 0 0 0 ...
Threshold probability ## $ ttcancer : num 3.009 0.249 1.59 3.457 3.329 ...
## $ risk_group : Factor w/3 levels “high”, ”intermediate”,..: 3 1 3 3 3 2 2 2 3 2 ...
Figure 19 DCA of survival outcome data. ## $ casecontrol : int 0 0 0 1 0 1 0 1 0 1 ...
## $ age : num 64 78.5 64.1 58.5 64 ...
## $ famhistory : int 0 0 0 0 0 0 0 0 0 0 ...
## $ marker : num 0.7763 0.2671 0.1696 0.024 0.0709 ...
is needed. However, readers should understand the truth: ## $ cancerpredmarker: num 0.0372 0.57891 0.02155 0.00391 0.01879 ...
DCA analysis is not the only way to assess the clinical
utility of predictive models, nor is it a perfect one (49). This is a dataframe of survival data, include 750
In fact, the method we use most often is mentioned in observations, 10 variables:
the first section. For predicting the outcome of the two $ patientid: number.
classifications, we should see if the prediction model has $ cancer: whether cancer occurs, binary variable, 1=
a better sensitivity and specificity; for predicting survival cancer, 0= non-cancer. dependent variable.
outcomes, we generally see whether patients can be $ dead: dead or not, binary variable,1= dead, 0= alive.
classified into good prognosis and poor prognosis based $ ttcancer: the time from follow-up to the occurrence of
on predictive models, such as calculating the score of each cancer, continuous variable, time variable.
subject by Nomogram, and treating the patient according $ risk_group: risk factor, the factor variable, ordinal
to a cutoff value, and divided into a good prognosis and variable, 3= “high”, 2= “intermediate”, 1= “low”.
poor prognosis, and then draw Kaplan-Meier survival $ casecontrol: grouping variable, binary variable, 1=
curve. “case”, 0= “control” .
$ age: Age, continuous variable.
$ famhistory: family history, 0= no, 1= yes
Decision curve analysis (DCA) for survival $ marker: a biomarker level, a continuous variable.
outcomes with R $ cancerpredmarker: tumor biomarker level, continuous
Background variables.
[Case 1] R codes and its interpretation
The DCA of survival outcome data are summarized in this The source() function is used to load the source code
Section. In the previous Section, we introduced using rmda downloaded from the MSKCC website, which is
package to perform DCA for binary outcome, but there is downloaded in advance and saved to the current working
no information about the survival function of DCA in the path.
rmda package and though in CRAN currently no package source(“stdca.R”)
available for process of survival outcome data of DCA. Here Subsequently, we can directly use the stdca() function of
we introduce how to perform DCA analysis of survival DCA analysis of survival data defined by this function.
outcome data based on the source code provided on the Usage of stdca() function are as follows:
MSKCC website (48). We authors are just a knowledge stdca(data, outcome, predictors, timepoint, xstart =0.01,
porter. The copyright of the R source code in this paper xstop =0.99, xby =0.01, ymin =-0.05, probability = NULL,
belongs to the original author. We only own the original harm = NULL, graph = TRUE, intervention = FALSE,
copyright of the text annotation and result interpretation interventionper =100, smooth = FALSE, loess.span =0.10,
part of the code. cmprsk = FALSE).
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 54 of 96 Zhou et al. Clinical prediction models with R
Notes of stdca() function parameters are as follows: incidence of no cancer, as shown below:
Data: a data frame containing the variables in the model. data.set$pr_failure18 <- c(1 - (summary(survfit(coxmod,
Outcome: the outcome, response variable. Must be a newdata=data.set), times=1.5)$surv))
variable contained within the data frame specified in data=. This step is necessary, according to the mentioned above
Predictors: the predictor variables. Must be a variable stdca() function parameters regulation about “predictors”,
contained within the data frame specified in data=. can only pass in a variable here, obviously using model
timepoint: specifies the time point at which the decision prediction probability as a new variable is introduced to
curve analysis is performed. reflect the entire model predictive power here. If only pass
Probability: specifies whether or not each of the in a predictor, the factors that represent only a factor for
independent variables are probabilities. The default is outcome prediction ability, rather than the entire model
TRUE. prediction ability.
xstart: starting value for x-axis (threshold probability) Use the stdca() function for DCA analysis of survival
between 0 and 1. The default is 0.01. outcome data. DCA curve of “coxmod” is shown in Figure 20
xstop: stopping value for x-axis (threshold probability) below.
between 0 and 1. The default is 0.99.
stdca(data=data.set, outcome=“cancer”, ttoutcome=“ttcancer”,
xby: increment for threshold probability. The default is timepoint=1.5, predictors=“pr_failure18”, xstop=0.5, smooth=TRUE)
0.01.
## $N
ymin: minimum bound for graph. The default is –0.05. ## [1] 750
Harm: specifies the harm(s) associated with the ##
## $predictors
independent variable(s). The default is none. ## predictor harm.applied probability
Graph: specifies whether or not to display graph of net ## 1 pr_failure18 0 TRUE
##
benefits. The default is TRUE. ## $interventions.avoided.per
Intervention: plot net reduction in interventions. ## [1] 100
##
interventionper: number of net reduction in interventions
## $net.benefit
per interger. The default is 100. ## threshold all none pr_failure18 pr_failure18_sm
Smooth: specifies whether or not to smooth net benefit ## 1 0.01 0.2122351058 0 0.21181105 0.21174613
## 2 0.02 0.2041966885 0 0.20361189 0.20385712
curve. The default is FALSE. ## 3 0.03 0.1959925306 0 0.20030295 0.20030295
loess.span: specifies the degree of smoothing. The default ## 4 0.04 0.1876174528 0 0.19999857 0.19999857
## 5 0.05 0.1790660576 0 0.18877249 0.18877249
is 0.10. ## 6 0.06 0.1703327178 0 0.18397250 0.18397250
cmprsk: if evaluating outcome in presence of a competing ## 7 0.07 0.1614115642 0 0.18451216 0.18451216
## 8 0.08 0.1522964725 0 0.17599183 0.17599183
risk. The default is FALSE. ## 9 0.09 0.1429810491 0 0.17050884 0.17050884
For the statistical analysis of the above cases, we first ## 10 0.10 0.1334586163 0 0.16958715 0.16958715
## 11 0.11 0.1237221963 0 0.16366952 0.16366952
need to define a survival function object, which contains ## 12 0.12 0.1137644940 0 0.16432269 0.16432269
the outcome of the study and the time when the outcome ## 13 0.13 0.1035778790 0 0.16149156 0.16149156
## 14 0.14 0.0931543659 0 0.15465439 0.15465439
occurred, namely the “cancer” and “ttcancer” variables of ## 15 0.15 0.0824855938 0 0.14784631 0.14784631
the dataframe in this case. ## 16 0.16 0.0715628032 0 0.14257246 0.14257246
## 17 0.17 0.0603768129 0 0.14110378 0.14110378
library(survival)
## 18 0.18 0.0489179935 0 0.13798791 0.13798791
## ## 19 0.19 0.0371762404 0 0.13763724 0.13763724
## Attaching package: ‘survival’ ## 20 0.20 0.0251409434 0 0.12895335 0.12895335
## The following object is masked from ‘data.set’: ## 21 0.21 0.0128009553 0 0.12448413 0.12448413
## 22 0.22 0.0001445573 0 0.12833148 0.12833148
##
## 23 0.23 -0.0128405783 0 0.12638034 0.12638034
## cancer ## 24 0.24 -0.0261674280 0 0.12723684 0.12723684
Srv = Surv(data.set$ttcancer, data.set$cancer) ## 25 0.25 -0.0398496604 0 0.12215471 0.12215471
## 26 0.26 -0.0539016828 0 0.11416302 0.11416302
Next, we need to build the Cox regression model named ## 27 0.27 -0.0683386922 0 0.10999741 0.10999741
coxmod using the coxph() function in the survival package. ## 28 0.28 -0.0831767296 0 0.10939791 0.10939791
## 29 0.29 -0.0984327399 0 0.10822856 0.10822856
The code is as follows: ## 30 0.30 -0.1141246361 0 0.10585886 0.10585886
coxmod <- coxph(Srv ~ age + famhistory + marker, data=data.set) ## 31 0.31 -0.1302713700 0 0.10300367 0.10300367
## 32 0.32 -0.1468930078 0 0.10181986 0.10181986
The coxmod survival function was used to calculate the
## 33 0.33 -0.1640108139 0 0.10224584 0.10224584
complement of the incidence of cancer at 1.5 years, i.e., the ## 34 0.34 -0.1816473414 0 0.10113914 0.10113914
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 55 of 96
Net benefit
## 42 0.42 -0.3446331816 0 0.08169947 0.08169947
## 43 0.43 -0.3682232374 0 0.08143814 0.08143814 0.05
## 44 0.44 -0.3926557952 0 0.07927021 0.07927021
## 45 0.45 -0.4179768096 0 0.08356079 0.08356079
## 46 0.46 -0.4442356395 0 0.07803458 0.07803458
## 47 0.47 -0.4714853685 0 0.07986688 0.07986688 –0.05
## 48 0.48 -0.4997831640 0 0.07858227 0.07858227
## 49 0.49 -0.5291906771 0 0.07768162 0.07751346 0.0 0.1 0.2 0.3 0.4 0.5
## 50 0.50 -0.5597744906 0 0.07587186 0.07591062
Threshold probability
##
## $interventions.avoided Figure 20 DCA curve of “coxmod” based on Cox regression
## threshold pr_failure18 pr_failure18_sm
## 1 0.01 -4.198140 -4.7660406 model.
## 2 0.02 -2.865499 -0.7203969
## 3 0.03 13.937020 13.9370196
## 4 0.04 29.714687 29.7146874
## 5 0.05 18.442215 18.4422153 data = data.set, specified data set; outcome =“cancer”,
## 6 0.06 21.368990 21.3689903
## 7 0.07 30.690796 30.6907960
to define dichotomous outcome; ttoutcome = “ttcancer”,
## 8 0.08 27.249660 27.2496595 specified time variable; timepoint =1.5; define time point
## 9 0.09 27.833653 27.8336527
## 10 0.10 32.515680 32.5156803
=1.5 years; predictors =“pr_failure18”, the prediction
## 11 0.11 32.321013 32.3210130 probability calculated according to the Cox regression
## 12 0.12 37.076008 37.0760083
## 13 0.13 38.757620 38.7576205
model is passed in, and here it needs to be specified that
## 14 0.14 37.778585 37.7785849 the probability is passed in here; probability = TRUE, this
## 15 0.15 37.037739 37.0377387
is also the default setting, only If you use a single factor for
## 16 0.16 37.280070 37.2800702
## 17 0.17 39.413756 39.4137560 prediction, set to FALSE.
## 18 0.18 40.576297 40.5762971
Now let’s construct two Cox regression models:
## 19 0.19 42.828111 42.8281111
## 20 0.20 41.524962 41.5249619 coxmod1 <- coxph(Srv ~ age + famhistory + marker, data=data.set)
## 21 0.21 42.014147 42.0141471 coxmod2 <- coxph(Srv ~ age + famhistory + marker + risk_group,
## 22 0.22 45.448092 45.4480917
data=data.set)
## 23 0.23 46.608741 46.6087406
## 24 0.24 48.578018 48.5780178 According to the survival function, the complement
## 25 0.25 48.601310 48.6013099
## 26 0.26 47.833800 47.8337999 number of cancer incidence at 1.5 years, namely the
## 27 0.27 48.216799 48.2167988 incidence without cancer, was calculated for the two models
## 28 0.28 49.519193 49.5191931
## 29 0.29 50.596386 50.5963863 respectively, with the code as follows:
## 30 0.30 51.329482 51.3294816
data.set$pr_failure19 <- c(1 - (summary(survfit(coxmod1,
## 31 0.31 51.922508 51.9225082
## 32 0.32 52.851484 52.8514842 newdata=data.set), times=1.5)$surv))
## 33 0.33 54.058169 54.0581691 data.set$pr_failure20 <- c(1 - (summary(survfit(coxmod2,
## 34 0.34 54.893846 54.8938461 newdata=data.set), times=1.5)$surv))
## 35 0.35 55.506940 55.5069400
## 36 0.36 55.983082 55.9830819 The stdca() function is used for DCA analysis of the two
## 37 0.37 56.744897 56.7448975
## 38 0.38 57.066090 57.0660899 Cox regression models. DCA curves of “coxmod1” and
## 39 0.39 57.582567 57.5825669 “coxmod1” based on Cox regression model was shown in
## 40 0.40 58.475938 58.4759382
## 41 0.41 58.110328 58.1103281
Figure 21 below.
## 42 0.42 58.874509 58.8745086
## 43 0.43 59.606275 59.6062750 stdca(data=data.set, outcome=“cancer”, ttoutcome=“ttcancer”,
## 44 0.44 60.063309 60.0633094
timepoint=1.5, predictors=c(“pr_failure19”,”pr_failure20”), xstop=0.5,
## 45 0.45 61.299040 61.2990397
## 46 0.46 61.309982 61.3099824 smooth=TRUE)
## 47 0.47 62.173764 62.1737642
## 48 0.48 62.656255 62.6562552 ## $N
## 49 0.49 63.164259 63.1469446 ## [1] 750
## 50 0.50 63.564635 63.5686257 ##
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 56 of 96 Zhou et al. Clinical prediction models with R
## $predictors ## 9 0.17292665
## predictor harm.applied probability ## 10 0.17172714
## 1 pr_failure19 0 TRUE ## 11 0.16413986
## 2 pr_failure20 0 TRUE ## 12 0.16115961
## ## 13 0.15997205
## $interventions.avoided.per ## 14 0.15102760
## [1] 100 ## 15 0.14680192
## ## 16 0.14533848
## $net.benefit ## 17 0.14377813
## threshold all none pr_failure19 pr_failure20 pr_failure19_sm ## 18 0.14260467
## 1 0.01 0.2122351058 0 0.21181105 0.21195459 0.21174613 ## 19 0.13073905
## 2 0.02 0.2041966885 0 0.20361189 0.20592854 0.20385712 ## 20 0.13033339
## 3 0.03 0.1959925306 0 0.20030295 0.20082319 0.20030295 ## 21 0.13224742
## 4 0.04 0.1876174528 0 0.19999857 0.19793636 0.19999857 ## 22 0.13027540
## 5 0.05 0.1790660576 0 0.18877249 0.18850393 0.18877249 ## 23 0.12984114
## 6 0.06 0.1703327178 0 0.18397250 0.18758263 0.18397250 ## 24 0.12158006
## 7 0.07 0.1614115642 0 0.18451216 0.17871488 0.18451216 ## 25 0.11550497
## 8 0.08 0.1522964725 0 0.17599183 0.17549498 0.17599183 ## 26 0.11224497
## 9 0.09 0.1429810491 0 0.17050884 0.17292665 0.17050884 ## 27 0.11336879
## 10 0.10 0.1334586163 0 0.16958715 0.17172714 0.16958715 ## 28 0.11138142
## 11 0.11 0.1237221963 0 0.16366952 0.16413986 0.16366952 ## 29 0.11042784
## 12 0.12 0.1137644940 0 0.16432269 0.16115961 0.16432269 ## 30 0.10713217
## 13 0.13 0.1035778790 0 0.16149156 0.15997205 0.16149156 ## 31 0.10922587
## 14 0.14 0.0931543659 0 0.15465439 0.15102760 0.15465439 ## 32 0.10710860
## 15 0.15 0.0824855938 0 0.14784631 0.14680192 0.14784631 ## 33 0.10514071
## 16 0.16 0.0715628032 0 0.14257246 0.14533848 0.14257246 ## 34 0.10168529
## 17 0.17 0.0603768129 0 0.14110378 0.14377813 0.14110378 ## 35 0.10084803
## 18 0.18 0.0489179935 0 0.13798791 0.14260467 0.13798791 ## 36 0.09572434
## 19 0.19 0.0371762404 0 0.13763724 0.13073905 0.13763724 ## 37 0.09287710
## 20 0.20 0.0251409434 0 0.12895335 0.13033339 0.12895335 ## 38 0.09547575
## 21 0.21 0.0128009553 0 0.12448413 0.13224742 0.12448413 ## 39 0.08909328
## 22 0.22 0.0001445573 0 0.12833148 0.13027540 0.12833148 ## 40 0.08873631
## 23 0.23 -0.0128405783 0 0.12638034 0.12984114 0.12638034 ## 41 0.08693879
## 24 0.24 -0.0261674280 0 0.12723684 0.12158006 0.12723684 ## 42 0.08850919
## 25 0.25 -0.0398496604 0 0.12215471 0.11550497 0.12215471 ## 43 0.08923527
## 26 0.26 -0.0539016828 0 0.11416302 0.11224497 0.11416302 ## 44 0.08132415
## 27 0.27 -0.0683386922 0 0.10999741 0.11336879 0.10999741 ## 45 0.07934853
## 28 0.28 -0.0831767296 0 0.10939791 0.11138142 0.10939791 ## 46 0.07668500
## 29 0.29 -0.0984327399 0 0.10822856 0.11042784 0.10822856 ## 47 0.07507451
## 30 0.30 -0.1141246361 0 0.10585886 0.10713217 0.10585886 ## 48 0.07639787
## 31 0.31 -0.1302713700 0 0.10300367 0.10922587 0.10300367 ## 49 0.07746882
## 32 0.32 -0.1468930078 0 0.10181986 0.10710860 0.10181986 ## 50 0.08189491
## 33 0.33 -0.1640108139 0 0.10224584 0.10514071 0.10224584 ##
## 34 0.34 -0.1816473414 0 0.10113914 0.10168529 0.10113914 ## $interventions.avoided
## 35 0.35 -0.1998265312 0 0.09905699 0.10084803 0.09905699 ## threshold pr_failure19 pr_failure20 pr_failure19_sm pr_failure20_sm
## 36 0.36 -0.2185738208 0 0.09633101 0.09572434 0.09633101 ## 1 0.01 -4.198140 -2.777062 -4.7660406 -2.565550
## 37 0.37 -0.2379162624 0 0.09534742 0.09287710 0.09534742 ## 2 0.02 -2.865499 8.486095 -0.7203969 7.687163
## 38 0.38 -0.2578826537 0 0.09187725 0.09547575 0.09187725 ## 3 0.03 13.937020 15.619119 13.9370196 15.619119
## 39 0.39 -0.2785036808 0 0.08964716 0.08909328 0.08964716 ## 4 0.04 29.714687 24.765386 29.7146874 24.765386
## 40 0.40 -0.2998120755 0 0.09002751 0.08873631 0.09002751 ## 5 0.05 18.442215 17.931950 18.4422153 17.931950
## 41 0.41 -0.3218427887 0 0.08197475 0.08693879 0.08197475 ## 6 0.06 21.368990 27.024866 21.3689903 27.024866
## 42 0.42 -0.3446331816 0 0.08169947 0.08850919 0.08169947 ## 7 0.07 30.690796 22.988695 30.6907960 22.988695
## 43 0.43 -0.3682232374 0 0.08143814 0.08923527 0.08143814 ## 8 0.08 27.249660 26.678283 27.2496595 26.678283
## 44 0.44 -0.3926557952 0 0.07927021 0.08132415 0.07927021 ## 9 0.09 27.833653 30.278333 27.8336527 30.278333
## 45 0.45 -0.4179768096 0 0.08356079 0.07934853 0.08356079 ## 10 0.10 32.515680 34.441667 32.5156803 34.441667
## 46 0.46 -0.4442356395 0 0.07803458 0.07668500 0.07803458 ## 11 0.11 32.321013 32.701568 32.3210130 32.701568
## 47 0.47 -0.4714853685 0 0.07986688 0.07507451 0.07986688 ## 12 0.12 37.076008 34.756418 37.0760083 34.756418
## 48 0.48 -0.4997831640 0 0.07858227 0.07639787 0.07858227 ## 13 0.13 38.757620 37.740714 38.7576205 37.740714
## 49 0.49 -0.5291906771 0 0.07768162 0.07665976 0.07751346 ## 14 0.14 37.778585 35.550699 37.7785849 35.550699
## 50 0.50 -0.5597744906 0 0.07587186 0.08208139 0.07591062 ## 15 0.15 37.037739 36.445916 37.0377387 36.445916
## pr_failure20_sm ## 16 0.16 37.280070 38.732229 37.2800702 38.732229
## 1 0.21199928 ## 17 0.17 39.413756 40.719466 39.4137560 40.719466
## 2 0.20575976 ## 18 0.18 40.576297 42.679487 40.5762971 42.679487
## 3 0.20082319 ## 19 0.19 42.828111 39.887305 42.8281111 39.887305
## 4 0.19793636 ## 20 0.20 41.524962 42.076980 41.5249619 42.076980
## 5 0.18850393 ## 21 0.21 42.014147 44.934621 42.0141471 44.934621
## 6 0.18758263 ## 22 0.22 45.448092 46.137300 45.4480917 46.137300
## 7 0.17871488 ## 23 0.23 46.608741 47.767357 46.6087406 47.767357
## 8 0.17549498 ## 24 0.24 48.578018 46.786704 48.5780178 46.786704
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 57 of 96
Net benefit
## 28 0.28 49.519193 50.029238 49.5191931 50.029238
## 29 0.29 50.596386 51.134831 50.5963863 51.134831 0.00
## 30 0.30 51.329482 51.626588 51.3294816 51.626588
## 31 0.31 51.922508 53.307450 51.9225082 53.307450
## 32 0.32 52.851484 53.975342 52.8514842 53.975342
–0.04
## 33 0.33 54.058169 54.645916 54.0581691 54.645916
## 34 0.34 54.893846 54.999864 54.8938461 54.999864
## 35 0.35 55.506940 55.839561 55.5069400 55.839561 0.05 0.10 0.15 0.20 0.25
## 36 0.36 55.983082 55.875228 55.9830819 55.875228 Threshold probability
## 37 0.37 56.744897 56.324276 56.7448975 56.324276
## 38 0.38 57.066090 57.653214 57.0660899 57.653214 Figure 22 DCA curve of a single predictor “thickness” based on
## 39 0.39 57.582567 57.495935 57.5825669 57.495935 univariate Cox regression model.
## 40 0.40 58.475938 58.282258 58.4759382 58.282258
## 41 0.41 58.110328 58.824666 58.1103281 58.824666
## 42 0.42 58.874509 59.814900 58.8745086 59.814900
## 43 0.43 59.606275 60.639848 59.6062750 60.639848
## 44 0.44 60.063309 60.324720 60.0633094 60.324720
## 45 0.45 61.299040 60.784209 61.2990397 60.784209 Thickness
100 patients
## 50 0.50 63.564635 64.185588 63.5686257 64.166483
40
20
None
All 0
0.15 pr_failure 19
pr_failure 20 0.05 0.10 0.15 0.20 0.25
Net benefit
Threshold probability
0.05
Figure 23 DCA curve of a single predictor “thickness” based on
univariate Cox regression model. Y axis represent net reduction in
–0.05 interventions per 100 persons.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 58 of 96 Zhou et al. Clinical prediction models with R
## $N ## Warning: ‘newdata’ had 1 row but variables found have 205 rows
## [1] 205 ## [1] “thickness converted to a probability with Cox regression. Due
## to linearity and proportional hazards assumption, miscalibration may
## $predictors occur.”
## predictor harm.applied probability
## 1 thickness 0 FALSE ## $N
## ## [1] 205
## $interventions.avoided.per ##
## [1] 100 ## $predictors
## ## predictor harm.applied probability
## $net.benefit ## 1 thickness 0 FALSE
## threshold all none thickness ##
## 1 0.01 4.043816e-02 0 0.0404381573 ## $interventions.avoided.per
## 2 0.02 3.064671e-02 0 0.0306467100 ## [1] 100
## 3 0.03 2.065338e-02 0 0.0235375096 ##
## 4 0.04 1.045185e-02 0 0.0350104433 ## $net.benefit
## 5 0.05 3.555344e-05 0 0.0341713892 ## threshold all none thickness
## 1 0.01 4.043816e-02 0 0.0404381573
## 6 0.06 -1.060237e-02 0 0.0170264274
## 2 0.02 3.064671e-02 0 0.0306467100
## 7 0.07 -2.146906e-02 0 0.0076078405
## 3 0.03 2.065338e-02 0 0.0235375096
## 8 0.08 -3.257198e-02 0 0.0074231177
## 4 0.04 1.045185e-02 0 0.0350104433
## 9 0.09 -4.391893e-02 0 0.0034843206
## 5 0.05 3.555344e-05 0 0.0341713892
## 10 0.10 -5.551803e-02 0 0.0048780488
## 6 0.06 -1.060237e-02 0 0.0170264274
## 11 0.11 -6.737778e-02 0 0.0055357632
## 7 0.07 -2.146906e-02 0 0.0076078405
## 12 0.12 -7.950707e-02 0 0.0050997783
## 8 0.08 -3.257198e-02 0 0.0074231177
## 13 0.13 -9.191520e-02 0 0.0053826745
## 9 0.09 -4.391893e-02 0 0.0034843206
## 14 0.14 -1.046119e-01 0 0.0049914918
## 10 0.10 -5.551803e-02 0 0.0048780488
## 15 0.15 -1.176073e-01 0 0.0045911047
## 11 0.11 -6.737778e-02 0 0.0055357632
## 16 0.16 -1.309122e-01 0 0.0041811847
## 12 0.12 -7.950707e-02 0 0.0050997783
## 17 0.17 -1.445376e-01 0 0.0037613870
## 13 0.13 -9.191520e-02 0 0.0053826745
## 18 0.18 -1.584954e-01 0 -0.0015466984
## 14 0.14 -1.046119e-01 0 0.0049914918
## 19 0.19 -1.727978e-01 0 0.0003011141 ## 15 0.15 -1.176073e-01 0 0.0045911047
## 20 0.20 -1.874578e-01 0 -0.0036585366 ## 16 0.16 -1.309122e-01 0 0.0041811847
## 21 0.21 -2.024889e-01 0 -0.0038900895 ## 17 0.17 -1.445376e-01 0 0.0037613870
## 22 0.22 -2.179054e-01 0 -0.0041275797 ## 18 0.18 -1.584954e-01 0 -0.0015466984
## 23 0.23 -2.337224e-01 0 -0.0029141590 ## 19 0.19 -1.727978e-01 0 0.0003011141
## 24 0.24 -2.499556e-01 0 -0.0030808729 ## 20 0.20 -1.874578e-01 0 -0.0036585366
## 25 0.25 -2.666216e-01 0 -0.0032520325 ## 21 0.21 -2.024889e-01 0 -0.0038900895
## ## 22 0.22 -2.179054e-01 0 -0.0041275797
## $interventions.avoided ## 23 0.23 -2.337224e-01 0 -0.0029141590
## threshold thickness ## 24 0.24 -2.499556e-01 0 -0.0030808729
## 1 0.01 0.000000 ## 25 0.25 -2.666216e-01 0 -0.0032520325
## 2 0.02 0.000000 ##
## 3 0.03 9.325362 ## $interventions.avoided
## 4 0.04 58.940624 ## threshold thickness
## 5 0.05 64.858088 ## 1 0.01 0.000000
## 6 0.06 43.285110 ## 2 0.02 0.000000
## 7 0.07 38.630737 ## 3 0.03 9.325362
## 8 0.08 45.994366 ## 4 0.04 58.940624
## 9 0.09 47.929951 ## 5 0.05 64.858088
## 10 0.10 54.356468 ## 6 0.06 43.285110
## 11 0.11 58.993685 ## 7 0.07 38.630737
## 12 0.12 62.045024 ## 8 0.08 45.994366
## 13 0.13 65.114732 ## 9 0.09 47.929951
## 14 0.14 67.327791 ## 10 0.10 54.356468
## 15 0.15 69.245776 ## 11 0.11 58.993685
## 16 0.16 70.924012 ## 12 0.12 62.045024
## 17 0.17 72.404809 ## 13 0.13 65.114732
## 18 0.18 71.498851 ## 14 0.14 67.327791
## 19 0.19 73.794804 ## 15 0.15 69.245776
## 20 0.20 73.519697 ## 16 0.16 70.924012
## 21 0.21 74.710978 ## 17 0.17 72.404809
## 22 0.22 75.793960 ## 18 0.18 71.498851
## 23 0.23 77.270575 ## 19 0.19 73.794804
## 24 0.24 78.176984 ## 20 0.20 73.519697
## 25 0.25 79.010880 ## 21 0.21 74.710978
stdca(data=data.set, outcome=“diedcancer”, ttoutcome=“time”, ## 22 0.22 75.793960
timepoint=545, ## 23 0.23 77.270575
predictors=“thickness”, probability=“FALSE”, xstop=.25, ## 24 0.24 78.176984
intervention=“TRUE”) ## 25 0.25 79.010880
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 59 of 96
Note that we use a single variable to predict the outcome, Calibration evaluation
so probability = FALSE, and other parameter settings are
The following hoslem.test () in the ResourceSelection
basically the same as Case 1.
package is used to perform the Hosmer-Lemeshow
goodness of fit test, which is usually used to evaluate the
External validation of Logistic regression model calibration degree of the Logistic model (50). We should
with R load the R package we need at first:
install.packages(“ResourceSelection”)
Background
library(ResourceSelection)
Logistic regression can be used to establish a clinical Logistic regression model construction
prediction model for dichotomous outcome variables, To establish the Logistic regression model, we simulate
and can also be used to predict binary clinical events, a data set, namely training set, for modeling, in which all
such as effective/ineffective, occurrence/non-occurrence, samples of the data set are used for modeling. And we can
recurrence/non-recurrence and so on. There is the also extract parts of samples for modeling while use the
difference between good and bad of the prediction rest of it, namely testing set, for internal validation of this
models. Not only can the good one accurately predicts model.
the probability of endpoint events, which means a good set.seed(123)
Calibration, but also distinguish the objects with different n <- 100
probabilities of endpoint events in the data set, which x <- rnorm(n)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 60 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 61 of 96
## $Table_HLtest
0.8
## total meanpred meanobs predicted observed
Observed risk
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 62 of 96 Zhou et al. Clinical prediction models with R
0.6
model, should have the characteristics of robust enough, no
0.4 matter for the training set, internal validation set or external
validation set, has better discrimination and calibration. A
0.2
good performance in the training set does not necessarily
0.0
mean a good performance in the validation set. In addition,
a good discrimination does not necessarily mean a good
1.2 1.0 0.8 0.6 0.4 0.2 0.0 –0.2 calibration, and vice versa.
Specificity
Figure 26 ROC curve in validation set. The evaluation of Cox regression model based
on pec package with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 63 of 96
vector that contains the points to be predicted. ## 120 0.9121396 0.7429388 0.6322600
## 121 0.8366330 0.5619540 0.4109773
## 122 0.8577605 0.6091122 0.4653860
Case analysis
The excellence of this package is that it can calculate the
We used the clinical data of 232 patients with renal clear survival probability rate of each patient in the validation
cell carcinoma downloaded from TCGA database (https:// set based on the prediction model built by the training set.
portal.gdc.cancer.gov/) for practical operation. There were Moreover, it can also calculate the C-index in the validation
8 variables in the dataset. Among the eight variables, death set. C-index, namely Concordance Index, is mainly used
was the outcome variable (dependent variable), OS was the to reflect the differentiation ability and accuracy of the
length of survival time. In addition, age, gender, TNM, prediction model. The definition of C-Index is as simple
SSIGN and Fuhrman were the independent variables. as the number of consistent pairs/ the number of useful
pairs. There will be N*(N-1)/2 pairs if N subjects were
randomly paired. However, if the sample size is very large,
R code and its interpretation
the calculation work cannot be completed without the
The pec package and other necessary auxiliary package assistant of computer software. First, we should find out
should be loaded firstly. Then, the data of Case_in_TCGA. the consistent pair number as the numerator. While, what
csv will be identified. The codes of R language were shown is a consistent pair number? Taking Cox regression analysis
as follows. for survival data as an example, if the actual survival length
library(dplyr) is long and the predicted survival rate is also high, or the
library(rms) predicted survival rate is low when the actual survival length
library(survival) is short, we could make the conclusion that the predicted
library(pec) result is consistent with the actual result. Otherwise, it is
data <- read.csv(“Case_in_TCGA.csv”)
inconsistent. Then, the useful number of pairs should be
Two hundred and thirty two patients were randomly found out for denominator. What is a useful pair number?
divided into training set and validation set, in which data.1 Taking Cox regression analysis as an example, the so-
was the training set and data.2 was the validation set. called useful pair number requires that at least one of the
set.seed(1450) two paired individuals have a target endpoint. That is to
x <- nrow(data) %>% runif() say, if the paired patients did not show an endpoint event
data <- transform(data,sample=order(x)) %>% arrange(sample) during all the observation period, it cannot be included in
data.1 <- data[1:(nrow(data)/2),] the denominator. In addition, there are two other situations
data.2 <- data[((nrow(data)/2)+1):nrow(data),]
that need to be excluded:
Then, the Cox regression model was fitted with 116 (I) If one of the paired object reach to an endpoint,
cases of training set data.1. while the other one cannot reach to an endpoint
cox1 <- cph(Surv(os,death)~age+gender+tnm+fuhrman+ssign, due to the loss of follow-up;
data=data.1, surv=TRUE) (II) The pair died at the same time.
cox2 <- cph(Surv(os,death)~tnm+ssign, data=data.1, surv=TRUE) Now the denominator had been identified, how to get
The survival time points of the predicted survival rate the numerator?
need to be set. In this case, the survival probability rate of In fact, the function of cindex() in pec package can
each patient in validation set at the first, third and fifth year calculate the C-index of the prediction model. Meanwhile,
would be predicted according to the model built by the it can evaluate the discrimination of different regression
training set. modeling strategies for the same data via cross validation.
C-Index could be verified by the model discrimination in
t <- c(1,3,5)
survprob <- predictSurvProb(cox1,newdata=data.2,times=t)
data.2.
head(survprob) c_index <- cindex(list(“Cox(5 variables)”=cox1, “Cox(2 variables)”=cox2),
## 1 3 5 formula=Surv(os,death)~age+gender+tnm+fuhrman+ssign,
## 117 0.9128363 0.7447740 0.6346714 data=data.2,
## 118 0.8907077 0.6879995 0.5615877 eval.times=seq(1,5,0.1))
## 119 0.9077517 0.7314529 0.6172421 plot(c_index,xlim = c(0,5))
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 64 of 96 Zhou et al. Clinical prediction models with R
Concordance index
Cox (2 variables) Cox (2 variables)
0.8 0.8
0.6 0.6
0.4
0.4
0 1 2 3 4 5
0 1 2 3 4 5
Time (year)
Time (year)
Figure 27 The discrimination index of Cox (2 variables) compared
with Cox (5 variables) without cross-validation. Figure 28 The discriminability of Cox (2 variables) compared with
Cox (5 variables) with cross-validation.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 65 of 96
Cox (2 variables)
frequencies (%)
75
2007 diagnosed with mild cognitive impairment (MCI) of
50 518 cases of elderly patients with clinical data, including
25
basic demographic characteristics, lifestyle, physical
examination, and merge disease information, etc., and
0
complete six follow-up survey in 2010–2013, the main
0 25 50 75 100
outcome is whether Alzheimer’s disease (AD) occurs.
Predicted survival probability (%)
During the follow-up period, a total of 78 cases of AD
Figure 29 The Calibration Plot performed by pec package. occurred, including 28 cases of relocation, 31 cases of
withdrawal and 25 cases of death. What are the factors that
affect the transition of MCI to AD?
100 In this case, if the MCI patient dies of cancer,
Cox (5 variables)
cardiovascular disease, car accident and other causes during
Observed survival
Cox (2 variables)
frequencies (%)
75
the observation period without AD, he cannot contribute
50 to the onset of AD, that is, the end of death “competes”
25
with the occurrence of AD. According to traditional
survival analysis method, the death of the individual occurs
0
before the AD, lost to the individual, and not the AD
0 25 50 75 100 individuals, are considered to be censored data, may lead
Predicted survival probability (%) to bias (54) For the elderly population with high mortality
rate, when there are competitive risk events, the traditional
Figure 30 The Calibration Plot performed by pec package with
survival analysis methods (Kaplan-Meier method, Log-
cross-validation.
rank test, Cox proportional hazard regression model) will
overestimate the risk of the diseases of interest, thus leading
and calibration evaluation. For the outcome in the Cox to competitive risk bias. Some studies have found that about
regression model include the time, so the calculation 46% of the literatures may have such bias (54).
method is slightly different from Logistic regression. In this case, the competitive risk model is appropriate.
However, whether the Cox regression model or Logistic The so-called competing risk model is an analytical method
regression model, as a good prediction model, should have to process multiple potential outcome of survival data.
the characteristics of robust enough, no matter for the As early as 1999, Fine and Gray proposed the partially
training set, internal validation set or external validation distributed semi-parametric proportional hazard model, and
set, has better discrimination and calibration. A good the commonly used endpoint index is cumulative incidence
function (CIF) (55,56). In this case, death before AD can be
performance in the training set does not necessarily mean a
taken as a competitive risk event of AD, and the competitive
good performance in the validation set. In addition, a good
risk model is adopted for statistical analysis. Univariate
discrimination does not necessarily mean a good calibration,
analysis of competitive risk is often used to estimate the
and vice versa.
incidence of end-point events of interest, and multivariate
analysis is often used to explore prognostic factors and
Fine-Gray test and competing risk model with R effect sizes.
Background
Case analysis
When observing whether an event occurs or not, if
the event is obstructed by other events, there may be [Case 1]
multiple outcome events in the so-called competitive risk This case data was downloaded from http://www.stat.
research, and some outcomes will prevent the occurrence unipg.it/luca/R/. Researchers plan to investigate the
of interested events or affect the probability of their curative effect of bone marrow transplantation compared
occurrence. All outcome events form a competitive blood transplantation for the treatment of leukemia,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 66 of 96 Zhou et al. Clinical prediction models with R
endpoint event defined as “recurrence”. Some patients after fit1 <- cuminc(ftime,Status,D)
fit1
transplantation, unfortunately because of adverse reactions
## Tests:
to the death, that the transplant related death cannot be ## stat pv df
observed in patients with the end of the “recurrence”. In ## 1 2.8623325 0.09067592 1
other words, “transplant-related death” and “recurrence” ## 2 0.4481279 0.50322531 1
are competitive risk events. Therefore, competitive risk ## Estimates and Variances:
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 67 of 96
the crr() function is convenient for multifactor analysis. The Interpretation of results: after controlling for competitive
function is used as follows: risk distribution events, phase variable, that is, the stage of
crr(ftime, fstatus, cov1, cov2, tf, cengroup, failcode disease, is an independent risk factor for patient recurrence.
=1, cencode =0, subset, na.action = na.omit, gtol =1e-06, The patients in the “Relapse ” stage are taken as the
maxiter =10, init, variance = TRUE) reference, the accumulated recurrence rate of patients in
You can refer to the crr() function help documentation the CR1, CR2 and CR3 stages compared with those in
for detailed explanation of each parameter. It should be the “Relapse” stage, Hazard Ratio and 95% CI were 0.332
noted here that the function must specify the time variable (0.159, 0.695), 0.361 (0.180, 0.724), and 0.481 (0.155,
and the outcome variable, and then pass in the covariate 1.490), respectively, corresponding P values were 0.0034,
matrix or dataframe. Firstly, the covariates entering the 0.0041, and 0.2000, respectively.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 68 of 96 Zhou et al. Clinical prediction models with R
Secondly, competitive risk events are also limited in ## $ Phase : Factor w/ 4 levels “CR1”,”CR2”,”CR3”,..: 4 2 3 2 2 4 1 1 1 4 ...
## $ Age : int 48 23 7 26 36 17 7 17 26 8 ...
the consideration of competitive risk model. Currently,
## $ Status: int 2 1 0 2 2 2 0 2 0 1 ...
only the binary end point of Cox model is extended to ## $ Source: Factor w/ 2 levels “BM+PB”,”PB”: 1 1 1 1 1 1 1 1 1 1 ...
triple classification, namely, outcome events, censored ## $ ftime : num 0.67 9.5 131.77 24.03 1.47 ...
and competitive risk events. Even so, it is difficult to
This is a data of dataframe structure with 7 variables and
interpret the results. Therefore, readers should evaluate and
a total of 177 observations.
experiment more fully when choosing statistical methods.
$ Sex: sex variable, factor variable, 2 levels: “F”, “M”.
$ D: disease type, factor variable, 2 levels “ALL
Nomogram of competing risk model with R (acute lymphocytic leukemia)”, “AML(acute myelocytic
leukemia)”.
Background
$ Phase: phase of disease, factor variables, 4 levels: “CR1”,
The cmprsk package of the competitive risk model is loaded “CR2”, “CR3”, “Relapse”.
in R, and the univariate analysis and multivariate analysis $ Age: age variable, continuous variable.
considering the survival data of competitive risk events can $ Status: outcome variables, 0= censored, 1= recurrence,
be carried out by using the cuminc() function and crr () 2= competitive risk events.
function. We have described the implementation method $ Source: type of intervention, factor variables, 2
based on R language in detail in the previous Section, which levels: “BM + PB (bone marrow transplantation + blood
is not shown here. transplantation)”, “PB (blood transplantation)”.
So how do you visualize the competitive risk model? $ ftime: time variable, continuous variable.
How do I draw a nomogram of competitive risk model? Firstly, the variables in the dataset bmt are further
Here we demonstrate how to draw a nomogram based processed.
on R. bmt$id<-1:nrow(bmt)# Sort the data set by rows and generate ordinal id
bmt$age <- bmt$Age
bmt$sex <- as.factor(ifelse(bmt$Sex==‘F’,1,0))
Case analysis
bmt$D <- as.factor(ifelse(bmt$D==‘AML’,1,0))
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 69 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 70 of 96 Zhou et al. Clinical prediction models with R
cr + ## 5 80 13
## source, data = bmt) ##
## ## $points.tables[[6]]
## n= 177, number of events= 56 ## Total Points Pr( ftime < 36 )
## ## 1 100 0.0894
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 71 of 96
Coxph regression
Points
0 20 40 60 80 100
Source* 1
0
Phase_cr***
1
0
D
0
1
Sex 0
1
Age
70 60 50 40 30 20 10 0
Total-points-to-outcome nomogram:
229
Total points
150 200 250 300 350
0.196
Pr (T<36)
0.06 0.08 0.1 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.213
Pr (T<60)
0.06 0.08 0.1 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 32 Nomogram predicting cumulative recurrence risk at 36 and 60 months using the competitive risk model. Nomogram estimates
that patient no. 31 has a cumulative risk of recurrence of 0.196 and 0.213 at 36 and 60 months, respectively. *, P<0.05; ***, P<0.001.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 72 of 96 Zhou et al. Clinical prediction models with R
Coxph regression
Points
0 10 20 30 40 50 60 70 80 90 100
Source 1
0
Phase_cr*** 1
0
D* 0
1
Sex 1
0
Age
80 60 40 20 0
Total-points-to-outcome nomogram:
145
Total points
100 120 140 160 180 200 220 240 260 280
Figure 33 Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox proportional hazard model. According to
Nomogram’s estimate, the cumulative risk of recurrence in patient no. 31 at 36 and 60 months is 0.205 and 0.217, respectively. *, P<0.05; ***,
P<0.001.
Outlier recognition and missing data handling beyond human common sense and non-conformity is an
with R outlier. For example, we collected fasting blood glucose
from a group of patients, one of whom had a fasting blood
Background
glucose of more than 50 mmol/L, which is obviously an
In Section 1, we describe the sources of data for clinical abnormal value. For another example, we investigated the
predictive model construction. Whether it is prospective prevalence of hypertension in the elderly over 60 years
data collection or retrospective data collection, it is common old in Xuhui District, Shanghai. If there is a subject with a
to have outliers or missing values in the data set. Outliers SBP exceeding 1,400 mmHg, this is obviously an abnormal
and missing values are often a tricky issue for statisticians value. It is likely to be a recording error, and the true SBP
and can lead to errors if not handled properly. Outliers is more May be 140.0 mmHg. Sometimes outliers are a
may cause our results to deviate from the real results, and relative concept that is related to the context in which our
the loss of information caused by missing values may lead clinical research data is collected. For example, if our study
to modeling failure. Therefore, it is important to correctly is for children under the age of 10, then children of this age
identify outliers and properly handle missing values before group are unlikely to be graduate students, and their height
performing data analysis. The content discussed in this is unlikely to exceed 170 cm, the weight is unlikely to
Section should be carried out before modeling. We can’t just exceed 100 kg. There is also a situation in which abnormal
put this thing in the end or think it doesn’t matter because values may be generated. When the sample we sampling is
our article is the last Section in this series of articles. not good, for example, 1,000 people are extracted from Area
A and 100 people are extracted from Area B. 100 people
from Area B are likely to become a collection alone. The
Outliers
value of this set is abnormally higher or abnormally lower
What is an outlier? In a word, the value of a variable than that of Area A. This situation corresponds to clinical
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 73 of 96
studies. When we study the efficacy of an intervention, ## 0% 25% 50% 75% 100%
if only some patients have a significant effect, this part of ## 100.00 137.25 175.00 212.00 250.00
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 74 of 96 Zhou et al. Clinical prediction models with R
assign(as.character(as.list(match.call())$dt), dt, envir = However, when there are a large number of observations
.GlobalEnv)
that contain missing values, the default row deletion in these
cat(“Outliers successfully removed”, “\n”)
return(invisible(dt)) functions can result in a large loss of information. In this
} else{ case, the analyst should carefully look at the mechanisms
cat(“Nothing changed”, “\n”) that may result from data loss and find an appropriate way
return(invisible(var_name)) to handle it.
} How to deal with missing values is a headache for clinical
}
statisticians, so I decided to use a Section to discuss this
The custom function has only two parameters, the thorny topic. Whether the data is missing or missing degree
first parameter is the name of the data set, and the second directly affects the quality of the data, and the quality of the
parameter is the variable name; the reader can run the code data ultimately affects our research results. If the processing
directly, as long as the data set and the name of the variable of missing data is not appropriate, it is likely to cause the
are properly replaced. Here we are based on the out value entire statistical analysis to fail. This section describes how
of the box plot as an outlier, we can also re-set the definition to handle missing data in R-project and introduces some
of outlier according to the expertise, such as greater than basic skills for dealing with missing data.
or less than the mean ± 3 standard deviation. At the end of In the R-project, “NA” is represented as a missing
the function, a user-entered code is also set. The user can value. When an Excel table with empty cells is imported
determine whether to eliminate the outliers recognized by into the R console, these empty cells will be replaced
the function in the dataset by typing “yes” or “no”. by NA. This is different from STATA replacing “empty
Below we simulate a set of data to verify the functionality cells” with “.”The same missing value symbol is used for
of this custom outlier recognition function. numeric variables and character variables in R.R provides
set.seed(123)
some functions to handle missing values. To determine if
df <- data.frame(height = c(sample(100:250, 1000, replace = TRUE), a vector contains missing values, you can use the is.na()
NA, 380, 20)) function. “is.na()” function is the most common method
outlierKD(df, height) used to determine if an element is an NA type. It returns
## Outliers identified: 2
an object of the same length as the incoming parameter
## Proportion (%) of outliers: 0.2
and all data is a logical value (FALSE or TRUE). Suppose
## Mean of the outliers: 200
## Mean without removing outliers: 174.63 we have 6 patients, but only 4 values are recorded, and 2
## Mean if we remove outliers: 174.58 are missing.
## Do you want to remove outliers x <- c(1.8,2.3,NA,4.1,NA,5.7)
## and to replace with NA? [yes/no]: is.na(x)
## Nothing changed ## [1] FALSE FALSE TRUE FALSE TRUE FALSE
# The author typed “yes”, so the outliers 380 and 20 in the original data
were rejected.。
The is.na() return vector has a length of 6, and the third
and fifth digits have a value of TRUE, indicating that the
patient is missing the value.
Missing value recognition
Use the which() function to find the location of the NA.
Data loss is common in clinical studies. For example, Someone might use a logic check (such as x==NA) to detect
when collecting data, the nurse may forget to record the missing data. This method cannot return TRUE because
amount of urine at a certain point in time due to busy work; the missing values are not comparable, you must use the
when the researcher wants to study the effect of lactic acid missing value function. The value returned by the “==“
changes on mortality, the patient may only monitor the operator is NA. By using the which() function, you can find
blood lactate value at a certain point in time. Other reasons out which element’s vector contains NA. In this example the
for the lack of data include coding errors, equipment which() function returns 3 and 5, indicating that the third
failures, and non-response of respondents in the survey and fifth patients’ x values are missing.
study (15). In statistical packages, some functions, such as which(is.na(x))
Logistic Regression, may automatically delete missing data. ## [1] 3 5
If there are only a small number of incomplete observations, Using the sum() function to calculate the number of NAs
then this processing will not be much of a problem. in the vector
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 75 of 96
sum(is.na(x)) install.packages(“psych”)
## [1] 2 library(psych)
## Warning: package ‘psych’ was built under R version 3.5.3
We can remove missing values in x.
describe(df)
x[!is.na(x)]
## vars n mean sd median trimmed mad min max range skew kurtosis
## [1] 1.8 2.3 4.1 5.7
## id 1 6 3.50 1.87 3.5 3.50 2.22 1.0 6.0 5 0.00 -1.80
Here we use the “!” in the logical operator, which is ## sex* 2 5 1.40 0.55 1.0 1.40 0.00 1.0 2.0 1 0.29 -2.25
the “non” operation. The code means to find out all the ## x 3 5 4.04 2.25 4.1 4.04 1.78 0.7 6.7 6 -0.29 -1.61
elements of the vector x those are not missing values. When ## se
## id 0.76
there is an NA in a vector, whether it is adding, subtracting,
## sex* 0.24
multiplying, dividing, or averaging, and finding the standard ## x 1.01
deviation, the return is all NA. Because by default, R will
calculate the NA as one of the elements, that leads to no The describe() function returned the basic statistic of
solution, such as the dataset. “n” represents the number of non-missing
sum(x) observations in the variable. “mean”, “sd” and “median”
## [1] NA represent the mean, standard deviation, and median after
mean(x) removing the NA. “trimmed” represents the recalculated
## [1] NA
mean after removing 10% of the data from the beginning
In this case, we need to use the na.rm parameter in these and the end of the data, which aims to remove the
calculation functions to change it from the default FALSE influence of extreme values; “mad”, “min”, “max”, and
to TRUE. “range” represent the mode, minimum, maximum, and
mean(x,na.rm = TRUE) range respectively; “skew”, “kurtosis”, and “se” represent
## [1] 3.475 skewness, kurtosis, and standard error respectively. The first
sum(x,na.rm = TRUE) two indicators are aimed to measure whether the data is
## [1] 13.9
fitted with a normal distribution.
The above R codes all deal with the missing in the Although some default settings in the regression model
vectors. But, in practice, what we encounter more is to deal can effectively ignore missing data, it is also necessary to
with the missing in a data frame. Next, we will simulate a create a new data frame that excludes missing data. We
data frame with missing values. Be careful that the third only need to use na.omit() in the dataset of the data frame
patient had a sex-value deletion and the fourth patient had a structure, then it returns a new data frame with the missing
x-value deletion. values removed.
id<-c(1,2,3,4,5,6) df_omit<-na.omit(df)
sex<-c(“m”,”f”,NA,”f”,”m”,”f”) df_omit
x<-c(0.7,3.4,4.1,NA,6.7,5.3) ## id sex x
df<-data.frame(id,sex,x) ## 1 1 m 0.7
df ## 2 2 f 3.4
## id sex x ## 5 5 m 6.7
## 1 1 m 0.7 ## 6 6 f 5.3
## 2 2 f 3.4
## 3 3 <NA> 4.1 The above na.omit() returns a new data frame with
## 4 4 f NA missing values removed. It can be seen that the third and
## 5 5 m 6.7
fourth patients with missing data were removed from the
## 6 6 f 5.3
variables.
Now, we can calculate the proportion of missing values
in the df data box, also the mean, standard deviation and so
Missing value visualization
on after removing the missing values. Lots of functions can
do it. Here we recommend the describe() function in the The visualization of missing values can help us visualize the
psych package, which is more convenient for giving a series missing values in the dataset more intuitively, which will
of statistical descriptive indicators’ values. help us to interpolate the missing value later. In this section,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 76 of 96 Zhou et al. Clinical prediction models with R
the author will mainly introduce the use of VIM packages missing values. Therefore, the visualization tool should
to the reader. The demo dataset for this section is the built- be performed before the interpolation operation, and
in dataset “airquality” of the R language. usually a diagnosis should be made to determine whether
data(“airquality”) the interpolation value is reasonable after the missing data
str(airquality) interpolation. The functions that can be used to visualize
## ‘data.frame’: 153 obs. of 6 variables: missing data are as follows: aggr(), matrixplot(), scattMiss()
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
and marginplot().
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... install.packages(‘VIM’)
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... library(VIM)
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ... ## Loading required package: colorspace
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ... ## Loading required package: grid
## Loading required package: data.table
The airquality dataset contains 153 observations and 6
## VIM is ready to use.
variables. From the above results we can see that there are ## Since version 4.0.0 the GUI is in its own package VIMGUI.
missing values in this dataset. Before visualizing, explore the ##
missing data pattern using the md.pattern() function in the ## Please use the package to use the new (and old) GUI.
mice package first. ## Suggestions and bug-reports can be submitted at: https://github.com/
alexkowa/VIM/issues
install.packages(“mice”) ##
library(mice) ## Attaching package: ‘VIM’
## Loading required package: lattice ## The following object is masked from ‘package:datasets’:
## ##
## Attaching package: ‘mice’ ## sleep
## The following objects are masked from ‘package:base’: aggr_plot <- aggr(airquality, col=c(‘red’,’blue’), numbers=TRUE,
## sortVars=TRUE, labels=names(airquality), cex.axis=.7, gap=3,
## cbind, rbind ylab=c(“Histogram of missing data”,”Pattern”))
md.pattern(airquality)
##
## Wind Temp Month Day Solar.R Ozone
## Variables sorted by number of missings:
## 111 1 1 1 1 1 1 0
## Variable Count
## 35 1 1 1 1 1 0 1
## Ozone 0.24183007
## 5 1 1 1 1 0 1 1
## Solar.R 0.04575163
## 2 1 1 1 1 0 0 2
## Wind 0.00000000
## 0 0 0 0 7 37 44
## Temp 0.00000000
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 77 of 96
Pattern
0.10
real world, it is often difficult to collect the missing data
0.229 repeatedly. Therefore, the question that the readers need
0.05 to consider is: What method could be used to interpolate
0.725 the missing value in the case that the missing data is not
0.00 available? What method can you use to restore the original
Ozone
Solar. R
Wind
Temp
Month
Day
Ozone
Solar. R
Wind
Temp
Month
Day
appearance of the missing value in the greatest extent?
There are many types of data missing, which are
Figure 34 Visualization of missing values (1). summarized in the following three cases (15,17):
(I) Missing completely at random (MCAR): deletion
formation is not related to other variables (including
observed and unobserved values), and there is no
300 systematic reason.
(II) Missing at random (MAR): missing is related
200
to other variables and is not related to its own
Solar. R
unmeasured value.
100
(III) Not missing at random (NMAR): exclude MCAR
0 and MAR.
2 There are various interpolation methods for missing
0 50 100 150 values. Simple interpolation method, such as filling
Ozone the mean or median directly and slightly complicated
Figure 35 Visualization of missing values (2).
interpolation method, such as regression interpolation.
However, no matter which method is used, it is difficult
to measure the quality of the imputation, because in the
indicates that both variables are all missing. The red box real world, we can’t get the missing value, so we can’t
plot on the left side of the figure shows the distribution of verify the correctness of the interpolation. However, in the
Solar.R in the case of Ozone missing values, and the blue computer, we can evaluate different interpolation methods
box plot shows the distribution of Sloar.R after removing by simulation, and this evaluation is effective later.
the missing values of Ozone. The box plot on the bottom In this section, the author will introduce several missing
side of the graph is just the opposite, reflecting the value interpolation methods to the reader for reference.
distribution of Ozone in the absence and non-absence of To better show how to interpolate data, we simulated 200
Solar.R. observations in the first time. The data frame contains three
variables: sex, mean arterial pressure (MAP), and lactic
acid (lac). In order to enable the reader to get the same
Imputation of missing values results as this article, we set the seed value for each random
Interpolation of missing values is a more complicated simulation. Here we use the mean of each variable to fill in
problem. First, readers must consider whether it is necessary the missing values.
to interpolate the missing values. The premise of an
set.seed(123)
imputable data set is that data deletion is a random deletion. sex<-rbinom(200, 1, 0.45)
In addition, we have to consider some actual situations. For sex[sex==1]<-”male”
example, there is only 5% of your data missing. Because sex[sex==0]<-”female”
the ratio is very small, you can directly remove it from the set.seed(123)
original data; But if there is 20% of the data missing, we sex.miss.tag<-rbinom(200, 1, 0.3)#MCAR
will lose a lot of information after they are all eliminated. sex.miss<-ifelse(sex.miss.tag==1,NA,sex)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 78 of 96 Zhou et al. Clinical prediction models with R
set.seed(123) ## logit
map<-round(abs(rnorm(200, mean = 70, sd = 30))) scatterplot(lac.mean ~ map | lac.miss.tag, lwd=2,
map<-ifelse(map<=40,map+30,map) main=“Scatter Plot of lac vs. map by # missingness”,
set.seed(123) xlab=“Mean Aterial Pressure (mmHg)”,
lac<- rnorm(200, mean = 5, sd = 0.7) -map*0.04 ylab=“Lactate (mmol/l)”,
lac<-abs(round(lac,1)) legend.plot=TRUE, smoother=FALSE,
id.method=“identify”,
set.seed(123)
boxplots=“xy”
lac.miss.tag<-rbinom(200, 1, 0.3)
)
lac.miss<-ifelse(lac.miss.tag==1,NA,lac)
data<-data.frame(sex.miss,map,lac.miss)
Unsurprisingly, all values inserted were 2.1 mmol/L of
In these data, the lactic acid content is assumed to be lactic acid (Figure 36), so the mean and standard deviation of
related to the MAP. The blood lactate value reflects the the new sample were biased compared to the actual sample.
perfusion of the tissue, which is associated with MAP. We The insertion of the mode and the median can also be
hypothesized that the relationship between MAP and lactic inserted in the same way, which can be left to the readers.
acid is negatively correlated. To increase randomness, we Although these rough methods provide convenience for
use the rnorm() function to generate the intercept. We missing value interpolation, this method underestimates
assume that the absence of gender variables is consistent the variance value (less than the actual value), ignores the
with MCAR. relationship between the variables, and finally causes some
The following steps can use the summary() function to statistical values (such as the mean standard deviation) to
observe the data set and calculate its standard deviation. be generated. Therefore, these rough estimates can only
be used to deal with a small amount of data missing and
summary(lac.miss)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
cannot be widely used. Therefore, we need to further
## 0.600 1.700 2.100 2.015 2.300 2.700 55 master the handling of complex problems. Next, I will
sd(lac.miss,na.rm=TRUE) use the BostonHousing dataset in the mlbench package
## [1] 0.4012707 to demonstrate the various common filling methods for
In the above output, we found that there are 55 missing missing values.
values for lactic acid, the average value is 2.015, and the install.packages(‘mlbench’)
function is used to keep the result one decimal place. ## 2 9.14 21.6
## 3 4.03 34.7
lac.mean<-round(ifelse(is.na(lac.miss),mean(lac.miss,na.
## 4 2.94 33.4
rm=TRUE),lac.miss),1)
## 5 5.33 36.2
Next, we use a visual method to check the distribution of ## 6 5.21 28.7
the missing value after replacing it with the average.
The BostonHousing dataset contains 506 observations
library(car)
and 14 variables that reflect the basic situation of Boston
## Loading required package: carData
##
city dwellers, including crime rates in each town and the
## Attaching package: ‘car’ number of non-retail businesses. Since the BostonHousing
## The following object is masked from ‘package:psych’: dataset itself has no missing values, we randomly generate
## some missing values.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 79 of 96
lac.miss.tag install.packages(‘Hmisc’)
0 1 Scatter Plot of lac vs. map by # missingness
library(Hmisc)
## Loading required package: survival
2.5
## Loading required package: Formula
Lactate (mmol/L)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 80 of 96 Zhou et al. Clinical prediction models with R
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 81 of 96
model needs to be based on screening imaging least squares estimate is almost unbiased, but may have
omics, so the pre-workload is much larger than the a high variance, which means that small changes in the
first type, and the imaging omics parameters are training set can lead to large changes in the least squares
more abundant than the clinical features (62). coefficient estimation results. Regularization can properly
(III) With the widespread using of high-throughput select appropriate λ to optimize the deviation/variance
biotechnology such as genomics and proteomics, tradeoff, then to improve the fitness of the regression
clinical researchers are attempting to mine model. Finally, the regularization of the coefficients can
biomarkers characteristics from these vast also be used to solve the over-fitting problem caused by
amounts of biological information to construct multicollinearity (64).
clinical predictive models. This type of predictive
model is a good entry point for basic medicine to
Introduction of ridge regression
clinical medicine (63).
Because there are too many features related to “omics” We will briefly introduce what is ridge regression and
in the second and third types of model, the variable what it can do and what is cannot do. In ridge regression,
selecting is very difficult. It is very difficult to use the the norm term is the squares sum of all coefficients,
traditional variable selecting methods described in Section 2. called L2-Norm. In regression model, we try to minimize
So, is there a better solution? The answer is YES. The RSS+λ (sumβj2). As λ increases, the regression coefficient β
regularization method described in this Section is one decreases, tending to 0 but never equal to 0. The advantage
of the solutions. Regularization can limit the regression of ridge regression is that it can improve the prediction
coefficient and even reduces it to zero. Now, there are many accuracy, but because it can’t make the coefficient of any
algorithms or algorithms combinations that can be used to variable being equal to zero, it is difficult to meet the
implement regularization. In this Section, we will focus on requirements of reducing numbers of variable, so there will
ridge regression and LASSO regression. be some problems in the model interpretability. To solve
this problem, we can use the LASSO regression mentioned
below.
Introduction of regularization
In addition, ridge regression is more often used to deal
The general linear model is Y=β0+β1X1+…+βnXn+e, the best with collinearity in linear regression. It is generally believed
fit attempts to minimize the residual sum of squares (RSS). that collinearity will lead to over-fitting and the parameter
RSS is the sum of the squares of the difference between the estimates will being very large. Therefore, adding a penalty
actual numbers minus the estimated numbers, which can function to the objective function of the least squares
be expressed as e12+e22+…+en2. We can add a new parameter to the regression coefficient β can solve this problem.
in the minimization process of RSS using regularization, The regularization thoughts are consistent, so the ridge
called the contraction penalty term. This penalty term regression can resolve the problem.
contains a λ and normalized results for the β coefficients
and weights. Different regularization technologies have
LASSO regression
different methods for standardizing weights. In brief, we
replaced RSS with RSS + λ (normalized coefficients) in the Different from L2-norm in ridge regression, LASSO
model. We choose λ, which is called the tuning parameter regression uses L1-Norm, which is the sum of the absolute
in model building. If λ=0, the model is equivalent to the values of all variable weights, that is, to minimize RSS+λ
least square method (OLS) because all the normalization (sum|βj|). This contraction penalty can shrink the weight
items are offset (64). to zero, which is a distinct advantage over ridge regression
What are the advantages of regularization? Firstly, the because it greatly increases the interpretability of the
regularization method is very computationally efficient. If model.
we use the regularization method, we only need to fit one If the LASSO regression is so well, do we need ridge
model for each λ, so the efficiency will be greatly improved. regression? The answer is YES. When there is a high
Secondly, it is the deviation/variance trade-off problem. degree of collinearity or correlation, LASSO regression
In a linear model, the relationship between dependent may delete some important predictive variables, which will
variables and predictor variables is close to linear, and the lose the predictive power of the model. For example, if
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 82 of 96 Zhou et al. Clinical prediction models with R
both variable A and B should be including in the prediction tissue samples of 699 patients and is stored in a data frame
model, LASSO regression may reduce the coefficient of one with 11 variables. This data frame contains the following
to zero (64). So, the ridge regression and LASSO regression columns:
should be complementary to each other. Choosing a ID: sample code number (not unique).
suitable method to solve the problem is the key to statistical V1: clump thickness.
applications. V2: uniformity of cell size.
The currently published clinical prediction model V3: uniformity of cell shape.
articles using LASSO regression include two types, one is V4: marginal adhesion.
the screening of image omics features mentioned above, and V5: single epithelial cell size.
the other is genomics screening. There are usually dozens V6: bare nuclei (16 values are missing).
to hundreds of group features. However, these features V7: bland chromatin.
cannot all be included in the predictive model, because if we V8: normal nucleoli.
do this, the model will be very bloated, and many features V9: mitoses.
may not be related to the outcome, so it is sensible to use class: “benign” or “malignant”.
LASSO regression to reduce features. Generally, the tens to Data processing
hundreds of features are reduced to several or a dozen, and We first load the MASS package and prepare the breast
then the score is calculated for each patient according to the cancer data:
regression coefficient of the LASSO regression equation library(glmnet)
and the value of each feature. This score is then converted library(MASS)
to a categorical variable based on the appropriate cutoff biopsy$ID =NULL
value and included in the regression model as a predictive names(biopsy) =c(“thick”, “u.size”, “u.shape”, “adhsn”,
“s.size”, “nucl”, “chrom”, “n.nuc”, “mit”, “class”)
feature (62,63).
biopsy.v2 <-na.omit(biopsy)
set.seed(123) #random number generator
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 83 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 84 of 96 Zhou et al. Clinical prediction models with R
9 9 9 9 9 ROC Curve
1.00
0.20
0.75
Coefficients
Sensitivity (TPR)
0.10
0.50
AUROC: 0.9974
0.00 0.25
–2 0 2 4 6
0.00
Log lambda
0.00 0.25 0.50 0.75 1.00
Figure 37 The relationship between the coefficient and the 1-Specificity (FPR)
for this line is 0.03951. But for simplicity’s sake, we’ll set The predict() function is then used to build an object
lambda equal to 0.05 for the test set. named ridge.y, specifying the parameter type= “response”
So, let’s see graphically how the regression coefficient and lambda value of 0.05. The R code is the following:
varies with lambda. Simply add the argument xvar = ridge.y <- predict(ridge, newx = newx, type = “response”, s=0.05)
“lambda” to the plot() function. By calculating the error and AUC, we can see the
plot(ridge, xvar = “lambda”, label = TRUE) performance of this model on the test set:
This graph shows that when lambda falls, the compression
library(InformationValue)
parameter decreases, but the absolute coefficient increases
##
(Figure 37). To look at the coefficient for lambda at a ## Attaching package: ‘InformationValue’
particular value, use the predict() function. Now, let’s see ##
what the coefficient is when lambda is 0.05. We specify the ## confusionMatrix, precision, sensitivity, specificity
parameter s=0.05 and the parameter type = “coefficients”. actuals <- ifelse(test$class == “malignant”, 1, 0)
misClassError(actuals, ridge.y )
The glmnet() function is configured to use lambda—specific
## [1] 0.0191
value when fitting the model, rather than insert value from
plotROC(actuals, ridge.y)
either side of the lambda-specific. The R code is as follows:
This misclassification rate is only 0.0191, indicating that
ridge.coef <- predict(ridge, s=0.05, type = “coefficients”)
ridge.coef the model has a higher level of classification and prediction
## 10 x 1 sparse Matrix of class “dgCMatrix” ability (Figure 38).
## 1 LASSO regression modeling
## (Intercept) -5.4997937 It is easy to run LASSO regression by changing one
## thick 0.2116920
parameter of the ridge regression model. That is, in
## u.size 0.1284357
## u.shape 0.1540309
glmnet() function, alpha =0 in ridge regression is changed
## adhsn 0.1301851 to alpha=1. Run the R code to see the output of the model
## s.size 0.1665205 and check all the fitting results:
## nucl 0.1874988
## chrom 0.1821222 lasso <- glmnet(x, y, family = “binomial”, alpha = 1)
## n.nuc 0.1378914 print(lasso)
## mit 0.1277047 ##
## Call: glmnet(x = x, y = y, family = “binomial”, alpha = 1)
It can be seen that a non-zero regression coefficient is ##
obtained for all the features. Next, we verify on the test set ## Df %Dev Lambda
that the features need to be transformed into matrix form, ## [1,] 0 -8.474e-16 0.3951000
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 85 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 86 of 96 Zhou et al. Clinical prediction models with R
type.measure =“auc”,
0.2
nfolds =5)
## u.size 0.11901349
0.50 ## u.shape 0.09179324
## adhsn .
AUROC: 0.9968 ## s.size .
0.25
## nucl 0.17732550
## chrom 0.02233980
0.00 ## n.nuc 0.02830596
0.00 0.25 0.50 0.75 1.00 ## mit .
1-Specificity (FPR)
Figure 40 The performance of this model on the test set. It can be seen that the four selected features are thickness,
u.size, u.shape and nucl. As with the previous Section, we will
look at the performance of this model on the test set by error
## s.size 0.05802241 and AUC (Figure 42):
## nucl 0.24709063
## chrom 0.08831568 library(InformationValue)
## n.nuc 0.07823788 ##
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 87 of 96
patient’s PSA level at every time intervals and determine ## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 88 of 96 Zhou et al. Clinical prediction models with R
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ... ## ‘data.frame’: 30 obs. of 9 variables:
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ... ## $ lcavol : num 0.737 -0.777 0.223 1.206 2.059 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ... ## $ lweight: num 3.47 3.54 3.24 3.44 3.5 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ... ## $ age : int 64 47 63 57 60 69 68 67 65 54 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ... ## $ lbph : num 0.615 -1.386 -1.386 -1.386 1.475 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ... ## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ train : logi TRUE TRUETRUETRUETRUETRUE ... ## $ lcp : num -1.386 -1.386 -1.386 -0.431 1.348 ...
## $ gleason: num 0 0 0 1 1 0 0 1 0 0 ...
Some problems need to be considered when examining ## $ pgg45 : int 0 0 0 5 20 0 0 20 0 0 ...
the data structure. The first 10 observations of svi, lcp, ## $ lpsa : num 0.765 1.047 1.047 1.399 1.658 ...
Gleason, and pgg45 have the same number, with one
Ridge regression model
exception: the third observation of Gleason. To ensure
In the ridge regression, the model includes all eight
that these features are indeed feasible as input features, we
features, so the comparison between the ridge regression
convert the Gleason variable into a dichotomous variable,
model and the optimal subset model is expected. We use
with 0 representing a score of 6, and 1 indicating a score
the package glmnet. This package requires input variables
of 7 or higher. Deleting variables might lose the predictive
to be stored in the matrix instead of in the data frame. The
power of the model. Missing values can also cause
demand for ridge regression is glmnet (x = input matrix, y
problems in the glmnet package. We can easily encode the
= response variable, family = distribution function, alpha
indicator variables in a single line of code. Use the ifelse()
=0). When alpha is 0, it means that ridge regression is
command to specify the column you want to convert in
performed; when alpha is 1, it means LASSO regression. It’s
the data frame and then convert according to this rule: if
also easy to prepare the training set data for glmnet, use the
the eigenvalue of the observation is ‘x’, encode it as ‘y’,
as.matrix() function to process the input data, and create a
otherwise encode it as ‘z’.
vector as the response variable, as shown below:
prostate$gleason<-ifelse(prostate$gleason==6, 0, 1)
table(prostate$gleason) x <-as.matrix(train[, 1:8])
## y <-train[, 9]
## 0 1
## 35 62 Now we can use ridge regression. We save the result
in an object and give the object an appropriate name,
Firstly, we establish a training data set and a test data
such as ridge. There is a very important point, please be
set. Because there is already a variable in the observation
sure to note: the glmnet package will first normalize the
indicating whether the observation belongs to the
input before calculating the λ value and then calculate the
training set, we can use the subset() function to assign the
non-normalized coefficients. So, we need to specify the
observation with the train value TRUE to the training set,
distribution of the response variable as gaussian because it
and the observation with the train value FALSE to the test
is continuous; also specify alpha =0 for the ridge regression.
set. It is also necessary to remove the train feature because
As follows:
we don’t want to use it as a predictive variable. As follows:
ridge <-glmnet(x, y, family =“gaussian”, alpha =0)
train <-subset(prostate, train ==TRUE)[, 1:9]
str(train) This object contains all the information that we need
## ‘data.frame’: 67 obs. of 9 variables: to evaluate the model. First try the print() function, which
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ... will show the number of non-zero coefficients, explain the
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ... percentage of deviation and the corresponding λ value.
## $ age : int 50 58 74 58 62 50 58 65 63 63 ...
The default number of calculations for the algorithm in the
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
package is 100, but if the percentage increase between the
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ... two λ values is not significant, the algorithm will stop before
## $ gleason: num 0 0 1 0 0 0 0 0 0 1 ... 100 calculations. That is, the algorithm converges to the
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 30 ... optimal solution. All the λ results are listed below:
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
print(ridge)
test =subset(prostate, train==FALSE)[,1:9] ##
str(test) ## Call: glmnet(x = x, y = y, family = “gaussian”, alpha = 0)
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 89 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 90 of 96 Zhou et al. Clinical prediction models with R
At this point, some charts are very useful. Let’s take a look 8 8 8 8 8
Coefficients
plot(ridge, label =TRUE) 0.4
Compared with the previous two graphs, from this The graph below showing the relationship between
graph, we can see that as λ decreases, the coefficient and the predicted and actual values in the ridge regression (Figure 46).
fraction deviance explained will increase (Figure 45). If the λ Similarly, there are two interesting outliers at the larger
value is 0, the shrink penalty will be ignored and the model number of the PSA measurement. In practical situations,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 91 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 92 of 96 Zhou et al. Clinical prediction models with R
0.2 As follows:
lasso.coef<-predict(lasso, s =0.045, type = “coefficients”)
0.0 lasso.coef
## 9 x 1 sparse Matrix of class “dgCMatrix”
–0.2 ## 1
–6 –5 –4 –3 –2 –1 0 ## (Intercept) -0.1305900670
## lcavol 0.4479592050
Log lambda
## lweight 0.5910476764
Figure 47 The relationship between the coefficient and the Log(λ) ## age -0.0073162861
in the Lasso regression. ## lbph 0.0974103575
## svi 0.4746790830
## lcp .
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 93 of 96
3
lasso.cv$lambda.min # minimum
2
## [1] 0.00189349
1 lasso.cv$lambda.1se # one standard error away
## [1] 0.08586749
1.5 2.0 2.5 3.0 3.5 4.0
Predicted Use lambda.1se can complete the following process, view
Figure 48 The relationship between predicted and actual values in the coefficients and perform model validation on the test
the LASSO regression. set:
coef(lasso.cv, s =“lambda.1se”)
## 9 x 1 sparse Matrix of class “dgCMatrix”
8 8 8 8 8 8 8 7 6 5 4 3 1 1 ## 1
## (Intercept) -0.3080148498
## lcavol 0.4416782463
1.4
Mean-squared error
## lweight 0.5300563493
## age .
## lbph 0.0666015918
1.0
## svi 0.4194205799
## lcp .
0.6 ## gleason 0.2475400081
## pgg45 0.0001654219
–6 –5 –4 –3 –2 –1 0 lasso.y.cv =predict(lasso.cv, newx=newx, type =“response”,
Log(Lambda) s =“lambda.1se”)
lasso.cv.resid =lasso.y.cv -test$lpsa
Figure 49 The relationship between the logarithm of λ and the mean(lasso.cv.resid^2)
mean square error in the LASSO regression. ## [1] 0.4455302
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 94 of 96 Zhou et al. Clinical prediction models with R
Brief summary 3. Collins GS, Reitsma JB, Altman DG, et al. Transparent
reporting of a multivariable prediction model for
The aim of this Section is to introduce how to apply
individual prognosis or diagnosis (TRIPOD): the
advanced feature selection techniques to linear models
TRIPOD statement. The TRIPOD Group. Circulation
through a prostate dataset with a small amount of data.
2015;131:211-9.
The dependent variables of the data set are quantitative,
4. Adams ST, Leveson SH. Clinical prediction rules. Bmj
but the glmnet package we use also supports qualitative
2012;344:d8312.
dependent variables (binary and multinomial) and survival
5. Ranstam J, Cook JA, Collins GS. Clinical prediction
outcome data. We introduced the regularization and applied
models. Br J Surg 2016;103:1886.
these techniques to build the model, and then compared
6. Moons KG, Royston P, Vergouwe Y, et al. Prognosis
it. Regularization is a powerful technology that improves and prognostic research: what, why, and how? BMJ
computational efficiency and extracts more meaningful 2009;338:b375.
features than other modeling techniques. In addition, we 7. Kannel WB, McGee D, Gordon T. A general
also use the caret package to optimize multiple parameters cardiovascular risk profile: the Framingham Study. Am J
while training the model. Cardiol 1976;38:46-51.
8. Han K, Song K, Choi BW. How to Develop, Validate,
The data used in this article can be found online at: and Compare Clinical Prediction Models Involving
http://cdn.amegroups.cn/static/application/1091c788c0342 Radiological Parameters: Study Design and Statistical
c498b882bd963c5aafb/2019.08.63-1.zip Methods. Korean J Radiol 2016;17:339-50.
9. Lee YH, Bang H, Kim DJ. How to Establish Clinical
Acknowledgments Prediction Models. Endocrinol Metab (Seoul)
2016;31:38-44.
Funding: This work was partly financially supported by 10. Steyerberg EW, Vergouwe Y. Towards better clinical
the National Natural Science Foundation of China (grant prediction models: seven steps for development and an
numbers: 81774146, 81602668 and 81760423), Beijing ABCD for validation. Eur Heart J 2014;35:1925-31.
NOVA Programme (grant number: xxjh2015A093 and 11. Su TL, Jaki T, Hickey GL, et al. A review of statistical
Z1511000003150125), the Shanghai Youth Medical Talents- updating methods for clinical prediction models. Stat
Specialist Program and the Shanghai Sailing Program (grant Methods Med Res 2018;27:185-97.
number 16YF1401700). 12. Woodward M, Tunstall-Pedoe H, Peters SA. Graphics and
statistics for cardiology: clinical prediction rules. Heart
2017;103:538-45.
Footnote
13. Hickey GL, Grant SW, Dunning J, et al. Statistical primer:
Conflicts of Interest: The authors have no conflicts of interest sample size and power calculations-why, when and how?
to declare. Eur J Cardiothorac Surg 2018;54:4-9.
14. Norman G, Monteiro S, Salama S. Sample size
Ethical Statement: The authors are accountable for all calculations: should the emperor’s clothes be off the peg or
aspects of the work in ensuring that questions related made to measure? BMJ 2012;345:e5278.
to the accuracy or integrity of any part of the work are 15. Zhou Z, Hu Z. Intelligent Statistics. Changsha: Central
appropriately investigated and resolved. South University Press, 2016.
16. Zhang W. Advanced Course of SPSS Statistical Analysis.
Beijing: Higher Education Press, 2004.
References
17. Zhou Z, Hu Z. Crazy Statistics. Changsha: Central South
1. Reza Soroushmehr SM, Najarian K. Transforming big University Press, 2018.
data into computational models for personalized medicine 18. Stone GW, Maehara A, Lansky AJ, et al. A prospective
and health care. Dialogues Clin Neurosci 2016;18:339-43. natural-history study of coronary atherosclerosis. N Engl J
2. Bibault JE, Giraud P, Burgun A. Big Data and machine Med 2011;364:226-35.
learning in radiation oncology: State of the art and future 19. Keteyian SJ, Patel M, Kraus WE, et al. Variables Measured
prospects. Cancer Lett 2016;382:110-7. During Cardiopulmonary Exercise Testing as Predictors
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 95 of 96
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 96 of 96 Zhou et al. Clinical prediction models with R
Impact of Risk Prediction Models With Decision Curves: using R: an easy guide for clinicians. Bone Marrow
Guidance for Correct Interpretation and Appropriate Use. Transplant 2007;40:381-7.
J Clin Oncol 2016;34:2534-40. 58. Scrucca L, Santucci A, Aversa F. Regression modeling of
50. ResourceSelection: Resource Selection (Probability) competing risk using R: an in depth guide for clinicians.
Functions for Use-Availability Data. Available Bone Marrow Transplant 2010;45:1388-95.
online: https://cran.r-project.org/web/packages/ 59. Zhang Z, Cortese G, Combescure C, et al. Overview
ResourceSelection/index.html of model validation for survival regression model with
51. PredictABEL: Assessment of Risk Prediction Models. competing risks using melanoma study data. Ann Transl
52. Robin X, Turck N, Hainard A, et al. pROC: an open- Med 2018;6:325.
source package for R and S+ to analyze and compare ROC 60. Geskus RB. Cause-specific cumulative incidence estimation
curves. BMC Bioinformatics 2011;12:77. and the fine and gray model under both left truncation and
53. pec: Prediction Error Curves for Risk Prediction Models right censoring. Biometrics 2011;67:39-49.
in Survival Analysis. Available online: https://cran. 61. Zhang T, Chen X, Liu Z. Medical Statistics Graphics with
r-project.org/web/packages/pec/index.html R. Beijing: People’s Medical Publishing House, 2018.
54. He P, Eriksson F, Scheike TH, et al. A Proportional 62. Huang YQ, Liang CH, He L, et al. Development and
Hazards Regression Model for the Sub-distribution with Validation of a Radiomics Nomogram for Preoperative
Covariates Adjusted Censoring Weight for Competing Prediction of Lymph Node Metastasis in Colorectal
Risks Data. Scand Stat Theory Appl 2016;43:103-22. Cancer. J Clin Oncol 2016;34:2157-64.
55. Fine JP, Gray RJ. A Proportional Hazards Model for 63. Tang XR, Li YQ, Liang SB, et al. Development and
the Subdistribution of a Competing Risk. Journal of the validation of a gene expression-based signature to
American Statistical Association 1999;94:496-509. predict distant metastasis in locoregionally advanced
56. Gray RJ. A Class of K-Sample Tests for Comparing the nasopharyngeal carcinoma: a retrospective, multicentre,
Cumulative Incidence of a Competing Risk. Ann Stat cohort study. Lancet Oncol 2018;19:382-93.
1988;16:1141-54. 64. Lesmeister C. Mastering Machine Learning with R.
57. Scrucca L, Santucci A, Aversa F. Competing risk analysis Second ed. Birmingham, UK: Packt Publishing Ltd., 2017.
Cite this article as: Zhou ZR, Wang WW, Li Y, Jin KR,
Wang XY, Wang ZW, Chen YS, Wang SJ, Hu J, Zhang HN,
Huang P, Zhao GZ, Chen XX, Li B, Zhang TS. In-depth
mining of clinical data: the construction of clinical prediction
model with R. Ann Transl Med 2019;7(23):796. doi: 10.21037/
atm.2019.08.63
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63