0% found this document useful (0 votes)
16 views96 pages

Atm 07 23 796

Uploaded by

drwinkhaing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views96 pages

Atm 07 23 796

Uploaded by

drwinkhaing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Special Report

Page 1 of 96

In-depth mining of clinical data: the construction of clinical


prediction model with R
Zhi-Rui Zhou1#, Wei-Wei Wang2#, Yan Li3#, Kai-Rui Jin4, Xuan-Yi Wang4, Zi-Wei Wang5, Yi-Shan Chen6,
Shao-Jia Wang7, Jing Hu6, Hui-Na Zhang6, Po Huang6, Guo-Zhen Zhao6, Xing-Xing Chen4, Bo Li6,
Tian-Song Zhang8
1
Department of Radiotherapy, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai 200040, China; 2Department of
Thoracic Surgery, The Third Affiliated Hospital of Kunming Medical University & Yunnan Provincial Tumor Hospital, Kunming 650118, China;
3
Department of Anesthesiology, The Fourth Affiliated Hospital, Harbin Medical University, Harbin 150001, China; 4Department of Radiation
Oncology, Shanghai Cancer Center, Shanghai Medical College, Fudan University, Shanghai 200040, China; 5Department of Urology, Changhai
Hospital, The Second Military Medical University, Shanghai 200040, China; 6Beijing Hospital of Traditional Chinese Medicine, Capital Medical
University, Beijing Institute of Traditional Chinese Medicine, Beijing 100010, China; 7Department of Gynecologic Oncology, The Third Affiliated
Hospital of Kunming Medical University & Yunnan Provincial Tumor Hospital, Kunming 650118, China; 8Internal Medicine of Traditional Chinese
Medicine Department, Jing’an District Central Hospital, Fudan University, Shanghai 200040, China
Contributions: (I) Conception and design: ZR Zhou, B Li, TS Zhang; (II) Administrative support: B Li; (III) Provision of study materials or patients:
ZR Zhou; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors;
(VII) Final approval of manuscript: All authors.
#
These authors contributed equally to this work.
Correspondence to: Zhi-Rui Zhou. Department of Radiotherapy, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai 200040,
China. Email: zzr3711@163.com; Bo Li. Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, Beijing Institute of
Traditional Chinese Medicine, Beijing 100010, China. Email: dr.libo@vip.163.com; Tian-Song Zhang. Internal Medicine of Traditional Chinese
Medicine Department, Jing’an District Central Hospital, Fudan University, Shanghai 200040, China. Email: zhangtiansong@fudan.edu.cn.

Abstract: This article is the series of methodology of clinical prediction model construction (total 16
sections of this methodology series). The first section mainly introduces the concept, current application
status, construction methods and processes, classification of clinical prediction models, and the necessary
conditions for conducting such researches and the problems currently faced. The second episode of
these series mainly concentrates on the screening method in multivariate regression analysis. The third
section mainly introduces the construction method of prediction models based on Logistic regression and
Nomogram drawing. The fourth episode mainly concentrates on Cox proportional hazards regression
model and Nomogram drawing. The fifth Section of the series mainly introduces the calculation method of
C-Statistics in the logistic regression model. The sixth section mainly introduces two common calculation
methods for C-Index in Cox regression based on R. The seventh section focuses on the principle and
calculation methods of Net Reclassification Index (NRI) using R. The eighth section focuses on the principle
and calculation methods of IDI (Integrated Discrimination Index) using R. The ninth section continues to
explore the evaluation method of clinical utility after predictive model construction: Decision Curve Analysis.
The tenth section is a supplement to the previous section and mainly introduces the Decision Curve Analysis
of survival outcome data. The eleventh section mainly discusses the external validation method of Logistic
regression model. The twelfth mainly discusses the in-depth evaluation of Cox regression model based on R,
including calculating the concordance index of discrimination (C-index) in the validation data set and drawing
the calibration curve. The thirteenth section mainly introduces how to deal with the survival data outcome
using competitive risk model with R. The fourteenth section mainly introduces how to draw the nomogram
of the competitive risk model with R. The fifteenth section of the series mainly discusses the identification
of outliers and the interpolation of missing values. The sixteenth section of the series mainly introduced the

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 2 of 96 Zhou et al. Clinical prediction models with R

advanced variable selection methods in linear model, such as Ridge regression and LASSO regression.

Keywords: Clinical prediction models; R; statistical computing

Submitted Jun 05, 2019. Accepted for publication Aug 02, 2019.
doi: 10.21037/atm.2019.08.63
View this article at: http://dx.doi.org/10.21037/atm.2019.08.63

Introduction to Clinical Prediction Models of clinical prediction models, necessary conditions for
conducting such researches and the current problems.
Background

For a doctor, if there is a certain “specific function” to


Concept of clinical prediction model
predict whether a patient will have some unknown outcome,
then many medical practice modes or clinical decisions Clinical predictive model refers to using a parametric/semi-
will change. Such demand is so strong that almost every parametric/non-parametric mathematical model to estimate
day we will hear such a sigh “If I could know in advance, the probability that a subject currently has a certain disease
I would certainly not do this!”. For example, if we can or the likelihood of a certain outcome in the future (3). It
predict that a patient with malignant tumor is resistant to can be seen that the clinical prediction model predicts the
a certain chemotherapy drug, then we will not choose to unknown by the knowing, and the model is a mathematical
give the patient the drug; if we can predict that a patient formula, that is, the known features are used to calculate
may have major bleeding during surgery, then we will be the probability of the occurrence of an unknown outcome
careful and prepare sufficient blood products for the patient through this model (4,5). Clinical prediction models are
during the operation; if we can predict that a patient with generally modeled by various regression analysis methods,
hyperlipidemia will not benefit from some lipid-lowering and the statistical nature of regression analysis is to find the
drug, then we can avoid many meaningless medical “quantitative causality.” To be simple, regression analysis
interventions. is a quantitative characterization of how much X affects Y.
As a quantitative tool for assessing risk and benefit, the Commonly used methods include multiple linear regression
clinical prediction model can provide more objective and model, logistic regression model and Cox regression
accurate information for the decision-making of doctors, model. The evaluation and verification of the effectiveness
patients and health administrators, so its application is of prediction models are the key to statistical analysis,
becoming more and more common. Under this kind of data modeling, and project design, and it is also the most
rigid demand, researches of clinical prediction model are in demanding part of data analysis technology (6).
the ascendant. Based on the clinical issues we have studied, clinical
The current medical practice model has evolved from prediction models include diagnostic models, prognostic
empirical medicine to evidence-based medicine and then models and disease occurrence models (3). From a statistical
to precise medicine. The value of data has never been more point of view, prediction models can be constructed as long
important. The rapid development of data acquisition, data as the outcome of a clinical problem (Y) can be quantized
storage and analysis and technology of prediction in the by the feature (X). The diagnostic model is common in
big data era has made the vision of personalized medical cross-sectional studies, focusing on the clinical symptoms
treatment become more and more possible (1,2). From the and characteristics of study subjects, and the probability of
perspective of the evolvement of medical practice models, diagnosing a certain disease. The prognostic model focuses
accurately predicting the likelihood of a certain clinical on the probability of outcomes such as recurrence, death,
outcome is also an inherent requirement of the current disability, and complications in a certain period of time of a
precise medical model. particular disease. This model is common in cohort studies.
This paper will summarize the researches of clinical There is another type of prediction model that predicts
prediction model from the concept, current application whether a particular disease will occur in the future based
status, construction methods and processes, classification on the general characteristics of the subject, which is also

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 3 of 96

common in cohort studies. There are many similarities rehabilitation programs, preventing disease recurrence,
among the diagnostic model, the prognostic model and reducing mortality and disability, and promoting functional
the disease occurrence model. Their outcomes are often recovery and quality of life.
dichotomous data and their effect indicators are the absolute There are several mature prediction models in clinical
risks of the outcome occurrence, that is, the probability of practice. For example, Framingham, QRISK, PROCAM,
occurrence, not the effect indicators of relative risk such as and ASSIGN scores are all well-known prediction models.
relative risk (RR), odds ratio (OR) or hazard ratio (HR). At The TNM staging system for malignant tumors is the most
the technical level of the model, researchers will face with representative prediction model. The biggest advantage
the selection of predictors, the establishment of modeling of TNM is that it is simple and fast, and the greatest
strategies, and the evaluation and verification of model problem is that the prediction is not accurate enough,
performance in all of these models. which is far from the expectations of clinicians. The need
to use predictive tools in clinical practice is far more than
predicting disease occurrence or predicting the prognosis
Applications of clinical prediction models
of patients. If we can predict the patient’s disease status in
As described in the background part, clinical prediction advance, for example, for patients with liver cancer, if we
models are widely used in medical research and practice. can predict whether there is microvascular infiltration in
With the help of clinical prediction models, clinical advance, it may help surgeons to choose between standard
researchers can select appropriate study subjects more resection and extended resection, which are completely
accurately, patients can make choices more beneficial for different. Preoperative neoadjuvant radiotherapy and
themselves, doctors can make better clinical decisions, and chemotherapy is the standard treatment for T1-4N+ middle
health management departments can monitor and manage and low rectal cancer. However, it is found during clinical
the quality of medical services better and allocate medical practice that the status of lymph nodes estimated according
resources more rationally. The effects of clinical prediction to the imaging examinations before surgery is not accurate
models are almost reflected in any of the three-grade enough, and the proportion of false positive or false
prevention system of diseases: negative is high. Is it possible to predict the patient’s lymph
node status accurately based on known characteristics before
Primary prevention of disease radiotherapy and chemotherapy? These clinical problems
The clinical prediction model can provide patients might be solved by constructing a suitable prediction model.
and doctors with a quantitative risk value (probability)
of diagnosing a particular disease in the future based
Research approach of clinical prediction models
on current health status, offering a more intuitive and
powerful scientific tool for health education and behavioral Clinical prediction models are not as simple as fitting a
intervention. For example, the Framingham Cardiovascular statistical model. From the establishment, verification,
Risk Score based on the Framingham’s studies on heart evaluation and application of the model, there is a complete
clarified that lowering blood lipids and blood pressure could research process of the clinical prediction model. Many
prevent myocardial infarction (7). scholars have discussed the research approaches of clinical
prediction models (8-11). Heart Magazine recently
Secondary prevention of disease published a review, in which the authors used risk score for
Diagnostic models often use non-invasive, low-cost and cardiovascular diseases (CVD) as an example to explore how
easy-to-acquire indicators to construct diagnostic means to construct a predictive model of disease with the help of
with high sensitivity and specificity and to practice the idea visual graphics and proposed six important steps (12):
of “early detection, early diagnosis, early treatment”, which (I) Select a data set of predictors as potential CVD
has important significance of health economics. influencing factors to be included in the risk score;
(II) Choose a suitable statistical model to analyze the
Tertiary prevention of disease relationship between the predictors and CVD;
The prognostic model provides quantitative estimates (III) Select the variables from the existing predictors
for probabilities of disease recurrence, death, disability that are significant enough to be included in the
and complications, guiding symptomatic treatment and risk score;

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 4 of 96 Zhou et al. Clinical prediction models with R

(IV) Construct the risk score model; model and a validation set to verify the prediction
(V) Evaluate the risk score model; ability of the model.
(VI) Explain the applications of the risk score in clinical (II) There are prediction models at present. To
practice. construct a new model, a validation set is applied
The author combined literature reports and personal to build the new model and the same training data
research experience and summarized the research steps as set is applied to verify the prediction ability of the
shown in Figure 1. existing model and the new model respectively.
(III) To update the existing models, the same validation
Clinical problem determine research type selection set is used to verify the prediction ability of the two
Clinical prediction models can answer questions about models.
etiology, diagnosis, patients’ response to treatment and With regard to the generation of training data sets and
prognosis of diseases. Different research types of design validation data sets, data can be collected prospectively or
are required for different problems. For instance, in retrospectively, and the data sets collected prospectively
regard to etiology studies, a cohort study can be used are of higher quality. For the modeling population, the
to predict whether a disease occurs based on potential sample size is expected to be as large as possible. For
causes. Questions about diagnostic accuracy are suitable prospective clinical studies, the preparation of relevant
for cross-sectional study design as the predictive factors documents includes the research protocol, the researcher’s
and outcomes occur at the same time or in a short period operation manual, the case report form, and the ethical
of time. To predict patients’ response to treatment, cohort approval document. Quality control and management of
study or randomized controlled trial (RCT) can be applied. data collection should also be performed. If data is collected
For prognostic problems, cohort study is suitable as there retrospectively, the data quality should also be evaluated,
are longitudinal time logics for predictors and outcomes. the outliers should be identified, and the missing values
Cohort study assessing the etiology requires rational should be properly processed, such as filling or deleting.
selection of study subjects and control of confounding Finally, the training data set for modeling and the validation
factors. In studies of diagnostic models, a “gold standard” set for verification are determined according to the actual
or reference standard is required to independently diagnose situations. Sometimes we can only model and verify in the
the disease, and the reference standard diagnosis should same data set due to realistic reasons, which is allowed, but
be performed with blind method. That is to say, reference the external applicability of the model will be affected to
standard diagnosis cannot rely on information of predictors some extent.
in prediction models to avoid diagnostic review bias.
Assessing patients’ response to treatment is one type of Establishment and evaluation of clinical prediction
interventional researches. It is also necessary to rationally models
select study subjects and control the interference of non- Before establishing a prediction model, it is necessary to
test factors. In studies of prognostic model, there is a clarify the predictors reported in the previous literature,
vertical relationship between predictors and outcomes, and determine the principles and methods for selecting
researchers usually expect to obtain outcome of the disease predictors, and choose the type of mathematical model
in the natural status, so prospective cohort study is the most applied. Usually a parametric or semi-parametric model
common prognostic model and the best type of research will be used, such as logistic regression model or Cox
design. regression model. Sometimes algorithms of machine
learning are used to build models and most of these models
Establishment of study design and implementation are non-parametric. Because there are no parameters like
protocol, data collection and quality control regression coefficients, the clinical interpretation of such
Good study design and implementation protocol are nonparametric models is difficult. Then fit the model and
needed. First, we need to review the literatures to determine estimate the parameters of the model. It is necessary to
the number of prediction models to be constructed: determine the presentation form of the prediction model in
(I) At present, there is no prediction model for a advance. Currently, there are four forms commonly used in
specific clinical problem. To construct a new model, prediction models:
generally a training set is required to construct the (I) Formula. Use mathematical formulas directly as

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 5 of 96

Ridge regression

Regularization method Lasso regression

Elastic network

Hierarchical clustering
Variable Screen Mehod
Cluster analysis K-means clustering

PAM method

Principal component and factor analysis

General linear model Linear regression

Logisticregression
Parameterized model Generalized linear model
Possion regreesion

Discriminant analysis

Cox proportional hazard model


Semi-parametric model
Prediction Model
Competing risk model
Construction

KNN method

SVM method

Classification regression tree


Nonparametric model Machine learning method
Random forest method

Clinical Prediction Model


Neural network
Construction and Evaluation
Deep learning

Linear regresion Convert coefficients into scores

Logistic regression Draw nomogram

Cox proportional hazard model Draw nomogram


Training set modeling
Screening variables, then use linear,
Lasso regression
logistic or cox method to draw nomogram

Machine learning algorithm Construct non-parametric model


No existing model Have 2 datasets
Area Under ROC

Discrimination C-Statistics

C-Index
Validation set validation
Calibration Calibration Plot

DCA anslysis DCA Curve

Area Under ROC


Prediction Model
Evaluation
Discrimination C-Statistics

C-Index
Already have 2 models Validate in one dataset
Calibration Calibration Plot

DCA analysis DCA Curve

C-Statistics

Add/remove variables or Net Reclassification Index NRI


Evaluate the improvement of the
Already have 1 model
introduce new variables new model over the old model
Integrated Discrimination Improvement IDI

DCA Curve

Figure 1 The flow chart of construction and evaluation of clinical prediction models.

the prediction model tool. and plotted as a nomogram as a predictive model


(II) Nomogram. The regression coefficients of the tool.
regression model are transformed into scores (III) Web calculator. The nature is also to convert the
through appropriate mathematical transformations regression coefficients of the regression model into

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 6 of 96 Zhou et al. Clinical prediction models with R

scores by appropriate mathematical operations, and relatively unstable. Studies with small sample sizes
to make it into a website for online use. are not suitable for this method.
(IV) Scoring system. The regression coefficients of (II) Cross-validation method. This method is a further
the regression model are transformed into a evolution of the split-half method. The half-fold
quantifiable scoring system through appropriate cross-validation and the ten-fold cross-validation
mathematical operations. are commonly used. The half-fold cross-validation
The first form is mainly for linear regression model, method is to divide the original data into two parts,
which is a deterministic regression. The latter forms one for establishing and the other for validating the
are based on parametric or semi-parametric models, the model. Then exchange the rolls of the two parts
statistical nature of which is the visual representation of and mutually verifying each other. The ten-fold
the model parameters. The researchers can make choices cross-validation method is to divide the data into
based on actual conditions. After the model is built, how to ten parts, and to uses nine parts for establishing the
evaluate the pros and cons of the model? The evaluation model, and the other part for verifying the model.
and verification of the model are of higher statistical analysis By establishing and verifying the model ten times
technology. For example, the discrimination, calibration, in this way, a relatively stable can be constructed.
clinical effectiveness and other indicators of the prediction (III) Bootstrap method. The conventional Bootstrap
models are evaluated to determine the performance of the internal validity analysis method is to randomly
models. sample a certain number of returnable cases in the
original data set to build a model, and then use
Validation of clinical prediction models the original data set to verify the model. By doing
The effect of the prediction model is prone to change as the random sampling, establishment and validation
scenario and the population change. Therefore, a complete for 500–1,000 times, 500–1,000 models can be
study of prediction model should include validation of the obtained, and the parameter distributions of the
model. The content of the validation includes the internal model can be summarized. Therefore, the final
validity and external validity of the model. Internal validity parameter values of the model can be determined.
reflects the reproducibility of the model, which can be Bootstrap method is a fast-developing method
validated through cross-validation and Bootstrap with in recent years. This method develops in the
the data of the study itself. External validity reflects the background of computer numeration increase. It is
generalizability of the model and needs to be validated with proved that models acquired through this method
data sets not from the study itself, which are temporally and have higher stability than through the previous
geographically independent, or completely independent. two methods. It can be speculated that Bootstrap
Internal and external validation of the model are method will be increasingly applied internal
necessary steps to assess the stability and applicability of the validity analysis of the prediction models. Of
model. The data sets for internal validation and external course, if conditions are met, we should do external
validation should be heterogeneous, but not to a certain validation of prediction models as much as possible
extent. Generally, data from the original institution are used to improve the external applicability of the models.
as training set to build the model and a part of the internal
data are randomly selected to perform internal validation. Assessment of clinical effectiveness of clinical prediction
Data from other institutions are selected as the external models
verification data set. Of course, it is best to do external data The ultimate goal of the clinical prediction models is
set validation. I will introduce several methods to verify whether the clinical prediction model changes the behaviors
internal validity. of doctors/patients, improves patients’ outcomes or cost
(I) Split-half method. Randomly divide the existing effect, which is the clinical effect study of the clinical
data into two parts, one for building the model prediction models. From the methodological point of view,
and the other for validating the model. The data generally the training set and the validation set are divided
is divided into two parts by the semi-division according to the new prediction model. For example,
method for “internal verification”. Since only half for predicting dichotomous outcome, we can assess the
of the data is used to build the model, the model is clinical effectiveness by assessing the sensitivity and

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 7 of 96

specificity of the model. For predicting survival outcomes, department is needed.


we generally evaluate whether patients can be classified (III) With wide use of high-throughput biotechnology
into good or poor prognosis according to the prediction such as genomics and proteomics, clinical researchers
model. For instance, the score of each subject is calculated are attempting to explore featured biomarkers for
by Nomogram, and the patients are classified into good constructing prediction models from these vast
prognosis group and poor prognosis group according to a amounts of biological information. Such prediction
certain cutoff value, then a Kaplan-Meier survival curve is models are a good entry point for the transformation
drawn. Decision Curve Analysis is also a commonly used of basic medicine into clinical medicine, but such
method for predicting clinical effectiveness of models. From researches require strong financial support as various
the perspective of the final purpose of the prediction model omics tests of the clinical specimens need to be
construction and study design, the best clinical effectiveness done. However, the input and output of scientific
assessment is to design randomized controlled trials, and research are directly proportional. As the saying goes,
usually cluster randomized controlled trials are used to “Reluctant children can’t entrap wolves.” Although
assess whether the application of prediction models can there is nobody willing to entrap the wolf with a child,
improve patient outcomes and reduce medical costs. the reason is the same. Once the researches willing to
put money in omics analysis are well transformed into
Update of clinical prediction models clinic, generally the researches can yield articles with
Even with well-validated clinical prediction models, the high impact factors. In addition, biological samples
model performance is degraded over time due to changes must be obtained, otherwise there is foundation to
of disease risk factors, unmeasured risk factors, treatment launch such researches.
measures, and treatment background, which is named the
calibration drift. Therefore, clinical prediction models need
The necessary conditions to conduct clinical prediction
to evolve and update dynamically. For example, the frequent
model from the perspective of clinicians
update of the most commonly used malignant tumor TNM
staging system is also because of these reasons. (I) Build a follow-up database of a single disease
and collect patient information as completely as
possible, including but not limited to the following:
Current researches of clinical prediction model can be
demographic characteristics, past history, family
roughly divided into three categories from the perspective
history, personal history; disease-related information
of clinicians
such as important physical and laboratory findings
(I) Prediction models are constructed with traditional before treatment, disease severity, clinical stage,
clinical features, pathological features, physical pathological stage, histological grade; treatment
examination results, laboratory test results, etc. information: such as surgical methods, radiotherapy
The predictive variables in this type of models are and chemotherapy regimens, dose and intensity;
more convenient for clinical acquisition and are the patients’ outcomes: for cancer patients, consistent
construction of these models is more feasible. follow-ups are required to obtain their outcomes,
(II) With the maturity of radiomics research methods, which is an extremely difficult and complex task. Other
more and more researchers are aware that certain information: If there is, such as genetic information.
manifestations or parameters of imaging represent a Database construction is a core competency.
specific biological characteristic. Using the massive (II) From the previous published articles of prediction
imaging parameters of color Doppler ultrasound, models, most of them are based on retrospective
CT, MR or PET combined with clinical features datasets, and a fraction of them are based on
to construct prediction models can often further prospective datasets. Such researches are easier to
improve the accuracy of the prediction models. The carry out compared with RCT, and they belong to
modeling of this type of method is based on screening areas of real-world study that we are now proposing.
the features of radiomics. The pre-workload of this Real-world study and RCT should be two same pearls
type is much larger than the first method, and close on the crown of clinical study and complement each
cooperation between clinical department and imaging other. In the past, we overemphasized the importance

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 8 of 96 Zhou et al. Clinical prediction models with R

of RCT and ignored the great value of real-world the clinical application of the prediction model needs
data. RCT data have the highest quality without to be balanced between the accuracy and the simplicity
doubt, but the data have been screened strictly, of the model. Imagine if there is a model that is as
therefore the extrapolation of the evidence is limited. easy to use as TNM staging, but more accurate than
Real-world data come from our daily clinical practice, TNM staging, what choices would you make?
which reflects the efficacy of clinical interventions (II) Most of the clinical prediction models were
more comprehensively, and the evidence has better constructed and validated based on retrospective
external applicability. However, the biggest problems datasets and validation is rarely performed in the
of real-world study are that the data quality varies prospective data. Therefore, the stability of the results
wildly and there are too many confounding factors predicted by the models was comparatively poor.
that is difficult to identify. Therefore, it is necessary (III) Validation of most clinical prediction models is based
to use more complicated statistical methods to find on internal data. Most articles have only one dataset.
the truth from the complicated confounding factors. Even if there are two datasets, one to construct and
It is not easy to sift sand for gold, and solid statistical the other to validate, but the two datasets often come
foundation is like a sifter for gold. Here we need to from the same research center. If the validation of
understand that confounding factors exist objectively, the prediction model can be further extended to
because the occurrence of any clinical outcome is not dataset of another research center, the application
the result from a single factor. There are two levels of value of the model will be greatly expanded. This
correction for confounding factors. One is correction work is extremely difficult and requires multi-center
at the experimental design stage, which is the top-level cooperation. Moreover, most of the domestic centers
correction, such as equalizing confounding factors do not have a complete database for validation,
between groups by randomization and enough sample which comes back to the topic “database importance”
size. This is also the reason why RCT is popular: as discussed earlier.
long as the sample size is enough and randomization is
correct, the problem of confounding factors is solved
Brief summary
once and for all. The second is after-effect correction
through statistical methods, which is obviously not The original intention of the clinical prediction model is
as thorough as the RCT correction, but the second to predict the status and prognosis of diseases with a small
situation is closer to the real situation of our clinical number of easily collected, low-cost predictors. Therefore,
practice. most prediction models are short and refined. This is
(III) Sample size. Because there are many confounding logical and rational in an era when information technology
factors in real-world research, a certain sample size is underdeveloped and data collection, storage and analysis
is necessary to achieve sufficient statistical efficacy to are costly. However, with the development of economy and
discern the influence of confounding factors on the the advancement of technology, costs of data collection
outcome. A simple and feasible principle for screening and storage have been greatly reduced and technology of
variables by multivariate analysis is that if one variable data analysis is improving. Therefore, clinical prediction
is included in multivariate analysis, there should be model should also break through the inherent concept, with
20 samples of the endpoint, which is called “1:20 application of larger amounts of data (big data) and more
principle” (13,14). complex models as well as algorithms (machine learning and
(IV) Clinical research insight. Construction of clinical artificial intelligence) to serve doctors, patients, and medical
prediction model is to solve clinical problems. The decision makers with more accurate results.
ability to discover valuable clinical problems is an In addition, from the perspective of a clinical doctor
insight that is cultivated through widely reading and conducting clinical researches, the following four principles
clinical practice. should be grasped when conducting researches of clinical
prediction models:
(I) Building a better clinical prediction model is also
Issues currently faced in the development of prediction model
an inherent requirement of precise medicine;
(I) Low clinical conversion rate. The main reason is that (II) H o w t o g e t h i g h q u a l i t y d a t a ? D a t a b a s e

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 9 of 96

construction is the core competitiveness, while will be included in the regression formula (here P value
prediction model is only a technical method; could be less than 0.05 or 0.2, but in common condition,
(III) We need to raise awareness that RCT is as P value can range between 0.05–0.2). This method is very
important as real-world study. Both are ways to controversial. For practitioners, how to choose a better
provide reliable clinical evidence; method is really an optional test. To be honest, there is no
(IV) Validation of the models requires internal and standard answer. But we still have some rules for better
external cooperation, so we should strengthen variable screening:
the internal cooperation of scientific research and (I) When the sample size is larger enough, statistical
improve the awareness of multi-center scientific test power is enough, you can choose one from
research cooperation. those six screening methods we mentioned
before. Hereby we introduce a method which can
help you evaluate the test efficiency quickly: 20
Variable screening method in multivariate
samples (events) are available for each variable.
regression analysis
For example, in Cox regression test, if we include
Background 10 variables associated with prognosis, at least 200
patients should be recruited to evaluate endpoint
Linear regression, Logistic regression and Cox proportional events, such as death (200 dead patients should be
hazards regression model are very widely used multivariate included instead of 200 patients in total). Because
regression analysis methods. We have introduced details those samples without endpoint event won’t be
about the principle of computing, application of associated considered to be test effective samples (13,14).
software and result interpreting of these three in our book (II) When the sample size is not qualified for the first
Intelligent Statistics (15). However, we talked little about condition or statistical power is not enough for
independent variable screening methods, which have some other reasons, widely used screening method
induced confusion during data analysis process and article in most clinical report should be applied. You can
writing for practitioners. This episode will focus more on perform univariate regression analysis of every
this part. variable one by one firstly; those with P value less
Practitioners will turn to statisticians when they are than 0.2 will be included in the regression formula.
confronted with problems during independent variable As we mentioned before, this method is quite
screening, statisticians will suggest the application of controversial during its wide application.
automatic screening in software, such as Logistic regression (III) Even the second screening method will be
and Cox regression in IBM SPSS, which have suggested challenged during practice. Sometimes we find
seven methods for variable screening as follow (16,17): some variables significantly associated with
(I) Conditional parameter estimation likelihood ratio prognosis may be excluded for its disqualification
test (Forward: Condition); in already set-up screening methods. For example,
(II) Likelihood Ratio Test of Maximum Partial in a prostate cancer prognosis study, the author find
Likelihood Estimation (Forward LR); Gleason score is not significantly associated with
(III) Wald chi-square test (Forward: Wald); prognosis in the screening model, while Gleason
(IV) Conditional parameter estimation likelihood ratio score is a confirmed factor for prostate cancer
test (Backward: Condition); prognosis in previous study. What should we do
(V) Likelihood Ratio Test of Maximum Partial now? In our opinion, we should include those
Likelihood Estimation (Backward: LR); variables significantly associated with prognosis
(VI) Wald chi-square test (Backward: Wald); in our analysis though they may be disqualified
(VII) Enter method (all variable included, default in statistical screening method for professional
method). perspective and clinical reasons.
Actually, in clinical trial report, many authors will adopt To sum up, the author recommends the third variable
one of these screening methods. I am going to talk about: screening method. The univariate analysis results and clinical
they will perform univariate regression analysis of every reasons, sample size and statistical power should be considered
variable one by one firstly; those with P value less than 0.1 at the same time. We will explain it in detail below.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 10 of 96 Zhou et al. Clinical prediction models with R

Disputes and consensus considered that the β1 change exceeds 10%, and
the variable needs to be adjusted, otherwise it is not
The discussion about variable screening has been going on
needed. This method is different from the second
for a long time. Statistician consider it with very professional
one because of quantification of confounding factor
perspective, however clinical doctors will not always stick
effect. This is not perfect because the effect “Z”
to these suggestions. It is very hard to distinguish the right
and “X” exert on “Y” could be affected by other
and the wrong for actual problem during clinical study,
confounding factors. This thought may lead to
such as small simple size; limited knowledge to confirm
logical confusion. Complicated methodological
the exact factors for the outcome. However, we still have
problem will be left for further exploration by the
some standards for reference during screening. During the
smart ones. In our opinion, this is acceptable option
review about good quality clinical studies published on top
for variable screening, especially for those programs
magazines, 5 conditions will be considered during variable
with very specific targets. We can confirm the effect
screening (18):
of “X” on independent “Y”. This effect is real and
(I) Clinical perspective. This is the most basic what we do can regulate these confounding factors.
consideration for variable screening. Medical (IV) Choosing right number of variables that will
statistical analysis can be meaningless if it is just eventually be included in the model. is very
statistical analysis. Based on professional view, important. This is a realistic problem. If the sample
confirmed factors significantly associated with size is large enough and the statistical performance
outcome, such as Gleason score in prostate cancer, is sufficient, we can use the variable screening
should be included in regression model. We do not method provided by the statistical software to
need to consider its statistical disqualification in automatically filter the variables, and we can filter
variable screening. out the variables that are suitable for independent
(II) S c r e e n i n g b a s e d o n u n i v a r i a t e a n a l y s i s . impact results in statistical. “Ideal is full, the reality
Independent variable included in multivariate is very skinny”. Sometimes we will consider a lot
analysis based on the results of the univariate of variables, while the sample size is pretty small.
analysis. Variables with significant P value should We have to make compromise between statistical
be included in multivariate regression. If P<0.1, we efficiency and variable screening. Compromise can
think it is “significant”. Sometimes P<0.2 or P<0.05 bear better results (13,14).
will be considered “significant”. P value can range (V) Above we listed four commonly used variable
according to sample size. Big sample size will be screening methods. Many other variable screening
with small P value. Small sample size will be with methods, such as some methods based on model
relatively big P value. This screening method is parameters: determination coefficient R 2, AIC,
quite common in already published articles, even likelihood logarithm, C-Statistics, etc. can be
in top magazines. Though this method has been an option also. The fact that too many variable
questioned by statisticians, it is still being used for screening methods is a good evidence to support a
no method because no more precise and scientific view that there is no best available during practice.
option available now. Even statisticians could not This article aims to help us find the right screening
find better replacement. Before we find better method instead of confirming the best one or worst
option to replace this one, it is better to have one one. Choosing the fittest one according to actual
than to have none. condition is the goal of this article.
(III) The variables would be chosen based on the
influence of the confounding factor “Z” on the test
The methods for recruiting different types of variables
factor or the exposure factor “X”. To be specific,
we will observe if “X” will affect dependent variable Continuous variable
“Y” when “Z” change or not. First run the basic For continuous variable, there is a good protocol for
model that only includes “X”, record the regression reference. If the relationship between variable and the
coefficient β1, and then add “Z” to the model outcome is linear, you can include the continuous variable
to see how much the β1 changes. It is generally in the regression formula. If not, you can transform it into

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 11 of 96

dichotomous variable or ordinal categorical variable, then evaluate the normality of data. In their research, variables
put them into the regression formula. We have already such as troponin I, NT-proBNP, or corin is fit abnormal
changed former continuous variable into categorical distribution. So, the author describes baseline characters
variable by this way. We do this transformation because of these recruited objects by median (quartile - third
the variable may be not linear to the outcome. Some other quartile). For example, the median of Troponin I is
relationship instead of linear one may present. 4.5 (1.8–12.6) ng/mL. Then multivariable linear regression
is performed to analyze corin. The original expression is
Continuous variable transformation summary as follows: multiple linear regression analysis was applied
When continuous variable is included in regression to determine factors influencing corin levels. Levels of
model, the original variable, as far as possible, should troponin I, NT-proBNP, and corin were normalized by
be included in this model and actual needs should be Log10 transformation. Variables like troponin I, NT-
considered also. The variable can be transformed based proBNP, corin have been normalized by function Log 10.
on some rules. Two-category grouping, aliquot grouping, After that, they have been included into multivariable linear
equidistant grouping, and clinical cut-off value grouping regression. Then, the author performed Cox regression.
are present for better professional explanation. By Though there is no specific requirement for Cox regression,
optimal truncation point analysis, we convert continuous Log 10 function was used to normalize troponin I, NT-
variables into categorical variables and introduce them proBNP and corin. All these three variables have been
into the regression model as dummy variables. In included in multivariable linear regression model for
regression model, the continuous variable can be present consistency with original ones.
in different ways. We will give specific examples as Transformation for each change of fixed increment
follow. No matter which way it will present, the general If continuous variable is introduced directly into the
principle is that this change is better for professional model in its original form, the regression parameter is
interpretation and understanding (19-21). interpreted as the effect of the change in the dependent
Normal transformation variable caused by each unit change. However, sometimes
For continuous variables which are in normal distribution, the effect of this change may be weak. Therefore, we can
this is not a problem. However, when we confronted with transform the continuous independent variables into a
data which is not fit normal distribution, we can make categorical variable by fixed interval, in an equidistant
transformation based on some function, then these data grouping, and then introduce them into the model for
will be normalized. And it will fit the regression model. analysis. This grouping is good for better understanding
Original data can be normalized by different function, such and application for patients. For example, we include
as Square Root method, LnX method, Log 10X method patients whose age range between 31 to 80 years old. We
and (1/X) etc., according to its own character. If you have can divide it into groups of 10–40, 41–50, 51–60, 61–70,
normalized the original data, you should interpret the 71–80 according to10 years age interval. Then five already
variable after normal transformation instead of the original set dummy variables will be included into the model for
ones in the regression model or you can reckon the effect of analysis. However, if the variable range a lot, grouping
the original independent variable exerting on the original according to the methods we mentioned before will lead
dependent variable according to the function used in the to too many groups and too many dummy variables, which
transformation. will be quite redundant during analysis. It will be very
For example, the authors have done normality test hard for clinical interpretation too. In the opposite, some
in the article they published in JACC, 2016 (21). The data with a small range and cannot be grouping again, it
original expression is as follows: Normality of continuous cannot be transformed into categorical variable also. Then,
variables was assessed by the Kolmogorov-Smirnov what should we do when we are confronted with these two
test. The method of normality test includes using the situations?
parameters of the data distribution (the skewness value Here, we can refer to an article published in JACC,
and the kurtosis value) and using the data distribution 2016 (19). We find in the model, the author used a lot of
graph (histogram, P-P diagram, Q-Q diagram). Or “per”, such as per 5% change, per 0.1 U, per 100 mL/min
some nonparametric test methods (Shapiro-Wilk test, etc. This is transformation of continuous variables in fixed
Kolmogorov-Smirnov test) will be applied to help us to increments per change, which has been present in “per +

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 12 of 96 Zhou et al. Clinical prediction models with R

interval + unit”. We will illustrate 2 examples in this article. interpretation, we can put per SD into the model. This can
The mean of oxygen uptake efficiency slope is 1,655 U and guide the patient to see that he or she is within the range of
5–95% population will change from 846 to 2,800 U. It is several standard deviations of the population distribution
really a big range. If the original data is put into formula, level according to his or her actual measurement results,
per 1 U change will lead to very week change of HR, which and then to assess how much the corresponding risk will
is meaningless in clinical practice. If it is transformed change.
into categorical variables, many groups will appear. So, It is very simple to do this kind of transformation. We
the author includes per 100 U change into the model and can do it by these two ways:
finds that the mortality risk will decrease 9% (HR =0.91, (I) Before constructing the regression model, the
95% CI: 0.89–0.93) when oxygen uptake efficiency slope original continuous variables should be normalized,
increases in per 100 U. Another example is variable Peak and the normalized independent variables
RER. The median is 1.08 U and 5–95% population will are brought into the regression model. The
change from 0.91–1.27 U. It is really a small range. If the regression coefficient obtained is the influence
original data is put into formula, per 1 U change will lead of the dependent variable on each dependent
to very big change of HR. In clinical practice, patients with SD. (Attention: Only independent variables are
a change of 1 U are quite rare and this outcome will be normalized here).
of limited practicality. It will be very hard for categorical (II) If the original variables are not normalized, the
variable transformation too for its small range. So, the original variables can be directly brought into the
author includes per 0.1U change into the model and finds model, and the Unstandardized Coefficients are
that the mortality risk will decrease 6% (HR =0.94, 95% CI: obtained, and then the standard deviation of the
0.86–1.04) when Peak RER increases per 0.1 U. However, independent variables is calculated by multiplying
it is not statistically significant. the standard deviation of the independent
Then, how can we do this transformation? If we want variables, which is also called Standardized
to change the factor from each1 unit to 100 units, it will Coefficients. This is the effect of the dependent
be 100 times larger. We only need to divide the original variable for each additional SD of the independent
variable by 100 and then to include into the model. variable.
Similarly, if we want to change the factor from 1 unit to 0.1
unit, the change is reduced by 10 times. It is only necessary Rank variable
to multiply the original variable by 10 and include it into Rank variable is very common. It is a kind of ordered
the regression model. multi-category variable. Generally, multiple data may
Transformation of each standard deviation present in the same variable and these data are rank
In clinical study, we get another transformation method: correlated with each other. For example, the grade of
independent variable change at per SD increase. Let us see hypertension (0= normal, 1= high normal, 2= Grade
an article published in JACC in 2016 (20). Age and systolic 1, 3= Grade 2, 4= Grade 3), the level of urine protein
pressure are included in the model as per SD increase. The (0=−, 1=±, 2=+, 3=++, 4=+++, 5=++++), the effect of drug
age increase at per SD, the risk of atherosclerotic heart (invalid, improvement, cure), they are all rank variable.
disease (ASCVD) increases by 70% (HR =1.70, 95% CI: It is different from non-ordered multi-category variable.
1.32–2.19). Systolic blood pressure (SBP) increased at per Ordered multi-category variable presents monotonic
SD, the risk of ASCVD increases by 25% (HR =1.25, 95% increasing or decreasing. When ordered multi-category
CI: 1.05–1.49). Here the author has put continuous variable variable are in Logistic regression model, these variables
into the model with the form of per SD increase. Assuming are not suggested to be brought in directly as continuous
that the variables are fit to normal distribution, the area variables unless per one-unit change can lead to the same
within the mean ±1 SD interval is 68.27%, while the mean risk ratio change in the outcome. However, mostly it will
value is ±1.96, the area within the SD interval is 95%. If the not change so ideally. So, we suggest to treat ordered
mean value is ±2.58, the area within the SD interval is 99%. multi-category variable as dummy variables, and you can
We can tell that if the data range within 4 SD, about 95% compare each level with another. When the outcome is
samples will be covered. Therefore, new variables, especially not linear related, the optimal scale regression should be
for those rare ones which are still unclear in clinical used to explore the effect inflection point.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 13 of 96

Non-ordered & multi-categorical variable Brief summary


Non-ordered & multi-categorical variable is very
We have already summarized the screening methods and
common variable style. Usually, there are several possible
variables transformation methods. The absolutely perfect
values in a multi-categorical variable, while there is
way is not valid in really world. But we can still choose the
no hierarchical relationship between each other. For
right way. Comparing with choose one method in haste,
example, race (1= white, 2= black, 3= yellow, 4= others),
we need more scientific solutions. There is one way for
method of drug administration (1= oral, 2= hypodermic,
reference: you can construct multiple models (mode1,
3= intravenous, 4= others), they are all non-ordered
model 2, model 3…) based on previous published clinical
multi-category variables. When non-ordered multi-
trials, especially those with high impact score and get the
category variable are in Logistic or Cox regression
objective outcome of each model. It actually is sensitivity
model, we need to set dummy variable before we brought
analysis. Different models will be constructed based on
them in the model. We will introduce the dummy
different variables. Some variables, which may be closely
variable setting methods in the follow.
related to the true world, will lead to relatively stable
outcome even in different models. This is also a way to
Dummy variable setting methods
reach the goal. We will not judge it here. We want to find
(I) Indicator: this method is used to specify the reference
out the most stable factor for the outcome from results.
level of the categorical variable. The parameter
During the construction of predictive model, we will
calculated here is referred to the last or first level of
have specific consideration except for variables screening in
the variable. It is depending on whether you choose
all these possible variables. For example, TNM staging for
the first or last in the following Reference Category.
malignant tumors are widely used for its easily application
(II) Simple: this method can calculate the ratio of each
in clinical practice instead of its prediction value of in
level of the categorical variable compared to the
prognosis. Actually, TNM staging prediction value is just
reference level.
so-so. Here we have to talk about another question: How
(III) Difference: this method can compare the categorical
can we assess the accuracy and simplicity of the model?
variable with the mean of all levels. It is totally the
More variable may lead to more accurate prediction of
opposite of Helmert. So, it is also called Reversed
a model while it will be much more difficult for clinical
Helmert. For example, mean of level 2 can be
application. Sometimes a comprise should be made.
compared with the mean of level 1; the mean of
level 3 can be compared with that of level 1 and
level 2 respectively and so forth. If the coefficient Method of building nomogram based on Logistic
becomes small at a certain level and is not statistically regression model with R
significant, the effect of the categorical variable on
Background
the risk ratio is reached its plateau. This option is
generally used for ordered-categorical variables, such The need for prediction models in clinical practice is
as smoking doses. Assuming that the researchers much more than predicting disease occurrence or patient
analyze them as independent non-ordered multi- prognosis. As explained in Section 1, many times we
category variable, it will be meaningless. may make a completely different clinical decision if we
(IV) Helmert: we will compare the level of categorical can predict the patient’s disease state in advance. For
variable with the mean of the following levels. If example, for patients with liver cancer, if we can predict
the coefficient of a certain level increases and is whether there is microvascular infiltration in advance, it
statistically significant, it indicates that the categorical may help surgeons to choose between standard resection
variable has an impact on the risk rate from this level. and extended resection, which are completely different.
It can also be used in ordered-categorical variables. Preoperative neoadjuvant radiotherapy and chemotherapy
(V) Repeated: the levels of the categorical variables are is the standard treatment for T1-4N+ middle and low rectal
compared with the levels adjacent to them, except for cancer. However, it is found during clinical practice that the
the first level, where the “previous level” is used as the status of lymph nodes estimated according to the imaging
reference level. examinations before surgery is not accurate enough, and

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 14 of 96 Zhou et al. Clinical prediction models with R

Identify clinical
Identify clinical Determine research
Determine research Determine the
Determine the Determine the
Determine the
issues
issues strategies
strategies predictors
predictors outcome
outcome

Assessment
Assessmentof Assessment
Assessmentof
Construct
Construct the
the ofmodel
model Assessment
Assessment ofof
prediction model discrimination model ofclinical
clinical
prediction model discrimination model calibration
calibration effectiveness
ability effectiveness
ability

Figure 2 Research process and technical routes of three prediction models.

the proportion of false positive or false negative is high. Is it model (13,14,17).


possible to predict the patient’s lymph node status accurately In this section, we will use two specific cases to
based on known characteristics before radiotherapy and introduce the complete process of constructing a Logistic
chemotherapy? If we can build such a prediction model, regression prediction model with R language and drawing
then we can make clinical decisions more accurately and a Nomogram. For complex statistical principles, we choose
avoid improper decision-making caused by misjudgment. to avoid as much as possible, and we would focus on the R
More and more people are becoming aware of the implementation process of this method.
importance of this problem. At present, researchers have We can summarize the process of constructing and
made vast efforts to build prediction models or improve verifying clinical prediction models into the following eight
existing prediction tools. The construction of Nomogram is steps (22):
one of the most popular research directions. (I) Identify clinical issues and determine scientific
When do you choose Logistic regression to build a hypotheses;
prediction model? This is related to the clinical problems and (II) Determine research strategies of prediction models
the clinical outcomes set up. If the outcomes are dichotomous according to previous literatures;
data, unordered categorical data or ranked data, we can (III) Determine the predictors of the predictive model;
choose Logistic regression to construct the model. Generally (IV) Determine the outcome variables of the prediction
unordered Logistic regression and ranked Logistic model;
regression are applied in unordered multi-categorical or (V) Construct the prediction model and calculate
ranked data outcomes, but the results are difficult to explain. model predictors;
Thus, we generally convert unordered multi-classification (VI) Assessment of model discrimination ability;
or ranked data outcomes into dichotomous outcomes (VII) Assessment of model calibration;
and use dichotomous Logistic regression to construct (VIII) Assessment of clinical effectiveness of the model.
the model. Outcomes such as “whether liver cancer has Research process of prediction models construction can be
microvascular infiltration” and “recurrence of lymph node referred to Figure 2.
metastasis before rectal cancer” mentioned above belong to
dichotomous outcomes. Dichotomous Logistic regression
[Case 1] analysis
can be used for constructing, evaluating and validating the
prediction model (15). [Case 1]
The screening principles for model predictors are Hosmer and Lemeshow studied the influencing factors of
consistent with the principles described in section 2. In low birth weight infants in 1989. The outcome variable is
addition, we need to consider two points: on the one hand, whether to give birth to low birth weight infants (Variable
the sample size and the number of independent variables name: “low”; Dichotomous variable; 1= low birth weight,
included in the model should be weighed; on the other which is infant birth weight <2,500 g; 0= not low birth
hand, we should also weigh the accuracy of the model and weight). The possible influencing factors (independent
the convenience to use the model, to finally determine the variables) include: maternal pre-pregnancy weight (lwt,
number of independent variables entering the prediction unit: pound); maternal age (age, unit: year); whether the

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 15 of 96

mother smokes during pregnancy (smoke, 0= no, 1= yes); mydata<-read.spss(“Lowweight.sav”)


mydata<-as.data.frame(mydata)
number of preterm births before pregnancy (ptl, unit: time);
head(mydata)
high blood pressure (ht, 0= no, 1= yes); uterus stress to the ## id low age lwt race smoke ptl ht ui ftv bwt
contraction caused by stimulation, oxytocin, etc. (ui, 0= ## 1 85 normal weight 19 182 black no smoking 0 no pih yes 0 2523
no, 1= yes); visits to community doctors in the first three ## 2 86 normal weight 33 155 other no smoking 0 no pih no 3 2551
months of pregnancy (ftv, unit: time); race (race, 1= white, ## 3 87 normal weight 20 105 white smoking 0 no pih no 1 2557

2= black, 3= other). ## 4 88 normal weight 21 108 white smoking 0 no pih yes 2 2594
## 5 89 normal weight 18 107 white smoking 0 no pih yes 0 2600
## 6 91 normal weight 21 124 other no smoking 0 no pih no 0 2622
[Case 1] interpretation
In this case, the dependent variable is dichotomous (whether Data preprocessing: set the outcome variable as a
or not a low birth weight infant is delivered). The purpose dichotomous variable, define “low birth weight” as “1”, and
of the study is to investigate the independent influencing set the unordered categorical variable “race” as a dummy
factors of low birth weight infants, which is consistent with variable.
the application conditions of binary Logistic regression. As mydata$low <- ifelse(mydata$low ==“low weight”,1,0)
there is only one data set in this case, we can use this data mydata$race1 <- ifelse(mydata$race ==“white”,1,0)
set as the training set to model, and then use Bootstrap mydata$race2 <- ifelse(mydata$race ==“black”,1,0)
mydata$race3 <- ifelse(mydata$race ==“other”,1,0)
resampling method to perform internal model validation
in the same data set. It should be noted here that we can Load the data frame “mydata” into the current working
also randomly divide the data set into a training set and an environment and “package” the data using function
internal validation set according to a 7:3 ratio, but we did datadist().
not do so considering the sample size. We will demonstrate attach(mydata)
the prediction model construction of low birth weight dd<-datadist(mydata)
options(datadist=‘dd’)
infants and the rendering of Nomogram with R language
below. The data were collected and named “Lweight.sav”, Fit the Logistic regression model using function lrm()
which is saved in the current working path of R language. and present the results of the model fitting and model
The data and code can be downloaded from the attachments parameters. Note: The parameter C of Rank Discrim
in this Section for readers to practice. The specific analysis Indexes. in the model can be directly read. This is the
and calculation steps are as follows: C-statistics of model “fit1”. According to the calculation
(I) Screen the independent influencing factors results, the C-Statistics is 0.738 in this example. The
affecting low birth weight infants and construct a meaning and calculation method of C-Statistic will be
Logistic regression model; further explained in the following sections.
(II) Visualize the Logistic regression model and draw a fit1<-lrm(low ~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2,
Nomogram; data = mydata, x = T, y = T)
(III) Calculate the discrimination degree (C-Statistics) fit1
of the Logistic model; ## Logistic Regression Model
##
(IV) Perform internal validation with resampling
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui +
method and draw the Calibration curve. ## race1 + race2, data = mydata, x = T, y = T)
##
[Case 1] R codes and results interpretation ## Model Likelihood Discrimination Rank Discrim.
Load the “foreign” package for importing external data in ## Ratio Test Indexes Indexes
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
.sav format (IBM SPSS style data format). Load the rms
## 0 130 d.f. 9 g 1.122 Dxy 0.476
package to build the Logistic regression model and to plot ## 1 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
the nomogram (23): ## max |deriv| 7e-05 gp 0.207 tau-a 0.206
library(foreign) ## Brier 0.181
library(rms) ##
## Coef S.E. Wald Z Pr(>|Z|)
Import the external data in .sav format and name it
## Intercept 1.1427 1.0873 1.05 0.2933
“mydata”. Then set the data to the structure of data frame, ## age -0.0255 0.0366 -0.69 0.4871
and display the first 6 lines of the data frame. ## ftv 0.0321 0.1708 0.19 0.8509

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 16 of 96 Zhou et al. Clinical prediction models with R

0 10 20 30 40 50 60 70 80 90 100
Points

age
45 40 35 30 25 20 15 10
1 6
ftv
0 4 pih
ht
no pih
lwt
260 240 220 200 180 160 140 120 100 80
1 3
ptl
0 2
smoking
smoke
no smoking yes
ui
no 0
race1
1
1
race2
0
Total points
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Low weight rate
0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 3 Nomogram based on model “fit1”.

1.0 Use the function calibrate() to construct the calibration


curve object “cal1” and print the calibration curve. The
0.8
result is shown in Figure 4.
Actual probability

0.6 cal1 <- calibrate(fit1, method = ‘boot’, B = 100)


plot(cal1,xlim = c(0,1.0),ylim = c(0,1.0))

0.4 ##
## n=189 Mean absolute error=0.037 Mean squared error=0.00173
Apparent
Bias-corrected ## 0.9 Quantile of absolute error=0.054
0.2
ldeal
From the calculation results of Logistic regression model
0.0
fit1 above and Figure 3, it is obvious that the contribution
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Pr{low=1} of some predictors to the model are negligible, such as
B=100 repetitions, boot Mean absolute error=0.035 n=189
the variable “ftv”. There are also some predictors that
Figure 4 Calibration curve based on model “fit1”. may not be suitable for entering the prediction model as
dummy variable, such as “race”, and the clinical operation is
## ht=pih 1.7631 0.6894 2.56 0.0105 cumbersome. We can consider conversing the un-ordered
## lwt -0.0137 0.0068 -2.02 0.0431 categorical variables into dichotomous variables properly
## ptl 0.5517 0.3446 1.60 0.1094 and involve them into the regression model. The adjusted
## smoke=smoking 0.9275 0.3986 2.33 0.0200
codes are as follows:
## ui=yes 0.6488 0.4676 1.39 0.1653
## race1 -0.9082 0.4367 -2.08 0.0375
First of all, according to the actual situation, we convert
## race2 0.3293 0.5339 0.62 0.5374 the unordered categorical variable “race” into a binominal
## variable. The standard of conversion is mainly based on
Use function nomogram() to construct the Nomogram professional knowledge. We classify “white” as one category
object “nom1” and print the Nomogram. The result is and “black and other” as another.
shown in Figure 3. mydata$race <- as.factor(ifelse(mydata$race==“white”, “white”, “black
and other”))
nom1 <- nomogram(fit1, fun = plogis,fun.at = c(.001, .01, .05,
seq(.1,.9, by = .1), .95, .99, .999), Use function datadist() to “package the current data set.
lp = F, funlabel = “Low weight rate”) dd<-datadist(mydata)
plot(nom1) options(datadist =‘dd’)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 17 of 96

0 10 20 30 40 50 60 70 80 90 100
Points

pih
ht
no pih

lwt
260 240 220 200 180 160 140 120 100 80
1 3
ptl
0 2
smoking
smoke
no smoking
black and other
race
white

Total points
0 20 40 60 80 100 120 140 160 180 200 220 240

Low weight rate


0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 5 Nomogram based on model “fit2”.

Exclude the variable “ftv that contributes less to the shown in Figure 5.
result from the regression model, then reconstruct model nom2 <- nomogram(fit2, fun = plogis,fun.at = c(.001, .01, .05,
“fit2” and display the model parameters. It can be seen that seq(.1,.9, by=.1), .95, .99, .999),

C-Statistics =0.732. lp = F, funlabel = “Low weight rate”)


plot(nom2)
fit2<-lrm(low ~ ht+lwt+ptl+smoke+race,
data=mydata, x = T, y = T) Nomogram interpretation: It is assumed that a pregnant
fit2 woman has the following characteristics: pregnancy-induced
## Logistic Regression Model
hypertension, weight of 100 pounds, two premature births,
##
smoking, and black. Then we can calculate the score of each
## lrm(formula = low ~ ht + lwt + ptl + smoke + race, data = mydata,
## x = T, y = T) feature of the pregnant woman according to the value of
## each variable: pregnancy-induced hypertension (68 points)
## Model Likelihood Discrimination Rank Discrim. + weight 100 pounds (88 points) + two premature births
## Ratio Test Indexes Indexes (48 points) + smoking (40 points) + black (42 points) =286
## Obs 189 LR chi2 28.19 R2 0.195 C 0.732
points. The probability of occurrence of low birth weight
## 0 130 d.f. 5 g 1.037 Dxy 0.465
## 1 59 Pr(> chi2) <0.0001 gr 2.820 gamma 0.467
infants with a total score of 286 is greater than 80% (22,24).
## max |deriv| 1e-05 gp 0.194 tau-a 0.201 Note that the portion exceeding 80% in this example is not
## Brier 0.184 displayed on the Nomogram. Readers can try to adjust the
## parameter settings to display all the prediction probabilities
## Coef S.E. Wald Z Pr(>|Z|)
with the range of 0–1.
## Intercept 0.7743 0.8303 0.93 0.3511
Use function calibrate() to construct the calibration curve
## ht=pih 1.6754 0.6863 2.44 0.0146
## lwt -0.0137 0.0064 -2.14 0.0322 object “cal2” and print the calibration curve. The result is
## ptl 0.6006 0.3342 1.80 0.0723 shown in Figure 6 below.
## smoke=smoking 0.9919 0.3869 2.56 0.0104
cal2 <- calibrate(fit2, method =‘boot’, B = 100)
## race=white -1.0487 0.3842 -2.73 0.0063
plot(cal2,xlim = c(0,1.0),ylim = c(0,1.0))
##
##
Use the function nomogram() to construct Nomogram ## n=189 Mean absolute error=0.021 Mean squared error=0.00077
## 0.9 Quantile of absolute error=0.036
object “nom2”, and print the Nomogram. The result is

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 18 of 96 Zhou et al. Clinical prediction models with R

1.0 The case data set is actually survival data. In order to be


consistent with the theme of this Section, we only consider
0.8 the binominal attribute of the outcome (status 1 = censored,
2 = dead). Again, we select the Logistic regression model
Actual probability

0.6 to construct and visualize the model, draw Nomogram,


calculate C-Statistics, and plot the calibration curve.
0.4

Apparent [Case 2] R codes and its interpretation


0.2 Bias-corrected
ldeal
Load survival package, rms package and other auxiliary
packages.
0.0
0.0 0.2 0.4 0.6 0.8 1.0 library(survival)
Predicted Pr{low=1} library(rms)
B=100 repetitions, boot Mean absolute error=0.022 n=189
Demonstrate with the “lung” data set in the survival
Figure 6 Calibration curve based on model “fit2”. package. We can enumerate all the data sets in the survival
package by using the following command.

Interpretation of calibration curve: In fact, the calibration data(package = “survival”)

curve is a scatter plot of the probability of actual occurrence


Read the “lung” data set and display the first 6 lines.
versus prediction. Actually, the calibration curve visualizes
data(lung)
the results of Hosmer-Lemeshow fit goodness test, so in
head(lung)
addition to the calibration curve, we should also check the ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
results of Hosmer-Lemeshow fit goodness test. The closer ## 1 3 306 2 74 1 1 90 100 1175 NA
to Y = X the prediction rate and the actual occurrence rate ## 2 3 455 2 68 1 0 90 90 1225 15
are, with p value of Hosmer-Lemeshow goodness-of-fit test ## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
greater than 0.05, the better the model is calibrated (24). In
## 5 1 883 2 60 1 0 100 90 NA 0
this case, the Calibration curve almost coincides with the ## 6 12 1022 1 74 1 1 50 80 513 0
Y = X line, indicating that the model is well calibrated.
You can use the following command to display the
variable descriptions in the lung dataset.
[Case 2] analysis
help(lung)
[Case 2]
Survival in patients with advanced lung cancer from the Variable tags can be added to dataset variables for
North Central Cancer Treatment Group. Performance subsequent explanation.
scores rate how well the patient can perform usual daily
lung$sex <- factor(lung$sex,
activities. Total 10 variates: levels = c(1,2),
(I) inst: Institution code; labels = c(“male”, “female”))
(II) time: Survival time in days; According to the requirements of rms package to build
(III) status: censoring status 1=censored, 2=dead; the regression model and to draw Nomogram, we need
(IV) age: Age in years; to “package“ the data in advance, which is the key step to
(V) sex: Male=1 Female=2; draw Nomogram, Use the command “?datadist” to view its
(VI) ph.ecog: ECOG performance score (0=good detailed help documentation.
5=dead);
dd=datadist(lung)
(VII) ph.karno: Karnofsky performance score (bad=0-
options(datadist=“dd”)
good=100) rated by physician;
(VIII) pat.karno: Karnofsky performance score as rated Using “status” as the dependent variable, “age” and “sex”
by patient; as the independent variables, the Logisitc regression model
(IX) meal.cal: Calories consumed at meals; “fit” was constructed, and the model parameters were
(X) wt.loss: Weight loss in last six months. shown. It can be seen that C-Statistics =0.666.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 19 of 96

0 10 20 30 40 50 60 70 80 90 100
Points

age
35 40 45 50 55 60 65 70 75 80 85

male
sex
female

Total points
0 20 40 60 80 100 120 140 160 180

Risk
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

Figure 7 Nomogram based on model “fit”.

fit <- lrm(status~ age+sex, data = lung, x=T,y=T) 1.0


fit
## Logistic Regression Model 0.8
Actual probability

##
## lrm(formula = status ~ age + sex, data = lung, x = T, y = T) 0.6
##
## Model Likelihood Discrimination Rank Discrim. 0.4
## Ratio Test Indexes Indexes
Apparent
## Obs 228 LR chi2 16.85 R2 0.103 C 0.666 0.2 Bias-corrected
## 1 63 d.f. 2 g 0.708 Dxy 0.331 ldeal
## 2 165 Pr(> chi2) 0.0002 gr 2.030 gamma 0.336 0.0
## max |deriv| 2e-09 gp 0.138 tau-a 0.133
0.0 0.2 0.4 0.6 0.8 1.0
## Brier 0.185 Predicted Pr{status=2}
## B=100 repetitions, boot Mean absolute error=0.019 n=228
## Coef S.E. Wald Z Pr(>|Z|) Figure 8 Calibration curve based on model “fit”.
## Intercept -0.5333 1.0726 -0.50 0.6190
## age 0.0319 0.0170 1.87 0.0609
## sex=female -1.0484 0.3084 -3.40 0.0007
cal <- calibrate(fit, method = ‘boot’, B = 100)
##
plot(cal,xlim = c(0,1.0),ylim = c (0,1.0))
##
Use function nomogram() to plot Nomogram of the risk
## n=228 Mean absolute error=0.014 Mean squared error=0.00034
estimate for the Logisitc regression model “fit”, as shown in
## 0.9 Quantile of absolute error=0.032
Figure 7.
nom <- nomogram(fit, fun= function(x)1/(1+exp(-x)),# fun=plogis The graphic interpretation is the same as before.
lp = F, funlabel = “Risk”)
plot(nom)
Brief summary
The graphic interpretation is the same as before.
Use the calibrate() function to construct the object of In summary, this section introduces the construction
calibration curve “cal” and print the calibration curve. The of Logistic regression prediction model and drawing of
result is shown in Figure 8. Nomogram. It should be noted that to assess the practical

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 20 of 96 Zhou et al. Clinical prediction models with R

value of a prediction model, its operability should be of pancreas and intraoperative radiotherapy is applied.
considered as well as the accuracy of its prediction. In Peritoneal metastasis is present. We can calculate the
addition to the internal validation, external validation is total score according to all this available information by a
sometimes necessary. In this case, as the external validation mathematical model: 40-year-old can be scored 10 points;
data is not obtained, the external validation process is not gender of male can be scored 4 points and so on… Finally,
demonstrated, and validation is only performed in the the total score can be obtained. Different score will be with
original data set with Bootstrap method. different survival probability in 3 months, 6 months and
1 year. Complicated Cox regression formula now is visual
graph. Practitioners can calculate the survival probability of
Method of building nomogram based on Cox
each patient conveniently and relatively accurate “fortune-
regression model with R
telling” can be present to each patient. In the previous
Background episode, we talked about Logistic regression Nomogram.
Cox regression Nomogram is quite similar with the Logistic
Human beings are always crazy about “fortune-telling”. Nomogram in interpretation (15,17).
Whether it is “fortune-telling” in Chinese culture or Like the previous episode, the first question is when
“astrology” in Western culture, it all shows people’s should we choose Cox regression? It is actually about
enthusiasm for this. In this section, we will discuss another the method choosing in multiple variable analysis. If the
scientific “fortune-telling”. It is a model which will assess outcome we are observing is survival, or we call it “Time
the prognosis of patients. As an oncologist, you will be to event” survival outcome, we can choose Cox regression
confronted with questions like “how long will I survive” from model. We have already introduced how to screen variables
patients suffering cancer during clinical practice. It is really in the 2 nd section. We should also pay attention to the
a gut-wrenching question. Mostly, we can tell a median balance between the numbers of variables you are going
survival time based on the staging of corresponding disease. to bring in the prediction model and the convenience,
Actually, clinical staging is the basis for our predication for practicality of the model. We will show two examples of
these patients or in other word it is “predicting model”. Nomogram construction with R. Detailed performance of
We answer this question by median survival time according R application will be present here instead of principles of
to its clinical stage. It could bring new questions because statistics behind.
it may not be so accurate to predict the survival time of
specific individual by median time of a group of people. No
can tell whether this specific individual will enjoy better or [Example 1] analysis
worse prognosis (15,17). [Example 1]
Is there any possibility we can calculate the survival of Here we will use the data in [Example 1] to introduce the
every patient by a more accurate and scientific method? construction of survival prediction model and corresponding
The answer is yes. We can firstly construct a mathematical Nomogram. Original data have been simplified for better
model by Cox proportional hazard model, and then understanding and practice. The clinical data of 1,215
visualize parameter associated with the patient’s survival invasive breast cancer patients is downloaded from TGCA
by Nomogram. This paragraph can relatively accurately (https://genome-cancer.ucsc.edu/). We have simplified
calculate the survival probability of each patient. Nomogram the original data by steps in Table 1. The definition and
in essence is the visualization of regression model. It sets assignment of variables is present in Table 2. We will try
the scoring criteria according to the regression coefficients to construct survival prediction model and corresponding
of all the independent variables, and then gives each scoring Nomogram of this cohort. The readers can download
value of each independent variable, so that for each patient, original data and R code in the attachment file of this
a total score can be calculated. A transfer between the episode for better practice.
occurrence probabilities and the outcome is calculated to
by function and the probability of each patient’s outcome [Example 1] analysis
can be obtained. For example, we have 40-year-old male This cohort is about the construction of prognosis
pancreatic cancer patient who have went through operation. predication model. Steps are as follow:
The clinical stage is IV. The tumor locates in the head (I) Cox regression will be used and screening

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 21 of 96

Table 1 Survival data of 1,215 breast cancer patients


Margin_ Pathologic_ HER2_ Menopause_ Surgery_ Histological_
No. Months Status Age ER PgR
status stage Status status method type
1 130.9 0 55 1 1 0 / / 1 2 2
2 133.5 0 50 1 1 0 2 / 2 1 1
3 43.4 0 62 1 1 0 2 / 2 2 1
4 40.7 0 52 1 1 / 1 / / 3 1
5 11.6 0 50 1 1 0 3 / 2 2 2
6 / / / / / / / / / / /
7 / / / / / / / / / / /
8 10.1 0 52 1 0 / 2 / / / 3
9 8.6 0 70 1 0 0 1 0 2 1 3
10 14.6 0 59 1 1 1 2 0 / 1 1
11 44.0 0 56 1 1 0 1 0 1 2 3
12 48.8 0 54 1 1 0 2 0 1 2 1
13 14.5 0 61 1 1 0 2 0 2 1 3
14 47.9 0 39 0 1 0 2 0 1 1 1
15 21.2 0 52 1 1 0 2 0 / 1 1


1,211 29.4 0 77 1 1 0 1 / 2 1 2
1,212 15.6 0 46 1 1 0 3 / 2 2 2
1,213 16.3 0 68 1 1 0 2 / 2 3 2
1,214 109.6 0 61 1 1 1 3 / 2 4 2
1,215 108.5 0 46 1 1 0 1 / 1 1 2

independent prognostic factors based on training bootstrap resampling methods based on internal
sets and predictive models can be built firstly. The data set and Calibration Plot will be recommended
data sets used for modeling are generally referred for validation (22,24).
to as training set or internal data set. You can refer Building of Cox regression model-based Nomogram,
to already published Intelligent Statistics and Crazy C-Index calculation, Bootstrap resampling methods and
Statistics (15,17) for the details about data entry, Calibration Plot are emphasized here. All processing can be
univariable Cox regression and multivariable Cox done by R (R software downloading: https://www.r-project.
regression. Finally, we get three independent org/). All data processed will be save as “BreastCancer.sav”
variables for prognosis: age, PgR, Pathologic_stage. and put under the R current running directory. The results
(II) Building Nomogram based on these three variables will show in Figures 9 and 10.
(these 3 have been treated as independent variable
in this Cox model) [Example 1] R codes and its interpretation
(III) Assessing the discrimination efficiency of these Load the rms package and the necessary helper packages.
models. C-Index will be calculated.
library(foreign)
(IV) Validation of this model can be performed by
library(rms)
external data set. If external data set is not available,

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 22 of 96 Zhou et al. Clinical prediction models with R

Table 2 Variable definition, assignment and description

Variable name Variable annotation Variable assignment and description

No. Number /

Months Survival time Continuous variables (month)

Status Outcome 1= dead, 0= censored

Age Age Continuous variables (year)

ER Estrogen receptor status 1= positive, 0= negative

PgR Progesterone receptor status 1= positive, 0= negative

Margin_status Surgical margin status 1= positive, 0= negative

Pathologic_stage Histopathologic stage 1= stage I, 2= stage II, 3= stage III, 4= stage IV

HER2_status HER2 status 1= positive, 0= negative

Menopause_status Menstrual status 1= premenopause, 2= postmenopause

Surgery_method Surgery methods 1= lumpectomy

2= modified radical mastectomy

3= simple mastectomy

4= other method

Histological_type Histological type 1= infiltrating ductal carcinoma

2= infiltrating lobular carcinoma

3= other

0 10 20 30 40 50 60 70 80 90 100
Points

Age
25 30 35 40 45 50 55 60 65 70 75 80 85 90
Stage II Stage IV
Pathologic_stage
Stage I Stage III
Negative
PgR
Positive

Total points
0 20 40 60 80 100 120 140 160 180 200 220

1-year OS
0.95 0.85 0.80

3-year OS
0.95 0.85 0.80 0.70 0.6 0.5 0.4

5-year OS
0.95 0.85 0.80 0.70 0.6 0.5 0.4 0.3 0.2

Figure 9 Nomogram of Cox regression model.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 23 of 96

1.0 package.
coxm <-cph(Surv(Months,Status==1) ~ Age+Pathologic_stage+PgR,
Actual 5-year OS (proportion)

x = T,y = T, data = breast, surv = T)


0.9
Build survival function object and define them as surv1,
surv2, surv3.
0.8
surv<- Survival(coxm)
surv1<- function(x)surv(1*12,lp=x) # defined time.inc,1 year OS
surv2<- function(x)surv(1*36,lp=x) # defined time.inc,3 year OS
0.7
surv3<- function(x)surv(1*60,lp=x) # defined time.inc,5 year OS

Data integrating by function datadist() (This is


0.6
compulsory processing in rms package during the
0.6 0.7 0.8 0.9 1.0
Nomogram-predicted probability of 5-year OS construction of regression model).
n=549 d=42 p=5, 100 subjects X-resampling optimism added, B=100 dd<-datadist(breast)
per group gray: ideal based on obseved-predicted
options(datadist = ‘dd’)
Figure 10 Calibration curve of Cox model.
Build Nomogram: “maxscale” means the highest point,
which will be set from 100 or 10 points; “un.at” means
Data preparation, loading external data in “.sav” style survival scale setting; “xfrac” means the distance between
format. data axis set and left label, which can regulate parameter
breast<-read.spss(“BreastCancer.sav”) value to observe the change of Nomogram. The meaning
Convert the data set “breast” to data frame format. of other parameter can be found in the help menu of
breast<-as.data.frame(breast) nomgram() function.
breast<-na.omit(breast)
nom<-nomogram(coxm,fun = list(surv1,surv2,surv3),lp = F,
Display the first 6 rows of data in the breast data frame. funlabel = c(“1-Yeas OS”, ‘3-Year OS’,’5-YearOS’),maxscale = 100,
head(breast) fun.at = c(‘0.95’,’0.85’,’0.80’,’0.70’,’0.6’,’0.5’,’0.4’,’0.3’,’0.2’,’0.1’))
## No Months Status Age ER PgR Margin_status plot((nom),xfrac = .3)
## 9 9 8.633333 Censor 70 Positive Negative Nagative
## 11 11 44.033333 Censor 56 Positive Positive Nagative Nomogram interpretation: point in Figure 9 is a selected
## 12 12 48.766667 Censor 54 Positive Positive Nagative scoring standard or scale. For each independent variable,
## 13 13 14.466667 Censor 61 Positive Positive Nagative a straight line perpendicular to the Points axis (through
## 14 14 47.900000 Censor 39 Negative Positive Nagative
a ruler) is made at that point, and the intersection point
## 19 19 39.866667 Censor 50 Positive Positive Nagative
represents the score under the value of the independent
## Pathologic_stage HER2_Status Menopause_status
## 9 Stage I Negative Post menopause variable. For example, Age at 25 means 0 point; CEA at
## 11 Stage I Negative Pre menopause 90 means 100 points. The corresponding points of these
## 12 Stage II Negative Pre menopause independent variables of each patient can be calculated
## 13 Stage II Negative Post menopause
in total. We can get total points, which will locate to the
## 14 Stage II Negative Pre menopause
## 19 Stage II Positive Post menopause
survival axis with a perpendicular line. This will indicate the
## Surgery_method Histological_type survival rate of this patient (3- or 5-year OS).
## 9 Lumpectomy Other Calculation of C-Index.
## 11 Modified Radical Mastectomy Other f<-coxph(Surv(Months,Status==1) ~ Age+Pathologic_stage+PgR, data =
## 12 Modified Radical Mastectomy Infiltrating Ductal Carcinoma breast)
## 13 Lumpectomy Other sum.surv<-summary(f)
## 14 Lumpectomy Infiltrating Ductal Carcinoma c_index<-sum.surv$concordance
## 19 Lumpectomy Infiltrating Ductal Carcinoma c_index
Define the endpoint event: define the ending value ## C se(C)
## 0.77878820 0.05734042
“Dead” as the endpoint event “dead” .
breast$Status<-ifelse(breast$Status==“Dead”,1,0) The meaning of C-Index in R code is similar to that
Set the reference levels of polytomous variable. of ROC. It will range from 0–1. The closer it gets to 1,
breast$Pathologic_stage<- relevel(breast$Pathologic_stage, ref = ‘Stage I’) the greater predicting value of this Cox regression model.
Build Cox regression formula by function cph() in rms Generally speaking, if C-Index equals 0.7, the model is with

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 24 of 96 Zhou et al. Clinical prediction models with R

very good predicting value. In this example, C-Index equals (IV) Age: age in years;
0.7503 and se(C-Index) equals 0.02992. All results above are (V) Sex: male =1, female =2;
the direct output of the software (23). (VI) ph.ecog: ECOG performance score (0= good 5=
Calculation complement of C-index. dead);
library(Hmisc) (VII) ph.karno: Karnofsky performance score (bad =0 to
S<-Surv(breast$Months,breast$Status==1)
good =100) rated by physician;
rcorrcens(S~predict(coxm),outx = TRUE)
(VIII) pat.karno: Karnofsky performance score as rated
##
## Somers’ Rank Correlation for Censored Data Response variable:S by patient;
## (IX) meal.cal: calories consumed at meals;
## C Dxy aDxy SD ZP n (X) wt.loss: weight loss in last six months.
## predict(coxm) 0.22 -0.559 0.559 0.09 6.21 0 549

Standard curve will be built. u should be in accord with [Example 2] interpretation


the f defined in time.inc in the previous regression model. This cohort is about survival. Here, time to event
If ime.inc is 60 in f model, u should be 60.m will be in line attribute (status 1 = censored, 2 = dead) associated with
with sample size. Standard curve will divide all sample into outcome will be considered. Cox regression model
3 or 4 groups (in the chart it will present as 3 or 4 points). will be built and visualization will be achieved through
m means sample size of each group. So, m*3 equals or Nomogram. C-Index will be calculated and calibration
approximately equals the total sample size. curve will be drawn using R.
cal<- calibrate(coxm, cmethod = ‘KM’, method = ‘boot’, u = 60, m = 100,
B = 100) [Example 2] R codes and its interpretation
Print and modify the graphic parameters of the standard Loading survival package, rms package and other helper
curve. The modified calibration curve is shown in Figure 10. packages.
plot(cal,lwd=2,lty=1,errbar. library(survival)
col=c(rgb(0,118,192,maxColorValue=255)), library(rms)
xlim=c(0.6,1),ylim=c(0.6,1),
Demonstration with the lung data set in the survival
xlab=“Nomogram-Predicted Probability of 5-Year OS”,
package, we can use the following command to enumerate
ylab=“Actual 5-Year OS(proportion)”,
col=c(rgb(192,98,83,maxColorValue=255))) all the data sets in the survival package.
lines(cal[,c(“mean.predicted”,”KM”)],type=“b”,lwd=2, data(package = “survival”)
col=c(rgb(192,98,83,maxColorValue=255)), pch=16)
Read the lung data set and display the first 6 lines of the
abline(0,1,lty=3,lwd=2,col=c(rgb(0,118,192,maxColorValue=255)))
lung data set.
Interpretation of modified standard curve: we will data(lung)
validate the predicting efficiency of this Nomogram model head(lung)
based on bootstrap resampling method in internal data ## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss

set. Lateral axis shows the predicated survival rate of each ## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
patient while the vertical axis shows the actual survival
## 3 3 1010 1 56 1 0 90 90 NA 15
rate of each patient. It is ideal if the red line in the picture ## 4 5 210 2 57 1 1 90 60 1150 11
exactly coincides with the blue dotted line. ## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0

[Example 2] analysis You can further use the following command to display
the variable descriptions in the lung dataset.
[Example 2] help(lung)
Survival in patients with advanced lung cancer from the ## starting httpd help server ... done

North Central Cancer Treatment Group. Performance Variable tags can be added to dataset variables for
scores rate how well the patient can perform usual daily subsequent explanation.
activities. Total 10 variates: lung$sex <- factor(lung$sex,
(I) inst: institution code; levels = c(1,2),
(II) Time: survival time in days; labels = c(“male”, “female”))
(III) Status: censoring status 1 = censored, 2 = dead; “Packaging” data before the building of regression model

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 25 of 96

and Nomogram by rms package. This is very important step The interpretation of this Nomogram can be referred to
for Nomogram building. Commend “?datadist” can enable that of Example 1.
you see the detailed helper document of this function. C-Index can be calculated by commends as follow. This
dd=datadist(lung) will give objective assessment of this model.
options(datadist=“dd”)
rcorrcens(Surv(time,status) ~ predict(fit), data = lung)
Cox regression fit will be built based on dependent ##
variable time, status and independent variable age, sex. ## Somers’ Rank Correlation for Censored Data Response variable:Sur-
Show the parameter of this model. Dxy =0.206. C-Index v(time, status)

=0.206/2+0.5 =0.603. ##
## C Dxy aDxy SD Z P n
fit <- cph(Surv(time,status) ~ age + sex, data = lung, x = T,y = T, surv = T)
## predict(fit) 0.397 -0.206 0.206 0.051 4.03 1e-04 228
fit
## Cox Proportional Hazards Model
##
The parameter C here is 0.397 (C=0.397). It is actually
## cph(formula = Surv(time, status) ~ age + sex, data = lung, x = T, the complement of C-Index. So, we know C-Index is 0.603
## y = T, surv = T) (C-Index = 1 − 0.397 =0.603), which is totally exactly the
##
same as we calculated before.
## Model Tests Discrimination
## Indexes We can modify the curve by commends as follow
## Obs 228 LR chi2 14.12 R2 0.060 (Figure 13). Firstly, calibrate() function is used to build
## Events 165 d.f. 2 Dxy 0.206
the object cal. Then we print the graph, and you can use
## Center 0.8618 Pr(> chi2) 0.0009 g 0.356
## Score chi2 13.72 gr 1.428 the graph parameters to beautify the calibration curve.
## Pr(> chi2) 0.0010 Commend “?calibrate” can help you see more details about
## the parameters.
## Coef S.E. Wald Z Pr(>|Z|)
## age 0.0170 0.0092 1.85 0.0646 cal <- calibrate(fit, cmethod=‘KM’, method=“boot”,
## sex=female -0.5131 0.1675 -3.06 0.0022 u = 365, m = 50, B = 100)
## plot(cal)
plot(cal,lwd=2,lty=1,
Median survival time calculating,
errbar.col=c(rgb(0,118,192,maxColorValue=255)),
med <- Quantile(fit)
xlim=c(0.1,1),ylim=c(0.1,1),
Build the survival function object surv. xlab=“Nomogram-Predicted Probability of 1-Year”,
surv <- Survival(fit) ylab=“Actual Probability of 1-Year”,
Build Nomogram of median survival based on Cox col=c(rgb(192,98,83,maxColorValue=255)))

regression by commends as follow. As show in Figure 11. The interpretation of this graph can be referred to that
nom <- nomogram(fit, fun = function(x) med(lp = x), of Example 1.
funlabel = “Median Survival Time”)
plot(nom)

The interpretation of this Nomogram can be referred to Brief summary


that of Example 1. This episode has introduced the build of survival prediction
Next, Nomogram of survival rate will be built based on model and Nomogram. A good model should be in
Cox regression. The unit for the survival time of lung data
convenient application and with accurate predicting
set is “day”. So, we will set the survival object (surv1, surv2)
efficiency. External validation is as important as internal
based on survival function firstly.
validation in accuracy assessment. In our examples, external
surv<- Survival(fit)
surv1<- function(x)surv(365,lp = x) # defined time.inc,1 year survival probability
validation is not present for better external data set is
surv2<- function(x)surv(730,lp = x) # defined time.inc,2 year survival probability not available. Many articles about Nomogram of clinical
prediction have been published. It is better in “fortune-
Or you can build survival Nomogram of Cox regression
telling” than that of TNM staging. However, practitioners
directly by commends as follow. As show in Figure 12.
are still used to TNM staging system in “fortune-telling”.
nom <- nomogram(fit, fun = list(surv1, surv2), Maybe, TNM staging is much more convenient. From this
funlabel = c(“1-year Survival Probability”,
perspective, less variables should be included during the
“2-year Survival Probability”))
plot(nom, xfrac=.6) building of Nomogram to ensure much more convenience
in clinical practice. That will lead to another question:

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 26 of 96 Zhou et al. Clinical prediction models with R

0 10 20 30 40 50 60 70 80 90 100
Points

Age
35 40 45 50 55 60 65 70 75 80 85

male
Sex
female

Total points
0 20 40 60 80 100 120 140 160 180

Linear predictor
–0.7 –0.6 –0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

Median survival time


550 500 450 400 350 300 250

Figure 11 Nomogram based on median survival time of Cox regression.

Points 0 10 20 30 40 50 60 70 80 90 100

Age
35 40 45 50 55 60 65 70 75 80 85

male
Sex
female

Total points
0 20 40 60 80 100 120 140 160 180

Linear predictor
–0.7 –0.5 –0.3 –0.1 0 0.1 0.3 0.5

1-Year survival probability


0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25

2-Year survival probability


0.3 0.25 0.2 0.15 0.1 0.05

Figure 12 Nomogram based on survival probality of Cox regression model.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 27 of 96

1.0 Discrimination index


It refers to the predictive ability of the regression model
Actual probability of 1-year

0.8 to distinguish between diseased/no disease, effective/


ineffective, and dead/alive outcomes. For example, there
0.6 are 100 people, 50 are diagnosed to have the disease, and
50 do not have the disease; we used the prediction method
0.4 to predict that 45 are sick and 55 are not. Then the number
of the 45 people that overlap with the 50 people who are
0.2 really sick directly determines the accuracy of your model’s
predictive ability, which we call “accuracy”. It is usually
0.2 0.4 0.6 0.8
Nomogram-predicted probability of 1-year
1.0 measured by ROC curve and C-Statistics (the AUC in
n=228 d=165 p=2, 50 subjects X-resampling optimism added, B=100
the Logistic regression model is equal to C-Statistics). Of
per group gray: ideal based on obseved-predicted
course, Net Reclassification Index (NRI) and integrated
Figure 13 Calibration curve based on Cox regression model.
discrimination improvement (IDI) are parts of the other
metrics. We will explain these further in the following
sections (25-27).
which one should be put in priority, the accuracy or the
For each individual, we do not want to misdiagnosis or
practicality? It really depends on purpose of this research.
failure to diagnosis. Therefore, for the Logistic regression
prediction model, the ROC curve is often drawn as a
Calculation C-Statistics of Logistic regression diagnostic test to judge the degree of discrimination. The
model with R difference is that the indicator we use to plot the ROC
curve is no longer a clinical test result, but the predicted
Background
probability of the Logistic regression model. Judging
In the previous two sections, we mentioned the R building whether the event occurs based on the magnitude of the
Logistic regression model and the Cox regression model, predicted probability, we will get a series of sensitivity and
and briefly introduced the C-Statistics and C-index specificity for plotting the ROC curve, which will help us to
calculation methods of the model, but did not focus understand whether the constructed predictive model can
on it. In this section, we will detail the C-Statistics for accurately predict the occurrence of the event.
calculating Logistic regression models using R. In fact,
the receiver operating characteristic curve (ROC) of the Consistency and calibration
Logistic regression model was based on the predicted It refers to the consistency of the probability of actual
probability. The area under the ROC curve (AUC) is equal occurrence and the probability of prediction. It seems a
to C-Statistics, so the IBM SPSS software can also calculate bit puzzling that we still cite the above example. The 100
C-Statistics, which is not repeated here (15,17). people we predicted do not mean that we really use the
When we build a regression model through the training model to predict whether a person has the disease or not.
set, how do we scientifically evaluate the accuracy of a The model only gives us the probability of people being
regression model prediction? For example, there are two sick, based on the probability being greater than a certain
fortune tellers each with a stall on a street corner. Miss cutoff value (e.g., 0.5) to determine if the person has the
Wang wishes to get her marriage fortunes told by one disease or not. For example, there are 100 people, we will
of these fortune tellers. Who should she ask? Mr. Zhang finally get 100 probabilities through the model which range
or Mr. Li? A simple choosing method would be to go from 0 to 1. We ranked the 100 probabilities in order from
with whomever is more accurate. However, this can only small to large, and then divided them into 10 groups with
be known by past customers’ word of mouth. Clinical 10 people in each group. The actual probability is actually
prediction models are similar to this. The most fundamental the proportion of diseases in these 10 people, the predicted
requirement is to ensure that the predictions are accurate. probability is the average of the 10 ratios predicted by
So how do you evaluate if a prediction model is accurate? In each group, and then compare the two numbers, one as
general, the merits of a prediction model can be evaluated the abscissa and one as the ordinate. A calibration Plot
using the following three aspects (22-24). is obtained, and the 95% range of the plot can also be

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 28 of 96 Zhou et al. Clinical prediction models with R

calculated. In the logistic regression model, sometimes smoking); pre-pregnancy premature births (Ptl, unit:
the consistency can also be measured by the Hosmer- times); whether the birth mother has high blood pressure
Lemeshow goodness-of-fit test. The calibration curve is a (ht, 0= not suffering, 1= sick); uterus stress on contraction,
scatter plot of the actual incidence and predicted incidence. oxytocin and other stimuli (ui, 0= no, 1= yes); the number
Statistically, the calibration curve is a visualized result of the of community physician visits in the first three months
Hosmer-Lemeshow goodness of fit test (28-30). of pregnancy (ftv, unit: times); race (race, 1= white, 2=
It is worth noting that a well-differentiated model may black, 3= other ethnic groups). We sorted out the data
have a poor calibration. For example, it can determine in this example and named it “Lowweight.sav”, which is
that the risk of a person’s disease is five times that of stored in the current working path. For the convenience
another person. It determines that the risk of the two of the reader, the data and code can be downloaded in the
people is 5% and 1%, respectively. In fact, the risk of attachment to this Section.
the two is 50% and 10%, respectively. The model is
also quite outrageous, which is a bad calibration. The [Case 1] analysis
calibration of the model can be tested with Hosmer- The dependent variable is a binary outcome variable
Lemeshow. If the results are statistically significant, there (whether low birth weight or not) in this case. The purpose
is a difference between the predicted and observed values. of the study was to investigate the independent influencing
Discrimination and calibration are important evaluations factors of low birth weight infants, which is consistent with
for a model, but many newly developed models are not fully the application conditions of binary Logistic regression. We
evaluated. A systematic review of cardiovascular system construct a Logistic regression equation with “age + ftv + ht
risk prediction models found that only 63% of the models + lwt + ptl + smoke + ui + race” as the independent variable
reported Discrimination, and even fewer models reported and “low” as the dependent variable. Based on this Logistic
Calibration, at 36%. regression model, we have three methods to calculate its
C-Statistics:
R-squared (R2) (I) Method 1. Use the lrm() function in the rms
The Coefficient of determination, also commonly known package to construct a logistic regression model
as “R-squared”, is used as a guideline to measure the and directly read the model “Rank Discrim.
accuracy of the model, which is a combination of metrics Indexes” Parameter C, which is C-Statistics.
discrimination and consistency. The model determines (II) Method 2. Construct a logistic regression model,
the coefficient R2 to be more comprehensive, but slightly predict() function to calculate the model prediction
rough (31,32). probability, then use the ROCR package to
Below we will explain the method of calculating draw the ROC curve according to the predicted
C-statistics in R language in a classic case of Logistic probability, and calculate the area under the curve
regression. As with the previous Sections, the focus is on the (AUC), which is C-Statistics. Note: This method is
process of using R language calculation instead of complex consistent with the calculation method in SPSS.
statistical principles. (III) Method 3. Construct a logistic regression
model, predict() function to calculate the model
prediction probability, and directly calculate the
Case analysis
area under the ROC curve AUC by using the
[Case 1] somers2 function in the Hmisc package. Note:
Hosmer and Lemeshow studied the influencing factors of This method is consistent with the calculation
low birth weight in infants in 1989. The outcome variable method in SPSS.
was whether or not to deliver low birth weight infants
(variable name “low”, dichotomous variable, 1= low birth
R codes of calculation process
weight, which birth weight <2,500 g; 0= non-low birth
weight), consideration influencing factors (independent Load the foreign package and the rms package.
variable) may include: maternal pre-pregnancy weight library(foreign)
(lwt, unit: pounds); maternal age (age, unit: year);maternal library(rms)

smoking during pregnancy (smoke, 0 = not sucked, 1 = Import external data in .sav format and convert the data

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 29 of 96

into a data frame structure while presenting the first 6 rows (II) Method 2. Calculate the AUC using the ROCR
of the data frame. package, the code is as follows:
mydata<- read.spss(“Lowweight.sav”) First, calculate the prediction probability of constructing
mydata<- as.data.frame(mydata) the logistic regression model.
head(mydata) mydata$predvalue <- predict(fit1)
## id low age lwt race smoke ptl ht ui ftv bwt
Load the ROCR package.
## 1 85 normal weight 19 182 black no smoking 0 no pih yes 0 2523
library(ROCR)
## 2 86 normal weight 33 155 other no smoking 0 no pih no 3 2551
## 3 87 normal weight 20 105 white smoking 0 no pih no 1 2557 Use the prediction() function to build the object “pred”,
## 4 88 normal weight 21 108 white smoking 0 no pih yes 2 2594 and the performance() function to build the object perf to
## 5 89 normal weight 18 107 white smoking 0 no pih yes 0 2600 plot the ROC curve.
## 6 91 normal weight 21 124 other no smoking 0 no pih no 0 2622
pred <- prediction(mydata$predvalue, mydata$low)
Set the ending variable to two categories, define “low perf<- performance(pred,”tpr”,”fpr”)

weight” as “1”, and set the unordered multi-class variable Draw the ROC curve as shown in Figure 14 below.
“race” to a dummy variable. plot(perf)
abline(0,1, col = 3, lty = 2)
mydata$low <- ifelse(mydata$low ==“low weight”,1,0)
mydata$race1 <- ifelse(mydata$race ==“white”,1,0) Calculating the area under the ROC curve (AUC) is
mydata$race2 <- ifelse(mydata$race ==“black”,1,0) C-statistics =0.7382008, which is consistent with the above
mydata$race3 <- ifelse(mydata$race ==“other”,1,0)
calculation results.
Load the data into the current working environment and auc <- performance(pred,”auc”)
“package” the data using the datadist() function. auc
## An object of class “performance”
dd<-datadist(mydata)
## Slot “x.name”:
options(datadist=‘dd’)
## [1] “None”
The logistic regression model was fitted using the lrm() ##
function in the rms package. ## Slot “y.name”:
## [1] “Area under the ROC curve”
(I) Method 1. Directly read the Rank Discrim. parameter
##
C in the model parameters, ie C-Statistics = 0.738. ## Slot “alpha.name”:
## [1] “none”
fit1<-lrm(low~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2, ##
data = mydata,x = T,y = T) ## Slot “x.values”:
fit1 ## list()
## Logistic Regression Model ##
## ## Slot “y.values”:
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui + ## [[1]]
## race1 + race2, data = mydata, x = T, y = T) ## [1] 0.7382008
## ##
## Model Likelihood Discrimination Rank Discrim. ##
## Ratio Test Indexes Indexes
## Slot “alpha.values”:
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
## list()
## 0 130 d.f. 9 g 1.122 Dxy 0.476
## 1 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
## max |deriv| 7e-05 gp 0.207 tau-a 0.206
(III) Method 3. Hmisc package somers2 () function
## Brier 0.181 calculation, we can see that AUC = 0.7382,
## consistent with the above calculation results.
## Coef S.E. Wald Z Pr(>|Z|)
library(Hmisc)
## Intercept 1.1427 1.0873 1.05 0.2933
## age -0.0255 0.0366 -0.69 0.4871 somers2(mydata$predvalue, mydata$low)
## ftv 0.0321 0.1708 0.19 0.8509 ## C Dxy n Missing
## ht=pih 1.7631 0.6894 2.56 0.0105 ## 0.7382008 0.4764016 189.0000000 0.0000000
## lwt -0.0137 0.0068 -2.02 0.0431
## ptl 0.5517 0.3446 1.60 0.1094
## smoke=smoking 0.9275 0.3986 2.33 0.0200
## ui=yes 0.6488 0.4676 1.39 0.1653 Brief summary
## race1 -0.9082 0.4367 -2.08 0.0375
## race2 0.3293 0.5339 0.62 0.5374 In fact, no matter which method we use, the standard error
## of C-Statistics is not directly given, so the calculation of

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 30 of 96 Zhou et al. Clinical prediction models with R

1.0 variable Y and multiple independent variables X. There are


three types of regression modeling that are most commonly
0.8 used in clinical research: multiple linear regression, Logistic
True positive rate

0.6 regression, and Cox regression. When we construct a


regression model through variable selection in the training
0.4 set, how do we scientifically evaluate the accuracy of a
0.2 regression model prediction? As the example given in the
previous section, there are two fortune tellers each with
0.0 a stall on a street corner. Miss Wang wishes to get her
0.0 0.2 0.4 0.6 0.8 1.0
marriage fortunes told by one of these fortune tellers. Who
False positive rate
should she ask? Mr. Zhang or Mr. Li? A simple choosing
Figure 14 ROC curve.
method would be to go with whomever is more accurate.
However, this can only be known by past customers’
the confidence interval is very troublesome, which is not word of mouth. Clinical prediction models are similar to
as convenient as SPSS software. If you want to report the this. The most fundamental requirement is to ensure that
C-Statistics confidence interval for various practical needs, the predictions are accurate. So how do you evaluate if a
you can consider using SPSS software for ROC analysis. prediction model is accurate? In general, the merits of a
SPSS software can directly give the standard error and prediction model can be evaluated using the following three
confidence interval of AUC. Readers can try it themselves. aspects (15,17).
If you want to compare the area under the ROC curve
of two models, AUC or C-Statistics, you can refer to the Discrimination ability
following formula: It refers to the predictive ability of the regression model
to distinguish between diseased/no disease, effective/
AUC1 − AUC2 ineffective, and dead/alive outcomes. For example, there
Z= [1]
SE12 +SE22 are 100 people, 50 are diagnosed to have the disease, and
50 do not have the disease; we used the prediction method
You can check the Z distribution table according to the Z
to predict that 45 are sick and 55 are not. Then the number
value to get the P value.
of the 45 people that overlap with the 50 people who are
So far, the demonstration of the three methods of
really sick directly determines the accuracy of your model’s
calculating C-Statistics in Logistic regression in this section
predictive ability, which we call “accuracy”. It is usually
has completed.
measured by ROC curve and C-Statistics (AUC in the
Logistic regression model is equal to C-Statistics). Of
Calculation C-Index of Cox regression model course, NRI and IDI are parts of the other metrics. We will
with R explain these further in the following sections (25-27).
C-Index is a general indicator, especially for the
Background
evaluation of the discriminative ability of the Cox regression
In the past decade, there has been an increase in the number model (33,34). The C-Index has a range between 0.5 to 1.0.
of articles in the clinical research that have a predictive C-Index =0.5 is completely inconsistent, indicating that the
model for construction and validation. What is the model has no predictive effect; C-Index =1.0 is completely
predictive model? In short, the predictive model predicts consistent, indicating that the model prediction results
clinically unknown outcomes by known parameters, and the are completely consistent with the actual. It is generally
model itself is a mathematical formula. That is, by using considered that the C-Index is lower accuracy between 0.50
the so-called model with the known parameters to calculate and 0.70, moderate accuracy between 0.71 and 0.80, higher
the probability of an unknown outcome, which is called accuracy above 0.80, and extremely high accuracy above 0.9.
prediction. C-Index (the full name Concordance Index) is also often
The statistical nature of the clinical prediction model is written as Harrell’s C-Index, Concordance C, C-statistic,
regression modeling analysis. The essence of regression is to etc., which mainly used to reflect the discriminative ability
find the mathematical relationship between the dependent of predictive models, first proposed by Harrell, professor of

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 31 of 96

biostatistics at Vanderbilt University in 1996, to see if the people we predicted do not mean that we really use the
model can make accurate predictions. model to predict whether a person has the disease or not.
The definition of C-Index is very simple, C-Index = The model only gives us the probability of people being
concordant pairs/useable pairs. Imagine pairing all the sick, based on the probability being greater than a certain
subjects randomly, N subjects will produce N*(N-1)/2 cutoff value (e.g., 0.5) to determine if the person has the
pairs. If the sample size N is large, the calculation is huge disease or not. For example, there are 100 people, we will
and must be done with computer software. We first find finally get 100 probabilities through the model which range
the concordant pairs as molecules. What is the concordant from 0 to 1. We ranked the 100 probabilities in order from
pair? Taking the survival analysis Cox regression analysis small to large, and then divided them into 10 groups with
as an example, if the actual survival time is longer, the 10 people in each group. The actual probability is actually
predicted survival probability is larger, or the survival time the proportion of diseases in these 10 people, the predicted
with shorter survival time is smaller, that is, the prediction probability is the average of the 10 ratios predicted by each
result is consistent with the actual result, on the contrary, it group, and then compare the two numbers, one as the
is inconsistent. Then we find useable pairs in so many pairs abscissa and one as the ordinate. A Balance Plot is obtained,
as the denominator. What is a useable pair? In the case of and the 95% range of the plot can also be calculated. In
Survival Analysis Cox Regression Analysis, for example, the logistic regression model, sometimes the consistency
at least one of the two pairs of useable pairs required to can also be measured by the Hosmer-Lemeshow goodness-
have a target endpoint event. That is to say, if the paired of-fit test. The calibration curve is a scatter plot of the
two people do not have an end point event throughout actual incidence and predicted incidence. Statistically,
the observation period, they are not included in the the calibration curve is a visualized result of the Hosmer-
denominator. In addition, there are two other situations Lemeshow goodness of fit test (28,29,35).
that need to be excluded: It is worth noting that a well-differentiated model may
(I) If one of the paired two people has an end point have a poor calibration. For example, it can determine
event and the other person loss of follow-up, this that the risk of a person’s disease is five times that of
situation cannot compare the survival time of the another person. It determines that the risk of the two
two, these parried should be excluded; people is 5% and 1%, respectively. In fact, the risk of
(II) The two pairs of people who died at the same time the two is 50% and 10%, respectively. The model is
should also be excluded. After finding a useable also quite outrageous, which is a bad calibration. The
pair as the denominator, how do we determine the calibration of the model can be tested with Hosmer-
molecule? Lemeshow. If the results are statistically significant, there
What is the relationship between C-Index and AUC? is a difference between the predicted and observed values.
We have said that C-Index is an indicator that can be used Discrimination and calibration are important evaluations
to evaluate the distinguish ability of various models. For the for a model, but many newly developed models are not fully
binary logistic regression model, C-Index can be simplified evaluated. A systematic review of cardiovascular system
as: the probability of predicting the disease of a patient risk prediction models found that only 63% of the models
with a disease is greater than the probability of predicting reported Discrimination, and even fewer models reported
the disease. It has been shown that the C-Index for the Calibration, at 36%.
binary logistic regression is equivalent to AUC. AUC
mainly reflects the predictive ability of the binary logistic R-squared (R2)
regression model, but C-Index can evaluate the accuracy of The Coefficient of determination, also commonly known
various model prediction results. It can be easily understood as “R-squared”, is used as a guideline to measure the
that C-Index is an extension of AUC, and AUC is a special accuracy of the model, which is a combination of metrics
case of C-Index. discrimination and consistency. The model determines the
coefficient R2 to be more comprehensive, but slightly rough.
Consistency and calibration
It refers to the congruency of the probability of actual
Calculation methods of C-Index
occurrence and the probability of prediction. It seems a
bit puzzling that we still cite the above example. The 100 In many clinical articles, it is often seen that the

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 32 of 96 Zhou et al. Clinical prediction models with R

discriminating ability of the description method in the ## 1 51.48918 122.6809 1.7626881 FALSE
statistical method is measured by C-Statistics or C-Index. ## 2 53.41727 138.3601 0.6110944 FALSE
## 3 46.42942 104.4238 0.2602561 TRUE
Below we use the R language to demonstrate the calculation
## 4 48.07264 127.1035 0.1138379 TRUE
method of C-Index in this Cox regression. The C-Statistics ## 5 53.56849 130.9504 0.3202129 TRUE
calculation for Logistic regression has been introduced in ## 6 53.94391 135.3093 1.2506595 FALSE
Section 5. As with the previous Sections, the focus of this
(I) Method 1. survival package, load the survival
article is on the R language calculation process. We try to
package, coxph () function to fit the cox regression
avoid the complex statistical principles.
model, summary () function to display the model
Strictly speaking, C-Index includes the following types.
results and assign to the object sum.surv, the model
We only introduce the first one that is more commonly
parameter concordance is displayed, it is C-Index.
used in clinical practice.
This example C-Index =0.5416, se(C) =0.02704
(I) Harrell’s C;
(II) C-statistic by Begg et al. (survAUC::BeggC); library(survival)

(III) C-statistic by Uno et al. (survC1::Inf.Cval; fit <- coxph(Surv(os, death) ~ age + bp,data = sample.data)
sum.surv<- summary(fit)
survAUC::UnoC);
c_index <- sum.surv$concordance
(IV) Gonen and Heller Concordance Index forCox c_index
models (survAUC::GHCI, CPE::phcpe, ## C se(C)
clinfun::coxphCPE). ## 0.54156912 0.02704007
There are two common calculation methods for C-index
(II) Method 2. rms package, build a Cox regression
in the Cox regression model:
model, read the model parameter Dxy, Dxy*0.5+0.5
(I) Method 1: output directly from the function coxph()
is C-Index. Note: the seed is set here using the set.
of the survival package, the version of R needs to
seed() function in order to repeat the final result
be higher than 2.15. You need to install the survival
because the adjusted result of the validate function
package in advance to see that this method outputs
is random.
C-Index (corresponding to model parameter C).
library(rms)
The standard error is also output, and the 95%
## Loading required package: Hmisc
confidence interval can be obtained by adding or ## Loading required package: lattice
subtracting 1.96*se from C. This method is also ## Loading required package: Formula
applicable to many indicators combined (22-24). ## Loading required package: ggplot2
## Loading required package: SparseM
(II) Method 2: using the cph() function and the validate()
set.seed(123)
function in the rms package, both un-adjusted and dd<- datadist(sample.data)
bias adjusted C-Index are available (23). options(datadist=‘dd’)
fit.cph <- cph(Surv(os, death)~ age + bp, data = sample.data,
x = TRUE, y = TRUE, surv = TRUE)
R code and its interpretation fit.cph
## Cox Proportional Hazards Model
Simulate a set of survival data and set it to the data frame ##
structure, age and bp are the independent variables, os ## cph(formula = Surv(os, death) ~ age + bp, data = sample.data,
and death are the survival ending, the data frame is named ## x = TRUE, y = TRUE, surv = TRUE)
##
“sample.data“, and the first 6 lines of the data frame are
## Model Tests Discrimination
displayed: ## Indexes
## Obs 200 LR chi2 4.97 R2 0.025
age <- rnorm(200,50,5)
## Events 137 d.f. 2 Dxy 0.083
bp <- rnorm(200,120,10)
## Center -1.3246 Pr(> chi2) 0.0833 g 0.218
d.time <- rexp(200)
## Score chi2 4.95 gr 1.243
cens <- runif(200,.5,2)
## Pr(> chi2) 0.0842
death <- d.time <= cens
##
os <- pmin(d.time, cens)
## Coef S.E. Wald Z Pr(>|Z|)
sample.data <- data.frame(age =age,bp = bp,os = os,death = death)
## age 0.0169 0.0178 0.95 0.3416
head(sample.data)
## bp -0.0180 0.0089 -2.01 0.0444
## age bp os death
##

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 33 of 96

Calculate the adjusted C-Index and the un-adjusted significantly improve C-Statistic/AUC.
C-Index, and the result is as follows. (II)The meaning of C-Statistic/AUC is hard to
understand and properly explain in clinical use.
v <- validate(fit.cph, dxy = TRUE, B = 1000)
NRI overcomes these two limitations.
Dxy = v[rownames(v)==“Dxy”, colnames(v)==“index.corrected”]
orig_Dxy = v[rownames(v)==“Dxy”, colnames(v)==“index.orig”]
bias_corrected_c_index <- abs(Dxy)/2+0.5
Calculation method of NRI
orig_c_index <- abs(orig_Dxy)/2+0.5
bias_corrected_c_index We use a dichotomous diagnostic marker as an example
## [1] 0.5276304
to explain the principle of NRI calculation and then
orig_c_index
## [1] 0.5415691
quantitatively compare predictive performance of
different models. Calculation of NRI could be done by
Un-adjusted C-Index =0.5416, adjusted C-Index =0.5276. using customized function in R or by using formula.
Comparison of performance of predictive models requires
statistical software. In short, the original marker would
Brief summary
classify study objects into patients and non-patients
C-Index is one of the most important parameters in the while the new marker would reclassify study objects into
evaluation of Cox regression model. It reflects the pros and patients and non-patients. When comparing results of
cons of the model prediction effect and is an important classification and reclassification, some objects may be
parameter to measure the discrimination of the model. mistakenly classified by original marker but corrected by
However, this parameter cannot be calculated in IBM SPSS. new marker and vice versa. Therefore, classification of
This Section has described two methods of R language study objects changes when using different markers. We
calculation and grasping one of them would be sufficient. use the change to calculate NRI (26,27,36). It may look
The author recommends the first one, because it reports confusing, but calculation below could help readers to
C-Index and its standard error at the same time and can understand the concept.
thus conveniently calculate the confidence interval of First, we classify study objectives into diseases group
C-Index. and healthy group using gold standard test. Two matched
fourfold tables are generated from the classification results
using original and new markers within two groups, as shown
Calculation method of NRI with R
below in Tables 3 and 4.
We mainly focus on study objects who are reclassified.
Background
As shown in Tables 3 and 4, in disease group (N1 in total),
NRI was originally used to quantitatively evaluate the c1 objects are correctly classified by new marker and
improvement in classification performance of the new mistakenly classified by original marker, b2 objects are
marker over the original marker. Since NRI is capable of correctly classified by original marker and mistakenly
evaluating the precision of a diagnostic test, NRI could also classified by new marker. So, comparing to original model,
be used to evaluate the performance of a predictive model improved proportion of correct classification in new
because statistically diagnostic test and predictive model model is (c1 − b1)/N1. Similarly, in healthy group (N2 in
are the same. Based on published clinical trials, NRI has total), b2 objects are correctly classified by new marker
been widely used to compare the accuracy of two predictive and mistakenly classified by original marker, c2 objects
models. We have discussed the methods to compare the are correctly classified by original marker and mistakenly
accuracy and discrimination of two predictive models such classified by new marker. improved proportion of correct
as C-Statistics and AUC. However, both methods have classification in new model is (b2 − c2)/N2. At last,
several limitations. combining two groups together, NRI = (c1 − b1)/N1 + (b2
(I) C-Statistic/AUC lacks sensitivity. When evaluating − c2)/N2, which is often refer to as absolute NRI.
the improvement of predictive performance of a If NRI > 0, it means positive improvement, which
predictive model after incorporating a new marker, indicates that new marker has better predictive value
the improvement of C-Statistic/AUC is always comparing to original marker; If NRI <0, it means negative
small, therefore the new marker sometimes fails to improvement, which indicates that new marker has worse

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 34 of 96 Zhou et al. Clinical prediction models with R

Table 3 Reclassification in disease group to explain its clinical significance. However, NRI could
New marker easily solve these problems. So how does NRI solve these
Disease group (N1) problems?
Positive Negative
Take the paper about the methodology of NRI
Original marker
calculation published on Stat Med for example (26). Based
Positive a1 b1 on the famous Framingham Heart Study, researchers
Negative c1 d1 assessed the improvement of new model which incorporate
high density lipoprotein-C (HDL-C) with classic model
in the prediction of 10-year risk of coronary heart disease
Table 4 Reclassification in healthy group (CHD). Researchers compare the AUC of the new model
and classic model. Results showed that AUC of two models
New marker
Healthy group (N2) were 0.774 and 0.762, respectively. AUC only increased
Positive Negative
0.012 after incorporating HDL-C and failed to reach a
Original marker significant level (P=0.092), which indicated that new model
Positive a2 b2 had no significant improvement as shown in Figure 1 in the
paper (26). Researchers further classified study objects into
Negative c2 d2
low-risk group (<6%), intermediate-risk group (6–20%)
and high-risk group (>20%) as shown in Table 2 (26).
Researchers also calculated NRI (NRI =12.1%), z-score
predictive value comparing to original marker; If NRI (z-score =3.616) and P value (P<0.001), which indicated
=0, it means no improvement. We could calculate z-score that incorporating new marker improved the predictive
to determine whether the difference between new model performance and the proportion of correct classification
and original model reaches a significant level. Z-score increased 12.1%.
obeys standard normal distribution. Formula for Z-score The Principle of NRI has been fully discussed above.
calculation is listed below. P value could be calculated from Here we discuss how to calculate NRI using R software.
Z-score. There are three circumstances:
NRI (I) To calculate how much predictive performance of a
Z=
b1 + c1 b2 + c2 new marker improves comparing to original marker
+ [2]
N12 N 22 could use formula listed above or use R code in
reference material to calculate NRI;
The example described above is a dichotomous (II) To calculate NRI of two predictive models
diagnostic marker. In predictive model studies, situation is constructed by Logistic regression;
more complicated. But the principle is the same. Making (III) To calculate NIR of two predictive models
diagnosis solely based on a dichotomous diagnostic constructed by Cox regression.
marker seems to be too “crude”, researchers may be more The calculation method using R is listed below (Table 5)
concerned with the risk of having disease in the future (27,37-39). We mainly demonstrate how to calculate NRI
instead of the present status of having or not having a using “nricens” package, which is highly recommended.
disease. And predictive model could offer the probability of
developing a disease or other outcome. For example, study
Case analysis
objects could be classified into low-risk group, intermediate-
risk group and high-risk group based on predicted risk The calculation of NRI of two markers
probability. If outcome variable is multiple categorical [Case 1]
variable, ROC analysis is not appropriate because outcome Researchers wanted to assess the predictive value of two
variable of ROC analysis is often dichotomous. If outcome diagnostic tests for diabetes. They used three methods
variable is multiple categorical variable, ROC may present (gold standard test, diagnostic test1 and diagnostic test2)
to be spherical surface, which is hard to draw. Even if to evaluate the disease status of 100 patients. The data
ROC was drawing, it is also hard to compare AUC of two used here is documented in appendix “diagnosisdata.
predictive model directly, which makes it complicated csv”. Disease status predicted by gold standard test and

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 35 of 96

Table 5 Packages in R for the calculation of NRI


Categorical variable The calculation of The calculation of
R package Download Survival data
outcome categorical NRI continuous NRI

Hmisc CRAN improveProb() function Not available Not available Not available

nricens CRAN nribin() function nricens() function Available Available

PredictABEL CRAN reclassification() Not available Available Available


function

survNRI github survNRI() function Available Not available Not available


NRI, Net Reclassification Index.

assessing diagnostic test were listed, of which “gold” NRI=round(p1-p2-p3+p4,3)


z = N R I / s q r t ( ( p 1 + p 2 ) / t a bl e ( d a t a n r i [ , n go l d ] ) [ 2 ] + ( p 3 + p 4 ) /
represented results of gold standard test (1= disease,
table(datanri[,ngold])[1])
0= healthy); “t1” represented results of diagnostic test1 z=round(as.numeric(z),3)
(1= positive, 0= negative); “t2” represented results of pvalue=round((1-pnorm(abs(z)))*2,3)
diagnostic test2 (1= positive, 0= negative). Readers could if(pvalue<0.001)pvalue=“<0.001”
use formulas listed above. Here we use our own R codes result=paste(“NRI=“,NRI,”,z=“,z,”,p=“,pvalue,sep= ““)

to calculate NRI of two diagnostic tests. We have sorted return(result)


}
the data, renamed it as “diagnosisdata.csv” and stored in
current working directory. To make it easier for readers Copy case data set to current working directory, load case
to practice, data and codes are available for download in data and set data format as data frame. Codes are presented
appendix of this article. below:
R codes and its interpretation library(foreign)
Because there is no function available for direct calculation dignosisdata <- read.csv(“dignosisdata.csv”)
of NRI, we need to define function named NRIcalculate() datanri=as.data.frame(dignosisdata)

function based on the definition we describe above. R Using NRI calculation function NRIcalculate() to
Codes are presented below: calculate NRI. Codes are presented as below:
NRIcalculate(m1=“t1”,m2=“t2”,gold=“gold”)
NRIcalculate=function(m1=“dia1”,m2=“dia2”,gold=“gold”){
## [1] “NRI=0.566, z=4.618, P=<0.001”
datanri=datanri[complete.cases(datanri),]
for (i in 1:length(names(datanri))){ m1 is variable name of diagnostic test1, m2 is variable
if (names(datanri)[i]==m1)nm1=as.numeric(i) name of diagnostic test2 and gold is gold standard test. NRI
if (names(datanri)[i]==m2)nm2=as.numeric(i) is 0.566, NRI of diagnostic test1 is significantly higher than
if(names(datanri)[i]==gold)ngold=as.numeric(i)
diagnostic test2.
}
if(names(table(datanri[,nm1]))[1]!=“0” ||
names(table(datanri[,nm1]))[2]!=“1”)stop(“index test 1 value not NRI calculation of dichotomous outcome
0or 1”) [Case 2]
if(names(table(datanri[,nm2]))[1]!=“0” ||
names(table(datanri[,nm2]))[2]!=“1”)stop(“index test 2 value not This data is from the Mayo Clinic trial in primary
0or 1”) biliary cirrhosis (PBC) of the liver conducted between
if(names(table(datanri[,ngold]))[1]!=“0” ||
names(table(datanri[,ngold]))[2]!=“1”)stop(“reference standard
1974 and 1984. A total of 424 PBC patients, referred to
value not 0 or 1”) Mayo Clinic during that ten-year interval, met eligibility
datanri1=datanri[datanri[,ngold]==1,]
criteria for the randomized placebo-controlled trial of
table1=table(datanri1[,nm1],datanri1[,nm2])
the drug D-penicillamine. The first 312 cases in the
datanri2=datanri[datanri[,ngold]==0,]
table2=table(datanri2[,nm1],datanri2[,nm2])
data set participated in the randomized trial and contain
p1=as.numeric(table1[2,1]/table(datanri[,ngold])[2]) largely complete data. The additional 112 cases did not
p2=as.numeric(table1[1,2]/table(datanri[,ngold])[2]) participate in the clinical trial, but consented to have basic
p3=as.numeric(table2[2,1]/table(datanri[,ngold])[1]) measurements recorded and to be followed for survival. Six
p4=as.numeric(table2[1,2]/table(datanri[,ngold])[1]) of those cases were lost to follow-up shortly after diagnosis,

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 36 of 96 Zhou et al. Clinical prediction models with R

Table 6 Data structure and data description dichotomous, therefore it requires data conversion. As for
Variable names Description “time” variable, some samples failed to reach 2,000 days,
Age: In years
which showed that these patients died or censored before
2,000 days. Here we deleted data which failed to reach 2,000
Albumin: Serum albumin (g/dL)
days. The detailed description of other variables is available
alk.phos: Alkaline phosphatase (U/liter) using “?pbc”.
Ascites: Presence of ascites A nearly identical data set found in appendix D of
Fleming and Harrington; this version has fewer missing
ast: Aspartate aminotransferase, once called
SGOT (U/mL) values.
R codes and its interpretation
bili: Serum bilirunbin (mg/dL)
Firstly, we load “nricens” package and the dataset, then
chol: Serum cholesterol (mg/dL) extract first 312 observations.
copper: Urine copper (μg/day) library(nricens)
## Loading required package: survival
Edema: 0 no edema, 0.5 untreated or successfully
dat= pbc[1:312,]
treated, 1 edema despite diuretic therapy
dat$sex= ifelse(dat$sex==‘f’, 1, 0)
hepato: Presence of hepatomegaly or enlarged
liver
We delete the data of which follow-up time is shorter
than 2,000 days. “[]” stands for filter, “|” stands for “or”, “&”
ID: Case number
stands for “and”. So, data with “time” >2,000 and data with
Platelet: Platelet count “time” <2,000 but “status” is “dead” are selected. Do not
protime: Standardised blood clotting time miss the last “,”. “,” means selecting all rows which fits the
filter.
Sex: M/F
dat= dat[ dat$time >2,000 | (dat$time <2,000 & dat$status == 2),]
Spiders: Blood vessel malformations in the skin
Define the endpoint: 1 stands for time <2000 and status =
Stage: Histologic stage of disease (needs biopsy) 2 (dead), 0 stands for others.
event= ifelse(dat$time <2,000 & dat$status == 2, 1, 0)
Status: Status at endpoint, 0/1/2 for censored,
transplant, dead Build a matrix out of a subset of dat containing age, bili
Time: Number of days between registration and
and albumin.
the earlier of death, transplantion, or study z.std= as.matrix(subset(dat, select = c(age, bili, albumin)))

analysis in July, 1986 Build a matrix out of a subset of dat containing age, bili,
trt: 1/2/NA for D-penicillamine, placebo, not albumin and protime.
randomized z.new= as.matrix(subset(dat, select = c(age, bili, albumin, protime)))
Construct two logistic regression model: mstd and
trig: Triglycerides (mg/dL)
mnew. Model “mnew” has one more variable “protime”.
Calculation using “nricens” package requires x = TRUE,
which means that output contains the matrix.
mstd= glm(event ~ ., binomial(logit), data.frame(event, z.std), x=TRUE)
so the data here are on an additional 106 cases as well as the mnew= glm(event ~ ., binomial(logit), data.frame(event, z.new), x=TRUE)

312 randomized participants. Calculating the predicted risk of two models.


We use data from the first 312 patients to predict survival p.std= mstd$fitted.values
status at the time point of 2,000 days. It should be noted p.new= mnew$fitted.values
that the original data is a survival data. Here we define a Logistic models are fitted.
dichotomous outcome (alive or dead), regardless of survival There are many ways to calculate NRI. Readers could
time. First load the dataset, as shown in Table 6. “Status” is choose any one. The first method is recommended.
the outcome variable, “0” means censored, “1” means liver (I) Calculation of risk category NRI using (‘mdl.std’,
transplant, “2” means dead. But outcome of our study is ‘mdl.new’).

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 37 of 96

nribin(event= event, z.std = z.std, z.new = z.new, cut = c(0.2, 0.4),


nribin(mdl.std= mstd, mdl.new = mnew, cut = c(0.2, 0.4),
niter = 100, updown = ‘category’)
niter = 100, updown = ‘category’) ##
## ## STANDARD prediction model:
## UP and DOWN calculation: ## Estimate Std. Error z value Pr(>|z|)
## #of total, case, and control subjects at t0: 232 88 144 ## (Intercept) 0.98927136 2.20809035 0.4480212 6.541379e-01
## ## age 0.07128234 0.01988079 3.5854876 3.364490e-04
## bili 0.61686651 0.10992947 5.6114755 2.006087e-08
## Reclassification Table for all subjects:
## albumin -1.95859156 0.53031693 -3.6932473 2.214085e-04
## New
##
## Standard < 0.2 < 0.4 >= 0.4 ## NEW prediction model:
## < 0.2 110 3 0 ## Estimate Std. Error z value Pr(>|z|)
## < 0.4 3 30 0 ## (Intercept) -1.16682234 2.92204889 -0.3993165 6.896600e-01
## >= 0.4 0 2 84 ## age 0.06659224 0.02032242 3.2767864 1.049958e-03
## ## bili 0.59995139 0.11022521 5.4429600 5.240243e-08
## albumin -1.88620553 0.53144647 -3.5491919 3.864153e-04
## Reclassification Table for case:
## protime 0.20127560 0.18388726 1.0945598 2.737095e-01
## New
##
## Standard < 0.2 < 0.4 >= 0.4 ## UP and DOWN calculation:
## < 0.2 7 0 0 ## #of total, case, and control subjects at t0: 232 88 144
## < 0.4 0 8 0 ##
## >= 0.4 0 2 71 ## Reclassification Table for all subjects:
## ## New
## Standard < 0.2 < 0.4 >= 0.4
## Reclassification Table for control:
## < 0.2 110 3 0
## New
## < 0.4 3 30 0
## Standard < 0.2 < 0.4 >= 0.4 ## >= 0.4 0 2 84
## < 0.2 103 3 0 ##
## < 0.4 3 22 0 ## Reclassification Table for case:
## >= 0.4 0 0 13 ## New
## ## Standard < 0.2 < 0.4 >= 0.4
## < 0.2 7 0 0
## NRI estimation:
## < 0.4 0 8 0
## Point estimates:
## >= 0.4 0 2 71
## Estimate ##
## NRI -0.02272727 ## Reclassification Table for control:
## NRI+ -0.02272727 ## New
## NRI- 0.00000000 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Up|Case) 0.00000000 ## < 0.2 103 3 0
## < 0.4 3 22 0
## Pr(Down|Case) 0.02272727
## >= 0.4 0 0 13
## Pr(Down|Ctrl) 0.02083333
##
## Pr(Up|Ctrl) 0.02083333 ## NRI estimation:
## ## Point estimates:
## Now in bootstrap.. ## Estimate
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## NRI -0.02272727
## NRI+ -0.02272727
## NRI- 0.00000000
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Pr(Up|Case) 0.00000000
##
## Pr(Down|Case) 0.02272727
## Point & Interval estimates: ## Pr(Down|Ctrl) 0.02083333
## Estimate Std.Error Lower Upper ## Pr(Up|Ctrl) 0.02083333
## NRI -0.02272727 0.03093492 -0.04211382 0.08275862 ##
## NRI+ -0.02272727 0.02163173 -0.05376344 0.04950495 ## Now in bootstrap……
## NRI- 0.00000000 0.02853621 -0.03571429 0.08333333 ##
## Point & Interval estimates:
## Pr(Up|Case) 0.00000000 0.01963109 0.00000000 0.06930693
## Estimate Std.Error Lower Upper
## Pr(Down|Case) 0.02272727 0.01939570 0.00000000 0.07142857
## NRI -0.02272727 0.03630980 -0.05063291 0.11288105
## Pr(Down|Ctrl) 0.02083333 0.03822346 0.00000000 0.14503817 ## NRI+ -0.02272727 0.02303343 -0.05063291 0.03448276
## Pr(Up|Ctrl) 0.02083333 0.02285539 0.00000000 0.09160305 ## NRI- 0.00000000 0.03004483 -0.02684564 0.07746479
## Pr(Up|Case) 0.00000000 0.01763929 0.00000000 0.04878049
(II) Calculation of risk difference NRI using (‘event’, ## Pr(Down|Case) 0.02272727 0.02334453 0.00000000 0.08860759
## Pr(Down|Ctrl) 0.02083333 0.03459169 0.00000000 0.12676056
‘z.std’, ‘z.new’). ## Pr(Up|Ctrl) 0.02083333 0.01853583 0.00000000 0.05970149

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 38 of 96 Zhou et al. Clinical prediction models with R

(III) Calculation of risk difference NRI using (‘event’, (IV) Calculation of risk difference NRI using (‘mdl.std’,
‘z.std’, ‘z.new’). ‘mdl.new’), updown = ‘diff’.
nribin(event= event, z.std = z.std, z.new = z.new, cut = c(0.2, 0.4), nribin(mdl.std= mstd, mdl.new = mnew, cut = 0.02, niter = 0,
niter = 100, updown = ‘category’) updown = ‘diff’)
## ##
## STANDARD prediction model:
## UP and DOWN calculation:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.98927136 2.20809035 0.4480212 6.541379e-01 ## #of total, case, and control subjects at t0: 232 88 144
## age 0.07128234 0.01988079 3.5854876 3.364490e-04 ## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## bili 0.61686651 0.10992947 5.6114755 2.006087e-08 ## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23
## albumin -1.95859156 0.53031693 -3.6932473 2.214085e-04 ##
## ## NRI estimation:
## NEW prediction model:
## Point estimates:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.16682234 2.92204889 -0.3993165 6.896600e-01 ## Estimate
## age 0.06659224 0.02032242 3.2767864 1.049958e-03 ## NRI 0.08712121
## bili 0.59995139 0.11022521 5.4429600 5.240243e-08 ## NRI+ 0.04545455
## albumin -1.88620553 0.53144647 -3.5491919 3.864153e-04 ## NRI- 0.04166667
## protime 0.20127560 0.18388726 1.0945598 2.737095e-01
## Pr(Up|Case) 0.19318182
##
## UP and DOWN calculation: ## Pr(Down|Case) 0.14772727
## #of total, case, and control subjects at t0: 232 88 144 ## Pr(Down|Ctrl) 0.15972222
## ## Pr(Up|Ctrl) 0.11805556
## Reclassification Table for all subjects:
## New (V) Calculation of risk difference NRI using (‘event’,
## Standard < 0.2 < 0.4 >= 0.4
‘z.std’, ‘z.new’), updown = ‘diff’.
## < 0.2 110 3 0
## < 0.4 3 30 0
nribin(event= event, z.std = z.std, z.new = z.new, cut = 0.02,
## >= 0.4 0 2 84
## niter = 100, updown = ‘diff’)
## Reclassification Table for case: ##
## New ## STANDARD prediction model:
## Standard < 0.2 < 0.4 >= 0.4 ## Estimate Std. Error z value Pr(>|z|)
## < 0.2 7 0 0
## (Intercept) 0.98927136 2.20809035 0.4480212 6.541379e-01
## < 0.4 0 8 0
## >= 0.4 0 2 71 ## age 0.07128234 0.01988079 3.5854876 3.364490e-04
## ## bili 0.61686651 0.10992947 5.6114755 2.006087e-08
## Reclassification Table for control: ## albumin -1.95859156 0.53031693 -3.6932473 2.214085e-04
## New ##
## Standard < 0.2 < 0.4 >= 0.4
## NEW prediction model:
## < 0.2 103 3 0
## < 0.4 3 22 0 ## Estimate Std. Error z value Pr(>|z|)
## >= 0.4 0 0 13 ## (Intercept) -1.16682234 2.92204889 -0.3993165 6.896600e-01
## ## age 0.06659224 0.02032242 3.2767864 1.049958e-03
## NRI estimation: ## bili 0.59995139 0.11022521 5.4429600 5.240243e-08
## Point estimates:
## albumin -1.88620553 0.53144647 -3.5491919 3.864153e-04
## Estimate
## protime 0.20127560 0.18388726 1.0945598 2.737095e-01
## NRI -0.02272727
## NRI+ -0.02272727 ##
## NRI- 0.00000000 ## UP and DOWN calculation:
## Pr(Up|Case) 0.00000000 ## #of total, case, and control subjects at t0: 232 88 144
## Pr(Down|Case) 0.02272727 ## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## Pr(Down|Ctrl) 0.02083333
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23
## Pr(Up|Ctrl) 0.02083333
## ##
## Now in bootstrap…… ## NRI estimation:
## ## Point estimates:
## Point & Interval estimates: ## Estimate
## Estimate Std.Error Lower Upper
## NRI 0.08712121
## NRI -0.02272727 0.03630980 -0.05063291 0.11288105
## NRI+ -0.02272727 0.02303343 -0.05063291 0.03448276 ## NRI+ 0.04545455
## NRI- 0.00000000 0.03004483 -0.02684564 0.07746479 ## NRI- 0.04166667
## Pr(Up|Case) 0.00000000 0.01763929 0.00000000 0.04878049 ## Pr(Up|Case) 0.19318182
## Pr(Down|Case) 0.02272727 0.02334453 0.00000000 0.08860759 ## Pr(Down|Case) 0.14772727
## Pr(Down|Ctrl) 0.02083333 0.03459169 0.00000000 0.12676056
## Pr(Down|Ctrl) 0.15972222
## Pr(Up|Ctrl) 0.02083333 0.01853583 0.00000000 0.05970149

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 39 of 96

## Pr(Up|Ctrl) 0.11805556 which stratifies the risk into three groups: low risk (0–20%),
## intermediate risk (20–40%) and high risk (40–100%). We
## Now in bootstrap..
convert the continuous variable to a categorical variable
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
based on cut-off value of actual risk. “updown” is defined as
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred how the predicted risk of one sample changes. “category”
is categorical variable defined as low, intermediate and high
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred risk. “diff” is a continuous value. When selected, “cut” is
## defined as one value, for example 0.02, which means that the
## Point & Interval estimates:
difference of predicted risk between new and original model
## Estimate Std.Error Lower Upper
## NRI 0.08712121 0.09606989 -0.02530364 0.3338028
predicted risk more than 0.02 is defined as reclassification.
## NRI+ 0.04545455 0.04172751 -0.02941176 0.1279070 “niter” is the number of iterations, namely number of
## NRI- 0.04166667 0.07033087 -0.02898551 0.2214765 resampling in bootstrap. The calculation of the standard
## Pr(Up|Case) 0.19318182 0.09990415 0.00000000 0.3797468 error of NRI requires resampling method. If “niter” =0,
## Pr(Down|Case) 0.14772727 0.07988383 0.00000000 0.2650602 standard error of NRI would not be calculated. Generally,
## Pr(Down|Ctrl) 0.15972222 0.12260050 0.00000000 0.4140127
set “niter” =1,000. If “niter” is too large, it takes a longer
## Pr(Up|Ctrl) 0.11805556 0.06328156 0.00000000 0.2420382
computing time and faster computing speed. However, the
(VI) Calculation of risk difference NRI using (‘event’, larger the “niter” is, the higher accuracy is. Significant level
‘p.std’, ‘p.new’), updown = ‘diff’. α is 0.05.
nribin(event= event, p.std = p.std, p.new = p.new, cut = 0.02, Results are listed as below:
niter = 100, updown = ‘diff’) (I) Tables of all outcomes, positive outcomes and
## negative outcomes. In a predictive model, case
## UP and DOWN calculation: means that outcome takes place, control means that
## #of total, case, and control subjects at t0: 232 88 144
outcome fails to take place.
## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 17 17
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 36 13 23 ## Reclassification Table for all subjects:
## ## New
## NRI estimation: ## Standard < 0.2 < 0.4 >= 0.4
## Point estimates: ## < 0.2 110 3 0
## Estimate ## < 0.4 3 30 0
## NRI 0.08712121 ## >= 0.4 0 2 84
## NRI+ 0.04545455 ##
## NRI- 0.04166667 ## Reclassification Table for case:
## Pr(Up|Case) 0.19318182 ## New
## Pr(Down|Case) 0.14772727 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Down|Ctrl) 0.15972222 ## < 0.2 7 0 0
## Pr(Up|Ctrl) 0.11805556 ## < 0.4 0 8 0
## ## >= 0.4 0 2 71
## Now in bootstrap…… ##
## ## Reclassification Table for control:
## Point & Interval estimates: ## New
## Estimate Std.Error Lower Upper ## Standard < 0.2 < 0.4 >= 0.4
## NRI 0.08712121 0.07622661 -0.06506300 0.2364524 ## < 0.2 103 3 0
## NRI+ 0.04545455 0.06076779 -0.08602151 0.1666667 ## < 0.4 3 22 0
## NRI- 0.04166667 0.04317476 -0.04444444 0.1205674 ## >= 0.4 0 0 13
## Pr(Up|Case) 0.19318182 0.03999520 0.11688312 0.2674419
## Pr(Down|Case) 0.14772727 0.03670913 0.07142857 0.2278481 (II) Point estimation, standard error and confidence
## Pr(Down|Ctrl) 0.15972222 0.03103384 0.09615385 0.2237762 interval of NRI. Comparing to the original model,
## Pr(Up|Ctrl) 0.11805556 0.02604235 0.07407407 0.1703704
the proportion of correct reclassification improves
R code interpretation: the comparison of the predictive −2.2727%. Incorporating a new variable reduces
performance of two models. “cut” represents the cut-off the predictive accuracy, the new model is worse
value of predicted risk. Here we define two cut-off value, than original model.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 40 of 96 Zhou et al. Clinical prediction models with R

## Estimate Std.Error Lower Upper ## #of total, case, and control subjects at t0: 312 88 144
## NRI -0.02272727 0.03093492 -0.04211382 0.08275862
##
## NRI+ -0.02272727 0.02163173 -0.05376344 0.04950495
## Reclassification Table for all subjects:
## NRI- 0.00000000 0.02853621 -0.03571429 0.08333333
## Pr(Up|Case) 0.00000000 0.01963109 0.00000000 0.06930693 ## New
## Pr(Down|Case) 0.02272727 0.01939570 0.00000000 0.07142857 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Down|Ctrl) 0.02083333 0.03822346 0.00000000 0.14503817 ## < 0.2 139 7 1
## Pr(Up|Ctrl) 0.02083333 0.02285539 0.00000000 0.09160305 ## < 0.4 17 72 6
Outcome of survival data ## >= 0.4 0 5 65
[Case 3] ##
We use the same data with Case 2. The difference between ## Reclassification Table for case:
NRI of survival data and NRI of categorical data is that ## New
the former needs to construct a cox regression model. So, ## Standard < 0.2 < 0.4 >= 0.4
we need to construct two cox models and calculate NRI of ## < 0.2 9 2 0
these two models. ## < 0.4 1 21 4
R codes and its interpretation ## >= 0.4 0 0 51
Firstly, we load necessary packages and data. ##
Here consider pbc dataset in survival package as an ## Reclassification Table for control:
example. ## New

library(nricens) ## Standard < 0.2 < 0.4 >= 0.4


dat= pbc[1:312,] ## < 0.2 92 4 1
dat$sex= ifelse(dat$sex==‘f’, 1, 0) ## < 0.4 9 29 2

predciting the event of ‘death’. ## >= 0.4 0 3 4

time= dat$time ##
event= ifelse(dat$status==2, 1, 0) ## NRI estimation by KM estimator:

standard prediction model: age, bilirubin, and albumin. ##

z.std= as.matrix(subset(dat, select = c(age, bili, albumin))) ## Point estimates:

new prediction model: age, bilirubin, albumin, and ## Estimate

protime. ## NRI 0.11028068

z.new= as.matrix(subset(dat, select = c(age, bili, albumin, protime))) ## NRI+ 0.05123381

Using coxph() to construct cox regression model: mnew ## NRI- 0.05904686

and mstd ## Pr(Up|Case) 0.06348538

mstd= coxph(Surv(time,event) ~ ., data.frame(time,event,z.std), x=TRUE) ## Pr(Down|Case) 0.01225156


mnew= coxph(Surv(time,event) ~ ., data.frame(time,event,z.new), x=TRUE) ## Pr(Down|Ctrl) 0.09583016

predicted risk at t0=2,000, predicted risk at the tome ## Pr(Up|Ctrl) 0.03678329

point of 2,000 days ##

p.std= get.risk.coxph(mstd, t0=2000) ## Now in bootstrap……


p.new= get.risk.coxph(mnew, t0=2000) ##

There are many ways to calculate risk categorical ## Point & Interval estimates:

NRI. Readers could choose any one. The first method is ## Estimate Lower Upper

recommended. ## NRI 0.11028068 -0.05865007 0.20446631

(I) By the KM estimator using (‘mdl.std’, ‘mdl.new’). ## NRI+ 0.05123381 -0.09480483 0.14708696
## NRI- 0.05904686 -0.01180288 0.11261994
nricens(mdl.std= mstd, mdl.new = mnew, t0 = 2000, cut = c(0.2, 0.4), ## Pr(Up|Case) 0.06348538 0.01113699 0.16595888
niter = 100, updown = ‘category’) ## Pr(Down|Case) 0.01225156 0.00000000 0.15653476
## ## Pr(Down|Ctrl) 0.09583016 0.02631760 0.16468399
## UP and DOWN calculation: ## Pr(Up|Ctrl) 0.03678329 0.01316053 0.08912133

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 41 of 96

(II) By the KM estimator using (‘time’, ‘event’, ‘z.std’, (III) By the KM estimator using (‘time’,‘event’,‘p.std’,‘p.
‘z.new’). new’).
nricens(time= time, event = event, z.std = z.std, z.new = z.new, nricens(time= time, event = event, p.std = p.std, p.new = p.new,
t0 = 2000, cut = c(0.2, 0.4), niter = 100, updown = ‘category’) t0 = 2000, cut = c(0.2, 0.4), niter = 100, updown = ‘category’)
## ##
## STANDARD prediction model (Cox model):
## UP and DOWN calculation:
## coef exp(coef) se(coef) z Pr(>|z|)
## #of total, case, and control subjects at t0: 312 88 144
## age 0.03726683 1.0379699 0.009048925 4.118371 3.815600e-05
## bili 0.13531179 1.1448937 0.013711323 9.868617 5.694436e-23 ##
## albumin -1.44611854 0.2354825 0.221997986 -6.514107 7.312356e-11 ## Reclassification Table for all subjects:
## ## New
## NEW prediction model (Cox model):
## Standard < 0.2 < 0.4 >= 0.4
## coef exp(coef) se(coef) z Pr(>|z|)
## < 0.2 139 7 1
## age 0.03362675 1.0341985 0.009214173 3.649460 2.627925e-04
## bili 0.12517886 1.1333511 0.014406820 8.688861 3.660902e-18 ## < 0.4 17 72 6
## albumin -1.39395237 0.2480928 0.217046959 -6.422354 1.341831e-10 ## >= 0.4 0 5 65
## protime 0.28602917 1.3311313 0.070536400 4.055058 5.012193e-05 ##
## ## Reclassification Table for case:
## UP and DOWN calculation:
## New
## #of total, case, and control subjects at t0: 312 88 144
## ## Standard < 0.2 < 0.4 >= 0.4
## Reclassification Table for all subjects: ## < 0.2 9 2 0
## New ## < 0.4 1 21 4
## Standard < 0.2 < 0.4 >= 0.4 ## >= 0.4 0 0 51
## < 0.2 139 7 1
##
## < 0.4 17 72 6
## >= 0.4 0 5 65 ## Reclassification Table for control:
## ## New
## Reclassification Table for case: ## Standard < 0.2 < 0.4 >= 0.4
## New ## < 0.2 92 4 1
## Standard < 0.2 < 0.4 >= 0.4
## < 0.4 9 29 2
## < 0.2 9 2 0
## < 0.4 1 21 4 ## >= 0.4 0 3 4
## >= 0.4 0 0 51 ##
## ## NRI estimation by KM estimator:
## Reclassification Table for control: ##
## New
## Point estimates:
## Standard < 0.2 < 0.4 >= 0.4
## Estimate
## < 0.2 92 4 1
## < 0.4 9 29 2 ## NRI 0.11028068
## >= 0.4 0 3 4 ## NRI+ 0.05123381
## ## NRI- 0.05904686
## NRI estimation by KM estimator: ## Pr(Up|Case) 0.06348538
##
## Pr(Down|Case) 0.01225156
## Point estimates:
## Estimate ## Pr(Down|Ctrl) 0.09583016
## NRI 0.11028068 ## Pr(Up|Ctrl) 0.03678329
## NRI+ 0.05123381 ##
## NRI- 0.05904686 ## Now in bootstrap……
## Pr(Up|Case) 0.06348538
##
## Pr(Down|Case) 0.01225156
## Pr(Down|Ctrl) 0.09583016 ## Point & Interval estimates:
## Pr(Up|Ctrl) 0.03678329 ## Estimate Lower Upper
## ## NRI 0.11028068 0.045766814 0.17347974
## Now in bootstrap…… ## NRI+ 0.05123381 -0.001867747 0.10493820
##
## NRI- 0.05904686 0.013007633 0.10168702
## Point & Interval estimates:
## Estimate Lower Upper ## Pr(Up|Case) 0.06348538 0.021802922 0.11136896
## NRI 0.11028068 -0.03560702 0.20881092 ## Pr(Down|Case) 0.01225156 0.000000000 0.03783913
## NRI+ 0.05123381 -0.08359649 0.11206601 ## Pr(Down|Ctrl) 0.09583016 0.056016704 0.13446013
## NRI- 0.05904686 -0.01795177 0.13023171 ## Pr(Up|Ctrl) 0.03678329 0.018781733 0.05795319
## Pr(Up|Case) 0.06348538 0.01955126 0.17904180
## Pr(Down|Case) 0.01225156 0.00000000 0.19505939 (IV) Calculation of risk difference NRI by the KM
## Pr(Down|Ctrl) 0.09583016 0.02681223 0.20450527
## Pr(Up|Ctrl) 0.03678329 0.01779895 0.09818359 estimator, updown = ‘diff’.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 42 of 96 Zhou et al. Clinical prediction models with R

nricens(mdl.std= mstd, mdl.new = mnew, t0 = 2000, updown = ‘diff’, ##


cut = 0.05, niter = 100) ## Point & Interval estimates:
## ## Estimate Lower Upper
## UP and DOWN calculation: ## NRI 0.06361038 -0.04977671 0.3115166
## #of total, case, and control subjects at t0: 312 88 144 ## NRI+ 0.08444371 -0.01895903 0.2349312
## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 21 11 ## NRI- -0.02083333 -0.06164384 0.1323529
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 40 12 8 ## Pr(Up|Case) 0.22905909 0.09065899 0.3639325
## ## Pr(Down|Case) 0.14461537 0.01041488 0.2470408
## NRI estimation by KM estimator: ## Pr(Down|Ctrl) 0.05555556 0.00000000 0.2695035
##
## Pr(Up|Ctrl) 0.07638889 0.03546099 0.1621622
## Point estimates:
## Estimate The interpretation of R codes is the same with the NRI
## NRI 0.10070960
of dichotomous outcome mentioned above.
## NRI+ 0.05097223
Results are presented as below.
## NRI- 0.04973737
## Pr(Up|Case) 0.22431499 (I) Tables of all outcomes, positive outcomes and
## Pr(Down|Case) 0.17334277 negative outcomes. In a predictive model, case
## Pr(Down|Ctrl) 0.10859064 means outcome takes place, control means that
## Pr(Up|Ctrl) 0.05885327 outcome fails to take place.
##
## Now in bootstrap…… ## Reclassification Table for all subjects:
## ## New
## Point & Interval estimates: ## Standard < 0.2 < 0.4 >= 0.4
## Estimate Lower Upper ## < 0.2 139 7 1
## NRI 0.10070960 -0.05948241 0.3051724 ## < 0.4 17 72 6
## NRI+ 0.05097223 -0.06240698 0.1771789 ## >= 0.4 0 5 65
## NRI- 0.04973737 -0.02707081 0.2263106 ##
## Pr(Up|Case) 0.22431499 0.06642735 0.3380647 ## Reclassification Table for case:
## Pr(Down|Case) 0.17334277 0.00000000 0.2792304 ## New
## Pr(Down|Ctrl) 0.10859064 0.01470473 0.3250865 ## Standard < 0.2 < 0.4 >= 0.4
## Pr(Up|Ctrl) 0.05885327 0.02707081 0.1276894 ## < 0.2 9 2 0
## < 0.4 1 21 4
(V) Calculation of risk difference NRI by the IPW ## >= 0.4 0 0 51
estimator, updown = ‘diff’. ##
## Reclassification Table for control:
nricens(mdl.std= mstd, mdl.new = mnew, t0 = 2000, updown = ‘diff’, ## New
cut = 0.05, point.method = ‘ipw’, niter= 100) ## Standard < 0.2 < 0.4 >= 0.4
## ## < 0.2 92 4 1
## UP and DOWN calculation: ## < 0.4 9 29 2
## #of total, case, and control subjects at t0: 312 88 144 ## >= 0.4 0 3 4
## #of subjects with ‘p.new - p.std > cut’ for all, case, control: 34 21 11
## #of subjects with ‘p.std - p.new < cut’ for all, case, control: 40 12 8
(II) Point estimation, standard error and confidence
## interval of NRI. Comparing to the original model,
## NRI estimation by IPW estimator: the proportion of correct reclassification improves
## 11.028%. Incorporating a new variable improves
## Point estimates: the predictive accuracy, the new model is better
## Estimate
than original model.
## NRI 0.06361038
## NRI+ 0.08444371 ## Estimate Lower Upper
## NRI- -0.02083333 ## NRI 0.11028068 -0.05865007 0.20446631
## Pr(Up|Case) 0.22905909 ## NRI+ 0.05123381 -0.09480483 0.14708696
## Pr(Down|Case) 0.14461537 ## NRI- 0.05904686 -0.01180288 0.11261994
## Pr(Down|Ctrl) 0.05555556 ## Pr(Up|Case) 0.06348538 0.01113699 0.16595888
## Pr(Up|Ctrl) 0.07638889 ## Pr(Down|Case) 0.01225156 0.00000000 0.15653476
## ## Pr(Down|Ctrl) 0.09583016 0.02631760 0.16468399
## Now in bootstrap……
## Pr(Up|Ctrl) 0.03678329 0.01316053 0.08912133

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 43 of 96

Brief summary C-statistics but also NRI and IDI, which could give a
comprehensive perspective on how much the predictive
It is necessary to have a correct understanding of NRI.
performance improves.
NRI and C-statistics evaluate the discrimination of models.
Improvement of C-statistics sometimes is limited but NRI
could significantly improves, which means that predictive Calculation principle of IDI
performance of the new model improves significantly
The formula of IDI reflects the difference between the
comparing to the original model. It should be noted that
predictive probability of two models (26,36). Therefore,
there is a little difference between R codes for dichotomous
IDI is calculated based on the predictive probability of each
data and survival data. In this Section, we discuss the
study object using given predictive models. The formula is:
calculation of NRI. In the next Section, we will introduce
IDI = (Pnew,events–Pold,events) – (Pnew,non-events – Pold,non-events)
the calculation theory and definition of another index, IDI.
Pnew,events, Pold,events are the average value of the predictive
probability of each study object in disease group using
Calculation method of IDI with R the new model and the original model. P new,events-Pold,events
represents the improvement of predictive probability. For
Background
disease group, the higher the probability of disease is, the
In the previous section about the principle and calculation more accurate the predictive model is. Therefore, the larger
methods of NRI, we compare AUC (also known as difference means that the new model is better.
C-statistics) with NRI. NRI has two advantages: Pnew,non-events, Pold,non-events means the average value of the
(I) NRI is more sensitive than C-statistics/AUC predictive probability of each study object in healthy group
derived from ROC (26); using the new model and the original model. Pnew,non-events −
(II) NRI is easier to understand in clinical practice Pold,non-events represents the reduction of predictive probability.
when a cut-off value is given, for example, a cut- For healthy group, the smaller the probability of disease is,
off value of a diagnostic marker or cut-off values to the more accurate the predictive model is. Therefore, the
stratify low risk, intermediate risk and high risk (27). smaller difference means that the new model is better.
However, NRI has its disadvantages: NRI only considers At last, IDI is calculated by doing subtraction. In general,
performance improvement at one time point and fail to larger IDI means better predictive performance of the new
evaluate the overall improvement of a predictive model. model. Like NRI, If IDI >0, it means positive improvement,
Therefore, we could use another index: IDI (Integrated which indicates that new model has better predictive value
Discrimination Index) (40,41). comparing to original marker; If IDI <0, it means negative
Some readers may ask: is AUC/C-statistics able improvement, which indicates that new model has worse
to evaluate the overall improvement of a predictive predictive value comparing to original marker; If IDI =
model? To answer this question, we must go back to the 0, it means no improvement. We could calculate z-score
limitation of AUC/C-statistics. If we must compare IDI to determine whether the difference between new model
with AUC/C-statistics, IDI is more sensitive and easier and original model reaches a significant level. Z-score
to understand in clinical practice. It should be noted that approximately obeys standard normal distribution. Formula
we could calculate AUC/C-statistics of one predictive for Z-score calculation is listed below.
model, but we could not calculate NRI or IDI of one IDI
predictive model. IDI and NRI are calculated from the Z= [3]
( SEevents ) 2 + ( SEnoevents ) 2
comparison of two models. One model does not have IDI
or NRI. SEevents is the standard error of (Pnew,events − Pold,events). First,
We don’t know whether the last paragraph makes calculate predictive probability of each patient using new
sense to readers. But when we are dealing with difficult and original model in disease group and its difference. Then
problems, we could put them aside and forget about calculate standard error of the difference. SEnon-events is the
the advantages and disadvantages of AUC/C-statistics, standard error of (Pnew,non-events − Pold,non-events). First, calculate
NRI and IDI. What we should remember is that when predictive probability of each patient using new and original
comparing diagnostic power of two markers or comparing model in healthy group and its difference. Then calculate
two predictive models, we could use not only AUC/ standard error of the difference.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 44 of 96 Zhou et al. Clinical prediction models with R

Case study Load case data to current working directory, load case
data and set data format as data frame. Codes are presented
Calculation of IDI
below:
[Case 1]
library(foreign)
Researchers want to evaluate the predictive value of two dignosisdata <- read.csv(“dignosisdata.csv”)
diagnostic markers for diabetes. Three diagnostic methods dataidi=as.data.frame(dignosisdata)
(gold standard test, diagnostic marker 1 and diagnostic
Using IDI calculation function IDIcalculate() to calculate
marker2) are applied to 100 study objects. Data used in this
IDI. Codes are presented below:
case is available in appendix “diagnosisdata.csv”. Disease
IDIcalculate(m1=“t1”,m2=“t2”,gold=“gold”)
status predicted by gold standard test, diagnostic test1 and ## [1] “IDI=0.599,z=5.803,p=<0.001”
diagnostic test2 were listed, of which “gold” represented
m1 is variable name of diagnostic test1, m2 is variable
results of gold standard test (1= disease, 0= healthy); “t1”
name of diagnostic test2 and gold is gold standard test. IDI
represented results of diagnostic test1 (1= positive, 0=
is 0.599, IDI of diagnostic test1 is significantly higher than
negative); “t2” represented results of diagnostic test2 (1=
diagnostic test2.
positive, 0= negative). Readers could use formulas listed
above. Here we use our own R codes to calculate IDI of two
IDI calculation of dichotomous outcome
diagnostic tests. We have organized our data, renamed as
[Case2]
“diagnosisdata.csv” and stored in current working directory.
Data used here is a dataset from mayo clinic which could
To make it easier for readers to practice, data and codes are
be imported from “survival” package. Data contains clinical
available for download in appendix.
data and PBC status of 418 patients of which first 312
R codes and its interpretation
patients participated in randomized trial and other were
Because there is no function available for calculation of
from cohort studies. We use data from the first 312 patients
IDI, we need to define function named “IDIcalculate()”
to predict survival status. “Status” is the outcome variable,
function based on the definition we describe above. Codes
“0” means censored, “1” means liver transplant, “2” means
are presented below:
dead. But outcome of our study is dichotomous, therefore
IDIcalculate=function(m1=“dia1”,m2=“dia2”,gold=“gold”){ it requires data conversion. We construct a logistic model
dataidi= dataidi [complete.cases(dataidi),]
for (i in 1:length(names(dataidi))){ based on patients’ survival status. The detailed description
if(names(dataidi)[i]==m1)nm1=as.numeric(i) of other variables is available using “?pbc”. R packages for
if(names(dataidi)[i]==m2)nm2=as.numeric(i)
IDI calculation were shown in Table 7.
if(names(dataidi)[i]==gold)ngold=as.numeric(i)
} R codes and its interpretation
if(names(table(dataidi[,ngold]))[1]!=“0” || Here consider pbc dataset in survival package as an example.
names(table(dataidi[,ngold]))[2]!=“1”)
stop(“reference standard value not 0 or 1”)
First, we load “survival” package and the dataset, then
logit1=glm(dataidi[,ngold]~dataidi[,nm1], extract first 312 observations.
family=binomial(link=‘logit’),data=dataidi)
library(survival)
dataidi$pre1=logit1$fitted.values
logit2=glm(dataidi[,ngold]~dataidi[,nm2], dat=pbc[1:312,]
family=binomial(link=‘logit’),data=dataidi) dat$sex=ifelse(dat$sex==‘f’,1,0)
dataidi$pre2=logit2$fitted.values
dataidi$predif=dataidi$pre1-dataidi$pre2
subjects censored before 2000 days are excluded.
dataidi1=dataidi[dataidi[,ngold]==1,] dat=dat[dat$time>2000|(dat$time<2000&dat$status==2),]
dataidi2=dataidi[dataidi[,ngold]==0,] predciting the event of ‘death’ before 2000 days.
p1=mean(dataidi1$pre1)
p2=mean(dataidi1$pre2) event=ifelse(dat$time<2000&dat$status==2,1,0)
p3=mean(dataidi2$pre1) standard prediction model: age, bilirubin, and albumin.
p4=mean(dataidi2$pre2)
z.std=as.matrix(subset(dat,select=c(age,bili,albumin)))
IDI=round(p1-p2-p3+p4,3)
z=IDI/sqrt(sd(dataidi1$predif)/length(dataidi1$predif)+ new prediction model: age, bilirubin, albumin, and
sd(dataidi2$predif)/length(dataidi2$predif)) protime.
z=round(as.numeric(z),3)
z.new=as.matrix(subset(dat,select=c(age,bili,albumin,protime)))
pvalue=round((1-pnorm(abs(z)))*2,3)
if(pvalue<0.001)pvalue=“<0.001” glm() fit (logistic model)
result=paste(“IDI=“,IDI,”,z=“,z,”,p=“,pvalue,sep= ““) mstd=glm(event~.,binomial(logit),data.frame(event,z.std),x=TRUE)
return(result)
mnew=glm(event~.,binomial(logit),data.frame(event,z.new),x=TRUE)
}

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 45 of 96

Table 7 R packages for IDI calculation (37)


R packages Download Categorical outcome Survival outcome

PredictABEL CRAN reclassification() function Not available

survIDINRI CRAN Not available IDI.INF() function


IDI, integrated discrimination improvement.

Using PredictABEL package. ## [0,0.2) 103 3 0 3


## [0.2,0.4) 3 22 0 12
library(PredictABEL)
## [0.4,1] 0 0 13 0
## Loading required package: Hmisc ##
## Loading required package: lattice ##
## Loading required package: Formula ## Outcome: present
##
## Loading required package: ggplot2 ## Updated Model
## Loading required package: ROCR ## Initial Model [0,0.2) [0.2,0.4) [0.4,1] % reclassified
## Loading required package: gplots ## [0,0.2) 7 0 0 0
## [0.2,0.4) 0 8 0 0
## Loading required package: epitools
## [0.4,1] 0 2 71 3
## Loading required package: PBSmodelling ##
## ##
## ----------------------------------------------------------- ## Combined Data
##
## PBS Modelling 2.68.6 -- Copyright (C) 2005-2019 Fisheries and ## Updated Model
Oceans Canada ## Initial Model [0,0.2) [0.2,0.4) [0.4,1] % reclassified
## ## [0,0.2) 110 3 0 3
## [0.2,0.4) 3 30 0 9
## A complete user guide ‘PBSmodelling-UG.pdf’ is located at
## [0.4,1] 0 2 84 2
## C:/Users/zzr/Documents/R/win-library/3.5/PBSmodelling/doc/PB- ## _________________________________________
Smodelling-UG.pdf ##
## NRI(Categorical) [95% CI]: −0.0227 [ -0.0683 to 0.0229]; P value:
##
0.32884
## Packaged on 2017-12-19 ## NRI(Continuous) [95% CI]: 0.0391 [−0.2238 to 0.3021]; P value:
## Pacific Biological Station, Nanaimo 0.77048
## ## IDI [95% CI]: 0.0044 [−0.0037 to 0.0126]; P value: 0.28396
## All available PBS packages can be found at
IDI is 0.0044, indicating that new model improves 0.44%
## https://github.com/pbs-software
## -----------------------------------------------------------
comparing to original model.
pstd<-mstd$fitted.values
pnew<-mnew$fitted.values IDI of survival outcome data
use cbind() to add pre-defined variable “event” to data [Case 3]
and rename as “dat_new”. Data used here is the same data with case 2. The data is
dat_new=cbind(dat,event)
from mayo clinic which could be imported from “survival”
Calculate NRI and IDI. IDI is irrelevant to cut-off package. Data contains clinical data and PBC status of
value. cOutcome is the column index for outcome variable. 418 patients of which first 312 patients participated in
Furthermore, predrisk1, predrisk2 are the original and new randomized trial and other were from cohort studies. We
model, respectively. use data from the first 312 patients to predict survival status.
reclassification(data=dat_new,cOutcome=21, “Status” is the outcome variable, “0” means censored, “1”
predrisk1=pstd,predrisk2=pnew,
cutoff=c(0,0.2,0.4,1)) means liver transplant, “2” means dead. But outcome of our
## _________________________________________ study is dichotomous, therefore it requires data conversion.
##
## Reclassification table We construct a logistic model based on patients’ survival
## _________________________________________
##
status. The detailed description of other variables is
## Outcome: absent available using “?pbc”.
##
## Updated Model R codes and its interpretation
## Initial Model [0,0.2) [0.2,0.4) [0.4,1] % reclassified Here consider pbc dataset in survival package as an example.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 46 of 96 Zhou et al. Clinical prediction models with R

model.
1.0
Furthermore, the result is illustrated as (Figure 15):
0.8 IDI.INF.GRAPH(x)

0.6
pr (D ≤s)

Brief summary
0.4
We introduce AUC, NRI, IDI and DCA in evaluating and
0.2
comparing predictive performance of two models in a series
0.0 of articles. These indices reflect the predictive performance
–0.1 0.0 0.1 0.2 0.3 0.4 from different angles. Here we do a summary of the three
s indices.
Figure 15 The comparison of two models. (I) AUC/C-statistic derived from ROC analysis
is the classic method and the foundation of
First, we load “survival” package and the dataset, then discrimination evaluation. Although NRI and IDI
extract first 312 observations. were recently developed and highly recommended,
AUC/C-statistic still is the basic method to
library(survival)
evaluate predictive performance improvement
dat=pbc[1:312,]
dat$time=as.numeric(dat$time)
of two models. Of course, we recommend that
Define survival outcome. Define “dead” as endpoint. C-Statistics/AUC, NRI and IDI should all be
dat$status=ifelse(dat$status==2,1, 0) calculated. It would be perfect if DCA analysis is
Define survival outcome. Define “dead” as endpoint. also available. But Done is better than perfect.
dat$status=ifelse(dat$status==2,1, 0) (II) If the outcome variable is a multiple categorical
Define time point. variable, for example, low risk, intermediate risk
t0=365*5 and high risk, NRI and IDI are better. AUC/
Construct a basic matrix containing regression model C-statistic is more complicated and harder to
variables. explain which we have discussed before.
indata0 = as.matrix(subset(dat,select=c(time,status,age,bili,albumin))) (III) The calculation of NRI is related to the cut-
Construct a new matrix adding one new regression off point. If the cut-off point is too high or the
model variable. number of cut-off point is too low, NRI could be
indata1 = as.matrix(subset(dat,select=c(time,status,age,bili,albumin,protime))) underestimated and failed to reach a significant
Variable matrix in basic regression models level. If the cut-off point is too low or the
covs0<-as.matrix(indata0[,c(-1,-2)]) number of cut-off point is too large, NRI could
Variable matrix in the new regression models. be overestimated in clinical practice. Therefore,
covs1<-as.matrix(indata1[,c(-1,-2)]) setting the cut-off value is important for NRI
dat[,2:3] is the survival outcome. The second column and calculation. It is necessary to set the correct cut-off
third column are the survival time and status, respectively. point based on clinical need. If the cut-off point is
covs0,covs1 are the original and new variable matrix, too hard to determine, IDI and AUC are better. If
respectively. t0 is the time point. npert is the number of cut-off value could be determined, NRI is better.
iteration. Calculation result of IDI is listed below: (IV) Using DCA in clinical utility analysis is the icing on the
library(survIDINRI) cake. DCA is not the only method to do clinical utility
## Loading required package: survC1 analysis which we will discuss in the next Section.
x<-IDI.INF(dat[,2:3],covs0, covs1, t0, npert=100) In addition, we need to consider a question which could
IDI.INF.OUT(x)
be easily neglected. After adding a new marker, the new
## Est. Lower Upper p-value
## M1 0.025 -0.001 0.055 0.079
model is more complicated than the original model. Is the
## M2 0.226 -0.057 0.401 0.079 complicated model acceptable? Is the new marker accessible?
## M3 0.012 -0.002 0.036 0.079 Is the new marker convenient? Everything has its pros and
IDI is 0.025, indicating that predictive performance cons. Researchers need make a decision between model
of new model improves 2.5% comparing to the original improvement and the cost of adding a new marker.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 47 of 96

Table 8 A subset of the Framingham Heart Study dataset (only the top 10 observations are listed)
ID Sex SBP (mmHg) DBP (mmHg) SCL (mmol/L) Chdfate Follow up (days) Age (years) BMI (kg/m2) Month

1 1 106 68 239 0 7,345 60 22.9 1

2 1 118 78 252 1 1,765 46 22 1

3 2 135 85 284 0 11,545 49 30.6 1

4 2 154 92 196 0 11,688 52 36.1 1

5 1 162 102 275 0 6,039 55 29.3 1

6 2 136 66 313 0 9,436 62 25.4 1

7 1 140 95 245 0 11,688 50 29.5 1

8 1 112 68 210 0 11,688 37 25.2 1

9 2 168 96 190 0 11,688 47 27.2 1

10 2 114 78 245 1 5,302 45 28.6 1


ID, serial number; SBP, systolic blood pressure; DBP, diastolic blood pressure; SCL, serum cholesterol; BMI, body mass index.

Decision Curve Analysis for Binary Outcome with R and Stata-based DCA algorithms. Kerr et al. also created a R
packagepackage called DecisionCurve for the implementation
Background
of the decision curve method which cannot be downloaded in
In the previous sections we have explored the C-Statistics the CRAN official website. All the functions of the original
(i.e., AUC, the area under the ROC curve) that evaluates DecisionCurve package have been integrated into the rmda
the discrimination of a predictive model. But is it good package. So, when you need to draw the decision curve, you
enough? The answer is: no best, only better. For example, just have to install the rmda package in R (46). The tutorial
predicting whether a patient is ill by a continuous index on how to install the DecisionCurve package in the new
has a certain probability of false positive and false negative version of R software on the Internet is not appropriate (47).
regardless of which value is selected as the cutoff value. The correct way is to install the rmda package directly. Below
Since neither of these situations can be avoided, we will start we will focus on the method of drawing DCA curves, and do
from the original motivation of constructing the predictive not explain too much complicated statistical principles.
model and try to find a model that predicts the greatest net
benefit. but how do we can calculate the net benefit of this
Case study
forecast?
In 2006, Andrew Vickers et al. who working for Dichotomous Outcome
Memorial Sloan-Kettering Cancer Center invented a new [Case 1]
calculation method called Decision Curve Analysis (DCA) The data is a subset of the famous Framingham Heart
(42-44). Compared with the ROC analysis that was born Study data set from the NHLBI (National Heart, Lung,
during the Second World War, DCA is obviously still and Blood Institute), containing 4,699 samples and 10
“innocent“, but “green out of blue, and better than blue“, variables. The independent variables include sex (sex), SBP,
many top medical journal, such as Ann Intern Med, JAMA, diastolic blood pressure (DBP), serum cholesterol (SCL),
BMJ, J Clin Oncol and others have encouraged to use DCA age (age), body mass index (BMI), etc., and the dependent
decision curve analysis method (45). Then how to draw an variable is CHD-related death event (chdfate). In this case,
attractive Decision Curve Analysis? the dependent variable was a two-category variable with a
Statisticians always firstly think of using R to implement death of 1 and no death of 0 during the follow-up period.
new algorithms. In fact, this is true. The DCA algorithm The data structure is shown in Table 8. We sorted out and
based on R language was firstly announced, followed by SAA named it ‘Framingham.csv’ which is stored in the current

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 48 of 96 Zhou et al. Clinical prediction models with R

working path of the R software. For the convenience of the complex<-decision_curve(chdfate~scl+sbp+dbp+age+bmi+sex,


reader, data and code readers can be downloaded from the data = Data,family = binomial(link =‘logit’),
attachments in this section. thresholds = seq(0,1, by = 0.01),

[Case 1] Interpretation confidence.intervals= 0.95,


study.design = ‘case-control’,
We will use the [Case 1] dataset CHD-related death event
population.prevalence = 0.3)
(chdfate) to establish two logistic regression models for the ## Calculating net benefit curves for case-control data. All calculations
outcome variable to demonstrate the DCA curve method. are done conditional on the outcome prevalence provided.
One is a simple logistic regression model (simple) with ## Note: The data provided is used to both fit a prediction model and
SCL as a predictor, outcome for CHD-related deaths to estimate the respective decision curve. This may cause bias in deci-
sion curve estimates leading to over-confidence in model performance.
(outcome); the other is a multivariate logistic regression
model (complex), in which gender, age, BMI, SCL, SBP We combine the fitted simple and complex models into a
were predictors, and DBP, and CHD-related death events single model and name it List.
were outcomes (outcome).
List<- list(simple,complex)
R codes and its interpretation
Load the rmda package (you need to install it in advance) We use the plot_decision_curve() function to plot the
and then load the data. DCA curve, as shown in Figure 16 below.
install.packages(“rmda”) plot_decision_curve(List,
library(rmda) curve.names=c(‘simple’,’complex’),
Data<-read.csv(‘Framingham.csv’,sep = ‘,’) cost.benefit.axis =FALSE,col= c(‘red’,’blue’),
confidence.intervals=FALSE,
DCA model construction. We firstly build a simple
standardize = FALSE)
model using the decision_curve() function, named simple. ## Note: When multiple decision curves are plotted, decision curves for
simple<- decision_curve(chdfate~scl,data = Data, ‘All’ are calculated using the prevalence from the first DecisionCurve
family = binomial(link =‘logit’), object in the list provided.
thresholds= seq(0,1, by = 0.01),
Code interpretation: The object of the plot_decision_
confidence.intervals = 0.95,
study.design = ‘case-control’,
curve() function is the List defined earlier. If you only draw
population.prevalence = 0.3) one curve, you can directly replace the List with simple
## Calculating net benefit curves for case-control data. All calculations or complex. Curve.names is the name of each curve on
are done conditional on the outcome prevalence provided. the legend when the plot is drawn. The order of writing
R Code interpretation: In the decision_curve() function, is the same as when the List is synthesized above. “cost.
family=binomial(link=‘logit’) uses Logistic regression to fit benefit.axis” is an additional axis of abscissa, loss-to-return
the model. The threshold sets the threshold of the abscissa ratio, the default value is TRUE. When you don’t need it
threshold probability, which is generally 0–1; but if there remember to set to FALSE. col sets the colors. “confidence.
is a specific situation, everyone agrees that the threshold intervals” sets whether to plot the confidence interval of
probability is above a certain value, such as 40%, then the curve, and the standardize set whether corrects the net
intervention must be taken, then the study after 0.4 doesn’t benefit rate (NB) by using prevalence rate .The DCA curve
make sense, it can be set to 0–0.4. By is the calculation of a is shown in Figure 16 below.
data point every other distance. “study.design” can set the Curve interpretation: It can be seen that the net benefit
type of research, whether it is “cohort” or “case-control”. rate of the complex model is higher than the simple model
When the research type is “case-control“, the “population. with the threshold in the range of 0.1–0.5.
prevalance” parameter should also be added, because in the You can view the data points on the complex model
case-control study the prevalence rate cannot be calculated curve by the command shown below, where NB can also be
and needs to be provided in advance. changed to sNB, indicating a standardized prevalence.
Then we use the decision_curve() function to construct
summary(complex,measure = ‘NB’)
a complex logistic regression model and name it complex. ##
The syntax and simple model construction are basically the ## Net Benefit (95% Confidence Intervals):
## ----------------------------------------------------------------------------------------
same, it only add the independent variable SBP + DBP + ---------------
age + BMI + sex on the basis of the original simple model.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 49 of 96

## 0.15 3:17 83.689 0.176 0.183 0


Simple ## (80.382, 87.386) (0.176, 0.176) (0.178, 0.188)
0.3 Complex
##
All
## 0.16 4:21 81.212 0.167 0.177 0
None
## (77.951, 84.785) (0.167, 0.167) (0.171, 0.181)
0.2 ##
Net benefit

## 0.17 17:83 78.835 0.157 0.168 0


## (75.652, 82.146) (0.157, 0.157) (0.163, 0.174)
0.1 ##
## 0.18 9:41 76.293 0.146 0.162 0
## (73.477, 79.47) (0.146, 0.146) (0.156, 0.167)
##
0.0 ## 0.19 19:81 74.008 0.136 0.153 0
## (71.329, 76.836) (0.136, 0.136) (0.148, 0.16)
##
0.0 0.2 0.4 0.6 0.8 1.0 ## 0.2 1:4 71.688 0.125 0.147 0
High risk threshold ## (69.236, 74.339) (0.125, 0.125) (0.14, 0.153)
##
Figure 16 DCA curve. ## 0.21 21:79 69.788 0.114 0.14 0
## (67.08, 72.032) (0.114, 0.114) (0.133, 0.146)
##
## 0.22 11:39 67.306 0.103 0.132 0
## risk cost:benefit percent All chdfate ~ scl + sbp ## (64.864, 69.575) (0.103, 0.103) (0.125, 0.14)
+ dbp + None ##
## threshold ratio high risk age + bmi + sex ## 0.23 23:77 64.688 0.091 0.127 0
## ----------- -------------- ------------------ -------------------- ---------------------- ## (62.591, 66.991) (0.091, 0.091) (0.119, 0.134)
------- ------ ##
## 0 0:1 100 0.3 0.3 0 ## 0.24 6:19 62.144 0.079 0.12 0
## (100, 100) (0.3, 0.3) (0.3, 0.3) ## (60.342, 64.518) (0.079, 0.079) (0.112, 0.127)
## ##
## 0.01 1:99 100 0.293 0.293 0 ## 0.25 1:3 59.955 0.067 0.112 0
## (100, 100) (0.293, 0.293) (0.293, 0.293) ## (57.98, 61.787) (0.067, 0.067) (0.105, 0.121)
## ##
## 0.02 1:49 100 0.286 0.286 0 ## 0.26 13:37 57.347 0.054 0.106 0
## (100, 100) (0.286, 0.286) (0.286, 0.286) ## (55.559, 59.03) (0.054, 0.054) (0.097, 0.114)
## ##
## 0.03 3:97 100 0.278 0.278 0 ## 0.27 27:73 54.863 0.041 0.101 0
## (100, 100) (0.278, 0.278) (0.278, 0.278) ## (53.238, 56.423) (0.041, 0.041) (0.091, 0.11)
## ##
## 0.04 1:24 100 0.271 0.271 0 ## 0.28 7:18 52.376 0.028 0.094 0
## (100, 100) (0.271, 0.271) (0.271, 0.271) ## (50.897, 53.582) (0.028, 0.028) (0.085, 0.104)
## ##
## 0.05 1:19 100 0.263 0.263 0 ## 0.29 29:71 49.858 0.014 0.089 0
## (99.934, 100) (0.263, 0.263) (0.263, 0.263) ## (48.583, 50.975) (0.014, 0.014) (0.08, 0.099)
## ##
## 0.06 3:47 99.978 0.255 0.255 0 ## 0.3 3:7 47.593 0 0.085 0
## (99.54, 100) (0.255, 0.255) (0.255, 0.256) ## (46.159, 48.495) (0, 0) (0.074, 0.094)
## ##
## 0.07 7:93 99.693 0.247 0.248 0 ## 0.31 31:69 45.269 -0.014 0.081 0
## (98.642, 100) (0.247, 0.247) (0.247, 0.248) ## (43.803, 45.921) (-0.014, -0.014) (0.069, 0.089)
## ##
## 0.08 2:23 98.949 0.239 0.24 0 ## 0.32 8:17 42.553 -0.029 0.074 0
## (97.445, 99.825) (0.239, 0.239) (0.239, 0.241) ## (41.151, 43.664) (-0.029, -0.029) (0.064, 0.084)
## ##
## 0.09 9:91 97.86 0.231 0.232 0 ## 0.33 33:67 40.116 -0.045 0.067
## (95.326, 99.255) (0.231, 0.231) (0.231, 0.233) 0
## ## (38.415, 41.213) (-0.045, -0.045) (0.058, 0.079)
## 0.1 1:9 96.157 0.222 0.224 0 ##
## (92.945, 98.363) (0.222, 0.222) (0.222, 0.225) ## 0.34 17:33 37.606 -0.061 0.063 0
## ## (35.906, 38.807) (-0.061, -0.061) (0.052, 0.074)
## 0.11 11:89 93.614 0.213 0.215 0 ##
## (90.461, 96.858) (0.213, 0.213) (0.213, 0.217) ## 0.35 7:13 35.265 -0.077 0.061 0
## ## (33.534, 36.323) (-0.077, -0.077) (0.047, 0.07)
## 0.12 3:22 91.313 0.205 0.207 0 ##
## (87.96, 94.751) (0.205, 0.205) (0.204, 0.209) ## 0.36 9:16 32.631 -0.094 0.052 0
## ## (31.053, 33.978) (-0.094, -0.094) (0.042, 0.064)
## 0.13 13:87 88.988 0.195 0.199 0 ##
## (85.309, 92.537) (0.195, 0.195) (0.195, 0.202) ## 0.37 37:63 30.369 -0.111 0.047 0
## ## (28.804, 31.907) (-0.111, -0.111) (0.037, 0.059)
## 0.14 7:43 86.193 0.186 0.19 0 ##
## (82.712, 90.106) (0.186, 0.186) (0.187, 0.195) ## 0.38 19:31 28.493 -0.129 0.042 0
## ## (26.305, 29.854) (-0.129, -0.129) (0.033, 0.053)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 50 of 96 Zhou et al. Clinical prediction models with R

## ## 0.63 63:37 1.241 -0.892 -0.001 0


## 0.39 39:61 26.291 -0.148 0.039 0 ## (0.668, 2.544) (-0.892, -0.892) (-0.006, 0.004)
## (23.833, 27.888) (-0.148, -0.148) (0.029, 0.05) ##
## ## 0.64 16:9 1.05 -0.944 0 0
## 0.4 2:3 23.955 -0.167 0.033 0 ## (0.525, 2.202) (-0.944, -0.944) (-0.005, 0.004)
## (21.694, 26.108) (-0.167, -0.167) (0.025, 0.046) ##
## ## 0.65 13:7 0.924 -1 -0.001 0
## 0.41 41:59 21.882 -0.186 0.032 0 ## (0.415, 1.813) (-1, -1) (-0.005, 0.004)
## (19.315, 24.116) (-0.186, -0.186) (0.022, 0.042) ##
## ## 0.66 33:17 0.775 -1.059 0 0
## 0.42 21:29 19.857 -0.207 0.029 0 ## (0.316, 1.515) (-1.059, -1.059) (-0.004, 0.004)
## (17.633, 22.252) (-0.207, -0.207) (0.019, 0.039) ##
## ## 0.67 67:33 0.566 -1.121 0 0
## 0.43 43:57 18.012 -0.228 0.024 0 ## (0.253, 1.277) (-1.121, -1.121) (-0.004, 0.003)
## (15.693, 20.393) (-0.228, -0.228) (0.016, 0.035) ##
## ## 0.68 17:8 0.44 -1.188 0 0
## 0.44 11:14 16.516 -0.25 0.023 0 ## (0.23, 1.085) (-1.188, -1.188) (-0.004, 0.003)
## (13.87, 18.639) (-0.25, -0.25) (0.014, 0.033) ##
## ## 0.69 69:31 0.399 -1.258 -0.001 0
## 0.45 9:11 14.627 -0.273 0.02 0 ## (0.167, 0.905) (-1.258, -1.258) (-0.004, 0.002)
## (12.272, 17.113) (-0.273, -0.273) (0.011, 0.03) ##
## ## 0.7 7:3 0.358 -1.333 -0.002 0
## 0.46 23:27 13.231 -0.296 0.017 0 ## (0.126, 0.734) (-1.333, -1.333) (-0.004, 0.002)
## (10.941, 15.746) (-0.296, -0.296) (0.008, 0.027) ##
## ## 0.71 71:29 0.295 -1.414 -0.002 0
## 0.47 47:53 11.773 -0.321 0.013 0 ## (0.085, 0.605) (-1.414, -1.414) (-0.004, 0.002)
## (9.638, 14.269) (-0.321, -0.321) (0.006, 0.026) ##
## ## 0.72 18:7 0.211 -1.5 -0.001 0
## 0.48 12:13 10.539 -0.346 0.012 0 ## (0.064, 0.56) (-1.5, -1.5) (-0.004, 0.002)
## (8.543, 12.927) (-0.346, -0.346) (0.004, 0.023) ##
## ## 0.73 73:27 0.168 -1.593 -0.001 0
## 0.49 49:51 9.527 -0.373 0.008 0 ## (0.061, 0.474) (-1.593, -1.593) (-0.003, 0.002)
## (7.619, 11.888) (-0.373, -0.373) (0.002, 0.022) ##
## ## 0.74 37:13 0.146 -1.692 0 0
## 0.5 1:1 8.659 -0.4 0.007 0 ## (0.041, 0.401) (-1.692, -1.692) (-0.004, 0.002)
## (6.723, 10.626) (-0.4, -0.4) (0.001, 0.019) ##
## ## 0.75 3:1 0.146 -1.8 0 0
## 0.51 51:49 7.72 -0.429 0.007 0 ## (0.041, 0.329) (-1.8, -1.8) (-0.003, 0.002)
## (5.948, 9.702) (-0.429, -0.429) (-0.001, 0.016) ##
## ## 0.76 19:6 0.105 -1.917 -0.001 0
## 0.52 13:12 6.913 -0.458 0.007 0 ## (0.02, 0.292) (-1.917, -1.917) (-0.003, 0.001)
## (5.233, 8.775) (-0.458, -0.458) (-0.001, 0.015) ##
## ## 0.77 77:23 0.105 -2.043 -0.001 0
## 0.53 53:47 6.26 -0.489 0.005 0 ## (0.02, 0.246) (-2.043, -2.043) (-0.003, 0.001)
## (4.586, 8.03) (-0.489, -0.489) (-0.002, 0.013) ##
## ## 0.78 39:11 0.083 -2.182 0 0
## 0.54 27:23 5.457 -0.522 0.004 0 ## (0.02, 0.212) (-2.182, -2.182) (-0.003, 0.001)
## (3.973, 7.172) (-0.522, -0.522) (-0.003, 0.012) ##
## ## 0.79 79:21 0.083 -2.333 0 0
## 0.55 11:9 4.796 -0.556 0.005 0 ## (0, 0.206) (-2.333, -2.333) (-0.002, 0.001)
## (3.426, 6.514) (-0.556, -0.556) (-0.004, 0.011) ##
## ## 0.8 4:1 0.061 -2.5 0.001 0
## 0.56 14:11 4.207 -0.591 0.004 0 ## (0, 0.168) (-2.5, -2.5) (-0.002, 0.001)
## (2.879, 5.758) (-0.591, -0.591) (-0.004, 0.01) ##
## ## 0.81 81:19 0.061 -2.684 0.001 0
## 0.57 57:43 3.85 -0.628 0.002 0 ## (0, 0.164) (-2.684, -2.684) (-0.002, 0.001)
## (2.372, 5.279) (-0.628, -0.628) (-0.005, 0.009) ##
## ## 0.82 41:9 0.061 -2.889 0.001 0
## 0.58 29:21 3.348 -0.667 0 0 ## (0, 0.143) (-2.889, -2.889) (-0.002, 0.001)
## (2.005, 4.736) (-0.667, -0.667) (-0.005, 0.008) ##
## ## 0.83 83:17 0.061 -3.118 0.001 0
## 0.59 59:41 2.759 -0.707 -0.001 0 ## (0, 0.143) (-3.118, -3.118) (-0.001, 0.001)
## (1.578, 4.318) (-0.707, -0.707) (-0.005, 0.007) ##
## ## 0.84 21:4 0.061 -3.375 0.001 0
## 0.6 3:2 2.272 -0.75 0 0 ## (0, 0.123) (-3.375, -3.375) (-0.001, 0.001)
## (1.303, 3.831) (-0.75, -0.75) (-0.006, 0.006) ##
## ## 0.85 17:3 0.041 -3.667 0 0
## 0.61 61:39 1.829 -0.795 0 0 ## (0, 0.123) (-3.667, -3.667) (0, 0.001)
## (1.07, 3.445) (-0.795, -0.795) (-0.006, 0.005) ##
## ## 0.86 43:7 0.041 -4 0 0
## 0.62 31:19 1.619 -0.842 -0.001 0 ## (0, 0.123) (-4, -4) (0, 0.001)
## (0.817, 3.005) (-0.842, -0.842) (-0.005, 0.005) ##
## ## 0.87 87:13 0 -4.385 0 0

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 51 of 96

## (0, 0.123) (-4.385, -4.385) (0, 0.001)


## 1000 Number high risk
Number high risk with event

Number high risk (out of 1000)


## 0.88 22:3 0 -4.833 0 0
## (0, 0.102) (-4.833, -4.833) (0, 0.001) 800
##
## 0.89 89:11 0 -5.364 0 0
600
## (0, 0.082) (-5.364, -5.364) (0, 0.001)
##
## 0.9 9:1 0 -6 0 0 400
## (0, 0.061) (-6, -6) (0, 0.001)
## 200
## 0.91 91:9 0 -6.778 0 0
## (0, 0.041) (-6.778, -6.778) (0, 0)
## 0
## 0.92 23:2 0 -7.75 0 0 0.0 0.2 0.4 0.6 0.8 1.0
## (0, 0) (-7.75, -7.75) (0, 0) High risk threshold
##
## 0.93 93:7 0 -9 0 0 1:100 1:5 2:5 3:4 4:3 5:2 5:1 100:1
Cost:Benefit ratio
## (0, 0) (-9, -9) (0, 0)
## Figure 17 Clinical impact curve of simple model.
## 0.94 47:3 0 -10.667 0 0
## (0, 0) (-10.667, -10.667) (0, 0)
##
## 0.95 19:1 0 -13 0 0
## (0, 0) (-13, -13) (0, 0) 1000 Number high risk
## Number high risk with event
Number high risk (out of 1000)
## 0.96 24:1 0 -16.5 0 0
## (0, 0) (-16.5, -16.5) (0, 0) 800
##
## 0.97 97:3 0 -22.333 0 0 600
## (0, 0) (-22.333, -22.333) (0, 0)
##
400
## 0.98 49:1 0 -34 0 0
## (0, 0) (-34, -34) (0, 0)
## 200
## 0.99 99:1 0 -69 0 0
## (0, 0) (-69, -69) (0, 0) 0
##
## 1 Inf:1 0 NA NA NA 0.0 0.2 0.4 0.6 0.8 1.0
High risk threshold
## (0, 0) (NA, NA) (NA, NA)
## ---------------------------------------------------------------------------------------- 1:100 1:5 2:5 3:4 4:3 5:2 5:1 100:1
--------------- Cost:Benefit ratio
summary(complex,measure = ‘sNB’)
Figure 18 Clinical impact curve of complex model.
The result has been omitted.
Draw a clinical impact curve.
We use the plot_clinical_impact() function to plot the Use the complex model to predict the risk stratification of
clinical impact curve of the simple model. Use the simple 1,000 people, display the “loss: benefit” axis, assign 8 scales,
model to predict the risk stratification of 1,000 people, display the confidence interval, as shown in Figure 18.
display the “loss: benefit” axis, assign 8 scales, display the plot_clinical_impact(complex, population.size = 1000,
confidence interval, and get the result shown in Figure 17. cost.benefit.axis = T,

plot_clinical_impact(simple, population.size = 1000, n.cost.benefits = 8,col = c(‘red’,’blue’),

cost.benefit.axis = T, confidence.intervals = T,

n.cost.benefits = 8, ylim = c(0,1000),

col = c(‘red’,’blue’), legend.position =“topright”)

confidence.intervals = T, Curve interpretation: the red curve (number of high risk)


ylim = c(0,1000), indicates the number of people classified as positive (high
legend.position =“topright”)
risk) by the simple model (Figure 17) or the complex model
We continue to use the plot_clinical_impact() function (Figure 18) at each threshold probability; the blue curve
to plot the clinical impact curve of the complex model. [(number high) risk with outcome] is the number of true

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 52 of 96 Zhou et al. Clinical prediction models with R

positives for each threshold probability. ## threshold all none thickness


## 1 0.01 4.043816e-02 0 0.0404381573
## 2 0.02 3.064671e-02 0 0.0306467100
Survival outcome data ## 3 0.03 2.065338e-02 0 0.0235375096
## 4 0.04 1.045185e-02 0 0.0350104433
[Case 2]
## 5 0.05 3.555344e-05 0 0.0341713892
The Melanoma data frame has data on 205 patients in ## 6 0.06 -1.060237e-02 0 0.0170264274
Denmark with malignant melanoma. This data frame ## 7 0.07 -2.146906e-02 0 0.0076078405
## 8 0.08 -3.257198e-02 0 0.0074231177
contains the following columns: ## 9 0.09 -4.391893e-02 0 0.0034843206
(I) Time: survival time in days, possibly censored; ## 10 0.10 -5.551803e-02 0 0.0048780488
## 11 0.11 -6.737778e-02 0 0.0055357632
(II) Status: 1 died from melanoma, 2 alive, 3 dead from ## 12 0.12 -7.950707e-02 0 0.0050997783
other causes; ## 13 0.13 -9.191520e-02 0 0.0053826745
## 14 0.14 -1.046119e-01 0 0.0049914918
(III) Sex: 1= male, 0= female;
## 15 0.15 -1.176073e-01 0 0.0045911047
(IV) Age: age in years; ## 16 0.16 -1.309122e-01 0 0.0041811847
(V) Year of operation; ## 17 0.17 -1.445376e-01 0 0.0037613870
## 18 0.18 -1.584954e-01 0 -0.0015466984
(VI) Thickness: tumour thickness in mm; ## 19 0.19 -1.727978e-01 0 0.0003011141
(VII) Ulcer: 1 = presence, 0 = absence. ## 20 0.20 -1.874578e-01 0 -0.0036585366
## 21 0.21 -2.024889e-01 0 -0.0038900895
R codes and its interpretation ## 22 0.22 -2.179054e-01 0 -0.0041275797
In this case, the outcome is survival data, CRAN does not ## 23 0.23 -2.337224e-01 0 -0.0029141590
## 24 0.24 -2.499556e-01 0 -0.0030808729
currently have a corresponding package about DCA analysis ## 25 0.25 -2.666216e-01 0 -0.0032520325
of the survival data. We can take a custom function or use ##
a function that has been written by other researchers. The ## $interventions.avoided
## threshold thickness
stdca.R file in the attached file is the source code of the ## 1 0.01 0.000000
function that other researchers have written (48), which you ## 2 0.02 0.000000
## 3 0.03 9.325362
can use directly. We firstly load the source code for stdca.R, ## 4 0.04 58.940624
the reader can download it in the attachment to this article. ## 5 0.05 64.858088
## 6 0.06 43.285110
source(“stdca.R”)
## 7 0.07 38.630737
Load the MASS package and call the Melanoma data set. ## 8 0.08 45.994366
## 9 0.09 47.929951
library(MASS)
## 10 0.10 54.356468
data.set <- Melanoma ## 11 0.11 58.993685
Define the survival outcome. ## 12 0.12 62.045024
## 13 0.13 65.114732
data.set$diedcancer = ifelse(data.set$status==1, 1, 0)
## 14 0.14 67.327791
Use the stdca() function defined in the previous step to ## 15 0.15 69.245776
## 16 0.16 70.924012
perform Decision Curve Analysis. ## 17 0.17 72.404809
stdca() function usage: ## 18 0.18 71.498851
## 19 0.19 73.794804
stdca(data=data.set, outcome=“diedcancer”, ttoutcome=“time”,
## 20 0.20 73.519697
timepoint=545, predictors=“thickness”, probability=“FALSE”, ## 21 0.21 74.710978
xstop=.25, intervention=“TRUE”) ## 22 0.22 75.793960
## 23 0.23 77.270575
stdca(data=data.set, outcome=“diedcancer”, ttoutcome=“time”, ## 24 0.24 78.176984
timepoint=545, predictors=“thickness”, probability=FALSE, xstop=.25) ## 25 0.25 79.010880
## Loading required package: survival
## [1] “thickness converted to a probability with Cox regression. Due The DCA curve interpretation principle of survival data
to linearity and proportional hazards assumption, miscalibration may
occur.” is similar to the DCA curve of the binary data (Figure 19).
## $N
## [1] 205
##
## $predictors Brief summary
## predictor harm.applied probability
## 1 thickness 0 FALSE The Decision Curve Analysis method is currently used
##
to predict the clinical utility evaluation. The processing
## $interventions.avoided.per
## [1] 100 method for the two-category outcome is relatively mature,
## but the processing of the survival data outcome is a bit
## $net.benefit
tricky, and further improvement and update of the method

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 53 of 96

0.04 None Case analysis


All
thickness
DCA analysis of multivariate Cox regression
0.02
[Case 1]
Net benefit

0.00 Read in the data file “dca.txt” under the current working path.
data.set <- read.table(“dca.txt”, header=TRUE, sep=“\t”)
–0.02 attach(data.set)
str(data.set)
## ‘data.frame’: 750 obs. of 10 variables:
–0.04 ## $ patientid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cancer : int 0 0 0 0 0 0 0 1 0 0 ...
0.05 0.10 0.15 0.20 0.25 ## $ dead : int 0 0 0 0 0 0 1 0 0 0 ...
Threshold probability ## $ ttcancer : num 3.009 0.249 1.59 3.457 3.329 ...
## $ risk_group : Factor w/3 levels “high”, ”intermediate”,..: 3 1 3 3 3 2 2 2 3 2 ...
Figure 19 DCA of survival outcome data. ## $ casecontrol : int 0 0 0 1 0 1 0 1 0 1 ...
## $ age : num 64 78.5 64.1 58.5 64 ...
## $ famhistory : int 0 0 0 0 0 0 0 0 0 0 ...
## $ marker : num 0.7763 0.2671 0.1696 0.024 0.0709 ...
is needed. However, readers should understand the truth: ## $ cancerpredmarker: num 0.0372 0.57891 0.02155 0.00391 0.01879 ...
DCA analysis is not the only way to assess the clinical
utility of predictive models, nor is it a perfect one (49). This is a dataframe of survival data, include 750
In fact, the method we use most often is mentioned in observations, 10 variables:
the first section. For predicting the outcome of the two $ patientid: number.
classifications, we should see if the prediction model has $ cancer: whether cancer occurs, binary variable, 1=
a better sensitivity and specificity; for predicting survival cancer, 0= non-cancer. dependent variable.
outcomes, we generally see whether patients can be $ dead: dead or not, binary variable,1= dead, 0= alive.
classified into good prognosis and poor prognosis based $ ttcancer: the time from follow-up to the occurrence of
on predictive models, such as calculating the score of each cancer, continuous variable, time variable.
subject by Nomogram, and treating the patient according $ risk_group: risk factor, the factor variable, ordinal
to a cutoff value, and divided into a good prognosis and variable, 3= “high”, 2= “intermediate”, 1= “low”.
poor prognosis, and then draw Kaplan-Meier survival $ casecontrol: grouping variable, binary variable, 1=
curve. “case”, 0= “control” .
$ age: Age, continuous variable.
$ famhistory: family history, 0= no, 1= yes
Decision curve analysis (DCA) for survival $ marker: a biomarker level, a continuous variable.
outcomes with R $ cancerpredmarker: tumor biomarker level, continuous
Background variables.
[Case 1] R codes and its interpretation
The DCA of survival outcome data are summarized in this The source() function is used to load the source code
Section. In the previous Section, we introduced using rmda downloaded from the MSKCC website, which is
package to perform DCA for binary outcome, but there is downloaded in advance and saved to the current working
no information about the survival function of DCA in the path.
rmda package and though in CRAN currently no package source(“stdca.R”)
available for process of survival outcome data of DCA. Here Subsequently, we can directly use the stdca() function of
we introduce how to perform DCA analysis of survival DCA analysis of survival data defined by this function.
outcome data based on the source code provided on the Usage of stdca() function are as follows:
MSKCC website (48). We authors are just a knowledge stdca(data, outcome, predictors, timepoint, xstart =0.01,
porter. The copyright of the R source code in this paper xstop =0.99, xby =0.01, ymin =-0.05, probability = NULL,
belongs to the original author. We only own the original harm = NULL, graph = TRUE, intervention = FALSE,
copyright of the text annotation and result interpretation interventionper =100, smooth = FALSE, loess.span =0.10,
part of the code. cmprsk = FALSE).

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 54 of 96 Zhou et al. Clinical prediction models with R

Notes of stdca() function parameters are as follows: incidence of no cancer, as shown below:
Data: a data frame containing the variables in the model. data.set$pr_failure18 <- c(1 - (summary(survfit(coxmod,
Outcome: the outcome, response variable. Must be a newdata=data.set), times=1.5)$surv))

variable contained within the data frame specified in data=. This step is necessary, according to the mentioned above
Predictors: the predictor variables. Must be a variable stdca() function parameters regulation about “predictors”,
contained within the data frame specified in data=. can only pass in a variable here, obviously using model
timepoint: specifies the time point at which the decision prediction probability as a new variable is introduced to
curve analysis is performed. reflect the entire model predictive power here. If only pass
Probability: specifies whether or not each of the in a predictor, the factors that represent only a factor for
independent variables are probabilities. The default is outcome prediction ability, rather than the entire model
TRUE. prediction ability.
xstart: starting value for x-axis (threshold probability) Use the stdca() function for DCA analysis of survival
between 0 and 1. The default is 0.01. outcome data. DCA curve of “coxmod” is shown in Figure 20
xstop: stopping value for x-axis (threshold probability) below.
between 0 and 1. The default is 0.99.
stdca(data=data.set, outcome=“cancer”, ttoutcome=“ttcancer”,
xby: increment for threshold probability. The default is timepoint=1.5, predictors=“pr_failure18”, xstop=0.5, smooth=TRUE)
0.01.
## $N
ymin: minimum bound for graph. The default is –0.05. ## [1] 750
Harm: specifies the harm(s) associated with the ##
## $predictors
independent variable(s). The default is none. ## predictor harm.applied probability
Graph: specifies whether or not to display graph of net ## 1 pr_failure18 0 TRUE
##
benefits. The default is TRUE. ## $interventions.avoided.per
Intervention: plot net reduction in interventions. ## [1] 100
##
interventionper: number of net reduction in interventions
## $net.benefit
per interger. The default is 100. ## threshold all none pr_failure18 pr_failure18_sm
Smooth: specifies whether or not to smooth net benefit ## 1 0.01 0.2122351058 0 0.21181105 0.21174613
## 2 0.02 0.2041966885 0 0.20361189 0.20385712
curve. The default is FALSE. ## 3 0.03 0.1959925306 0 0.20030295 0.20030295
loess.span: specifies the degree of smoothing. The default ## 4 0.04 0.1876174528 0 0.19999857 0.19999857
## 5 0.05 0.1790660576 0 0.18877249 0.18877249
is 0.10. ## 6 0.06 0.1703327178 0 0.18397250 0.18397250
cmprsk: if evaluating outcome in presence of a competing ## 7 0.07 0.1614115642 0 0.18451216 0.18451216
## 8 0.08 0.1522964725 0 0.17599183 0.17599183
risk. The default is FALSE. ## 9 0.09 0.1429810491 0 0.17050884 0.17050884
For the statistical analysis of the above cases, we first ## 10 0.10 0.1334586163 0 0.16958715 0.16958715
## 11 0.11 0.1237221963 0 0.16366952 0.16366952
need to define a survival function object, which contains ## 12 0.12 0.1137644940 0 0.16432269 0.16432269
the outcome of the study and the time when the outcome ## 13 0.13 0.1035778790 0 0.16149156 0.16149156
## 14 0.14 0.0931543659 0 0.15465439 0.15465439
occurred, namely the “cancer” and “ttcancer” variables of ## 15 0.15 0.0824855938 0 0.14784631 0.14784631
the dataframe in this case. ## 16 0.16 0.0715628032 0 0.14257246 0.14257246
## 17 0.17 0.0603768129 0 0.14110378 0.14110378
library(survival)
## 18 0.18 0.0489179935 0 0.13798791 0.13798791
## ## 19 0.19 0.0371762404 0 0.13763724 0.13763724
## Attaching package: ‘survival’ ## 20 0.20 0.0251409434 0 0.12895335 0.12895335
## The following object is masked from ‘data.set’: ## 21 0.21 0.0128009553 0 0.12448413 0.12448413
## 22 0.22 0.0001445573 0 0.12833148 0.12833148
##
## 23 0.23 -0.0128405783 0 0.12638034 0.12638034
## cancer ## 24 0.24 -0.0261674280 0 0.12723684 0.12723684
Srv = Surv(data.set$ttcancer, data.set$cancer) ## 25 0.25 -0.0398496604 0 0.12215471 0.12215471
## 26 0.26 -0.0539016828 0 0.11416302 0.11416302
Next, we need to build the Cox regression model named ## 27 0.27 -0.0683386922 0 0.10999741 0.10999741
coxmod using the coxph() function in the survival package. ## 28 0.28 -0.0831767296 0 0.10939791 0.10939791
## 29 0.29 -0.0984327399 0 0.10822856 0.10822856
The code is as follows: ## 30 0.30 -0.1141246361 0 0.10585886 0.10585886
coxmod <- coxph(Srv ~ age + famhistory + marker, data=data.set) ## 31 0.31 -0.1302713700 0 0.10300367 0.10300367
## 32 0.32 -0.1468930078 0 0.10181986 0.10181986
The coxmod survival function was used to calculate the
## 33 0.33 -0.1640108139 0 0.10224584 0.10224584
complement of the incidence of cancer at 1.5 years, i.e., the ## 34 0.34 -0.1816473414 0 0.10113914 0.10113914

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 55 of 96

## 35 0.35 -0.1998265312 0 0.09905699 0.09905699


## 36 0.36 -0.2185738208 0 0.09633101 0.09633101 None
## 37 0.37 -0.2379162624 0 0.09534742 0.09534742 All
pr_failure 18
## 38 0.38 -0.2578826537 0 0.09187725 0.09187725 0.15
## 39 0.39 -0.2785036808 0 0.08964716 0.08964716
## 40 0.40 -0.2998120755 0 0.09002751 0.09002751
## 41 0.41 -0.3218427887 0 0.08197475 0.08197475

Net benefit
## 42 0.42 -0.3446331816 0 0.08169947 0.08169947
## 43 0.43 -0.3682232374 0 0.08143814 0.08143814 0.05
## 44 0.44 -0.3926557952 0 0.07927021 0.07927021
## 45 0.45 -0.4179768096 0 0.08356079 0.08356079
## 46 0.46 -0.4442356395 0 0.07803458 0.07803458
## 47 0.47 -0.4714853685 0 0.07986688 0.07986688 –0.05
## 48 0.48 -0.4997831640 0 0.07858227 0.07858227
## 49 0.49 -0.5291906771 0 0.07768162 0.07751346 0.0 0.1 0.2 0.3 0.4 0.5
## 50 0.50 -0.5597744906 0 0.07587186 0.07591062
Threshold probability
##
## $interventions.avoided Figure 20 DCA curve of “coxmod” based on Cox regression
## threshold pr_failure18 pr_failure18_sm
## 1 0.01 -4.198140 -4.7660406 model.
## 2 0.02 -2.865499 -0.7203969
## 3 0.03 13.937020 13.9370196
## 4 0.04 29.714687 29.7146874
## 5 0.05 18.442215 18.4422153 data = data.set, specified data set; outcome =“cancer”,
## 6 0.06 21.368990 21.3689903
## 7 0.07 30.690796 30.6907960
to define dichotomous outcome; ttoutcome = “ttcancer”,
## 8 0.08 27.249660 27.2496595 specified time variable; timepoint =1.5; define time point
## 9 0.09 27.833653 27.8336527
## 10 0.10 32.515680 32.5156803
=1.5 years; predictors =“pr_failure18”, the prediction
## 11 0.11 32.321013 32.3210130 probability calculated according to the Cox regression
## 12 0.12 37.076008 37.0760083
## 13 0.13 38.757620 38.7576205
model is passed in, and here it needs to be specified that
## 14 0.14 37.778585 37.7785849 the probability is passed in here; probability = TRUE, this
## 15 0.15 37.037739 37.0377387
is also the default setting, only If you use a single factor for
## 16 0.16 37.280070 37.2800702
## 17 0.17 39.413756 39.4137560 prediction, set to FALSE.
## 18 0.18 40.576297 40.5762971
Now let’s construct two Cox regression models:
## 19 0.19 42.828111 42.8281111
## 20 0.20 41.524962 41.5249619 coxmod1 <- coxph(Srv ~ age + famhistory + marker, data=data.set)
## 21 0.21 42.014147 42.0141471 coxmod2 <- coxph(Srv ~ age + famhistory + marker + risk_group,
## 22 0.22 45.448092 45.4480917
data=data.set)
## 23 0.23 46.608741 46.6087406
## 24 0.24 48.578018 48.5780178 According to the survival function, the complement
## 25 0.25 48.601310 48.6013099
## 26 0.26 47.833800 47.8337999 number of cancer incidence at 1.5 years, namely the
## 27 0.27 48.216799 48.2167988 incidence without cancer, was calculated for the two models
## 28 0.28 49.519193 49.5191931
## 29 0.29 50.596386 50.5963863 respectively, with the code as follows:
## 30 0.30 51.329482 51.3294816
data.set$pr_failure19 <- c(1 - (summary(survfit(coxmod1,
## 31 0.31 51.922508 51.9225082
## 32 0.32 52.851484 52.8514842 newdata=data.set), times=1.5)$surv))
## 33 0.33 54.058169 54.0581691 data.set$pr_failure20 <- c(1 - (summary(survfit(coxmod2,
## 34 0.34 54.893846 54.8938461 newdata=data.set), times=1.5)$surv))
## 35 0.35 55.506940 55.5069400
## 36 0.36 55.983082 55.9830819 The stdca() function is used for DCA analysis of the two
## 37 0.37 56.744897 56.7448975
## 38 0.38 57.066090 57.0660899 Cox regression models. DCA curves of “coxmod1” and
## 39 0.39 57.582567 57.5825669 “coxmod1” based on Cox regression model was shown in
## 40 0.40 58.475938 58.4759382
## 41 0.41 58.110328 58.1103281
Figure 21 below.
## 42 0.42 58.874509 58.8745086
## 43 0.43 59.606275 59.6062750 stdca(data=data.set, outcome=“cancer”, ttoutcome=“ttcancer”,
## 44 0.44 60.063309 60.0633094
timepoint=1.5, predictors=c(“pr_failure19”,”pr_failure20”), xstop=0.5,
## 45 0.45 61.299040 61.2990397
## 46 0.46 61.309982 61.3099824 smooth=TRUE)
## 47 0.47 62.173764 62.1737642
## 48 0.48 62.656255 62.6562552 ## $N
## 49 0.49 63.164259 63.1469446 ## [1] 750
## 50 0.50 63.564635 63.5686257 ##

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 56 of 96 Zhou et al. Clinical prediction models with R

## $predictors ## 9 0.17292665
## predictor harm.applied probability ## 10 0.17172714
## 1 pr_failure19 0 TRUE ## 11 0.16413986
## 2 pr_failure20 0 TRUE ## 12 0.16115961
## ## 13 0.15997205
## $interventions.avoided.per ## 14 0.15102760
## [1] 100 ## 15 0.14680192
## ## 16 0.14533848
## $net.benefit ## 17 0.14377813
## threshold all none pr_failure19 pr_failure20 pr_failure19_sm ## 18 0.14260467
## 1 0.01 0.2122351058 0 0.21181105 0.21195459 0.21174613 ## 19 0.13073905
## 2 0.02 0.2041966885 0 0.20361189 0.20592854 0.20385712 ## 20 0.13033339
## 3 0.03 0.1959925306 0 0.20030295 0.20082319 0.20030295 ## 21 0.13224742
## 4 0.04 0.1876174528 0 0.19999857 0.19793636 0.19999857 ## 22 0.13027540
## 5 0.05 0.1790660576 0 0.18877249 0.18850393 0.18877249 ## 23 0.12984114
## 6 0.06 0.1703327178 0 0.18397250 0.18758263 0.18397250 ## 24 0.12158006
## 7 0.07 0.1614115642 0 0.18451216 0.17871488 0.18451216 ## 25 0.11550497
## 8 0.08 0.1522964725 0 0.17599183 0.17549498 0.17599183 ## 26 0.11224497
## 9 0.09 0.1429810491 0 0.17050884 0.17292665 0.17050884 ## 27 0.11336879
## 10 0.10 0.1334586163 0 0.16958715 0.17172714 0.16958715 ## 28 0.11138142
## 11 0.11 0.1237221963 0 0.16366952 0.16413986 0.16366952 ## 29 0.11042784
## 12 0.12 0.1137644940 0 0.16432269 0.16115961 0.16432269 ## 30 0.10713217
## 13 0.13 0.1035778790 0 0.16149156 0.15997205 0.16149156 ## 31 0.10922587
## 14 0.14 0.0931543659 0 0.15465439 0.15102760 0.15465439 ## 32 0.10710860
## 15 0.15 0.0824855938 0 0.14784631 0.14680192 0.14784631 ## 33 0.10514071
## 16 0.16 0.0715628032 0 0.14257246 0.14533848 0.14257246 ## 34 0.10168529
## 17 0.17 0.0603768129 0 0.14110378 0.14377813 0.14110378 ## 35 0.10084803
## 18 0.18 0.0489179935 0 0.13798791 0.14260467 0.13798791 ## 36 0.09572434
## 19 0.19 0.0371762404 0 0.13763724 0.13073905 0.13763724 ## 37 0.09287710
## 20 0.20 0.0251409434 0 0.12895335 0.13033339 0.12895335 ## 38 0.09547575
## 21 0.21 0.0128009553 0 0.12448413 0.13224742 0.12448413 ## 39 0.08909328
## 22 0.22 0.0001445573 0 0.12833148 0.13027540 0.12833148 ## 40 0.08873631
## 23 0.23 -0.0128405783 0 0.12638034 0.12984114 0.12638034 ## 41 0.08693879
## 24 0.24 -0.0261674280 0 0.12723684 0.12158006 0.12723684 ## 42 0.08850919
## 25 0.25 -0.0398496604 0 0.12215471 0.11550497 0.12215471 ## 43 0.08923527
## 26 0.26 -0.0539016828 0 0.11416302 0.11224497 0.11416302 ## 44 0.08132415
## 27 0.27 -0.0683386922 0 0.10999741 0.11336879 0.10999741 ## 45 0.07934853
## 28 0.28 -0.0831767296 0 0.10939791 0.11138142 0.10939791 ## 46 0.07668500
## 29 0.29 -0.0984327399 0 0.10822856 0.11042784 0.10822856 ## 47 0.07507451
## 30 0.30 -0.1141246361 0 0.10585886 0.10713217 0.10585886 ## 48 0.07639787
## 31 0.31 -0.1302713700 0 0.10300367 0.10922587 0.10300367 ## 49 0.07746882
## 32 0.32 -0.1468930078 0 0.10181986 0.10710860 0.10181986 ## 50 0.08189491
## 33 0.33 -0.1640108139 0 0.10224584 0.10514071 0.10224584 ##
## 34 0.34 -0.1816473414 0 0.10113914 0.10168529 0.10113914 ## $interventions.avoided
## 35 0.35 -0.1998265312 0 0.09905699 0.10084803 0.09905699 ## threshold pr_failure19 pr_failure20 pr_failure19_sm pr_failure20_sm
## 36 0.36 -0.2185738208 0 0.09633101 0.09572434 0.09633101 ## 1 0.01 -4.198140 -2.777062 -4.7660406 -2.565550
## 37 0.37 -0.2379162624 0 0.09534742 0.09287710 0.09534742 ## 2 0.02 -2.865499 8.486095 -0.7203969 7.687163
## 38 0.38 -0.2578826537 0 0.09187725 0.09547575 0.09187725 ## 3 0.03 13.937020 15.619119 13.9370196 15.619119
## 39 0.39 -0.2785036808 0 0.08964716 0.08909328 0.08964716 ## 4 0.04 29.714687 24.765386 29.7146874 24.765386
## 40 0.40 -0.2998120755 0 0.09002751 0.08873631 0.09002751 ## 5 0.05 18.442215 17.931950 18.4422153 17.931950
## 41 0.41 -0.3218427887 0 0.08197475 0.08693879 0.08197475 ## 6 0.06 21.368990 27.024866 21.3689903 27.024866
## 42 0.42 -0.3446331816 0 0.08169947 0.08850919 0.08169947 ## 7 0.07 30.690796 22.988695 30.6907960 22.988695
## 43 0.43 -0.3682232374 0 0.08143814 0.08923527 0.08143814 ## 8 0.08 27.249660 26.678283 27.2496595 26.678283
## 44 0.44 -0.3926557952 0 0.07927021 0.08132415 0.07927021 ## 9 0.09 27.833653 30.278333 27.8336527 30.278333
## 45 0.45 -0.4179768096 0 0.08356079 0.07934853 0.08356079 ## 10 0.10 32.515680 34.441667 32.5156803 34.441667
## 46 0.46 -0.4442356395 0 0.07803458 0.07668500 0.07803458 ## 11 0.11 32.321013 32.701568 32.3210130 32.701568
## 47 0.47 -0.4714853685 0 0.07986688 0.07507451 0.07986688 ## 12 0.12 37.076008 34.756418 37.0760083 34.756418
## 48 0.48 -0.4997831640 0 0.07858227 0.07639787 0.07858227 ## 13 0.13 38.757620 37.740714 38.7576205 37.740714
## 49 0.49 -0.5291906771 0 0.07768162 0.07665976 0.07751346 ## 14 0.14 37.778585 35.550699 37.7785849 35.550699
## 50 0.50 -0.5597744906 0 0.07587186 0.08208139 0.07591062 ## 15 0.15 37.037739 36.445916 37.0377387 36.445916
## pr_failure20_sm ## 16 0.16 37.280070 38.732229 37.2800702 38.732229
## 1 0.21199928 ## 17 0.17 39.413756 40.719466 39.4137560 40.719466
## 2 0.20575976 ## 18 0.18 40.576297 42.679487 40.5762971 42.679487
## 3 0.20082319 ## 19 0.19 42.828111 39.887305 42.8281111 39.887305
## 4 0.19793636 ## 20 0.20 41.524962 42.076980 41.5249619 42.076980
## 5 0.18850393 ## 21 0.21 42.014147 44.934621 42.0141471 44.934621
## 6 0.18758263 ## 22 0.22 45.448092 46.137300 45.4480917 46.137300
## 7 0.17871488 ## 23 0.23 46.608741 47.767357 46.6087406 47.767357
## 8 0.17549498 ## 24 0.24 48.578018 46.786704 48.5780178 46.786704

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 57 of 96

## 25 0.25 48.601310 46.606390 48.6013099 46.606390


0.04 None
## 26 0.26 47.833800 47.287894 47.8337999 47.287894 All
## 27 0.27 48.216799 49.128319 48.2167988 49.128319 Thickness

Net benefit
## 28 0.28 49.519193 50.029238 49.5191931 50.029238
## 29 0.29 50.596386 51.134831 50.5963863 51.134831 0.00
## 30 0.30 51.329482 51.626588 51.3294816 51.626588
## 31 0.31 51.922508 53.307450 51.9225082 53.307450
## 32 0.32 52.851484 53.975342 52.8514842 53.975342
–0.04
## 33 0.33 54.058169 54.645916 54.0581691 54.645916
## 34 0.34 54.893846 54.999864 54.8938461 54.999864
## 35 0.35 55.506940 55.839561 55.5069400 55.839561 0.05 0.10 0.15 0.20 0.25
## 36 0.36 55.983082 55.875228 55.9830819 55.875228 Threshold probability
## 37 0.37 56.744897 56.324276 56.7448975 56.324276
## 38 0.38 57.066090 57.653214 57.0660899 57.653214 Figure 22 DCA curve of a single predictor “thickness” based on
## 39 0.39 57.582567 57.495935 57.5825669 57.495935 univariate Cox regression model.
## 40 0.40 58.475938 58.282258 58.4759382 58.282258
## 41 0.41 58.110328 58.824666 58.1103281 58.824666
## 42 0.42 58.874509 59.814900 58.8745086 59.814900
## 43 0.43 59.606275 60.639848 59.6062750 60.639848
## 44 0.44 60.063309 60.324720 60.0633094 60.324720
## 45 0.45 61.299040 60.784209 61.2990397 60.784209 Thickness

Net reduction in interventions per


## 46 0.46 61.309982 61.151553 61.3099824 61.151553 80
## 47 0.47 62.173764 61.633348 62.1737642 61.633348
## 48 0.48 62.656255 62.419612 62.6562552 62.419612
60
## 49 0.49 63.164259 63.057903 63.1469446 63.140791

100 patients
## 50 0.50 63.564635 64.185588 63.5686257 64.166483
40

20
None
All 0
0.15 pr_failure 19
pr_failure 20 0.05 0.10 0.15 0.20 0.25
Net benefit

Threshold probability
0.05
Figure 23 DCA curve of a single predictor “thickness” based on
univariate Cox regression model. Y axis represent net reduction in
–0.05 interventions per 100 persons.

0.0 0.1 0.2 0.3 0.4 0.5


Threshold probability
$ Ulcer: whether the tumor has ulcer or not, 1 represents
Figure 21 DCA curves of “coxmod1” and “coxmod1” based on the presence of ulcer, 0 represents the absence of ulcer, the
two Cox regression models. dichotomous variable.
R codes and its interpretation
The stdca() function is used for DCA analysis of the
Univariate Cox regression and DCA analysis univariate Cox regression analysis. DCA curves of a single
[Case 2] predictor “thickness” based on Cox regression model was
We use the built-in dataset in the MASS package. shown in Figures 22 and 23.
This is the dataframe structure with 7 variables and 205 source(“stdca.R”)
observations in total: library(MASS)
$ Time: time variable, continuous variables. data.set <- Melanoma
data.set$diedcancer = ifelse(data.set$status==1, 1, 0)
$ Status: outcome variables, 1= death from melanoma, 2=
##Decision Curve Analysis
alive, and 3= death from other causes.
stdca(data=data.set, outcome=“diedcancer”, ttoutcome=“time”,
$ Sex: sex variable, 1= male, 0= female, binary variable. timepoint=545,
$ Age: age, continuous variable. predictors=“thickness”, probability=FALSE, xstop=.25)
$ Year: time point of surgery, continuous variable. ## [1] “thickness converted to a probability with Cox regression. Due
$ Thickness: tumor thickness, unit: mm, continuous to linearity and proportional hazards assumption, miscalibration may
occur.”
variable.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 58 of 96 Zhou et al. Clinical prediction models with R

## $N ## Warning: ‘newdata’ had 1 row but variables found have 205 rows
## [1] 205 ## [1] “thickness converted to a probability with Cox regression. Due
## to linearity and proportional hazards assumption, miscalibration may
## $predictors occur.”
## predictor harm.applied probability
## 1 thickness 0 FALSE ## $N
## ## [1] 205
## $interventions.avoided.per ##
## [1] 100 ## $predictors
## ## predictor harm.applied probability
## $net.benefit ## 1 thickness 0 FALSE
## threshold all none thickness ##
## 1 0.01 4.043816e-02 0 0.0404381573 ## $interventions.avoided.per
## 2 0.02 3.064671e-02 0 0.0306467100 ## [1] 100
## 3 0.03 2.065338e-02 0 0.0235375096 ##
## 4 0.04 1.045185e-02 0 0.0350104433 ## $net.benefit
## 5 0.05 3.555344e-05 0 0.0341713892 ## threshold all none thickness
## 1 0.01 4.043816e-02 0 0.0404381573
## 6 0.06 -1.060237e-02 0 0.0170264274
## 2 0.02 3.064671e-02 0 0.0306467100
## 7 0.07 -2.146906e-02 0 0.0076078405
## 3 0.03 2.065338e-02 0 0.0235375096
## 8 0.08 -3.257198e-02 0 0.0074231177
## 4 0.04 1.045185e-02 0 0.0350104433
## 9 0.09 -4.391893e-02 0 0.0034843206
## 5 0.05 3.555344e-05 0 0.0341713892
## 10 0.10 -5.551803e-02 0 0.0048780488
## 6 0.06 -1.060237e-02 0 0.0170264274
## 11 0.11 -6.737778e-02 0 0.0055357632
## 7 0.07 -2.146906e-02 0 0.0076078405
## 12 0.12 -7.950707e-02 0 0.0050997783
## 8 0.08 -3.257198e-02 0 0.0074231177
## 13 0.13 -9.191520e-02 0 0.0053826745
## 9 0.09 -4.391893e-02 0 0.0034843206
## 14 0.14 -1.046119e-01 0 0.0049914918
## 10 0.10 -5.551803e-02 0 0.0048780488
## 15 0.15 -1.176073e-01 0 0.0045911047
## 11 0.11 -6.737778e-02 0 0.0055357632
## 16 0.16 -1.309122e-01 0 0.0041811847
## 12 0.12 -7.950707e-02 0 0.0050997783
## 17 0.17 -1.445376e-01 0 0.0037613870
## 13 0.13 -9.191520e-02 0 0.0053826745
## 18 0.18 -1.584954e-01 0 -0.0015466984
## 14 0.14 -1.046119e-01 0 0.0049914918
## 19 0.19 -1.727978e-01 0 0.0003011141 ## 15 0.15 -1.176073e-01 0 0.0045911047
## 20 0.20 -1.874578e-01 0 -0.0036585366 ## 16 0.16 -1.309122e-01 0 0.0041811847
## 21 0.21 -2.024889e-01 0 -0.0038900895 ## 17 0.17 -1.445376e-01 0 0.0037613870
## 22 0.22 -2.179054e-01 0 -0.0041275797 ## 18 0.18 -1.584954e-01 0 -0.0015466984
## 23 0.23 -2.337224e-01 0 -0.0029141590 ## 19 0.19 -1.727978e-01 0 0.0003011141
## 24 0.24 -2.499556e-01 0 -0.0030808729 ## 20 0.20 -1.874578e-01 0 -0.0036585366
## 25 0.25 -2.666216e-01 0 -0.0032520325 ## 21 0.21 -2.024889e-01 0 -0.0038900895
## ## 22 0.22 -2.179054e-01 0 -0.0041275797
## $interventions.avoided ## 23 0.23 -2.337224e-01 0 -0.0029141590
## threshold thickness ## 24 0.24 -2.499556e-01 0 -0.0030808729
## 1 0.01 0.000000 ## 25 0.25 -2.666216e-01 0 -0.0032520325
## 2 0.02 0.000000 ##
## 3 0.03 9.325362 ## $interventions.avoided
## 4 0.04 58.940624 ## threshold thickness
## 5 0.05 64.858088 ## 1 0.01 0.000000
## 6 0.06 43.285110 ## 2 0.02 0.000000
## 7 0.07 38.630737 ## 3 0.03 9.325362
## 8 0.08 45.994366 ## 4 0.04 58.940624
## 9 0.09 47.929951 ## 5 0.05 64.858088
## 10 0.10 54.356468 ## 6 0.06 43.285110
## 11 0.11 58.993685 ## 7 0.07 38.630737
## 12 0.12 62.045024 ## 8 0.08 45.994366
## 13 0.13 65.114732 ## 9 0.09 47.929951
## 14 0.14 67.327791 ## 10 0.10 54.356468
## 15 0.15 69.245776 ## 11 0.11 58.993685
## 16 0.16 70.924012 ## 12 0.12 62.045024
## 17 0.17 72.404809 ## 13 0.13 65.114732
## 18 0.18 71.498851 ## 14 0.14 67.327791
## 19 0.19 73.794804 ## 15 0.15 69.245776
## 20 0.20 73.519697 ## 16 0.16 70.924012
## 21 0.21 74.710978 ## 17 0.17 72.404809
## 22 0.22 75.793960 ## 18 0.18 71.498851
## 23 0.23 77.270575 ## 19 0.19 73.794804
## 24 0.24 78.176984 ## 20 0.20 73.519697
## 25 0.25 79.010880 ## 21 0.21 74.710978
stdca(data=data.set, outcome=“diedcancer”, ttoutcome=“time”, ## 22 0.22 75.793960
timepoint=545, ## 23 0.23 77.270575
predictors=“thickness”, probability=“FALSE”, xstop=.25, ## 24 0.24 78.176984
intervention=“TRUE”) ## 25 0.25 79.010880

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 59 of 96

Note that we use a single variable to predict the outcome, Calibration evaluation
so probability = FALSE, and other parameter settings are
The following hoslem.test () in the ResourceSelection
basically the same as Case 1.
package is used to perform the Hosmer-Lemeshow
goodness of fit test, which is usually used to evaluate the
External validation of Logistic regression model calibration degree of the Logistic model (50). We should
with R load the R package we need at first:
install.packages(“ResourceSelection”)
Background
library(ResourceSelection)

Logistic regression can be used to establish a clinical Logistic regression model construction
prediction model for dichotomous outcome variables, To establish the Logistic regression model, we simulate
and can also be used to predict binary clinical events, a data set, namely training set, for modeling, in which all
such as effective/ineffective, occurrence/non-occurrence, samples of the data set are used for modeling. And we can
recurrence/non-recurrence and so on. There is the also extract parts of samples for modeling while use the
difference between good and bad of the prediction rest of it, namely testing set, for internal validation of this
models. Not only can the good one accurately predicts model.
the probability of endpoint events, which means a good set.seed(123)
Calibration, but also distinguish the objects with different n <- 100
probabilities of endpoint events in the data set, which x <- rnorm(n)

means a good discrimination. And it can also find out xb <- x


pr <- exp(xb)/(1+exp(xb))# generate a probability from 0 to 1 by
the possible influence of certain factors on the endpoint
logistic regression
event, including the independent risk factors or protective y <- 1*(runif(n) < pr)# According to the probability “pr” to determine
factors. Therefore, how to judge and verify the model is whether a single patient event occurred, the greater the pr, the greater the
particularly important. As for the evaluation indexes of probability of y=1, and the greater the probability of the endpoint event of
the model, we have mentioned before. This paper mainly the sample
intern.data <- data.frame(x=x,y=y)
introduces the external validation of Logistic regression
mod <- glm(y~x,intern.data,family=binomial)# Generate the model “mod”
model.
Out time validation can be adopted for model validation. Carry the Hosmer-Lemeshow test out on the model, and
For example, we can use the 2005–2010 samples for divide the data into a certain number of groups “g”. The
modeling, and then use the 2010–2015 samples for meaning of parameter “g” here has been explained when
model validation. In this way, it can evaluate whether the explaining the concept of calibration degree in the previous
prediction of the model is still accurate over time. Across Section. If we predict the probability of the outcome of 100
modeling techniques can also be adopted. In other words, people, it does not mean that we actually use the model to
for a certain data set, to select the best model in testing predict the result of disease/no disease. The model only gives us
data, not only can the logical regression be adopted, but the probability of disease. And we diagnose disease/no disease
also the discriminant analysis, decision tree, support vector according to the probability which is greater than a certain
machine (SVM) and other methods. Regardless of what cut-off value, such as 0.5. If there are 100 people in data set,
kind of methods we use, external validation of the model we finally got 100 probabilities from the model, namely 100
with a data set different to the one used in modeling is an numbers between 0 and 1.0. We rank these numbers from
important part of the process. smallest to largest and then divide them into 10 groups of 10
When modeling, the samples which are extracted from people. So, the actual probability is the percentage of those
the sample data set for modeling are called training set. 10 people who get sick. The predicted probability is the
When validating the completed model, the samples which average of these 10 proportions of each set of predictions.
are reserved from the sample data set for internal validation Compare these two numbers, take one as the abscissa and
are called testing set. A model that performs well in a single the other one as the ordinate, and we can get the Calibration
data set does not necessarily perform satisfactorily in other Plot. So, the 95% confidence interval we can calculate. In
data sets, so external validation of the model in a new data Logistic regression model, calibration can also be measured
set, which is called validation set, is required. by Hosmer-Lemeshow goodness of fit test.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 60 of 96 Zhou et al. Clinical prediction models with R

# The simulated external data set


hl <- hoslem.test(mod$y, fitted(mod), g=10)
# The probability of the event occurrence of external data predict by the
hl
model generated from internal data
## pr.e <- predict(mod,exter.data,type = c(“response”))
## Hosmer and Lemeshow goodness of fit (GOF) test #The probability of the new data set predicted by the established model
## mod
## data: mod$y, fitted(mod) #The first parameter is the model object GLM,
#The second parameter is the data set to validate
## X-squared = 6.4551, df = 8, P value =0.5964
hl.e <- hoslem.test(y.e,pr.e, g=10)
hl.e
##
The first parameter is whether the target event actually
## Hosmer and Lemeshow goodness of fit (GOF) test
occurred in the function hoslem.test(). The second ##
parameter is the event occurrence probability predicted ## data: y.e, pr.e
## X-squared =10.313, df =8, P value =0.2437
by the variable x. The third parameter is the grouping
parameter g. Then, calculate the goodness of fit statistics in P=0.2437>0.05. So, it cannot be considered that the
the Hosmer-Lemeshow test. If the model is well fitted, the model fitting is poor, suggesting that the model performs
statistic should follow the Chi-square distribution with g-2 well in the new data set. If P<0.05, the model is poorly
degrees of freedom. If the P value of the hypothesis test is fitted.
less than test level α, it indicates that the model does not fit
well. Then, output the results of the Hosmer-Lemeshow Hosmer-Lemeshow test
test. The calculation principle of Hosmer-Lemeshow test
P value is 0.5964, It cannot be considered that the model statistics is as follows: calculate the prediction probability
does not fit well yet. of the model, and divide the data into ten groups according
cbind(hl$observed,hl$expected) to the prediction probability from largest to smallest, which
## y0 y1 yhat0 yhat1 refers to the parameter g in the above function.
## [0.111,0.298] 7 3 7.692708 2.307292
pihat <- mod$fitted
## (0.298,0.396] 8 2 6.491825 3.508175
pihatcat <- cut(pihat, breaks=c(0,quantile(pihat,
## (0.396,0.454] 5 5 5.764301 4.235699
probs=seq(0.1,0.9,0.1)),1), labels=FALSE)
## (0.454,0.494] 6 4 5.243437 4.756563
## (0.494,0.564] 7 3 4.739571 5.260429 Then calculate the Hosmer-Lemeshow test statistics.
## (0.564,0.624] 4 6 4.077834 5.922166 meanprobs <- array(0, dim=c(10,2))
## (0.624,0.669] 2 8 3.532070 6.467930 #Create a 10 rows, 2 columns matrix to save the average probabilities of
## (0.669,0.744] 2 8 2.910314 7.089686 events occurred and not occurred
## (0.744,0.809] 1 9 2.213029 7.786971 expevents <- array(0, dim=c(10,2))
## (0.809,0.914] 2 8 1.334912 8.665088 #Create a 10 rows, 2 columns matrix to save the number of occurrence
and the number of nonoccurrence calculated by the probability values
Generate Hosmer-Lemeshow contingency table. Among
obsevents <- array(0, dim=c(10,2))
them, y0 is the number of events that didn’t occur; y1 is #Create a 10 rows, 2 columns matrix to save the number of events that
the number of events that occurred; yhat0 represents the actually occurred and the number that did not occur
probability of events that will not happen predicted by the
model; yhat1 represents the probability of events that will for (i in 1:10) {
meanprobs[i,1] <- mean(pihat[pihatcat==i])# Calculate the average
happen predicted by the model.
probability of each set of events
expevents[i,1] <- sum(pihatcat==i)*meanprobs[i,1]# Number of
Validation in external data set predicted events = number of samples in the group * average probability
Simulate the generated external data set in the same way, obsevents[i,1] <- sum(y[pihatcat==i])# Actual number of events
and conduct external validation on the model: # Use the same method to calculate the number of predicted events that
did not occur and the number of actual events that did not occur
set.seed(123) meanprobs[i,2] <- mean(1-pihat[pihatcat==i])
n.e <- 150#.e represent external
expevents[i,2] <- sum(pihatcat==i)*meanprobs[i,2]
x.e <- rnorm(n.e)# The independent variable x
xb.e <- x.e obsevents[i,2] <- sum(1-y[pihatcat==i])
pr.e <- exp(xb.e)/(1+exp(xb.e))#pr.e is the probability of event that }
occurred simulated by logistic regression.
y.e <- 1*(runif(n.e) < pr.e)# y represents the actual occurrence of the Calculate the Hosmer-Lemeshow test statistics. It
event, 0 is no occurrence and 1 is occurrence. has been proved that if the model fits well, the statistic
exter.data <- data.frame(x=x.e,y=y.e)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 61 of 96

Calibration plot rangeaxis=c(0,1))


1.0

## $Table_HLtest
0.8
## total meanpred meanobs predicted observed
Observed risk

0.6 ## [0.111,0.259) 15 0.199 0.333 2.99 5


## [0.259,0.372) 15 0.306 0.333 4.59 5
0.4 ## [0.372,0.426) 15 0.393 0.333 5.90 5
## [0.426,0.477) 15 0.449 0.267 6.73 4
0.2 ## [0.477,0.536) 15 0.499 0.400 7.49 6
## [0.536,0.593) 15 0.562 0.467 8.44 7
0.0
## [0.593,0.655) 15 0.624 0.533 9.36 8
0.0 0.2 0.4 0.6 0.8 1.0 ## [0.655,0.725) 15 0.685 0.733 10.28 11
## [0.725,0.808) 15 0.764 0.600 11.46 9
Predicted risk
## [0.808,0.914] 15 0.866 0.733 12.99 11
Figure 24 Calibration plot. ##
## $Chi_square
## [1] 10.329

should obey the chi-square distribution of g-2 degree ##


## $df
of freedom. The invalid hypothesis is that Hosmer-
## [1] 8
Lemeshow test statistic obeys chi-square distribution
##
with g-2 degree of freedom. The alternative hypothesis ## $p_value
is that Hosmer-Lemeshow test statistic does not obey ## [1] 0.2427
the chi-square distribution of g-2 degree of freedom.
The horizontal axis of the calibration plot is the
Significance level =0.05.
hosmerlemeshow <- sum((obsevents-expevents)^2 / expevents)
predicted risk rate, and the vertical axis is the actual risk
hosmerlemeshow rate. Each point represents a group, and its basic idea is
## [1] 6.455077 consistent with the Hosmer-Lemeshow test.
The calculated statistics are consistent with the calculated It is not enough to just evaluate the calibration of the
results of the above functions. model. Assuming a short-term mortality rate is 0.1% after
a particular surgery, there is now a poor model that predicts
Calibration plot a mortality rate is 0.1% for all patients regardless of their
Use the calibration curves (Figure 24) to evaluate the model health status, smoking status, diabetes status, and only one
visually. We use plotCalibration() function in PredictABEL patient died in 1,000 patients after surgery. Although the
package to draw the calibration plots (51). model prediction is consistent with the facts (both short-
term mortality rate is 0.1%), which means the model is in
install.packages(“PredictABEL”)
high fit degree. But the model does not differentiate the
library(PredictABEL)
patient with a high risk of death and the patients with low
## Warning: package ‘Hmisc’ was built under R version 3.5.2
#Use the plotCalibration to draw the calibration curve
risk of dying, which means the discrimination of the model
#Parameter description: data is the data set to be verified is not enough. That’s why we need to evaluate the model
#cOutcome specifies which column the outcome variable is in discrimination that to differentiate the patient with a high
#predRisk is the probability of occurrence predicted by the model. It can risk of death and the patients with low risk of dying. We
be calculated through predict() function need the further external validation of the model.
#groups is groups number
#rangeaxis is the range of axes。
#Draw the calibration plot in the external data now Discrimination evaluation
plotCalibration(data=exter.data,
cOutcome=2, ROC curve
predRisk=pr.e, Use the Logistic regression model fitted above to draw the
groups=10,
ROC curve by pROC package (52) (Figure 25).

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 62 of 96 Zhou et al. Clinical prediction models with R

1.0 dataset and draw ROC curve (Figure 26).


pr.e <- predict(mod,newdata = exter.data, type=c(“response”))
0.8
roccurve <- roc(y.e ~ pr.e)
Sensitivity

0.6 # Draw the ROC curve


plot.roc(roccurve)
0.4
# Get the AUC value of the ROC curve
0.2
auc(roccurve)
0.0 ## Area under the curve: 0.6723

1.2 1.0 0.8 0.6 0.4 0.2 0.0 –0.2


The model was verified in the external data set,
Specificity
Figure 25 ROC curve.
AUC =0.6723, indicating that the model has a good
discrimination in the validation of the external data set.

1.0 Summary and discussion

We summarize the methods of external verification for


0.8
Logistic regression model above, including calibration
evaluation and discrimination evaluation. A good prediction
Sensitivity

0.6
model, should have the characteristics of robust enough, no
0.4 matter for the training set, internal validation set or external
validation set, has better discrimination and calibration. A
0.2
good performance in the training set does not necessarily
0.0
mean a good performance in the validation set. In addition,
a good discrimination does not necessarily mean a good
1.2 1.0 0.8 0.6 0.4 0.2 0.0 –0.2 calibration, and vice versa.
Specificity

Figure 26 ROC curve in validation set. The evaluation of Cox regression model based
on pec package with R

#Extract the probability values predicted by the model


pr <- predict(mod,type=c(“response”))
Background
install.packages(“pROC”)
According to the logistic regression equation, the rate
library(pROC)
## Type ‘citation(“pROC”)’ for a citation.
of outcome occurrence could be predicted based on the
## patient’s independent variables. In Cox regression, many
## Attaching package: ‘pROC’ scholars may be confused about how to use the established
## The following objects are masked from ‘package:stats’: Cox model of survival outcome data to predict the survival
## probability of an individual patient. The function of cph() in
## cov, smooth, var
rms package, the function of coxph() in survival package and
## Use the function roc, the left side of the equation is the actual event
occurrence, and the right side is the event occurrence rate predicted by
the function of survfit() in survival package were synthesized
the model by the function of predictSurvProb() in the R language
roccurve <- roc(y ~ pr) pec package, which could calculate the individual patient’s
# Draw the ROC curve survival probability (53).
plot.roc(roccurve,xlim = c(1,0),ylim=c(0,1))
The useage of predictSurvProb() in the R pec package is
# Get the AUC value of the ROC curve predictSurvProb (object, newdata, times,…).
auc(roccurve) Among these, the object is a well-fitted model by the
## Area under the curve: 0.7455
function of survival::coxph() or the function of rms::cph().
In intern. data, the AUC value of the model is 0.7455. The newdata is a dataset of data.frame style. The rows
ROC curve in external dataset represent observations and the columns represent the
Then, verify the model’s discrimination in the external variables to be used in the prediction model. The time is a

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 63 of 96

vector that contains the points to be predicted. ## 120 0.9121396 0.7429388 0.6322600
## 121 0.8366330 0.5619540 0.4109773
## 122 0.8577605 0.6091122 0.4653860
Case analysis
The excellence of this package is that it can calculate the
We used the clinical data of 232 patients with renal clear survival probability rate of each patient in the validation
cell carcinoma downloaded from TCGA database (https:// set based on the prediction model built by the training set.
portal.gdc.cancer.gov/) for practical operation. There were Moreover, it can also calculate the C-index in the validation
8 variables in the dataset. Among the eight variables, death set. C-index, namely Concordance Index, is mainly used
was the outcome variable (dependent variable), OS was the to reflect the differentiation ability and accuracy of the
length of survival time. In addition, age, gender, TNM, prediction model. The definition of C-Index is as simple
SSIGN and Fuhrman were the independent variables. as the number of consistent pairs/ the number of useful
pairs. There will be N*(N-1)/2 pairs if N subjects were
randomly paired. However, if the sample size is very large,
R code and its interpretation
the calculation work cannot be completed without the
The pec package and other necessary auxiliary package assistant of computer software. First, we should find out
should be loaded firstly. Then, the data of Case_in_TCGA. the consistent pair number as the numerator. While, what
csv will be identified. The codes of R language were shown is a consistent pair number? Taking Cox regression analysis
as follows. for survival data as an example, if the actual survival length
library(dplyr) is long and the predicted survival rate is also high, or the
library(rms) predicted survival rate is low when the actual survival length
library(survival) is short, we could make the conclusion that the predicted
library(pec) result is consistent with the actual result. Otherwise, it is
data <- read.csv(“Case_in_TCGA.csv”)
inconsistent. Then, the useful number of pairs should be
Two hundred and thirty two patients were randomly found out for denominator. What is a useful pair number?
divided into training set and validation set, in which data.1 Taking Cox regression analysis as an example, the so-
was the training set and data.2 was the validation set. called useful pair number requires that at least one of the
set.seed(1450) two paired individuals have a target endpoint. That is to
x <- nrow(data) %>% runif() say, if the paired patients did not show an endpoint event
data <- transform(data,sample=order(x)) %>% arrange(sample) during all the observation period, it cannot be included in
data.1 <- data[1:(nrow(data)/2),] the denominator. In addition, there are two other situations
data.2 <- data[((nrow(data)/2)+1):nrow(data),]
that need to be excluded:
Then, the Cox regression model was fitted with 116 (I) If one of the paired object reach to an endpoint,
cases of training set data.1. while the other one cannot reach to an endpoint
cox1 <- cph(Surv(os,death)~age+gender+tnm+fuhrman+ssign, due to the loss of follow-up;
data=data.1, surv=TRUE) (II) The pair died at the same time.
cox2 <- cph(Surv(os,death)~tnm+ssign, data=data.1, surv=TRUE) Now the denominator had been identified, how to get
The survival time points of the predicted survival rate the numerator?
need to be set. In this case, the survival probability rate of In fact, the function of cindex() in pec package can
each patient in validation set at the first, third and fifth year calculate the C-index of the prediction model. Meanwhile,
would be predicted according to the model built by the it can evaluate the discrimination of different regression
training set. modeling strategies for the same data via cross validation.
C-Index could be verified by the model discrimination in
t <- c(1,3,5)
survprob <- predictSurvProb(cox1,newdata=data.2,times=t)
data.2.
head(survprob) c_index <- cindex(list(“Cox(5 variables)”=cox1, “Cox(2 variables)”=cox2),
## 1 3 5 formula=Surv(os,death)~age+gender+tnm+fuhrman+ssign,
## 117 0.9128363 0.7447740 0.6346714 data=data.2,
## 118 0.8907077 0.6879995 0.5615877 eval.times=seq(1,5,0.1))
## 119 0.9077517 0.7314529 0.6172421 plot(c_index,xlim = c(0,5))

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 64 of 96 Zhou et al. Clinical prediction models with R

Cox (5 variables) 1.0 Cox (5 variables)


1.0
Concordance index

Concordance index
Cox (2 variables) Cox (2 variables)
0.8 0.8

0.6 0.6

0.4
0.4
0 1 2 3 4 5
0 1 2 3 4 5
Time (year)
Time (year)
Figure 27 The discrimination index of Cox (2 variables) compared
with Cox (5 variables) without cross-validation. Figure 28 The discriminability of Cox (2 variables) compared with
Cox (5 variables) with cross-validation.

The figure above (Figure 27) showed that model Cox (2


variables) is better than Cox (5 variables) in terms of the and the predicted rate. For example, when we need to
discriminability of the regression model. However, this predict the disease prevalence in 100 subjects, the models
result has not been cross-verified and may be unstable. cannot directly calculate the rate of disease/no disease. In
Therefore, we can further compare the discriminability fact, the model just provides us the probable prevalence
of the two regression models with the method of cross- of this disease, and we need to judge the rate of illness/
validation. no illness based on the probability greater than a certain
The model discrimination was verified in data.2, and the cut-off value (such as 0.5). Then, 100 numbers between 0
cross-validation was performed using bootstrap re-sampling and 1 could be obtained via the model. Then we divided
method. these 100 objects into ten groups. The actual probability is
c_index <- cindex(list(“Cox(5 variables)”=cox1, “Cox(2 the percentage of those ten objects who suffered from the
variables)”=cox2), disease. The probability of prediction is the average of the
formula=Surv(os,death)~fuhrman+ssign, ten proportions in all of the ten groups. Then we compared
data=data,
the predicted probability and actual probability, one as
eval.times=seq(1,5,0.1),
splitMethod=“bootcv”, the abscissa and the other one as the ordinate. Finally,
B=1000) Calibration Plot could be obtained. Meanwhile, the 95%
## Warning: executing %dopar% sequentially: no parallel backend registered confidence interval of Calibration Plot could be calculated.
## 100
We used the calPlot() function in pec package to
## 200
## 300
demonstrate the performance of Calibration Plot in the
## 400 validation set (Figure 29).
## 500 calPlot(list(“Cox(5 variables)”=cox1,”Cox(2 variables)”=cox2),
## 600 time=3,# Set the time point you want to observe
## 700 data=data.2)
## 800
## 900 Similarly, we can use bootstrap method to re-sample
## 1000 232 cases, and perform cross-validation for the model
plot(c_index,xlim = c(0,5)) (Figure 30).
T h e f i g u r e a b o v e ( F i g u re 2 8 ) s h o w e d t h a t t h e calPlot(list(“Cox(5 variables)”=cox1,”Cox(2 variables)”=cox2),
discriminability of the two Cox regression models is similar. time=3,# Set the time point you want to observe
However, the model with only two variables is obviously data=data,
more convenient than the model with five variables. splitMethod = “BootCv”,
B=1000)
Therefore, the simpler the model, the better to perform.
The pec package can not only calculate the survival
rate of each object and C-index in the validation set, but a Brief summary
Calibration plot can be drawn in the validation set, which
cannot be completed by rms package. Calibration refers Above we summarize the Cox regression model method
to the consistency between the actual rate of the outcome for external validation, including the discrimination index

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 65 of 96

relationship and are mutually competitive risk event.


100
Cox (5 variables)
For example, some researchers collected in this city in
Observed survival

Cox (2 variables)
frequencies (%)

75
2007 diagnosed with mild cognitive impairment (MCI) of
50 518 cases of elderly patients with clinical data, including
25
basic demographic characteristics, lifestyle, physical
examination, and merge disease information, etc., and
0
complete six follow-up survey in 2010–2013, the main
0 25 50 75 100
outcome is whether Alzheimer’s disease (AD) occurs.
Predicted survival probability (%)
During the follow-up period, a total of 78 cases of AD
Figure 29 The Calibration Plot performed by pec package. occurred, including 28 cases of relocation, 31 cases of
withdrawal and 25 cases of death. What are the factors that
affect the transition of MCI to AD?
100 In this case, if the MCI patient dies of cancer,
Cox (5 variables)
cardiovascular disease, car accident and other causes during
Observed survival

Cox (2 variables)
frequencies (%)

75
the observation period without AD, he cannot contribute
50 to the onset of AD, that is, the end of death “competes”
25
with the occurrence of AD. According to traditional
survival analysis method, the death of the individual occurs
0
before the AD, lost to the individual, and not the AD
0 25 50 75 100 individuals, are considered to be censored data, may lead
Predicted survival probability (%) to bias (54) For the elderly population with high mortality
rate, when there are competitive risk events, the traditional
Figure 30 The Calibration Plot performed by pec package with
survival analysis methods (Kaplan-Meier method, Log-
cross-validation.
rank test, Cox proportional hazard regression model) will
overestimate the risk of the diseases of interest, thus leading
and calibration evaluation. For the outcome in the Cox to competitive risk bias. Some studies have found that about
regression model include the time, so the calculation 46% of the literatures may have such bias (54).
method is slightly different from Logistic regression. In this case, the competitive risk model is appropriate.
However, whether the Cox regression model or Logistic The so-called competing risk model is an analytical method
regression model, as a good prediction model, should have to process multiple potential outcome of survival data.
the characteristics of robust enough, no matter for the As early as 1999, Fine and Gray proposed the partially
training set, internal validation set or external validation distributed semi-parametric proportional hazard model, and
set, has better discrimination and calibration. A good the commonly used endpoint index is cumulative incidence
function (CIF) (55,56). In this case, death before AD can be
performance in the training set does not necessarily mean a
taken as a competitive risk event of AD, and the competitive
good performance in the validation set. In addition, a good
risk model is adopted for statistical analysis. Univariate
discrimination does not necessarily mean a good calibration,
analysis of competitive risk is often used to estimate the
and vice versa.
incidence of end-point events of interest, and multivariate
analysis is often used to explore prognostic factors and
Fine-Gray test and competing risk model with R effect sizes.
Background
Case analysis
When observing whether an event occurs or not, if
the event is obstructed by other events, there may be [Case 1]
multiple outcome events in the so-called competitive risk This case data was downloaded from http://www.stat.
research, and some outcomes will prevent the occurrence unipg.it/luca/R/. Researchers plan to investigate the
of interested events or affect the probability of their curative effect of bone marrow transplantation compared
occurrence. All outcome events form a competitive blood transplantation for the treatment of leukemia,

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 66 of 96 Zhou et al. Clinical prediction models with R

endpoint event defined as “recurrence”. Some patients after fit1 <- cuminc(ftime,Status,D)
fit1
transplantation, unfortunately because of adverse reactions
## Tests:
to the death, that the transplant related death cannot be ## stat pv df
observed in patients with the end of the “recurrence”. In ## 1 2.8623325 0.09067592 1
other words, “transplant-related death” and “recurrence” ## 2 0.4481279 0.50322531 1
are competitive risk events. Therefore, competitive risk ## Estimates and Variances:

model is adopted for statistical analysis (57,58). ## $est


## 20 40 60 80 100 120
Firstly, import the data file ‘bmtcrr.csv’ from the current
## ALL 1 0.3713851 0.3875571 0.3875571 0.3875571 0.3875571 0.3875571
working path. ## AML 1 0.2414530 0.2663827 0.2810390 0.2810390 0.2810390 NA
library(foreign) ## ALL 2 0.3698630 0.3860350 0.3860350 0.3860350 0.3860350 0.3860350
bmt <-read.csv(‘bmtcrr.csv’) ## AML 2 0.4439103 0.4551473 0.4551473 0.4551473 0.4551473 NA
str(bmt) ##
## ‘data.frame’: 177 obs. of 7 variables: ## $var
## $ Sex : Factor w/ 2 levels “F”,”M”: 2 1 2 1 1 2 2 1 2 1 ... ## 20 40 60 80 100
## $ D : Factor w/ 2 levels “ALL”,”AML”: 1 2 1 1 1 1 1 1 1 1 ... ## ALL 1 0.003307032 0.003405375 0.003405375 0.003405375
## $ Phase : Factor w/ 4 levels “CR1”,”CR2”,”CR3”,..: 4 2 3 2 2 4 1 1 1 4 ... 0.003405375
## $ Age : int 48 23 7 26 36 17 7 17 26 8 ... ## AML 1 0.001801156 0.001995487 0.002130835 0.002130835
## $ Status: int 2 1 0 2 2 2 0 2 0 1 ... 0.002130835
## $ Source: Factor w/ 2 levels “BM+PB”,”PB”: 1 1 1 1 1 1 1 1 1 1 ... ## ALL 2 0.003268852 0.003373130 0.003373130 0.003373130
## $ ftime : num 0.67 9.5 131.77 24.03 1.47 ... 0.003373130
## AML 2 0.002430406 0.002460425 0.002460425 0.002460425
This is a data of dataframe structure with 7 variables and 0.002460425
a total of 177 observations. ## 120
$ Sex: sex variable, factor variable, 2 levels: “F”, “M”. ## ALL 1 0.003405375
$ D: disease type, factor variable, 2 levels “ALL ## AML 1 NA
## ALL 2 0.003373130
(acute lymphocytic leukemia)”, “AML (acute myelocytic
## AML 2 NA
leukemia)”.
$ Phase: phase of disease, factor variables, 4 levels: “CR1”, Interpretation of results: “1” represents the defined
“CR2”, “CR3”, “ Relapse”. endpoint and “2” represents competitive risk events.
$ Age: age variable, continuous variable. Statistics in the first row =2.8623325, P value =0.09067592,
$ Status: outcome variables, 0= censored, 1= recurrence, indicating that after controlling competitive risk events (i.e.,
2= competitive risk events. statistics calculated in the second row and P value), there
$ Source: type of intervention, factor variables, 2 was no statistical difference in cumulative recurrence risk of
levels: “BM + PB (bone marrow transplantation + blood “ALL” and “AML” P=0.09067592.
transplantation)”, “PB (blood transplantation)”. $est: represents the estimated cumulative recurrence rate
$ ftime: time variable, continuous variable. and cumulative competitive risk event rate of “ALL” and
The package “cmprsk” of the competitive risk model was “AML” groups at each time point (the defined endpoint
loaded, the data box “bmt” was loaded, and the outcome and competitive risk events distinguished by “1” and “2”
was defined as the factor variable. respectively, consistent with the results in line 1 and line 2
library(cmprsk) above).
## Loading required package: survival $var: represents the variances of the estimated cumulative
bmt$D <- as.factor(bmt$D) recurrence rate and the cumulative competitive risk event
attach(bmt)
rate for the “ALL” and “AML” groups at each time point
(the defined endpoint and competitive risk events identified
Fine-Gray test (univariate variable analysis) by “1” and “2” respectively, consistent with the results in
Similar to the Log-rank test comparing survival outcome line 1 and line 2 above).
data of the two groups, univariate analysis can also be Below, we draw the survival curve of cumulative
carried out considering competitive risk events. Next, recurrence rate and cumulative competitive risk event
we can use the cuminc() function to carry out univariate incidence rate to intuitively represent the above digitized
variable Fine-Gray test. results (Figure 31).

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 67 of 96

1.0 model are defined and positioned as dataframe.


ALL 1
0.8 cov <- data.frame(age = bmt$Age,
AML 1
ALL 2 sex_F = ifelse(bmt$Sex==‘F’,1,0),
0.6 AML 2 dis_AML = ifelse(bmt$D==‘AML’,1,0),
CIF

0.4 phase_cr1 = ifelse(bmt$Phase==‘CR1’,1,0),

0.2 phase_cr2 = ifelse(bmt$Phase==‘CR2’,1,0),


phase_cr3 = ifelse(bmt$Phase==‘CR3’,1,0),
0.0
source_PB = ifelse(bmt$Source==‘PB’,1,0)) ## Set dummy variables
0 20 40 60 80 100 120 cov

Month Construct a multivariate competitive risk model. In here,


Figure 31 The survival curve of cumulative recurrence rate and failcode =1 and cencode =0 need to be specified, respectively
cumulative competitive risk event incidence rate. representing: endpoint event assignment “1” and censored
assignment “0”, and other competitive risk events are
assigned a default value “2”.
plot(fit1,xlab = ‘Month’, ylab = ‘CIF’,lwd=2,lty=1,
fit2 <- crr(bmt$ftime, bmt$Status, cov, failcode=1, cencode=0)
col = c(‘red’,’blue’,’black’,’forestgreen’))
summary(fit2)
Figure interpretation: the vertical axis represents the ## Competing Risks Regression
##
CIF, and the horizontal axis is the time. We looked at
## Call:
the red curve corresponding to ALL1 and the blue curve
## crr(ftime = bmt$ftime, fstatus = bmt$Status, cov1 = cov, failcode = 1,
corresponding to AML1 (1 represents the defined endpoint ## cencode = 0)
and 2 represents competitive risk events). From the figure, ##
it can be concluded that the recurrence risk of ALL group ## coef exp(coef) se(coef) z p-value
was higher than that of AML group, but it did not reach ## age -0.0185 0.982 0.0119 -1.554 0.1200
## sex_F -0.0352 0.965 0.2900 -0.122 0.9000
statistical significance, P=0.09067592. Similarly, if we look
## dis_AML -0.4723 0.624 0.3054 -1.547 0.1200
at the black curve corresponding to ALL2 below the grass ## phase_cr1 -1.1018 0.332 0.3764 -2.927 0.0034
green curve corresponding to AML2, we can conclude that ## phase_cr2 -1.0200 0.361 0.3558 -2.867 0.0041
the incidence of competitive risk events in ALL group is ## phase_cr3 -0.7314 0.481 0.5766 -1.268 0.2000
lower than that in AML group, which also fails to reach ## source_PB 0.9211 2.512 0.5530 1.666 0.0960
##
statistical significance, P=0.50322531. As can be seen
## exp(coef) exp(-coef) 2.5% 97.5%
from the curve, the curves were “entangled” in the first 20
## age 0.982 1.019 0.959 1.005
months, so no statistically significant results were obtained. ## sex_F 0.965 1.036 0.547 1.704
Simply put, this figure can be summarized in one sentence: ## dis_AML 0.624 1.604 0.343 1.134
after controlling competitive risk events, there is no ## phase_cr1 0.332 3.009 0.159 0.695
statistical difference in cumulative recurrence risk of “ALL” ## phase_cr2 0.361 2.773 0.180 0.724
## phase_cr3 0.481 2.078 0.155 1.490
and “AML”, P=0.09067592.
## source_PB 2.512 0.398 0.850 7.426
##
Competitive risk model (multivariate analysis) ## Num. cases = 177
The following is a multivariate analysis of survival data ## Pseudo Log-likelihood = -267
considering competitive risk events. In the cmprsk package, ## Pseudo likelihood ratio test = 24.4 on 7 df,

the crr() function is convenient for multifactor analysis. The Interpretation of results: after controlling for competitive
function is used as follows: risk distribution events, phase variable, that is, the stage of
crr(ftime, fstatus, cov1, cov2, tf, cengroup, failcode disease, is an independent risk factor for patient recurrence.
=1, cencode =0, subset, na.action = na.omit, gtol =1e-06, The patients in the “Relapse ” stage are taken as the
maxiter =10, init, variance = TRUE) reference, the accumulated recurrence rate of patients in
You can refer to the crr() function help documentation the CR1, CR2 and CR3 stages compared with those in
for detailed explanation of each parameter. It should be the “Relapse” stage, Hazard Ratio and 95% CI were 0.332
noted here that the function must specify the time variable (0.159, 0.695), 0.361 (0.180, 0.724), and 0.481 (0.155,
and the outcome variable, and then pass in the covariate 1.490), respectively, corresponding P values were 0.0034,
matrix or dataframe. Firstly, the covariates entering the 0.0041, and 0.2000, respectively.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 68 of 96 Zhou et al. Clinical prediction models with R

Brief summary are competitive risk events. Therefore, competitive risk


model is adopted for statistical analysis (57,58).
This Section introduces in detail the Fine-Gray test and
competitive risk model of cmprsk programmed package
R codes and its interpretation
using R. The author thinks that readers in the process of
Firstly, import the data file ‘bmtcrr.csv’ from the current
concrete application should pay attention to two points:
working path.
Firstly, the use of selective Fine-Gray test and
library(foreign)
competition risk model, if the endpoint has competing risk
bmt <-read.csv(‘bmtcrr.csv’)
events, and are likely to affect the conclusion, that using this str(bmt)
model is suitable, this model is not necessarily better than ## ‘data.frame’: 177 obs. of 7 variables:
the Cox model, these two models should complement each ## $ Sex : Factor w/ 2 levels “F”,”M”: 2 1 2 1 1 2 2 1 2 1 ...
other; ## $ D : Factor w/ 2 levels “ALL”,”AML”: 1 2 1 1 1 1 1 1 1 1 ...

Secondly, competitive risk events are also limited in ## $ Phase : Factor w/ 4 levels “CR1”,”CR2”,”CR3”,..: 4 2 3 2 2 4 1 1 1 4 ...
## $ Age : int 48 23 7 26 36 17 7 17 26 8 ...
the consideration of competitive risk model. Currently,
## $ Status: int 2 1 0 2 2 2 0 2 0 1 ...
only the binary end point of Cox model is extended to ## $ Source: Factor w/ 2 levels “BM+PB”,”PB”: 1 1 1 1 1 1 1 1 1 1 ...
triple classification, namely, outcome events, censored ## $ ftime : num 0.67 9.5 131.77 24.03 1.47 ...
and competitive risk events. Even so, it is difficult to
This is a data of dataframe structure with 7 variables and
interpret the results. Therefore, readers should evaluate and
a total of 177 observations.
experiment more fully when choosing statistical methods.
$ Sex: sex variable, factor variable, 2 levels: “F”, “M”.
$ D: disease type, factor variable, 2 levels “ALL
Nomogram of competing risk model with R (acute lymphocytic leukemia)”, “AML(acute myelocytic
leukemia)”.
Background
$ Phase: phase of disease, factor variables, 4 levels: “CR1”,
The cmprsk package of the competitive risk model is loaded “CR2”, “CR3”, “Relapse”.
in R, and the univariate analysis and multivariate analysis $ Age: age variable, continuous variable.
considering the survival data of competitive risk events can $ Status: outcome variables, 0= censored, 1= recurrence,
be carried out by using the cuminc() function and crr () 2= competitive risk events.
function. We have described the implementation method $ Source: type of intervention, factor variables, 2
based on R language in detail in the previous Section, which levels: “BM + PB (bone marrow transplantation + blood
is not shown here. transplantation)”, “PB (blood transplantation)”.
So how do you visualize the competitive risk model? $ ftime: time variable, continuous variable.
How do I draw a nomogram of competitive risk model? Firstly, the variables in the dataset bmt are further
Here we demonstrate how to draw a nomogram based processed.
on R. bmt$id<-1:nrow(bmt)# Sort the data set by rows and generate ordinal id
bmt$age <- bmt$Age
bmt$sex <- as.factor(ifelse(bmt$Sex==‘F’,1,0))
Case analysis
bmt$D <- as.factor(ifelse(bmt$D==‘AML’,1,0))

[Case 1] bmt$phase_cr <- as.factor(ifelse(bmt$Phase==‘Relapse’,1,0))


bmt$source = as.factor(ifelse(bmt$Source==‘PB’,1,0))
This case data was downloaded from http://www.stat.
unipg.it/luca/R/. Researchers plan to investigate the View the data structure and present the first six rows of
curative effect of bone marrow transplantation compared data. It can be seen that we have re-assigned the covariables
blood transplantation for the treatment of leukemia, in the data set and binarized the multi-classification
endpoint event defined as “recurrence”. Some patients after variables. Note that dummy variables are not set here for
transplantation, unfortunately because of adverse reactions multi-classification variables. The main reason is that if
to the death, that the transplant related death cannot be dummy variables appear in the nomogram, interpretation of
observed in patients with the end of the “recurrence”. In the results will be confusing. Therefore, dummy variables
other words, “transplant-related death” and “recurrence” should be avoided in the nomogram.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 69 of 96

str(bmt) risk analysis.


## ‘data.frame’: 177 obs. of 12 variables:
## $ Sex : Factor w/ 2 levels “F”,”M”: 2 1 2 1 1 2 2 1 2 1 ... m.crr<- coxph(Surv(T,status==1)~age+sex+D+phase_cr+source,
## $ D : Factor w/ 2 levels “0”,”1”: 1 2 1 1 1 1 1 1 1 1 ... data=df.w,
## $ Phase : Factor w/ 4 levels “CR1”,”CR2”,”CR3”,..: 4 2 3 2 2 4 1 1 1 4 ... weight=weight.cens,
## $ Age : int 48 23 7 26 36 17 7 17 26 8 ... subset=failcode==1)
summary(m.crr)
## $ Status : int 2 1 0 2 2 2 0 2 0 1 ...
## Call:
## $ Source : Factor w/ 2 levels “BM+PB”,”PB”: 1 1 1 1 1 1 1 1 1 1 ...
## coxph(formula = Surv(T, status == 1) ~ age + sex + D + phase_cr +
## $ ftime : num 0.67 9.5 131.77 24.03 1.47 ...
## source, data = df.w, weights = weight.cens, subset = failcode ==
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## 1)
## $ age : int 48 23 7 26 36 17 7 17 26 8 ...
##
## $ sex : Factor w/ 2 levels “0”,”1”: 1 2 1 2 2 1 1 2 1 2 ...
## n= 686, number of events= 56
## $ phase_cr: Factor w/ 2 levels “0”,”1”: 2 1 1 1 1 2 1 1 1 2 ...
##
## $ source : Factor w/ 2 levels “0”,”1”: 1 1 1 1 1 1 1 1 1 1 ... ## coef exp(coef) se(coef) z Pr(>|z|)
head(bmt) ## age -0.02174 0.97850 0.01172 -1.854 0.06376 .
## Sex D Phase Age Status Source ftime id age sex phase_cr source ## sex1 -0.10551 0.89987 0.27981 -0.377 0.70612
## 1 M 0 Relapse 48 2 BM+PB 0.67 1 48 0 1 0 ## D1 -0.53163 0.58764 0.29917 -1.777 0.07556 .
## 2 F 1 CR2 23 1 BM+PB 9.50 2 23 1 0 0 ## phase_cr1 1.06140 2.89040 0.27870 3.808 0.00014 ***
## 3 M 0 CR3 7 0 BM+PB 131.77 3 7 0 0 0 ## source1 1.06564 2.90269 0.53453 1.994 0.04620 *
## 4 F 0 CR2 26 2 BM+PB 24.03 4 26 1 0 0 ## ---
## 5 F 0 CR2 36 2 BM+PB 1.47 5 36 1 0 0 ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## 6 M 0 Relapse 17 2 BM+PB 2.23 6 17 0 1 0 ##
## exp(coef) exp(-coef) lower .95 upper .95
The regplot() function in the regplot package can draw ## age 0.9785 1.0220 0.9563 1.001
more aesthetic nomogram. However, it currently only ## sex1 0.8999 1.1113 0.5200 1.557
accepts regression objects returned by the coxph(), lm(), and ## D1 0.5876 1.7017 0.3269 1.056
## phase_cr1 2.8904 0.3460 1.6739 4.991
glm () functions. Therefore, in order to draw the nomogram
## source1 2.9027 0.3445 1.0181 8.275
of the competition risk model, we need to weight the ##
original data set to create a new data set for the competition ## Concordance= 0.737 (se = 0.037 )
risk model analysis (59,60). The main feature of the crprep() ## Likelihood ratio test= 28.33 on 5 df, p=3e-05
## Wald test = 28.54 on 5 df, p=3e-05
function in the mstate package is to create this weighted
## Score (logrank) test = 30.49 on 5 df, p=1e-05
data set, as shown in the R code below. We can then use
the coxph() function to fit the competitive risk model of the Next, we can plot the nomogram using the regplot()
weighted dataset and pass it to the regplot() function to plot function. In the nomogram, the values of covariates
the nomogram. For specific weighting principles, readers of patients with id=31 in the data set are mapped to
may refer to the literature published by Geskus et al. (60), corresponding scores, and the total scores are calculated,
which is not shown here. and the cumulative recurrence probability at 36 and
Next, we create the weighted dataset for the original 60 months is calculated respectively, which is the cumulative
data set bmt and name it df.w. Where, the parameter trans recurrence probability with considering competitive
= specifies the endpoint event and competitive risk event risk events. The calculated results are: 0.196 and 0.213,
that need to be weighted; cens = designated censored; id = respectively (Figure 32).
the id of the incoming data set bmt; keep = covariables to be
library(regplot)
retained in the weighted dataset. regplot(m.crr,observation=df.w[df.w$id==31&df.w$failcode==1,],

library(mstate) failtime = c(36, 60), prfail = T, droplines=T)


## Loading required package: survival ## Replicate weights assumed
df.w <- crprep(“ftime”, “Status”, ## Click on graphic expected. To quit click Esc or press Esc
data=bmt, trans=c(1,2), ## $points.tables
cens=0, id=“id”, ## $points.tables[[1]]
keep=c(“age”,”sex”,”D”,”phase_cr”,”source”)) ## source Points
df.w$T<- df.w$Tstop - df.w$Tstart ## source2 1 100
## source1 0 38
The above code has created a weighted dataset df.w on ##
which we can then use the coxph() function for competitive ## $points.tables[[2]]

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 70 of 96 Zhou et al. Clinical prediction models with R

## phase_cr Points ## coef exp(coef) se(coef) z Pr(>|z|)


## phase_cr1 0 38 ## age -0.007766 0.992264 0.011952 -0.650 0.5158
## phase_cr2 1 100 ## sex1 0.371888 1.450470 0.283306 1.313 0.1893
## ## D1 -0.643592 0.525402 0.295888 -2.175 0.0296 *
## $points.tables[[3]] ## phase_cr1 1.373882 3.950657 0.290598 4.728 2.27e-06 ***
## D Points ## source1 0.315122 1.370427 0.552842 0.570 0.5687
## D2 1 8 ## ---
## D1 0 38 ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## ##
## $points.tables[[4]] ## exp(coef) exp(-coef) lower .95 upper .95
## sex Points ## age 0.9923 1.0078 0.9693 1.0158
## sex1 0 38 ## sex1 1.4505 0.6894 0.8324 2.5273
## sex2 1 32 ## D1 0.5254 1.9033 0.2942 0.9383
## ## phase_cr1 3.9507 0.2531 2.2352 6.9828
## $points.tables[[5]] ## source1 1.3704 0.7297 0.4637 4.0498
## age Points ##
## 1 0 78 ## Concordance= 0.726 (se = 0.036 )
## 2 10 65 ## Likelihood ratio test= 30.74 on 5 df, p=1e-05
## 3 20 53 ## Wald test = 29.99 on 5 df, p=1e-05
## 4 30 40 ## Score (logrank) test = 33.48 on 5 df, p=3e-06
## 5 40 28 regplot(m.cph,observation=bmt[bmt$id==31,],
## 6 50 15 failtime = c(36,60), prfail = TRUE,droplines=T)
## 7 60 3 ## Click on graphic expected. To quit click Esc or press Esc
## 8 70 -10 ## $points.tables
## ## $points.tables[[1]]
## $points.tables[[6]] ## source Points
## Total Points Pr( T < 36 ) ## source2 1 48
## 1 100 0.0232 ## source1 0 32
## 2 150 0.0543 ##
## 3 200 0.1243 ## $points.tables[[2]]
## 4 250 0.2705 ## phase_cr Points
## 5 300 0.5276 ## phase_cr1 0 32
## 6 350 0.8318 ## phase_cr2 1 100
## 7 400 0.9856 ##
## $points.tables[[3]]
In order to facilitate the comparison, Cox regression ## D Points
model can be further constructed in the original data set ## D2 1 0
bmt, and the values of covariates of patients with id =31 ## D1 0 32
can be calculated to the corresponding scores, and the total ##
## $points.tables[[4]]
scores can be calculated, and the cumulative recurrence
## sex Points
probability of patients with id =31 at 36 and 60 months can
## sex1 0 32
be calculated, respectively. The calculated results are: 0.205 ## sex2 1 50
and 0.217 respectively (Figure 33). ##
## $points.tables[[5]]
library(survival)
## age Points
m.cph<-coxph(Surv(ftime,Status==1)~age+sex+D+phase_cr+source,
## 1 0 44
data=bmt)
## 2 20 36
summary(m.cph)
## 3 40 28
## Call:
## coxph(formula = Surv(ftime, Status == 1) ~ age + sex + D + phase_ ## 4 60 21

cr + ## 5 80 13
## source, data = bmt) ##
## ## $points.tables[[6]]
## n= 177, number of events= 56 ## Total Points Pr( ftime < 36 )
## ## 1 100 0.0894

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 71 of 96

Coxph regression
Points
0 20 40 60 80 100

Source* 1
0

Phase_cr***
1
0

D
0
1

Sex 0
1

Age

70 60 50 40 30 20 10 0

Total-points-to-outcome nomogram:

229
Total points
150 200 250 300 350

0.196
Pr (T<36)
0.06 0.08 0.1 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.213
Pr (T<60)
0.06 0.08 0.1 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 32 Nomogram predicting cumulative recurrence risk at 36 and 60 months using the competitive risk model. Nomogram estimates
that patient no. 31 has a cumulative risk of recurrence of 0.196 and 0.213 at 36 and 60 months, respectively. *, P<0.05; ***, P<0.001.

## 2 120 0.1308 little difference. When a patient is truncated or a competitive


## 3 140 0.1892 risk event occurs, the settlement results of the two models are
## 4 160 0.2695
significantly different, so readers can try by themselves.
## 5 180 0.3751
## 6 200 0.5053
## 7 220 0.6514 Brief summary
## 8 240 0.7935
## 9 260 0.9057 This paper describes in detail the use mstate and regplot
## 10 280 0.9708 R packages to plot the nomogram of competition risk
## 11 300 0.9950
model. In fact, this is a flexible method, namely, first, the
It can be seen that the cumulative recurrence risk calculated original data set is weighted, then the Cox regression model
by the competition risk model and Cox proportional hazard is used to build the competitive risk model based on the
model is slightly different, and the cumulative recurrence weighted data set, and then the nomogram is drawn. This
risk calculated by the competition risk model for patient paper does not introduce the further evaluation of the
No. 31 is slightly lower. The endpoint event we defined competitive risk model. The riskRegression package in R
occurred in patient No. 31, that is, the patient relapsed after can further evaluate the prediction model built based on the
transplantation. The results calculated according to the competitive risk analysis, such as calculating C-index and
competition risk and Cox proportional hazard model showed drawing calibration curve.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 72 of 96 Zhou et al. Clinical prediction models with R

Coxph regression
Points
0 10 20 30 40 50 60 70 80 90 100
Source 1
0

Phase_cr*** 1
0
D* 0
1

Sex 1
0

Age

80 60 40 20 0

Total-points-to-outcome nomogram:

145
Total points
100 120 140 160 180 200 220 240 260 280

Pr (ftime <36) 0.205


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95
0.217
Pr (ftime <60)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.93 0.96

Figure 33 Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox proportional hazard model. According to
Nomogram’s estimate, the cumulative risk of recurrence in patient no. 31 at 36 and 60 months is 0.205 and 0.217, respectively. *, P<0.05; ***,
P<0.001.

Outlier recognition and missing data handling beyond human common sense and non-conformity is an
with R outlier. For example, we collected fasting blood glucose
from a group of patients, one of whom had a fasting blood
Background
glucose of more than 50 mmol/L, which is obviously an
In Section 1, we describe the sources of data for clinical abnormal value. For another example, we investigated the
predictive model construction. Whether it is prospective prevalence of hypertension in the elderly over 60 years
data collection or retrospective data collection, it is common old in Xuhui District, Shanghai. If there is a subject with a
to have outliers or missing values in the data set. Outliers SBP exceeding 1,400 mmHg, this is obviously an abnormal
and missing values are often a tricky issue for statisticians value. It is likely to be a recording error, and the true SBP
and can lead to errors if not handled properly. Outliers is more May be 140.0 mmHg. Sometimes outliers are a
may cause our results to deviate from the real results, and relative concept that is related to the context in which our
the loss of information caused by missing values may lead clinical research data is collected. For example, if our study
to modeling failure. Therefore, it is important to correctly is for children under the age of 10, then children of this age
identify outliers and properly handle missing values before group are unlikely to be graduate students, and their height
performing data analysis. The content discussed in this is unlikely to exceed 170 cm, the weight is unlikely to
Section should be carried out before modeling. We can’t just exceed 100 kg. There is also a situation in which abnormal
put this thing in the end or think it doesn’t matter because values may be generated. When the sample we sampling is
our article is the last Section in this series of articles. not good, for example, 1,000 people are extracted from Area
A and 100 people are extracted from Area B. 100 people
from Area B are likely to become a collection alone. The
Outliers
value of this set is abnormally higher or abnormally lower
What is an outlier? In a word, the value of a variable than that of Area A. This situation corresponds to clinical

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 73 of 96

studies. When we study the efficacy of an intervention, ## 0% 25% 50% 75% 100%
if only some patients have a significant effect, this part of ## 100.00 137.25 175.00 212.00 250.00

the data is “outliers” compared to other patients with less fivenum(height)


## [1] 100 137 175 212 250
obvious effects. “, but these outliers are exactly what we care
about the most. Therefore, for the judgment of abnormal Note that the na.rm parameter in fiveum() defaults
values, it is necessary to contact the actual situation, and it to TRUE, but the sd() and quantile() functions require
is not arbitrary, so as to avoid serious mistakes. When we us to set na.rm=TRUE, otherwise the result will not be
are not sure about the data, the best solution is to check the calculated.
original data record. The above method can help us identify maxima or
Below I will introduce several commonly used functions minima, but sometimes the extremum does not appear as a
to identify outliers in the dataset. Suppose we have collected single independent, but in clusters, then the above method
the height of 1,000 subjects. First, we can use the boxplot() of identifying outliers is not enough. We generally define
function to draw a box plot to describe the data. Further the outliers of the data according to the mean and standard
use of the range() function to help us find the maximum and deviation of the variables, or the median and quartiles (Tukey
minimum values of these values. method) in the actual research background. For example, we
First, we simulated 1,000 subjects with a height of 100– can set the greater or less than the mean. The ±3 standard
250 cm. Use range() to see the range of SBP in this group deviations are all outliers. Of course, we can also make
of patients. an outlier judgment on a certain value of the categorical
set.seed(123) variable. For example, the gender value is 1= male, 2=
height <- sample(100:250,1000,replace = TRUE) female. If there is an assignment value of 3, then outliers.
boxplot(height) Here we introduce a custom function (61). This function
range(height)
judges the outliers according to the quartile Tukey method,
## [1] 100 250
which can effectively avoid the influence of extreme values
Use the min() and max() functions to return the on the mean and standard deviation. The function is as
minimum and maximum values of an object. follows:
min(height)
outlierKD <- function(dt, var) {
## [1] 100
var_name <- eval(substitute(var),eval(dt))
max(height)
tot <- sum(!is.na(var_name))
## [1] 250
na1 <- sum(is.na(var_name))
If there are missing values in the data, you must adjust m1 <- mean(var_name, na.rm = T)
the parameter settings, na.rm = TRUE, otherwise the par(mfrow=c(2, 2), oma=c(0,0,3,0))
results may not be calculated. boxplot(var_name, main=“With outliers”)
hist(var_name, main=“With outliers”, xlab=NA, ylab=NA)
height <- c(sample(100:250,998,replace = TRUE),NA,NA) outlier <- boxplot.stats(var_name)$out # Outlier is defined here based
boxplot(height) on the out value of the box diagram as an outlier
range(height,na.rm = TRUE) mo <- mean(outlier)
## [1] 100 250 var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main=“Without outliers”)
We can use mean() and median() functions to calculate
hist(var_name, main=“Without outliers”, xlab=NA, ylab=NA)
the mean and median, respectively, and use na.rm = TRUE
title(“Outlier Check”, outer=TRUE)
for missing values. na2 <- sum(is.na(var_name))
mean(height,na.rm = TRUE) cat(“Outliers identified:”, na2 - na1, “\n”)
## [1] 174.7244 cat(“Proportion (%) of outliers:”, round((na2 - na1) / tot*100, 1), “\n”)
median(height,na.rm = TRUE) cat(“Mean of the outliers:”, round(mo, 2), “\n”)
## [1] 175 m2 <- mean(var_name, na.rm = T)
cat(“Mean without removing outliers:”, round(m1, 2), “\n”)
We can also use the sd(), quantile() or fivenum() functions
cat(“Mean if we remove outliers:”, round(m2, 2), “\n”)
to calculate the standard deviation and quartile.
response <- readline(prompt=“Do you want to remove outliers
sd(height,na.rm = TRUE) and to replace with NA? [yes/no]: “)
## [1] 43.19566 if(response == “y” | response == “yes”){
quantile(height,na.rm = TRUE) dt[as.character(substitute(var))] <- invisible(var_name)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 74 of 96 Zhou et al. Clinical prediction models with R

assign(as.character(as.list(match.call())$dt), dt, envir = However, when there are a large number of observations
.GlobalEnv)
that contain missing values, the default row deletion in these
cat(“Outliers successfully removed”, “\n”)
return(invisible(dt)) functions can result in a large loss of information. In this
} else{ case, the analyst should carefully look at the mechanisms
cat(“Nothing changed”, “\n”) that may result from data loss and find an appropriate way
return(invisible(var_name)) to handle it.
} How to deal with missing values is a headache for clinical
}
statisticians, so I decided to use a Section to discuss this
The custom function has only two parameters, the thorny topic. Whether the data is missing or missing degree
first parameter is the name of the data set, and the second directly affects the quality of the data, and the quality of the
parameter is the variable name; the reader can run the code data ultimately affects our research results. If the processing
directly, as long as the data set and the name of the variable of missing data is not appropriate, it is likely to cause the
are properly replaced. Here we are based on the out value entire statistical analysis to fail. This section describes how
of the box plot as an outlier, we can also re-set the definition to handle missing data in R-project and introduces some
of outlier according to the expertise, such as greater than basic skills for dealing with missing data.
or less than the mean ± 3 standard deviation. At the end of In the R-project, “NA” is represented as a missing
the function, a user-entered code is also set. The user can value. When an Excel table with empty cells is imported
determine whether to eliminate the outliers recognized by into the R console, these empty cells will be replaced
the function in the dataset by typing “yes” or “no”. by NA. This is different from STATA replacing “empty
Below we simulate a set of data to verify the functionality cells” with “.”The same missing value symbol is used for
of this custom outlier recognition function. numeric variables and character variables in R.R provides
set.seed(123)
some functions to handle missing values. To determine if
df <- data.frame(height = c(sample(100:250, 1000, replace = TRUE), a vector contains missing values, you can use the is.na()
NA, 380, 20)) function. “is.na()” function is the most common method
outlierKD(df, height) used to determine if an element is an NA type. It returns
## Outliers identified: 2
an object of the same length as the incoming parameter
## Proportion (%) of outliers: 0.2
and all data is a logical value (FALSE or TRUE). Suppose
## Mean of the outliers: 200
## Mean without removing outliers: 174.63 we have 6 patients, but only 4 values are recorded, and 2
## Mean if we remove outliers: 174.58 are missing.
## Do you want to remove outliers x <- c(1.8,2.3,NA,4.1,NA,5.7)
## and to replace with NA? [yes/no]: is.na(x)
## Nothing changed ## [1] FALSE FALSE TRUE FALSE TRUE FALSE
# The author typed “yes”, so the outliers 380 and 20 in the original data
were rejected.。
The is.na() return vector has a length of 6, and the third
and fifth digits have a value of TRUE, indicating that the
patient is missing the value.
Missing value recognition
Use the which() function to find the location of the NA.
Data loss is common in clinical studies. For example, Someone might use a logic check (such as x==NA) to detect
when collecting data, the nurse may forget to record the missing data. This method cannot return TRUE because
amount of urine at a certain point in time due to busy work; the missing values are not comparable, you must use the
when the researcher wants to study the effect of lactic acid missing value function. The value returned by the “==“
changes on mortality, the patient may only monitor the operator is NA. By using the which() function, you can find
blood lactate value at a certain point in time. Other reasons out which element’s vector contains NA. In this example the
for the lack of data include coding errors, equipment which() function returns 3 and 5, indicating that the third
failures, and non-response of respondents in the survey and fifth patients’ x values are missing.
study (15). In statistical packages, some functions, such as which(is.na(x))
Logistic Regression, may automatically delete missing data. ## [1] 3 5

If there are only a small number of incomplete observations, Using the sum() function to calculate the number of NAs
then this processing will not be much of a problem. in the vector

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 75 of 96

sum(is.na(x)) install.packages(“psych”)
## [1] 2 library(psych)
## Warning: package ‘psych’ was built under R version 3.5.3
We can remove missing values in x.
describe(df)
x[!is.na(x)]
## vars n mean sd median trimmed mad min max range skew kurtosis
## [1] 1.8 2.3 4.1 5.7
## id 1 6 3.50 1.87 3.5 3.50 2.22 1.0 6.0 5 0.00 -1.80
Here we use the “!” in the logical operator, which is ## sex* 2 5 1.40 0.55 1.0 1.40 0.00 1.0 2.0 1 0.29 -2.25
the “non” operation. The code means to find out all the ## x 3 5 4.04 2.25 4.1 4.04 1.78 0.7 6.7 6 -0.29 -1.61
elements of the vector x those are not missing values. When ## se
## id 0.76
there is an NA in a vector, whether it is adding, subtracting,
## sex* 0.24
multiplying, dividing, or averaging, and finding the standard ## x 1.01
deviation, the return is all NA. Because by default, R will
calculate the NA as one of the elements, that leads to no The describe() function returned the basic statistic of
solution, such as the dataset. “n” represents the number of non-missing
sum(x) observations in the variable. “mean”, “sd” and “median”
## [1] NA represent the mean, standard deviation, and median after
mean(x) removing the NA. “trimmed” represents the recalculated
## [1] NA
mean after removing 10% of the data from the beginning
In this case, we need to use the na.rm parameter in these and the end of the data, which aims to remove the
calculation functions to change it from the default FALSE influence of extreme values; “mad”, “min”, “max”, and
to TRUE. “range” represent the mode, minimum, maximum, and
mean(x,na.rm = TRUE) range respectively; “skew”, “kurtosis”, and “se” represent
## [1] 3.475 skewness, kurtosis, and standard error respectively. The first
sum(x,na.rm = TRUE) two indicators are aimed to measure whether the data is
## [1] 13.9
fitted with a normal distribution.
The above R codes all deal with the missing in the Although some default settings in the regression model
vectors. But, in practice, what we encounter more is to deal can effectively ignore missing data, it is also necessary to
with the missing in a data frame. Next, we will simulate a create a new data frame that excludes missing data. We
data frame with missing values. Be careful that the third only need to use na.omit() in the dataset of the data frame
patient had a sex-value deletion and the fourth patient had a structure, then it returns a new data frame with the missing
x-value deletion. values removed.
id<-c(1,2,3,4,5,6) df_omit<-na.omit(df)
sex<-c(“m”,”f”,NA,”f”,”m”,”f”) df_omit
x<-c(0.7,3.4,4.1,NA,6.7,5.3) ## id sex x
df<-data.frame(id,sex,x) ## 1 1 m 0.7
df ## 2 2 f 3.4
## id sex x ## 5 5 m 6.7
## 1 1 m 0.7 ## 6 6 f 5.3
## 2 2 f 3.4
## 3 3 <NA> 4.1 The above na.omit() returns a new data frame with
## 4 4 f NA missing values removed. It can be seen that the third and
## 5 5 m 6.7
fourth patients with missing data were removed from the
## 6 6 f 5.3
variables.
Now, we can calculate the proportion of missing values
in the df data box, also the mean, standard deviation and so
Missing value visualization
on after removing the missing values. Lots of functions can
do it. Here we recommend the describe() function in the The visualization of missing values can help us visualize the
psych package, which is more convenient for giving a series missing values in the dataset more intuitively, which will
of statistical descriptive indicators’ values. help us to interpolate the missing value later. In this section,

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 76 of 96 Zhou et al. Clinical prediction models with R

the author will mainly introduce the use of VIM packages missing values. Therefore, the visualization tool should
to the reader. The demo dataset for this section is the built- be performed before the interpolation operation, and
in dataset “airquality” of the R language. usually a diagnosis should be made to determine whether
data(“airquality”) the interpolation value is reasonable after the missing data
str(airquality) interpolation. The functions that can be used to visualize
## ‘data.frame’: 153 obs. of 6 variables: missing data are as follows: aggr(), matrixplot(), scattMiss()
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
and marginplot().
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... install.packages(‘VIM’)
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... library(VIM)
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ... ## Loading required package: colorspace
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ... ## Loading required package: grid
## Loading required package: data.table
The airquality dataset contains 153 observations and 6
## VIM is ready to use.
variables. From the above results we can see that there are ## Since version 4.0.0 the GUI is in its own package VIMGUI.
missing values in this dataset. Before visualizing, explore the ##
missing data pattern using the md.pattern() function in the ## Please use the package to use the new (and old) GUI.
mice package first. ## Suggestions and bug-reports can be submitted at: https://github.com/
alexkowa/VIM/issues
install.packages(“mice”) ##
library(mice) ## Attaching package: ‘VIM’
## Loading required package: lattice ## The following object is masked from ‘package:datasets’:
## ##
## Attaching package: ‘mice’ ## sleep
## The following objects are masked from ‘package:base’: aggr_plot <- aggr(airquality, col=c(‘red’,’blue’), numbers=TRUE,
## sortVars=TRUE, labels=names(airquality), cex.axis=.7, gap=3,
## cbind, rbind ylab=c(“Histogram of missing data”,”Pattern”))
md.pattern(airquality)
##
## Wind Temp Month Day Solar.R Ozone
## Variables sorted by number of missings:
## 111 1 1 1 1 1 1 0
## Variable Count
## 35 1 1 1 1 1 0 1
## Ozone 0.24183007
## 5 1 1 1 1 0 1 1
## Solar.R 0.04575163
## 2 1 1 1 1 0 0 2
## Wind 0.00000000
## 0 0 0 0 7 37 44
## Temp 0.00000000

In the output table body, “1” represents a non-missing ## Month 0.00000000


## Day 0.00000000
value, and “0” represents a missing value. The first column
shows the number of unique missing data patterns. In our The aggr() function helps us visualize the missing values.
example, 111 observations don’t contain missing data, 35 The left graph is a missing value proportional histogram.
observations have only missing data in the Ozone variable, It can be seen from the graph that Ozone and Solar.R
5 observations have only Solar. R missing, and so on. The have missing values, with the Ozone’s missing value ratio
rightmost column shows the number of missing variables more than 20%. The right graph reflects the pattern of
in a particular missing mode. For example, if there is no missing values, red indicates no deletion, and blue indicates
missing value in the first line, it is displayed as “0”. The last deletion. As it can be seen from the figure, only Ozone is
line counts the number of missing values for each variable. missing 22.9%, Solar.R alone is 3.3%, and both are missing
For example, the Wind variable has no missing values and 1.3%. While the observations with complete data account
displays “0”, as well as the Ozone variable has 37 missing for 72.5%.
values. In the study, some variables with more missing In addition, the marginplot() function can help us
values may be eliminated. Then, the form can provide visualize the distribution of missing values.
useful reference information. marginplot(airquality[1:2])
Below we call the VIM package to visualize the missing In the below graph (Figures 34 and 35), the open lake
values. Studying missing data patterns is necessary for blue circle indicates the non-missing value, the solid point
choosing a suitable interpolation method to estimate of red indicates the missing value, and the dark purple point

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 77 of 96

Therefore, we should try to interpolate the missing values


0.013 at this time. If you have 50% missing values in your original
0.20
data, obviously, how powerful method can make up for
Histogram of missing data

this “congenital defect.” “At this time, we must find the


0.15 0.033
reason from the source of data collection. However, in the

Pattern
0.10
real world, it is often difficult to collect the missing data
0.229 repeatedly. Therefore, the question that the readers need
0.05 to consider is: What method could be used to interpolate
0.725 the missing value in the case that the missing data is not
0.00 available? What method can you use to restore the original
Ozone
Solar. R
Wind
Temp
Month
Day

Ozone
Solar. R
Wind
Temp
Month
Day
appearance of the missing value in the greatest extent?
There are many types of data missing, which are
Figure 34 Visualization of missing values (1). summarized in the following three cases (15,17):
(I) Missing completely at random (MCAR): deletion
formation is not related to other variables (including
observed and unobserved values), and there is no
300 systematic reason.
(II) Missing at random (MAR): missing is related
200
to other variables and is not related to its own
Solar. R

unmeasured value.
100
(III) Not missing at random (NMAR): exclude MCAR
0 and MAR.
2 There are various interpolation methods for missing
0 50 100 150 values. Simple interpolation method, such as filling
Ozone the mean or median directly and slightly complicated
Figure 35 Visualization of missing values (2).
interpolation method, such as regression interpolation.
However, no matter which method is used, it is difficult
to measure the quality of the imputation, because in the
indicates that both variables are all missing. The red box real world, we can’t get the missing value, so we can’t
plot on the left side of the figure shows the distribution of verify the correctness of the interpolation. However, in the
Solar.R in the case of Ozone missing values, and the blue computer, we can evaluate different interpolation methods
box plot shows the distribution of Sloar.R after removing by simulation, and this evaluation is effective later.
the missing values of Ozone. The box plot on the bottom In this section, the author will introduce several missing
side of the graph is just the opposite, reflecting the value interpolation methods to the reader for reference.
distribution of Ozone in the absence and non-absence of To better show how to interpolate data, we simulated 200
Solar.R. observations in the first time. The data frame contains three
variables: sex, mean arterial pressure (MAP), and lactic
acid (lac). In order to enable the reader to get the same
Imputation of missing values results as this article, we set the seed value for each random
Interpolation of missing values is a more complicated simulation. Here we use the mean of each variable to fill in
problem. First, readers must consider whether it is necessary the missing values.
to interpolate the missing values. The premise of an
set.seed(123)
imputable data set is that data deletion is a random deletion. sex<-rbinom(200, 1, 0.45)
In addition, we have to consider some actual situations. For sex[sex==1]<-”male”
example, there is only 5% of your data missing. Because sex[sex==0]<-”female”
the ratio is very small, you can directly remove it from the set.seed(123)
original data; But if there is 20% of the data missing, we sex.miss.tag<-rbinom(200, 1, 0.3)#MCAR

will lose a lot of information after they are all eliminated. sex.miss<-ifelse(sex.miss.tag==1,NA,sex)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 78 of 96 Zhou et al. Clinical prediction models with R

set.seed(123) ## logit
map<-round(abs(rnorm(200, mean = 70, sd = 30))) scatterplot(lac.mean ~ map | lac.miss.tag, lwd=2,
map<-ifelse(map<=40,map+30,map) main=“Scatter Plot of lac vs. map by # missingness”,
set.seed(123) xlab=“Mean Aterial Pressure (mmHg)”,
lac<- rnorm(200, mean = 5, sd = 0.7) -map*0.04 ylab=“Lactate (mmol/l)”,
lac<-abs(round(lac,1)) legend.plot=TRUE, smoother=FALSE,
id.method=“identify”,
set.seed(123)
boxplots=“xy”
lac.miss.tag<-rbinom(200, 1, 0.3)
)
lac.miss<-ifelse(lac.miss.tag==1,NA,lac)
data<-data.frame(sex.miss,map,lac.miss)
Unsurprisingly, all values inserted were 2.1 mmol/L of
In these data, the lactic acid content is assumed to be lactic acid (Figure 36), so the mean and standard deviation of
related to the MAP. The blood lactate value reflects the the new sample were biased compared to the actual sample.
perfusion of the tissue, which is associated with MAP. We The insertion of the mode and the median can also be
hypothesized that the relationship between MAP and lactic inserted in the same way, which can be left to the readers.
acid is negatively correlated. To increase randomness, we Although these rough methods provide convenience for
use the rnorm() function to generate the intercept. We missing value interpolation, this method underestimates
assume that the absence of gender variables is consistent the variance value (less than the actual value), ignores the
with MCAR. relationship between the variables, and finally causes some
The following steps can use the summary() function to statistical values (such as the mean standard deviation) to
observe the data set and calculate its standard deviation. be generated. Therefore, these rough estimates can only
be used to deal with a small amount of data missing and
summary(lac.miss)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
cannot be widely used. Therefore, we need to further
## 0.600 1.700 2.100 2.015 2.300 2.700 55 master the handling of complex problems. Next, I will
sd(lac.miss,na.rm=TRUE) use the BostonHousing dataset in the mlbench package
## [1] 0.4012707 to demonstrate the various common filling methods for
In the above output, we found that there are 55 missing missing values.
values for lactic acid, the average value is 2.015, and the install.packages(‘mlbench’)

standard deviation is 0.4012707. library(mlbench)


## Warning: package ‘mlbench’ was built under R version 3.5.2
Using the mean, mode, or median to estimate missing
data(“BostonHousing”)
values is a quick and easy way. The initialise() function in head(BostonHousing)
the VIM package in the R software does the job. However, ## crim zn indus chas nox rm age dis rad tax ptratio b
it is mainly used internally in some functions, and it has no ## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
advantage over other methods of doing a single insertion. ## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
For example, we want to insert a missing value into a
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
continuous variable with an average value. The following
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
R code returns a lac.mean variable containing the complete ## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
data information, and the missing value in lac.miss is ## lstat medv
replaced by the average of the other values. The round() ## 1 4.98 24.0

function is used to keep the result one decimal place. ## 2 9.14 21.6
## 3 4.03 34.7
lac.mean<-round(ifelse(is.na(lac.miss),mean(lac.miss,na.
## 4 2.94 33.4
rm=TRUE),lac.miss),1)
## 5 5.33 36.2
Next, we use a visual method to check the distribution of ## 6 5.21 28.7
the missing value after replacing it with the average.
The BostonHousing dataset contains 506 observations
library(car)
and 14 variables that reflect the basic situation of Boston
## Loading required package: carData
##
city dwellers, including crime rates in each town and the
## Attaching package: ‘car’ number of non-retail businesses. Since the BostonHousing
## The following object is masked from ‘package:psych’: dataset itself has no missing values, we randomly generate
## some missing values.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 79 of 96

lac.miss.tag install.packages(‘Hmisc’)
0 1 Scatter Plot of lac vs. map by # missingness
library(Hmisc)
## Loading required package: survival
2.5
## Loading required package: Formula
Lactate (mmol/L)

## Loading required package: ggplot2


2.0 ##
## Attaching package: ‘ggplot2’
1.5 ## The following objects are masked from ‘package:psych’:
##
1.0 ## %+%, alpha
##
## Attaching package: ‘Hmisc’
40 60 80 100 120 140 160 ## The following object is masked from ‘package:psych’:
Mean aterial pressure (mmHg)
##
## describe
Figure 36 Distribution of missing values with averages. ## The following objects are masked from ‘package:base’:
##
## format.pval, units
im_mean <- impute(BostonHousing$ptratio, mean) #Interpolate mean
original_data <- BostonHousing
im_median <- impute(BostonHousing$ptratio, median) #Interpolate
set.seed(123)
median
BostonHousing[sample(1:nrow(BostonHousing), 80), “rad”] <- NA
im_spe <-impute(BostonHousing$ptratio, 20) #Interpolate specified
BostonHousing[sample(1:nrow(BostonHousing), 80), “ptratio”] <- NA
value
head(im_mean)
Here we set 80 missing values for rad and ptratio. The
## 1 2 3 4 5 6
former is a factor variable and the latter is a numerical ## 15.30000 17.80000 17.80000 18.70000 18.46033* 18.70000
variable. Next, we call the mice package md.pattern()
function to observe the missing mode, and we have decided In the above code, the author uses three simple methods
how to further process the missing values. to interpolate the missing values. The number in the
upper right corner with an asterisk indicates that the data
install.packages(‘mice’)
is imputed, instead of the original data. Obviously, this
library(mice)
md.pattern(BostonHousing)
method is only applicable to continuous data, rather than
## crim zn indus chas nox rm age dis tax b lstat medv rad ptratio classified data.
## 362 1 1 1 1 1 1 1 1 11 1 1 1 1 0 Then we focus on the advanced interpolation
## 64 1 1 1 1 1 1 1 1 11 1 1 1 0 1 methods for missing values in the mice package. Mice
## 64 1 1 1 1 1 1 1 1 11 1 1 0 1 1
are Multivariate Imputation by Chained Equations. The
## 16 1 1 1 1 1 1 1 1 11 1 1 0 0 2
## 0 0 0 0 0 0 0 0 00 0 0 80 80 160
mice package provides a variety of advanced missing value
processing methods. It uses an unusual method for two-step
The md.pattern() function is used to view missing value interpolation: first use the mice() function to modelling,
patterns in the dataset. According to the above results, then use the complete() function to generate the complete
0 indicates a deletion and 1 indicates no deletion. It can data. mice(df) returns multiple complete copies of df, each
be seen that there are 80 missing values in each of “rad” of which interpolates different values for missing data. The
and “ptratio”. There were 365 observations that were not complete() function returns one (default) or more of these
missing, 61 observations were missing rad variables, 61 data sets. The following demonstrates how to interpolate
observations were missing ptratio, and 19 observations were the two variables rad and ptratio using this method:
missing.
set.seed(123)
Similarly, we interpolate missing values in a simple way. miceMod <- mice(BostonHousing[, !names(BostonHousing) %in%
First, we need to download and load the Hmisc package “medv”], method=“rf”)
##
beforehand. We can use the impute() function in this
## iter imp variable
package to perform automatic interpolation of missing ## 1 1 rad ptratio
values. ## 1 2 rad ptratio

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 80 of 96 Zhou et al. Clinical prediction models with R

## 1 3 rad ptratio accurate, and the degree of variation can be adjusted by


## 1 4 rad ptratio increasing the random error of the regression model (15,17).
## 1 5 rad ptratio
In addition to the mice package, the VIM package
## 2 1 rad ptratio
## 2 2 rad ptratio described earlier can also interpolate missing values and
## 2 3 rad ptratio provide advanced methods such as hotdeck, K-nearest
## 2 4 rad ptratio neighbor, linear regression, and more. More content readers
## 2 5 rad ptratio
can explore on their own.
## 3 1 rad ptratio
## 3 2 rad ptratio
## 3 3 rad ptratio
Brief summary
## 3 4 rad ptratio
## 3 5 rad ptratio This Section systematically introduces the method of outlier
## 4 1 rad ptratio
identification and introduces a function for custom outlier
## 4 2 rad ptratio
## 4 3 rad ptratio
recognition, but this function is only used for continuous
## 4 4 rad ptratio variable outlier recognition. This Section introduces
## 4 5 rad ptratio several methods for filling the missing values. It is simple to
## 5 1 rad ptratio interpolate with mean, median, and specific values. While
## 5 2 rad ptratio
you can also apply complex ways to fill missing values, such
## 5 3 rad ptratio
## 5 4 rad ptratio as linear regression and random forest methods. No matter
## 5 5 rad ptratio which method is used for interpolation, there is a certain
#Mice interpolation based on random forest model, you need to install the
risk, and readers need to be cautious when using it. In
‘randomForest’ package in advance.
miceoutput <- complete(miceMod) # Generate complete data addition, missing value recognition and imputation should
anyNA(miceoutput) be individualized, and it is difficult to have a method that is
## [1] FALSE universal.
The algorithm of random forest can be used here.
You can skip it here. This is a common machine learning Ridge regression and LASSO regression with R
algorithm, which is not covered in this book. We are just
Background
here to show you how to use advanced statistical methods
to interpolate missing values. As for the choice of random As the dimensions and depth of data continue to be
forest algorithm here, or a simple linear regression complex, variable selecting becomes more and more
algorithm, although the principle is different, but the goal difficult. In Section 1, we mentioned that from the
is the same. The miceoutput is the imputed data set. The perspective of clinicians, there are three types of current
anyNA() function indicates that there is no missing value in clinical predictive model researches.
this data set, while the missing value has been interpolated. (I) Using traditional clinical features, pathological
Since we have the original unmissed data set, we can use features, physical examination results and
it to make the missing. The calculation of the value of the laboratory test results to construct clinical
interpolation precision. predictive models. Because the predictive variables
actuals <- original_data$rad[is.na(BostonHousing$rad)] are easily acquired in clinical practice, so this type
predicteds <- miceoutput[is.na(BostonHousing$rad), “rad”] of model is more feasible than the other two types.
mean(actuals != predicteds)
(II) With the development of imaging omics research
## [1] 0.125
methods, more and more researchers are aware that
Readers can see that the error rate of interpolation is some manifestations or parameters of imaging can
12.5%, which means that most of the missing values are represent specific biological characteristics. Using
interpolated correctly, which is obviously much more these massive imaging parameters, no matter color
accurate than purely using the mean interpolation. It Doppler ultrasound, CT, MR, or PET parameters,
should be emphasized that the regression method takes combined with clinical features to construct clinical
into account the dependence between the data, but the predictive models can often further improve the
estimation of the variation of the default value is not accuracy of the predictive model. This type of

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 81 of 96

model needs to be based on screening imaging least squares estimate is almost unbiased, but may have
omics, so the pre-workload is much larger than the a high variance, which means that small changes in the
first type, and the imaging omics parameters are training set can lead to large changes in the least squares
more abundant than the clinical features (62). coefficient estimation results. Regularization can properly
(III) With the widespread using of high-throughput select appropriate λ to optimize the deviation/variance
biotechnology such as genomics and proteomics, tradeoff, then to improve the fitness of the regression
clinical researchers are attempting to mine model. Finally, the regularization of the coefficients can
biomarkers characteristics from these vast also be used to solve the over-fitting problem caused by
amounts of biological information to construct multicollinearity (64).
clinical predictive models. This type of predictive
model is a good entry point for basic medicine to
Introduction of ridge regression
clinical medicine (63).
Because there are too many features related to “omics” We will briefly introduce what is ridge regression and
in the second and third types of model, the variable what it can do and what is cannot do. In ridge regression,
selecting is very difficult. It is very difficult to use the the norm term is the squares sum of all coefficients,
traditional variable selecting methods described in Section 2. called L2-Norm. In regression model, we try to minimize
So, is there a better solution? The answer is YES. The RSS+λ (sumβj2). As λ increases, the regression coefficient β
regularization method described in this Section is one decreases, tending to 0 but never equal to 0. The advantage
of the solutions. Regularization can limit the regression of ridge regression is that it can improve the prediction
coefficient and even reduces it to zero. Now, there are many accuracy, but because it can’t make the coefficient of any
algorithms or algorithms combinations that can be used to variable being equal to zero, it is difficult to meet the
implement regularization. In this Section, we will focus on requirements of reducing numbers of variable, so there will
ridge regression and LASSO regression. be some problems in the model interpretability. To solve
this problem, we can use the LASSO regression mentioned
below.
Introduction of regularization
In addition, ridge regression is more often used to deal
The general linear model is Y=β0+β1X1+…+βnXn+e, the best with collinearity in linear regression. It is generally believed
fit attempts to minimize the residual sum of squares (RSS). that collinearity will lead to over-fitting and the parameter
RSS is the sum of the squares of the difference between the estimates will being very large. Therefore, adding a penalty
actual numbers minus the estimated numbers, which can function to the objective function of the least squares
be expressed as e12+e22+…+en2. We can add a new parameter to the regression coefficient β can solve this problem.
in the minimization process of RSS using regularization, The regularization thoughts are consistent, so the ridge
called the contraction penalty term. This penalty term regression can resolve the problem.
contains a λ and normalized results for the β coefficients
and weights. Different regularization technologies have
LASSO regression
different methods for standardizing weights. In brief, we
replaced RSS with RSS + λ (normalized coefficients) in the Different from L2-norm in ridge regression, LASSO
model. We choose λ, which is called the tuning parameter regression uses L1-Norm, which is the sum of the absolute
in model building. If λ=0, the model is equivalent to the values of all variable weights, that is, to minimize RSS+λ
least square method (OLS) because all the normalization (sum|βj|). This contraction penalty can shrink the weight
items are offset (64). to zero, which is a distinct advantage over ridge regression
What are the advantages of regularization? Firstly, the because it greatly increases the interpretability of the
regularization method is very computationally efficient. If model.
we use the regularization method, we only need to fit one If the LASSO regression is so well, do we need ridge
model for each λ, so the efficiency will be greatly improved. regression? The answer is YES. When there is a high
Secondly, it is the deviation/variance trade-off problem. degree of collinearity or correlation, LASSO regression
In a linear model, the relationship between dependent may delete some important predictive variables, which will
variables and predictor variables is close to linear, and the lose the predictive power of the model. For example, if

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 82 of 96 Zhou et al. Clinical prediction models with R

both variable A and B should be including in the prediction tissue samples of 699 patients and is stored in a data frame
model, LASSO regression may reduce the coefficient of one with 11 variables. This data frame contains the following
to zero (64). So, the ridge regression and LASSO regression columns:
should be complementary to each other. Choosing a ID: sample code number (not unique).
suitable method to solve the problem is the key to statistical V1: clump thickness.
applications. V2: uniformity of cell size.
The currently published clinical prediction model V3: uniformity of cell shape.
articles using LASSO regression include two types, one is V4: marginal adhesion.
the screening of image omics features mentioned above, and V5: single epithelial cell size.
the other is genomics screening. There are usually dozens V6: bare nuclei (16 values are missing).
to hundreds of group features. However, these features V7: bland chromatin.
cannot all be included in the predictive model, because if we V8: normal nucleoli.
do this, the model will be very bloated, and many features V9: mitoses.
may not be related to the outcome, so it is sensible to use class: “benign” or “malignant”.
LASSO regression to reduce features. Generally, the tens to Data processing
hundreds of features are reduced to several or a dozen, and We first load the MASS package and prepare the breast
then the score is calculated for each patient according to the cancer data:
regression coefficient of the LASSO regression equation library(glmnet)
and the value of each feature. This score is then converted library(MASS)
to a categorical variable based on the appropriate cutoff biopsy$ID =NULL
value and included in the regression model as a predictive names(biopsy) =c(“thick”, “u.size”, “u.shape”, “adhsn”,
“s.size”, “nucl”, “chrom”, “n.nuc”, “mit”, “class”)
feature (62,63).
biopsy.v2 <-na.omit(biopsy)
set.seed(123) #random number generator

Case analysis ind<-sample(2, nrow(biopsy.v2), replace =TRUE, prob =c(0.7, 0.3))


train <-biopsy.v2[ind==1, ] #the training data set
In the following examples, we use the glmnet packages to test <-biopsy.v2[ind==2, ] #the test data set
select the appropriate variables and generate the appropriate Convert data to generate input matrices and labels:
x <-as.matrix(train[, 1:9])
models. The regularization technique used above is also
y <-train[, 10]
applicable to classified outcomes, including binary and
multinomial outcomes. We will introduce the sample data Ridge regression modeling
that can be used for Logistic regression. We can use the We first build the model using ridge regression and store
biopsy dataset of the breast cancer in the MASS package. the results in an object ridge. Please note: the glmnet
In regression models with quantitative response variables, package standardizes the input value before lambda value
regularization is an important technique for handling high- is calculated. We need to specify the distribution of the
dimensional data sets. Because there is a linear part of the response variable as “binomial” because it is a binary
Logistic regression function, the L1 norm and the L2 norm outcome; Also specify alpha = 0 to indicate ridge regression
regularization can be used in combination. at this point. The R code is as follows:
ridge <- glmnet(x, y, family = “binomial”, alpha = 0)
[Example 1] analysis This object contains all the information we need to
[Example 1] evaluate the model. The first thing to do is use the print()
The biopsy dataset in the MASS package is a dataset function, which shows the value of the nonzero regression
from the Wisconsin breast cancer patient. The aim is to coefficient, explaining the percentage deviation or the
determine whether the biopsy result is benign or malignant. corresponding lambda value. The default number of
The researchers used fine needle aspiration (FNA) computations in the package is 100, but if the improvement
techniques to collect samples and perform biopsies to in the percentage deviation by two lambda values isn’t
determine the diagnosis (malignant or benign). Our task is significant, the algorithm will stop before 100 computations.
to develop predictive models that are as accurate as possible In other words, the algorithm converges to the optimal
to determine the nature of the tumor. The data set contains solution. We listed all the lambda outcomes:

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 83 of 96

print(ridge) ## [51,] 9 2.857e-01 3.77100


## ## [52,] 9 3.025e-01 3.43600
## Call: glmnet(x = x, y = y, family = “binomial”, alpha = 0) ## [53,] 9 3.195e-01 3.13100
## ## [54,] 9 3.369e-01 2.85300
## Df %Dev Lambda ## [55,] 9 3.544e-01 2.59900
## [1,] 9 -8.474e-16 395.10000 ## [56,] 9 3.721e-01 2.36800
## [2,] 9 4.716e-03 360.00000 ## [57,] 9 3.898e-01 2.15800
## [3,] 9 5.174e-03 328.00000 ## [58,] 9 4.076e-01 1.96600
## [4,] 9 5.675e-03 298.90000 ## [59,] 9 4.253e-01 1.79200
## [5,] 9 6.224e-03 272.30000 ## [60,] 9 4.429e-01 1.63200

## [6,] 9 6.826e-03 248.10000 ## [61,] 9 4.604e-01 1.48700

## [7,] 9 7.486e-03 226.10000 ## [62,] 9 4.777e-01 1.35500


## [63,] 9 4.947e-01 1.23500
## [8,] 9 8.209e-03 206.00000
## [64,] 9 5.114e-01 1.12500
## [9,] 9 9.000e-03 187.70000
## [65,] 9 5.279e-01 1.02500
## [10,] 9 9.868e-03 171.00000
## [66,] 9 5.439e-01 0.93410
## [11,] 9 1.082e-02 155.80000
## [67,] 9 5.596e-01 0.85110
## [12,] 9 1.186e-02 142.00000
## [68,] 9 5.749e-01 0.77550
## [13,] 9 1.300e-02 129.40000
## [69,] 9 5.898e-01 0.70660
## [14,] 9 1.424e-02 117.90000
## [70,] 9 6.042e-01 0.64390
## [15,] 9 1.560e-02 107.40000
## [71,] 9 6.181e-01 0.58670
## [16,] 9 1.709e-02 97.86000
## [72,] 9 6.316e-01 0.53450
## [17,] 9 1.872e-02 89.17000
## [73,] 9 6.446e-01 0.48710
## [18,] 9 2.050e-02 81.25000
## [74,] 9 6.572e-01 0.44380
## [19,] 9 2.245e-02 74.03000
## [75,] 9 6.693e-01 0.40440
## [20,] 9 2.457e-02 67.45000
## [76,] 9 6.809e-01 0.36840
## [21,] 9 2.687e-02 61.46000
## [77,] 9 6.920e-01 0.33570
## [22,] 9 2.939e-02 56.00000
## [78,] 9 7.027e-01 0.30590
## [23,] 9 3.215e-02 51.02000 ## [79,] 9 7.129e-01 0.27870
## [24,] 9 3.515e-02 46.49000 ## [80,] 9 7.226e-01 0.25400
## [25,] 9 3.841e-02 42.36000 ## [81,] 9 7.319e-01 0.23140
## [26,] 9 4.196e-02 38.60000 ## [82,] 9 7.408e-01 0.21080
## [27,] 9 4.582e-02 35.17000 ## [83,] 9 7.492e-01 0.19210
## [28,] 9 5.001e-02 32.05000 ## [84,] 9 7.572e-01 0.17500
## [29,] 9 5.455e-02 29.20000 ## [85,] 9 7.649e-01 0.15950
## [30,] 9 5.947e-02 26.60000 ## [86,] 9 7.721e-01 0.14530
## [31,] 9 6.480e-02 24.24000 ## [87,] 9 7.789e-01 0.13240
## [32,] 9 7.056e-02 22.09000 ## [88,] 9 7.854e-01 0.12060
## [33,] 9 7.677e-02 20.13000 ## [89,] 9 7.915e-01 0.10990
## [34,] 9 8.347e-02 18.34000 ## [90,] 9 7.973e-01 0.10020
## [35,] 9 9.067e-02 16.71000 ## [91,] 9 8.027e-01 0.09127
## [36,] 9 9.840e-02 15.22000 ## [92,] 9 8.079e-01 0.08316
## [37,] 9 1.067e-01 13.87000 ## [93,] 9 8.127e-01 0.07577
## [38,] 9 1.156e-01 12.64000 ## [94,] 9 8.172e-01 0.06904
## [39,] 9 1.250e-01 11.52000 ## [95,] 9 8.214e-01 0.06291
## [40,] 9 1.351e-01 10.49000 ## [96,] 9 8.254e-01 0.05732
## [41,] 9 1.458e-01 9.56100 ## [97,] 9 8.291e-01 0.05223
## [42,] 9 1.571e-01 8.71200 ## [98,] 9 8.326e-01 0.04759
## [43,] 9 1.691e-01 7.93800 ## [99,] 9 8.359e-01 0.04336
## [44,] 9 1.816e-01 7.23300 ## [100,] 9 8.389e-01 0.03951

## [45,] 9 1.948e-01 6.59000


Taking the 100th row as an example, it can be seen that
## [46,] 9 2.086e-01 6.00500
## [47,] 9 2.230e-01 5.47100
the non-zero regression coefficient, that is, the number of
## [48,] 9 2.380e-01 4.98500 features contained in the model is 9. In ridge regression,
## [49,] 9 2.534e-01 4.54200 this number is constant. You can also see the percentage of
## [50,] 9 2.694e-01 4.13900 interpretation deviation was 0.8389, or the tuning factor

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 84 of 96 Zhou et al. Clinical prediction models with R

9 9 9 9 9 ROC Curve
1.00
0.20

0.75
Coefficients

Sensitivity (TPR)
0.10
0.50

AUROC: 0.9974
0.00 0.25

–2 0 2 4 6
0.00
Log lambda
0.00 0.25 0.50 0.75 1.00
Figure 37 The relationship between the coefficient and the 1-Specificity (FPR)

Log(λ). Figure 38 The performance of this model on the test set.

for this line is 0.03951. But for simplicity’s sake, we’ll set The predict() function is then used to build an object
lambda equal to 0.05 for the test set. named ridge.y, specifying the parameter type= “response”
So, let’s see graphically how the regression coefficient and lambda value of 0.05. The R code is the following:
varies with lambda. Simply add the argument xvar = ridge.y <- predict(ridge, newx = newx, type = “response”, s=0.05)
“lambda” to the plot() function. By calculating the error and AUC, we can see the
plot(ridge, xvar = “lambda”, label = TRUE) performance of this model on the test set:
This graph shows that when lambda falls, the compression
library(InformationValue)
parameter decreases, but the absolute coefficient increases
##
(Figure 37). To look at the coefficient for lambda at a ## Attaching package: ‘InformationValue’
particular value, use the predict() function. Now, let’s see ##
what the coefficient is when lambda is 0.05. We specify the ## confusionMatrix, precision, sensitivity, specificity
parameter s=0.05 and the parameter type = “coefficients”. actuals <- ifelse(test$class == “malignant”, 1, 0)
misClassError(actuals, ridge.y )
The glmnet() function is configured to use lambda—specific
## [1] 0.0191
value when fitting the model, rather than insert value from
plotROC(actuals, ridge.y)
either side of the lambda-specific. The R code is as follows:
This misclassification rate is only 0.0191, indicating that
ridge.coef <- predict(ridge, s=0.05, type = “coefficients”)
ridge.coef the model has a higher level of classification and prediction
## 10 x 1 sparse Matrix of class “dgCMatrix” ability (Figure 38).
## 1 LASSO regression modeling
## (Intercept) -5.4997937 It is easy to run LASSO regression by changing one
## thick 0.2116920
parameter of the ridge regression model. That is, in
## u.size 0.1284357
## u.shape 0.1540309
glmnet() function, alpha =0 in ridge regression is changed
## adhsn 0.1301851 to alpha=1. Run the R code to see the output of the model
## s.size 0.1665205 and check all the fitting results:
## nucl 0.1874988
## chrom 0.1821222 lasso <- glmnet(x, y, family = “binomial”, alpha = 1)
## n.nuc 0.1378914 print(lasso)
## mit 0.1277047 ##
## Call: glmnet(x = x, y = y, family = “binomial”, alpha = 1)
It can be seen that a non-zero regression coefficient is ##
obtained for all the features. Next, we verify on the test set ## Df %Dev Lambda
that the features need to be transformed into matrix form, ## [1,] 0 -8.474e-16 0.3951000

just as we did on the training set: ## [2,] 3 1.005e-01 0.3600000


## [3,] 3 1.848e-01 0.3280000
newx <- as.matrix(test[, 1:9])
## [4,] 3 2.558e-01 0.2989000

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 85 of 96

## [5,] 3 3.165e-01 0.2723000 ## [60,] 8 8.728e-01 0.0016320


## [6,] 3 3.691e-01 0.2481000 ## [61,] 8 8.729e-01 0.0014870
## [7,] 3 4.152e-01 0.2261000 ## [62,] 8 8.730e-01 0.0013550
## [8,] 4 4.568e-01 0.2060000 ## [63,] 8 8.731e-01 0.0012350
## [9,] 4 4.961e-01 0.1877000 ## [64,] 8 8.731e-01 0.0011250
## [10,] 5 5.317e-01 0.1710000 ## [65,] 8 8.732e-01 0.0010250
## [11,] 6 5.645e-01 0.1558000 ## [66,] 8 8.732e-01 0.0009341
## [12,] 6 5.940e-01 0.1420000 ## [67,] 8 8.733e-01 0.0008511
## [13,] 6 6.204e-01 0.1294000 ## [68,] 8 8.733e-01 0.0007755
## [14,] 6 6.442e-01 0.1179000 ## [69,] 8 8.734e-01 0.0007066
## [15,] 6 6.657e-01 0.1074000 ## [70,] 8 8.734e-01 0.0006439
## [16,] 7 6.852e-01 0.0978600 ## [71,] 8 8.734e-01 0.0005867
## [17,] 7 7.031e-01 0.0891700 ## [72,] 8 8.734e-01 0.0005345
## [18,] 7 7.194e-01 0.0812500 ## [73,] 9 8.735e-01 0.0004871
## [19,] 7 7.341e-01 0.0740300 ## [74,] 9 8.735e-01 0.0004438
## [20,] 7 7.474e-01 0.0674500 ## [75,] 9 8.736e-01 0.0004044
## [21,] 7 7.595e-01 0.0614600 ## [76,] 9 8.736e-01 0.0003684
## [22,] 7 7.705e-01 0.0560000 ## [77,] 9 8.736e-01 0.0003357
## [23,] 8 7.805e-01 0.0510200 ## [78,] 9 8.737e-01 0.0003059
## [24,] 8 7.897e-01 0.0464900 ## [79,] 9 8.737e-01 0.0002787
## [25,] 8 7.980e-01 0.0423600 ## [80,] 9 8.737e-01 0.0002540
## [26,] 8 8.055e-01 0.0386000 ## [81,] 9 8.737e-01 0.0002314
## [27,] 8 8.124e-01 0.0351700 ## [82,] 9 8.737e-01 0.0002108
## [28,] 8 8.186e-01 0.0320500 ## [83,] 9 8.737e-01 0.0001921
## [29,] 8 8.242e-01 0.0292000 ## [84,] 9 8.737e-01 0.0001750
## [30,] 8 8.292e-01 0.0266000
## [31,] 8 8.338e-01 0.0242400 Notice that the model-building process stops after 84
## [32,] 8 8.380e-01 0.0220900 steps, because the explanatory bias doesn’t decrease with
## [33,] 8 8.417e-01 0.0201300 lambda. Notice also, the Df column varies with lambda. At
## [34,] 8 8.450e-01 0.0183400 first glance, all nine characteristics should be included in
## [35,] 8 8.480e-01 0.0167100
the model when lambda is 0.0001750. For the purpose of
## [36,] 8 8.507e-01 0.0152200
learning and practice, we first tested the model with fewer
## [37,] 9 8.532e-01 0.0138700
## [38,] 9 8.555e-01 0.0126400
features, such as the seven features model. The result line
## [39,] 9 8.577e-01 0.0115200 below shows that the model changes from seven features to
## [40,] 9 8.596e-01 0.0104900 eight when lambda is 0.0560, 000. So, when we evaluate the
## [41,] 9 8.613e-01 0.0095610 model using the test set, we use this lambda value.
## [42,] 9 8.628e-01 0.0087120 As with ridge regression, the results can be plotted in the
## [43,] 9 8.642e-01 0.0079380
graph (Figure 39). R code is shown as follows:
## [44,] 9 8.654e-01 0.0072330
plot(lasso, xvar = “lambda”, label = TRUE)
## [45,] 8 8.664e-01 0.0065900
## [46,] 8 8.673e-01 0.0060050 The same thing can be done for ridge regression to
## [47,] 8 8.681e-01 0.0054710 look at the coefficient value of the 7 characteristic model,
## [48,] 8 8.688e-01 0.0049850 which passes the established lambda value to the predict()
## [49,] 8 8.695e-01 0.0045420 function. The R code is the following:
## [50,] 8 8.700e-01 0.0041390
## [51,] 8 8.705e-01 0.0037710 lasso.coef <- predict(lasso, s = 0.056, type = “coefficients”)
## [52,] 8 8.709e-01 0.0034360 lasso.coef
## [53,] 8 8.713e-01 0.0031310 ## 10 x 1 sparse Matrix of class “dgCMatrix”
## [54,] 8 8.716e-01 0.0028530 ## 1
## [55,] 8 8.719e-01 0.0025990 ## (Intercept) -4.04233153
## [56,] 8 8.721e-01 0.0023680 ## thick 0.17983856
## [57,] 8 8.723e-01 0.0021580 ## u.size 0.12484046
## [58,] 8 8.725e-01 0.0019660 ## u.shape 0.12038690
## [59,] 8 8.726e-01 0.0017920 ## adhsn .

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 86 of 96 Zhou et al. Clinical prediction models with R

9 8 8 6 binomial, set the value of measure to the area under the


ROC curve (auc), and use 5 folds cross-validation:
0.4 set.seed(123)
fitCV<-cv.glmnet(x, y, family =“binomial”,
Coefficients

type.measure =“auc”,
0.2
nfolds =5)

Drawing fitCV, we can see the relationship between


0.0
AUC and λ (Figure 41):
–8 –6 –4 –2 plot(fitCV)
Adding only one feature can result in a significant
Log Lambda
improvement in AUC. Let’s look at the model coefficients
Figure 39 The relationship between the coefficient and the
at Log(λ)+a standard error:
Log(λ).
fitCV$lambda.1se
## [1] 0.1293669
coef(fitCV, s =“lambda.1se”)
ROC Curve
## 10 x 1 sparse Matrix of class “dgCMatrix”
1.00
## 1
## (Intercept) -2.52374214
0.75 ## thick 0.07189973
Sensitivity (TPR)

## u.size 0.11901349
0.50 ## u.shape 0.09179324
## adhsn .
AUROC: 0.9968 ## s.size .
0.25
## nucl 0.17732550
## chrom 0.02233980
0.00 ## n.nuc 0.02830596
0.00 0.25 0.50 0.75 1.00 ## mit .
1-Specificity (FPR)

Figure 40 The performance of this model on the test set. It can be seen that the four selected features are thickness,
u.size, u.shape and nucl. As with the previous Section, we will
look at the performance of this model on the test set by error
## s.size 0.05802241 and AUC (Figure 42):
## nucl 0.24709063
## chrom 0.08831568 library(InformationValue)
## n.nuc 0.07823788 ##

## mit ## Attaching package: ‘InformationValue’


predCV<-predict(fitCV, newx =as.matrix(test[, 1:9]),
The LASSO algorithm returns the coefficient of the s =“lambda.1se”,type =“response”)
actuals <-ifelse(test$class== “malignant”, 1, 0)
adhsn or mit variables to zero at 0.056. Here’s how the
misClassError(actuals, predCV)
LASSO model looks on the test set: ## [1] 0.0478
lasso.y <- predict(lasso, newx = newx, plotROC(actuals, predCV)
type = “response”, s = 0.056)
actuals <- ifelse(test$class == “malignant”, 1, 0) The results show that the effect of this model is basically
misClassError(actuals, lasso.y ) the same as the previous logistic regression model. It seems
## [1] 0.0383 that lambda.1se is not the optimal choice. Let’s see if the
plotROC(actuals, lasso.y)
model selected with lambda.min can improve the sample
This misclassification rate is only 0.0383, indicating that prediction again:
the model has a higher level of classification and prediction predCV.min<-predict(fitCV, newx =as.matrix(test[, 1:9]),
ability (Figure 40). s =“lambda.min”,
type =“response”)
Cross-validation misClassError(actuals, predCV.min)
In the function cv.glmnet, set the value of family to ## [1] 0.0239

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 87 of 96

9 8 8 8 8 8 8 9 8 8 8 7 6 5 3 whether the patient is recovering through various formulas.


Preoperative predictive models and postoperative data (not
0.95 provided here) work together to improve the diagnosis and
prognosis of prostate cancer. The data set collected from 97
0.85 males is stored in a data box with 10 variables as follows:
(I) lcavol: logarithm of tumor volume;
AUC

0.75 (II) lweight: logarithm of prostate weight;


(III) age: patient age (years);
0.65 (IV) lbph: logarithm of benign prostatic hyperplasia
(BPH), non-cancer prostatic hyperplasia;
–8 –7 –6 –5 –4 –3 –2 –1
(V) svi: whether the seminal vesicle is invaded,
Log (lambda) indicating whether the cancer cells have invaded
Figure 41 Relationship between AUC and Log(λ). the seminal vesicle through the prostate wall (1=
yes, 0= no);
(VI) lcp: logarithm of the envelope penetration,
ROC Curve indicating the extent to which cancer cells spread
1.00 beyond the prostate capsule;
(VII) Gleason: patient’s Gleason score. It is given by a
0.75 pathologist after biopsy, indicating the variation
Sensitivity (TPR)

degree of cancer cells. The higher the score, the


0.50
more dangerous of disease;
(VIII) pgg45: the percentage of Gleason score is 4 or 5;
AUROC: 0.9965
(IX) lpsa: logarithmic value of PSA value, this is the
0.25
result variable;
(X) Train: a logical vector (TRUE or FALSE to
0.00
distinguish between the training data set and the
0.00 0.25 0.50 0.75 1.00
1-Specificity (FPR) test data set).
Figure 42 The performance of this model on the test set.
Data processing
This data set is included in the ElemStatLearn package of
R. After loading the required packages and data frames,
This misclassification rate is only 0.0239, indicating that examining the possible connections between variables, as
the model has a higher level of classification and prediction follows:
ability. library(ElemStatLearn) #contains the data
library(glmnet) # allows ridge regression, LASSO and elastic net
[Example 2] analysis ## Loading required package: Matrix
## Loading required package: foreach
[Example 2]
## Loaded glmnet 2.0-16
The example is a prostate cancer data. Although this data
library(caret) #this will help identify the appropriate parameters
set is relatively small, only 97 observations and 9 variables, ## Loading required package: lattice
it is enough to make us master the regularization method to ## Loading required package: ggplot2
compare with the traditional methods. Stanford University
Medical Center provided prostate-specific antigen (PSA) After loading the package, bring up the prostate dataset
data for 97 patients who underwent radical prostatectomy. and look at the data structure as follows:
Our goal is to establish a predictive model to predict
data(prostate)
postoperative PSA levels using the data from clinical testing. str(prostate)
PSA may be a more effective prognostic variable than other ## ‘data.frame’: 97 obs. of 10 variables:
variables when predicting patients can or should be restored ## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
after surgery. After the operation, the doctor will check the ## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...

patient’s PSA level at every time intervals and determine ## $ age : int 50 58 74 58 62 50 64 58 47 63 ...

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 88 of 96 Zhou et al. Clinical prediction models with R

## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ... ## ‘data.frame’: 30 obs. of 9 variables:
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ... ## $ lcavol : num 0.737 -0.777 0.223 1.206 2.059 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ... ## $ lweight: num 3.47 3.54 3.24 3.44 3.5 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ... ## $ age : int 64 47 63 57 60 69 68 67 65 54 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ... ## $ lbph : num 0.615 -1.386 -1.386 -1.386 1.475 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ... ## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ train : logi TRUE TRUETRUETRUETRUETRUE ... ## $ lcp : num -1.386 -1.386 -1.386 -0.431 1.348 ...
## $ gleason: num 0 0 0 1 1 0 0 1 0 0 ...
Some problems need to be considered when examining ## $ pgg45 : int 0 0 0 5 20 0 0 20 0 0 ...
the data structure. The first 10 observations of svi, lcp, ## $ lpsa : num 0.765 1.047 1.047 1.399 1.658 ...
Gleason, and pgg45 have the same number, with one
Ridge regression model
exception: the third observation of Gleason. To ensure
In the ridge regression, the model includes all eight
that these features are indeed feasible as input features, we
features, so the comparison between the ridge regression
convert the Gleason variable into a dichotomous variable,
model and the optimal subset model is expected. We use
with 0 representing a score of 6, and 1 indicating a score
the package glmnet. This package requires input variables
of 7 or higher. Deleting variables might lose the predictive
to be stored in the matrix instead of in the data frame. The
power of the model. Missing values can also cause
demand for ridge regression is glmnet (x = input matrix, y
problems in the glmnet package. We can easily encode the
= response variable, family = distribution function, alpha
indicator variables in a single line of code. Use the ifelse()
=0). When alpha is 0, it means that ridge regression is
command to specify the column you want to convert in
performed; when alpha is 1, it means LASSO regression. It’s
the data frame and then convert according to this rule: if
also easy to prepare the training set data for glmnet, use the
the eigenvalue of the observation is ‘x’, encode it as ‘y’,
as.matrix() function to process the input data, and create a
otherwise encode it as ‘z’.
vector as the response variable, as shown below:
prostate$gleason<-ifelse(prostate$gleason==6, 0, 1)
table(prostate$gleason) x <-as.matrix(train[, 1:8])
## y <-train[, 9]
## 0 1
## 35 62 Now we can use ridge regression. We save the result
in an object and give the object an appropriate name,
Firstly, we establish a training data set and a test data
such as ridge. There is a very important point, please be
set. Because there is already a variable in the observation
sure to note: the glmnet package will first normalize the
indicating whether the observation belongs to the
input before calculating the λ value and then calculate the
training set, we can use the subset() function to assign the
non-normalized coefficients. So, we need to specify the
observation with the train value TRUE to the training set,
distribution of the response variable as gaussian because it
and the observation with the train value FALSE to the test
is continuous; also specify alpha =0 for the ridge regression.
set. It is also necessary to remove the train feature because
As follows:
we don’t want to use it as a predictive variable. As follows:
ridge <-glmnet(x, y, family =“gaussian”, alpha =0)
train <-subset(prostate, train ==TRUE)[, 1:9]
str(train) This object contains all the information that we need
## ‘data.frame’: 67 obs. of 9 variables: to evaluate the model. First try the print() function, which
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ... will show the number of non-zero coefficients, explain the
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ... percentage of deviation and the corresponding λ value.
## $ age : int 50 58 74 58 62 50 58 65 63 63 ...
The default number of calculations for the algorithm in the
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
package is 100, but if the percentage increase between the
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ... two λ values is not significant, the algorithm will stop before
## $ gleason: num 0 0 1 0 0 0 0 0 0 1 ... 100 calculations. That is, the algorithm converges to the
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 30 ... optimal solution. All the λ results are listed below:
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
print(ridge)
test =subset(prostate, train==FALSE)[,1:9] ##
str(test) ## Call: glmnet(x = x, y = y, family = “gaussian”, alpha = 0)

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 89 of 96

## ## [54,] 8 3.614e-01 6.34600


## Df %Dev Lambda ## [55,] 8 3.780e-01 5.78200
## [1,] 8 3.801e-36 878.90000 ## [56,] 8 3.945e-01 5.26900
## [2,] 8 5.591e-03 800.80000 ## [57,] 8 4.108e-01 4.80100
## [3,] 8 6.132e-03 729.70000 ## [58,] 8 4.268e-01 4.37400
## [4,] 8 6.725e-03 664.80000 ## [59,] 8 4.424e-01 3.98600
## [5,] 8 7.374e-03 605.80000 ## [60,] 8 4.576e-01 3.63200

## [6,] 8 8.086e-03 552.00000 ## [61,] 8 4.724e-01 3.30900


## [62,] 8 4.866e-01 3.01500
## [7,] 8 8.865e-03 502.90000
## [63,] 8 5.003e-01 2.74700
## [8,] 8 9.718e-03 458.20000
## [64,] 8 5.134e-01 2.50300
## [9,] 8 1.065e-02 417.50000
## [65,] 8 5.260e-01 2.28100
## [10,] 8 1.168e-02 380.40000
## [66,] 8 5.380e-01 2.07800
## [11,] 8 1.279e-02 346.60000
## [67,] 8 5.493e-01 1.89300
## [12,] 8 1.402e-02 315.90000
## [68,] 8 5.601e-01 1.72500
## [13,] 8 1.536e-02 287.80000
## [69,] 8 5.703e-01 1.57200
## [14,] 8 1.682e-02 262.20000
## [70,] 8 5.800e-01 1.43200
## [15,] 8 1.842e-02 238.90000
## [71,] 8 5.891e-01 1.30500
## [16,] 8 2.017e-02 217.70000 ## [72,] 8 5.976e-01 1.18900
## [17,] 8 2.208e-02 198.40000 ## [73,] 8 6.057e-01 1.08400
## [18,] 8 2.417e-02 180.70000 ## [74,] 8 6.133e-01 0.98730
## [19,] 8 2.644e-02 164.70000 ## [75,] 8 6.204e-01 0.89960
## [20,] 8 2.892e-02 150.10000 ## [76,] 8 6.270e-01 0.81960
## [21,] 8 3.163e-02 136.70000 ## [77,] 8 6.333e-01 0.74680
## [22,] 8 3.457e-02 124.60000 ## [78,] 8 6.391e-01 0.68050
## [23,] 8 3.777e-02 113.50000 ## [79,] 8 6.445e-01 0.62000
## [24,] 8 4.126e-02 103.40000 ## [80,] 8 6.496e-01 0.56500
## [25,] 8 4.504e-02 94.24000 ## [81,] 8 6.543e-01 0.51480
## [26,] 8 4.915e-02 85.87000 ## [82,] 8 6.587e-01 0.46900
## [27,] 8 5.360e-02 78.24000 ## [83,] 8 6.628e-01 0.42740
## [28,] 8 5.842e-02 71.29000 ## [84,] 8 6.666e-01 0.38940
## [29,] 8 6.364e-02 64.96000 ## [85,] 8 6.701e-01 0.35480
## [30,] 8 6.928e-02 59.19000 ## [86,] 8 6.733e-01 0.32330
## [31,] 8 7.536e-02 53.93000 ## [87,] 8 6.763e-01 0.29460

## [32,] 8 8.191e-02 49.14000 ## [88,] 8 6.790e-01 0.26840


## [89,] 8 6.815e-01 0.24460
## [33,] 8 8.896e-02 44.77000
## [90,] 8 6.838e-01 0.22280
## [34,] 8 9.652e-02 40.79000
## [91,] 8 6.859e-01 0.20300
## [35,] 8 1.046e-01 37.17000
## [92,] 8 6.877e-01 0.18500
## [36,] 8 1.133e-01 33.87000
## [93,] 8 6.894e-01 0.16860
## [37,] 8 1.225e-01 30.86000
## [94,] 8 6.909e-01 0.15360
## [38,] 8 1.324e-01 28.12000
## [95,] 8 6.923e-01 0.13990
## [39,] 8 1.428e-01 25.62000
## [96,] 8 6.935e-01 0.12750
## [40,] 8 1.539e-01 23.34000
## [97,] 8 6.946e-01 0.11620
## [41,] 8 1.655e-01 21.27000
## [98,] 8 6.955e-01 0.10590
## [42,] 8 1.778e-01 19.38000 ## [99,] 8 6.964e-01 0.09646
## [43,] 8 1.907e-01 17.66000 ## [100,] 8 6.971e-01 0.08789
## [44,] 8 2.041e-01 16.09000
## [45,] 8 2.181e-01 14.66000 Take the line 100 as an example. It can be seen that
## [46,] 8 2.327e-01 13.36000 the non-zero coefficient, that is, the number of variables
## [47,] 8 2.477e-01 12.17000
included in the model is 8. Remember that in the ridge
## [48,] 8 2.631e-01 11.09000
regression, this number is constant. It can also be seen that
## [49,] 8 2.790e-01 10.10000
## [50,] 8 2.951e-01 9.20700
the interpretation deviation percentage is 0.6971, and the
## [51,] 8 3.115e-01 8.38900 value of the tuning coefficient λ is 0.08789. Here we can
## [52,] 8 3.281e-01 7.64400 decide which λ to use on the test set. This λ value should be
## [53,] 8 3.447e-01 6.96500 0.08789, but we can try 0.10 on the test set for simplicity.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 90 of 96 Zhou et al. Clinical prediction models with R

At this point, some charts are very useful. Let’s take a look 8 8 8 8 8

at the default chart in the package. Set label = TRUE to 0.6


annotate the curve as follows:

Coefficients
plot(ridge, label =TRUE) 0.4

In the default graph, the Y-axis is the regression 0.2


coefficient and the X-axis is the L1 norm. The relationship
between the coefficient and the L1 norm is shown in 0.0
Figure 43. There is another X-axis above the graph, and
the number on it represents the number of features in the
0.0 0.5 1.0 1.5 2.0
model. We can also see how the coefficient changes with
L1 Norm
λ. Just adjust it slightly using the plot() function and the
parameter xvar = “lambda”. Another option is to replace Figure 43 The relationship between the coefficient and the L1
lamda with dev and see how the coefficient varies with the norm.
percentage of interpretation deviation.
plot(ridge, xvar =“lambda”, label =TRUE) 8 8 8 8 8

This graph shows that as λ decreases, the compression 0.6

parameter decreases and the absolute value of coefficient


increases (Figure 44). We can use the predict() function to Coefficients 0.4

see the coefficient values when λ is a specific value. If we


0.2
want to know the value of coefficient when λ is 0.1, we can
specify the parameter s=0.1 and specify the parameter type
0.0
= “coefficients”, when using glmnet() to fit the model, we
should use the specific λ value, rather than use values from
–2 0 2 4 6
both sides of λ. As follows:
Log lambda
ridge.coef<-predict(ridge, s=0.1, type =“coefficients”)
ridge.coef
Figure 44 The relationship between the coefficient and the
## 9 x 1 sparse Matrix of class “dgCMatrix” Log(λ).
## 1
## (Intercept) 0.130475478
## lcavol 0.457279371 will be equivalent to the OLS. In order to prove this on the
## lweight 0.645792042
test set, we need to convert the features as we did on the
## age -0.017356156
## lbph 0.122497573
training set:
## svi 0.636779442 newx<-as.matrix(test[, 1:8])
## lcp -0.104712451
## gleason 0.346022979 Then we use the predict() function to create an object
## pgg45 0.004287179
named ridge.y, specifying the parameter type=“response”
It is important to note that the coefficients of the age, and the λ value of 0.10. Draw a statistical graph representing
lcp, and pgg45 variables are very close to zero, but not yet the relationship between the predicted value and the actual
zero. Don’t forget to look at the relationship between the value, as shown below:
deviation and the coefficient: ridge.y =predict(ridge, newx =newx, type =“response”, s=0.1)
plot(ridge.y, test$lpsa, xlab =“Predicted”,
plot(ridge, xvar =“dev”, label =TRUE) ylab =“Actual”, main =“Ridge Regression”)

Compared with the previous two graphs, from this The graph below showing the relationship between
graph, we can see that as λ decreases, the coefficient and the predicted and actual values in the ridge regression (Figure 46).
fraction deviance explained will increase (Figure 45). If the λ Similarly, there are two interesting outliers at the larger
value is 0, the shrink penalty will be ignored and the model number of the PSA measurement. In practical situations,

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 91 of 96

8 8 8 8 8 8 8 8 ## Call: glmnet(x = x, y = y, family = “gaussian”, alpha = 1)


0.6 ##
## Df %Dev Lambda
Coefficients

0.4 ## [1,] 0 0.00000 0.878900


## [2,] 1 0.09126 0.800800
0.2 ## [3,] 1 0.16700 0.729700
## [4,] 1 0.22990 0.664800
0.0 ## [5,] 1 0.28220 0.605800
## [6,] 1 0.32550 0.552000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 ## [7,] 1 0.36150 0.502900
## [8,] 1 0.39140 0.458200
Fraction deviance explained
## [9,] 2 0.42810 0.417500
Figure 45 The relationship between the coefficient and the ## [10,] 2 0.45980 0.380400
## [11,] 3 0.48770 0.346600
fraction deviance explained.
## [12,] 3 0.51310 0.315900
## [13,] 4 0.53490 0.287800
## [14,] 4 0.55570 0.262200
Ridge regression
## [15,] 4 0.57300 0.238900
## [16,] 4 0.58740 0.217700
5
## [17,] 4 0.59930 0.198400
4 ## [18,] 5 0.61170 0.180700
Actual

## [19,] 5 0.62200 0.164700


3
## [20,] 5 0.63050 0.150100
2 ## [21,] 5 0.63760 0.136700
## [22,] 5 0.64350 0.124600
1
## [23,] 5 0.64840 0.113500
1.5 2.0 2.5 3.0 3.5 4.0 ## [24,] 5 0.65240 0.103400
## [25,] 6 0.65580 0.094240
Predicted
## [26,] 6 0.65870 0.085870
Figure 46 The relationship between predicted and actual values in ## [27,] 6 0.66110 0.078240
the ridge regression. ## [28,] 6 0.66310 0.071290
## [29,] 7 0.66630 0.064960
## [30,] 7 0.66960 0.059190
we suggest a more in-depth study of outliers to find out ## [31,] 7 0.67240 0.053930
## [32,] 7 0.67460 0.049140
whether they are really different with other data or what we
## [33,] 7 0.67650 0.044770
have missed. A comparison with the MSE benchmark may
## [34,] 8 0.67970 0.040790
tell us something different. We can calculate the residual ## [35,] 8 0.68340 0.037170
first, then calculate the average of the residual square: ## [36,] 8 0.68660 0.033870
ridge.resid<-ridge.y-test$lpsa ## [37,] 8 0.68920 0.030860
mean(ridge.resid^2) ## [38,] 8 0.69130 0.028120
## [1] 0.4783559 ## [39,] 8 0.69310 0.025620
## [40,] 8 0.69460 0.023340
MSE =0.4783559 in ridge regression. Then we test
## [41,] 8 0.69580 0.021270
LASSO and see if we can reduce the error.
## [42,] 8 0.69680 0.019380
LASSO regression model ## [43,] 8 0.69770 0.017660
Now running LASSO is very simple, just need to change ## [44,] 8 0.69840 0.016090
one parameter of the ridge regression model. That is, use ## [45,] 8 0.69900 0.014660
glmnet() grammar to change alpha =0 to alpha =1 in the ## [46,] 8 0.69950 0.013360
ridge regression. Run the code to see the output of the ## [47,] 8 0.69990 0.012170
## [48,] 8 0.70020 0.011090
model and check all the fitting results:
## [49,] 8 0.70050 0.010100
lasso <-glmnet(x, y, family =“gaussian”, alpha =1) ## [50,] 8 0.70070 0.009207
print(lasso) ## [51,] 8 0.70090 0.008389
## ## [52,] 8 0.70110 0.007644

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 92 of 96 Zhou et al. Clinical prediction models with R

8 8 8 7 5 3 0 variables pgg45, age, and lcp, respectively. It seems that


lcp is always close to 0 until the last variable is included
0.6 in the model. We can calculate the coefficient values of
the 7 variable model by the same operation as in the ridge
0.4
regression, and put the λ value into the predict() function.
Coefficients

0.2 As follows:
lasso.coef<-predict(lasso, s =0.045, type = “coefficients”)
0.0 lasso.coef
## 9 x 1 sparse Matrix of class “dgCMatrix”
–0.2 ## 1
–6 –5 –4 –3 –2 –1 0 ## (Intercept) -0.1305900670
## lcavol 0.4479592050
Log lambda
## lweight 0.5910476764
Figure 47 The relationship between the coefficient and the Log(λ) ## age -0.0073162861
in the Lasso regression. ## lbph 0.0974103575
## svi 0.4746790830
## lcp .

## [53,] 8 0.70120 0.006965 ## gleason 0.2968768129


## pgg45 0.0009788059
## [54,] 8 0.70130 0.006346
## [55,] 8 0.70140 0.005782
The LASSO algorithm zeros the coefficient of lcp
## [56,] 8 0.70150 0.005269
## [57,] 8 0.70150 0.004801
when the λ value is 0.045. Below is the performance of the
## [58,] 8 0.70160 0.004374 LASSO model on the test set (Figure 48):
## [59,] 8 0.70160 0.003986 lasso.y<-predict(lasso, newx =newx,
## [60,] 8 0.70170 0.003632 type =“response”, s =0.045)
## [61,] 8 0.70170 0.003309 plot(lasso.y, test$lpsa, xlab =“Predicted”, ylab =“Actual”,
## [62,] 8 0.70170 0.003015 main =“LASSO”)
## [63,] 8 0.70170 0.002747
Calculate the value of MSE as below:
## [64,] 8 0.70180 0.002503
## [65,] 8 0.70180 0.002281 lasso.resid<-lasso.y-test$lpsa
## [66,] 8 0.70180 0.002078 mean(lasso.resid^2)
## [67,] 8 0.70180 0.001893 ## [1] 0.4437209
## [68,] 8 0.70180 0.001725 It seems that our statistical chart is the same as above,
## [69,] 8 0.70180 0.001572 but the MSE value has a minor improvement. The major
Note that the model constructing process stops after step improvement can only be relied on elastic network. To
69 because the interpretation bias no longer decreases as the perform elastic network modeling, we can continue to use
λ value increases. It is should be noted that the Df column the glmnet package. The adjustment is to not only solve the
also varies with λ. When the λ value is 0.001572, all eight λ value but also the elastic network parameter α. Remember
variables should be included in the model. However, for that α=0 represents the ridge regression penalty, α=1
testing purposes, first we use models with fewer variables represents the LASSO regression, and the elastic network
to test, such as the 7 variables model. From the result row is 0≤α≤1. Solving two different parameters at the same time
as shown below, we can see that the model changes from can be very cumbersome and confusing, but we can resort
7 to 8 variables when the λ value is approximately 0.045. to the caret package in R.
Therefore, this λ value should be used when evaluating the Cross-validation
model using the test set. Now we try K-fold cross-validation. The glmnet package
Like the ridge regression, we can draw the results in the uses 10-fold cross-validation by default when estimating the
graph. As follows: λ value using cv.glmnet(). In the K-fold cross-validation,
the data is divided into k identical subsets (folds), each time
plot(lasso, xvar =“lambda”, label =TRUE)
using k-1 subsets to fit the model, then the remaining subset
This graph shows how LASSO works (Figure 47). Note is used as the test set, and finally combine the k results
the curves labeled 8, 3, and 6, which correspond to the (generally use average) to determine the final parameters.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 93 of 96

LASSO problem, then the distance from the minimum to a standard


5
error position is a very good starting point for solving the
problem. We can also get the two specific values of λ. As
4
follows:
Actual

3
lasso.cv$lambda.min # minimum
2
## [1] 0.00189349
1 lasso.cv$lambda.1se # one standard error away
## [1] 0.08586749
1.5 2.0 2.5 3.0 3.5 4.0
Predicted Use lambda.1se can complete the following process, view
Figure 48 The relationship between predicted and actual values in the coefficients and perform model validation on the test
the LASSO regression. set:
coef(lasso.cv, s =“lambda.1se”)
## 9 x 1 sparse Matrix of class “dgCMatrix”
8 8 8 8 8 8 8 7 6 5 4 3 1 1 ## 1
## (Intercept) -0.3080148498
## lcavol 0.4416782463
1.4
Mean-squared error

## lweight 0.5300563493
## age .
## lbph 0.0666015918
1.0
## svi 0.4194205799
## lcp .
0.6 ## gleason 0.2475400081
## pgg45 0.0001654219
–6 –5 –4 –3 –2 –1 0 lasso.y.cv =predict(lasso.cv, newx=newx, type =“response”,
Log(Lambda) s =“lambda.1se”)
lasso.cv.resid =lasso.y.cv -test$lpsa
Figure 49 The relationship between the logarithm of λ and the mean(lasso.cv.resid^2)
mean square error in the LASSO regression. ## [1] 0.4455302

The error of the model is 0.45, and there are only 5


In this method, each subset is used only once as a test set. features of the model, excluding age, lcp and pgg45.
It is very easy to use K-fold cross-validation in the glmnet We get three different models through the analysis of the
package. The results include the λ value for each fit and data set. The errors of these models on the test set are as
the corresponding MSE. The default setting is α=1, so if below:
we want to try ridge regression or elastic network, we must (I) Ridge regression model: 0.48;
specify the α value. Because we want to see as few input (II) LASSO model: 0.44;
variables as possible, we still use the default settings, but (III) LASSO cross-validation model: 0.45.
because of the amount of data in the training set, there are Just consider the error, the LASSO model which includes
only 3 folds: seven features is the best. But can this optimal model solve
the problem we are trying to answer? We obtained a model
set.seed(123)
with a λ value of about 0.125 by cross-validation, which is
lasso.cv =cv.glmnet(x, y, nfolds =3)
simpler and may be more suitable. We prefer to choose it
plot(lasso.cv)
because it is more explanatory. It is important that expertise
The CV statistical chart is quite different from other from oncologists, urologists and pathologists is needed to
charts in glmnet, it represents the relationship between the help us figure out what makes the most sense. This is true,
logarithm of λ and the mean square error, and the number but it also requires more data. Under the sample size of
of variables in the model (Figure 49). The two vertical this example, only change the random number seed or re-
dashed lines in the chart represent the log λ (left dashed dividing the training set and the test set may result in a large
line) of the minimum MSE and the log λ of a standard change in the results. In the end, these results will not only
error of the minimum distance. If there is an overfitting provide no answers, but may also cause more problems.

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 94 of 96 Zhou et al. Clinical prediction models with R

Brief summary 3. Collins GS, Reitsma JB, Altman DG, et al. Transparent
reporting of a multivariable prediction model for
The aim of this Section is to introduce how to apply
individual prognosis or diagnosis (TRIPOD): the
advanced feature selection techniques to linear models
TRIPOD statement. The TRIPOD Group. Circulation
through a prostate dataset with a small amount of data.
2015;131:211-9.
The dependent variables of the data set are quantitative,
4. Adams ST, Leveson SH. Clinical prediction rules. Bmj
but the glmnet package we use also supports qualitative
2012;344:d8312.
dependent variables (binary and multinomial) and survival
5. Ranstam J, Cook JA, Collins GS. Clinical prediction
outcome data. We introduced the regularization and applied
models. Br J Surg 2016;103:1886.
these techniques to build the model, and then compared
6. Moons KG, Royston P, Vergouwe Y, et al. Prognosis
it. Regularization is a powerful technology that improves and prognostic research: what, why, and how? BMJ
computational efficiency and extracts more meaningful 2009;338:b375.
features than other modeling techniques. In addition, we 7. Kannel WB, McGee D, Gordon T. A general
also use the caret package to optimize multiple parameters cardiovascular risk profile: the Framingham Study. Am J
while training the model. Cardiol 1976;38:46-51.
8. Han K, Song K, Choi BW. How to Develop, Validate,
The data used in this article can be found online at: and Compare Clinical Prediction Models Involving
http://cdn.amegroups.cn/static/application/1091c788c0342 Radiological Parameters: Study Design and Statistical
c498b882bd963c5aafb/2019.08.63-1.zip Methods. Korean J Radiol 2016;17:339-50.
9. Lee YH, Bang H, Kim DJ. How to Establish Clinical
Acknowledgments Prediction Models. Endocrinol Metab (Seoul)
2016;31:38-44.
Funding: This work was partly financially supported by 10. Steyerberg EW, Vergouwe Y. Towards better clinical
the National Natural Science Foundation of China (grant prediction models: seven steps for development and an
numbers: 81774146, 81602668 and 81760423), Beijing ABCD for validation. Eur Heart J 2014;35:1925-31.
NOVA Programme (grant number: xxjh2015A093 and 11. Su TL, Jaki T, Hickey GL, et al. A review of statistical
Z1511000003150125), the Shanghai Youth Medical Talents- updating methods for clinical prediction models. Stat
Specialist Program and the Shanghai Sailing Program (grant Methods Med Res 2018;27:185-97.
number 16YF1401700). 12. Woodward M, Tunstall-Pedoe H, Peters SA. Graphics and
statistics for cardiology: clinical prediction rules. Heart
2017;103:538-45.
Footnote
13. Hickey GL, Grant SW, Dunning J, et al. Statistical primer:
Conflicts of Interest: The authors have no conflicts of interest sample size and power calculations-why, when and how?
to declare. Eur J Cardiothorac Surg 2018;54:4-9.
14. Norman G, Monteiro S, Salama S. Sample size
Ethical Statement: The authors are accountable for all calculations: should the emperor’s clothes be off the peg or
aspects of the work in ensuring that questions related made to measure? BMJ 2012;345:e5278.
to the accuracy or integrity of any part of the work are 15. Zhou Z, Hu Z. Intelligent Statistics. Changsha: Central
appropriately investigated and resolved. South University Press, 2016.
16. Zhang W. Advanced Course of SPSS Statistical Analysis.
Beijing: Higher Education Press, 2004.
References
17. Zhou Z, Hu Z. Crazy Statistics. Changsha: Central South
1. Reza Soroushmehr SM, Najarian K. Transforming big University Press, 2018.
data into computational models for personalized medicine 18. Stone GW, Maehara A, Lansky AJ, et al. A prospective
and health care. Dialogues Clin Neurosci 2016;18:339-43. natural-history study of coronary atherosclerosis. N Engl J
2. Bibault JE, Giraud P, Burgun A. Big Data and machine Med 2011;364:226-35.
learning in radiation oncology: State of the art and future 19. Keteyian SJ, Patel M, Kraus WE, et al. Variables Measured
prospects. Cancer Lett 2016;382:110-7. During Cardiopulmonary Exercise Testing as Predictors

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Annals of Translational Medicine, Vol 7, No 23 December 2019 Page 95 of 96

of Mortality in Chronic Systolic Heart Failure. J Am Coll 2018;38:777-94.


Cardiol 2016;67:780-9. 35. Demler OV, Paynter NP, Cook NR. Reclassification
20. Mody P, Joshi PH, Khera A, et al. Beyond Coronary calibration test for censored survival data: performance
Calcification, Family History, and C-Reactive Protein: and comparison to goodness-of-fit criteria. Diagn Progn
Cholesterol Efflux Capacity and Cardiovascular Risk Res 2018. doi: 10.1186/s41512-018-0034-5.
Prediction. J Am Coll Cardiol 2016;67:2480-7. 36. Alba AC, Agoritsas T, Walsh M, et al. Discrimination and
21. Zhou X, Chen J, Zhang Q, et al. Prognostic Value of Calibration of Clinical Prediction Models: Users’ Guides
Plasma Soluble Corin in Patients With Acute Myocardial to the Medical Literature. JAMA 2017;318:1377-84.
Infarction. J Am Coll Cardiol 2016;67:2008-14. 37. Risk Prediction Modeling: R Code. Available online:
22. Wang Y, Li J, Xia Y, et al. Prognostic nomogram for http://ncook.bwh.harvard.edu/r-code.html
intrahepatic cholangiocarcinoma after partial hepatectomy. 38. Hsu CH, Taylor JM. A robust weighted Kaplan-Meier
J Clin Oncol 2013;31:1188-95. approach for data with dependent censoring using
23. rms: Regression Modeling Strategies. Available online: linear combinations of prognostic covariates. Stat Med
https://cran.r-project.org/web/packages/rms/rms.pdf 2010;29:2215-23.
24. Iasonos A, Schrag D, Raj GV, et al. How to build and 39. Uno H, Tian L, Cai T, et al. A unified inference
interpret a nomogram for cancer prognosis. J Clin Oncol procedure for a class of measures to assess improvement
2008;26:1364-70. in risk prediction systems with survival data. Stat Med
25. Leening MJ, Vedder MM, Witteman JC, et al. Net 2013;32:2430-42.
reclassification improvement: computation, interpretation, 40. Hilden J, Gerds TA. A note on the evaluation of novel
and controversies: a literature review and clinician’s guide. biomarkers: do not rely on integrated discrimination
Ann Intern Med 2014;160:122-31. improvement and net reclassification index. Stat Med
26. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, et al. 2014;33:3405-14.
Evaluating the added predictive ability of a new marker: 41. van Smeden M, Moons KGM. Event rate net
from area under the ROC curve to reclassification and reclassification index and the integrated discrimination
beyond. Stat Med 2008;27:157-72; discussion 207-12. improvement for studying incremental value of risk
27. Pencina MJ, D’Agostino RB Sr, Steyerberg EW. markers. Stat Med 2017;36:4495-7.
Extensions of net reclassification improvement calculations 42. Talluri R, Shete S. Using the weighted area under the
to measure usefulness of new biomarkers. Stat Med net benefit curve for decision curve analysis. BMC Med
2011;30:11-21. Inform Decis Mak 2016;16:94.
28. Fagerland MW, Hosmer DW. A goodness-of-fit test 43. Vickers AJ, Cronin AM, Elkin EB, et al. Extensions to
for the proportional odds regression model. Stat Med decision curve analysis, a novel method for evaluating
2013;32:2235-49. diagnostic tests, prediction models and molecular markers.
29. Fernández D, Liu I. A goodness-of-fit test for the ordered BMC Med Inform Decis Mak 2008;8:53.
stereotype model. Stat Med 2016;35:4660-96. 44. Vickers AJ, Elkin EB. Decision curve analysis: a novel
30. Hosmer DW, Hosmer T, Le Cessie S, et al. A comparison method for evaluating prediction models. Med Decis
of goodness-of-fit tests for the logistic regression model. Making 2006;26:565-74.
Stat Med 1997;16:965-80. 45. Rousson V, Zumbrunn T. Decision curve analysis revisited:
31. Rigby AS. Statistical methods in epidemiology. VI. overall net benefit, relationships to ROC curve analysis,
Correlation and regression: the same or different? Disabil and application to case-control studies. BMC Med Inform
Rehabil 2000;22:813-9. Decis Mak 2011;11:45.
32. Wang X, Jiang B, Liu JS. Generalized R-squared for 46. mdbrown/rmda. Available online: https://github.com/
detecting dependence. Biometrika 2017;104:129-39. mdbrown/rmda/releases
33. Fisher LD, Lin DY. Time-dependent covariates in the Cox 47. Decision Curve Analysis, DCA. Available online: https://
proportional-hazards regression model. Annu Rev Public www.plob.org/article/12455.html
Health 1999;20:145-57. 48. Decision Curve Analysis. Available online: https://www.
34. Moolgavkar SH, Chang ET, Watson HN, et al. mskcc.org/departments/epidemiology-biostatistics/
An Assessment of the Cox Proportional Hazards biostatistics/decision-curve-analysis
Regression Model for Epidemiologic Studies. Risk Anal 49. Kerr KF, Brown MD, Zhu K, et al. Assessing the Clinical

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63
Page 96 of 96 Zhou et al. Clinical prediction models with R

Impact of Risk Prediction Models With Decision Curves: using R: an easy guide for clinicians. Bone Marrow
Guidance for Correct Interpretation and Appropriate Use. Transplant 2007;40:381-7.
J Clin Oncol 2016;34:2534-40. 58. Scrucca L, Santucci A, Aversa F. Regression modeling of
50. ResourceSelection: Resource Selection (Probability) competing risk using R: an in depth guide for clinicians.
Functions for Use-Availability Data. Available Bone Marrow Transplant 2010;45:1388-95.
online: https://cran.r-project.org/web/packages/ 59. Zhang Z, Cortese G, Combescure C, et al. Overview
ResourceSelection/index.html of model validation for survival regression model with
51. PredictABEL: Assessment of Risk Prediction Models. competing risks using melanoma study data. Ann Transl
52. Robin X, Turck N, Hainard A, et al. pROC: an open- Med 2018;6:325.
source package for R and S+ to analyze and compare ROC 60. Geskus RB. Cause-specific cumulative incidence estimation
curves. BMC Bioinformatics 2011;12:77. and the fine and gray model under both left truncation and
53. pec: Prediction Error Curves for Risk Prediction Models right censoring. Biometrics 2011;67:39-49.
in Survival Analysis. Available online: https://cran. 61. Zhang T, Chen X, Liu Z. Medical Statistics Graphics with
r-project.org/web/packages/pec/index.html R. Beijing: People’s Medical Publishing House, 2018.
54. He P, Eriksson F, Scheike TH, et al. A Proportional 62. Huang YQ, Liang CH, He L, et al. Development and
Hazards Regression Model for the Sub-distribution with Validation of a Radiomics Nomogram for Preoperative
Covariates Adjusted Censoring Weight for Competing Prediction of Lymph Node Metastasis in Colorectal
Risks Data. Scand Stat Theory Appl 2016;43:103-22. Cancer. J Clin Oncol 2016;34:2157-64.
55. Fine JP, Gray RJ. A Proportional Hazards Model for 63. Tang XR, Li YQ, Liang SB, et al. Development and
the Subdistribution of a Competing Risk. Journal of the validation of a gene expression-based signature to
American Statistical Association 1999;94:496-509. predict distant metastasis in locoregionally advanced
56. Gray RJ. A Class of K-Sample Tests for Comparing the nasopharyngeal carcinoma: a retrospective, multicentre,
Cumulative Incidence of a Competing Risk. Ann Stat cohort study. Lancet Oncol 2018;19:382-93.
1988;16:1141-54. 64. Lesmeister C. Mastering Machine Learning with R.
57. Scrucca L, Santucci A, Aversa F. Competing risk analysis Second ed. Birmingham, UK: Packt Publishing Ltd., 2017.

Cite this article as: Zhou ZR, Wang WW, Li Y, Jin KR,
Wang XY, Wang ZW, Chen YS, Wang SJ, Hu J, Zhang HN,
Huang P, Zhao GZ, Chen XX, Li B, Zhang TS. In-depth
mining of clinical data: the construction of clinical prediction
model with R. Ann Transl Med 2019;7(23):796. doi: 10.21037/
atm.2019.08.63

© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2019;7(23):796 | http://dx.doi.org/10.21037/atm.2019.08.63

You might also like