0% found this document useful (0 votes)
20 views24 pages

Etik 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views24 pages

Etik 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Letters

https://doi.org/10.1038/s41591-019-0447-x

Corrected: Author Correction

End-to-end lung cancer screening with


three-dimensional deep learning on low-dose
chest computed tomography
Diego Ardila 1,5, Atilla P. Kiraly1,5, Sujeeth Bharadwaj1,5, Bokyung Choi1,5, Joshua J. Reicher2,
Lily Peng1, Daniel Tse 1*, Mozziyar Etemadi 3, Wenxing Ye1, Greg Corrado1, David P. Naidich4
and Shravya Shetty1

With an estimated 160,000 deaths in 2018, lung cancer is limitations suggest opportunities for more sophisticated systems
the most common cause of cancer death in the United States1. to improve performance and inter-reader consistency18,19. Deep
Lung cancer screening using low-dose computed tomography learning approaches offer the exciting potential to automate more
has been shown to reduce mortality by 20–43% and is now complex image analysis, detect subtle holistic imaging findings and
included in US screening guidelines1–6. Existing challenges unify methodologies for image evaluation20.
include inter-grader variability and high false-positive and A variety of software devices have been approved by the
false-negative rates7–10. We propose a deep learning algorithm Food and Drug Administration (FDA) with the goal of address-
that uses a patient’s current and prior computed tomography ing workflow efficiency and performance through augmented
volumes to predict the risk of lung cancer. Our model achieves detection of lung nodules on lung computed tomography (CT)21.
a state-of-the-art performance (94.4% area under the curve) Clinical research has primarily focused on either nodule detec-
on 6,716 National Lung Cancer Screening Trial cases, and tion or diagnostic support for lesions manually selected by imag-
performs similarly on an independent clinical validation set ing experts22–27. Nodule detection systems were engineered with
of 1,139 cases. We conducted two reader studies. When prior the goal of improving radiologist sensitivity in identifying nod-
computed tomography imaging was not available, our model ules while minimizing costs to specificity, thereby falling into the
outperformed all six radiologists with absolute reductions of category of computer-aided detection (CADe)28. This approach
11% in false positives and 5% in false negatives. Where prior highlights small nodules, leaving malignancy risk evaluation and
computed tomography imaging was available, the model per- clinical decision making to the clinician. Diagnostic support for
formance was on-par with the same radiologists. This creates pre-identified lesions is included in computer-aided diagnosis
an opportunity to optimize the screening process via com- (CADx) platforms, which are primarily aimed at improving speci-
puter assistance and automation. While the vast majority of ficity. CADx has gained greater interest and even first regulatory
patients remain unscreened, we show the potential for deep approvals in other areas of radiology, though not in lung cancer at
learning models to increase the accuracy, consistency and the time of manuscript preparation29.
adoption of lung cancer screening worldwide. To move beyond the limitations of prior CADe and CADx
In 2013, the United States Preventive Services Task Force rec- approaches, we aimed to build an end-to-end approach perform-
ommended low-dose computed tomography (LDCT) lung cancer ing both localization and lung cancer risk categorization tasks using
screening in high-risk populations based on reported improved the input CT data alone. More specifically, we were interested in
mortality in the National Lung Cancer Screening Trial (NLST)2–5. replicating a more complete part of a radiologist’s workflow, includ-
In 2014, the American College of Radiology published the Lung- ing full assessment of LDCT volume, focus on regions of concern,
RADS guidelines for LDCT lung cancer screening, to standardize comparison to prior imaging when available and calibration against
image interpretation by radiologists and dictate management rec- biopsy-confirmed outcomes.
ommendations1,6. Evaluation is based on a variety of image find- Another important high-level decision in our approach was to
ings, but primarily nodule size, density and growth6. At screening learn features using deep convolutional neural networks (CNN),
sites, Lung-RADS and other models such as PanCan are used to rather than using hand-engineered features such as texture fea-
determine malignancy risk ratings that drive recommendations for tures or specific Hounsfield unit values. We chose to learn features
clinical management11,12. Improving the sensitivity and specificity because this approach has repeatedly been shown superior to hand-
of lung cancer screening is imperative because of the high clinical engineered features in many open computer vision competitions in
and financial costs of missed diagnosis, late diagnosis and unneces- the past five years30,31, including the Kaggle 2017 Data Science Bowl
sary biopsy procedures resulting from false negatives and false posi- which used NLST data32.
tives5,13–17. Despite improved consistency, persistent inter-grader There were three key components in our new approach (Fig. 1).
variability and incomplete characterization of comprehensive First, we constructed a three-dimensional (3D) CNN model that
imaging findings remain as limitations7–10 of Lung-RADS. These performs end-to-end analysis of whole-CT volumes, using LDCT

Google AI, Mountain View, CA, USA. 2Stanford Health Care and Palo Alto Veterans Affairs, Palo Alto, CA, USA. 3Northwestern Medicine, Chicago, IL, USA.
1

4
New York University-Langone Medical Center, Center for Biological Imaging, New York City, NY, USA. 5These authors contributed equally: Diego Ardila,
Atilla P. Kiraly, Sujeeth Bharadwaj, Bokyung Choi. *e-mail: tsed@google.com

954 Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine


Nature Medicine Letters
Input Model Output

Malignancy probability
LUMAS risk bucket
Cancer risk prediction Cancer localization
Prior Current
model
Prior Current

Cancer ROI 1
ROI
detection

ROI 2
Current

Learned features
Full-volume
model

Fig. 1 | Overall modeling framework. For each patient, the model uses a primary LDCT volume and, if available, a prior LDCT volume as input. The model
then analyzes suspicious and volumetric ROIs as well as the whole-LDCT volume and outputs an overall malignancy prediction for the case, a risk bucket
score (LUMAS) and localization for predicted cancerous nodules.

volumes with pathology-confirmed cancer as training data (the the probability of malignancy in Lung-RADS buckets 1/2, 3+, 4A+
‘full-volume model’). and 4B/X on the tuning set33 (see Methods, Operating point selec-
Second, we trained a CNN region-of-interest (ROI) detection tion, for more detail on model score thresholding for LUMAS).
model to detect 3D cancer candidate regions in the CT volume (the Buckets 1 and 2 were combined, as they have the same management
‘cancer ROI detection model’). We collected additional bounding recommendation: referral to continued annual screening.
box labels to train this model. We conducted a two-part retrospective reader study with six US
Third, we developed a CNN cancer risk prediction model that board-certified radiologists (average of 8 years clinical experience,
operates on outputs from both the cancer ROI detection model range 4–20 years). In the first part, the radiologists graded a sin-
and full-volume model. This can also incorporate regions from a gle-screening CT volume. Readers were given access to associated
patient’s previous scans, which is accomplished by assessing regions patient demographics and clinical history, while the deep learning
in prior scans corresponding to the cancer candidate regions in the model did not have access to this information. Additionally, while
current scan, and then assigning a case-level malignancy score (we the volumes were resampled for the model, the readers assessed
use the term ‘case’ to refer to a single patient visit, which could con- the full-resolution original CT cases. Neither the radiologists nor
tain multiple CT volumes). This component was also trained on the model had access to previous screening CT volumes from the
case-level, pathology-confirmed cancer labels (see Methods, Model patient (see Methods, Reader studies). Radiologists reviewed a subset
development and training). of the test dataset consisting of 507 patients (83 cancer-positives).
To complete this study, multiple datasets were acquired and vari- On this subset of the test set, the model’s area under the curve
ous clinical evaluations were performed, as follows. (AUC) was 95.9 (95% confidence interval, 92.8–98.1). This AUC,
A deep learning model for analysis of malignancy risk in lung and the sensitivity/specificity for LUMAS and radiologists, are pre-
cancer screening CTs was developed from a NLST dataset consist- sented in Fig. 2a,b. The performance of all six radiologists trended
ing of 42,290 CT cases from 14,851 patients, 578 of whom devel- at or below the model’s receiver operating curve (Fig. 2b).
oped biopsy-confirmed cancer within the 1-year follow-up period. We compared the model to the average reader performance by
This represents the entire publicly available dataset provided by measuring the sensitivity and specificity for each LUMAS score
the National Institutes of Health (there were 26,722 patients in and its corresponding Lung-RADS risk bucket (see Methods,
NLST). Details of how this dataset was selected from the entire Reader studies). The model achieved significantly better sensitivity
NLST screening arm and the inclusion/exclusion criteria are given (P < 0.05 for all three thresholds) and better specificity (P < 0.05 for
in Extended Data Fig. 1. Patients were randomly assigned into one two of three thresholds) than the average radiologist (Fig. 2c,d). For
of three sets: a training set (70%), a tuning set (15%) and a test set instance, comparison of the operating point of LUMAS 3+ to Lung-
(15%). All CT scan volumes from each patient were then placed into RADS 3+ yielded a statistically significant specificity boost of 11.6%
the corresponding set based on this patient assignment. An individ- (95% confidence interval, 7.8–15.1) and a sensitivity boost of 5.2%
ual volume was considered cancer-positive if the result of a biopsy (95% confidence interval, 0.38–9.9).
or surgical resection was positive during the screening study year, We present an alternative methodology for comparison in
and considered cancer-negative if the patient was cancer-free in the Supplementary Table 4a,b where, rather than using LUMAS, we set
1-year follow-up screen. Supplementary Tables 1, 2 and 3 contain the model sensitivity to match the average reader, compared speci-
information on demographics and cancer staging, CT model manu- ficity and then matched specificity to compare sensitivity.
facturer and nodule characteristics for all NLST subsets. Extended Data Fig. 2 shows the same analysis presented in this
On the test dataset, for 6,716 cases (86 cancer-positives) the section, except that the results have been reweighted to take into
model achieved an area under the receiver operating characteris- account the sampling from the total of 26,722 patients in the NLST
tic of 94.4% (95% confidence interval, 91.1–97.3) (see Methods, screening arm.
Statistical analysis). For comparison with radiologists, we then In the second part, CT volumes from both the current and
thresholded the model’s predictions at three different cutoffs to previous year were available to the model and the same six
produce four different lung malignancy scores (LUMAS). These radiologists. Comparison with previous scans to assess inter-
thresholds were chosen so that LUMAS scores corresponded with val growth is an important component of Lung-RADS34. Readers

Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine 955


Letters Nature Medicine
a Comparison to average reader c
100 Risk buckets Sensitivity Delta

Average 90.0
1,2 +5.2*
reader (86.1, 93.4)
versus
3+ 95.2 (.4, 9.8)
80 Model
(89.9, 98.9) P = 0.0386

Average 82.9
1,2,3 +7.4*
reader (76.6, 89.4)
versus
4A+ 90.4 (1.7, 12.9)
60 Model
(83.3, 96.3) P = 0.0114
Sensitivity (%)

Average 62.5
+17.1*
1,2,3,4A reader (54.4, 70.7)
versus
4B/X 79.5 (9.0, 24.5)
40 Model
(70.8, 88.2) P < 1 × 10–4

Model: AUC 95.9


(95% CI: 92.8–98.1)
LUMAS buckets d
20 Risk buckets Specificity Delta
Average reader Lung-RADS 3+ (95% CI)
Average 69.7
Average reader Lung-RADS 4A+ (95% CI) +11.6*
1,2 reader (66.6, 72.8)
Average reader Lung-RADS 4B/X (95% CI) versus
3+ 81.3 (7.8, 15.1)
Model
0 (77.3, 84.9) P < 1 × 10–4
0 20 40 60 80 100
Average 86.0
1 – specificity (%) 1,2,3 +5.0*
reader (83.4, 88.4)
versus
b Comparison to individual readers 4A+ 91.0 (1.7, 12.9)
Model
100 (88.1, 93.9) P = 0.0008

2 Average 95.3
5 6 1,2,3,4A +1.1*
reader (94.0, 96.6)
3 versus
90 4B/X 96.5 (–0.4, 2.6)
2 Model
5 (94.6, 98.2) P = 0.143
6
4
3

80 4 2
e
Risk buckets
Sensitivity (%)

1
1,2 Hit@1 73/74
6 1
70 versus
3+
Hit@2 74/74
5 Model: AUC 95.9
(95% CI: 92.8–98.1)
60 3 Hit@1 72/73
LUMAS buckets 1,2,3
1 versus
Individual readers Lung-RADS 3+
4A+ Hit@2 73/73
Individual readers Lung-RADS 4A+
50
Individual readers Lung-RADS 4B/X Hit@1 62/63
1,2,3,4A
Lung-RADS 3+ applied retrospectively versus
to NLST readers 4B/X
Hit@2 63/63
4
40
0 10 20 30 40 50
1 – specificity (%)

Fig. 2 | Results from the reader study—lung cancer screening on a single CT volume. a–e, Performance of radiologists and model in predicting malignancy
using single screening CT volumes. Model performance shown in the AUC and summary tables is based on case-level malignancy score. LUMAS buckets
refers to operating points selected to match the predicted probability of cancer for Lung-RADS 3+, 4A+ and 4B/X. a, Performance of model (blue line)
versus average radiologist for various Lung-RADS categories (crosses) using a single CT volume. The length of the crosses represents the confidence
Intervals (CIs). The area highlighted in blue is magnified in b to show the performance of each of the six radiologists at various Lung-RADS risk buckets.
c, Sensitivity comparison between model and average radiologist. d, Specificity comparison between model and average radiologist. Both sensitivity and
specificity analyses were conducted with n = 507 volumes from 507 patients, with P values computed using a two-sided permutation test with 10,000
random resamplings of the data. e, Hit rate localization analysis used to measure how often the model correctly localized a cancerous lesion.

graded 308 volumes from the first reader study that were not from dropped relative to the first part of the reader study as a result
the initial baseline NLST prevalence screening; all of the cases in of dropping the CTs from the baseline year. We performed the
this subset had prior scans available (see Methods, Reader study— same comparison as in the previous reader study (Fig. 3). LUMAS
lung cancer screening using current and prior CT volume). On showed statistically significant improved specificity for the 4A+
this subset, the model’s AUC was 92.6% (95% confidence inter- bucket, and otherwise matched the average reader sensitivity and
val, 86.5–97.3). Notably, both the reader and model performance specificity (Fig. 3c,d).

956 Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine


Nature Medicine Letters
a Comparison to average reader c
100 Risk buckets Sensitivity Delta
Average 86.7
+0.8
1,2 reader (79.7, 92.9)
versus
3+ 87.5 (.4, 9.8)
Model
80 (76.5, 97.2) P = 0.0386

Average 82.1
+0.4
1,2,3 reader (74.1, 89.4)
versus
4A+ 82.5 (–10.3, 10.3)
Model
60 (69.0, 93.9) P = 0.975
Sensitivity (%)

Average 70.0
+2.5
1,2,3,4A reader (59.4, 80.3)
versus
4B/X 72.5 (–6.8, 12.3)
40 Model
(58.8, 85.7) P = 0.665

Model: AUC 92.6


(95% CI: 86.5–97.3)
LUMAS buckets d
20 Risk buckets Specificity Delta
Average reader Lung-RADS 3+ (95% CI)
Average reader Lung-RADS 4A+ (95% CI) Average 83.7
1,2 reader (81.1, 87.0) +0.5
Average reader Lung-RADS 4B/X (95% CI) versus
3+ 84.2 (–3.7, 4.6)
0 Model (77.7, 88.1) P = 0.8172
0 20 40 60 80 100
1 – specificity (%) Average 89.1
reader (87.1, 91.9) +3.4*
1,2,3
b Comparison to individual readers versus
100 4A+ 92.7 (0.7, 6.6)
Model (87.8, 97.4) P = 0.0156
5 6
6 Average 95.3
1,2,3,4A reader (92.0, 96.2) +1.9
2 versus
90 4B/X 96.5 (–0.1, 4.1)
Model (93.2, 98.4)
2 P = 0.0527
5

3 6
2
e
80 4 Risk buckets
Sensitivity (%)

5 3 Hit@1 31/31
4 1 1,2
versus
3+
Model: AUC 92.6 Hit@2 31/31
70 1 (95% CI: 86.5–97.3)

4 LUMAS buckets Hit@1 31/31


1,2,3
Individual readers Lung-RADS 3+ versus
4A+
Individual readers Lung-RADS 4A+ Hit@2 31/31
60
1 Individual readers Lung-RADS 4B/X
1,2,3,4A Hit@1 27/27
Lung-RADS 3+ applied retrospectively
to NLST readers versus
3 4B/X
Hit@2 27/27
50
0 5 10 15 20 25 30
1 – specificity (%)

Fig. 3 | Results from the reader study—lung cancer screening using current and prior CT volume. a–e, Model performance in the AUC curve and
summary tables is based on case-level malignancy score. The term ‘LUMAS buckets’ refers to operating points selected to represent sensitivity/specificity
at the 3+, 4A+ and 4B/X thresholds. a, Performance of model (blue line) versus average radiologist at various Lung-RADS categories (crosses) using a CT
volume and a prior CT volume per patient. The length of the crosses represents the 95% confidence interval. The area highlighted in blue is magnified in
b to show the performance of each of the six radiologists at various Lung-RADS categories in this reader study. c, Sensitivity comparison between model
and average radiologist. d, Specificity comparison between model and average radiologist. Both sensitivity and specificity analyses were conducted with
n = 308 volumes from 308 patients, with P values computed using a two-sided permutation test with 10,000 random resamplings of the data. e, Hit rate
localization analysis to measure how often the model correctly localized a cancerous lesion.

We present an alternative methodology for comparison in confidence interval, 91.1–97.3). A total of 2,302 cases from the
Supplementary Table 4c,d where, rather than using LUMAS, we set baseline year did not have prior volumes available, but in all other
the model sensitivity to match the average reader, compared speci- cases readers and the model had access to both current and prior
ficity and then matched specificity to compare sensitivity. year volumes. We followed an earlier algorithmic methodology33
Extended Data Fig. 3 shows the same analysis presented in this sec- to estimate Lung-RADS performance from NLST nodule annota-
tion, except the results have been reweighted to take into account the tions. Because the nodule annotations in NLST do not contain all
sampling from the total 26,722 patients in the NLST screening arm. of the findings needed by the Lung-RADS guidelines (see Methods,
Application of the model to all 6,716 cases (86 cancer-positives) Retrospective application of model to NLST), for comparison of the
in the held-out NLST test set yielded an overall AUC of 94.4% (95% model to this Lung-RADS estimate we chose a different operating

Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine 957


Letters Nature Medicine

point. For the 1-year cancer outcomes data, we found a boost in reviewing the 140 cases of disagreement between the radiologist
specificity (5.0%; 95% confidence interval, 4.2–5.7). We also consensus and the model from the first part of the reader study
analyzed the model’s performance for a longer-term endpoint, can- (without priors volumes) to evaluate possible causes of disparities
cer within 2 years, resulting in an AUC of 87.3% (95% confidence (see Supplementary Information, Subjective analysis). The radiolo-
interval, 83.2–90.9). For this endpoint, the model yielded improve- gists observed scarring in 22% of the model–reader disagreements
ments in both sensitivity (9.5%; 95% confidence interval, 2.5–16.4) and, in 57% of these cases, LUMAS appropriately assigned a lower
and specificity (5.1%; 95% confidence interval, 4.4–5.9) relative to risk bucket than the readers. This downgrading of scarring accounts
retrospective-Lung-RADS. for some of the specificity improvements in the model. An example
Extended Data Fig. 4 shows the same analysis presented in this where LUMAS downgraded risk for a cancer-negative case with
section, except that the results have been reweighted to take into scarring is shown in Extended Data Fig. 6c (See Supplementary
account the sampling from the total 26,722 patients in the NLST Information, Subjective analysis, for more details of the analysis).
screening arm. Further analysis of the model’s results included examining attri-
Under insitutional review board (IRB) approval, we evaluated the bution regions computed with integrated gradients35, using three
model on an additional independent, fully de-identified screening radiologists with an average of 23 years’ clinical experience (range
dataset from a US academic medical center, resulting in an AUC of 10–38 years). Positive and negative classification regions were
95.5% (95% confidence interval, 88.0–98.4) (Fig. 4b and Extended examined by three radiologists on a subset of examples from the test
Data Fig. 5a). This dataset contained 1,139 cases (27 cancer-posi- set. The attribution regions indicated that the model primarily con-
tives) and was used to evaluate model performance for biopsy and/ centrated within and on the edges of the nodule, although in some
or surgically confirmed lung cancers. The model was not trained or cases also on the vasculature in the parenchyma (see Supplementary
tuned using this dataset. Images were not submitted for re-inter- Information, Subjective analysis; Extended Data Fig. 7 and example
pretation by radiologists (see Methods, Development and valida- model false positives in Extended Data Fig. 8).
tion datasets, for more details on the dataset). We also evaluated the In summary, we used advanced deep learning techniques to
sensitivity and specificity of LUMAS (Fig. 4b). For LUMAS 3+, we train models with state-of-the-art-performance by leveraging full
found a sensitivity of 81.5% (95% confidence interval, 66.7–95.0) 3D LDCT volumes, pathology-confirmed case results and prior
and a specificity of 89.3% (95% confidence interval, 87.5–91.2). volumes. These models, if clinically validated, could aid clinicians
We performed a localization analysis to measure how often a in evaluating lung cancer screening exams.
correct cancer diagnosis was linked with a correct localization. A Our end-to-end priors approach generates case-level malignancy
bounding box was produced by the model for the top two candi- risk predictions as well as localization information for LDCT lung
date lesions by malignancy risk. For the localization ground truth, screening volumes. The strong performance of the model at the
each of 79 scans was labeled by two radiologists from a pool of nine. case level has important potential clinical relevance. The observed
Every scan was derived from a cancer-positive patient in NLST. The increase in specificity could translate to fewer unnecessary follow-
radiologists were given the location and staging information from up procedures. Increased sensitivity in cases without priors could
the pathology report, as well as all CT volumes from the patient’s translate to fewer missed cancers in clinical practice, especially as
data. They were then instructed to label all malignancies with a more patients begin screening. For patients with prior imaging
bounding box. The highest-ranked bounding box overlapped with exams, the performance of the deep learning model could enable
a malignancy in the scan labeled by our radiologists in all but one gains in workflow efficiency and consistency as assessment of prior
case (Figs. 2e and 3e), for a Hit@1 rate of 98%. The Hit@2 rate was imaging is already a key component of a specialist’s workflow36.
100% (see Methods, Localization analysis, for more details on the Given that LDCT screening is in the relatively early phases of adop-
Hit metric). These findings were consistent regardless of the specific tion, the potential for considerable improvement in patient care
LUMAS score used to define a true-positive. For a more detailed in the coming years is substantial. The model’s localization directs
analysis of the extent of overlap, see Extended Data Fig. 5b. follow-up for specific lesion(s) of greatest concern. These predic-
Given the perceived ‘black box’ nature of deep learning, an tions are critical for patients proceeding for further work-up and
important step in evaluating clinical performance is a deeper assess- treatment, including diagnostic CT, positron emission tomography
ment of the modeling results. We measured performance on many (PET)/CT or biopsy.
data subsets to show that the model’s overall performance improve- Malignancy risk prediction allows for the possibility of aug-
ments were not obscuring poor performance in clinically relevant menting existing, manually created interpretation guidelines such
subsets. Additionally, we attempted to understand where the as Lung-RADS, which are limited to subjective clustering and
model’s performance improvements were greatest. The full list of assessment to approximate cancer risk. Numerous investigations
subsets and metrics can be seen in the Supplementary Information have evaluated CADx applications built to assist radiologists in clas-
(see Supplementary Tables 5 and 6). The model was not statistically sification of suspected lesions previously detected and segmented
inferior relative to the average reader for any metric, subset or risk by radiologists18. These prior CADx studies typically report only a
bucket in either part of the reader study. lesion-level classification performance, which is not comparable to
Some of the subsets we analyzed were based on a patient’s can- this work. In contrast, the model presented performs human-inde-
cer stage when diagnosed according to NLST pathology data. In the pendent detection and classification on full volumes. Past non-peer-
first part of the reader study (see Reader study—lung cancer screen- reviewed efforts that have attempted direct, automated malignancy
ing on a single CT volume), LUMAS 4B/X versus the average reader prediction from full volumes using deep learning methods reported
Lung-RADS 4B/X showed an absolute improvement in sensitivity of AUCs as high as 0.88 (ref. 37). However, these models were primarily
24.4% (95% confidence interval, 10.4–37.2) for early-stage cancers. trained and tested on smaller portions of the NLST dataset, did not
Another group of subsets were based on NLST nodule size anno- evaluate the use of priors and did not report localization metrics32,37.
tations. On the subset with nodules 8–15 mm, we saw an absolute We hypothesize that taking into account a larger context in our can-
improvement in sensitivity of 42.4% (95% confidence interval, 24.7– cer risk prediction model (larger ROIs around candidate regions,
58.0) in LUMAS 4B/X versus the average reader Lung-RADS 4B/X. whole-3D volume assessment and priors) and training on a larger
An example case of this type is illustrated in Extended Data Fig. 6d, portion of NLST led to superior performance.
annotated by one radiologist as containing a 12-mm nodule. While we did note a performance decrease in the with-priors
Further exploration of model results was completed by two subset, we also found a corresponding drop in performance for our
additional radiologists (with 10 and 21 years of clinical experience) readers. This decrease may be because patients with easy-to-spot

958 Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine


Nature Medicine Letters
a 100 b Cancer in 1 year
c
LUMAS
Sensitivity Delta bucket cutoff
Retrospective 77.9 81.5
Sensitivity
Lung-RADS (67.9, 86.2) +5.8 1,2 (66.7, 95.0)
versus
90 83.7 (–2.9, 14.6) 3+ 89.4
Model (75.0, 91.1) Specificity
P = 0.300 (87.5, 91.2)

Specificity Delta Sensitivity


70.4
1,2,3 (53.1, 87.5)
Retrospective 90.1 versus
Lung-RADS (89.3, 90.7) +5.0* 4A+ 95.0
80 Specificity
(93.7, 98.8)
Sensitivity (%)

95.0 (4.2, 5.7)


Model (94.7, 95.7) P < 1 × 10–4 59.3
Sensitivity
1,2,3,4A (39.3, 78.3)
Model predicting cancer in 1 year Cancer in 2 years versus
AUC 94.4 (95% CI: 91.1–97.3) 4B/X 98.1
Specificity
70 Sensitivity Delta (97.2, 98.8)
Model predicting cancer in 2 years
Retrospective 54.2 95.5
AUC 87.3 (95% CI: 83.2–90.9) +9.5* AUC
Lung-RADS (46.6, 63.8) (88.0, 98.4)
Lung-RADS retrospectively applied to
64.7 (2.5, 16.4)
NLST reads for cancer in 1 year Model
(55.9, 72.8) P = 0.0143
60 LUMAS buckets
Specificity Delta
NLST comparison operating point
Retrospective 90.1
+5.1*
Lung-RADS retrospectively applied to Lung-RADS (89.3, 90.7)
NLST reads for cancer in 2 years
95.2 (4.4, 5.9)
50 Model
(94.7, 95.7) P = 0.0059
0 10 20 30 40 50
1 – specificity (%)

Fig. 4 | Results of the full NLST and independent test sets. a, Comparison of model performance to NLST reader performance on the full NLST test set.
NLST reader performance was estimated by retrospectively applying Lung-RADS 3 criteria to the NLST reads. b, Sensitivity and specificity comparisons
between the model and Lung-RADS retrospectively applied to NLST reads. The comparison was performed on n = 6,716 cases, using a two-sided
permutation test using 10,000 random resamplings of the data. c, Sensitivity and specificity of different LUMAS buckets on an independent dataset
comprising n = 1,139 cases using the same two-sided set with 10,000 random resamplings. The full AUC plot is shown in Extended Data Fig. 5a.

cancers are diagnosed and dropped from the study in the baseline encouraged by the indicators of generalizability of our model to an
year, leaving only more subtle cancer cases. independent dataset from another patient population. As we used
We propose a LUMAS system in this paper, but the underly- only two datasets during testing, there is a limit to the conclusions
ing techniques allow for broader exploration of other risk strati- that can be drawn about generalizability. However, the NLST test set
fication methods. Incorporating these new methods into CAD we used represents 33 different test sites across 21 different manu-
systems could also address the issues of inter-grader variability in facturer and model combinations. In addition, our academic
lung cancer assessment, a pattern seen in both our reader study (see medical center test set is derived from 1,039 cases, all in the years
Supplementary Table 7) and prior publications38,39. post-NLST. Further study will require testing and tuning against
Explainability of deep learning models is still at an early stage. an even broader variability of screening data parameters to ensure
To begin to explore how the model evaluates risk of malignancy, generalizability.
we asked our clinicians to analyze a subset of cases subjectively. Lastly, although we presented a methodology for choosing oper-
We hypothesize that there are advantages to the model’s more con- ating points for the model, this was primarily for the purposes of
sistent visualization of morphological features, such as scars and comparing reader and model performance. It is important to stress
nodules, in 3D. Additionally, the model was not bound by the size that the selection of operating points for use in clinical practice
guidelines in Lung-RADS, allowing for new risk categorization. We remains an ongoing area of research, potentially involving an analy-
found cases where the model appeared to use features outside of the sis of costs and outcomes to properly trade off between sensitivity
main nodule, such as the vasculature and parenchyma surrounding and specificity.
the nodule (see Supplementary Information, Subjective analysis). More robust retrospective and prospective studies will be required
However, we do not know whether the model incorporates other to ensure clinical applicability as screening programs continue to
abnormalities such as background emphysema in its predictions. scale. In future studies we aim to explore different approaches in
Further examination using model attribution techniques may allow presenting radiologists with model output assessments, including
radiologists to take advantage of the same visual features used by the malignancy risk calculations and localization. Correlating the per-
model to assess malignancy. formance improvements with documented improved clinical out-
Our study did have some important limitations. While our comes and health system costs will also be required to determine
radiologist-comparison studies were larger than in prior published potential impact. Another opportunity would be to apply similar
work32, they were still limited to retrospective data from the NLST modeling techniques to routine diagnostic CT, aiding in the detec-
dataset. Although clinical comparison metrics were limited to a tion and management of incidental pulmonary nodules.
small number of general (not thoracic) radiologists, lung cancer In addition to its application to lung cancer screening, the deep
screening is commonly performed by general radiologists40. learning techniques applied in this study have considerable rele-
Another limitation resulting from initial lung cancer screening vance to other types of 3D imaging data. For instance, this approach
studies is the relative lack of cancer outcomes information avail- holds promise for magnetic resonance imaging, PET or other types
able. In spite of this, our multi-stage modeling approach was able of volumetric or multi-view problem research. Our research also
to leverage the natural distribution of data from the screening pop- has applications in workflows involving comparison with a patient’s
ulation using only 398 cancer-positives for training. We were also prior imaging.

Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine 959


Letters Nature Medicine

Lastly, the early stage of lung cancer screening adoption led to a 17. Wiener, R. S., Schwartz, L. M., Woloshin, S. & Welch, H. G. Population-based
relative scarcity of quality ground truth data for training. While this risk for complications after transthoracic needle lung biopsy of a pulmonary
nodule: an analysis of discharge records. Ann. Intern. Med. 155,
presented a challenge during the research process, it demonstrated 137–144 (2011).
that it is possible for deep learning to achieve radiologist-level per- 18. Ciompi, F. et al. Towards automatic pulmonary nodule management in lung
formance with a smaller number of positive examples. As data scar- cancer screening with deep learning. Sci. Rep. 7, 46479 (2017).
city is a common problem in medical deep learning research, we 19. Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: images are more than
pictures, they are data. Radiology 278, 563–577 (2016).
hope these methods will translate to new opportunities for explora-
20. Gulshan, V. et al. Development and validation of a deep learning algorithm
tion, especially in rare diseases. for detection of diabetic retinopathy in retinal fundus photographs. JAMA
In conclusion, these results represent a step toward automated 316, 2402–2410 (2016).
image evaluation via lung cancer risk malignancy estimation 21. Bogoni, L. et al. Impact of a computer-aided detection (CAD) system
through deep learning. We believe this research could supple- integrated into a picture archiving and communication system (PACS) on
reader sensitivity and efficiency for the detection of lung nodules in thoracic
ment future approaches to lung cancer screening as well as support CT exams. J. Digit. Imaging 25, 771–781 (2012).
assisted- or second-read workflows. In addition, we believe the 22. Ye, Xujiong et al. Shape-based computer-aided detection of lung nodules in
general approach employed in our work, mainly outcomes-based thoracic CT images. IEEE Trans. Biomed. Eng. 56, 1810–1820 (2009).
training, full volume techniques and directly comparable clinical 23. Bellotti, R. et al. A CAD system for nodule detection in low-dose lung CTs
performance evaluation, may lay additional groundwork toward based on region growing and a new active contour model. Med. Phys. 34,
4901–4910 (2007).
deep learning medical applications. 24. Sahiner, B. et al. Effect of CAD on radiologists’ detection of lung nodules on
thoracic CT scans: analysis of an observer performance study by nodule size.
Online content Acad. Radiol. 16, 1518–1530 (2009).
Any methods, additional references, Nature Research reporting 25. Firmino, M., Angelo, G., Morais, H., Dantas, M. R. & Valentim, R.
summaries, source data, statements of code and data availability and Computer-aided detection (CADe) and diagnosis (CADx) system for lung
cancer with likelihood of malignancy. Biomed. Eng. Online 15, 2 (2016).
associated accession codes are available at https://doi.org/10.1038/ 26. Armato, S. G. et al. Lung cancer: performance of automated lung nodule
s41591-019-0447-x. detection applied to cancers missed in a CT screening program. Radiology
225, 685–692 (2002).
Received: 1 October 2018; Accepted: 5 April 2019; 27. Valente, I. R. S. et al. Automatic 3D pulmonary nodule detection in CT
Published online: 20 May 2019 images: a survey. Comput. Methods Prog. Biomed. 124, 91–107 (2016).
28. Das, M. et al. Performance evaluation of a computer-aided detection
algorithm for solid pulmonary nodules in low-dose and standard-dose
References MDCT chest examinations and its influence on radiologists. Br. J. Radiol. 81,
1. American Lung Association. Lung cancer fact sheet. American Lung 841–847 (2008).
Association http://www.lung.org/lung-health-and-diseases/lung-disease- 29. Quantitative Insights. Quantitative Insights gains industry’s first FDA
lookup/lung-cancer/resource-library/lung-cancer-fact-sheet.html (accessed 11 clearance for machine learning driven cancer diagnosis. PRNewswire https://
September 2018). www.prnewswire.com/news-releases/quantitative-insights-gains-industrys-
2. Jemal, A. & Fedewa, S. A. Lung cancer screening with low-dose computed first-fda-clearance-for-machine-learning-driven-cancer-diagnosis-300495405.
tomography in the United States—2010 to 2015. JAMA Oncol. 3, html (2018).
1278 (2017). 30. Russakovsky, O. et al. ImageNet large scale visual recognition challenge.
3. US Preventive Services Task Force. Final update summary: lung cancer: Int. J. Comput. Vis. 115, 211–252 (2015).
screening (1AD). US Preventive Services Task Force https://www. 31. Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. Comput. Vis.
uspreventiveservicestaskforce.org/Page/Document/UpdateSummaryFinal/ ECCV 2014, 740–755 (2014).
lung-cancer-screening (2018). 32. Liao, F., Liang, M., Li, Z., Hu, X. & Song, S. Evaluate the malignancy of
4. National Lung Screening Trial Research Team et al. Reduced lung-cancer pulmonary nodules using the 3D deep leaky noisy-or network. Preprint at
mortality with low-dose computed tomographic screening. N. Engl. J. Med. https://arxiv.org/abs/1711.08324 (2017).
365, 395–409 (2011). 33. Pinsky, P. F. et al. Performance of Lung-RADS in the national lung screening
5. Black, W. C. et al. Cost-effectiveness of CT screening in the National Lung trial: a retrospective assessment. Ann. Intern. Med. 162, 485 (2015).
Screening Trial. N. Engl. J. Med. 371, 1793–1802 (2014). 34. Manos, D. et al. The Lung Reporting and Data System (LU-RADS): a proposal
6. Lung CT screening reporting & data system. American College of Radiology for computed tomography screening. Can. Assoc. Radiol. J. 65, 121–134 (2014).
https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/ 35. Sun, Y. & Sundararajan, M. Axiomatic attribution for multilinear functions.
Lung-Rads (accessed 11 September 2018). In Proc. 12th ACM Conference on Electronic Commerce—EC ’11 https://doi.
7. van Riel, S. J. et al. Observer variability for Lung-RADS categorisation of lung org/10.1145/1993574.1993601 (2011).
cancer screening CTs: impact on patient management. Eur. Radiol. 29, 36. Varela, C., Karssemeijer, N., Hendriks, J. H. C. L. & Holland, R. Use of prior
924–931 (2019). mammograms in the classification of benign and malignant masses. Eur. J.
8. Singh, S. et al. Evaluation of reader variability in the interpretation of Radiol. 56, 248–255 (2005).
follow-up CT scans at lung cancer screening. Radiology 259, 263 (2011). 37. Trajanovski, S. et al. Towards radiologist-level cancer risk assessment in CT
9. Mehta, H. J., Mohammed, T.-L. & Jantz, M. A. The American College of lung screening using deep learning. Preprint at https://arxiv.org/
Radiology lung imaging reporting and data system: potential drawbacks and abs/1804.01901 (2019).
need for revision. Chest 151, 539–543 (2017). 38. Pinsky, P. F., Gierada, D. S., Nath, P. H., Kazerooni, E. & Amorosa, J. National
10. Martin, M. D., Kanne, J. P., Broderick, L. S., Kazerooni, E. A. & Meyer, C. A. lung screening trial: variability in nodule detection rates in chest CT studies.
Lung-RADS: pushing the limits. Radiographics 37, 1975–1993 (2017). Radiology 268, 865–873 (2013).
11. Winkler Wille, M. M. et al. Predictive accuracy of the pancan lung cancer 39. Armato, S. G. 3rd et al. The Lung Image Database Consortium (LIDC): an
risk prediction model—external validation based on CT from the Danish evaluation of radiologist variability in the identification of lung nodules on
Lung Cancer Screening Trial. Eur. Radiol. 25, 3093–3099 (2015). CT scans. Acad. Radiol. 14, 1409–1421 (2007).
12. De Koning, H., Van Der Aalst, K., Ten Haaf, M. & Oudkerk, H. D. K. C. 40. Kazerooni, E.A. et al. ACR–STR practice parameter for the performance and
PL02.05 Effects of volume CT lung cancer screening: mortality results of the reporting of lung cancer screening thoracic computed tomography (CT).
NELSON randomised-controlled population based tria. J. Thorac. Oncol. 13, J. Thorac. Imaging 29, 310–316 (2014).
S185 (2018).
13. Field, J. K. et al. UK Lung Cancer RCT Pilot Screening Trial: baseline
findings from the screening arm provide evidence for the potential Acknowledgements
implementation of lung cancer screening. Thorax 71, 161–170 (2016). The authors acknowledge the NCI and the Foundation for the National Institutes
14. McMahon, P. M. et al. Cost-effectiveness of computed tomography screening of Health for their critical roles in the creation of the free publicly available LIDC/
for lung cancer in the United States. J. Thorac. Oncol. 6, 1841–1848 (2011). IDRI/NLST Database used in this study. All participants enrolled in NLST signed an
15. Goffin, J. R. et al. Cost-effectiveness of lung cancer screening in canada. informed consent developed and approved by the screening center’s IRBs, the NCI IRB
JAMA Oncol. 1, 807 (2015). and the Westat IRB. The authors thank the NCI for access to NCI data collected by the
16. Tomiyama, N. et al. CT-guided needle biopsy of lung lesions: a survey of NLST. The statements herein are solely those of the authors and do not represent or
severe complication based on 9783 biopsies in Japan. Eur. J. Radiol. 59, imply concurrence or endorsement by the NCI. The authors would like to thank
60–64 (2006). M. Etemadi and his team at Northwestern Medicine for data collection, de-identification

960 Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine


Nature Medicine Letters
and research support. These team members include E. Johnson, F. Garcia-Vicente, Competing interests
D. Melnick, J. Heller and S. Singh. We also thank C. Christensen and his team at D.P.N. and J.J.R. are paid consultants of Google Inc. D.P.N. is on the Medical Advisory
Northwestern Medicine IT, including M. Lombardi, C. Wilbar and R. Atanasiu. We Board of VIDA Diagnostics, Inc. and Exact Sciences. M.E.’s lab received funding from
would also like to acknowledge the work of the team working on labeling infrastructure Google Inc. to support the research collaboration. This study was funded by Google
and, specifically, J. Yoshimi, who implemented many of the features we needed for Inc. The remaining authors are employees of Google Inc. and own stock as part of the
labeling of ROIs, and J. Wong for coordinating and recruiting radiologist labelers. standard compensation package. The authors have no other competing interests
We would also like to acknowledge the work of the team that put together the data- to disclose.
handling infrastructure, including G. Duggan and K. Eswaran. Lastly, we would like to
acknowledge the helpful feedback on the initial drafts from Y. Liu and S. McKinney.

Additional information
Author contributions Extended data is available for this paper at https://doi.org/10.1038/s41591-019-0447-x.
D.A., A.P.K., S.B. and B.C. developed the network architecture and data/modeling
infrastructure, training and testing setup. D.A. and A.P.K. created the figures, wrote Supplementary information is available for this paper at https://doi.org/10.1038/
the methods and performed additional analysis requested in the review process. J.J.R., s41591-019-0447-x.
S.S., D.T., D.A., A.P.K. and S.B. wrote the manuscript. D.P.N. and J.J.R. provided clinical Reprints and permissions information is available at www.nature.com/reprints.
expertise and guidance on the study design. G.C and S.S. advised on the modeling Correspondence and requests for materials should be addressed to D.T.
techniques. M.E., S.S., J.J.R., B.C., W.Y. and D.A. created the datasets, interpreted the data
and defined the clinical labels. D.A., B.C., A.P.K. and S.S. performed statistical analysis. Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
S.S, L.P. and D.T. initiated the project and provided guidance on the concept and design. published maps and institutional affiliations.
S.S. and D.T supervised the project. © The Author(s), under exclusive licence to Springer Nature America, Inc. 2019

Nature Medicine | VOL 25 | JUNE 2019 | 954–961 | www.nature.com/naturemedicine 961


Letters Nature Medicine

Methods When classifying each candidate region, a purely two-stage approach would
Development and validation datasets. We used data from the NLST study, have access only to features within the candidate region and not from the full
consisting of 42,290 CT cases from 14,851 patients, 638 of whom developed biopsy- volume. It was not technically feasible to train a model on the full volume at the
confirmed cancer within 1 year of a LDCT screening (see Extended Data Fig. 1 for original resolution. To provide this global context for every candidate region,
more details on NLST dataset selection)41. Patients were randomly assigned to a we trained a model on the full volume at a reduced resolution to predict cancer
training set (70%), a tuning set (15%) or a test set (15%). Because not all negative diagnosis and then combined features extracted from this model to those extracted
cases from NLST have been made publicly available, the training, tuning and from each candidate region. The input volume for each case was the entire 3D CT
test sets had cancer percentages of 3.9, 4.5 and 3.7, respectively (slightly higher volume for the case, including the lung, mediastinum, heart, chest wall and so on,
than the 1–2% range reported for NLST in general and in real-world practice). just as a radiologist would be given in practice. No manual image segmentation
Supplementary Tables 1, 2 and 3 describe demographics, scanner information was performed. A total of 29,541 cases were used for training, including all volumes
and nodule and cancer characteristics for relevant subsets of this dataset. All with slice thickness less than or equal to 2.5 mm. The model consists of lung
participants enrolling in NLST signed an informed consent developed and segmentation, cancer ROI detection, a full-volume model and a final cancer risk
approved by the screening centers’ IRBs, the National Cancer Institute (NCI) IRB prediction model based on the outputs of the full-volume model and the cancer
and the Westat IRB. Additional details regarding cases in the dataset are available ROI detection model. For each of these components, we chose a more general
through the National Institutes of Health Cancer Data Access System. Briefly, computer vision task (that is, instance segmentation, object detection and video
LDCTs were collected from multiple institutions, with slice spacing varying from classification) that was similar to the task performed by the component. Then, for
1.25 to 5 mm and scanner vendors varying by site. We filtered out the 5-mm scans each task we chose an approach that was state of the art at the time of our modeling
to better represent the slice spacing of a typical modern screening protocol42, experiments. For a schematic overview of the model see Fig. 1, and for a more
and the largest remaining slice spacing was 2.5 mm. A diagnosis of lung cancer detailed overview see Extended Data Fig. 10.
established by biopsy at any time during the same year as a screening case counted The approach consists of four components, all trained using the TensorFlow
as a ground truth true-positive case. This included cases identified as incidental platform (Google Inc.)43:
cancers diagnosed during the same screening year as an initially negative screening (1) Lung segmentation. We trained a lung segmentation Mask-RCNN44 ap-
exam. An exam was considered negative if the patient proved cancer-free on proach, trained on the LUNA45 dataset using the TensorFlow Object Detec-
1-year follow-up; patients in the trial had multi-year follow-up. Patients had up to tion API46, which produced the lung segmentation mask. This mask was used
3 years of screening, all via LDCT and, in nearly all cases, only one visit occurred to compute the center of its bounding box for step (c) and to determine an
per year with exceptions made for patients with inadequate imaging or interval alignment with the prior volume. Since only the bounding box center is the
development of symptoms concerning for cancer. In cases where prior imaging was key result of interest, the precise segmentation boundaries are not a factor in
used for testing and development purposes, the screening exam from the preceding our modeling approach. It is likely that other lung segmentation approaches
year was selected. As screening read data from NLST were gathered once per year could substitute this component. Finding the lung center allows us to focus
for each patient, it was important also to evaluate the model once per year for each further processing on the lungs.
patient for our tuning and testing sets. We chose the latest case per screening year, (2) Cancer ROI detection model. This was trained on 1.4 × 0.7 mm2 (spac-
since this was the most likely case to have generated the screening read because ing, pixel size) voxel size volumes. The cancer ROI detection architecture
patients typically were asked to return only if imaging was inadequate. Within was a RetinaNet47 modified to be in 3D and to remove the feature pyramid
each case we used the best available reconstruction kernel (See Supplementary network48. Extended Data Fig. 6a demonstrates how a large ROI was cropped
Information, Kernel selection) with the highest number of slices. around each bounding box detected. The detection model was initialized
An independent dataset from an academic medical center was used to further by first training on LIDC39 and then trained on radiologist-annotated lesion
validate the model’s performance. This dataset consisted of 1,139 cases from bounding boxes collected on the NLST dataset. The cancer ROI detection
907 patients collected as part of a screening program (see Extended Data Fig. 9 for component outputs ROIs from all input volumes, even if no nodules are
exclusion criteria and further details); 209 of the patients and 232 of the cases had present. In this case, the most nodule-like regions are proposed as ROIs.
priors available. These data were not used in the training or tuning of the model. (3) Full-volume model. An end-to-end convolutional model, 3D inflated Incep-
The data were a fully de-identified lung cancer screening CT dataset. The ground tion V1 (ref. 49,50), was trained on the 1.5-mm3 voxel size volumes to predict
truth for cancer on this dataset was defined based on lung cancer International cancer within 1 year, fine-tuning from a checkpoint trained on ImageNet51.
Classification of Disease codes with biopsy or surgical confirmation of cancer Each of these volumes was a large region cropped around the center of the
via manual review of the pathology note. For cancer-negatives, patients had a bounding box as determined by lung segmentation. This cancer prediction
cancer-free follow-up examination at least 1 year after the initial screening exam. model was trained with focal loss47 to try to mitigate the sparsity of positive
Slice spacing for CTs in this dataset varied from 1.25 to 3.0 mm, with the majority examples. We trained the model to predict cancer probability and then
(84%) being 3.0 mm. Notably, our training set in NLST had a maximum spacing of used the last layer before the final probability, which contains 1,024 units.
2.5 mm, suggesting that our model generalized to different scanning parameters. We take these 1,024 numbers as the output for this model, and use them as
features later on.
(4) Cancer risk prediction model. A final cancer classification model was used to
Model development and training. Overall, the model is trained to take the entire consider the output of the previous two models. In all cases, 3D Inception is
CT volume and automatically produce a score predicting whether the patient used to extract features. Throughout the model components, our approach to
received a cancer diagnosis in the same study year. First, for clarity it is important classifying and extracting features from 3D volumes is heavily based on this
to define the following terms. 3D Inception model51. First, features were extracted from the detected ROIs
Volume always refers to the full CT volume (that is, the entire set of axial (Extended Data Fig. 6a). Features from the full-volume model were appended
images comprising the volume)—whether in original resolution or resampled. to the final layers of each detected ROI in the second-stage model, so that all
When we describe that the ‘volume’ is labeled as malignant or non-malignant, we predictions relied on both nodule-level local information and global context
intend to communicate that the label is at a case level (that is, ‘there is cancer in the from the entire CT volume. Extended Data Fig. 10 illustrates the unified end-
CT scan somewhere’). to-end approach, after the top two candidate ROIs were passed to the second-
Bounding box is a rectangular 3D sub-volume containing a malignancy. Our stage malignancy classification model. It was trained as a single convolutional
radiologist labelers were instructed to draw boxes that tightly encapsulate the neural network with shared parameters across all detected ROIs. Each ROI
malignancy. We call these resulting sub-volumes bounding boxes for this reason. was passed through this network to predict its individual malignancy score.
Our detection model aims to predict these bounding boxes. The final prediction was generated by combining the two probability scores
ROI is a fixed-size, 3D sub-volume containing a malignancy and some as shown in Extended Data Fig. 10 (ref. 32). This model was also trained with
surrounding context. Once we have bounding boxes from our detection model, we focal loss47 to try to mitigate the sparsity of positive examples.
take a fixed 90-mm3 region around each bounding box. We call this larger 3D sub-
volume an ROI. The final cancer prediction model was developed to allow as input either
Since the use of only a single label for an entire volume can be a challenging a single CT scan (without prior) or both the current and prior year scan (with
learning task, part of the model used a two-stage approach leveraging bounding prior). The prior and current volumes were aligned based on the lung bounding
box labels. First, two candidate ROIs were detected using a detection component box centers of two volumes and then by aligning nearby center candidate ROIs
trained on radiologist-annotated bounding boxes (see Methods, Localization from the prior scan when available (Extended Data Fig. 6b). In each case a 3D
analysis for annotation details and Extended Data Fig. 6 for details on how shift of prior volume is performed to align the two centers. Higher-level spatial
candidate regions were cropped from detected bounding boxes). We tried using up feature maps from the current and prior scans were combined and passed through
to seven candidates and arrived at two based on tune set performance. additional convolutional layers with batch normalization. Since features of the
Next, we combined the scores p1 and p2 from both candidates using the current and prior scan considered at these higher levels represent the entire
‘noisy-or’ equation1 – (1 – p1)(1 – p2) to produce a final score which was then 90-mm3 (the 64 × 128-mm2 cropped sub-volume with voxel size 1.4 × 0.7 mm2) sub-
trained against the case-level cancer diagnosis labels (see Extended Data Fig. 10). volume at a low spatial resolution, precise alignment of the nodules is not required.
To summarize, the use of ‘noisy-or’ lets us train against the case-level ground truth In the case of a malignant prediction, nodule localization was performed by
in NLST even though internally the model is making predictions about two ROIs. selecting the ROI with the highest malignancy score. For a benign prediction, the

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters
detection model is still forced to produce two ROIs which are then later rejected Reader study—lung cancer screening on a single CT volume. A total of 507 cases
by the cancer risk prediction model. The final model is an ensemble of ten models (83 cancer-positives) were each independently interpreted by six US board-certified
trained with different random initializations. Additional detail can be found in radiologists. In this study, neither the model nor the readers were given access to prior
Supplementary Information, Additional modeling details. cases. Only axial CT slices were available for the first 250 cases; for the remaining cases,
sagittal and coronal reformations and maximum-intensity projection images were
Clinical validation. The NLST-based test set comprised 6,716 cases, 86 of which available. Lung-RADS scores, slice number and anatomic lung location were recorded,
had a biopsy-confirmed cancer within 1 year of screening. The model’s output is and readers saved an ROI for each lesion with a Lung-RADS score of 2 or greater.
a probability between 0 and 1, which was bucketed using three thresholds. We
used a previously developed approach to estimate the positive predictive value Reader study—lung cancer screening using current and prior CT volume. After
(PPV) of Lung-RADS 3, Lung-RADS 4A and Lung-RADS 4B/X33. We then chose completing part one, 308 patients (40 cancer-positives) that were known to have prior
three operating points that matched these PPV values on our tuning set, to have CTs available were re-presented to the same readers, now with a scan from the prior
comparable probability of malignancy with the four existing Lung-RADS risk year available and with readers following the guidelines for Lung-RADS with baseline
buckets. Since Lung-RADS 1 and 2 have the same management recommendations comparisons. They were then allowed to modify their Lung-RADS scores from part
(return to routine annual screening) and risk of malignancy, we grouped these one. The model was also given access to the same CT scan from the prior year.
in the same bucket for this experiment. These operating points define LUMAS
by establishing cutoffs for 1/2 versus 3 and 4A/B/X, 1/2/3 versus 4A/B/X and Localization analysis. Each NLST cancer-positive volume was labeled by two
1/2/3/4A versus 4B/X: the same cutoffs within Lung-RADS at which the likelihood radiologists from a pool of nine (including five fellows and four practicing
of malignancy increases and management is changed. When readers gave S- (other radiologists with a range of experience of 4–21 years (average, 9 years)). The
non-lung cancer findings) or C- (prior lung cancer diagnosis) modified ratings, radiologists were given the coarse locations of pathology-confirmed malignancies
these were treated in the same way as those without modifications (for example, noted in NLST, and were then asked to label all malignancies with bounding boxes.
3C was treated the same as 3), and cases with ratings of 0 were considered not We used all boxes labeled by either radiologist. Overlapping boxes referencing the
gradable and dropped from the analysis. Both test sets were run only once to avoid same malignancy were combined into a single box by averaging the coordinates
influencing model development. Additionally, all individuals who worked on comprising the box. The Hit@N metric was defined as the fraction of true-positive
modeling and image analysis were blinded to the diagnoses in the test set. cases in which the top N candidate lesions from the detection model made any
overlap with an annotated malignancy. The recall metric involving all cancer-
Operating point selection. We define three LUMAS operating points as a way to positive cases is presented in Supplementary Table 8.
compare the model to the readers. We computed Lung-RADS 3+ performance
on our tune set using the nodule annotations from the original NLST readers, to Statistical analysis. All confidence intervals were computed based on the
arrive at a PPV of 0.11. We then adjusted the threshold of our model on the tune percentiles of 1,000 random resamplings (bootstraps) of the data. Confidence
set to match this PPV of 0.11 and used the resulting model score threshold as our intervals for differences were derived by computing the metric of interest and then
LUMAS 3+ threshold. We estimated 4A+ and 4B/X PPVs using a previous analysis computing a reader–model difference on each bootstrap. P values for sensitivity
of NLST33, which gave a PPV of 0.15 for 4A+ and 0.25 for 4B/X from which and specificity comparisons were computed using a standard permutation
we computed LUMAS thresholds for 4A and 4B/X, respectively. We present an test52 using 10,000 random resamplings of the data. Briefly, for each resampling
alternative way of making model-to-reader comparisons in Supplementary Table 4. we randomly swapped the reader and model results for each case35. We then
performed a two-sided hypothesis test comparing the model–reader difference
Retrospective application of model to NLST. The model used current and with the distribution of 10,000 model–reader differences across the resampled data
prior CT volumes when available. We followed the methodology in prior work33 to obtain an empirical P value.
to estimate the performance of Lung-RADS 3 across the entire held-out test set,
using nodule growth annotations to take into account priors when possible. For Reporting Summary. Further information on research design is available in the
brevity, we call this performance estimate retrospective-Lung-RADS. As can be Nature Research Reporting Summary linked to this article.
seen from Figs. 2b and 3b, retrospective-Lung-RADS seems to overestimate the
specificity and underestimate the sensitivity compared to the readers for Lung- Data availability
RADS 3+. Reasons for these differences may include the fact that the NLST dataset This study used three datasets that are publicly available: LUNA: https://luna16.
nodule annotations are insufficient for accurate computation of Lung-RADS grand-challenge.org/data/; LIDC: https://wiki.cancerimagingarchive.net/display/
retrospectively. For example, endobronchial nodules were not noted in the NLST Public/LIDC-IDRI; NLST: https://biometry.nci.nih.gov/cdas/learn/nlst/images/
data and were therefore ignored, and the exact amount of nodule growth was not The dataset from Northwestern Medicine was used under license for the current
noted. As retrospective-Lung-RADS operates in such a different part of the receiver study, and is not publicly available.
operating curve compared to Lung-RADS, we chose a different, non-LUMAS,
operating point to compare our model’s performance to the readers in NLST (see Code availability
Fig. 4a). We found two operating points, one which matched the sensitivity and The code used for training the models has a large number of dependencies on
one which matched the specificity of retrospective-Lung-RADS on our tune set. internal tooling, infrastructure and hardware, and its release is therefore not
We then chose a final operating point midway between these two to improve both feasible. However, all experiments and implementation details are described in
sensitivity and specificity in a balanced manner. sufficient detail in the Methods section to allow independent replication with
non-proprietary libraries. Several major components of our work are available
Reader studies. A two-part reader study was conducted comparing the model in open source repositories: Tensorflow: https://www.tensorflow.org; Tensorflow
to six radiologists on a subset of the test set. All radiologists were US board- Estimator API: https://www.tensorflow.org/guide/estimators; Tensorflow Object
certified with an average of 8 years’ clinical experience (range 4–20 years). Each Detection API: https://github.com/tensorflow/models/tree/master/research/
reader independently reviewed the same set of cases and applied the Lung-RADS object_detection—the lung segmentation model and cancer ROI detection
2014 v.1 criteria to determine a Lung-RADS score. A fully featured, web-based model were trained using this framework; Inflated Inception: https://github.com/
DICOM viewer (eUnity, Client Outlook Inc.) with FDA 510(k) clearance was deepmind/kinetics-i3d—the full-volume model and the second-stage model were
used to evaluate cases. While the first reader study did not use prior imaging, the trained using this feature extractor.
second used a single prior CT scan for comparison. In each case, readers were
given information about the patient: race, gender, ethnicity, smoking history and References
cancer history. The model does not make use of this clinical information, as initial 41. National Cancer Institute. National Lung Screening Trial https://www.cancer.
experiments with this data did not improve performance. gov/types/lung/research/nlst (2018)
Performance comparisons were made for malignancy risk evaluation between 42. The American College of Radiology. Adult lung cancer screening
the model and the average results of the six radiologists. For the model (using specifications. https://www.acr.org/Clinical-Resources/Lung-Cancer-
LUMAS) and the average reader (using Lung-RADS), we computed sensitivity Screening-Resources (2014).
and specificity at each of the three risk bucket thresholds—3+, 4A+ and 4B/X. 43. Abadi, M. et al. Tensorflow: a system for large-scale machine learning. OSDI
The average reader sensitivity and specificity were computed by taking the 16, 265–283 (2016).
average of the six individual reader sensitivities and specificities, respectively. 44. He, K., Gkioxari, G., Dollar, P. & Girshick, R. Mask R-CNN. In IEEE
The without priors reader study subset consisted of 507 patients, 83 of which International Conference on Computer Vision (ICCV) https://doi.org/10.1109/
were cancer-positives. There was a single volume per patient and the subset was iccv.2017.322 (IEEE, 2017).
enriched for biopsied cases (see Extended Data Fig. 1 for details on exclusion 45. Setio, A. A. A. et al. Validation, comparison, and combination of algorithms
and enrichment). The cancer-negative biopsy cases were down-weighted in the for automatic detection of pulmonary nodules in computed tomography
subsequent analysis such that the final metrics on negatives were representative images: The LUNA16 challenge. Med. Image Anal. 42, 1–13 (2017).
of a random sampling of negatives from the NLST test set. The with priors reader 46. Huang, J. et al. Speed/accuracy trade-offs for modern convolutional object
study was conducted on all cases from the without priors reader study that had an detectors. In IEEE Conference on Computer Vision and Pattern Recognition
available prior. (CVPR) https://doi.org/10.1109/cvpr.2017.351 (IEEE, 2017).

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine
47. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal loss for dense 50. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the
object detection. IEEE Trans. Pattern Anal. Mach. Intell. Preprint at https:// Inception architecture for computer vision. In IEEE Conference on Computer
doi.org/10.1109/TPAMI.2018.2858826 (2018). Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/cvpr.2016.308
48. Lin, T.-Y. et al. Feature pyramid networks for object detection. in IEEE (IEEE, 2017).
Conference on Computer Vision and Pattern Recognition (CVPR) https://doi. 51. J. Deng et al. ImageNet: a large-scale hierarchical image database. In IEEE
org/10.1109/cvpr.2017.106 (IEEE, 2017). Conference on Computer Vision and Pattern Recognition https://doi.
49. Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and org/10.1109/cvprw.2009.5206848 (IEEE, 2009).
the kinetics dataset. In IEEE Conference on Computer Vision and Pattern 52. Chihara, L. M. & Hesterberg, T. C. Mathematical Statistics with Resampling
Recognition (CVPR) https://doi.org/10.1109/cvpr.2017.502 (IEEE, 2017). and R (John Wiley & Sons, 2014).

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters

Extended Data Fig. 1 | NLST STARD diagram. a, Diagram describing exclusions made in our analysis. b, Table describing exclusions made by the NCI
when selecting images to release from NLST. Note that there were 623 screen-detected cancers but a total of 638 cancer-positive patients. The additional
15 patients were diagnosed during the screening window, but not due to a positive screening result. In this case Row 3 ‘Relevant Images’ meant that, for
cancer-positive patients, there were images from the year of the cancer diagnosis, and for cancer-negative patients it meant that all 3 years of screening
images were available. Note that the publicly available version of NLST downsampled the screening groups 3 (no nodule, some abnormality) and 4 (no
nodule, no abnormalities). In Extended Data Figs. 2, 3, 4 and Supplementary Table 4 we present another version of the main analysis that compensates for
this downsampling by upweighting patients within these groups.

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine

Extended Data Fig. 2 | Results from the reader study—lung cancer screening on a single CT volume: reweighted. a–e, Identical to Fig. 2, except that
we took into account the biased sampling done in the selection of the NLST data released. This meant that examples in screening groups 3 (no nodule,
some abnormality) and 4 (no nodule, no abnormality) were upweighted by the same factor by which they were downsampled (see Extended Data Fig. 1
for further details on the groups). Model performance shown in the AUC curve and summary tables is based on case-level malignancy score. LUMAS
buckets refers to operating points selected to match the predicted probability of cancer for Lung-RADS 3+, 4A+ and 4B/X. a, Performance of model (blue
line) versus average radiologist for various Lung-RADS categories (crosses) using a single CT volume. The lengths of the crosses represent the confidence
intervals. The area highlighted in blue is magnified in b to show the performance of each of the six radiologists at various Lung-RADS risk buckets.
c, Sensitivity comparison between model and average radiologist. d, Specificity comparison between model and average radiologist. Both sensitivity
and specificity analysis were conducted with n = 507 volumes from 507 patients, with P values computed using a two-sided permutation test with
10,000 random resamplings of the data. e, Hit rate localization analysis used to measure how often the model correctly localized a cancerous lesion.

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters

Extended Data Fig. 3 | Results from the reader study—lung cancer screening using current and prior CT volume: reweighted. a–e, Identical to Fig. 3,
except that we took into account the sampling done in the selection of the 15,000 patient NLST data released. This meant that for screening groups 3
(no nodule, some abnormality) and 4 (no nodule, no abnormality) we upweighted each example by the same factor by which they were downsampled.
Model performance in the AUC curve and summary tables is based on case-level malignancy score. The term LUMAS buckets refers to operating points
selected to represent sensitivity/specificity at the 3+, 4A+ and 4B/X thresholds. a, Performance of model (blue line) versus average radiologist at various
Lung-RADS categories (crosses) using a CT volume and a prior CT volume for a patient. The length of the crosses represents the 95% confidence interval.
The area highlighted in blue is magnified in b to show the performance of each of the six radiologists at various Lung-RADS categories in this reader
study. c, Sensitivity comparison between model and average radiologist. d, Specificity comparison between model and average radiologist. Both sensitivity
and specificity analysis were conducted with n = 308 volumes from 308 patients with P values computed using a two-sided permutation test with
10,000 random resamplings of the data. e, Hit rate localization analysis used to measure how often the model correctly localized a cancerous lesion.

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine

Extended Data Fig. 4 | Results from the full NLST test set and independent test set: reweighted. a,b, Identical to Fig. 4 except that we took into account
the biased sampling done in the selection of the NLST data released. This meant that for screening groups 3 (no nodule, some abnormality) and 4
(no nodule, no abnormality) we upweighted each example by the same factor by which they were downsampled. The comparison was performed on
n = 6,716 cases, using a two-sided permutation test with 10,000 random resamplings of the data. a, Comparison of model performance to NLST reader
performance on the full NLST test set. NLST reader performance was estimated by retrospectively applying Lung-RADS 3 criteria to the NLST reads. b,
Sensitivity and specificity comparisons between the model and Lung-RADS retrospectively applied to NLST reads.

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters

Extended Data Fig. 5 | Independent dataset ROC curve and intersection over union for localization. a, AUC curve for the independent data test set with
n = 1,139 cases using a two-sided permutation test with 10,000 random resamplings of the data. b, For each detection that was a ‘hit’ (overlapped with a
labeled malignancy), this plot shows the volume of the intersection between the detection and the ground truth divided by the volume of the union of the
ground truth and the detection. In 3D, intersection over union (IOU) drops much faster than in two dimensions (2D). For example, given a 1-mm3 nodule
and a correctly centered 2-mm3 bounding box, the resulting IOU will be 0.125. In 2D, a similar situation would result in an IOU of 0.25.

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine

Extended Data Fig. 6 | Examples of ROIs from the detection model and examples of cases where the model prediction differs from the consensus grade.
a, Example slices from cancer ROIs (cyan) determined by bounding boxes (red) detected by the cancer ROI detection model. The final classification model
uses the larger additional context as input illustrated by the cyan ROI. b, Sample alignment of prior CT with current CT based on the detected cancer
bounding box, which is performed by centering both sub-volumes at the center of their respective detected bounding boxes. When a prior detection is not
available, the lung center is used for an approximate alignment. Note that features derived from this large, 90-mm3 context are compared for classification
at a late stage in the model after several max-pooling layers that can discard spatial information. Therefore, a precise voxel-to-voxel alignment is not
necessary. c, Example cancer-negative case with scarring that was correctly downgraded from a consensus grade of Lung-RADS 4B to LUMAS 1/2 by the
model. d, Example cancer-positive case with a nodule (size graded as 7–12 mm, depending on the radiologist) correctly upgraded from grades of Lung-
RADS 3 and 4A (depending on the radiologist) to LUMAS 4B/X by the model.

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters

Extended Data Fig. 7 | Attribution maps generated using integrated gradients. a, Example of model attributions for a cancer-positive case. The top row
shows the input volume for the full-volume and cancer risk prediction models, respectively. The lower row shows the attribution overlay with positive
(magenta) and negative (blue) region contributions to the classifications. In all cancer cases under the attributions study, the readers strongly agreed
that the model focused on the nodule. Also, in 86% of these cases, the global and second-stage models focused on the same region. b, Example of model
attributions for a cancer-negative case. The left-hand image shows a slice from the input subset volume. The right-hand image image shows positive
(magenta) and negative (blue) attributions overlayed. The readers found that, in 40% of the negative cases examined, the model focused on vascular
regions in the parenchyma.

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine

Extended Data Fig. 8 | Example LUMAS false positive cases. a, 4B/X false positives. b, 4A+ false positives.

Nature Medicine | www.nature.com/naturemedicine


Nature Medicine Letters

Extended Data Fig. 9 | STARD diagram of low-dose-screening CT patients from an academic medical center used for the independent validation test
set. We require a minimum of 1 year of follow-up for cancer-negative cases. This resulted in a median follow-up time of 625 d across all patients once all
exclusion criteria were taken into account. To clarify, this means that the median amount of time from the first screening CT to either a cancer diagnosis or
the last follow-up event was 625 d. There were 209 patients (232 cases) with priors in this set of 1,139.

Nature Medicine | www.nature.com/naturemedicine


Letters Nature Medicine

Extended Data Fig. 10 | Illustration of the architecture of the end-to-end cancer risk prediction model. The model is trained to encompass the entire
CT volume and automatically produce a score predicting the cancer diagnosis. In all cases, the input volume is first resampled into two different fixed
voxel sizes as shown. Two ROI detections are used per input volume, from which features are extracted to arrive at per-ROI prediction scores via a fully
connected neural network. The prior ROI is padded to all zeros when a prior is not available.

Nature Medicine | www.nature.com/naturemedicine


nature research | reporting summary
Corresponding author(s): Daniel Tse

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars


State explicitly what error bars represent (e.g. SD, SE, CI)

Our web collection on statistics for biologists may be useful.

Software and code


Policy information about availability of computer code
Data collection eUnity: FDA-approved fully featured PACS viewer. Used to collect reader study results.
MAPLE: Internal labeling tool. Used to collect localization ground truth.

Data analysis Colab: Internal version of Colab which is an iPython notebook viewer
Pandas: Internal fork of open source library Pandas which is a framework for tabular data
Matplotlib: Internal fork of open source library Matplotlib which is for making plots
sklearn: Internal fork of open source library Scikit-Learn which we used for metrics such as AUC
Tensorflow: Internal fork of open source library used to train machine learning models
Apache Beam: Internal fork of open source library used for large scale batch processing
Tensorflow object detection API: https://github.com/tensorflow/models/tree/master/research/object_detection
Inflated Inception: https://github.com/deepmind/kinetics-i3d
April 2018

For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

1
Data

nature research | reporting summary


Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
We used three datasets which are publicly accessible:

LUNA: https://luna16.grand-challenge.org/data/
LIDC: https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI
NLST: https://biometry.nci.nih.gov/cdas/learn/nlst/images/

The dataset from Northwestern was used under license for the current study, and so is not publicly available. The data, or a test subset, may be available from
Northwestern Medicine subject to ethical approvals.

Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size The first step in determining the sample size was the size of the test set we decided to use for the dataset from the National Lung Cancer
Screening Trial (NLST). We had to balance having enough data to train the algorithm while having enough data to validate the algorithm. We
used a 70% training (29,541 cases, 401 cancer positive), 15% tuning (6,309 cases, 100 cancer positive), 15% testing (6,729 cases, 87 cancer
positive) split which is a standard way of splitting datasets for deep learning research. We believe this sample size was sufficient for the test
set because the test set represents all 33 sites in the NLST trial, it contains all 4 stages of cancer, and all CT manufacturers present in the trial.

For our independent dataset, the medical institution returned all available cases after NLST publication related to lung cancer screening. We
used all cases where we could arrive at a clear conclusion about the cancer outcome.

For our reader studies, we used positive enrichment by taking all cases within the test set with a same-year positive cancer diagnosis or
biopsy, and then randomly sampling negatives. We believe the sample of negatives was sufficient as it was 5x larger than the number of
positives used and we were able to see statistically significant improvements in performance for specificity in both reader studies.

Data exclusions We excluded data only when it made subsequent analysis not possible:

We excluded 3 studies that were not gradable as determined by our readers as there would be no way of making a reader-model comparison
since no reader grade was returned.
Cases where neither reader found a bounding box suspicious for malignancy in the volume were excluded from the localization analysis since
there was no bounding box to compare to.
There were a small number of patients in the independent dataset where either there were no images or it was not possible to assess ground
truth due to insufficient follow-up, for instance the image was suspicious for cancer but was missing a biopsy confirmation.

Replication We replicated the high performance of our model on a completely independent dataset from an academic medical center, with different scan
parameters, and from a disjoint time period.

Randomization For NLST, we randomly split patients into the train, tune, or test split. All imaging and metadata from each patient was associated with the
April 2018

same split as the patient.

For the reader study, we randomly selected negative cases from the test set. After a random selection of cases we randomly chose one
volume from each patient to avoid having the same patient twice in the reader study.

Blinding We held out the data from the test set and did not give anyone in the research group access to the images until we froze our choice of model
and produced the test set results. We have done only one previous evaluation on the test set for an abstract for RSNA-2018 (using a different

2
model). In that case we only ran the model on the test set once, withholding access otherwise. No one on the model development team has
been allowed to inspect the model’s performance on the test set at any point.

nature research | reporting summary


Reporting for specific materials, systems and methods

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Unique biological materials ChIP-seq
Antibodies Flow cytometry
Eukaryotic cell lines MRI-based neuroimaging
Palaeontology
Animals and other organisms
Human research participants

Human research participants


Policy information about studies involving human research participants
Population characteristics For NLST, the patient population characteristics are best described in the original NLST publication:
The National Lung Screening Trial: Overview and Study Design
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3009383/

For our independent dataset, we included all patients from the center who underwent lung cancer
screening.

Recruitment All participants enrolling in NLST signed an informed consent developed and were approved by the screening centers’
institutional review boards (IRBs), the National Cancer Institute (NCI) IRB, and the Westat IRB. Additional details regarding
cases in the dataset are available through the National Institutes of Health Cancer Data Access System.
The independent dataset was gathered retrospectively under approval from the Northwestern University IRB

April 2018

You might also like