Cancer Microbiome Atlas
Cancer Microbiome Atlas
                                                                    Correspondence
                                                                    anders.dohlman@duke.edu (A.B.D.),
                                                                    xiling.shen@duke.edu (X.S.)
                                                                    In Brief
                                                                    Dohlman et al. present The Cancer
                                                                    Microbiome Atlas, a public database of
                                                                    decontaminated, tissue-resident
                                                                    microbial profiles of TCGA
                                                                    gastrointestinal cancer tissues. As these
                                                                    profiles are matched to specific TCGA
                                                                    tissue samples, this work allows
                                                                    identification of prognostic species and
                                                                    provides a resource for performing multi-
                                                                    omic, pan-cancer analyses of host-
                                                                    microbe interactions.
Highlights
d   Decontaminated microbial compositions for 3,689
    gastrointestinal cancer samples
Resource
The cancer microbiome atlas: a pan-cancer
comparative analysis to distinguish tissue-resident
microbiota from contaminants
Anders B. Dohlman,1,* Diana Arguijo Mendoza,1 Shengli Ding,1 Michael Gao,2 Holly Dressman,3 Iliyan D. Iliev,4
Steven M. Lipkin,4 and Xiling Shen1,5,*
1Department of Biomedical Engineering, Center for Genomics and Computational Biology, Duke Microbiome Center, Duke University,
SUMMARY
Studying the microbial composition of internal organs and their associations with disease remains chal-
lenging due to the difficulty of acquiring clinical biopsies. We designed a statistical model to analyze the prev-
alence of species across sample types from The Cancer Genome Atlas (TCGA), revealing that species
equiprevalent across sample types are predominantly contaminants, bearing unique signatures from each
TCGA-designated sequencing center. Removing such species mitigated batch effects and isolated the tis-
sue-resident microbiome, which was validated by original matched TCGA samples. Gene copies and nucle-
otide variants can further distinguish mixed-evidence species. We, thus, present The Cancer Microbiome
Atlas (TCMA), a collection of curated, decontaminated microbial compositions of oropharyngeal, esopha-
geal, gastrointestinal, and colorectal tissues. This led to the discovery of prognostic species and blood
signatures of mucosal barrier injuries and enabled systematic matched microbe-host multi-omic analyses,
which will help guide future studies of the microbiome’s role in human health and disease.
INTRODUCTION                                                            sites and disease states, and these sequencing datasets contain
                                                                        a significant number of sequencing reads of microbial origin
The human body supports an ecosystem of 10–100 trillion micro-          (Kostic et al., 2011; Poore et al., 2020; Robinson et al., 2017).
organisms (Luckey, 1972; Sender et al., 2016), representing 500–        Large sequencing projects can, thus, be mined to promote un-
1,000 unique species per individual (Human Microbiome Project           derstanding of host-microbe interactions in both healthy and
Consortium, 2012; Qin et al., 2010). Perturbations to this              diseased human tissue. To that end, the bioinformatics tool
ecosystem, termed dysbiosis, can impact human health: micro-            PathSeq (Kostic et al., 2011) was used to identify enrichment
bial alterations have been implicated in a variety of health condi-     of Fusobacterium nucleatum in The Cancer Genome Atlas
tions, including obesity, diabetes, inflammatory bowel disease,         (TCGA) colorectal cancer (CRC) tumors (Kostic et al., 2013; Kos-
cancer, and other diseases (Elinav et al., 2019; Iliev and Leo-         tic et al., 2012). Since then, dozens of research articles explored
nardi, 2017; Levy et al., 2017; Schirmer et al., 2019). While public    the role of F. nucleatum in tumorigenesis, finding associations
microbiome projects such as the Human Microbiome Project                with stage, survival, metastasis, and even drug response (Bull-
(HMP) and MetaHIT have helped bring tremendous insights                 man et al., 2017; Flanagan et al., 2014; Yu et al., 2017). More
into the diversity and function of human flora, these databases         broadly, sequencing data from TCGA has been used ad hoc to
are dominated by tissue swabs and stool samples that do not             screen for viral and bacterial presence in stomach adenoma
necessarily reflect the microbial composition of internal organs        (Cancer Genome Atlas Research Network, 2014) and cervical
(Grice et al., 2008; Prast-Nielsen et al., 2019). Collection of clin-   cancer (Cancer Genome Atlas Research Network et al., 2017)
ical biopsies specifically dedicated to microbial profiling remains     specifically, as well as viromes (Tang et al., 2013) and bacter-
difficult despite many disease-related host-microbe interactions        iomes (Robinson et al., 2017). Recently, analysis of TCGA
occurring at the epithelium of internal body sites.                     sequencing data has been used to demonstrate the potential
   Next-generation sequencing (NGS) is frequently used to pro-          for bloodborne microbial DNA to diagnose certain cancers
file biopsied human tissue samples at a broad range of body             (Poore et al., 2020). Given that even low-biomass tumors contain
                                                            Cell Host & Microbe 29, 281–298, February 10, 2021 ª 2020 Elsevier Inc. 281
            ll
                                                                                                                                          Resource
Figure 1. WGS and WXS harbor colorectal bacterial reads distinct from blood and brain
See also Figure S1
(A) Matched analysis of bacterial sequencing reads per million (RPM) in normal tissue (yellow), tumor tissue (blue), and blood (red) from CRC and BC patients in
TCGA. Significance is given by paired, one-sided t tests.
(B) Abundance data from (A) but comparing solid tissue (pooled tumor and normal) with blood samples from BC (green) and CRC (brown) patients. Significance is
given by one-sided t tests.
(C) Comparison of bacterial species prevalence in WGS data for CRC blood and CRC tissue samples reveals populations of tissue-enriched species (blue) and
species that are equiprevalent in blood and tissue (red). Black circles denote species associated with MBI.
                                                                                                                               (legend continued on next page)
282 Cell Host & Microbe 29, 281–298, February 10, 2021
                                                                                                                                              ll
Resource
tissue-specific microbiomes (Nejman et al., 2020), analysis of                 tracted from TCGA sequencing data matched the microbial
microbial DNA and RNA in TCGA sequencing data has great po-                    composition of the original tissue samples.
tential for diagnostic applications, as well as for exploring host-               Finally, we ran the vetted decontamination algorithm to estab-
microbe interactions along molecular and clinical axes.                        lish the TCMA database, which users can access from an inter-
   However, few actionable microbiota targets like F. nucleatum                active website (https://tcma.pratt.duke.edu). The database
have emerged from such analyses. When examining a subset of                    contains curated tissue-resident microbial profiles for 4,937
TCGA sequencing data, previous analyses (Robinson et al.,                      sequencing runs on 3,689 unique samples from 1,772 patients
2017) found that microbial reads from a number of species                      representing 5 TCGA projects and 21 anatomic sites with tis-
were the result of contamination, and that distinguishing                      sue-resident populations. As proof-of-principle, we used
contamination from tissue-embedded microbes remained an                        TCMA to identify two bacterial co-abundance groups in CRC tis-
outstanding challenge for use of this dataset. Indeed, while con-              sue, including species enriched in CRC tumors compared with
cerns over contamination are less pressing for samples with high               matched adjacent normal tissue and species prognostic of
microbial biomass such as stool or swabs, microbiome studies                   patient survival. TCMA enabled a matched microbe-host tran-
on low-biomass samples suffer from contamination during sam-                   scriptomic, proteomic, and epigenetic analysis that identified as-
ple collection and DNA extraction. Contamination can originate                 sociations between microbes and host gene expression patterns
from the laboratory environment, including nucleic acid extrac-                and pathways. Finally, by comparing TCMA-curated blood sam-
tion kits (Davis et al., 2018; Eisenhofer et al., 2019; Glassing               ples of CRC and brain cancer (BC) patients, we identified a bac-
et al., 2016). Thus, controlling for contamination in these data-              terial signature associated with colorectal mucosal barrier injury
sets is a crucial step that must precede downstream analyses                   unique to CRC blood samples. Thus, TCMA constitutes a power-
of host-microbe interactions. Samples for multi-institutional pro-             ful resource for validation and hypothesis generation in future
jects are acquired, processed, and sequenced at different sites,               studies of host-microbe interactions relevant to cancer.
each of which may introduce its own contaminants that influence
the extracted microbial profiles, impede reproducibility, and                  RESULTS
complicate discovery of microbial biomarkers. Thus, a number
of strategies have been deployed to identify contamination in                  WGS and WXS harbor colorectal bacterial reads distinct
TCGA sequencing data, through examination of batch effects,                    from blood and brain
sample analyte concentrations, and through manual curation                     To explore the microbial populations of sequenced TCGA tissue,
(Poore et al., 2020; Robinson et al., 2017). To date, such analyses            we began by analyzing multi-platform sequencing data for 730
have never been validated by original TCGA tissue or blood sam-                tissue and 555 blood samples from 617 CRC (TCGA projects
ples, nor have decontaminated TCGA microbiome datasets                         COAD/READ) patients and for 958 tissue and 914 blood samples
been made readily available.                                                   from 923 BC (TCGA projects GBM/LGG) patients. For several
   Sequencing data in TCGA provide a unique opportunity for                    thousand whole-genome sequencing (WGS) and whole-exome
identifying tissue-specific microbiota, since matched tissue                   sequencing (WXS) experiments, we retrieved raw sequencing
and blood samples from various cancer types are processed                      data from the TCGA database and extracted and mapped
and sequenced in parallel using various sequencing platforms                   high-quality reads of bacterial origin using PathSeq (Kostic
at designated centers (Choi et al., 2017). Using an unbiased sta-              et al., 2011). We found that microbial reads were more abundant
tistical model comparing the prevalence of microbial species in                and more diverse in solid tissue than in matched blood samples
tissue and blood samples, we isolated the tissue-resident micro-               from CRC patients; in contrast, the abundance and diversity of
biome in TCGA sequencing data. We found that species equally                   microbial reads were no greater in BC tissue than matched blood
prevalent across tissue types and blood samples are mostly ar-                 samples (Figures 1A and S1A). Furthermore, CRC tissue had
tifacts or contaminants that (1) bear unique signatures from the               more abundant and diverse microbiota than BC tissue (Figures
designated TCGA sequencing center and (2) comprise more                        1B and S1B), consistent with the notion that blood and brain tis-
than half of all detectible microbial sequencing reads in many tis-            sue are more sterile than colorectal tissues. Notably, microbial
sue samples. With gene- and nucleotide-level resolution, our                   reads were also more abundant and diverse in blood samples
model is also capable of normalizing read counts for ‘‘mixed-ev-               from CRC patients than in those of BC patients (Figures 1B
idence’’ cases, in which sequencing reads aligning a given                     and S1B).
species may come from a combination of endogenous and                             Comparative analysis of microbial reads between CRC tissue
contaminant microbiota. To validate our approach, we obtained                  and blood samples revealed two distinct groups of bacterial spe-
original matched CRC tissue and plasma samples that were pre-                  cies: those enriched in tissue and those equally prevalent in tis-
viously sequenced by TCGA and performed 16S rRNA amplicon                      sue and blood (Figures 1C and S1C). Species were seldom more
sequencing. This independently confirmed not only the absence                  prevalent in blood than in tissue. Species that were more preva-
of putative contaminants but also that the tissue-resident,                    lent in CRC tissue than in CRC blood included many species
computationally decontaminated microbial profiles that we ex-                  known to be associated with mucosal barrier injury (MBI)
(D) Comparison of bacterial species prevalence in WGS data for CRC and BC samples reveals populations of CRC-enriched species (brown) and species that are
equiprevalent in CRC and BC (green).
(E) Relative abundance of bacterial phyla in WGS data for tissue (top) and blood (bottom) samples from CRC (left) and BC (right) patients.
(F and G) Heat tree comparing relative abundance of bacteria in WGS data for (F) matched blood samples (red) versus tissue samples (blue) and (G) CRC tissue
(brown) versus BC tissue (green).
                                                                                          Cell Host & Microbe 29, 281–298, February 10, 2021 283
           ll
                                                                                                                                    Resource
Figure 2. Most equiprevalent taxa are common contaminants and associated with particular sequencing centers
See also Figure S2
(A) Genera commonly found in negative controls of metagenomic sequencing experiments (Eisenhofer et al., 2019) are highly prevalent in blood samples.
(CDC, 2019), whereas the group equally present in CRC tissue                      negative controls of metagenomic sequencing experiments (Ei-
and blood contained very few such species (Figures 1C and                         senhofer et al., 2019). Overall, genera in this ‘‘common contam-
S1C). By comparison, nearly all species detected in samples                       inant’’ set were more prevalent in blood samples (p = 4.45e 10;
from BC patients were equiprevalent in tissue and blood, with                     Figures 2B and 2A) than genera not in the list.
few enriched in tissue (Figures 1C and S1C). Similar comparative                     Species equiprevalent in CRC tissue and blood were also
analyses of tissue and blood from CRC and BC patients showed                      considerably more genetically and phenotypically diverse than
significant populations of disease-enriched species for CRC but                   species enriched in CRC with respect to their G-C content,
few for BC (Figures 1D and S1D). We then repeated this compar-                    genome size, and optimal growth conditions. Conversely,
ative prevalence analysis using samples from ovarian cancer                       CRC-enriched species were much less tolerant to extreme
(OVC; TCGA project OV; Figures S1E and S1F).                                      growth conditions, with optimal temperature, pH, and NaCl
   The microbial composition of CRC tissue samples was also                       levels more closely resembling those of human homeostasis
markedly distinct from that of matched blood samples or BC tis-                   (Figures 2C and S2B). Together, these results suggest that the
sue samples. Among the most dominant phyla in CRC tissue                          equiprevalent group may contain contaminant species, which
were Bacteroidetes and Firmicutes, which were relatively absent                   have larger genomes—a signature of ‘‘generalist’’ bacteria that
in blood and brain tissue samples (Figures 1E and S1G). Next, we                  must endure more variable and unstable environmental condi-
compared the relative abundance of bacterial taxa in CRC tissue                   tions than gut microbiota (Sriswasdi et al., 2017).
versus matched blood samples (Figure 1F) and CRC tissue
versus BC tissues (Figure 1G) from different donors. These ana-                   Equiprevalent species are associated with particular
lyses were largely consistent, with taxa from Bacteroidetes,                      sequencing centers
Firmicutes, and Fusobacteria clades being consistently overrep-                   Principle coordinate analysis (PCoA) of UniFrac distances be-
resented in CRC tissue, compared with Proteobacteria and Ac-                      tween CRC samples demonstrated that the primary axis of
tinobacteria, which accounted for a relatively greater fraction of                variation in TCGA microbiome data could be attributed to differ-
reads in CRC blood samples and BC tissue samples. Genera                          ences between blood samples and tissue samples (41.43%)
that were relatively more abundant in CRC blood samples or                        (Figures 2D and S2C). Interestingly, the second axis of microbial
BC tissue samples compared with CRC tissue samples consis-                        variation was determined by the sequencing center at which the
tently included Acinetobacter, Mycobacterium, and Ralstonia,                      samples were processed (17.20%), regardless of sample type.
among others (Figures 1F and 1G). Metagenomic profiling of                        All TCGA samples (tissue and blood) were harvested at a tissue
TCGA samples using Kraken2 (Wood et al., 2019) largely recapit-                   source site and then sent to designated genome sequencing
ulated PathSeq results (Figures S1H–S1J).                                         centers (Figure S2D). While the first PCoA axis captured differ-
   Together, these comparative analyses were capable of distin-                   ences in the presence of tissue-enriched species that are more
guishing species enriched in CRC tissues from those with similar                  abundant in CRC tissue than in blood, the second axis captured
prevalence across different blood samples and disease types.                      species found in both tissue and blood samples at similar levels,
The analyses confirmed that bacteria in CRC tissues were (1)                      many of which were associated with sequencing center (Figures
more diverse and abundant and (2) were enriched for mucosa-                       2D, S2E, and S2F). We then examined the abundance of equi-
related species.                                                                  prevalent species in blood samples and found significant clus-
                                                                                  tering according to the sequencing centers at which the samples
Species equiprevalent in tissue and blood are                                     were processed (Figures 2E and S2F). For comparison, we per-
predominantly contaminants                                                        formed the same analysis on tissue and blood samples from BC
Besides species enriched in CRC tissues, a significant number of                  patients, which revealed no discernible variation between tissue
detected species were equally prevalent in blood, CRC tissue,                     and blood samples, but rather significant clustering by
BC tissue, and OV tissue (Figures 1C, 1D, and S1C–S1F). While                     sequencing center (Figures S2G and S2H).
compromised epithelial barrier function may allow the transloca-                     Therefore, the majority of species equiprevalent in tissue and
tion of microorganisms to the bloodstream at low levels (Chelak-                  blood are not endogenous but are mostly artifacts introduced
kot et al., 2018), we expected that such species would be prev-                   during processing and profiling at respective sequencing cen-
alent at much lower levels in blood than in CRC tissue. To                        ters. For ease of description, we will refer to equiprevalent
analyze equiprevalent species and determine their origin, we first                species as ‘‘contaminants’’ and tissue-enriched species as ‘‘tis-
examined a set of 70 bacterial genera known to be enriched in                     sue-resident’’ for the remainder of the article. However, the
                                                                                             Cell Host & Microbe 29, 281–298, February 10, 2021 285
            ll
                                                                                                                                              Resource
equiprevalent population may still contain biologically relevant                    Detecting tissue-resident and contaminant species with
species that are detected in both tissue and blood.                                 gene-level resolution
                                                                                    For species designated as tissue-resident (e.g., B. vulgatus) or
A generalizable model for isolating tissue-resident                                 as contamination (e.g., A. junii), we subsequently explored the
microbiota in TCGA tumor samples                                                    extent to which microbial genes could be reliably detected in
Based on the comparative analyses of prevalence in tissue and                       TCGA sequencing data. Using annotated genomes to search
blood, we developed a generalizable statistical model to distin-                    for gene-level assignments, we found that for many such spe-
guish tissue-resident microbiota from contaminant species                           cies, sequencing alignments provided coverage of the full micro-
across cancer types in TCGA. Of the 1,136 bacterial species                         bial genome. As expected, gene prevalence profiles of tissue
detectable in more than 5% of CRC tissue samples, this model                        and blood samples largely recapitulated those of species-level
classified 769 species as tissue-resident (67.69%) and 367 spe-                     assignments (Figures 3A, 3B, S3A, and S3B). For tissue-resident
cies as contaminants (32.31%). Tissue-resident populations                          species, the gene prevalence distribution was much lower in
identified by comparing prevalence in tissue and blood were                         blood samples than tissue, while for contaminants, the gene
largely consistent with prevalence comparisons of CRC tissue                        prevalence distribution of blood and tissue samples were nearly
with BC tissue, as well as analogous comparisons made with                          identical (Figures 3D, 3E, S3C, and S3D); genome coverage was
WXS data (Figure S2I). The model was used to perform binary                         greater in tissue than blood for tissue-resident species but iden-
classifications for all observed species. For each sample in the                    tical for contaminant species (Figures 3G, 3H, S3E, and S3F).
cohort, and at each subsequent taxonomic level, we then used                        Likewise, genome coverage in tissue samples was nearly equal
species-level classifications to design a mixture model esti-                       at Harvard and Baylor for tissue-resident species but not for
mating the fraction of contaminant read counts within a given                       contaminant species (Figures S3G and S3H). These results sug-
clade. This method provides a generalizable approach for de-                        gest that gene- and nucleotide-level analyses of microbial
composing observed microbial populations into their tissue-resi-                    sequencing reads may be leveraged to help distinguish contam-
dent and contaminant fractions on a sample-by-sample basis at                       ination from tissue-resident populations.
multiple taxonomic levels.
                                                                                    Distinguishing tissue-resident Escherichia reads from
Proteobacteria and Actinobacteria contribute the                                    contamination
largest fraction of contaminant reads                                               An outstanding challenge in controlling contamination is the
As expected, the removal of contaminant species resulted in a                       problem of mixed-evidence cases in which detected sequencing
reduction in the number of bacterial reads in all sample types.                     reads come from an unknown combination of endogenous and
Specifically, bacterial species classified as contaminants ac-                      contaminant sources (Poore et al., 2020; Robinson et al.,
counted for a median of 16.27% bacterial read counts in tissue                      2017). For example, although Escherichia coli is ubiquitous
but varied considerably (Figure 2F). Contaminants consistently                      among human microbiomes, species-level E. coli reads were
dominated blood samples, with a median of 99.45% of detected                        present in tissue (64.68%) and blood (66.29%) at nearly equal
WGS reads being the result of contamination. The phyla Proteo-                      rates and were strongly associated with sequencing center (Fig-
bacteria and Actinobacteria contributed the greatest fraction of                    ure 2E). We therefore explored whether gene-level read align-
contaminant reads in WGS data, with medians of 76.67% and                           ments could provide greater resolution and could be used to
80.95% of reads, respectively, found in CRC tissue samples being                    estimate the fraction of sequencing reads resulting from contam-
the result of contamination (Figure 2G). By contrast, only small                    ination versus endogenous microbiota. To test this, we mapped
fractions of Firmicutes (1.70%), Bacteroidetes (0.02%), and Fuso-                   microbial sequencing reads from TCGA tissue and blood sam-
bacteria (0.00%) reads were predicted to be the result of contam-                   ples to genes in the annotated E. coli genome.
ination (Figure 2G). Contamination rates were largely similar for                      Overall, reads aligning to E. coli genes in tissue and blood sam-
WXS and across sequencing centers (Figures S2J and S2K).                            ples were detected at up to the same rates as species-level E. coli
   Additionally, correlation between the normalized relative                        alignments (Figure 3C) and had similar genome coverage (Fig-
abundances of taxa in matched WGS and WXS samples was                               ure 3I). However, a small number of E. coli genes displayed a
predictive of contamination rates. Correlations between Bacter-                     signature analogous to tissue-resident microbiota in our spe-
oidetes, Fusobacteria, and Firmicutes abundances in WGS and                         cies-level prevalence analysis (Figure 3C). Moreover, we
WXS were consistently high, in contrast to Actinobacteria and                       observed bimodality in the blood prevalence of E. coli genes (Fig-
Proteobacteria (Figure 2H). For blood samples, the normalized                       ure 3F), suggesting the presence of distinct tissue-resident and
relative abundances of these five phyla were wholly uncorrelated                    contaminant E. coli populations. We identified a set of 119
between matched WGS and WXS (Figure S2L). Overall, these re-                        E. coli genes significantly enriched in tissue samples (q < 0.01; Fig-
sults show that significant fractions of the bacterial reads in WGS                 ure 3J), several of which have credible reasons for being enriched
data for CRC tissue and blood samples are the result of contam-                     in tissue samples. The top candidate, cadA (q = 4.44E-9), is a gene
ination from Actinobacteria and Proteobacteria species.                             encoding one of two lysine decarboxylases (Kikuchi et al., 1997;
(G–I) Coverage of WGS reads aligning to genomes of B. vulgatus (G), A. junii (H), and E. coli (I) in blood (red) and tissue (blue).
(J) Top 25 E. coli genes most significantly enriched in tissue.
(K) Comparison of the prevalence of E. coli genes, cadA and ldcC, in blood (red) and tissue (blue).
(L) Results of GO pathway analysis of tissue-enriched E. coli genes.
*Indicates tissue-enriched E. coli genes
                                                                                               Cell Host & Microbe 29, 281–298, February 10, 2021 287
           ll
                                                                                                                                     Resource
Figure 4. Decontamination removes sequencing center artifacts and original TCGA tissue and blood samples validate tissue-resident micro-
bial compositions and equiprevalent species as contaminants, see also Figure S4; Table S1
(A) Abundance of WGS bacteria before and after decontamination. Samples with no reduction in bacterial reads lie along the gray line. Experiments with low
microbial biomass a priori are disproportionally affected by decontamination.
(B) Relative abundance of bacterial phyla in tissue samples before and after decontamination, sorted by their a priori abundance of Actinobacteria.
(C) PCoA of the decontaminated, tissue-resident microbial component reveals retention of variation related to sample type but not sequencing center.
(D) PCoA of the contaminant microbial component reveals retention of variation related to sequencing center but not sample type.
(E and F) Prevalence of bacterial species in tissue samples sequenced at Baylor versus Harvard (E) before and (F) after removing contamination.
Yamamoto et al., 1997) produced by E. coli; the other is ldcC,                 most prominently by the removal of contaminant Actinobacteria
which is not enriched in tissue samples (q = 0.16) (Figures 3K                 and Proteobacteria reads (Figure 4B).
and S3J). While ldcC encodes a gene that is constitutively ex-                    Despite being naive to sequencing center, our prevalence-
pressed, cadA transcription is induced under conditions of anaer-              based model for decomposing the TCGA microbiome data into
obic growth at low pH and its gene product displays greater ther-              tissue-resident and contaminant fractions also mitigated cen-
mostability and acid tolerance (Lemonnier and Lane, 1998).                     ter-related batch effects. Prior to removing contamination,
Additionally, genes in the pks island encoding colibactin were                 TCGA microbiome data clustered by both sample type and
significantly more prevalent in tissue than blood, matching previ-             sequencing center (Figure 2D). However, unsupervised clus-
ous reports that E. coli strains expressing this gene are associated           tering of the tissue-resident component extracted from the orig-
with CRC tissues (Figure S3L) (Arthur et al., 2012).                           inal TCGA sequencing data showed no dependency on
   Discrepancies in intraspecies genome content may be ex-                     sequencing center and maintained variation related to sample
plained by adaptive gene loss, an evolutionary mechanism                       type (Figure 4C). Examining the contamination component, we
whereby bacteria dispense with genes that are unnecessary                      found the opposite: samples no longer clustered by sample
for their environmental conditions (Koskiniemi et al., 2012;                   type but rather organized exclusively according to sequencing
Mira et al., 2001). Pathway analysis (Huang da et al., 2009) re-               center (Figure 4D). These results reflected the removal of species
vealed that tissue-enriched E. coli genes were significantly                   that were uniquely prevalent in tissue samples from either Baylor
associated with processes including iron-ion homeostasis, en-                  or Harvard (Figures 4E, 4F, and S4A).
terobactin biosynthesis, ion transport, ferric-enterobactin trans-                Finally, our algorithm greatly increased the similarity between
port, and copper-ion response (p < 0.01) (Figure 3L). Iron (Fe3+)              the microbial populations in patient-matched tissue samples
and copper (Cu2+) are abundant in the host and can be toxic to                 sequenced at both Harvard and Baylor, while maintaining diversity
E. coli in acidic, aerobic conditions; therefore, strains of E. coli           among samples overall (Figure 4G). Thus, our prevalence-based
must tightly regulate intracellular concentrations of these                    model is able to homogenize matched samples sequenced at
metals and undergo selection to do so (Porcheron et al.,                       different centers and mitigate sequencing center artifacts.
2013; Rensing and Grass, 2003). Given that hypothetically
bloodborne E. coli would also have to contend with high con-                   Original TCGA tissue and blood samples validate tissue-
centrations of copper and iron, enrichment of these genes                      resident microbial compositions and equiprevalent
and processes in tissue relative to blood suggests that the                    species as contaminants
majority of E. coli reads detected in blood samples are not                    To benchmark our analysis, we obtained five primary CRC tumor
endogenous but rather the result of contamination. However,                    samples and matched plasma samples from an original TCGA tis-
tissue-enriched genes such as cadA and others serve as                         sue provider (Table S1). These samples were specifically chosen to
benchmarks for distinguishing the two.                                         ensure that each tissue and plasma sample was profiled by WGS at
                                                                               both Baylor and Harvard. For controls, we also procured three
Tissue-enriched sequencing reads can be identified                             plasma samples from healthy individuals and spiked one plasma
with nucleotide precision                                                      sample with E. coli. We then used 16S amplicon sequencing to vali-
We then examined microbial sequencing reads at nucleotide-                     date that the bacterial composition of the original TCGA samples
level resolution. Given that gene-level alignments helped resolve              resembled the decontaminated compositions extracted from
mixed-evidence cases, we explored whether bacterial sequence                   TCGA sequencing data for matched tumor samples (Figure 4H).
variants, such as SNPs, could be used in a similar fashion.                       We found that the original TCGA tumor samples contained
Variant prevalence across CRC tissue and blood samples largely                 ample bacterial diversity and read counts (Figures 4I and S4B)
recapitulated the results from species- and gene-level profiles                and that their bacterial composition largely recapitulated the de-
(Figures S3L–S3M). Interestingly, we also found populations of                 contaminated, tissue-resident microbial population our decon-
apparent tissue-enriched and equiprevalent variants in E. coli                 tamination model extracted from TCGA WGS data on matched
genomes, suggesting that analyses of sequence variants may                     samples (Figures 4J and S4D). In addition to increasing the sim-
prove useful in distinguishing between endogenous and contam-                  ilarity between WGS and 16S validation results (Figure S4C),
inant sequencing reads in mixed-evidence cases.                                decontamination greatly improved the concordance between
                                                                               microbial compositions of matched WGS experiments per-
Decontamination removes sequencing center artifacts                            formed at Harvard and Baylor (Figure 4G). Despite detecting a
Removing contamination affected all samples, but samples with                  large number of bacterial sequencing reads in tumor samples,
low bacterial abundance a priori were the most affected (Fig-                  bacterial diversity of CRC plasma was not significantly greater
ure 4A), consistent with observations that low-biomass samples                 than healthy plasma (p = 0.30) or water controls (p = 0.44). More-
are the most profoundly affected by contamination (Eisenhofer                  over, the 16S bacterial composition of original TCGA plasma
et al., 2019; Glassing et al., 2016). Decontamination also regular-            samples was distinct from the bacterial composition of WGS
ized the relative abundance profiles of CRC tissue samples,                    data from the same samples (Figures S4E and S4F), supporting
(G) Comparison of weighted UniFrac distances before and after removing contamination among all tissues (left) and specifically matched tissues sequenced at
both Baylor and Harvard (right).
(H) Design of the validation experiment. Data are represented as mean ± 95% CI.
(I) Bacterial diversity of 16S rRNA-seq results from tissue (blue), plasma (red), and controls (bottom panel).
(J) Relative abundances in 16S results for tissue compared with tissue samples sequenced using WGS at Harvard and Baylor, before and after contamination.
                                                                                         Cell Host & Microbe 29, 281–298, February 10, 2021 289
         ll
                                                         Resource
290 Cell Host & Microbe 29, 281–298, February 10, 2021
                                                                                                                                           ll
Resource
the notion that the majority of bacterial reads detected in TCGA              sue types. As proof-of-principle, we used batch-normalized
blood samples are contamination introduced during DNA extrac-                 RPPA, mRNA-seq, miRNA-seq, and methylation 27-K data
tion and sequencing, rather than at the time of procurement.                  from TCGA to compute correlations between features in these
These validation results demonstrate that our model accurately                datasets with genera in the Fusobacterium and Bacteroides
identifies and removes contaminants and that computational                    clusters identified previously (Figures 5D, 5E, S5C, and S5D).
decontamination produced profiles that represent the true mi-                 For each of these assays, we found that these bacterial co-abun-
crobial composition of tissue.                                                dance groups were predictive of host gene expression patterns.
                                                                              For instance, in RPPA protein expression data we found that
Colorectal tissue microbiomes cluster into                                    ADAR1 and PARP1 expression appeared to distinguish these
Fusobacterium and Bacteroides co-abundance groups                             co-abundance groups (Figure 5E). The protein ADAR1 is upregu-
Having validated the contamination-adjusted microbial profiles,               lated by inflammatory mediators such as TNF-alpha and IFN-
we sought to leverage the decontaminated TCMA dataset to                      gamma (Yang et al., 2003) and regulates pathogen detection
investigate whether certain subgroups of microbiota were more                 and autoinflammation by discriminating self from non-self RNA
likely to be found together in tissue from CRC patients. Using                (Chung et al., 2018), while PARP1 regulates DNA repair and is
the bootstrapping procedure, SparCC (Friedman and Alm,                        activated by Helicobacter pylori in gastric cancer (Nossa et al.,
2012), we found two anticorrelated co-abundance groups (Fig-                  2009). Independently, we found that ADAR1 expression corre-
ures 5A–5C, S5A, and S5B): the ‘‘Fusobacterium cluster’’ con-                 lated with expression of PARP1, TNF-alpha, and IFN-gamma in
tained Porphyromonas, Prevotella, Peptostreptococcus, and                     TCGA RNA-seq data for CRC (Figure S5E). These results sug-
Campylobacter, among other species that were associated with                  gest that genes regulating inflammation and pathogen response
tumor samples; the second ‘‘Bacteroides cluster’’ is larger and               may distinguish the Fusobacterium and Bacteroides co-abun-
contained a highly correlated set of microbes, including Parabac-             dance groups in CRC. More broadly, these analyses illustrate
teroides, Clostridium, and Alistipes. This group may represent a              the utility of TCMA as a unique resource for comparing microbial
more normal/healthy microbiome, as several of these species                   and multi-omic host profiles from matched tissue samples.
were positively associated with normal tissue samples. Taxa in
the Fusobacterium cluster were significantly associated with colo-            Matched tumor-normal analysis reveals species
rectal neoplasms, while taxa in the Bacteroides cluster were asso-            associated with colorectal neoplasms
ciated with C. difficile infection, irritable bowel syndrome, and             The TCGA database contains detailed annotations on each tis-
cirrhosis (q < 0.05). These co-abundance groups may represent                 sue donor, including statistics on tumor stage, size, morphology,
two distinct ‘‘enterotypes’’ of CRC tissue microbiomes.                       and location, as well information on patient survival, treatment
                                                                              history, and therapeutic response. To identify microbes predic-
Bacterial co-abundance groups are predictive of the                           tive of pathological and prognostic characteristics of CRC tissue,
host tissue molecular environment                                             we used matched normal tissue and primary tumor samples to
Next, we evaluated whether the bacterial co-abundance groups                  perform a paired comparison of tissue-resident microbes (Fig-
we identified had discernible effects on host gene expression or              ures 5F, S5F, and S5G). This analysis identified 37 species that
regulation. Microbiota and host cells are known to engage with                were significantly enriched in either normal (n = 14) or tumor
one another through a complex variety of molecular interactions.              (n = 23) samples (p < 0.05) (Table S2).
Host-derived nutrients and dietary macromolecules are utilized                   The species most significantly associated with CRC tumors
by microorganisms as a food source, while microbial byproducts                compared with matched normal tissue was F. nucleatum
including short-chain fatty acids (SCFAs) are known to modulate               (p = 1.82E-3), which is known to promote intestinal tumorigen-
gene expression, cell differentiation, and inflammatory response              esis. Overall, approximately half of tumor-associated species
(Donohoe et al., 2011; Furusawa et al., 2013).                                belonged to the genus Fusobacterium, including F. hwasookii,
   The TCGA database contains a dense cube of molecular                       F. massiliense, and a number of unclassified Fusobacterium
profiling data, including matched genetic, epigenetic, transcrip-             spp. (p < 0.01). Non-Fusobacterium species associated
tional, and proteomic assays performed on thousands of sam-                   with CRC tumors included P. micra, S. moorei, and P. stomatis
ples. Thus, the ability to compare microbial profiles with                    (p < 0.05), several of which belonged to the Fusobacterium
matched host molecular profiles represents an unprecedented                   co-abundance group (Figure S5A) and have previously been
opportunity for querying host-microbe interactions in various tis-            implicated in CRC (Kostic et al., 2013; Purcell et al., 2017; Warren
Figure 5. Colorectal tissue microbiomes cluster into Fusobacterium and Bacteroides co-abundance groups predictive of host tissue molec-
ular environment
See also Figure S5; Table S2
(A) Heatmap clustering of correlations between bacterial genera reveals anticorrelated clusters of genera, characterized by Bacteroides and Fusobacterium
(purple triangles). Axes are colored according to species’ association with tumor (blue) or matched adjacent normal tissue (yellow).
(B and C) (B) Bacteroides- and (C) Fusobacterium-associated co-abundance networks. Node size is proportional to the prevalence of the genera in tissue
samples, and node hue is proportional to abundance.
(D and E) Co-abundance groups are predictive of gene expression (D; RNA-seq) and protein expression (E; RPPA).
(F) Heat tree comparing bacterial taxa abundance in tumor samples (blue) or matched normal tissue (yellow).
(G) Survival analysis p values of species in the Bacteroides and Fusobacterium co-abundance groups.
(H) OS curves for Bacteroides spp.
                                                                                        Cell Host & Microbe 29, 281–298, February 10, 2021 291
           ll
                                                                                                                                Resource
Figure 6. Microbial presence in CRC tissue is predictive of host gene expression pathways and MBI See also Figure S6
(A) Correlation between host gene expression (columns) and CLR-transformed species abundances (rows). Rows are colored according to each species’ as-
sociation with tumor (blue) or normal tissue (yellow).
et al., 2013). Other species, including several Campylobacter                     spective of their association with tumor or normal tissue (Figures
spp. did not have extant links to the disease. Of these,                          6B and S6B) and (2) processes related to inflammatory cancer
C. ureolyticus (Log2FC = 1.97; p = 2.19e 2) is an emerging                        pathways and cell-cell adhesion were enriched among genes
gastrointestinal pathogen implicated in inflammatory bowel dis-                   correlated with tumor-associated and normal tissue-associated
ease and colitis (Bullman et al., 2013; O’Donovan et al., 2014),                  species, respectively (Figures 6C, 6D, S6C, and S6D). Specif-
prompting further examination. C. ureolyticus abundance corre-                    ically, both tumor- and normal tissue-associated species were
lated with expression of several genes, including CAMK2D and                      enriched for processes relating to intestinal IgA production, anti-
UGDH (Figures S5J, S5H, and S5I), and several genes expressed                     gen presentation, natural killer cell-mediated cytotoxicity, cyto-
by C. ureolyticus were significantly associated with tumor sam-                   kine signaling, and primary immunodeficiency, suggesting
ples compared with normal tissue (Figure S5J). Additionally,                      near-universal activation of an immunogenic transcriptional
C. ureolyticus was associated with worse progression-free inter-                  response to the presence of these bacteria (Figures 6B and S6B).
val (PFI) in recurrent CRC patients (Figure S5K).                                    We also found that pathways including DNA replication, DNA
   Taxa that were significantly more abundant in adjacent normal                  repair, oxidative phosphorylation, p53 signaling, and ribosome
tissue compared with matched tumor tissue were dominated by                       activity were all negatively enriched among normal tissue-asso-
Bacteroides and Parabacteroides spp. (p < 0.05) (Figure S5G),                     ciated species, and positively enriched among tumor-associ-
many of which belonged to the Bacteroides co-abundance                            ated species, particularly for Fusobacterium spp. (Figures 6C
group (Figure S5A). By leveraging patient-matched tumor and                       and S6C). Conversely, genes involved in the regulation of cellular
normal tissue samples, TCMA may thus be used identify bacte-                      adhesion were positively enriched among normal tissue-associ-
rial associations with CRC and other gastrointestinal cancers.                    ated species and negatively enriched among tumor-associated
                                                                                  species (Figures 6D and S6D). Together, these results indicate
Survival analysis reveals candidate microbial                                     that within this cohort of CRC tumor samples, tumor-associated
biomarkers predictive of clinical outcomes                                        species may be associated with proinflammatory, neoplastic
Using survival data collected by the PanCanAtlas (Liu et al.,                     transformations and loss of epithelial integrity.
2018), we next examined whether co-abundance groups were
predicative of overall survival (OS). For each species, we used                   Microbial presence in CRC blood samples indicate
a log-rank test to assess its individual prognostic value. Interest-              mucosal barrier injury
ingly, species in the Bacteroides co-abundance group were                         As shown in Figures 1B and S1B, bacteria were significantly more
generally more prognostic of survival than the Fusobacterium                      abundant and diverse in blood samples from CRC patients than
co-abundance group (Figure 5G). We found over a dozen Bac-                        from BC patients (p < 0.01). The presence of transient, endoge-
teroides spp. that were prognostic of survival, including                         nous microbial DNA in the bloodstream has been reported in pri-
B. cellulosilyticus and several unclassified Bacteroides spp. (Fig-               mary CRC patients, often before diagnosis, and may even be pre-
ures 5H and S5L). These findings demonstrate the utility of                       dictive of tumor stage and location (Abdulamir et al., 2011; Poore
TCMA for the identification of prognostic microbial biomarkers                    et al., 2020). Loss of mucosal barrier function is a common feature
relevant to CRC and other cancers.                                                of CRC and other chronic inflammatory conditions and may lead
                                                                                  to microbial translocations from CRC tumors to the lamina propria
Microbial presence in CRC tissue is predictive of host                            and bloodstream (Oshima and Miwa, 2016; Yu, 2018).
immunogenic response, inflammatory cancer pathways,                                  To explore this possibility, we examined the abundance of
and cell-cell adhesion                                                            subsets of bacterial species and genera designated as common
Next, we explored whether the 37 species that we identified as                    commensal (n = 407) or MBI-associated (n = 693) (CDC, 2019).
significantly associated with either tumor or normal tissue sam-                  Examining MBI-associated species among decontaminated
ples had identifiable effects on host gene expression or related                  blood samples within the CRC cohort, we found that species
biological pathways. Comparing normalized abundances of                           associated with MBI were considerably more prevalent than
these species with matched mRNA expression data from 159                          those that were not (p = 2.42e 7; Figure 6E). We then compared
CRC tumor samples, we computed correlations and found tran-                       the abundance of MBI-associated genera with that of common
scriptional patterns that were associated with both tumor- and                    commensals and discovered that genera associated with MBI
normal tissue-associated species (Figures 6A and S6A). Given                      were frequently more abundant in the blood of CRC patients
the observed differences in the transcriptional correlations of tu-               than BC patients (Figures 6F and S6E). These results point to-
mor- and normal tissue-associated bacteria, we subsequently                       ward the potential utility of bloodborne bacterial DNA from
performed gene-set enrichment analysis (Subramanian et al.,                       MBI-associated organisms as a potential biomarker for CRC.
2005) to identify biological pathways associated with the abun-
dance of these species.                                                           Contamination-adjusted tissue microbiome profiles for
  Pathway analysis revealed that (1) genes correlated with the                    all gastrointestinal cancers in TCGA
abundance of bacterial species were consistently enriched for                     Having successfully identified the CRC tissue-associated mi-
the activation of immune system pathways and processes, irre-                     crobial component in the TCGA dataset, we analyzed samples
(B–D) Comparison of differentially abundant species and their association with tissue type (x axis) versus enrichment score (y axis) for KEGG terms (A) ‘‘natural
killer cell-mediated cytotoxicity’’ (B), ‘‘DNA replication’’ (C), and ‘‘cell adhesion molecules’’ (D).
(E) Bacterial species implicated in MBI are more prevalent in decontaminated blood samples than other species.
(F) Bacterial genera implicated in MBI (red) are more abundant in CRC blood (brown) than BC blood (green), in contrast to some commensal species (blue).
                                                                                             Cell Host & Microbe 29, 281–298, February 10, 2021 293
            ll
                                                                                                                                          Resource
Figure 7. Contamination-adjusted tissue microbiome profiles for all gastrointestinal cancers in TCGA
See also Figure S7
(A) Pan-cancer abundance of bacteria in solid tissue samples from TCGA projects prior to decontamination. Data are represented as mean ± 95% CI.
(B) Estimated fraction of contaminant reads for sequencing experiments on tumor (blue), normal (yellow), and blood (red) samples for each sequencing project
in TCMA.
(C) Classification of tissue-resident (blue) and contaminant (red) species across TCGA gastrointestinal tissues by comparison of prevalence in blood and tissue.
(D) Labeling of tissue-resident (blue) and contaminant (red) species across gastrointestinal tissues by comparison of prevalence in brain tissue and disease-
specific tissue, using classification from (C).
(E) Estimated proportions of tissue-resident (blue) and contaminant (red) species for each TCGA project.
from other cancer types to search for tissue-resident micro-                     signature of tissue-resident species in HNSC, STAD, and
biota. All sequencing datasets contained some bacterial                          ESCA projects by comparing species prevalence in tissue with
reads but as expected, they were most abundant in gastroin-                      blood and brain samples (Figures 7C and 7D), and we estimated
testinal cancers (Figure 7A). In particular, tissue samples                      the fraction of minimally detectable species that were tissue-
from head and neck cancer (HNSC), colon cancer (COAD),                           resident or contaminants (Figure 7E). For each of these cancer
rectal cancer (READ), esophageal cancer (ESCA), and                              types, we applied our decomposition model to isolate tissue-
stomach cancer (STAD) had the greatest number of bacterial                       resident populations and establish TCMA. Few statistically sig-
reads prior to decontamination, whereas uveal melanoma                           nificant tissue-resident populations in bladder (BLCA), breast
(UVM), lung squamous-cell (LUSC), and glioblastoma had                           (BRCA), uterine (UCEC), cervical (CESC), or prostate (PRAD)
the fewest.                                                                      cancers could be detected (Figures S7A and S7B). The microbial
   Given the abundance of microbial reads in gastrointestinal                    biomass in these tissues is known to be magnitudes less
cancer types, we used our prevalence-based approach to deter-                    than that of gastrointestinal tissues despite similar levels of
mine whether tissue-resident microbiota were present and esti-                   contamination (p = 0.56), hence it may be more challenging to
mate the fraction of contaminant reads in each sample type (Fig-                 distinguish the few tissue-resident species from the over-
ure 7B). In addition to COAD and READ, we found a strong                         whelming proportion of contaminants.
294 Cell Host & Microbe 29, 281–298, February 10, 2021
                                                                                                                                      ll
Resource
                                                                                Cell Host & Microbe 29, 281–298, February 10, 2021 295
            ll
                                                                                                                                               Resource
You, Lawrence David, Justin Silverman, Kimberly Roche, and Darien Lombardi          Choi, J.H., Hong, S.E., and Woo, H.G. (2017). Pan-cancer analysis of system-
for constructive discussion and advice. We are grateful to Caroline Connor for      atic batch effects on somatic sequence variations. BMC Bioinformatics
reviewing and editing the manuscript. This study was supported by NIH               18, 211.
R35GM122465, DK119795, and DARPA W911NF1920111.                                     Chung, H., Calis, J.J.A., Wu, X., Sun, T., Yu, Y., Sarbanes, S.L., Dao Thi, V.L.,
                                                                                    Shilvock, A.R., Hoffmann, H.H., Rosenberg, B.R., and Rice, C.M. (2018).
AUTHOR CONTRIBUTIONS                                                                Human ADAR1 prevents endogenous RNA from triggering translational shut-
                                                                                    down. Cell 172, 811–824.e14.
A.B.D. and X.S. initiated the concept, designed the analyses, and interpreted
the data. A.B.D. performed all the analyses. D.A.M. assisted A.B.D. with gene-      Davidson-Pilon, C., Kalderstam, J., Jacobson, N., Zivich, P., Kuhn, B.,
and sequence-level analyses. A.B.D., S.D., and H.D. designed and performed          Williamson, M., Sean-Reed, J.K., Fiore-Gartland, A., Datta, D., and Moneda,
the experiments. A.B.D. and M.G. designed and deployed the TCMA website.            L. (2020). CamDavidsonPilon/lifelines: 0.24.6 (Zenodo).
A.B.D. and X.S. wrote the manuscript. S.M.L. and I.D.I. assisted with concep-       Davis, N.M., Proctor, D.M., Holmes, S.P., Relman, D.A., and Callahan, B.J.
tion, scientific direction, and writing.                                            (2018). Simple statistical identification and removal of contaminant sequences
                                                                                    in marker-gene and metagenomics data. Microbiome 6, 226.
DECLARATION OF INTERESTS                                                            DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K.,
                                                                                    Huber, T., Dalevi, D., Hu, P., and Andersen, G.L. (2006). Greengenes, a
The authors declare no competing interests.
                                                                                    chimera-checked 16S rRNA gene database and workbench compatible with
                                                                                    ARB. Appl. Environ. Microbiol. 72, 5069–5072.
Received: June 12, 2020
Revised: September 29, 2020                                                         Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut,
Accepted: December 1, 2020                                                          P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq
Published: December 30, 2020                                                        aligner. Bioinformatics 29, 15–21.
                                                                                    Donohoe, D.R., Garge, N., Zhang, X., Sun, W., O’Connell, T.M., Bunger, M.K.,
REFERENCES                                                                          and Bultman, S.J. (2011). The microbiome and butyrate regulate energy meta-
                                                                                    bolism and autophagy in the mammalian colon. Cell Metab 13, 517–526.
Abdulamir, A.S., Hafidh, R.R., and Abu Bakar, F. (2011). The association of         Eisenhofer, R., Minich, J.J., Marotz, C., Cooper, A., Knight, R., and Weyrich,
Streptococcus bovis/gallolyticus with colorectal tumors: the nature and the         L.S. (2019). Contamination in low microbial biomass microbiome studies: is-
underlying mechanisms of its etiological role. J. Exp. Clin. Cancer Res. 30, 11.    sues and recommendations. Trends Microbiol 27, 105–117.
                                  €hlbauer, M., Tomkovich, S., Uronis, J.M.,
Arthur, J.C., Perez-Chanona, E., Mu                                                 Elinav, E., Garrett, W.S., Trinchieri, G., and Wargo, J. (2019). The cancer micro-
Fan, T.J., Campbell, B.J., Abujamel, T., Dogan, B., Rogers, A.B., et al. (2012).    biome. Nat. Rev. Cancer 19, 371–376.
Intestinal inflammation targets cancer-inducing activity of the microbiota.
                                                                                    Flanagan, L., Schmid, J., Ebert, M., Soucek, P., Kunicka, T., Liska, V., Bruha,
Science 338, 120–123.
                                                                                    J., Neary, P., Dezeeuw, N., Tommasino, M., et al. (2014). Fusobacterium nucle-
Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith,   atum associates with stages of colorectal neoplasia development, colorectal
G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019).           cancer and disease outcome. Eur. J. Clin. Microbiol. Infect. Dis. 33,
Reproducible, interactive, scalable and extensible microbiome data science          1381–1390.
using QIIME 2. Nat. Biotechnol. 37, 852–857.
                                                                                                                        €nwald, N.J. (2017). Metacoder: an R pack-
                                                                                    Foster, Z.S., Sharpton, T.J., and Gru
Brighenti, E., Calabrese, C., Liguori, G., Giannone, F.A., Trerè, D., Montanaro,   age for visualization and manipulation of community taxonomic diversity data.
L., and Derenzini, M. (2014). Interleukin 6 downregulates p53 expression and        PLoS Comput. Biol. 13, e1005404.
activity by stimulating ribosome biogenesis: a new pathway connecting inflam-
mation to cancer. Oncogene 33, 4396–4406.                                           Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic
                                                                                    survey data. PLOS Comput. Biol. 8, e1002687.
Bullman, S., Lucid, A., Corcoran, D., Sleator, R.D., and Lucey, B. (2013).
Genomic investigation into strain heterogeneity and pathogenic potential of         Furusawa, Y., Obata, Y., Fukuda, S., Endo, T.A., Nakato, G., Takahashi, D.,
the emerging gastrointestinal pathogen Campylobacter ureolyticus. PLoS              Nakanishi, Y., Uetake, C., Kato, K., Kato, T., et al. (2013). Commensal
One 8, e71515.                                                                      microbe-derived butyrate induces the differentiation of colonic regulatory
                                                                                    T cells. Nature 504, 446–450.
Bullman, S., Pedamallu, C.S., Sicinska, E., Clancy, T.E., Zhang, X., Cai, D.,
Neuberg, D., Huang, K., Guevara, F., Nelson, T., et al. (2017). Analysis of         Gemmell, E., and Seymour, G.J. (1993). Interleukin 1, interleukin 6 and trans-
Fusobacterium persistence and antibiotic response in colorectal cancer.             forming growth factor-beta production by human gingival mononuclear cells
Science 358, 1443–1448.                                                             following stimulation with Porphyromonas gingivalis and Fusobacterium nu-
                                                                                    cleatum. J. Periodont. Res. 28, 122–129.
Bush, S.J., Foster, D., Eyre, D.W., Clark, E.L., De Maio, N., Shaw, L.P.,
Stoesser, N., Peto, T.E.A., Crook, D.W., and Walker, A.S. (2020). Genomic di-       Glassing, A., Dowd, S.E., Galandiuk, S., Davis, B., and Chiodini, R.J. (2016).
versity affects the accuracy of bacterial single-nucleotide polymorphism-call-      Inherent bacterial DNA contamination of extraction and sequencing reagents
ing pipelines. GigaScience 9, giaa007.                                              may affect interpretation of microbiota in low bacterial biomass samples.
                                                                                    Gut Pathog 8, 24.
Cancer Genome Atlas Research Network (2014). Comprehensive molecular
characterization of gastric adenocarcinoma. Nature 513, 202–209.                    Gopalakrishnan, V., Spencer, C.N., Nezi, L., Reuben, A., Andrews, M.C.,
                                                                                    Karpinets, T.V., Prieto, P.A., Vicente, D., Hoffman, K., Wei, S.C., et al.
Cancer Genome Atlas Research Network; Albert Einstein College of Medicine;
                                                                                    (2018). Gut microbiome modulates response to anti-PD-1 immunotherapy in
Analytical Biological Service; Barretos Cancer Hospital; Baylor College of
                                                                                    melanoma patients. Science 359, 97–103.
Medicine; Beckman Research Institute of City of Hope; Buck Institute for
Research on Aging; Canada’s Michael Smith Genome Sciences Center;                   Grice, E.A., Kong, H.H., Renaud, G., Young, A.C., NISC Comparative
Harvard Medical School; Helen F. Graham Cancer Center &Research                     Sequencing Program, Bouffard, G.G., Blakesley, R.W., Wolfsberg, T.G.,
Institute at Christiana Care Health Services, et al. (2017). Integrated genomic     Turner, M.L., and Segre, J.A. (2008). A diversity profile of the human skin mi-
and molecular characterization of cervical cancer. Nature 543, 378–384.             crobiota. Genome Res 18, 1043–1050.
CDC (2019). NHSN organism list. In National Healthcare Safety Network               Huang, da W., Sherman, B.T., and Lempicki, R.A. (2009). Systematic and inte-
(NHSN) patient safety component manual (Centers for Disease Control and             grative analysis of large gene lists using DAVID bioinformatics resources. Nat.
Prevention).                                                                        Protoc. 4, 44–57.
Chelakkot, C., Ghim, J., and Ryu, S.H. (2018). Mechanisms regulating intesti-       Human Microbiome Project Consortium (2012). Structure, function and diver-
nal barrier integrity and its pathological implications. Exp. Mol. Med. 50, 103.    sity of the healthy human microbiome. Nature 486, 207–214.
296 Cell Host & Microbe 29, 281–298, February 10, 2021
                                                                                                                                                         ll
Resource
Iliev, I.D., and Leonardi, I. (2017). Fungal dysbiosis: immunity and interactions     Oshima, T., and Miwa, H. (2016). Gastrointestinal mucosal barrier function and
at mucosal barriers. Nat. Rev. Immunol. 17, 635–646.                                  diseases. J. Gastroenterol. 51, 768–778.
Islami, F., and Kamangar, F. (2008). Helicobacter pylori and esophageal can-          Pertea, G., and Pertea, M. (2020). GFF Utilities: GffRead and GffCompare
cer risk: a meta-analysis. Cancer Prev. Res. (Phila) 1, 329–338.                      [version 1; peer review: 3 approved. F1000Res 9, 304.
Jiang, W. (2018). A protocol for quantizing total bacterial 16S rDNA in plasma        Plottel, C.S., and Blaser, M.J. (2011). Microbiome and malignancy. Cell Host
as a marker of microbial translocation in vivo. Cell. Mol. Immunol. 15, 937–939.      Microbe 10, 324–335.
Kikuchi, Y., Kojima, H., Tanaka, T., Takatsuka, Y., and Kamio, Y. (1997).             Poore, G.D., Kopylova, E., Zhu, Q., Carpenter, C., Fraraccio, S., Wandro, S.,
Characterization of a second lysine decarboxylase isolated from Escherichia           Kosciolek, T., Janssen, S., Metcalf, J., Song, S.J., et al. (2020). Microbiome an-
coli. J. Bacteriol. 179, 4486–4492.                                                   alyses of blood and tissues suggest cancer diagnostic approach. Nature 579,
Koskiniemi, S., Sun, S., Berg, O.G., and Andersson, D.I. (2012). Selection-           567–574.
driven gene loss in bacteria. PLoS Genet 8, e1002787.                                 Poplin, R., Ruano-Rubio, V., DePristo, M.A., Fennell, T.J., Carneiro, M.O., Van
Kostic, A.D., Chun, E., Robertson, L., Glickman, J.N., Gallini, C.A., Michaud,        der Auwera, G.A., Kling, D.E., Gauthier, L.D., Levy-Moonshine, A., Roazen, D.,
M., Clancy, T.E., Chung, D.C., Lochhead, P., Hold, G.L., et al. (2013).               et al. (2017). Scaling accurate genetic variant discovery to tens of thousands of
Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates            samples. bioRxiv https://www.biorxiv.org/content/10.1101/201178v3.
the tumor-immune microenvironment. Cell Host Microbe 14, 207–215.
                                                                                      Porcheron, G., Garénaux, A., Proulx, J., Sabri, M., and Dozois, C.M. (2013).
Kostic, A.D., Gevers, D., Pedamallu, C.S., Michaud, M., Duke, F., Earl, A.M.,         Iron, copper, zinc, and manganese transport and regulation in pathogenic
Ojesina, A.I., Jung, J., Bass, A.J., Tabernero, J., et al. (2012). Genomic analysis   Enterobacteria: correlations between strains, site of infection and the relative
identifies association of Fusobacterium with colorectal carcinoma. Genome             importance of the different metal transport systems for virulence. Front. Cell.
Res 22, 292–298.                                                                      Infect. Microbiol. 3, 90.
Kostic, A.D., Ojesina, A.I., Pedamallu, C.S., Jung, J., Verhaak, R.G., Getz, G.,      Prast-Nielsen, S., Tobin, A.M., Adamzik, K., Powles, A., Hugerth, L.W.,
and Meyerson, M. (2011). PathSeq: software to identify or discover microbes           Sweeney, C., Kirby, B., Engstrand, L., and Fry, L. (2019). Investigation of the
by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396.                     skin microbiome: swabs vs. biopsies. Br. J. Dermatol. 181, 572–579.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D.,        Purcell, R.V., Visnovska, M., Biggs, P.J., Schmeier, S., and Frizelle, F.A. (2017).
Jones, S.J., and Marra, M.A. (2009). Circos: an information aesthetic for             Distinct gut microbiome patterns associate with consensus molecular sub-
comparative genomics. Genome Res 19, 1639–1645.                                       types of colorectal cancer. Sci. Rep. 7, 11590.
Lemonnier, M., and Lane, D. (1998). Expression of the second lysine decar-
                                                                                      Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen,
boxylase gene of Escherichia coli. Microbiology 144, 751–760.
                                                                                      T., Pons, N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene
Levy, M., Kolodziejczyk, A.A., Thaiss, C.A., and Elinav, E. (2017). Dysbiosis and     catalogue established by metagenomic sequencing. Nature 464, 59–65.
the immune system. Nat. Rev. Immunol. 17, 219–232.
                                                                                      Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,      comparing genomic features. Bioinformatics 26, 841–842.
Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing
                                                                                                     €ndar, F., Diehl, S., Gru
                                                                                      Ramı́rez, F., Du                        €ning, B.A., and Manke, T. (2014).
Subgroup (2009). The Sequence Alignment/Map format and SAMtools.
                                                                                      deepTools: a flexible platform for exploring deep-sequencing data. Nucleic
Bioinformatics 25, 2078–2079.
                                                                                      Acids Res 42, W187–W191.
Liao, Y., Smyth, G.K., and Shi, W. (2014). featureCounts: an efficient general
                                                                                      Rensing, C., and Grass, G. (2003). Escherichia coli mechanisms of copper ho-
purpose program for assigning sequence reads to genomic features.
                                                                                      meostasis in a changing environment. FEMS Microbiol. Rev. 27, 197–213.
Bioinformatics 30, 923–930.
Liu, J., Lichtenberg, T., Hoadley, K.A., Poisson, L.M., Lazar, A.J., Cherniack,       Robinson, K.M., Crabtree, J., Mattick, J.S., Anderson, K.E., and Dunning
A.D., Kovatich, A.J., Benz, C.C., Levine, D.A., Lee, A.V., et al. (2018). An inte-    Hotopp, J.C. (2017). Distinguishing potential bacteria-tumor associations
grated TCGA pan-cancer clinical data resource to drive high-quality survival          from contamination in a secondary data analysis of public cancer genome
outcome analytics. Cell 173, 400–416.e11.                                             sequence data. Microbiome 5, 9.
Luckey, T.D. (1972). Introduction to intestinal microecology. Am. J. Clin. Nutr.      Rubinstein, M.R., Wang, X., Liu, W., Hao, Y., Cai, G., and Han, Y.W. (2013).
25, 1292–1294.                                                                        Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating
                                                                                      E-cadherin/beta-catenin signaling via its FadA adhesin. Cell Host Microbe 14,
Ma, W., Huang, C., Zhou, Y., Li, J., and Cui, Q. (2017). Micropattern: a web-
                                                                                      195–206.
based tool for microbe set enrichment analysis and disease similarity calcula-
tion based on a list of microbes. Sci. Rep. 7, 40200.                                 Schirmer, M., Garner, A., Vlamakis, H., and Xavier, R.J. (2019). Microbial genes
                                                                                      and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17,
Marotz, C., Amir, A., Humphrey, G., Gaffney, J., Gogul, G., and Knight, R.
                                                                                      497–511.
(2017). DNA extraction for streamlined metagenomics of diverse environ-
mental samples. BioTechniques 62, 290–293.                                            Sender, R., Fuchs, S., and Milo, R. (2016). Revised estimates for the number of
McMurdie, P.J., and Holmes, S. (2013). phyloseq: an R package for reproduc-           human and bacteria cells in the body. PLoS Biol 14, e1002533.
ible interactive analysis and graphics of microbiome census data. PLoS One 8,         Sivan, A., Corrales, L., Hubert, N., Williams, J.B., Aquino-Michaels, K., Earley,
e61217.                                                                               Z.M., Benyamin, F.W., Lei, Y.M., Jabri, B., Alegre, M.-L., et al. (2015).
Mira, A., Ochman, H., and Moran, N.A. (2001). Deletional bias and the evolution       Commensal Bifidobacterium promotes antitumor immunity and facilitates
of bacterial genomes. Trends Genet 17, 589–596.                                       anti-PD-L1 efficacy. Science 350, 1084–1089.
Nejman, D., Livyatan, I., Fuks, G., Gavert, N., Zwang, Y., Geller, L.T., Rotter-      Sriswasdi, S., Yang, C.C., and Iwasaki, W. (2017). Generalist species drive mi-
Maskowitz, A., Weiser, R., Mallel, G., Gigi, E., et al. (2020). The human tumor       crobial dispersion and evolution. Nat. Commun. 8, 1162.
microbiome is composed of tumor type-specific intracellular bacteria.                 Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L.,
Science 368, 973–980.                                                                 Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and
Nossa, C.W., Jain, P., Tamilselvam, B., Gupta, V.R., Chen, L.F., Schreiber, V.,       Mesirov, J.P. (2005). Gene set enrichment analysis: a knowledge-based
Desnoyers, S., and Blanke, S.R. (2009). Activation of the abundant nuclear fac-       approach for interpreting genome-wide expression profiles. Proc. Natl.
tor poly(ADP-ribose) polymerase-1 by Helicobacter pylori. Proc. Natl. Acad.           Acad. Sci. USA 102, 15545–15550.
Sci. USA 106, 19998–20003.                                                            Tang, K.W., Alaei-Mahabadi, B., Samuelsson, T., Lindh, M., and Larsson, E.
O’Donovan, D., Corcoran, G.D., Lucey, B., and Sleator, R.D. (2014).                   (2013). The landscape of viral expression and host gene fusion and adaptation
Campylobacter ureolyticus: a portrait of the pathogen. Virulence 5, 498–506.          in human cancer. Nat. Commun. 4, 2513.
                                                                                                 Cell Host & Microbe 29, 281–298, February 10, 2021 297
             ll
                                                                                                                                                 Resource
Thorvaldsdóttir, H., Robinson, J.T., and Mesirov, J.P. (2013). Integrative           Yamamoto, Y., Miwa, Y., Miyoshi, K., Furuyama, J.-i., and Ohmori, H. (1997).
Genomics Viewer (IGV): high-performance genomics data visualization and               The Escherichia coli ldcC gene encodes another lysine decarboxylase, prob-
exploration. Brief. Bioinform. 14, 178–192.                                           ably a constitutive enzyme. Genes Genet. Syst. 72, 167–172.
Viaud, S., Saccheri, F., Mignot, G., Yamazaki, T., Daillère, R., Hannani, D.,        Yang, J.H., Luo, X., Nie, Y., Su, Y., Zhao, Q., Kabir, K., Zhang, D., and
Enot, D.P., Pfirschke, C., Engblom, C., Pittet, M.J., et al. (2013). The intestinal   Rabinovici, R. (2003). Widespread inosine-containing mRNA in lymphocytes
microbiota modulates the anticancer immune effects of cyclophosphamide.               regulated by ADAR1 in response to inflammation. Immunology 109, 15–23.
Science 342, 971–976.                                                                 Yu, L.C. (2018). Microbiota dysbiosis and barrier dysfunction in inflammatory
                                                                                      bowel disease and colorectal cancers: exploring a common ground hypothe-
Warren, R.L., Freeman, D.J., Pleasance, S., Watson, P., Moore, R.A.,                  sis. J. Biomed. Sci. 25, 79.
Cochrane, K., Allen-Vercoe, E., and Holt, R.A. (2013). Co-occurrence of anaer-
                                                                                      Yu, T., Guo, F., Yu, Y., Sun, T., Ma, D., Han, J., Qian, Y., Kryczek, I., Sun, D.,
obic bacteria in colorectal carcinomas. Microbiome 1, 16.
                                                                                      Nagarsheth, N., et al. (2017). Fusobacterium nucleatum promotes chemore-
Wood, D.E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis            sistance to colorectal cancer by modulating autophagy. Cell 170, 548–
with Kraken 2. Genome Biol 20, 257.                                                   563.e16.
298 Cell Host & Microbe 29, 281–298, February 10, 2021
                                                                                                                              ll
Resource
STAR+METHODS
Continued
REAGENT or RESOURCE                               SOURCE                                       IDENTIFIER
Oligonucleotides
Primers for 16S analysis                          IDT                                          N/A
FWD:GTGYCAGCMGCCGCGGTAA
REV:GGACTACNVGGGTWTCTAAT
Software and Algorithms
GATK 4.0.3 (PathSeq & HaplotypeCaller)            (Kostic et al., 2011; Poplin et al., 2017)   https://github.com/broadinstitute/gatk/
SAMtools 1.9                                      (Li et al., 2009)                            http://samtools.sourceforge.net/
phyloseq 1.30.0                                   (McMurdie and Holmes, 2013)                  https://github.com/joey711/phyloseq
metacoder 0.3.3                                   (Foster et al., 2017)                        https://grunwaldlab.github.io/
                                                                                               metacoder_documentation/
gffread 0.11.6                                    (Pertea and Pertea, 2020)                    https://github.com/gpertea/gffread
STAR 2.7.3a                                       (Dobin et al., 2013)                         https://github.com/alexdobin/STAR/
subread 1.6.4                                     (Liao et al., 2014)                          http://subread.sourceforge.net/
deepTools 3.3.0                                   (Ramı́rez et al., 2014)                      https://github.com/deeptools/deepTools
bedtools 2.29.0                                   (Quinlan and Hall, 2010)                     https://github.com/arq5x/bedtools2
circos 0.69.8                                     (Krzywinski et al., 2009)                    http://circos.ca/software/download/
IGV 2.4.14                                        (Thorvaldsdóttir et al., 2013)              http://software.broadinstitute.org/
                                                                                               software/igv/
QIIME2 2019.7                                     (Bolyen et al., 2019)                        https://qiime2.org/
SparCC                                            (Friedman and Alm, 2012)                     https://bitbucket.org/yonatanf/
                                                                                               sparcc/src/default/
MicroPattern                                      (Ma et al., 2017)                            http://www.cuilab.cn/micropattern
lifelines 0.23.8                                  (Davidson-Pilon et al., 2020)                https://github.com/CamDavidsonPilon/
                                                                                               lifelines/tree/0.24.6
GSEA 4.0.3                                        (Subramanian et al., 2005)                   https://www.gsea-msigdb.org/gsea/
Other
Patient metadata for TCGA validation              This paper                                   Table S1
samples
Tumor- and normal tissue-associated               This paper                                   Table S2
taxa
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Xiling Shen (xiling.shen@
duke.edu).
Materials availability
The Cancer Genome Atlas (TCGA) collected biospecimens and associated clinical information from human subjects, under informed
consent and authorization of local institutional review boards. All TCGA sequencing data were accessed from the Genomic Data
Commons (GDC) portal in accordance with the TCGA Data Use Certification Agreement and under authorization of Duke’s campus
institutional review board. Original TCGA tissue and plasma samples were acquired from Indivumed, a third-party vendor. Healthy
patient samples used for validation analyses were collected and analyzed under authorization of Duke’s campus institutional review
board. Primers used from 16S sequencing were acquired from IDT. This study did not generate new unique reagents.
METHOD DETAILS
Decomposition of observed TCGA microbial profiles into tissue-resident and contaminant fractions
The classification of tissue-resident microbiota for each TCGA project was performed at the species-level using WGS sequencing
data. To assess whether a species deviated significantly from equiprevalence and identify a tissue-resident population, we found
the most generalizable criteria combined a statistical test of proportions with a hard cutoff on blood prevalence. Species were defined
as tissue-resident if they were prevalent in fewer than 20% of blood samples and significantly more prevalent in tissue than blood by a
one-sided Fisher exact test (q < 0.05).
   Fisher’s test offered two major benefits: (1) it performs well for low prevalence cases, meaning that it naturally removed low-prev-
alence species which could not be statistically distinguished from contamination, and (2) it is sensitive to the group sample sizes pro-
vided (tissue and blood), making it sufficiently generalizable across sequencing projects which had varying numbers of tissue and
blood samples. Because of the very large total number of detectible species (n = 11,745 in CRC), we used FDR-correction to adjust
for multiple tests (q < 0.05). The second filter was hard cutoff on blood prevalence (<20%). This was effective for classifying high-
prevalence species, which were statistically more prevalent in tissue than blood but still detectible in more blood samples than
was plausible for endogenous blood-borne bacteria. Ultimately, for CRC data the second cutoff was only relevant for four species
from the Enterobacteriaceae family, which likely represent mixed-evidence species. Finally, we defined ‘‘detectible’’ as having more
than one sequencing read aligning to a given taxon (R2 reads). Singletons (taxa with a single read) are known to frequently be
sequencing artifacts or false positives and are commonly removed to reduce noise in downstream metagenomic analyses.
   Many reads are not aligned at the species level. For example, unambiguous genus-level alignments are not necessarily equal to the
sum of unambiguous species-level alignments from species within that genus. Therefore, in order to preserve read counts at taxo-
nomic ranks above species-level, we adjusted read counts to reflect that a given clade could be comprised of a combination of
contaminant and tissue-resident species. The decomposition of observed metagenomic data (K) into tissue-resident (T) and contam-
inant (C) components for a given taxon in a given sample can be described using a mixture with two components of the form K = mT +
nC, where m and n represent the estimated fractions of tissue-resident or contaminant sequencing reads belonging to a given taxon,
respectively (such that m + n = 1). For all taxa above the species-level, we assigned m and n using the relative fractions of unambig-
uously aligned sequencing reads from species classified as tissue-resident or contaminant within the corresponding clade. For taxa
with fewer than 5 unambiguously assigned reads, we imputed mixtures from other sequencing runs processed on the same plate or
center. Defining these mixtures thus allowed us to propagate the classification of tissue-resident species to higher taxonomic ranks
on a sample-by-sample basis while preserving read counts that were unambiguously aligned above the species level.
genome tracks were then plotted using the Circos software (Krzywinski et al., 2009). For visualization of cadA and ldcC alignments,
counts from each sample and bin (10bp) were summed and divided by the total number of samples per track. The bedgraph files were
converted to bigwig using UCSC bedGraphToBigWig, then bigwigs were plotted using The Integrated Genomics Viewer (IGV) (Thor-
valdsdóttir et al., 2013). Each track was scaled to the max bin height within the viewable region.
we then calculated the median log2 ratio between the relative abundance of each taxa in each tissue type. Significance values were
calculated using Wilcoxon’s rank-sums test and corrected for false discovery rate. Taxa with significant p-values (p < 0.05) were
selected for downstream analysis.
Survival analysis
We performed our survival analyses using a log-rank test, using the lifelines survival analysis python package (Davidson-Pilon et al.,
2020). Relative abundances of decontaminated bacterial compositions for all tissue samples belonging to a given patient were used
for both models. Data on patient survival, disease-free interval, and progression-free interval were collected from the PanCanAtlas’
clinical follow-up data (Liu et al., 2018). For log-rank tests, patients were segregated into two groups: one with taxa relative abun-
dance below the bottom quartile (‘‘low’’ or ’’absent’’), and one with above the top quartile (‘‘high’’). In many cases, particularly for
species-level alignments, the bottom quartile was zero and therefore may include more than a quarter of patients. To ensure quartiles
were non-equal, taxa that were present in fewer than 25% of samples were excluded from the analysis. The CPH test was performed
with default parameters and 10-fold cross-validation.
All statistical tests between unmatched groups were performed using a Wilcoxon rank-sums test (p-value), and all statistical tests
between matched groups were performed using a Wilcoxon signed-rank test (p-value) unless otherwise specified. Statistical tests
of prevalence were performed using a one-sided Fisher’s exact test. Statistical tests of variance for microbial compositions were
performed using PERMANOVA. Statistical tests for survival analyses were performed using the log-rank test. For multiple tests,
the false discovery rate (FDR; q-value) was calculated using the Benjamini-Hochberg method. All analyses were performed in python
3.7.1 and R 3.6.1. P-values are indicated as follows: *, <0.05; **, <0.01; ***, <0.001.
Supplemental Information
Figure S1. WGS and WXS harbor colorectal bacterial reads distinct from blood and brain, Related to
Figure 1
(A) Matched analysis of Shannon diversity of bacteria in normal tissue (yellow), tumor tissue (blue), and
blood (red) from CRC and BC patients in TCGA. By a paired, one-sided t-test, tumor and normal
tissue from CRC patients have more bacterial diversity than CRC blood, but BC tumor tissue does not
(B) Shows diversity data from (A), but pools together tumor and normal tissue samples from BC (green)
and CRC (brown) patients. By an unpaired, one-sided t-test CRC tissue has greater bacterial diversity
than BC tissue, and CRC blood has greater bacterial diversity than BC blood.
(C) Comparison of bacterial species prevalence in WXS data for blood and tissue samples from CRC and
BC.
(D) Comparison of bacterial species prevalence in WXS data for BC and CRC samples from tissue and
blood.
(E) Comparison of bacterial species prevalence in WGS and WXS data for blood and tissue samples from
OVC.
(F) Comparison of bacterial species prevalence in WGS data for OVC and CRC samples from tissue and
blood.
(G) Relative abundance of bacterial phyla in WXS data for tissue and blood samples from CRC and BC
patients.
(H) Comparison of bacterial species prevalence in tissue and blood using Kraken2 results.
(C – D) Correlation between PathSeq and Kraken2 read counts estimated for selected phyla (I) and genera (J)
in matched sequencing experiments from tumors (blue) and normal tissue (yellow).
Figure S2. Most equiprevalent taxa are common contaminants and are associated with particular
                                                                                                                   1
(A)        Genera ranked by their prevalence in CRC blood samples. Common contaminants are marked in red.
(B) Panel comparing gene count, G-C content, pH optima, and NaCl optima for tissue-enriched (blue) and
(C) PCoA of WXS data for CRC samples, labeled by sample type (left) and sequencing center (right).
(D) Schematic showing the path of TCGA sequencing samples from their tissue source site (TSS) to their
(E) Abundance of selected tissue-resident species overlaid on PCoA plot of WGS sequencing data (Figure
2D).
(F) Abundance of putative contaminant species overlaid on PCoA plot of WGS sequencing data (Figure
2D).
(G) PCoA of WGS data for BC samples, labeled by sample type (top) and sequencing center (bottom).
(H) Heatmap clustering of bacterial species’ normalized log10-abundance in blood samples from BC
patients.
(I) Species classified as tissue-resident by comparing WGS prevalence in tissue and blood samples (blue)
are consistent with prevalence comparisons of BC tissue vs. CRC tissue and for analogous
(J) Fraction of contaminant bacterial reads from WXS data for normal (yellow), tumor (blue), and blood
(red) samples from CRC patients, broken down by the five most prevalent phyla.
(K) Fraction of contaminant bacterial reads from WGS data for each sample type at Harvard, Baylor, and
Broad.
(L) The CLR-normalized relative abundances of the five most prevalent phyla are uncorrelated between
Figure S3. Detecting tissue-resident and contaminant species with gene-level resolution, Related to
Figure 3
(A) Prevalence of genes belonging to F. nucleatum genes in blood vs. tissue; representative of tissue-
                                                                                                             2
(B)       Prevalence of genes belonging to C. provencense genes in blood vs. tissue; representative of
(C – D) Distribution of gene prevalence in blood (red) and tissue (blue) for F. nucleatum (C) and C. provencense
(D).
(E – F) Genome coverage of reads aligning to F. nucleatum (E) and C. provencense (F) in blood (red) and
tissue (blue).
(G – I) Genome coverage of B. vulgatus (G), A. junii (H), E. coli (I) in samples from Baylor and Harvard.
(J) Distribution of sequencing reads aligned to cadA (left) and ldcC (right) genes in blood (red) and tissue
(blue).
(K) Prevalence of genes in the pks island encoding colibactin in tissue and blood samples. These genes
are enriched in tissue from CRC patients, consistent with previous reports.
(L) Barplot showing average prevalence of variants from species defined as tissue-resident and
equiprevalent species in blood (red) and tissue (blue). Data are represented as mean ± 95% CI.
(Mycobacterium spp.), and mixed-evidence species (E. coli). E. coli appears to harbor both tissue-
Figure S4. Metagenomic analysis of original TCGA tissue and blood samples validate tissue-resident
(B) Total bacterial reads from 16S results of tissue (blue), plasma (red) and controls (bottom panel). Data
(C) Weighted UniFrac distances between microbial compositions of tissue sequenced at Harvard and
Baylor (WGS) compared with patient-matched tissue analyzed at Duke (16S), before and after
                                                                                                                 3
(D)     Relative abundances of bacterial phyla in 16S results for tissue compared with tissue samples
(E – F) Relative abundances of bacterial phyla in 16S results for blood compared with blood samples
sequenced using WGS at Harvard and Baylor (E) and WXS (F).
tumor-associated (blue) and normal tissue-associated (yellow) species. Points are sized proportional to
(B) T-SNE visualization showing coabundance of detected species and their relative abundances (purple).
(E) ADAR1, which distinguished of Fusobacterium and Bacteroides coabundance groups, is positively
(F) Volcano plot showing species that are more abundant in tumor samples (blue) or matched normal tissue
(yellow).
(G) Heat-tree showing Bacteroides spp. and their associations with normal (yellow) and tumor (blue) tissue.
(H) Barplot showing genes significantly correlated with C. ureolyticus expression in tissue samples from
CRC patients.
(I) Scatterplots showing correlation of C. ureolyticus abundance (CLR) and gene expression of CAMK2D
and UGDH.
(J) Several in the C. ureolyticus genome are significantly associated with tumor samples (blue) relative to
(K) C. ureolyticus is predictive of progression-free interval (PFI) among among patients with CRC
recurrence.
                                                                                                             4
(L)     Several species in the Bacteroides cluster are negatively predictive of patient survival (OS).
Figure S6. Bacteria in TCGA sequencing data are prognostic of mucosal barrier injury, immune
(A) Heatmap showing correlations between CLR-transformed abundances of tumor- and normal tissue-
associated species with features in the PARADIGM pathway matrix. PARADIGM scores are inferred
(B – D) Comparison of differentially abundant species and their association with tissue type (x-axis) versus
enrichment score (y-axis) for KEGG terms associated with immune pathways (B), cancer-related
(E) Enrichment scores for tumor-associated (blue) and normal tissue-associated (yellow) species across
(F) Panel showing enrichment scores across pathologic stage for all tumor-associated (blue) and normal
(G) Abundance of species in BC blood (green) and CRC blood (brown) samples, ordered by statistical
significance.
Figure S7. Contamination-adjusted tissue microbiome profiles for all gastrointestinal cancers in TCGA,
Related to Figure 7
(B) Prevalence of species in BC tissue and disease-specific tissue from non-gastrointestinal cancers.