0% found this document useful (0 votes)

28 views8 pages

Faecal Microbiome AI for Disease Diagnosis

Uploaded by

Rafalel Jupio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views8 pages

Faecal Microbiome AI for Disease Diagnosis

Uploaded by

Rafalel Jupio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Article https://doi.org/10.

1038/s41467-022-34405-3

Faecal microbiome-based machine learning

for multi-class disease diagnosis

Received: 2 September 2022 Qi Su 1,2,3,4,6, Qin Liu 1,2,3,4,6, Raphaela Iris Lau 1,2,3, Jingwan Zhang1,2,3,4,
Zhilu Xu1,2,3,4, Yun Kit Yeoh1, Thomas W. H. Leung2, Whitney Tang1,2,3,
Accepted: 21 October 2022
Lin Zhang1,2,3,4, Jessie Q. Y. Liang 2,3,4, Yuk Kam Yau1,2,3, Jiaying Zheng 1,2,3,
Chengyu Liu1,2,3, Mengjing Zhang1,2,3, Chun Pan Cheung1,2,4,
Jessica Y. L. Ching1,2,3, Hein M. Tun1,3,5, Jun Yu 2,3,4, Francis K. L. Chan1,2,3,4 &
Check for updates Siew C. Ng 1,2,3,4
1234567890():,;
1234567890():,;

Systemic characterisation of the human faecal microbiome provides the

opportunity to develop non-invasive approaches in the diagnosis of a major
human disease. However, shared microbial signatures across different dis-
eases make accurate diagnosis challenging in single-disease models. Herein,
we present a machine-learning multi-class model using faecal metagenomic
dataset of 2,320 individuals with nine well-characterised phenotypes, includ-
ing colorectal cancer, colorectal adenomas, Crohn’s disease, ulcerative colitis,
irritable bowel syndrome, obesity, cardiovascular disease, post-acute COVID-
19 syndrome and healthy individuals. Our processed data covers 325 microbial
species derived from 14.3 terabytes of sequence. The trained model achieves
an area under the receiver operating characteristic curve (AUROC) of 0.90 to
0.99 (Interquartile range, IQR, 0.91–0.94) in predicting different diseases in
the independent test set, with a sensitivity of 0.81 to 0.95 (IQR, 0.87–0.93) at a
specificity of 0.76 to 0.98 (IQR 0.83–0.95). Metagenomic analysis from public
datasets of 1,597 samples across different populations observes comparable
predictions with AUROC of 0.69 to 0.91 (IQR 0.79–0.87). Correlation of the top
50 microbial species with disease phenotypes identifies 363 significant asso-
ciations (FDR < 0.05). This microbiome-based multi-disease model has
potential clinical application in disease diagnostics and treatment response
monitoring and warrants further exploration.

Recent studies have shown that imbalanced intestinal microbiota, single-disease diagnostic models are likely to be confounded by
termed “dysbiosis”, contributes to various human diseases1. The cur- unrelated diseases and may lead to misclassiﬁcation. Although an
rent development of microbial markers has mostly used binary attempt has been made to develop a multi-class diagnostic model,
classiﬁers2–5. Emerging evidence, however, suggests that most health heterogeneity, technical bias and batch effects involved in the pre-
conditions exhibit overlapping gut microbiome signatures6, thus vious work relying on public datasets for analyses would limit

1
Microbiota I-Center (MagIC), Hong Kong SAR, China. 2Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong
SAR, China. 3Li Ka Shing Institute of Health Sciences, State Key Laboratory of Digestive Disease, Institute of Digestive Disease, The Chinese University of Hong
Kong, Hong Kong SAR, China. 4Center for Gut Microbiota Research, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China. 5JC
School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong SAR, China. 6These authors contributed equally: Qi Su, Qin Liu.
e-mail: siewchienng@cuhk.edu.hk

Nature Communications | (2022)13:6818 1

Article https://doi.org/10.1038/s41467-022-34405-3

accuracy7. Here, we develop the largest single-site dataset to date class proportions as the cohort) and presented their final performance
covering multiple diseases, adopt a machine learning multi-class from the withheld test set (30% samples, Fig. 1a, see the “Methods”
model to predict different diseases using species-level faecal micro- section). All these models achieved a mean AUROC of 0.67–0.99
biome profiling, and validate the findings using public metagenome (Interquartile range, IQR 0.81–0.92), suggesting that multi-class dis-
datasets across different populations. ease classification based on the faecal microbiome was feasible
(Source Data, Supplementary Fig. 3). Amongst them, the RF multi-class
Results model achieved a mean AUROC of 0.90–0.99 (IQR 0.91–0.94, one
We performed metagenomic sequencing of faecal samples from 2320 versus all others, Fig. 1b) for different disease phenotypes in the test
Hong Kong Chinese (mean age 54.9, 48.7% female, Source Data, Sup- set. The performance of the RF model in the test set significantly
plementary Fig. 1a, see the “Methods” section) consisting of 9 well- outperformed all other models (Source Data, Supplementary Fig. 3b)
characterised disease phenotypes: colorectal cancer (CRC, n = 174), and was similar to that of the training set (calculated by 5-fold cross-
colorectal adenomas (CA, n = 168), Crohn’s disease (CD, n = 200), validation, Source Data, Supplementary Fig. 3c), suggesting high
ulcerative colitis (UC, n = 147), irritable bowel syndrome (diarrhoea integrity of this classifier. Therefore, the RF multi-class model was used
subtype, IBS-D, n = 145), obesity (n = 148), cardiovascular disease (CVD, for further analyses. At a threshold based on the highest Youden’s
n = 143), post-acute COVID-19 syndrome (PACS, n = 302) and healthy Index, the sensitivities of our RF multi-class classifier ranged from 0.81
controls (n = 893). In total, we obtained 14.3 terabytes of the sequence to 0.95 (IQR 0.87–0.93) at specificities of 0.76 to 0.98 (IQR 0.83–0.95)
at an average depth of 6.15 gigabases for each metagenome and for different diseases with accuracy from 0.77 to 0.98 (IQR 0.82–0.92,
identified 1208 bacterial species. Amongst them, 325 bacterial species one versus all others, Fig. 1c), highlighting good diagnostic perfor-
had a relative abundance higher than 0.15% and these species were mance. For example, our classifier achieved a mean AUROC of 0.94 for
present in over 5% of the subjects (Source Data). CRC with a sensitivity of 0.88 at a specificity of 0.85 (accuracy 0.85, one
versus all others, Fig. 1b, c); this performance was superior to that of
Shared microbiome signatures across different phenotypes our trained binary classifier (CRC versus health, mean AUROC 0.91,
We observed differences in bacterial diversity (Shannon) and richness Source Data, Supplementary Fig. 2c) and a previously published CRC
(number of species) in different diseases, and we found that both diagnostic model2. Further assessment using predicted probabilities in
indices vary across phenotypes (Source Data, Supplementary the test set showed that the trained classifier achieved a mean AUROC
Fig. 1b, c). These results are consistent with a recent meta-analysis8, of 0.94 for all one versus one classifications (IQR 0.92–0.98, Source
indicating that ecological indices may not be robust indicators of Data, Supplementary Fig. 4a) with high sensitivities (IQR 0.88–0.95)
health or disease. Then, we explored associations of microbial com- and specificities (IQR 0.83–0.94, Source Data, Supplementary Fig. 4b),
position at the species level with disease phenotypes using a linear which supported a superior performance of multi-class model analyses
model of MaAsLin2 after adjusting for biological and technical con- over binary models (Source Data, Supplementary Fig. 2c).
founders (see the “Methods” section). We found a total of 1061 sig- To fully characterise the RF multi-class model, we compared its
nificant associations between these nine phenotypes and 215 bacterial performance under different split ratios and achieved similar results,
taxa at the species level (FDR < 0.05). Amongst the 215 species, more suggesting high stability and good predictive power without risk of
than 94% were significantly associated with two or more diseases, overfitting (Source Data, Supplementary Fig. 3d). Given that subjects
which is consistent with previous works that numerous signals are with CRC or colorectal adenomas were older than other subjects
shared among different diseases6,9 (Source Data, Supplementary (Source Data, Supplementary Fig. 1a), we assessed our model stratified
Fig. 1d). For instance, Klebsiella pneumoniae, a well-characterised by age and found consistent performance (Source Data, Supplemen-
opportunistic pathogen10, was positively associated with CD, CRC, IBS- tary Fig. 5). In addition, our model achieved a mean AUROC of 0.87 in
D, Obesity, PACS and UC in our cohort, whilst Roseburia intestinalis, a distinguishing CRC and colorectal adenomas (Source Data, Supple-
promising probiotic with butyrate-producing properties11, negatively mentary Fig. 4), which supported that effect of age on the model was
correlated with these six disease phenotypes (Source Data, Supple- likely to be negligible. To rule out the possibility that uneven cohort
mentary Fig. 1d). Next, we found that both PCoA analysis based on sizes across different diseases may influence the classification perfor-
beta-diversity and random forest (RF) binary classifier could sig- mance, we trained a separate RF multi-class classifier by randomly
nificantly separate all disease phenotypes (Source Data, Supplemen- pooling 143 subjects from each disease phenotype (a total of
tary Table 1, Supplementary Fig. 2a–c, all p < 0.001). Whilst common 1287 subjects, 70% training, 30% testing) and found an AUROC of
microbial signatures were shared across diseases, these findings 0.83–0.99 (one versus all others, IQR 0.89–0.96; one versus one, IQR
pointed to the presence of disease-specific microbial composition. 0.89–0.97; Source Data, Supplementary Fig. 6) which was comparable
However, it is unknown whether binary classifiers can capture these to the AUROC of 0.90–0.99 in the 2,320 individuals (one versus all
disease-specific signatures. Therefore, we tested the specificity of our others, IQR 0.91–0.94; one versus one, IQR 0.92–0.98; Fig. 1b, Source
trained binary models in unrelated diseases, and the results showed a Data, Supplementary Fig. 4). Importantly, the AUROC values of the
high misdiagnosis rate (average 0.52, IQR 0.41–0.65, Source Data, model increased with the increasing number of features which sug-
Supplementary Fig. 2d). These results suggested that the binary clas- gested again that overfitting based on the 325 selected features was
sifier failed to capture real disease-specific features based solely on unlikely (Source Data, Supplementary Fig. 7).
single disease versus control samples.
Validation of multi-class model on independent datasets
Development of faecal microbiome-based multi-class diag- Then, we integrated 1597 shotgun faecal metagenome data from 12
nosis model public datasets from Asia, Europe and North America (Source Data,
Classification tasks in machine learning involving more than two Supplementary Table 2, Supplementary Fig. 8a). Our RF multi-class
classes are known as “multi-class classification”, which can effectively classifier showed a mean AUROC of 0.69–0.91 (IQR 0.79–0.87, Source
account for confounding effects of unrelated classes12. Based on our Data, Supplementary Table 3) in classifying different diseases, and
cohort of 2320 Hong Kong Chinese, we trained five machine learning generally outperformed all other models (Source Data, Supplementary
multi-class classifiers (RF, K-nearest neighbours (KNN), multi-layer Fig. 8b). Such performance from an independent validation cohort
perceptron (MLP), support vector machine (SVM), and graph con- further confirmed the robustness and generalisability of our model
volutional neural network (GCN)) to classify different diseases using across different populations and geographical locations. To further
species-level data from the training set (70% samples with the same validate the accuracy of our model, we selected 60 patients who had a

Nature Communications | (2022)13:6818 2

Article https://doi.org/10.1038/s41467-022-34405-3

a
Training 325 bacterial species
Fecal microbiome
n=2320 30%

Repeats (n=20)
SVM KNN RF MLP GCN Microbiome Profiling
70% 1100101111010101
1100011011100110

Optimal
1011000010101110

5-Fold Cross 30% 1100000111110101

1011001111001011
1110110110110010

30% Validation

……
Health, n=893
CA, n=168
CD, n=200
CRC, n=174
CVD, n=143
IBS-D, n=145
Obesity, n=148
PACS, n=302
UC, n=147

…… Training
Probabilities
UC CD
Test CVD
Test 30% Health CRC
Trained Multi-class Classifier CA
PACS IBS-D
Obesity

b ROC Curve (325 Features) c

Thresholds at highest Youden index
Phenotype Threshold Sensitivity Specificity Accuracy Youden index

Health 0.36586 0.81 0.83 0.82 0.64

CA 0.08028 0.93 0.76 0.77 0.69

Sensitivity

CD 0.09361 0.88 0.83 0.83 0.71

CRC 0.09666 0.88 0.85 0.85 0.72

PACS (AUC=0.98, 95%CI 0.97-0.99) CVD 0.06786 0.87 0.77 0.78 0.64
Health (AUC=0.91, 95%CI 0.90-0.92)
CA (AUC=0.90, 95%CI 0.89-0.92) IBS-D 0.12423 0.94 0.98 0.98 0.93
CD (AUC=0.93, 95%CI 0.91-0.94)
CRC (AUC=0.94, 95%CI 0.93-0.96) Obesity 0.10805 0.88 0.95 0.94 0.82
CVD (AUC=0.91, 95%CI 0.89-0.93)
IBS-D (AUC=0.99, 95%CI 0.98-0.99) PACS 0.15348 0.95 0.92 0.92 0.87
Obesity (AUC=0.92, 95%CI 0.89-0.96)
UC (AUC=0.93, 95%CI 0.91-0.94) UC 0.09891 0.86 0.86 0.86 0.72

1 - Specificity
Fig. 1 | Faecal microbiome-based machine learning for multi-class disease vector machine, KNN K-nearest neighbours, RF random forests; MLP multi-layer
diagnosis. a Framework for dataset partition, model training and independent perceptron, GCN graph convolutional neural network, CA colorectal adenomas, CD
validation. b Area under the receiver operating characteristic curve (AUROC, centre Crohn’s disease, CRC colorectal cancer, CVD cardiovascular disease, IBS-D diar-
for the error bands is median). c Performance metric details of the trained random rhoea-dominant irritable bowel syndrome, PACS post-acute COVID-19 syndrome,
forest multi-class classiﬁer for classifying one phenotype from all others using UC ulcerative colitis. Source data are provided as a Source Data ﬁle.
species-level faecal microbiome data in the independent test set. SVM support

complete recovery from COVID-19 infection. Our trained model bacterial species achieved a mean AUROC of 0.88 to 0.99 (IQR
showed an accuracy of 83.3% (50/60) in classifying these subjects as 0.90–0.93, Source Data, Supplementary Fig. 9a) for different diseases
healthy (Source Data, Supplementary Fig. 8c). These data verified that in our test set, and a mean AUROC of 0.67 to 0.90 (IQR 0.78–0.86,
fully recovered COVID-19 survivors (and without PACS) shared similar Source Data, Supplementary Fig. 9b) in the public dataset. A total of
gut microbiome profiles as healthy people13. Additionally, we also 363 significant associations were found between these 50 species with
tested our trained RF model on diseases not included in our training different disease phenotypes (Hong Kong cohort, FDR < 0.05, Fig. 2).
dataset, including liver cirrhosis and constipation-dominant IBS data- Compared with healthy controls, almost all disease states were asso-
sets (n = 60, see the “Methods” section). We found that using our RF ciated with a significantly decreased abundance of microbiota from
multi-class model there were high probabilities whereby prediction the bacteria phylum of Firmicutes or Actinobacteria (FDR < 0.05) and a
cannot be made as they failed the corresponding threshold for most significant increase in Bacteroidetes (FDR < 0.05). Imbalance in
subjects (48/60, Source Data, Supplementary Fig. 8d), and they might Firmicutes/Bacteroidetes ratio had previously been reported primarily
be categorised as undetermined. And, the misclassification rate for in patients with obesity and IBD14, but its associations with other dis-
each phenotype is from 0% (0/60, CA, CVD, IBS-D, Obesity) to 5% (3/ eases have not been reported. Nonetheless, such shared microbial
60, CD, CRC, PACS, Source Data, Supplementary Fig. 9d), suggesting signatures may serve as a basis for distinguishing health and disease.
that our model has a high specificity and accuracy for the nine phe- Then, we identified specific microbial signatures that can classify dif-
notypes within our cohort with a low risk of misclassification for ferent diseases (Fig. 2). Specifically, the abundance of several bacterial
unrelated diseases. species in Bacteroidetes differed significantly between patients with
PACS, UC and CD. Subjects with PACS showed a significant increase in
Associations between bacterial features and phenotypes abundance of Bacteroides vulgatus and Bacteroides xylanisolvens, while
Next, we correlated the top 50 bacterial species contributing to the those with UC were enriched in Bacteroides ovatus, and subjects with
model (Source Data, Supplementary Table 4) with different disease CD showed significant decreases in Bacteroides uniformis, Bacteroides
phenotypes to identify clues to model interpretability. These top 50 vulgatus and Bacteroides xylanisolvens, compared with healthy

Nature Communications | (2022)13:6818 3

Article https://doi.org/10.1038/s41467-022-34405-3

Fig. 2 | Microbial species associated with health status or different disease correlations), respectively. The nominal significance (p-value) of associations was
phenotypes. The top 50 microbial species contributing to the random forest multi- calculated by MaAsLin 2, and the false discovery rate (FDR) was computed by
class classifier were clustered by taxonomy, and different phenotypes were clus- Benjamini–Hochberg correction. CA colorectal adenomas, CD Crohn’s disease,
tered using hierarchical clustering. Associations were coloured by direction of CRC colorectal cancer, CVD cardiovascular disease, IBS-D diarrhoea-dominant
effect (red, positive; blue, negative; p < 0.05), with associations significant at irritable bowel syndrome, PACS post-acute COVID-19 syndrome, UC ulcerative
FDR < 0.05 marked with a plus (positive correlations) or minus (negative colitis. Source data are provided as a Source Data file.

controls. Although patients with CRC and colorectal adenomas shared Supplementary Fig. 9c). Overall, these results suggest that our model
relatively similar gut bacteria composition, the abundance of Parvi- can capture various disease-specific microbial signatures, which may
monas micra was significantly higher in patients with CRC but not explain the robust diagnostic performance of this multi-class classifier.
colorectal adenomas, compared to healthy controls, which was con-
sistent with previous findings showing that Parvimonas micra can be Discussion
used as a marker to distinguish CRC from colorectal adenomas15,16. For Overall, our data showed that the faecal microbiome-based multi-class
other diseases, microbiome differences were mainly driven by Acti- model for disease diagnosis is feasible. The novelty lies in the high-
nobacteria. Subjects with obesity showed increases in Actinomyces quality dataset, and superior and reproducible machine-learning
naeslundii, Actinomyces odontolyticus and Actinomyces oris, and sub- methods which are of high clinical relevance. We believe this multi-
jects with IBS-D showed increases in Collinsella aerofaciens and Col- class model of classifying diseases has potential clinical applications
linsella stercoris. We further correlated bacteria and phenotypes in the and can serve as a non-invasive way of screening various diseases in
assembled public dataset, and found that many disease-specific bio- clinical practice or for disease risk assessment. Our results also have
marker are stable across datasets, such as Bacteroides for UC, Parvi- implications for the potential development of biomarkers for pre-
monas micra for CRC and Actinomyces for obesity (Source Data, dicting drug response and common treatment strategies using the

Nature Communications | (2022)13:6818 4

Article https://doi.org/10.1038/s41467-022-34405-3

identified shared or specific marker for multiple diseases. This work Healthy controls were recruited during the same recruitment
has some limitations. Firstly, the disease spectrum of this study is still period from the community through advertisement and from the
limited, and the inclusion of more phenotypes can further enhance the endoscopy centre at the Prince of Wales Hospital and included sub-
value of this multi-class diagnostic tool. Secondly, biological evidence jects who had a normal colonoscopy (faecal samples collected before
to support the identified microbiome–phenotype associations is lim- bowel preparation). The exclusion criteria for healthy controls were
ited and future work to delineate the mechanisms of these associations known complex infections or sepsis; known history of severe organ
is needed to facilitate our understanding of the role of the shared and failure (including decompensated cirrhosis, malignant disease, kidney
disease-specific microbiome in disease pathogenesis. Also, the pooled failure, epilepsy, active serious infection, acquired immunodeficiency
public dataset did not specify co-morbidities and antibiotic use, thus syndrome); bowel surgery in the last 6 months (excluding colono-
model performance may vary upon the exclusion of these subjects. scopy/procedure related to perianal disease); the presence of an
Since our model predicts probabilities for multiple diseases simulta- ileostomy/stoma; and current pregnancy; any long term drugs for
neously, it may also apply to multi-disease diagnosis in a single patient. chronic diseases; the use of antibiotics in the last 3 months; the use of
Though we could not validate it at this moment, this hypothesis should laxatives or anti-diarrhoeal drugs in the last 3 months or recent dietary
be tested in the future. changes (e.g., becoming vegetarian/vegan). Finally, a total of
To our knowledge, we present the largest faecal microbiome 2320 subjects were recruited. Clinical metadata and dietary data were
datasets with different disease phenotypes and developed a machine collected during clinical interviews. Besides, an additional 60 subjects
learning multi-class model that achieved high performance for disease (mean age 53.5, 48.3% female) were prospectively followed up for up to
classification. This non-invasive microbiome-based model could two years after the COVID-19 infection and were confirmed to have
potentially be applied clinically to complement disease diagnostics fully recovered from the initial infection without any symptoms of
and treatment response monitoring. PACS. These subjects served as an independent validation cohort and
provided serial faecal samples after SARS-CoV-2 clearance.
Methods
Ethics statement Faecal samples
The study was approved by The Joint Chinese University of Hong Kong Faecal samples were collected at home by all subjects using tubes
– New Territories East Cluster Clinical Research Ethics Committee (The prepared by investigators containing preservative media (cat. 63700,
Joint CUHK-NTEC CREC). All subjects provided written informed Norgen Biotek Corp, Ontario Canada). The Norgen preservative can
consent. preserve and allow safe transportation of microbial DNA & RNA at
ambient temperature eliminating sample variability. The stool sam-
Study population ple was sent to the hospital within 24 h of collection and stored at
All participants were recruited and diagnosed at the Prince of Wales −80 °C refrigerators until further processing. We have previously
Hospital in Hong Kong from January 2017 to March 2022. Subjects with shown that data on gut microbiota composition generated from
CRC and CA were diagnosed by colonoscopy and confirmed on his- faecal samples collected using this preservative medium was com-
tology examinations; Subjects with CD and UC were diagnosed based parable to data obtained from fresh samples that were immediately
on standard criteria of endoscopy, radiology, and histological exam- stored at −80 °C18.
inations. Subjects with IBS were diagnosed according to the ROME III
criteria, and endoscopy and enteroscopy were performed to exclude Faecal DNA extraction and sequencing
other GI disorders such as IBD, coeliac disease, parasite infestations, or After removing the preservative media, microbial DNA was isolated
other organic disorders. Obesity was defined as subjects with a body with the Qiagen (Hilden, Germany) QIAamp DNA Stool Mini Kit,
mass index (BMI) of over 28 and with no other medical co-morbidities. according to the manufacturer’s instructions. After the quality control
Subjects with cardiovascular disease (CVD) were recruited from the procedures by Qubit 2.0, agarose gel electrophoresis, and Agilent
public as part of a survey of cardiovascular health in the Hong Kong 2100, extracted DNA was subject to DNA libraries construction, com-
general population. Subjects underwent carotid ultrasounds to mea- pleted through the processes of end repairing, adding A to tails, pur-
sure intima-media thickness (IMT) of the common, internal, and ification and PCR amplification, using Nextera DNA Flex Library
external carotid arteries (CCA, ICA and ECA, respectively) and carotid Preparation kit (Illumina, San Diego, CA). Libraries were subsequently
bulbs and subjects that had ≥50% stenosis in a single or multiple ves- sequenced on our in-house sequencer Illumina NextSeq 550 (150 base
sels were regarded as having the risk of CVD. Subjects with post-acute pairs paired-end) at the Center for Microbiota Research, The Chinese
covid-19 syndrome (PACS) were defined as those with at least one University of Hong Kong. All samples were in random order for DNA
persistent symptom or long-term complications of SARS-CoV-2 infec- extraction, library construction and sequencing. ZymoBIOMICS Spike-
tion beyond 4 weeks from the viral clearance which could not be in Control I (High Microbial Load, Cat: D6320-10, ZYMO Research,
explained by an alternative diagnosis, and we assessed the presence of USA) and ZymoBIOMICS Microbial Community DNA Standard (Cat:
the 30 most commonly reported symptoms post-COVID after illness D6306-A) were used as positive controls during DNA extraction,
onset13,17 (Source Data, Supplementary Table 5). All subjects with other library construction, sequencing and quality assessment.
diseases (apart from the obesity group) had a normal range of BMI of
18.5–22.9. All subjects are on a stable traditional Chinese style diet and Microbiome profiling
are of Han Chinese ethnicity. Patients were excluded if they had the Raw sequence data were quality filtered using Trimmomatic V.39 to
following: age under 18 or over 80; self-reported comorbidities of remove the adaptor, low-quality sequences (quality score < 20), and
other diseases; infection with an enteric pathogen; acquired immu- reads shorter than 50 base pairs. Contaminating human reads were
nodeficiency syndrome; known history of organ dysfunction or failure filtered using Kneaddata (V.0.10.0, Reference database: GRCh38 p12)
and abdominal surgery; active malignancy or undergoing radio-che- with default parameters. Following this, microbiota composition pro-
motherapy; short bowel syndrome; taking drugs commonly known to files were inferred from quality-filtered forward reads using MetaPh-
affect the gut microbiome including proton pump inhibitors, oral anti- lAn3 version 3.0.14. GNU parallel (v2018) was used for parallel analysis
diabetics, non-steroidal anti-inflammatory drugs, corticosteroids, jobs to accelerate data processing. Species whose average abundance
laxatives or selective serotonin reactive inhibitors and antibiotics or and prevalence were <0.15% and 5% were filtered out. Alpha diversity
probiotics use within three months of sample collection; pregnant or metrics (Shannon diversity, Chao1 richness) were calculated by using
breastfeeding; on special diets such as vegetarians. the phyloseq package (v1.26.0).

Nature Communications | (2022)13:6818 5

Article https://doi.org/10.1038/s41467-022-34405-3

Microbiome analysis proportions across folds). The optimal models selected based on
All statistical analyses were done using R version 4.0.3. The ggpubr cross-validated results were evaluated in the withheld evaluation
package (https://github.com/kassambara/ggpubr) performed non- dataset as the final performance for predicting different diseases. This
parametric statistical testing between groups and accounted for mul- process was repeated 20 times to obtain a distribution of random
tiple hypothesis testing corrections when necessary. Principal forest prediction evaluations on the validation set, and the mean
coordinates analysis (PCoA) based on beta-diversity (Bray–Curtis dis- AUROC and AUPR value was calculated accordingly for the visualisa-
tance matrix calculated using relative abundances of microbial spe- tion of results. The highly ranked and frequently selected microbial
cies) was used to visualise the clustering of samples based on their features were considered predictive signatures for further interpreta-
species-level compositional profiles. The microbiome composition tion. We retrieved prediction performance using the same training
differences between different phenotypes were calculated by permu- datasets.
tational multivariate analysis of variance (PERMANOVA) using distance
matrices (adonis) in the adonis function of vegan R package V.2.5–7 Model evaluation
with 999 permutations. Associations of specific microbial species with We included AUROC to characterise the model performance as our
phenotypes were identified using the multivariate analysis by linear models initially provided outputs of probabilities for each disease
models (MaAsLin2) statistical frameworks implemented in the Hut- phenotype, and these predicted probabilities were then used to
tenhower Lab Galaxy instance (http://huttenhower.sph.harvard.edu/ estimate the risk of disease occurrence or absence, which formed a
galaxy/) with healthy controls as reference. The linear model also binary status that was analysed to provide an AUROC value. The
included age, sex and technical factors (library DNA concentration, AUROC is a widely applied metric that considers the trade-offs
sequencing read depth, sequencing batch) to further correct for between sensitivity and specificity at all possible thresholds for
potential batch effects and confounders. BMI was not included as apart comparing the performance across various classifiers with a baseline
from the obese group, all subjects from other disease groups had a value of 0.5 for a random classifier. AUPR was provided as a com-
normal BMI requirement (18.5–22.9) and there was no difference in the plimentary assessment, which considers the trade-offs between
BMI across different phenotypes. Benjamini–Hochberg correction was precision (or positive predictive value) and recall (or sensitivity) with
used to control for multiple testing, and results were considered sig- a baseline that equals the proportion of positive disease cases in all
nificant at false discovery rate (FDR) < 0.05. samples.

Random Forest binary classiﬁer Public data download and processing

To account for sequencing batch effects for all samples processed in For the construction of the external dataset, a total of 1,597 raw
different periods, we performed Combat by SVA package (v3.44.0)19 shotgun faecal metagenomes were acquired from 12 independently
for relative abundance before machine learning model development. published studies across 11 countries (114 for UC21–23, 102 for CD22–24,
Binary sub-cohorts were composed of two phenotypes drawn from the 218 for CVD25, 177 for CRC26–28, 86 for adenoma27,28, 83 for IBS-D29–31, 81
entire cohort. A total of 36 binary sub-cohorts were generated, cov- for Obesity32, and corresponding 736 health controls). Besides, a total
ering all cross-comparisons of different phenotypes. Machine learning of 60 raw shotgun faecal metagenomes from subjects with liver
binary classifier used random forest through the Sklearn20 library cirrhosis33 (n = 38, validation cohort from Qin et al.33) or constipation-
under Python 3.6.7, as this algorithm has been shown to outperform, dominant IBS31 (IBS-C, n = 22) were downloaded to construct an
on average, other learning tools for microbiota data5. Normalised unrelated disease dataset. After downloading, the quality filtration and
abundance table from each binary sub-cohort to train the model. species-level taxonomic profiling were performed according to the
Machine learning models were first trained on the randomly selected above process.
training set (70%, 20-times repeated, fivefold-stratified cross-valida-
tion) and then were applied to the withheld validation set (30%) to Statistics and reproducibility
access the final performance. This process was repeated 20 times to No statistical method was used to predetermine the sample size.
obtain a distribution of random forest prediction evaluations on the Instead, the study focused on obtaining the largest possible sample
validation set, and the mean AUROC value was calculated accordingly size to capture the highest performance of the machine learning
for the visualisation of and reporting of results. multi-class model. 2,320 samples were all successfully sequenced
and passed the quality assessment (read depth > 10 million), thus no
Machine learning for diagnosis of multi-diseases data were excluded from the analyses. For the machine learning
Multi-class models are implemented by Python 3.6.7 using standard multi-class model, a nested cross-validation procedure was applied
libraries that are publicly available: pandas (0.23.4), numpy (1.14.5), to calculate within-training set accuracy by splitting data into train-
scikit-learn (1.1), and matplotlib (2.2.3). For each phenotype, samples ing and test sets for 20-times repeated, fivefold-stratified cross-vali-
were randomly divided into a training set (70% of samples, total dation (balancing class proportions across folds). The optimal
n = 1724) and a test set for independent evaluation (remaining 30%, models selected based on cross-validated results were evaluated in
total n = 696) with balancing class proportions across the cohort, the withheld evaluation dataset as the final performance for pre-
training set and test set. Random forests (RF), K-nearest neighbours dicting different diseases. This process was repeated 20 times to
(KNN), SVM multi-layer perceptron (MLP) and support vector machine obtain a distribution of random forest prediction evaluations on the
(SVM) were used as classifier models for the diagnosis of different validation set, and the mean AUROC and AUPR value was calculated
phenotypes by using taxonomic profiles at the species level of the accordingly for the visualisation of results. For each phenotype,
faecal microbiome. We implemented the RF multi-class classifier with samples were randomly divided into a training set (70% of samples,
the following modifications to the default SciKit-learn settings: total n = 1724) and a test set for independent evaluation (remaining
n_estimaters = 2000 and class_weight = balanced. KNN, SVM and MLP 30%, total n = 696). Within the training set, a nested cross-validation
were implemented from SciKit-learn with the default settings. Besides, procedure was applied to calculate within-training set accuracy by
we reconstructed a graph convolutional neural network (GCN) model randomly splitting data into training and test sets for 20-times
based on a published work with the same parameter settings7. A nested repeated, fivefold-stratified cross-validation (balancing class pro-
cross-validation procedure was applied to calculate within-training set portions across folds). The study did not include any interventions
accuracy by splitting data into training and test sets for 20-times and thus the conventional blinding (as used in clinical trials or
repeated, fivefold-stratified cross-validation (balancing class intervention studies) was not relevant to this study.

Nature Communications | (2022)13:6818 6

Article https://doi.org/10.1038/s41467-022-34405-3

Reporting summary 17. Nalbandian, A. et al. Post-acute COVID-19 syndrome. Nat. Med. 27,
Further information on research design is available in the Nature 601–615 (2021).
Research Reporting Summary linked to this article. 18. Chen, Z. et al. Impact of preservation method and 16S rRNA
hypervariable region on gut microbiota profiling. mSystems 4,
Data availability e00271–00218 (2019).
The raw metagenomes generated in this study have been deposited in 19. Chen, C. et al. Removing batch effects in analysis of expression
the NCBI Sequence Read Archive database under accession code microarray data: an evaluation of six batch adjustment methods.
PRJNA841786. The public available raw sequencing data were down- PLoS ONE 6, e17238 (2011).
loaded through the NCBI Sequence Read Archive using the retrieved 20. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J.
accession numbers from cited papers, including DRA006684, Mach. Learn. Res. 12, 2825–2830 (2011).
DRA008156, ERP008729, ERP005534, ERP023788, ERP021923, 21. Franzosa, E. A. et al. Gut microbiome structure and metabolic
PRJEB36140, PRJEB37924, PRJEB33500, PRJNA400072, PRJEB1220, activity in inflammatory bowel disease. Nat. Microbiol. 4,
PRJNA429990, PRJEB1220, PRJNA429990, PRJEB15371, and PRJEB6337. 293–305 (2019).
The reference database GRCh38.p12 was downloaded from https:// 22. Nielsen, H. B. et al. Identification and assembly of genomes
www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38. Source data are and genetic elements in complex metagenomic samples
provided with this paper. without using reference genomes. Nat. Biotechnol. 32, 822–828
(2014).
Code availability 23. Weng, Y. J. et al. Correlation of diet, microbiota and metabolite
Codes and scripts developed in this study are all available at the GitHub networks in inflammatory bowel disease. J. Dig. Dis. 20,
repository (https://github.com/qsu123/multi_class_diagnosis34). 447–459 (2019).
24. He, Q. et al. Two distinct metacommunities characterize the
References gut microbiota in Crohn’s disease patients. Gigascience 6, 1–11
1. Lynch, S. V. & Pedersen, O. The human intestinal microbiome in (2017).
health and disease. N. Engl. J. Med. 375, 2369–2379 (2016). 25. Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular
2. Liang, J. Q. et al. A novel faecal Lachnoclostridium marker for the disease. Nat. Commun. 8, 845 (2017).
non-invasive diagnosis of colorectal adenoma and cancer. Gut 69, 26. Yachida, S. et al. Metagenomic and metabolomic analyses reveal
1248–1257 (2020). distinct stage-specific phenotypes of the gut microbiota in color-
3. Vila, A. V. et al. Gut microbiota composition and functional changes ectal cancer. Nat. Med. 25, 968–976 (2019).
in inflammatory bowel disease and irritable bowel syndrome. Sci. 27. Feng, Q. et al. Gut microbiome development along the colorectal
Transl. Med. 10, eaap8914 (2018). adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).
4. Shaukat, A. & Levin, T. R. Current and future colorectal cancer 28. Zeller, G. et al. Potential of fecal microbiota for early-stage detec-
screening strategies. Nat. Rev. Gastroenterol. Hepatol. 19, tion of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
521–531 (2022). 29. Vervier, K. et al. Two microbiota subtypes identified in irritable
5. Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine bowel syndrome with distinct responses to the low FODMAP diet.
learning meta-analysis of large metagenomic datasets: tools and Gut. 71, 1821–1830 (2022).
biological insights. PLoS Comput. Biol. 12, e1004977 (2016). 30. Goll, R. et al. Effects of fecal microbiota transplantation in subjects
6. Gacesa, R. et al. Environmental factors shaping the gut microbiome with irritable bowel syndrome are mirrored by changes in gut
in a Dutch population. Nature 604, 732–739 (2022). microbiome. Gut Microbes 12, 1794263 (2020).
7. Saad Khan, L. K. Multiclass disease classification from microbial 31. Mars, R. A. T. et al. Longitudinal multi-omics reveals subset-specific
wholecommunity metagenomes. Pac. Symp. Biocomput. 25, mechanisms underlying irritable Bowel syndrome. Cell 182,
55–66 (2020). 1460–1473 e1417 (2020).
8. Gupta, V. K. et al. A predictive index for health status using species- 32. Meslier, V. et al. Mediterranean diet intervention in overweight and
level gut microbiome profiling. Nat. Commun. 11, 4635 (2020). obese subjects lowers plasma cholesterol and causes changes in
9. Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta- the gut microbiome and metabolome independently of energy
analysis of gut microbiome studies identifies disease-specific and intake. Gut 69, 1258–1268 (2020).
shared responses. Nat. Commun. 8, 1784 (2017). 33. Qin, N. et al. Alterations of the human gut microbiome in liver cir-
10. Wyres, K. L., Lam, M. M. C. & Holt, K. E. Population genomics of rhosis. Nature 513, 59–64 (2014).
Klebsiella pneumoniae. Nat. Rev. Microbiol. 18, 344–359 (2020). 34. Su, Q. Faecal microbiome-based machine learning for multi-class
11. Nie, K. et al. Roseburia intestinalis: a beneficial gut organism from disease diagnosis. Github https://doi.org/10.5281/zenodo.
the discoveries in genus and species. Front. Cell. Infect. Microbiol. 7193183 (2022).
11, 757718 (2021).
12. Grandini, M., E. Bagli, E. & Visani, G. Metrics for multi-class classi- Acknowledgements
fication: an overview. Preprint at arXiv.2008.05756 (2020). We thank Gabriel Lee for manuscript proofreading. We thank Anki Miu,
13. Liu, Q. et al. Gut microbiota dynamics in a prospective cohort of Bonaventure YM Ip, Joyce Wing Yan Mak, Paul KS Chan and other clinical
patients with post-acute COVID-19 syndrome. Gut 71, research staff/students for their technical contribution to this study,
544–552 (2022). including clinical data and sample collection, inventory and processing.
14. Stojanov, S., Berlec, A. & Štrukelj, B. The influence of probiotics on This research has been conducted using the CU-Med Biobank Resource
the Firmicutes/Bacteroidetes ratio in the treatment of obesity and under Request ID ‘R20221008’. Q.S., Q.L., J.Z., Z.X., Y.K.Y., W.T., L.Z.,
inflammatory bowel disease. Microorganisms 8, 1715 (2020). Y.K.Y., C.L., M.Z., C.P.C., H.M.T., F.K.L.C. and S.C.N are partially or fully
15. Xu, J. et al. Alteration of the abundance of Parvimonas micra in the supported by InnoHK, The Government of Hong Kong, Special Admin-
gut along the adenoma-carcinoma sequence. Oncol. Lett. 20, istrative Region of the People’s Republic of China. S.C.N. is also sup-
106 (2020). ported by the Croucher Senior Medical Research Fellowship. R.I.L.
16. Lowenmark, T. et al. Parvimonas micra as a putative non-invasive received additional support from the Hong Kong Ph.D. Fellowship
faecal biomarker for colorectal cancer. Sci. Rep. 10, 15250 (2020). Scheme (HKPFS).

Nature Communications | (2022)13:6818 7

Article https://doi.org/10.1038/s41467-022-34405-3

Author contributions Correspondence and requests for materials should be addressed to

Q.S. and Q.L. conceived the study, developed algorithms, ran analyses Siew C. Ng.
and took responsibility for the integrity of the data and preparation of the
manuscript. R.I.L., J.W.Z., Z.X., Y.K.Y, W.T, L.Z, J.Q.L. contributed to part Peer review information Nature Communications thanks the anon-
of the data analysis. Y.K.Y., J.YZ., C.L., M.Z. contributed to metagenomic ymous reviewers for their contribution to the peer review of this
sequencing. J.Y.Z. .contributed to publicly available data management. work. Peer reviewer reports are available.
J.Y.L.C., TWHL, C.P.C. contributed to participant recruitment, sample
collection and biobank management. H.M.T., J.Y. and F.K.L.C. con- Reprints and permissions information is available at
tributed to the study design and data interpretation. S.C.N. contributed http://www.nature.com/reprints
to the study design, data analysis and manuscript writing. All authors
gave final approval for the version to be published. All authors agree to Publisher’s note Springer Nature remains neutral with regard to jur-
be accountable for all aspects of the work in ensuring that questions isdictional claims in published maps and institutional affiliations.
related to the accuracy or integrity of any part of the work are appro-
priately investigated and resolved. Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
Competing interests adaptation, distribution and reproduction in any medium or format, as
The Chinese University of Hong Kong has filed a provisional patent long as you give appropriate credit to the original author(s) and the
application in connection with this work on which S.C.N., F.K.L.C., S.Q. source, provide a link to the Creative Commons license, and indicate if
and Q.L. are inventors. F.K.L.C. and S.C.N. are the scientific co-founders changes were made. The images or other third party material in this
and sit on the board of Directors of GenieBiome Ltd. S.C.N. has served as article are included in the article’s Creative Commons license, unless
an advisory board member for Pfizer, Ferring, Janssen, and Abbvie and a indicated otherwise in a credit line to the material. If material is not
speaker for Ferring, Tillotts, Menarini, Janssen, Abbvie, and Takeda. She included in the article’s Creative Commons license and your intended
has received research grants from Olympus, Ferring, and Abbvie. use is not permitted by statutory regulation or exceeds the permitted
F.K.L.C. has served as an advisor and lecture speaker for Eisai Co. Ltd., use, you will need to obtain permission directly from the copyright
AstraZeneca, Pfizer Inc., Takeda Pharmaceutical Co., and Takeda (China) holder. To view a copy of this license, visit http://creativecommons.org/
Holdings Co. Ltd. ZX, W.T., J.Q.Y.L. are part-time employee of Genie- licenses/by/4.0/.
Biome Ltd. All other co-authors have no competing interests.
© The Author(s) 2022
Additional information
Supplementary information The online version contains
supplementary material available at
https://doi.org/10.1038/s41467-022-34405-3.