Reproducible, end-to-end public TCGA ovarian cancer multi-omics pipeline for network modelling, uncertainty-aware benchmarking, external ovarian immune-context validation, and CAR-product benchmarking.
- Preferred manuscript package:
manuscript/submission_package/targets/journal_of_biomedical_informatics/ - Current compliant main manuscript: JBI package with required statement-of-significance table and
<=8combined main-manuscript tables/figures - Public mirrors:
- GitHub repository: workflow, source code, manuscript assets, and release notes
- Hugging Face dataset: derived results, public manuscript bundles, and release documentation
- Kaggle dataset: derived results and notebook-ready public package
Latest refresh highlights:
- JBI submission package rebuilt as the primary submission-grade manuscript bundle
- JBI main manuscript reduced to a compliant
7combined figure/table objects while retaining the significance table - in-text table and figure callouts corrected to appear sequentially before captions
- JBI reference first-appearance ordering corrected
- graphical abstract rebuilt in a cleaner, journal-facing layout
- CAR motif benchmark documentation clarified so all-zero heuristic motif rows are not misinterpreted as construct absence
- public release package refreshed to keep GitHub, Hugging Face, and Kaggle outputs aligned
- Cohort:
TCGA-OV - Main layers: RNA + CNA + methylation + mutation
- Outcome: survival risk group (median split)
- Models: MOFA2-like latent factors, DIABLO-like supervised components
- Network: multi-layer graph with centrality ranking + in silico perturbation
- Data policy: real public GDC data only (no synthetic fallback)
multiomics-ov-network/
+-- data/
+-- metadata/
+-- notebooks/
+-- scripts/
+-- results/
+-- configs/
+-- environment/
+-- workflow/
+-- manuscript/
+-- Snakefile
+-- Makefile
+-- README.md
- Python:
environment/environment-python.yml - R:
environment/environment-r.yml
Create envs:
conda env create -f environment/environment-python.yml
conda env create -f environment/environment-r.ymlDry run:
snakemake -n --cores 1Full run:
snakemake --cores 4 --rerun-incompleteMake targets:
make dryrun
make run
make report
make immune
make car_t- Download
scripts/01_download/01_gdc_download.pyscripts/01_download/03_fetch_gdc_files.pyscripts/01_download/02_optional_imports.pyscripts/01_download/04_prepare_external_sra_manifest.py(optional external validation/CAR benchmark manifest)
- QC / preprocessing
scripts/02_qc/01_qc_preprocess.py
- Harmonize IDs
scripts/03_harmonize/01_harmonize_ids.py
- Feature engineering
scripts/04_features/01_build_features.pyscripts/04_features/02_immune_receptor_proxy.py(optional immune-context branch)
- Integration
scripts/05_integration/01_run_mofa.Rscripts/05_integration/02_run_diablo.R
- Network
scripts/06_network/01_build_network.py
- Perturbation
scripts/07_perturbation/01_perturbation.py
- Reporting
scripts/08_reporting/01_generate_report.pyscripts/08_reporting/09_external_validation_and_cart_benchmark.py(optional external ovarian validation + direct CAR FASTQ benchmark)
- Optional CAR-T extension
scripts/09_cart/01_build_car_t_assets.pyscripts/09_cart/02_screen_car_raw_reads.pyworkflow/car_t_raw_read_screening.smk
Layer templates are in metadata/manifests/:
gdc_query_rna.jsongdc_query_cna.jsongdc_query_methylation.jsongdc_query_mutation.json
Each uses:
cases.project.project_id = TCGA-OV- layer-specific
data_type access = open- output fields:
file_id,file_name,data_type,data_format,cases.case_id,cases.submitter_id
data/interim/*_matrix.parquet- required columns:
sample_id,patient_id, feature columns
- required columns:
metadata/sample_maps/master_sample_sheet.csv- per-patient layer presence + clinical labels
data/processed/outcomes.csv- includes
patient_id,days_to_death,days_to_last_follow_up,vital_status,survival_risk_group
- includes
results/models/mofa_factors.csvpatient_id,LF1..LFn
results/models/diablo_scores.csvpatient_id,comp1,comp2,survival_risk_group
results/networks/network_centrality.csvnode,degree,betweenness,pagerank,rank_score
results/tables/perturbation_delta.csv- perturbation impact metrics per hub
- file existence and stage flags
- per-layer dimensions tracked (
results/tables/feature_count_summary.csv) - matched-sample intersection summary (
results/tables/sample_matching_summary.csv) - missingness filtering in QC
- ID harmonization via TCGA barcode truncation rules
- immutable raw data folders under
data/raw/ - stage completion flags
- deterministic random seed in config (
project.seed) - workflow-managed dependencies in Snakemake
Tables:
results/tables/sample_matching_summary.csvresults/tables/feature_count_summary.csvresults/tables/perturbation_delta.csvresults/tables/sensitivity_perturb_fraction_grid.csvresults/tables/sensitivity_hub_slope_summary.csvresults/tables/model_benchmark.csvresults/tables/model_benchmark_protein_matched.csvresults/tables/pca_summary.csvresults/tables/advanced_ml_benchmark.csvresults/tables/input_output_ablation_auc.csvresults/tables/permutation_test_auc.csvresults/tables/causal_pathway_strength_summary.csv
Models:
results/models/mofa_factors.csvresults/models/diablo_scores.csv
Network:
results/networks/multilayer_network_edges.csvresults/networks/network_centrality.csvresults/networks/network_centrality_stability.csvresults/networks/dag_pathways.csv
Figures:
results/figures/mofa_factors.pngresults/figures/diablo_components.pngresults/figures/survival_km.pngresults/figures/model_benchmark_auc_ci.pngresults/figures/model_benchmark_protein_matched_auc_ci.pngresults/figures/model_benchmark_cox_cindex_ci.pngresults/figures/model_benchmark_protein_matched_cox_cindex_ci.pngresults/figures/perturbation_bootstrap_ci.pngresults/figures/multilayer_network_graph.pngresults/figures/dag_pathway_graph.pngresults/figures/sensitivity_perturbation_curves.pngresults/figures/advanced_ml_benchmark_auc_ci.pngresults/figures/input_output_ablation_top_auc.png
Reports:
results/reports/final_report.htmlmanuscript/manuscript_skeleton.mdresults/reports/immune_receptor_proxy.mdresults/reports/car_t_architecture_summary.mdresults/reports/car_t_raw_read_screening_plan.md
Optional extension outputs:
results/tables/car_t_architecture_metadata.csvresults/tables/car_t_raw_read_inventory.csvresults/tables/external_sra_manifest.csvresults/tables/external_cart_dataset_candidates.csvresults/tables/external_validation_file_inventory.csvresults/tables/external_ovarian_immune_scores.csvresults/tables/external_ovarian_immune_summary.csvresults/tables/cart_direct_benchmark_qc.csvresults/tables/immune_receptor_proxy_scores.csvresults/tables/immune_receptor_proxy_summary.csvresults/figures/immune_receptor_proxy_heatmap.pngresults/figures/immune_receptor_proxy_by_risk.pngresults/figures/external_ovarian_immune_scores.pngresults/figures/external_ovarian_immune_heatmap.pngresults/reports/external_sra_manifest.mdresults/reports/gsm4877937_suitability.mdresults/reports/external_validation_and_cart_benchmark.md
scripts/02_qc/01_qc_preprocess.pyis strict and parses only downloaded real GDC files; it fails fast when required layers are missing.- Optional cBioPortal/PDC imports are enabled via
data/raw/cbioportalanddata/raw/pdc. - The CAR-T extension is metadata-first. Direct CAR/transgene screening requires BAM/CRAM/FASTQ plus a validated custom reference FASTA.
- The immune-receptor branch provides expression-level proxy scores only. It is not a clonotype reconstruction workflow.
- CI pipeline:
.github/workflows/ci.yml- Python syntax smoke checks
- R script parse checks
- Release pipeline:
.github/workflows/release.yml- Packages
results/+manuscript/as versioned artifacts on tag (v*)
- Packages
- Kaggle package folder:
public_release/kaggle_dataset/ - Hugging Face package folder:
public_release/hf_dataset/ - Both contain derived outputs only (no raw GDC redistribution).
- Landing page:
PUBLIC_RELEASE_INDEX.md
- CAR-related public assets are scaffold-only and metadata-first.
references/car_t/contains:README.mdreference_panel_manifest_template.csvpublic_car_panel.placeholder.txt
- The workflow can benchmark approved external reference panels when supplied later, but no engineered construct FASTA is bundled in this repository.
- Current readiness outputs are in:
results/tables/cart_reference_alignment_readiness.csvresults/reports/cart_reference_alignment_plan.mdresults/reports/car_t_public_panel_scaffold.md
- Notebook entry point:
notebooks/tcga_ov_car_panel_scaffold.ipynbnotebooks/tcga_ov_host_alignment_car_benchmark.ipynb
Suggested release message for the current public package:
TCGA-OV multi-omics public release refresh:
- JBI manuscript package added
- CAR benchmark workflow refreshed
- metadata-only CAR panel scaffold added for approved future reference validation
- WSL-backed bwa/samtools/minimap2 readiness audited
- duplicate stale public-release copies cleaned