Given the ever-growing volume and variety of biomedical data, principled analyses of these rich datasets offer an exciting opportunity to accelerate the scientific discovery process. Here, we advance our goal of extracting reliable scientific hypotheses from such data through (I) the in-context development of interpretable statistical machine learning methods, (II) the demonstration of responsible data science in practice, and (III) the dissemination of open-source software and data for reliable data science.
Throughout this dissertation, we build heavily upon the Predictability, Computability, and Stability (PCS) framework and documentation for veridical (trustworthy) data science (Yu and Kumbier, 2020) to improve the reliability of our scientific conclusions. This framework advocates for the use of predictability as a reality check, computability as an important consideration in algorithmic design and data collection, and stability as a minimum requirement for reproducibility and interpretability in knowledge-seeking and decision-making. Moreover, it calls on the need for transparent documentation of decisions made throughout the data science pipeline.
In Part I, we highlight two statistical machine learning methods, developed within the context of grounded biomedical problems and guided by the PCS framework. First, in Chapter 2, we investigate genetic and epistatic drivers of cardiac hypertrophy in hope of obtaining a more complete understanding of the disease architecture. To this end, we develop a data-driven recommendation system, named the low-signal signed iterative random forest (lo-siRF), to identify candidate genes and gene-gene interactions that are both predictive and stable across various model and data perturbations. We then phenotypically validate these genes and gene-gene interactions via gene-silencing experiments and investigate potential mechanistic explanations for the demonstrated epistases. This leads to a hypothesis in which the identified genes interact through mediating the variable binding of transcription factors that are essential for cardiac contractile function and metabolism. Second, the practical utility of random forests and interpretability tools, not only in the search for epistasis but in a wide range of scientific problems, motivates the need for reliable tree-based feature importance measures. In Chapter 3, we demonstrate that the mean decrease in impurity (MDI), arguably the most popular random forest feature importance measure, suffers from well-known biases including against highly-correlated and low-entropy features. To overcome these drawbacks, we develop a novel feature importance framework, MDI+, which leverages a connection between MDI and the R-squared value from linear regression. We show that MDI+ improves the reliability and stability of feature importance rankings across an extensive range of data-inspired simulations and two real-data case studies on drug response prediction and breast cancer subtype prediction.
In Part II, we further expand on the theme of reliable data science and demonstrate it in practice through two collaborative projects in cancer -omics. In Chapters 4 and 5, we incorporate principles from the PCS framework while working in close collaboration with scientists and clinicians to identify stable and predictive biomarkers in drug response prediction and the early detection of pancreatic cancer, respectively.
Finally, in Part III, we introduce open-source software and data to promote and facilitate the broader adoption of reliable, transparent data science for statisticians and substantive researchers. In particular, we highlight three tools that support our goals: (1) simChef, an R package to simplify the creation of tidy, high-quality simulation studies (Chapter 6); (2) vdocs, an interactive virtual lab notebook in R to seamlessly implement, document, and justify human judgment calls throughout the data science pipeline in accordance with the PCS framework (Chapter 7); and (3) a COVID-19 data repository that aided community-wide data science efforts during the height of the pandemic (Chapter 8).