Search

Scholarly Works (25 results)

Sort By:

Show:

Article
Peer Reviewed

Correction to: scBFA: modeling detection patterns to mitigate technical noise in large-scale single-cell genomics data

UC Davis Previously Published Works (2019)

Following publication of the original article [1], the following two errors were found in formulae.

Cover page: Correction to: scBFA: modeling detection patterns to mitigate technical noise in large-scale single-cell genomics data

Article
Peer Reviewed

scBFA: modeling detection patterns to mitigate technical noise in large-scale single-cell genomics data

UC Davis Previously Published Works (2019)

Technical variation in feature measurements, such as gene expression and locus accessibility, is a key challenge of large-scale single-cell genomic datasets. We show that this technical variation in both scRNA-seq and scATAC-seq datasets can be mitigated by analyzing feature detection patterns alone and ignoring feature quantification measurements. This result holds when datasets have low detection noise relative to quantification noise. We demonstrate state-of-the-art performance of detection pattern models using our new framework, scBFA, for both cell type identification and trajectory inference. Performance gains can also be realized in one line of R code in existing pipelines.

Cover page: scBFA: modeling detection patterns to mitigate technical noise in large-scale single-cell genomics data

Article
Peer Reviewed

scPair: Boosting single cell multimodal analysis by leveraging implicit feature selection and single cell atlases

UC Davis Previously Published Works (2024)

Multimodal single-cell assays profile multiple sets of features in the same cells and are widely used for identifying and mapping cell states between chromatin and mRNA and linking regulatory elements to target genes. However, the high dimensionality of input features and shallow sequencing depth compared to unimodal assays pose challenges in data analysis. Here we present scPair, a multimodal single-cell data framework that overcomes these challenges by employing an implicit feature selection approach. scPair uses dual encoder-decoder structures trained on paired data to align cell states across modalities and predict features from one modality to another. We demonstrate that scPair outperforms existing methods in accuracy and execution time, and facilitates downstream tasks such as trajectory inference. We further show scPair can augment smaller multimodal datasets with larger unimodal atlases to increase statistical power to identify groups of transcription factors active during different stages of neural differentiation.

Cover page: scPair: Boosting single cell multimodal analysis by leveraging implicit feature selection and single cell atlases

Thesis
Peer Reviewed

Representation learning methods developed for single cell genomics analysis

Li, Ruoxin
Advisor(s): Quon, Gerald

UC Davis Electronic Theses and Dissertations (2024)

Advances in high throughput omics technologies allow for assaying increasing compendium of molecular layers, from genome and epigenome profiling, transcriptomics to proteomics. Such data provide detailed snapshots which can characterize the molecular state for a given biology system from very fine resolution. Single cell genomics assays such as scRNA-seq and scATAC-seq specifically captures the landscape of genomic features across large collections of cells and have become one of the most popular molecular profiling techniques for investigating diverse problems related to gene regulation, such as identification of novel cell types and their regulatory signatures, trajectory inference for the analysis of continuous processes such as differentiation, high resolution analysis of transcriptional dynamics, and characterization of transcriptional heterogeneity within population of cells.

Despite the rapidly evolving technologies which can scales up to millions of cells across multiple individuals , one of the most pressing challenges in single cell genomics analysis is to address the amount of technical noise that can drive approximately 50% of the cell-cell variation in expression measurements. And such technical noise often times associated with high-sparsity of the genomic feature measurements. In chapter 2, we are mainly focusing on alleviating the effect of such technical variation in feature measurements of single cell genomics data, such as gene expression and locus accessibility. We show that this technical variation in both scRNA-seq and scATAC-seq datasets can be mitigated by analyzing feature detection patterns alone and ignoring feature quantification measurements. This result holds when datasets have low detection noise relative to quantification noise. We demonstrate state-of-the-art performance of detection pattern models using our new framework, scBFA, for both cell type identification and trajectory inference.

While single cell genomics assays are inherently high dimensional, the variations of individual cells are often summarized in a low dimensional space reflecting the change of gene’s mean expression. Gene co-expression networks, which often inferred from RNA sequencing data are another perspective to study cell type specific functional modules and complex regulatory interactions from transcriptomics profile. The increasing availability of large-scale scRNA-seq datasets is now making it possible to infer many gene networks from diverse cell populations. However, there are no mature tools currently available to visualize and compare large collections of networks across single cell populations, or for identifying correlations between variance in gene network structure with cell population-level phenotypes. In chapter 3, we present an unsupervised framework scMultiAE enabling comparison and visualization of multiple gene networks in a low-dimensional space with a focus on studying the heterogeneity of iPSCs during differentiation.

Article
Peer Reviewed

scAlign: a tool for alignment, integration, and rare cell identification from scRNA-seq data

UC Davis Previously Published Works (2019)

scRNA-seq dataset integration occurs in different contexts, such as the identification of cell type-specific differences in gene expression across conditions or species, or batch effect correction. We present scAlign, an unsupervised deep learning method for data integration that can incorporate partial, overlapping, or a complete set of cell labels, and estimate per-cell differences in gene expression across datasets. scAlign performance is state-of-the-art and robust to cross-dataset variation in cell type-specific expression and cell type composition. We demonstrate that scAlign reveals gene expression programs for rare populations of malaria parasites. Our framework is widely applicable to integration challenges in other domains.

Cover page: scAlign: a tool for alignment, integration, and rare cell identification from scRNA-seq data

Thesis
Peer Reviewed

DEEP LEARNING MODELS FOR THE ANALYSIS OF SINGLE CELL GENOMICS

Johansen, Nelson Jamse
Advisor(s): Quon, Gerald

UC Davis Electronic Theses and Dissertations (2022)

Single cell transcriptomic technologies which capture high dimensional measurements of gene expression in individual cells have been exponentially scaling in the number of cells that can be sequenced and analyzed simultaneously. Capturing a snapshot of the landscape for possible gene expression measurements from a collection of cells enables researchers to observe the space of molecular variation inherent to specific biological systems, termed atlasing. A challenge to building deeply characterized atlases of complex biological systems such as the human brain is in the identification and correction of confounding factors which do not relate to the underlying biology but instead arise from technical confounders. In this dissertation I present deep learning models applied to single cell genomics which remove unwanted technical variation and contamination as well as perform novel analysis not previously possible using standard methods. The construction of single cell genomics atlases leverages recent advances in single cell RNA sequencing technologies such as 10X and SmartSeq which can capture thousands of cells in single experiment. When the sequencing of individual cells is performed on different technologies this introduces unwanted technical variation (bias) specific to the technology and confounds attempts to merge scRNA-seq experiments into more complete atlases. To address this challenge, we developed scAlign to remove the effects of unwanted technical variation on gene expression specifically, scRNA-seq alignment based on advances in computer vision. scAlign, an unsupervised deep learning method, performs data alignment that can incorporate partial, overlapping or a complete set of cell labels, and estimate per-cell differences in gene expression across datasets or conditions to characterize specific expression changes due to conditions such as age or disease. With the recent surge of atlases efforts across complex tissues, conditions, and species another challenge is how to integrate the deep characterizations of cell state with lower resolution assays of single cell or bulk genomics. Specifically, spatial and multi-omics assays do not collect RNA from a single cell but instead from a spot containing multiple cells or in the later contamination from the unintended collection of additional cells. We developed scProjection to join deeply sequenced atlases with lower resolution genomic assays to address the unwanted heterogeneity in mixed samples and project such samples in a way that recovers the underlying single-cell measurements. scProjection is demonstrated to accurately estimate the abundance of cell types that compose a mixed RNA sample while simultaneously identifying the gene expression measurements consistent for each cell type in the sample to identify cell type specific changes due spatial location of cells or disease state.

Cover page: DEEP LEARNING MODELS FOR THE ANALYSIS OF SINGLE CELL GENOMICS

Article
Peer Reviewed

Multiscale and Multimodal Representation Learning for Single-Cell Omics

Hu, Hongru
Advisor(s): Quon, Gerald GQ

UC Davis Electronic Theses and Dissertations (2025)

Understanding how molecular diversity at the single-cell level gives rise to complex, emergent functions and phenotypes, such as developmental progression or disease states, requires computational frameworks that capture cell state specificity, integrate diverse data modalities, bridge resolution gaps, prioritize key cellular programs, and incorporate prior biological knowledge to uncover underlying gene signatures. This dissertation presents a suite of deep learning models designed to meet these challenges in a multiscale and multimodal fashion, enabling interpretable and scalable analysis of single-cell data across complex biological systems.At the foundation of single-cell profiling lies cell type specificity. Chapter 2 introduces scProjection, a method for resolving cell type-specific signals from mixed or partially observed transcriptomic profiles. By projecting bulk or low-resolution profiles onto high-quality single-cell atlases, scProjection provides cell state-specific gene expression projections and imputes missing genes using learned gene-gene covariation structures through a deep generative model.Expanding on this, Chapter 3 presents scPair, a framework for enhanced cell state identification using information from multiple molecular modalities. scPair addresses the limitations of shallow multimodal assays by aligning chromatin accessibility and transcriptomic features via dual encoder-decoder architectures with implicit feature selection. This improves cross-modal translation, enables augmentation with larger unimodal atlases, and enhances statistical power for discovering transient or rare cell states. scPair reveals cross-modality relationships and uncovers gene regulatory programs, including key transcription factors active during transitional states.Chapter 4 transitions from cell-level resolution to the sample level with bioPointNet, a deep multiple instance learning (MIL) model that represents each biological sample as an unordered set of cell instances. By applying attention-based aggregation, bioPointNet predicts emergent phenotypes without relying on cell type annotations and identifies the most informative cell subpopulations predictive of phenotype. This enables interpretable phenotype associations and supports alignment of samples from different sources along developmental or disease trajectories.Finally, Chapter 5 introduces sciLaMA, a framework for integrating prior biological knowledge into single-cell analysis. By incorporating gene embeddings derived from large language models (LLMs) into a paired variational autoencoder (VAE) structure, sciLaMA learns joint representations of genes and cells, which facilitates the discovery of biologically meaningful gene modules and the identification of key markers driving specific cell states.Together, these methods establish a set of tools for multiscale and multimodal single-cell analysis, supporting integrative data modeling, interpretable inference, and mechanistic insight into the cellular basis of phenotypic variation and gene network discovery.

Article
Peer Reviewed

Microbiome-based classification models for fresh produce safety and quality evaluation.

UC Davis Previously Published Works (2024)

UNLABELLED: Small sample sizes and loss of sequencing reads during the microbiome data preprocessing can limit the statistical power of differentiating fresh produce phenotypes and prevent the detection of important bacterial species associated with produce contamination or quality reduction. Here, we explored a machine learning-based k-mer hash analysis strategy to identify DNA signatures predictive of produce safety (PS) and produce quality (PQ) and compared it against the amplicon sequence variant (ASV) strategy that uses a typical denoising step and ASV-based taxonomy strategy. Random forest-based classifiers for PS and PQ using 7-mer hash data sets had significantly higher classification accuracy than those using the ASV data sets. We also demonstrated that the proposed combination of integrating multiple data sets and leveraging a 7-mer hash strategy leads to better classification performance for PS and PQ compared to the ASV method but presents lower PS classification accuracy compared to the feature-selected ASV-based taxonomy strategy. Due to the current limitation of generating taxonomy using the 7-mer hash strategy, the ASV-based taxonomy strategy with remarkably less computing time and memory usage is more efficient for PS and PQ classification and applicable for important taxa identification. Results generated from this study lay the foundation for future studies that wish and need to incorporate and/or compare different microbiome sequencing data sets for the application of machine learning in the area of microbial safety and quality of food. IMPORTANCE: Identification of generalizable indicators for produce safety (PS) and produce quality (PQ) improves the detection of produce contamination and quality decline. However, effective sequencing read loss during microbiome data preprocessing and the limited sample size of individual studies restrain statistical power to identify important features contributing to differentiating PS and PQ phenotypes. We applied machine learning-based models using individual and integrated k-mer hash and amplicon sequence variant (ASV) data sets for PS and PQ classification and evaluated their classification performance and found that random forest (RF)-based models using integrated 7-mer hash data sets achieved significantly higher PS and PQ classification accuracy. Due to the limitation of taxonomic analysis for the 7-mer hash, we also developed RF-based models using feature-selected ASV-based taxonomic data sets, which performed better PS classification than those using the integrated 7-mer hash data set. The RF feature selection method identified 480 PS indicators and 263 PQ indicators with a positive contribution to the PS and PQ classification.

Cover page: Microbiome-based classification models for fresh produce safety and quality evaluation.

Article
Peer Reviewed

Projecting RNA measurements onto single cell atlases to extract cell type-specific expression profiles using scProjection

UC Davis Previously Published Works (2023)

Multi-modal single cell RNA assays capture RNA content as well as other data modalities, such as spatial cell position or the electrophysiological properties of cells. Compared to dedicated scRNA-seq assays however, they may unintentionally capture RNA from multiple adjacent cells, exhibit lower RNA sequencing depth compared to scRNA-seq, or lack genome-wide RNA measurements. We present scProjection, a method for mapping individual multi-modal RNA measurements to deeply sequenced scRNA-seq atlases to extract cell type-specific, single cell gene expression profiles. We demonstrate several use cases of scProjection, including identifying spatial motifs from spatial transcriptome assays, distinguishing RNA contributions from neighboring cells in both spatial and multi-modal single cell assays, and imputing expression measurements of un-measured genes from gene markers. scProjection therefore combines the advantages of both multi-modal and scRNA-seq assays to yield precise multi-modal measurements of single cells.

Cover page: Projecting RNA measurements onto single cell atlases to extract cell type-specific expression profiles using scProjection

Article
Peer Reviewed

siVAE: interpretable deep generative models for single-cell transcriptomes

UC Davis Previously Published Works (2023)

Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.

Cover page: siVAE: interpretable deep generative models for single-cell transcriptomes