Understanding how molecular diversity at the single-cell level gives rise to complex, emergent functions and phenotypes, such as developmental progression or disease states, requires computational frameworks that capture cell state specificity, integrate diverse data modalities, bridge resolution gaps, prioritize key cellular programs, and incorporate prior biological knowledge to uncover underlying gene signatures. This dissertation presents a suite of deep learning models designed to meet these challenges in a multiscale and multimodal fashion, enabling interpretable and scalable analysis of single-cell data across complex biological systems.At the foundation of single-cell profiling lies cell type specificity. Chapter 2 introduces scProjection, a method for resolving cell type-specific signals from mixed or partially observed transcriptomic profiles. By projecting bulk or low-resolution profiles onto high-quality single-cell atlases, scProjection provides cell state-specific gene expression projections and imputes missing genes using learned gene-gene covariation structures through a deep generative model.Expanding on this, Chapter 3 presents scPair, a framework for enhanced cell state identification using information from multiple molecular modalities. scPair addresses the limitations of shallow multimodal assays by aligning chromatin accessibility and transcriptomic features via dual encoder-decoder architectures with implicit feature selection. This improves cross-modal translation, enables augmentation with larger unimodal atlases, and enhances statistical power for discovering transient or rare cell states. scPair reveals cross-modality relationships and uncovers gene regulatory programs, including key transcription factors active during transitional states.Chapter 4 transitions from cell-level resolution to the sample level with bioPointNet, a deep multiple instance learning (MIL) model that represents each biological sample as an unordered set of cell instances. By applying attention-based aggregation, bioPointNet predicts emergent phenotypes without relying on cell type annotations and identifies the most informative cell subpopulations predictive of phenotype. This enables interpretable phenotype associations and supports alignment of samples from different sources along developmental or disease trajectories.Finally, Chapter 5 introduces sciLaMA, a framework for integrating prior biological knowledge into single-cell analysis. By incorporating gene embeddings derived from large language models (LLMs) into a paired variational autoencoder (VAE) structure, sciLaMA learns joint representations of genes and cells, which facilitates the discovery of biologically meaningful gene modules and the identification of key markers driving specific cell states.Together, these methods establish a set of tools for multiscale and multimodal single-cell analysis, supporting integrative data modeling, interpretable inference, and mechanistic insight into the cellular basis of phenotypic variation and gene network discovery.