Changes in the cost and availability of next generation sequencing are enabling biological data sets to grow to previously impossible sizes. These new data sets include quantitative phenotypes measured on the scale of 10s to 100s of thousands of variants at once. This advance is allowing robust screening for negative phenotypes and quantification of the relative effect of lethal variants. These techniques can be applied both on the level of a whole genome or of a specific gene. Screening allows quantitative mapping of the importance of genes for organismal phenotypes as well as the importance of amino acid residues to gene phenotypes. These results can be leveraged both to build deeper understandings of the underlying processes, but also to find the important insights needed to engineer these phenotypes. At the same time quantitative models of biology have been enhancing our ability to predict the behavior of biological systems. These models help us encode our understandings and assumptions about the systems they describe, and by their performance, they can shed light on the validity of those assumptions. Quantitative models also enable first principles design of biological systems. In this thesis, we apply quantitative screening and biological modeling to understanding the composition and evolution of the carbon dioxide concentrating mechanism (CCM), designing RNA diagnostics, and mapping the fitness landscape of the enzyme Dihydrofolate reductase (DHFR).
The CCM is an adaptation that allows carbon fixation to occur efficiently despite the low levels of CO2 and high levels of O2 in the modern atmosphere. The bacterial carboxysome has potential to improve the efficiency of rubisco in autotrophs in both agricultural and biotechnological contexts. Additionally, the ⍺-CCM from chemotrophs is a particularly interesting case scientifically. Even though it shares many components with the ⍺-CCM from cyanobacteria, it doesn’t contain the same inorganic carbon (Ci) transporters and the extrashellular components are unknown. Since these chemotrophs are adapted to acidic pHs and Ci equilibria are heavily affected by pH, we expect the Ci transporters that are part of the ⍺-CCM from chemotrophs to be particularly interesting specimens that might have unusual properties. In order to identify the missing components, we generated a quantitative high throughput CCM mutant screen. This screen identified 17 CCM genes from operons containing 25 total genes. We then identified a potential Ci transporter, confirmed its transport activity, and interrogated its mechanism. This revealed evidence that is consistent with this protein being from a new family of Ci transporters that act by converting CO2 into HCO3- in an energy coupled fashion.
In atmospheric conditions the ⍺-CCM requires all its components to function. This poses a problem for our understanding of evolution because evolution can only operate through sequential steps that are each fitness positive. We set out to characterize the possible evolutionary paths that could have led to the evolution of the ⍺-CCM in the context of the atmospheric changes that occurred over the same time period. To do this, we first repeated our CCM gene mapping experiment in a variety of CO2 concentrations. This provided a quantitative measure of the essentiality of each component of the CCM as a function of CO2 concentration. We then used CCM reporter strains to evaluate combinations of components at different CO2 concentrations. We mapped out a feasible evolutionary path starting with acquisition of the Ci transporter and the carbonic anhydrase (CA) in either order, followed by the encapsulation of the CA and the rubisco.
RNA diagnostics detect specific RNA sequences in complicated mixtures and are useful for identifying the presence of RNA viruses in clinical samples. Most RNA diagnostics require a reverse transcription step and an amplification step. This slows detection and increases the complexity of the assay. We set out to use CRISPR associated (Cas) nucleases to circumvent these issues. Cas13 is a nuclease that is activated by the presence of RNA sequences complementary to its guide RNA. When activated it becomes a nonspecific nuclease. This activity allows direct detection of RNA without a reverse transcription step; however, it does not include any signal amplification slowing the assay and reducing sensitivity. Csm6 is activated by short poly-A tracts containing cyclic phosphates and becomes a non-specific nuclease upon activation. In this thesis we used kinetic enzyme modeling to design an RNA diagnostic based on Cas13 and Csm6. We designed a short RNA which releases a csm6 activator upon cas13 or csm6 cleavage. We then predicted an amplification effect upon including csm6 and this short RNA in a Cas13 assay. We also identified a dampening effect due to csm6’s activator cleavage activity. We then constructed these nuclease assays using a fluoro-modified uncleavable activator and demonstrated improved sensitivity and detection time compared to cas13 alone on both mock and real clinical samples.
Being able to map the fitness landscapes of proteins is important both for understanding how those proteins functions mechanistically and for engineering new and improved functions. Protein fitness is determined not only by the individual effects of all the single mutations, but also by the ways mutations interact with each other. These interactions take the form of specific dependencies between residues that directly affect each other in a phenomenon called specific epistasis. However, many non-interacting residues also have apparent dependencies. This is because these residues additively determine underlying physical properties of the protein that nonlinearly affect fitness in phenomenon called general epistasis. In order to map the fitness landscape of a protein, we must account for both factors. In this thesis, we work to develop methods to map this landscape. In order to do this, we generate an error prone mutagenesis library of the protein DHFR and use a high throughput quantitative screen to assess the function of each of the mutants. We then test different models for their ability to predict mutant effects and map the fitness landscape. We identify an effective model based on fitting specific epistasis from a multiple sequence alignment combined with a regression technique that explicitly accounts for general epistasis. Using this model, we successfully predict the effects of mutations achieving performance which matches our replicate-replicate correspondence.