vhsvhs/simgenome
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
SimGenome README Victor Hanson-Smith victorhansonsmith@gmail.com 2012 ======== OVERVIEW SimGenome is a quantitative model of gene transcription regulatory circuit evolution. SimGenome incorporates three major aspects of this process: (i) DNA-binding thermodynamics, (ii) non-equilibrium gene expression patterns, and (iii) population structure and evolution. ================= REQUIRED SOFTWARE 1. Python 2. MPI 3. mpi4py 4. Numpy ======== MPI NOTES SimGenome uses a master-slave architecture with MPI. The master process is 0, and the slave processes are 1 through N-1. SimGenome has two running modes: genetic alogorithm (GA) and knock-out (KO) analysis. In GA, at the beginning of each generation, all slaves self identify a subset of the individuals in the population, and then proceed to calculate the fitness of those individuals. Upon completion of fitness calculations, each slave sends a hashtable (where key = individual ID, value = fitness score) to the master. The master incorporates these hashtables into a single hashtable. The master then uses this hashtable of fitnesses to selectively mate and reproduce the population. The master also injects mutations into the new children, according to the mutation parameters specified by the user. The master then broadcasts the updated population to all the slaves, and the loop then iterates to the next generation. In KO, the analysis focuses on a single genome, specified using --poppath P and --ko G, where P is the population and G is the ID of the target genome within P. All slaves identify a subset of regulatory genes in G. Each slave then calculates the fitness of the WT individual, and the fitness of the individual with each regulatory gene KO'd individually. ============== LOAD BALANCING Keep in mind the master node only dispatches jobs; the slaves do the heavy-lifting (i.e. the fitness calculations). This means that best load-balancing results will be achieved using a population with M individuals on a computational cluster with N MPI processes, where M is a multiple of N-1. For example, with a population size of 128, good values for N include 5, 9, 17, 33, 65, and 129. ================================= VERBOSITY NOTES (use --verbose N) N= 1: . Prints basic notifications about population bu, loading files, etc. . Prints generation stats: time to completion, max/min/mean/median/sd fitness. N= 2: . Pickles the population, writes population.gen.X.pickle. . Prints ID fitnesses, elites, and F1 crosses. . Writes the R script to plot_expression for each individual for each generation. N= 3: N= 4: . Invokes the method 'print_configuration', which writes config.gen.X.gid.Y.txt . Invokes R to plot_expression. N= 5: . Prints timeslice information to stdout. N= 10: . Plots the R script (functon plot_expression) for each individual for each generation. N= 99: . Prints some debugging information. This is very verbose, and not recommended. N= 100: . Prints all debugging information. This is outrageously verbose. ============================================= NOTES ON GROWTH RATE AND DECAY RATE The rate at which genes reach maximum expression, and the rate of their decay, are governed by three parameters: --growth_rate --decay_rate --pe_scalar Genes will be turned on faster if you increase the value of growth rate and/or pe_scalar. Genes will decay faster if you increase the value of decay_rate and/or pe_scalar. It is important to avoid binding "saturation," i.e. the situation where a regulatory region bound at X bits reaches maximum expression at the same rate as a gene bound by X+n bits. In this saturation scenario, there is no evolutionary advantage for the regulatory machinery to evolve additional specificity for the URS. To avoid this problem, turn down pe_scalar. ======================== TROUBLESHOOTING & F.A.Q. * If SimGenome hangs at the end of a generation (usually the 0th generation), this may indicate that a slave node has crashed.