Introduction To NCBI Resources
Introduction To NCBI Resources
BIOINFORMATICS
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data.
The completion of a "working draft" of the human genome--an important milestone in the Human Genome Project--was announced in June 2000 at a press conference at the White House and was published in the February 15, 2001 issue of the journal Nature.
The data in GenBank are made available in a variety of ways, each tailored to a particular use, such as data submission or sequence searching.
What Is Bioinformatics?
Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. At the beginning of the "genomic revolution", a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.
Biology in the 21st century is being transformed from a purely lab-based science to an information science as well.
Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important subdisciplines within bioinformatics and computational biology include:
the development and implementation of tools that enable efficient access to, and use and management of, various types of information the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences
Although a human disease may not be found in exactly the same form in animals, there may be sufficient data for an animal model that allow researchers to make inferences about the process in humans.
Evolutionary Biology New insight into the molecular basis of a disease may come from investigating the function of homologs of a disease gene in model organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship. Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms of life. With the aid of nucleotide and protein sequences, it should be possible to find the ancestral ties between different organisms. Thus far, experience has taught us that closely related organisms have similar sequences and that more distantly related organisms have more dissimilar sequences. Proteins that show a significant sequence conservation, indicating a clear evolutionary relationship, are said to be from the same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary relationship between two species and to estimate the time of divergence between two organisms since they last shared a common ancestor.
NCBI's COGs database has been designed to simplify evolutionary studies of complete genomes and to improve functional assignment of individual proteins.
Phylogenetics is the field of biology that deals with identifying and understanding the relationships between the different kinds of life on earth.
Protein Modeling The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions. In the absence of a protein structure that has been determined by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target). Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy. Because the different genome projects are producing more sequences and because novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease-related processes in living organisms.
The Four Steps of Protein Modeling Identify the proteins with known three-dimensional structures that are related to the target sequence Align the related three-dimensional structures with the target sequence and determine those structures that will be used as templates Construct a model for the target sequence based on its alignment with the template structure(s) Evaluate the model against a variety of criteria to determine if it is satisfactory
Genome Mapping
Genomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wanting to localize a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a timeconsuming and often painstaking process. Today, thanks to new technologies and the influx of sequence data, a number of high-quality, genome-wide maps are available to the scientific community for use in their research. Computerized maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell, scientists would first use a genetic map to assign a gene to a relatively small area of a chromosome. They would then use a physical map to examine the region of interest close up, to determine a gene's precise location. In light of these advances, a researcher's burden has shifted from mapping a genome or genomic region of interest to navigating a vast number of Web sites and databases.
Using Map Viewer, researchers can find answers to questions such as: Where does a particular gene exist within an organism's genome? Which genes are located on a particular chromosome and in what order? What is the corresponding sequence data for a gene that exists in a particular chromosomal region? What is the distance between two genes?
The rapidly emerging field of bioinformatics promises to lead to advances in understanding basic biological processes and, in turn, advances in the diagnosis, treatment, and prevention of many genetic diseases. Bioinformatics has transformed the discipline of biology from a purely lab-based science to an information science as well. Increasingly, biological studies begin with a scientist conducting vast numbers of database and Web site searches to formulate specific hypotheses or to design large-scale experiments. The implications behind this change, for both science and medicine, are staggering.
GENOME MAPPING: A GUIDE TO THE GENETIC HIGHWAY WE CALL THE HUMAN GENOME
Imagine you're in a car driving down the highway to visit an old friend who has just moved to Los Angeles. Your favorite tunes are playing on the radio, and you haven't a care in the world. You stop to check your maps and realize that all you have are interstate highway maps not a single street map of the area. How will you ever find your friend's house? It's going to be difficult, but eventually, you may stumble across the right house.
This scenario is similar to the situation facing scientists searching for a specific gene somewhere within the vast human genome. They have available to them two broad categories of maps: genetic maps and physical maps. Both genetic and physical maps provide the likely order of items along a chromosome. However, a genetic map, like an interstate highway map, provides an indirect estimate of the distance between two items and is limited to ordering certain items. One could say that genetic maps serve to guide a scientist toward a gene, just like an interstate map guides a driver from city to city. On the other hand, physical maps mark an estimate of the true distance, in measurements called base pairs, between items of interest. To continue our analogy, physical maps would then be similar to street maps, where the distance between two sites of interest may be defined more precisely in terms of city blocks or street addresses. Physical maps, therefore, allow a scientist to more easily home in on the location of a gene. An appreciation of how each of these maps is constructed may be helpful in understanding how scientists use these maps to traverse that genetic highway commonly referred to as the "human genome".
Genetic maps serve to guide a scientist toward a gene, just like an interstate map guides a driver from city to city. Physical maps are more similar to street maps and allow a scientist to more easily home in on a gene's location.
From Linkage Analysis to Genetic Mapping Early geneticists recognized that genes are located on chromosomes and believed that each individual chromosome was inherited as an intact unit. They hypothesized that if two genes were located on the same chromosome, they were physically linked together and were inherited together. We now know that this is not always the case. Studies conducted around 1910 demonstrated that very few pairs of genes displayed complete linkage. Pairs of genes were either inherited independently or displayed partial linkagethat is, they were inherited together sometimes, but not always. During meiosisthe process whereby gametes (eggs and sperm) are produced two copies of each chromosome pair become physically close. The chromosome arms can then undergo breakage and exchange segments of DNA, a process referred to as recombination or crossing-over. If recombination occurs, each chromosome found in the gamete will consist of a "mixture" of material from both members of the chromosome pair. Thus, recombination events directly affect the inheritance pattern of those genes involved.
It is the behavior of chromosomes during meiosis that determines whether two genes will remain linked.
Because one cannot physically see crossover events, it is difficult to determine with any degree of certainty how many crossovers have actually occurred. But, using the phenomenon of co-segregation of alleles of nearby markers, researchers can reverse-engineer meiosis and identify markers that lie close to each other. Then, using a statistical technique called genetic linkage analysis, researchers can infer a likely crossover pattern, and from that an order of the markers involved. Researchers can also infer an estimate for the probability that a recombination occurs between each pair of markers.
An allele is one of the variant forms of a DNA sequence at a particular locus, or location, on a chromosome. Co-segregation of alleles refers to the movement of each marker during meiosis. If a marker tends to "travel" with the disease status, the markers are said to co-segregate.
If recombination occurs as a random event, then two markers that are close together should be separated less frequently than two markers that are more distant from one another. The recombination probability between two markers, which can range from 0 to 0.5, increases monotonically as the distance between the two markers increases along a chromosome. Therefore, the recombination probability may be used as a surrogate for ordering genetic markers along a chromosome. If you then determine the recombination frequencies for different pairs of markers, you can construct a map of their relative positions on the chromosome.
Monotonic functions are functions that tend to move only in one direction.
Linkage maps can tell you where markers are in relation to each other on the chromosome, but the actual "mileage" between those markers may not be so well defined.
But alas, predicting recombination is not so simple. Although crossovers are random, they are not uniformly distributed across the genome or any chromosome. Some chromosomal regions, called recombination hotspots, are more likely to be involved in crossovers than other regions of a chromosome. This means that genetic map distance does not always indicate physical distance between markers. Despite these qualifications, linkage analysis usually correctly deduces marker order, and distance estimates are sufficient to generate genetic maps that can serve as a valuable framework for
genome sequencing.
In humans, data for calculating recombination frequencies are obtained by examining In humans, genetic the genetic makeup of the members of successive generations of existing families, diseases are frequently termed human pedigree analysis. Linkage studies begin by obtaining blood samples used as gene markers, from a group of related individuals. For relatively rare diseases, scientists find a few with the disease state large families that have many cases of the disease and obtain samples from as many being one allele and the family members as possible. For more common diseases where the pattern of disease healthy state the second inheritance is unclear, scientists will identify a large number of affected families and will allele. take samples from four to thirty close relatives. DNA is then harvested from all of the blood samples and screened for the presence, or co-inheritance, of two markers. One marker is usually the gene of interest, generally associated with a physically identifiable characteristic. The other is usually one of the various detectable rearrangements mentioned earlier, such as a microsatellite. A computerized analysis is then performed to determine whether the two markers are linked and approximately how far apart those markers are from one another. In this case, the value of the genetic map is that an inherited disease can be located on the map by following the inheritance of a DNA marker present in affected individuals but absent in unaffected individuals, although the molecular basis of the disease may not yet be understood, nor the gene(s) responsible identified.
Genetic Maps as a Framework for Physical Map Construction Genetic maps are also used to generate the essential backbone, or scaffold, needed for the creation of more detailed human genome maps. These detailed maps, called physical maps, further define the DNA sequence between genetic markers and are essential to the rapid identification of genes.
Types of Physical Maps and What They Measure Physical maps can be divided into three general types: chromosomal or cytogenetic maps, radiation hybrid (RH) maps, and sequence maps. The different types of maps vary in their degree of resolution, that is, the ability to measure the separation of elements that are close together. The higher the resolution, the better the picture. The lowest-resolution physical map is the chromosomal or cytogenetic map, which is based on the distinctive banding patterns observed by light microscopy of stained chromosomes. As with genetic linkage mapping, chromosomal mapping can be used to locate genetic markers defined by traits observable only in whole organisms. Because chromosomal maps are based on estimates of physical distance, they are considered to be physical maps. Yet, the number of base pairs within a band can only be estimated. RH maps and sequence maps, on the other hand, are more detailed. RH maps are similar to linkage maps in that they show estimates of distance between genetic and physical markers, but that is where the similarity ends. RH maps are able to provide more precise information regarding the distance between markers than can a linkage map. The physical map that provides the most detail is the sequence map. Sequence maps show genetic markers, as well as the sequence between the markers, measured in base pairs.
RH mapping, like linkage mapping, shows an estimated distance between genetic markers. But, rather than
relying on natural recombination to separate two markers, scientists use breaks induced by radiation to determine the distance between two markers. In RH mapping, a scientist exposes DNA to measured doses of radiation, and in doing so, controls the average distance between breaks in a chromosome. By varying the degree of radiation exposure to the DNA, a scientist can induce breaks between two markers that are very close together. The ability to separate closely linked markers allows scientists to produce more detailed maps. RH mapping provides a way to localize almost any genetic marker, as well as other genomic fragments, to a defined map position, and RH maps are extremely useful for ordering markers in regions where highly polymorphic genetic markers are scarce.
Polymorphic refers to the existence of two or more forms of the same gene, or genetic marker, with each form being too common in a population to be merely attributable to a new mutation. Polymorphism is a useful genetic marker because it enables researchers to sometimes distinguish which allele was inherited.
Scientists also use RH maps as a bridge between linkage maps and sequence maps. In doing so, they have been able to more easily identify the location(s) of genes involved in diseases such as spinal muscular atrophy and hyperekplexia, more commonly known as "startle disease".
Sequence Mapping Sequence tagged site (STS) mapping is another physical mapping technique. An STS is a short DNA sequence that has been shown to be unique. To qualify as an STS, the exact location and order of the bases of the sequence must be known, and this sequence may occur only once in the chromosome being studied or in the genome as a whole if the DNA fragment set covers the entire genome.
To map a set of STSs, a collection of overlapping DNA fragments from a chromosome is digested into smaller fragments using restriction enzymes, agents that cut up DNA molecules at defined target points. The data from which the map will be derived are then obtained by noting which fragments contain which STSs. To accomplish this, scientists copy the DNA fragments using a process known as "molecular cloning". Cloning involves the use of a special technology, called recombinant DNA technology, to copy DNA fragments inside a foreign host. First, the fragments are united with a carrier, also called a vector. After introduction into a suitable host, the DNA fragments can then be reproduced along with the host cell DNA, providing unlimited material for experimental study. An unordered set of cloned DNA fragments is called a library. Next, the clones, or copies, are assembled in the order they would be found in the original chromosome by determining which clones contain overlapping DNA fragments. This assembly of overlapping clones is called a clone contig. Once the order of the clones in a chromosome is known, the clones are placed in frozen storage,
and the information about the order of the clones is stored in a computer, providing a valuable resource that may be used for further studies. These data are then used as the base material for generating a lengthy, continuous DNA sequence, and the STSs serve to anchor the sequence onto a physical map.
The Need to Integrate Physical and Genetic Maps As with most complex techniques, STS-based mapping has its limitations. In addition to gaps in clone coverage, DNA fragments may become lost or mistakenly mapped to a wrong position. These errors may occur for a variety of reasons. A DNA fragment may break, resulting in an STS that maps to a different position. DNA fragments may also get deleted from a clone during the replication process, resulting in the absence of an STS that should be present. Sometimes a clone composed of DNA fragments from two distinct genomic regions is replicated, leading to DNA segments that are widely separated in the genome being mistakenly mapped to adjacent positions. Lastly, a DNA fragment may become contaminated with host genetic material, once again leading to an STS that will map to the wrong location. To help overcome these problems, as well as to improve overall mapping accuracy, researchers have begun comparing and integrating STS-based physical maps with genetic, RH, and cytogenetic maps. Cross-referencing different genomic maps enhances the utility of a given map, confirms STS order, and helps order and orient evolving contigs.
NCBI and Map Integration Comparing the many available genetic and physical maps can be a time-consuming step, especially when trying to pinpoint the location of a new gene. Without the use of computers and special software designed to align the various maps, matching a sequence to a region of a chromosome that corresponds to the gene location would be very difficult. It would be like trying to compare 20 different interstate and street maps to get from a house in Ukiah, California to a house in Beaver Dam, Wisconsin. You could compare the maps yourself and create your own travel itinerary, but it would probably take a long time. Wouldn't it be easier and faster to have the automobile club create an integrated map for you? That is the goal behind NCBI's Human Genome Map Viewer.
NCBI's Map Viewer: A Tool for Integrating Genetic and Physical Maps The NCBI Map Viewer provides a graphical display of the available human genome sequence data as well as sequence, cytogenetic, genetic linkage, and RH maps. Map Viewer can simultaneously display up to seven maps, selected from a large set of maps, and allows the user access to detailed information for a selected map region. Map Viewer uses a common sequence numbering system to align sequence maps and shared markers as well as gene names to align other maps. You can use NCBI's Map Viewer to search for a gene in a number of genomes, by choosing an organism from the Map Viewer home page.
information on how you can use Map Viewer descriptions of the Map Viewer layout step-by-step information on using the Map Viewer shortcuts for getting to where you need to go
Proteins can be divided into two general classes based on their tertiary structure. Fibrous proteins have elongated structures, with the polypeptide chains arranged in long strands. This class of proteins serves as major structural components of cells, and therefore their role tends to be static in providing a structural framework. Globular proteins have more compact, often irregular structures. This class of proteins includes most enzymes and most of the proteins involved in gene expression and regulation.
Allosteric Proteins Under certain conditions, a protein may have a stable alternate conformation, or shape, that enables it to carry out a different biological function. Proteins that exhibit this characteristic are called allosteric. The interaction of an allosteric protein with a specific cofactor, or with another protein, may influence the transition of the protein between shapes. In addition, any change in conformation brought about by an interaction at one site may lead to an alteration in the structure, and thus function, at another site. One should bear in mind, though, that this type of transition affects only the protein's shape, not the primary amino acid sequence. Allosteric proteins play an important role in both metabolic and genetic regulation.
Allosteric proteins can change their shape and function depending on the environmental conditions in which they are found.
X-ray Crystallography Crystals are a solid form of a substance in which the component molecules are present in an When performing ordered array called a lattice. The basic building block of a crystal is called a unit cell. Each this technique, the unit cell contains exactly one unique set of the crystal's components, the smallest possible set molecule under that is fully representative of the crystal. Crystals of a complex molecule, like a protein, study must first be crystallized, and the produce a complex pattern of X-ray diffraction, or scattering of X-rays. When the crystal is crystals must be placed in an X-ray beam, all of the unit cells present the same face to the beam; therefore, singular and of many molecules are in the same orientation with respect to the incoming X-rays. The X-ray perfect quality a beam enters the crystal and a number of smaller beams emerge: each one in a different time-consuming and difficult task. direction, each one with a different intensity. If an X-ray detector, such as a piece of film, is placed on the opposite side of the crystal from the X-ray source, each diffracted ray, called a reflection, will produce a spot on the film. However, because only a few reflections can be detected with any one orientation of the crystal, an important component of any X-ray diffraction instrument is a device for accurately setting and changing the orientation of the crystal. The set of diffracted, emerging beams contains information about the underlying crystal structure. If we could use light instead of X-rays, we could set up a system of lenses to recombine the beams emerging from the crystal and thus bring into focus an enlarged image of the unit cell and the molecules therein. But the molecules do not diffract visible light, and X-rays, unlike light, cannot be focused with lenses. However, the scientific laws that lenses obey are well understood, and it is possible to calculate the molecular image with a computer. In effect, the computer mimics the action of a lens. The major drawback associated with this technique is that crystallization of the proteins is a difficult task. Crystals are formed by slowly precipitating proteins under conditions that maintain their native conformation or structure. These exact conditions can only be discovered by repeated trials that entail varying certain experimental conditions, one at a time. This is a very time consuming and tedious process. In some cases, the task of crystallizing a protein borders on the impossible.
The basic phenomenon of NMR spectroscopy was discovered in 1945. In this Solution NMR is performed on technique, a sample is immersed in a magnetic field and bombarded with radio a solution of macromolecules waves. These radio waves encourage the nuclei of the molecule to resonate, or spin. while the molecules tumble and As the positively charged nucleus spins, the moving charge creates what is called a vibrate with thermal motion. magnetic moment. The thermal motion of the moleculethe movement of the molecule associated with the temperature of the materialfurther creates a torque, or twisting force, that makes the magnetic moment "wobble" like a child's top. When the radio waves hit the spinning nuclei, they tilt even more, sometimes flipping over. These resonating nuclei emit a unique signal that is then picked up on a special radio receiver and translated using a decoder. This decoder is called the Fourier Transform algorithm, a complex equation that translates the language of the nuclei into something a scientist can understand. By measuring the frequencies at which different nuclei flip, scientists can determine molecular structure, as well as many other interesting properties of the molecule. In the past 10 years, NMR has proven to be a powerful alternative to X-ray crystallography for the determination of molecular structure. NMR has the advantage over crystallographic techniques in that experiments are performed in solution as opposed to a crystal lattice. However, the principles that make NMR possible tend to make this technique very time consuming and limit the application to small- and medium-sized molecules.
Folding motifs are independent folding units, or particular structures, that recur in many molecules. Domains are the building blocks of a protein and are considered elementary units of molecular function. Families are groups of proteins that demonstrate sequence homology or have similar sequences. Superfamilies consist of proteins that have similar folding motifs but do not exhibit sequence similarity.
Some Basic Theory It is theorized that proteins that share a similar sequence generally share the same basic structure. Therefore, by experimentally determining the structure for one member of a protein family, called a target, researchers have a model on which to base the structure of other proteins within that family. Moving a step further, by selecting a target from each superfamily, researchers can study the universe of protein folds in a systematic fashion and outline a set of sequences associated with each folding motif. Many of these sequences may not demonstrate a resemblance to one another, but their identification and assignment to a particular fold is essential for predicting future protein structures using homology modeling.
The scientific basis for these theories is that a strong conservation of protein threedimensional shape across large evolutionary distances both within single species, between species, and in spite of sequence variationhas been demonstrated again and again. Although most scientists choose high-priority structures as their targets, this theory provides the option to choose any one of the proteins within a family as the target, rather than trying to achieve experimental results using a protein that is particularly difficult to work with using crystallographic or NMR techniques.
A computer-generated image of a protein's structure shows the relative locations of most, if not all, of the protein's thousands of atoms. The image also reveals the physical, chemical, and electrical properties of the protein and provides clues about its role in the body.
Steps for Maximizing Results Specific tasks must be carried out to maximize results when determining protein structure using homology modeling. First, protein sequences must be organized in terms of families, preferably in a searchable database, and a target must be selected. Protein families can be identified and organized by comparing protein sequences derived from completely sequenced genomes. Targets may be selected for families that do not exhibit apparent sequence homology to proteins with a known three-dimensional structure. Next, researchers must generate a purified protein for analysis of the chosen target and then experimentally determine the target's structure, either by X-ray crystallography and/or NMR. Target structures determined experimentally may then be further analyzed to evaluate their similarity to other known protein structures and to determine possible evolutionary relationships that are not identifiable from protein sequence alone. The target structure will also serve as a detailed model for determining the structure of other proteins within that family. In favorable cases, just knowing the structure of a particular protein may also provide considerable insight into its possible function.
PDB is supported by funds from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.
MMDB records have value-added information compared to the original PDB entries, including explicit chemical graph information, uniformly derived secondary structure definitions, structure domain information, literature citation matching, and molecule-based assignment of taxonomy to each biologically derived protein or nucleic acid chain.
NCBI has also developed a three-dimensional structure viewer, called Cn3D, for easy interactive visualization of molecular structures from Entrez. Cn3D serves as a visualization tool for sequences and sequence alignments. What sets Cn3D apart from other software is its ability to correlate structure and sequence information. For example, using Cn3D, a scientist can quickly locate the residues in a crystal structure that correspond to known disease mutations or conserved active site residues from a family of sequence homologs, or sequences that share a common ancestor. Cn3D displays structure-structure alignments along with the corresponding structure-based sequence alignments to emphasize those regions within a group of related proteins that are most conserved in
structure and sequence. Cn3D also features custom labeling options, high-quality graphics, and a variety of file exports that together make Cn3D a powerful tool for literature annotation.
PDBeast: Taxonomy in MMDB Taxonomy is the scientific discipline that seeks to catalog and reconstruct the evolutionary history of life on earth. NCBI's Structure Group, in collaboration with NCBI's taxonomists, has undertaken taxonomy annotation for the structure data stored in MMDB. A semi-automated approach has been implemented in which a human expert checks, corrects, and validates automatic taxonomic assignments. The PDBeast software tool was developed by NCBI for this purpose. It pulls text-descriptions of "Source Organisms" from either the original PDB-Entries or user-specified information and looks for matches in the NCBI Taxonomy database to record taxonomy assignments.
Taxonomy provides a vivid picture of the existing organic diversity of the earth. Taxonomy provides much of the information permitting a reconstruction of the phylogeny of life. Taxonomy reveals numerous, interesting evolutionary phenomena. Taxonomy supplies classifications that are of great explanatory value in most branches of biology.
COGs: Phylogenetic Classification of Proteins The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt at the phylogenetic classification of proteins a scheme that indicates the evolutionary relationships between organisms from complete genomes. Each COG includes proteins that are thought to be orthologous. Orthologs are genes in different species derived from a common ancestor and carried on through evolution. COGs may be used to detect similarities and differences between species for identifying protein families and predicting new protein functions and to point to potential drug targets in disease-causing species. The database is accompanied by the COGnitor program, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web page containing additional structural and functional information is now associated with each COG. These hyperlinked information pages include: systematic classification of the COG members under the different classification systems; indications of which COG members (if any) have been characterized genetically and biochemically; information on the domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains if known or predictable; a succinct summary of the common structural and functional features of the COG members as well as peculiarities of individual members; and key references.
Detecting New Sequence Similarities: BLAST against MMDB Comparison, whether of structural features or protein sequences, lies at the heart The journal article describing the of biology. The introduction of BLAST, or The Basic Local Alignment Search original algorithm used in BLAST Tool, in 1990 made it easier to rapidly scan huge databases for overt has since become one of the most homologies, or sequence similarities, and to statistically evaluate the resulting frequently cited papers of the decade, with over 10,000 citations. matches. BLAST works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. Sequence similarities found by BLAST have been critical in several gene discoveries. Hundreds of major sequencing centers and research institutions around the country use this software to transmit a query sequence from their local computer to a BLAST server at the NCBI via the Internet. In a matter of seconds, the BLAST server compares the user's sequence with up to a million known sequences and determines the closest matches. Not all significant homologies are readily and easily detected, however. Some of the most interesting are subtle
similarities that do not always rise to statistical significance during a standard BLAST search. Therefore, NCBI has extended the statistical methodology used in the original BLAST to address the problem of detecting weak, yet significant, sequence similarities. PSI-BLAST, or Position-Specific Iterated BLAST, searches sequence databases with a profile constructed using BLAST alignments, from which it then constructs what is called a position-specific score matrix. For protein analysis, the new Pattern Hit Initiated BLAST, or PHI-BLAST, serves to complement the profile-based searching that was previously introduced with PSI-BLAST. PHI-BLAST further incorporates hypotheses as to the biological function of a query sequence and restricts the analysis to a set of protein sequences that is already known to contain a specific pattern or motif.
BLAST now comes in several varieties in addition to those described above. Specialized BLASTs are also available for human, microbial, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences.
Structure Similarity Searching Using VAST As just noted, a sequence-sequence similarity program provides an alignment of two sets of sequences. A structure-structure similarity program provides a three-dimensional structure superposition. Structure similarity search services are based on the premise that some measure can be computed between two structures to assess their similarities, much the same way a BLAST alignment is predicted. VAST, or the Vector Alignment Search Tool, is a computer algorithm developed at NCBI for use in identifying similar three-dimensional protein structures. VAST is capable of detecting structural similarities between proteins stored in MMDB, even when no sequence similarity is detected.
VAST Search is NCBI's structure-structure similarity search service that compares threedimensional coordinates of newly determined protein structures to those in the MMDB or PDB databases. VAST Search creates a list of structure neighbors, or related structures, that a user can then browse interactively. VAST Search will retrieve almost all structures with an identical three-dimensional fold, although it may occasionally miss a few structures or report chance similarities.
The detection of structural similarity in the absence of obvious sequence similarity is a powerful tool to study remote homologies and protein evolution.
The Conserved Domain Database The Conserved Domain Database (CDD) is a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from SMART and Pfam, two popular Web-based tools for studying sequence domains, as well as domains contributed by NCBI researchers. CD Search, another NCBI search service, can be used to identify conserved domains in a protein query sequence. CD-Search uses RPS-BLAST to compare a query sequence against specific matrices that have been prepared from conserved domain alignments present in CDD. Alignments are also mapped to known three-dimensional structures and can be displayed using Cn3D (see above).
Conserved Domain Architecture Retrieval Tool NCBI's Conserved Domain Architecture Retrieval Tool (CDART) displays the functional domains that make up a protein and lists other proteins with similar domain architectures. CDART determines the domain architecture of a query protein sequence by comparing it to the CDD, a database of conserved domain alignments, using RPSBLAST. The Conserved Domain Architecture Retrieval Tool then compares the protein's domain architecture to that of other proteins in NCBI's non-redundant sequence database. Related sequences are identified as those proteins that share one or more similar domains. CDART displays these sequences using a graphical summary that depicts the types and locations of domains identified within each sequence. Links to the individual sequences as well as to further information on their domain architectures are also provided. Because protein domains may be considered elementary units of molecular function and proteins related by domain architecture may play similar roles in cellular processes, CDART serves as a useful tool in comparative sequence analysis.
RPS-BLAST is a "reverse" version of PSI-BLAST, which is described above. Both RPS-BLAST and PSI-BLAST use similar methods to derive conserved features of a protein family. However, RPS-BLAST compares a query sequence against a database of profiles prepared from ready-made alignments, whereas PSI-BLAST builds alignments starting from a single protein sequence. The programs also differ in purpose: RPS-BLAST is used to identify conserved domains in a query sequence, whereas PSI-BLAST is used to identify other members of the protein family to which a query sequence belongs.
Application to Biomedicine
Although the information derived from modeling studies is primarily about molecular function, protein structure data also provide a wealth of information on mechanisms linked to the function and the evolutionary history of and relationships between macromolecules. NCBI's goals in adding structure data to its Web site are to make this information easily accessible to the biomedical community worldwide and to facilitate comparative analysis involving three-dimensional structure.
An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where the second "A" in the first snippet is replaced with a "T". On average, SNPs occur in the human population more than 1 percent of the time. Because only about 3 to 5 percent of a person's DNA sequence codes for the production of proteins, most SNPs are found outside of "coding sequences". SNPs found within a coding sequence are of particular interest to researchers because they are more likely to alter the biological function of a protein. Because of the recent advances in technology, coupled with the unique ability of these genetic variations to facilitate gene identification, there has been a recent flurry of SNP discovery and detection.
Needles in a Haystack
Finding single nucleotide changes in the human genome seems like a daunting prospect, but over the last 20 years, biomedical researchers have developed a number of techniques that make it possible to do just that. Each technique uses a different method to compare selected regions of a DNA sequence obtained from multiple individuals who share a common trait. In each test, the result shows a physical difference in the DNA samples only when a SNP is detected in one individual and not in the other.
As a result of recent advances in SNPs research, diagnostics for many diseases may improve.
Many common diseases in humans are not caused by a genetic variation within a single gene but are influenced by complex interactions among multiple genes as well as environmental and lifestyle factors. Although both
environmental and lifestyle factors add tremendously to the uncertainty of developing a disease, it is currently difficult to measure and evaluate their overall effect on a disease process. Therefore, we refer here mainly to a person's genetic predisposition, or the potential of an individual to develop a disease based on genes and hereditary factors. Genetic factors may also confer susceptibility or resistance to a disease and determine the severity or progression of disease. Because we do not yet know all of the factors involved in these intricate pathways, researchers have found it difficult to develop screening tests for most diseases and disorders. By studying stretches of DNA that have been found to harbor a SNP associated with a disease trait, researchers may begin to reveal relevant genes associated with a disease. Defining and understanding the role of genetic factors in disease will also allow researchers to better evaluate the role non-genetic factorssuch as behavior, diet, lifestyle, and physical activityhave on disease. Because genetic factors also affect a person's response to drug therapy, DNA polymorphisms such as SNPs will be useful in helping researchers determine and understand why individuals differ in their abilities to absorb or clear certain drugs, as well as to determine why an individual may experience an adverse side effect to a particular drug. Therefore, the recent discovery of SNPs promises to revolutionize not only the process of disease detection but the practice of preventative and curative medicine.
Each person's genetic material contains a unique SNP pattern that is made up of many different genetic variations. Researchers have found that most SNPs are not responsible for a disease state. Instead, they serve as biological markers for pinpointing a disease on the human genome map, because they are usually located near a gene found to be associated with a certain disease. Occasionally, a SNP may actually cause a disease and, therefore, can be used to search for and isolate the
disease-causing gene. To create a genetic test that will screen for a disease in which the disease-causing gene has already been identified, scientists collect blood samples from a group of individuals affected by the disease and analyze their DNA for SNP patterns. Next, researchers compare these patterns to patterns obtained by analyzing the DNA from a group of individuals unaffected by the disease. This type of comparison, called an " association study", can detect differences between the SNP patterns of the two groups, thereby indicating which pattern is most likely associated with the disease-causing gene. Eventually, SNP profiles that are characteristic of a variety of diseases will be established. Then, it will only be a matter of time before physicians can screen individuals for susceptibility to a disease just by analyzing their DNA samples for specific SNP patterns.
In the future, the most appropriate drug for an individual could be determined in advance of treatment by analyzing a patient's SNP profile. The ability to target a drug to those individuals most likely to benefit, referred to as "personalized medicine", would allow pharmaceutical companies to bring many more drugs to market and allow doctors to prescribe individualized therapies specific to a patient's needs.
Because SNPs occur frequently throughout the genome and tend to be relatively stable genetically, they serve as excellent biological markers. Biological markers are segments of DNA with an identifiable physical location that can be easily tracked and used for constructing a chromosome map that shows the positions of known genes, or other markers, relative to each other. These maps allow researchers to study and pinpoint traits resulting from the interaction of more than one gene. NCBI plays a major role in facilitating the identification and cataloging of SNPs through its creation and maintenance of the public SNP database (dbSNP). This powerful genetic tool may be accessed by the biomedical community worldwide and is intended to stimulate many areas of biological research, including the identification of the genetic components of disease.
Most SNPs are not responsible for a disease state. Instead, they serve as biological markers for pinpointing a disease on the human genome map.
Figure 1. The NCBI Discovery Space. Records in dbSNP are cross-annotated within other internal information resources such as PubMed, genome project sequences, GenBank records, the Entrez Gene database, and the dbSTS database of sequence tagged sites. Users may query dbSNP directly or start a search in any part of the NCBI discovery space to construct a set of dbSNP records that satisfy their search conditions. Records are also integrated with external information resources through hypertext URLs that dbSNP users can follow to explore the detailed information that is beyond the scope of dbSNP curation. Reproduced with permission from Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K."dbSNP: the NCBI database of genetic variation." Nucleic Acids Research. 2001; 29:308-311.
To facilitate research efforts, NCBI's dbSNP is included in the Entrez retrieval system which provides integrated access to a number of software tools and databases that can aid in SNP analysis. For example, each SNP record in the database links to additional resources within NCBI's "Discovery Space", as noted in Figure 1. Resources include: GenBank, NIH's sequence database; Entrez Gene, a focal point for genes and associated information; dbSTS, NCBI's resource containing sequence and mapping data on short genomic landmarks; human genome sequencing data; and PubMed, NCBI's literature search and retrieval system. SNP records also link to various external allied resources. Providing public access to a site for "one-stop SNP shopping" facilitates scientific research in a variety of fields, ranging from population genetics and evolutionary biology to large-scale disease and drug association studies. The long-term investment in such novel and exciting research promises not only to advance human biology but to revolutionize the practice of modern medicine.
obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression. Once we begin to understand where and how a gene is expressed under normal circumstances, we can then study what happens in an altered state, such as in disease. To accomplish the latter goal, however, researchers must identify and study the protein, or proteins, coded for by a gene. As one can imagine, finding a gene that codes for a protein, or proteins, is not easy. Traditionally, scientists would start their search by defining a biological problem and developing a strategy for researching the problem. Oftentimes, a search of the scientific literature provided various clues about how to proceed. For example, other laboratories may have published data that established a link between a particular protein and a disease of interest. Researchers would then work to isolate that protein, determine its function, and locate the gene that coded for the protein. Alternatively, scientists could conduct what is referred to as linkage studies to determine the chromosomal location of a particular gene. Once the chromosomal location was determined, scientists would use biochemical methods to isolate the gene and its corresponding protein. Either way, these methods took a great deal of timeyears in some casesand yielded the location and description of only a small percentage of the genes found in the human genome. Now, however, the time required to locate and fully describe a gene is rapidly decreasing, thanks to the development of, and access to, a technology used to generate what are called Expressed Sequence Tags, or ESTs. ESTs provide researchers with a quick and inexpensive route for discovering new genes, for obtaining data on gene expression and regulation, and for constructing genome maps. Today, researchers using ESTs to study the human genome find themselves riding the crest of a wave of scientific discovery the likes of which has never been seen before.
An Expressed Sequence Tag is a tiny portion of an entire gene that can be used to help identify unknown genes and to map their positions within a genome.
Separating the Wheat from the Chaff: Using mRNA to Generate cDNA Gene identification is very difficult in humans, because most of our genome is composed of introns interspersed with a relative few DNA coding sequences, or genes. These genes are expressed as proteins, a complex process composed of two main two steps. Each gene (DNA) must be converted, or transcribed, into messenger RNA (mRNA), RNA that serves as a template for protein synthesis. The resulting mRNA then guides the synthesis of a protein through a process called translation. Interestingly, mRNAs in a cell do not contain sequences from the regions between genes, nor from the non-coding introns that are present within many genes. Therefore, isolating mRNA is key to finding expressed genes in the vast expanse of the human genome.
Figure 1. An overview of the process of protein synthesis. Protein synthesis is the process whereby DNA codes for the production of amino acids and proteins. The process is divided into two parts: transcription and translation. During transcription, one strand of a DNA double helix is used as a template by mRNA polymerase to synthesize a mRNA. During this step, mRNA passes through various phases, including one called splicing, where the non-coding sequences are eliminated. In the next step, translation, the mRNA guides the synthesis of the protein by adding amino acids, one by one, as dictated by the DNA and represented by the mRNA.
The problem, however, is that mRNA is very unstable outside of a cell; therefore, scientists use special enzymes to convert it to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence.
cDNA is a form of DNA prepared in the laboratory using an enzyme called reverse transcriptase. cDNA production is the reverse of the usual process of transcription in cells because the procedure uses mRNA as a template rather than DNA. Unlike genomic DNA, cDNA contains only expressed DNA sequences, or exons.
From cDNAs to ESTs Once cDNA representing an expressed gene has been isolated, scientists can then sequence a few hundred nucleotides from either end of the molecule to create two different A "gene family" is a group of closely related kinds of ESTs. Sequencing only the beginning portion of the cDNA produces what is called genes that produces a 5' EST. A 5' EST is obtained from the portion of a transcript that usually codes for a similar protein products. protein. These regions tend to be conserved across species and do not change much within a gene family. Sequencing the ending portion of the cDNA molecule produces what is called a 3' EST. Because these ESTs are generated from the 3' end of a transcript, they are likely to fall within noncoding, or untranslated regions (UTRs), and therefore tend to exhibit less cross-species conservation than do coding sequences.
Figure 2. An overview of how ESTs are generated. ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell. The mRNAs in a cell are copies of the genes that are being expressed. mRNA does not contain sequences from the regions between genes, nor from the non-coding introns that are present within many interesting parts of the genome.
Because ESTs represent a copy of just the interesting part of a genome, that which is expressed, they have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary diseases. ESTs also have a number of practical advantages in that their sequences can be generated rapidly and inexpensively, only one sequencing experiment is needed per each cDNA generated, and they do not have to be checked for sequencing errors because mistakes do not prevent identification of the gene from which the EST was derived.
Using ESTs, scientists have rapidly isolated some of the genes involved in Alzheimer's disease and colon cancer.
To find a disease gene using this approach, scientists first use observable biological clues to identify ESTs that may correspond to disease gene candidates. Scientists then examine the DNA of disease patients for mutations in one or more of these candidate genes to confirm gene identity. Using this method, scientists have already isolated genes involved in Alzheimer's disease, colon cancer, and many other diseases. It is easy to see why ESTs will pave the way to new horizons in genetic research.
dbEST: A Descriptive Catalog of ESTs Scientists at NCBI created dbEST to organize, store, and provide access to the great mass of public EST data that has already accumulated and that continues to grow daily. Using dbEST, a scientist can access not only data on human ESTs but information on ESTs from over 300 other organisms as well. Whenever possible, NCBI scientists annotate the EST record with any known information. For example, if an EST matches a DNA sequence that codes for a known gene with a known function, that gene's name and function are placed on the EST record. Annotating EST records allows public scientists to use dbEST as an avenue for gene discovery. By using a database search tool, such as NCBIs BLAST, any interested party can conduct sequence similarity searches against dbEST.
Scientists at NCBI annotate EST records with text information regarding DNA and mRNA homologies.
UniGene: A Non-Redundant Set of Gene-oriented Clusters Because a gene can be expressed as mRNA many, many times, ESTs ultimately derived from this mRNA may be redundant. That is, there may be many identical, or similar, copies of the same EST. Such redundancy and overlap means that when someone searches dbEST for a particular EST, they may retrieve a long list of tags, many of which may represent the same gene. Searching through all of these identical ESTs can be very time consuming. To resolve the redundancy and overlap problem, NCBI investigators developed the UniGene database UniGene automatically partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Although it is widely recognized that the generation of ESTs constitutes an efficient strategy to identify genes, it is important to acknowledge that despite its advantages, there are several limitations associated with the EST approach. One is that it is very difficult to isolate mRNA from some tissues and cell types. This results in a paucity of data on certain genes that may only be found in these tissues or cell types.
Second is that important gene regulatory sequences may be found within an intron. Because ESTs are small segments of cDNA, generated from a mRNA in which the introns have been removed, much valuable information may be lost by focusing only on cDNA sequencing. Despite these limitations, ESTs continue to be invaluable in characterizing the human genome, as well as the genomes of other organisms. They have enabled the mapping of many genes to chromosomal sites and have also assisted in the discovery of many new genes.
The proper and harmonious expression of a large number of genes is a critical component of normal growth and development and the maintenance of proper health. Disruptions or changes in gene expression are responsible for many diseases.
Enabling Technologies Biomedical research evolves and advances not only through the compilation of knowledge but also through the development of new technologies. Using traditional methods to assay gene expression, researchers were able to survey a relatively small number of genes at a time. The emergence of new tools enables researchers to address previously intractable problems and to uncover novel potential targets for therapies. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and efficiently. They represent a major methodological advance and illustrate how the advent of new technologies provides powerful tools for researchers. Scientists are using microarray technology to try to understand fundamental aspects of growth and development as well as to explore the underlying genetic causes of many human diseases.
A microarray is a tool for analyzing gene expression that consists of a small membrane or glass slide containing samples of many genes arranged in a regular pattern.
An oligonucleotide, or oligo as it is commonly called, is a short fragment of a single-stranded DNA that is typically 5 to 50 nucleotides long.
4. Detect bound cDNA using laser technology and store data in a computer.
After this hybridization step is complete, a researcher will place the microarray in a "reader" or " scanner" that consists of some lasers, a special microscope, and a camera. The fluorescent tags are excited by the laser, and the microscope and camera work together to create a digital image of the array. These data are then stored in a computer, and a special program is used either to calculate the red-to-green fluorescence ratio or to subtract out background data for each microarray spot by analyzing the digital image of the array. If calculating ratios, the program then creates a table that contains the ratios of the intensity of red-to-green fluorescence for every spot on the array. For example, using the scenario outlined above, the computer may conclude that both cell types express gene A at the same level, that cell 1 expresses more of gene B, that cell 2 expresses more of gene C, and that neither cell expresses gene D. But remember, this is a simple example used to demonstrate key points in experimental design. Some microarray experiments can contain up to 30,000 target spots. Therefore, the data generated from a single array can mount up quickly.
Reproduced with permission from the Office of Science Education, the National Institutes of Health.
In this schematic: GREEN represents Control DNA, where either DNA or cDNA derived from normal tissue is hybridized to the target DNA. RED represents Sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA. YELLOW represents a combination of Control and Sample DNA, where both hybridized equally to the target DNA. BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA. Each spot on an array is associated with a particular gene. Each color in an array represents either healthy (control) or diseased (sample) tissue. Depending on the type of array used, the location and intensity of a color will tell us whether the gene, or mutation, is present in either the control and/or sample DNA. It will also provide an estimate of the expression level of the gene(s) in the sample and control DNA.
Types of Microarrays
There are three basic types of samples that can be used to construct DNA microarrays, two are genomic and the other is "transcriptomic", that is, it measures mRNA levels. What makes them different from each other is the kind of immobilized DNA used to generate the array and, ultimately, the kind of information that is derived from the chip. The target DNA used will also determine the type of control and sample DNA that is used in the hybridization solution.
I. Changes in Gene Expression Levels Determining the level, or volume, at which a certain gene is expressed is called microarray expression analysis, and the arrays used in this kind of analysis are called " expression chips". The immobilized DNA is cDNA derived from the mRNA of known genes, and once again, at least in some experiments, the control and sample DNA hybridized to the chip is cDNA derived from the mRNA of normal and diseased tissue, respectively. If a gene is overexpressed in a certain disease state, then more sample cDNA, as compared to control cDNA, will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. Once researchers have characterized the expression patterns of various genes involved in many diseases, cDNA derived from diseased tissue from any individual can be hybridized to determine whether the expression pattern of the gene from the individual matches the expression pattern of a known disease. If this is the case, treatment appropriate for that disease can be initiated. As researchers use expression chips to detect expression patterns whether a particular gene(s) is being expressed more or less under certain circumstancesexpression chips may also be used to examine changes in gene expression over a given period of time, such as within the cell cycle. The cell cycle is a molecualr network that determines, in the normal cell, if the cell should pass through its life cycle. There are a variety of genes involved in regulating the stages of the cell cycle. Also built into this network are mechanisms designed to protect the body when this system fails or breaks down because of mutations within one of the "control genes", as is the case with cancerous cell growth. An expression microarray "experiment" could be designed where cell cycle data are generated in multiple arrays and referenced to time "zero". Analysis of the collected data could further elucidate details of the cell cycle and its "clock", providing much needed data on the points at which gene mutation leads to cancerous growth as well as sources of therapeutic intervention. In the same way, expression chips can be used to develop new drugs. For instance, if a certain gene is overexpressed in a particular form of cancer, researchers can use expression chips to see if a new drug will reduce overexpression and force the cancer into remission. Expression chips could also be used in disease diagnosis as well, e.g., in the identification of new genes involved in environmentally triggered diseases, such as those diseases affecting the immune, nervous, and pulmonary/respiratory systems.
II. Genomic Gains and Losses DNA repair genes are thought to be the body's frontline defense against mutations and, as such, play a major role in cancer. Mutations within these genes often manifest themselves as lost or broken chromosomes. It has been hypothesized that certain chromosomal gains and losses are related to cancer progression and that the patterns of these changes are relevant to clinical prognosis. Using different laboratory methods, researchers can measure gains and losses in the copy number of chromosomal regions in tumor cells. Then, using mathematical models to analyze these data, they can predict which chromosomal regions are most likely to harbor important genes for tumor initiation and disease progression. The results of such an analysis may be depicted as a hierarchical treelike branching diagram, referred to as a "tree model of tumor progression ". Researchers use a technique called microarray Comparative Genomic Hybridization (CGH) to look for genomic gains and losses or for a change in the number of copies of a particular gene involved in a disease state. In microarray CGH, large pieces of genomic DNA serve as the target DNA, and each spot of target DNA in the array has a known chromosomal location. The hybridization mixture will contain fluorescently labeled genomic DNA harvested from both normal (control) and diseased (sample) tissue. Therefore, if the number of copies of a particular target gene has increased, a large amount of sample DNA will hybridize to those spots on the microarray that represent the gene involved in that disease, whereas comparatively small amounts of control DNA will hybridize to those same spots. As a result, those spots containing the disease gene will fluoresce red with greater intensity than they will fluoresce green, indicating that the number of copies of the gene involved in the disease has gone up.
III. Mutations in DNA When researchers use microarrays to detect mutations or polymorphisms in a gene sequence, the target, or immobilized DNA, is usually that of a single gene. In this case though, the target sequence placed on any given spot within the array will differ from that of other spots in the same microarray, sometimes by only one or a few specific nucleotides. One type of sequence commonly used in this type of analysis is called a Single Nucleotide Polymorphism, or SNP, a small genetic change or variation that can occur within a person's DNA sequence. Another difference in mutation microarray analysis, as compared to expression or CGH microarrays, is that this type of
experiment only requires genomic DNA derived from a normal sample for use in the hybridization mixture. Once researchers have established that a SNP pattern is associated with a particular disease, they can use SNP microarray technology to test an individual for that disease expression pattern to determine whether he or she is susceptible to (at risk of developing) that disease. When genomic DNA from an individual is hybridized to an array loaded with various SNPs, the sample DNA will hybridize with greater frequency only to specific SNPs associated with that person. Those spots on the microarray will then fluoresce with greater intensity, demonstrating that the individual being tested may have, or is at risk for developing, that disease.
What Is GEO?
As we have just alluded, microarray technology is one of the most recent and important experimental breakthroughs in molecular biology. Today, proficiency in generating data is fast overcoming the capacity for storing and analyzing it. Much of this information is scattered across the Internet or is not even available to the public. As more laboratories acquire this technology, the problem will only get worse. This avalanche of data requires standardization of storage, sharing, and publishing techniques. To support the public use and dissemination of gene expression data, NCBI has launched the Gene Expression Omnibus, or GEO. GEO represents NCBI's effort to build an expression data repository and online resource for the storage and retrieval of gene expression data from any organism or artificial source. Many types of gene expression data, such as those types discussed in this primer, are accepted and archived as a public dataset.
Array design: each array used and each spot on the array Samples: samples used, the extract preparation, and labeling Hybridizations: procedures and parameters Measurements: images, quantitation, and specifications Controls: types, values, and specifications
MAML is independent of the particular experimental platform and provides a framework for describing experiments done on all types of DNA arrays, including spotted and synthesized arrays, as well as oligo and cDNA arrays. What's more, MAML provides format to represent microarray data in a flexible way, which allows analysis of data obtained from not only any existing microarray platforms but also many of the possible future variants, including protein arrays. Although the data in GEO are not currently provided in MAML format, it is NCBI's goal to have the data delivered in a number of formats, including MAML, soon to be replaced by a more recent version called MAGEML (MicroArray Gene Expression Markup Language).
The Benefits of GEO and MAML By storing vast amounts of data on gene expression profiles derived from multiple experiments using varied criteria and conditions, GEO will aid in the study of functional genomics the development and application of global experimental approaches to assess gene function GEO will facilitate the cross-validation of data obtained using different techniques and technologies and will help set benchmarks and standards for further gene expression studies By making the information stored in GEO publicly available, the fields of bioinformatics and functional genomics will be both promoted and advanced That such experimental data should be freely accessible to all is consistent with NCBI's legislative mandate and mission: to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease
What Is Pharmacogenomics?
The way a person responds to a drug (this includes both positive and negative reactions) is a complex trait that is influenced by many different genes. Without knowing all of the genes involved in drug response, scientists have found it difficult to develop genetic tests that could predict a person's response to a particular drug. Once scientists discovered that people's genes show small variations (or changes) in their nucleotide (DNA base) content, all of that changedgenetic testing for predicting drug response is now possible. Pharmacogenomics is a science that examines the inherited variations in genes that dictate drug response and explores the ways these variations can be used to predict whether a patient will have a good response to a drug, a bad response to a drug, or no response at all.
The distinction between the two terms is considered arbitrary, however, and now the two terms are used interchangeably.
How Will Gene Variation Be Used in Predicting Drug Response? Right now, there is a race to catalog as many of the genetic variations found within the DNA sequencing is the human genome as possible. These variations, or SNPs (pronounced "snips"), as they are determination of the order commonly called, can be used as a diagnostic tool to predict a person's drug response. For of nucleotides (the base SNPs to be used in this way, a person's DNA must be examined ( sequenced) for the sequence) in a DNA molecule. presence of specific SNPs. The problem is, however, that traditional gene sequencing technology is very slow and expensive and has therefore impeded the widespread use of SNPs as a diagnostic tool. DNA microarrays (or DNA chips) are an evolving technology that should make it possible for doctors to examine their patients for the presence of specific SNPs quickly and affordably. A single microarray can now be used to screen 100,000 SNPs found in a patient's genome in a matter of hours. As DNA microarray technology is developed further, SNP screening in the doctor's office to determine a patient's response to a drug, prior to drug prescription, will be commonplace.
How Will Drug Development and Testing Benefit from Pharmacogenomics? SNP screenings will benefit drug development and testing because pharmaceutical companies could exclude from clinical trials those people whose pharmacogenomic screening would show that the drug being tested would be harmful or ineffective for them. Excluding these people will increase the chance that a drug will show itself useful to a particular population group and will thus increase the chance that the same drug will make it into the marketplace. Pre-screening clinical trial subjects should also allow the clinical trials to be smaller, faster, and therefore less expensive; therefore, the consumer could benefit in reduced drug costs. Finally, the ability to assess an individual's reaction to a drug before it is prescribed will increase a physician's confidence in prescribing the drug and the patient's confidence in taking the drug, which in turn should encourage the development of new drugs tested in a like manner.
Pre-screening should allow clinical trials to be smaller, faster, and less expensive; therefore, the consumer could benefit in reduced drug costs.
What Is NCBI'S Role in Pharmacogenomics? The explosion in both SNP and microarray data generated from the human genome project has necessitated the development of a means of cataloging and annotating (briefly describing) these data so that scientists can more easily access and use it for their research. NCBI, always on the forefront of bioinformatics research, has developed database repositories for both SNP (dbSNP) and microarray (GEO) data. These databases include either descriptive information about the data within the site itself and links to NCBI and external information resources. Access to these data and information resources will allow scientists to more easily interpret data that will be used not only to help determine drug response but to study disease susceptibility and conduct basic research in population genetics.
The Promise of Pharmacogenomics Right now, in doctors' offices all over the world, patients are given medications that either don't work or have bad side effects. Often, a patient must return to their doctor over and over again until the doctor can find a drug that is right for them. Pharmacogenomics offers a very appealing alternative. Imagine a day when you go into your doctor's office and, after a simple and rapid test of your DNA, your doctor changes her/his mind about a drug considered for you because your genetic test indicates that you could suffer a severe negative reaction to the medication. However, upon further examination of your test results, your doctor finds that you would benefit greatly from a new drug on the market, and that there would be little likelihood that you would react negatively to it. A day like this will be coming to your doctor's office soon, brought to you by pharmacogenomics.
GETTING STARTED Need help with dbSNP or Map Viewer? How about a quick "how-to" for using other NCBI data-mining tools? Try GETTING STARTED, a new NCBI resource designed to aid the novice user: brief descriptions of data that can be found/manipulated using a particular NCBI tool shortcuts for getting to where you need to go concise explanations of NCBI tool graphics with insider techniques for conducting database searches simple examples of tool usage
Taxonomic Classification
Taxonomic ranks approximate evolutionary distances among groups of organisms. For example, species belonging to two different superkingdoms are most distantly related (their common ancestor diverged in the distant past), with progressively more exclusive groups indicated by phylum, class and so on, down to infraspecific ranks, or ranks occurring within a species. Infraspecific ranks, such as subspecies, varietas, and forma, denote the closest evolutionary relationship. See the simplified classification of humans below. Taxonomists, scientists who classify living organisms, define a species as any group of closely related organisms that can produce fertile offspring. Two organisms are more closely "related" as they approach the level of species, that is, they have more genes in common. The level of species can be further divided into smaller segments. A population is the smallest unit of a species and is made up of organisms of the same species. Sometimes, a population will physically alter over time to suit the needs of its environment. This is called a cline and can make members of the same species look different.
Charles Darwin was the first to recognize that the systematic hierarchy represented a rough approximation of evolutionary history. However, it was not until the 1950s that the German entomologist Willi Hennig proposed that systematics should reflect the known evolutionary history of lineages as closely as possible, an approach he called phylogenetic systematics. The followers of Hennig were disparagingly referred to as "cladists" by his opponents, because of the emphasis on recognizing only monophyletic groups, a group plus all of its descendents, or clades. However, the cladists quickly adopted that term as a helpful label, and nowadays, cladistic approaches to systematics are used routinely.
How Does Genetic Variation Occur? Every organism possesses a genome that contains all of the biological information needed to construct and maintain a living example of that organism. The biological information contained in a genome is encoded in the nucleotide sequence of its DNA or RNA molecules and is divided into discrete units called genes. The information stored in a gene is read by proteins, which attach to the genome and initiate a series of reactions called gene expression. Every time a cell divides, it must make a complete copy of its genome, a process called DNA replication. DNA replication must be extremely accurate to avoid introducing mutations, or changes in the nucleotide sequence of a short region of the genome. Inevitably, some mutations do occur, usually in one of two ways; either from errors in DNA replication or from damaging effects of chemical agents or radiation that react with DNA and change the
structure of individual nucleotides. Many of these mutations result in a change that has no effect on the functioning of the genome, referred to as silent mutations. Silent mutations include virtually all changes that happen in the noncoding components of genes and gene-related sequences. Mutations in the coding regions of genes are much more important. Here we must consider the importance of the same mutation in a somatic cell compared with a germ line cell. A somatic cell is any cell of an organism other than a reproductive cell, such as a sperm or egg cell. A germ cell line is any line of cells that gives rise to gametes and is continuous through the generations. Because a somatic cell does not pass on copies of its genome to the next generation, a somatic cell mutation is important only for the organism in which it occurs and has no potential evolutionary impact. In fact, most somatic mutations have no significant effect because there are many other identical cells in the same tissue. On the other hand, mutations in germ cells can be transmitted to the next generation and will then be present in all of the cells of an individual who inherits that mutation. Even still, mutations within germ line cells may not change the phenotype of the organism in any significant way. Those mutations that do have an evolutionary effect can be divided into two categories, loss-of-function mutations and gain-of-function mutations. A loss-of-function mutation results in reduced or abolished protein function. Gain-of-function mutations, which are much less common, confer an abnormal activity on a protein.
The randomness with which mutations can occur is an important concept in biology and is a requirement of the Darwinian view of evolution, which holds that changes in the characteristics of an organism occur by chance and are not influenced by the environment in which the organism lives. Beneficial changes within an organism are then positively selected for, whereas harmful changes are negatively selected.
The Drivers of Evolution: Selection, Drift, and Founder Effects We just discussed that new alleles appear in a population because of mutations that occur in the reproductive cells of an organism. This means that many genes are polymorphic, that is, two or more alleles for that gene are present in a population. Each of these alleles has its own allele or gene frequency, a measure of how common an allele is in a population. Allele frequencies vary over time because of two conditions, natural selection and random drift. Natural Selection Natural selection is the process whereby one genotype, the hereditary constitution of an individual, leaves more offspring than another genotype because of superior life attributes, termed fitness. Natural selection acts on genetic variation by conferring a survival advantage to those individuals harboring a particular mutation that tends to favor a changing environmental condition. These individuals then reproduce and pass on this "new" gene, altering their gene pool. Natural selection, therefore, decreases the frequencies of alleles that reduce the fitness of an organism and increases the frequency of alleles that improve fitness.
"Natural Selection" is the principle by which each slight variation, if useful, is preserved. Charles Darwin
It is important to point out that natural selection does not always represent progress, only adaptation to a changing surrounding, that is, evolution attributable to natural selection is devoid of intent something does not evolve to better itself, only to adapt. Because environments are always changing, what was once an advantageous mutation can often become a liability further down the evolutionary line. Random Drift The term random drift actually encompasses a number of distinct processes, sometimes referred to as outcomes. They include indiscriminate parent sampling, the founder effect, and fluctuations in the rate of evolutionary processes such as selection, migration, and mutation. Parent sampling is the process of determining which organisms of one generation will be the parents of the next generation. Parent sampling may be discriminate, that is, with regard to fitness differences, or indiscriminate, without regard to fitness differences. Discriminate parent sampling is generally considered natural selection, whereas indiscriminate parent sampling is considered random drift.
What Is Sampling?
Suppose a population of red and brown squirrels share a habitat with a color blind predator. Although the predator is color blind, the brown squirrels seem to die in greater numbers than the red squirrels, suggesting that the brown squirrels just seem to be unlucky enough to come into contact with the predator more often. As a result, the frequency of brown squirrels in the next generation is reduced. More red squirrels survive to reproduce, or are sampled, but it is without regard to any differences in fitness between the two groups. The physical differences of the groups do not play a causal role in the differences in reproductive success. Now, lets say that the predator is not color blind and can now see the red squirrels better than the brown squirrels, resulting in a better survival rate for the brown squirrels. This would be a case of discriminate parent sampling, or natural selection.
Founder Effect Another important cause of genetic drift is the founder effect, the difference between the gene pool of a population as a whole and that of a newly isolated population of the same species. The founder effect occurs when populations are started from a small number of pioneer individuals of one original population. Because of small sample size, the new population could have a much different genetic ratio than the original population. An example of the founder effect would be when a plant population results from a single seed. Thus far, we have discussed natural selection and random drift as events that occur in isolation from one another. However, in most populations, the two processes will be occurring at the same time. Furthermore, there is great debate over whether, in particular instances and in general, natural selection is more prevalent that random drift.
A phylogenetic tree is composed of nodes, each representing a taxonomic unit (species, populations, individuals), and branches, which define the relationship between the taxonomic units in terms of descent and ancestry. Only one branch can connect any two adjacent nodes. The branching pattern of the tree is called the topology, and the branch length usually represents the number of changes that have occurred in the branch. This is called a scaled branch. Scaled trees are often calibrated to represent the passage of time. Such trees have a theoretical basis in the particular gene or genes under analysis. Branches can also be unscaled, which means that the branch length is not proportional to the number of changes that has occurred, although the actual number may be indicated numerically somewhere on the branch. Phylogenetic trees may also be either rooted or unrooted. In rooted trees, there is a particular node, called the root, representing a common ancestor, from which a unique path leads to any other node. An unrooted tree only specifies the relationship among species, without identifying a common ancestor, or evolutionary path.
Cladistic Method of Analysis An alternative approach to diagramming relationships between taxa is called cladistics. The basic assumption behind cladistics is that members of a group share a common evolutionary history. Thus, they are more closely related to one another than they are to other groups of organisms. Related groups of organisms are recognized because they share a set of unique features (apomorphies) that were not present in distant ancestors but which are shared by most or all of the organisms within the group. These shared derived characteristics are called synapomorphies. Therefore, in contrast to phenetics, cladistics groupings do not depend on whether organisms share physical traits but depend on their evolutionary relationships. Indeed, in cladistic analyses two organisms may share numerous characteristics but still be considered members of different groups. Cladistic analysis entails a number of assumptions. For example, species are assumed to arise primarily by bifurcation, or separation, of the ancestral lineage; species are often considered to become extinct upon hybridization (crossbreeding); and hybridization is assumed to be rare or absent. In addition, cladistic groupings must possess the following characteristics: all species in a grouping must share a common ancestor and all species derived from a common ancestor must be included in the taxon. The application of these requirements results in the following terms being used to describe the different ways in which groupings can be made: A monophyletic grouping is one in which all species share a common ancestor, and all species derived from that common ancestor are included. This is the only form of grouping accepted as valid by cladists. A paraphyletic grouping is one in which all species share a common ancestor, but not all species derived from that common ancestor are included. A polyphyletic grouping is one in which species that do not share an immediate common ancestor are lumped together, while excluding other members that would link them.
Molecular Phylogenetic Analysis: Fundamental Elements As we just discussed, macromolecules, especially gene and protein sequences, have surpassed morphological and other organismal characters as the most popular forms of data for phylogenetic analyses. Therefore, this next section will concentrate only on molecular data. It is important to point out that a single, all-purpose recipe does not exist for phylogenetic analysis of molecular data. Although numerous algorithms, procedures, and computer programs have been developed, their reliability and practicality are, in all cases, dependent upon the size and structure of the dataset under analysis. The merits and shortfalls of these various methods are subject to much scientific debate, because the danger of generating incorrect results is greater in computational molecular phylogenetics than in many other
Nucleotide and protein sequences can also be used to generate trees. DNA, RNA, and protein sequences can be considered as phenotypic traits. The sequences depict the relationship of genes and usually of the organism in which the genes are found.
fields of science. Occasionally, the limiting factor in such analyses is not so much the computational method used, but the users' understanding of what the method is actually doing with the data. Therefore, the goal of this section is to demonstrate to the reader that practical analysis should be thought of both as a search for a correct model (analysis) as well as a search for the correct tree (outcome). Phylogenetic tree-building models presume particular evolutionary models. For any given set of data, these models may be violated because of various occurrences, such as the transfer of genetic material between organisms. Therefore, when interpreting a given analysis, a person should always consider the model used and entertain possible explanations for the results obtained. For example, models used in molecular phylogenetic analysis methods make "default" assumptions, including: The sequence is correct and originates from the specified source. The sequences are homologousall descended in some way from a shared ancestral sequence. Each position in a sequence alignment is homologous with every other in that alignment. Each of the multiple sequences included in a common analysis has a common phylogenetic history with the other sequences. The sampling of taxa is adequate to resolve the problem under study. Sequence variation among the samples is representative of the broader group. The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem under study.
1.
2. 3. 4.
Alignmentbuilding the data model and extracting a dataset. Determining the substitution modelconsider sequence variation. Tree building. Tree evaluation.
Tree Building: Key Features of DNA-based Phylogenetic Trees Studies of gene and protein evolution often involve the comparison of homologs, sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary level of similarity determined by alignment of matching bases are homologous. These sequences are inherited from a common ancestor that possessed similar structure, although the ancestor may be difficult to determine because it has been modified through descent.
A typical gene-based phylogenetic tree is depicted below. This tree shows the relationship between four homologous genes: A, B, C, and D. The topology of this tree consists of four external nodes ( A, B, C, and D), each one representing one of the four genes, and two internal nodes ( e and f) representing ancestral genes. The branch lengths indicate the degree of evolutionary differences between the genes. This particular tree is unrooted it is only an illustration of the relationships between genes A, B, C, and D and does not signify anything about the series of
The second panel, below, depicts three rooted trees that can be drawn from the unrooted tree shown above, each representing the different evolutionary pathways possible between these four genes. A rooted tree is often referred to as an inferred tree. This is to emphasize that this type of illustration depicts only the series of evolutionary events that are inferred from the data under study and may not be the same as the true tree or the tree that depicts the actual series of evolutionary events that occurred.
To distinguish between the pathways, the phylogenetic analysis must include at least one outgroup, a gene that is less closely related to A, B, C, and D than these genes are to each other (panel below). Outgroups enable the root of the tree to be located and the correct evolutionary pathway to be identified. Let's say that the four homologous genes used in the previous tree examples come from human, chimpanzee, gorilla, and orangutan. In this case, an outgroup could be a gene from another primate, such as baboon, which is known to have branched away from the four species above before the common ancestor of the species.
Gene Trees Versus Species TreesWhy Are They Different? It is assumed that a gene tree, because it is based on molecular data, will be a more accurate and less ambiguous representation of the species tree than that obtainable by morphological comparisons. This may indeed be the case, but it does not mean that the gene tree is the same as the species tree. For this to be true, the internal nodes in both trees would have to be precisely equivalent, and they are not. An internal node in a gene tree indicates the divergence of an ancestral gene into two genes with different DNA sequences, usually resulting from a mutation of one sort or another. An internal node in a species tree represents what is called a speciation event, whereby the population of the ancestral species splits into two groups that are no longer able to interbreed. These two events, mutation and speciation, do not always occur at the same time.
BLAST: Detecting New Sequence Similarities Currently, the characters most widely used for phylogenetic analysis are DNA and protein sequences. DNA sequences may be compared directly, or for those regions that code for a known protein, translated into protein sequences. Creating phylogenies from nucleotide or amino acid sequences first requires aligning the bases so that the differences between the sequences being studied are easier to spot. The introduction of NCBI's BLAST, or The Basic Local Alignment Search Tool, in 1990 made it easier to rapidly scan huge databases for overt homologies, or sequence similarity, and to statistically evaluate the resulting matches. BLAST works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. In a matter of seconds, the BLAST server compares the user's sequence with up to a million known sequences and determines the closest matches. Specialized BLASTs are also available for human, mouse, microbial, and many other genomes. A single BLAST search can compare a sequence of interest to all other sequences stored in GenBank, NCBI's nucleotide sequence database. In this step, a researcher has the option of limiting the search to a specific taxonomic group. If the full scientific name or relationship of species of interest is not known, the user can search for such details using NCBI's Taxonomy Browser, which provides direct links to some of the organisms commonly used in molecular research projects, such as the zebrafish, fruit fly, bakers yeast, nematode, and many more.
BLAST next tallies the differences between sequences and assigns a "score" based on sequence similarity. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real sequence matches easier to distinguish from random background hits. This is because BLAST uses a special algorithm, or mathematical formula, that seeks local as opposed to global alignments and is therefore able to detect relationships among sequences that share only isolated regions of similarity. Taxonomy-related BLAST results are presented in three formats based on the information found in NCBI's Taxonomy database. The Organism Report sorts BLAST comparisons, also called hits, by species such that all hits to a given organism are grouped together. The Lineage Report provides a view of the relationships between the organisms based on NCBI's Taxonomy database. The Taxonomy Report provides in-depth details on the relationship between all the organisms in the BLAST hit list.
COGs: Phylogenetic Classification of Proteins The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt at the phylogenetic classification of proteins, a scheme that indicates the evolutionary relationships between organisms, from complete genomes. Each COG includes proteins that are thought to be orthologous, or connected through vertical evolutionary descent. COGs may be used to detect similarities and differences between species, for identifying protein families and predicting new protein functions, and to point to potential drug targets in disease-causing species. The database is accompanied by the COGnitor program, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web page containing additional structural and functional information is now associated with each COG. These hyperlinked information pages include: systematic classification of the COG members under the different classification systems; indications of which COG members (if any) have been characterized genetically and biochemically; information on the domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains if known or predictable; a succinct summary of the common structural and functional features of the COG members, as well as peculiarities of individual members; and key references.
HomoloGene HomoloGene is a database of both curated and calculated orthologs and homologs for the organisms represented in NCBI's UniGene database. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon, and from published reports. Computed orthologs and homologs are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet clusters in which orthologous clusters in two organisms are both orthologous to the same cluster in a third organism. HomoloGene can be searched via the Entrez retrieval system.
UniGene is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information, such as the tissue types in which the gene has been expressed and map location.
Entrez Genome The whole genomes of over 1,200 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life bacteria, archaea, and eukaryotes are represented, as well as many viruses, viroids, plasmids, and eukaryotic organelles. Data can be accessed hierarchically starting from either an alphabetical listing or a phylogenetic tree for complete genomes in each of six principle taxonomic groups. One can follow the hierarchy to a variety of graphical overviews, including that of the whole genome of a single organism, a single chromosome, or even a single gene. At each level, one can access multiple views of the data, pre-computed summaries, and links to analyses appropriate for that level. In addition, any gene product (protein) that is a member of a COG is linked to the COGs database. A summary of COG functional groups is also presented in tabular and graphical formats at the genome level. For complete microbial genomes, pre-computed BLAST neighbors for protein sequences, including their taxonomic distribution and links to 3D structures, are given in TaxTables and PDBTables, respectively. Pairwise sequence alignments are presented graphically and linked to NCBI's Cn3D macromolecular viewer that allows the interactive display of three-dimensional structures and sequence alignments.
PDBeast: Taxonomy in MMDB NCBI's Structure Group, in collaboration with NCBI taxonomists, has undertaken taxonomy annotation for the threedimensional structure data stored in the Molecular Modeling Database (MMDB). A semi-automated approach has been implemented in which a human expert checks, corrects, and validates automatic taxonomic assignments in MMDB. The PDBeast software tool was developed by NCBI for this purpose. It pulls text descriptions of "Source Organisms" from either the original entries or user-specified information and looks for matches in the NCBI Taxonomy database to record taxonomy assignments.
The Molecular Modeling Database (MMDB) is a compilation of three-dimensional structures of biomolecules obtained from the Protein Data Bank (PDB). The PDB, managed and maintained by the Research Collaboratory for Structural Bioinformatics, is a collection of all publicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of other complexes experimentally determined by X-ray crystallography and NMR. The difference between the two databases is that MMDB records reorganize and validate the information stored in the database in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.