0% found this document useful (0 votes)
184 views39 pages

Introduction To NCBI Resources

A biological database is a large, organized body of persistent data. It stores, organizes, and indexes the data and for specialized tools to view and analyze it. Bioinformatics is the analysis of biological data using computers and software.

Uploaded by

cgonzagaa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views39 pages

Introduction To NCBI Resources

A biological database is a large, organized body of persistent data. It stores, organizes, and indexes the data and for specialized tools to view and analyze it. Bioinformatics is the analysis of biological data using computers and software.

Uploaded by

cgonzagaa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

A Basic Introduction to the Science Underlying NCBI Resources

BIOINFORMATICS
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data.

The completion of a "working draft" of the human genome--an important milestone in the Human Genome Project--was announced in June 2000 at a press conference at the White House and was published in the February 15, 2001 issue of the journal Nature.

What Is a Biological Database?


A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name, the input sequence with a description of the type of molecule, the scientific name of the source organism from which it was isolated, and often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met: easy access to the information a method for extracting only that information needed to answer a specific biological question At NCBI, many of our databases are linked through a unique search and retrieval system, called Entrez. Entrez (pronounced ahn' tray) allows a user to not only access and retrieve specific information from a single database but to access integrated information from many NCBI databases. For example, the Entrez Protein database is cross-linked to the Entrez Taxonomy database. This allows a researcher to find taxonomic information (taxonomy is a division of the natural sciences that deals with the classification of animals and plants) for the species from which a protein sequence was derived.

The data in GenBank are made available in a variety of ways, each tailored to a particular use, such as data submission or sequence searching.

What Is Bioinformatics?
Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. At the beginning of the "genomic revolution", a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.
Biology in the 21st century is being transformed from a purely lab-based science to an information science as well.

Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important subdisciplines within bioinformatics and computational biology include:

the development and implementation of tools that enable efficient access to, and use and management of, various types of information the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences

Why Is Bioinformatics So Important?


The rationale for applying computational approaches to facilitate the understanding of various biological processes includes: a more global perspective in experimental design the ability to capitalize on the emerging technology of database-mining the process by which testable hypotheses are generated regarding the function or structure of a gene or protein of interest by identifying similar sequences in better characterized organisms

Although a human disease may not be found in exactly the same form in animals, there may be sufficient data for an animal model that allow researchers to make inferences about the process in humans.

Evolutionary Biology New insight into the molecular basis of a disease may come from investigating the function of homologs of a disease gene in model organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship. Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms of life. With the aid of nucleotide and protein sequences, it should be possible to find the ancestral ties between different organisms. Thus far, experience has taught us that closely related organisms have similar sequences and that more distantly related organisms have more dissimilar sequences. Proteins that show a significant sequence conservation, indicating a clear evolutionary relationship, are said to be from the same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary relationship between two species and to estimate the time of divergence between two organisms since they last shared a common ancestor.
NCBI's COGs database has been designed to simplify evolutionary studies of complete genomes and to improve functional assignment of individual proteins.

Phylogenetics is the field of biology that deals with identifying and understanding the relationships between the different kinds of life on earth.

Protein Modeling The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions. In the absence of a protein structure that has been determined by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target). Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy. Because the different genome projects are producing more sequences and because novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease-related processes in living organisms.

The Four Steps of Protein Modeling Identify the proteins with known three-dimensional structures that are related to the target sequence Align the related three-dimensional structures with the target sequence and determine those structures that will be used as templates Construct a model for the target sequence based on its alignment with the template structure(s) Evaluate the model against a variety of criteria to determine if it is satisfactory

Genome Mapping
Genomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wanting to localize a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a timeconsuming and often painstaking process. Today, thanks to new technologies and the influx of sequence data, a number of high-quality, genome-wide maps are available to the scientific community for use in their research. Computerized maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell, scientists would first use a genetic map to assign a gene to a relatively small area of a chromosome. They would then use a physical map to examine the region of interest close up, to determine a gene's precise location. In light of these advances, a researcher's burden has shifted from mapping a genome or genomic region of interest to navigating a vast number of Web sites and databases.

Map Viewer: A Tool for Visualizing Whole Genomes or Single Chromosomes


NCBI's Map Viewer is a tool that allows a user to view an organism's complete genome, integrated maps for each chromosome (when available), and/or sequence data for a genomic region of interest. When using Map Viewer, a researcher has the option of selecting either a "Whole-Genome View" or a "Chromosome or Map View". The Genome View displays a schematic for all of an organisms chromosomes, whereas the Map View shows one or more detailed maps for a single chromosome. If more than one map exists for a chromosome, Map Viewer allows a display of these maps simultaneously.

Using Map Viewer, researchers can find answers to questions such as: Where does a particular gene exist within an organism's genome? Which genes are located on a particular chromosome and in what order? What is the corresponding sequence data for a gene that exists in a particular chromosomal region? What is the distance between two genes?

The rapidly emerging field of bioinformatics promises to lead to advances in understanding basic biological processes and, in turn, advances in the diagnosis, treatment, and prevention of many genetic diseases. Bioinformatics has transformed the discipline of biology from a purely lab-based science to an information science as well. Increasingly, biological studies begin with a scientist conducting vast numbers of database and Web site searches to formulate specific hypotheses or to design large-scale experiments. The implications behind this change, for both science and medicine, are staggering.

GENOME MAPPING: A GUIDE TO THE GENETIC HIGHWAY WE CALL THE HUMAN GENOME

Imagine you're in a car driving down the highway to visit an old friend who has just moved to Los Angeles. Your favorite tunes are playing on the radio, and you haven't a care in the world. You stop to check your maps and realize that all you have are interstate highway maps not a single street map of the area. How will you ever find your friend's house? It's going to be difficult, but eventually, you may stumble across the right house.

This scenario is similar to the situation facing scientists searching for a specific gene somewhere within the vast human genome. They have available to them two broad categories of maps: genetic maps and physical maps. Both genetic and physical maps provide the likely order of items along a chromosome. However, a genetic map, like an interstate highway map, provides an indirect estimate of the distance between two items and is limited to ordering certain items. One could say that genetic maps serve to guide a scientist toward a gene, just like an interstate map guides a driver from city to city. On the other hand, physical maps mark an estimate of the true distance, in measurements called base pairs, between items of interest. To continue our analogy, physical maps would then be similar to street maps, where the distance between two sites of interest may be defined more precisely in terms of city blocks or street addresses. Physical maps, therefore, allow a scientist to more easily home in on the location of a gene. An appreciation of how each of these maps is constructed may be helpful in understanding how scientists use these maps to traverse that genetic highway commonly referred to as the "human genome".
Genetic maps serve to guide a scientist toward a gene, just like an interstate map guides a driver from city to city. Physical maps are more similar to street maps and allow a scientist to more easily home in on a gene's location.

PART I: GENETIC MAPS


Types of Landmarks Found on a Genetic Map Just like interstate maps have cities and towns that serve as landmarks, genetic maps Genetic maps use have landmarks known as genetic markers, or "markers" for short. The term marker is used very broadly to describe any observable variation that results from an alteration, or landmarks called genetic mutation, at a single genetic locus. A marker may be used as one landmark on a map if, markers to guide researchers on their gene in most cases, that stretch of DNA is inherited from parent to child according to the hunt. standard rules of inheritance. Markers can be within genes that code for a noticeable physical characteristic such as eye color, or a not so noticeable trait such as a disease. DNA-based reagents can also serve as markers. These types of markers are found within the non-coding regions of genes and are used to detect unique regions on a chromosome. DNA markers are especially useful for generating genetic maps when there are occasional, predictable mutations that occur during meiosisthe formation of gametes such as egg and sperm that, over many generations, lead to a high degree of variability in the DNA content of the marker from individual to individual.

Commonly Used DNA Markers


RFLPs, or restriction fragment length polymorphisms, were among the first developed DNA markers. RFLPs are defined by the presence or absence of a specific site, called a restriction site, for a bacterial restriction enzyme. This enzyme breaks apart strands of DNA wherever they contain a certain nucleotide sequence. VNTRs, or variable number of tandem repeat polymorphisms, occur in non-coding regions of DNA. This type of marker is defined by the presence of a nucleotide sequence that is repeated several times. In each case, the number of times a sequence is repeated may vary. Microsatellite polymorphisms are defined by a variable number of repetitions of a very small number of base pairs within a sequence. Oftentimes, these repeats consist of the nucleotides, or bases, cytosine and adenosine. The number of repeats for a given microsatellite may differ between individuals, hence the term polymorphism--the existence of different forms within a population. SNPs, or single nucleotide polymorphisms, are individual point mutations, or substitutions of a single nucleotide, that do not change the overall length of the DNA sequence in that region. SNPs occur throughout an individual's genome.

From Linkage Analysis to Genetic Mapping Early geneticists recognized that genes are located on chromosomes and believed that each individual chromosome was inherited as an intact unit. They hypothesized that if two genes were located on the same chromosome, they were physically linked together and were inherited together. We now know that this is not always the case. Studies conducted around 1910 demonstrated that very few pairs of genes displayed complete linkage. Pairs of genes were either inherited independently or displayed partial linkagethat is, they were inherited together sometimes, but not always. During meiosisthe process whereby gametes (eggs and sperm) are produced two copies of each chromosome pair become physically close. The chromosome arms can then undergo breakage and exchange segments of DNA, a process referred to as recombination or crossing-over. If recombination occurs, each chromosome found in the gamete will consist of a "mixture" of material from both members of the chromosome pair. Thus, recombination events directly affect the inheritance pattern of those genes involved.

It is the behavior of chromosomes during meiosis that determines whether two genes will remain linked.

Because one cannot physically see crossover events, it is difficult to determine with any degree of certainty how many crossovers have actually occurred. But, using the phenomenon of co-segregation of alleles of nearby markers, researchers can reverse-engineer meiosis and identify markers that lie close to each other. Then, using a statistical technique called genetic linkage analysis, researchers can infer a likely crossover pattern, and from that an order of the markers involved. Researchers can also infer an estimate for the probability that a recombination occurs between each pair of markers.

An allele is one of the variant forms of a DNA sequence at a particular locus, or location, on a chromosome. Co-segregation of alleles refers to the movement of each marker during meiosis. If a marker tends to "travel" with the disease status, the markers are said to co-segregate.

If recombination occurs as a random event, then two markers that are close together should be separated less frequently than two markers that are more distant from one another. The recombination probability between two markers, which can range from 0 to 0.5, increases monotonically as the distance between the two markers increases along a chromosome. Therefore, the recombination probability may be used as a surrogate for ordering genetic markers along a chromosome. If you then determine the recombination frequencies for different pairs of markers, you can construct a map of their relative positions on the chromosome.

Monotonic functions are functions that tend to move only in one direction.

Linkage maps can tell you where markers are in relation to each other on the chromosome, but the actual "mileage" between those markers may not be so well defined.

But alas, predicting recombination is not so simple. Although crossovers are random, they are not uniformly distributed across the genome or any chromosome. Some chromosomal regions, called recombination hotspots, are more likely to be involved in crossovers than other regions of a chromosome. This means that genetic map distance does not always indicate physical distance between markers. Despite these qualifications, linkage analysis usually correctly deduces marker order, and distance estimates are sufficient to generate genetic maps that can serve as a valuable framework for

genome sequencing.

Linkage Studies in Patient Populations: Genetic Maps and Gene Hunting

In humans, data for calculating recombination frequencies are obtained by examining In humans, genetic the genetic makeup of the members of successive generations of existing families, diseases are frequently termed human pedigree analysis. Linkage studies begin by obtaining blood samples used as gene markers, from a group of related individuals. For relatively rare diseases, scientists find a few with the disease state large families that have many cases of the disease and obtain samples from as many being one allele and the family members as possible. For more common diseases where the pattern of disease healthy state the second inheritance is unclear, scientists will identify a large number of affected families and will allele. take samples from four to thirty close relatives. DNA is then harvested from all of the blood samples and screened for the presence, or co-inheritance, of two markers. One marker is usually the gene of interest, generally associated with a physically identifiable characteristic. The other is usually one of the various detectable rearrangements mentioned earlier, such as a microsatellite. A computerized analysis is then performed to determine whether the two markers are linked and approximately how far apart those markers are from one another. In this case, the value of the genetic map is that an inherited disease can be located on the map by following the inheritance of a DNA marker present in affected individuals but absent in unaffected individuals, although the molecular basis of the disease may not yet be understood, nor the gene(s) responsible identified.

Genetic Maps as a Framework for Physical Map Construction Genetic maps are also used to generate the essential backbone, or scaffold, needed for the creation of more detailed human genome maps. These detailed maps, called physical maps, further define the DNA sequence between genetic markers and are essential to the rapid identification of genes.

PART II: PHYSICAL MAPS

Types of Physical Maps and What They Measure Physical maps can be divided into three general types: chromosomal or cytogenetic maps, radiation hybrid (RH) maps, and sequence maps. The different types of maps vary in their degree of resolution, that is, the ability to measure the separation of elements that are close together. The higher the resolution, the better the picture. The lowest-resolution physical map is the chromosomal or cytogenetic map, which is based on the distinctive banding patterns observed by light microscopy of stained chromosomes. As with genetic linkage mapping, chromosomal mapping can be used to locate genetic markers defined by traits observable only in whole organisms. Because chromosomal maps are based on estimates of physical distance, they are considered to be physical maps. Yet, the number of base pairs within a band can only be estimated. RH maps and sequence maps, on the other hand, are more detailed. RH maps are similar to linkage maps in that they show estimates of distance between genetic and physical markers, but that is where the similarity ends. RH maps are able to provide more precise information regarding the distance between markers than can a linkage map. The physical map that provides the most detail is the sequence map. Sequence maps show genetic markers, as well as the sequence between the markers, measured in base pairs.

How Are Physical Maps Made and Used?


RH Mapping

RH mapping, like linkage mapping, shows an estimated distance between genetic markers. But, rather than

relying on natural recombination to separate two markers, scientists use breaks induced by radiation to determine the distance between two markers. In RH mapping, a scientist exposes DNA to measured doses of radiation, and in doing so, controls the average distance between breaks in a chromosome. By varying the degree of radiation exposure to the DNA, a scientist can induce breaks between two markers that are very close together. The ability to separate closely linked markers allows scientists to produce more detailed maps. RH mapping provides a way to localize almost any genetic marker, as well as other genomic fragments, to a defined map position, and RH maps are extremely useful for ordering markers in regions where highly polymorphic genetic markers are scarce.

Polymorphic refers to the existence of two or more forms of the same gene, or genetic marker, with each form being too common in a population to be merely attributable to a new mutation. Polymorphism is a useful genetic marker because it enables researchers to sometimes distinguish which allele was inherited.

Scientists also use RH maps as a bridge between linkage maps and sequence maps. In doing so, they have been able to more easily identify the location(s) of genes involved in diseases such as spinal muscular atrophy and hyperekplexia, more commonly known as "startle disease".

Sequence Mapping Sequence tagged site (STS) mapping is another physical mapping technique. An STS is a short DNA sequence that has been shown to be unique. To qualify as an STS, the exact location and order of the bases of the sequence must be known, and this sequence may occur only once in the chromosome being studied or in the genome as a whole if the DNA fragment set covers the entire genome.

Common Sources of STSs


Expressed sequence tags (ESTs) are short sequences obtained by analysis of complementary DNA (cDNA) clones. Complementary DNA is prepared by converting mRNA into double-stranded DNA and is thought to represent the sequences of the genes being expressed. An EST can be used as an STS if it comes from a unique gene and not from a member of a gene family in which all of the genes have the same, or similar, sequences. Simple sequence length polymorphisms (SSLPs) are arrays of repeat sequences that display length variations. SSLPs that are polymorphic and have already been mapped by linkage analysis are particularly valuable because they provide a connection between genetic and physical maps. Random genomic sequences are obtained by sequencing random pieces of cloned genomic DNA or by examining sequences already deposited in a database.

To map a set of STSs, a collection of overlapping DNA fragments from a chromosome is digested into smaller fragments using restriction enzymes, agents that cut up DNA molecules at defined target points. The data from which the map will be derived are then obtained by noting which fragments contain which STSs. To accomplish this, scientists copy the DNA fragments using a process known as "molecular cloning". Cloning involves the use of a special technology, called recombinant DNA technology, to copy DNA fragments inside a foreign host. First, the fragments are united with a carrier, also called a vector. After introduction into a suitable host, the DNA fragments can then be reproduced along with the host cell DNA, providing unlimited material for experimental study. An unordered set of cloned DNA fragments is called a library. Next, the clones, or copies, are assembled in the order they would be found in the original chromosome by determining which clones contain overlapping DNA fragments. This assembly of overlapping clones is called a clone contig. Once the order of the clones in a chromosome is known, the clones are placed in frozen storage,

and the information about the order of the clones is stored in a computer, providing a valuable resource that may be used for further studies. These data are then used as the base material for generating a lengthy, continuous DNA sequence, and the STSs serve to anchor the sequence onto a physical map.

The Need to Integrate Physical and Genetic Maps As with most complex techniques, STS-based mapping has its limitations. In addition to gaps in clone coverage, DNA fragments may become lost or mistakenly mapped to a wrong position. These errors may occur for a variety of reasons. A DNA fragment may break, resulting in an STS that maps to a different position. DNA fragments may also get deleted from a clone during the replication process, resulting in the absence of an STS that should be present. Sometimes a clone composed of DNA fragments from two distinct genomic regions is replicated, leading to DNA segments that are widely separated in the genome being mistakenly mapped to adjacent positions. Lastly, a DNA fragment may become contaminated with host genetic material, once again leading to an STS that will map to the wrong location. To help overcome these problems, as well as to improve overall mapping accuracy, researchers have begun comparing and integrating STS-based physical maps with genetic, RH, and cytogenetic maps. Cross-referencing different genomic maps enhances the utility of a given map, confirms STS order, and helps order and orient evolving contigs.

NCBI and Map Integration Comparing the many available genetic and physical maps can be a time-consuming step, especially when trying to pinpoint the location of a new gene. Without the use of computers and special software designed to align the various maps, matching a sequence to a region of a chromosome that corresponds to the gene location would be very difficult. It would be like trying to compare 20 different interstate and street maps to get from a house in Ukiah, California to a house in Beaver Dam, Wisconsin. You could compare the maps yourself and create your own travel itinerary, but it would probably take a long time. Wouldn't it be easier and faster to have the automobile club create an integrated map for you? That is the goal behind NCBI's Human Genome Map Viewer.

NCBI's Map Viewer: A Tool for Integrating Genetic and Physical Maps The NCBI Map Viewer provides a graphical display of the available human genome sequence data as well as sequence, cytogenetic, genetic linkage, and RH maps. Map Viewer can simultaneously display up to seven maps, selected from a large set of maps, and allows the user access to detailed information for a selected map region. Map Viewer uses a common sequence numbering system to align sequence maps and shared markers as well as gene names to align other maps. You can use NCBI's Map Viewer to search for a gene in a number of genomes, by choosing an organism from the Map Viewer home page.

Map Viewer Getting Started


Need help using the NCBI Map Viewer? Try GETTING STARTED, a quick "how-to guide" on NCBI data mining tools designed for the novice user. GETTING STARTED using Map Viewer provides:

information on how you can use Map Viewer descriptions of the Map Viewer layout step-by-step information on using the Map Viewer shortcuts for getting to where you need to go

MOLECULAR MODELING: A METHOD FOR UNRAVELING PROTEIN STRUCTURE AND FUNCTION


Proteins are fundamental components of all living cells. They exhibit an enormous amount of Proteins form chemical and structural diversity, enabling them to carry out an extraordinarily diverse range of our bodies and biological functions. Proteins help us digest our food, fight infections, control body chemistry, and help direct its in general, keep our bodies functioning smoothly. Scientists know that the critical feature of a many systems. protein is its ability to adopt the right shape for carrying out a particular function. But sometimes a protein twists into the wrong shape or has a missing part, preventing it from doing its job. Many diseases, such as Alzheimer's and "mad cow", are now known to result from proteins that have adopted an incorrect structure. Identifying a protein's shape, or structure, is key to understanding its biological function and its role in health and disease. Illuminating a protein's structure also paves the way for the development of new agents and devices to treat a disease. Yet solving the structure of a protein is no easy feat. It often takes scientists working in the laboratory months, sometimes years, to experimentally determine a single structure. Therefore, scientists have begun to turn toward computers to help predict the structure of a protein based on its sequence. The challenge lies in developing methods for accurately and reliably understanding this intricate relationship.

Levels of Protein Structure


To produce proteins, cellular structures called ribosomes join together long chains of Proteins function subunits. A set of 20 different subunits, called amino acids, can be arranged in any order to through their form a polypeptide that can be thousands of amino acids long. These chains can then loop conformation. about each other, or fold, in a variety of ways, but only one of these ways allows a protein to function properly. The critical feature of a protein is its ability to fold into a conformation that creates structural features, such as surface grooves, ridges, and pockets, which allow it to fulfill its role in a cell. A protein's conformation is usually described in terms of levels of structure. Traditionally, proteins are looked upon as having four distinct levels of structure, with each level of structure dependent on the one below it. In some proteins, functional diversity may be further amplified by the addition of new chemical groups after synthesis is complete. The stringing together of the amino acid chain to form a polypeptide is referred to as the primary structure. The secondary structure is generated by the folding of the primary sequence and refers to the path that the polypeptide backbone of the protein follows in space. Certain types of secondary structures are relatively common. Two well-described secondary structures are the alpha helix and the beta sheet. In the first case, certain types of bonding between groups located on the same polypeptide chain cause the backbone to twist into a helix, most often in a form known as the alpha helix. Beta sheets are formed when a polypeptide chain bonds with another chain that is running in the opposite direction. Beta sheets may also be formed between two sections of a single polypeptide chain that is arranged such that adjacent regions are in reverse orientation. The tertiary structure describes the organization in three dimensions of all of the atoms in the polypeptide. If a protein consists of only one polypeptide chain, this level then describes the complete structure. Multimeric proteins, or proteins that consist of more than one polypeptide chain, require a higher level of organization. The quaternary structure defines the conformation assumed by a multimeric protein. In this case, the individual polypeptide chains that make up a multimeric protein are often referred to as the protein subunits. The four levels of protein structure are hierarchal, that is, each level of the build process is dependent upon the one below it.

Proteins can be divided into two general classes based on their tertiary structure. Fibrous proteins have elongated structures, with the polypeptide chains arranged in long strands. This class of proteins serves as major structural components of cells, and therefore their role tends to be static in providing a structural framework. Globular proteins have more compact, often irregular structures. This class of proteins includes most enzymes and most of the proteins involved in gene expression and regulation.

How Do Proteins Acquire Their Correct Conformations?


A protein's primary amino acid sequence is crucial in determining its final structure. In some cases, amino acid sequence is the sole determinant, whereas in other cases, additional interactions may be required before a protein can attain its final conformation. For example, some proteins require the presence of a cofactor, or a second molecule that is part of the active protein, before it can attain its final conformation. Multimeric proteins often require one or more subunits to be present for another subunit to adopt the proper higher order structure. In any case, as we stated earlier, the entire process is cooperative, that is, the formation of one region of secondary structure determines the formation of the next region.

Allosteric Proteins Under certain conditions, a protein may have a stable alternate conformation, or shape, that enables it to carry out a different biological function. Proteins that exhibit this characteristic are called allosteric. The interaction of an allosteric protein with a specific cofactor, or with another protein, may influence the transition of the protein between shapes. In addition, any change in conformation brought about by an interaction at one site may lead to an alteration in the structure, and thus function, at another site. One should bear in mind, though, that this type of transition affects only the protein's shape, not the primary amino acid sequence. Allosteric proteins play an important role in both metabolic and genetic regulation.
Allosteric proteins can change their shape and function depending on the environmental conditions in which they are found.

Determining Protein Structure


Traditionally, a protein's structure was determined using one of two techniques: X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.

X-ray Crystallography Crystals are a solid form of a substance in which the component molecules are present in an When performing ordered array called a lattice. The basic building block of a crystal is called a unit cell. Each this technique, the unit cell contains exactly one unique set of the crystal's components, the smallest possible set molecule under that is fully representative of the crystal. Crystals of a complex molecule, like a protein, study must first be crystallized, and the produce a complex pattern of X-ray diffraction, or scattering of X-rays. When the crystal is crystals must be placed in an X-ray beam, all of the unit cells present the same face to the beam; therefore, singular and of many molecules are in the same orientation with respect to the incoming X-rays. The X-ray perfect quality a beam enters the crystal and a number of smaller beams emerge: each one in a different time-consuming and difficult task. direction, each one with a different intensity. If an X-ray detector, such as a piece of film, is placed on the opposite side of the crystal from the X-ray source, each diffracted ray, called a reflection, will produce a spot on the film. However, because only a few reflections can be detected with any one orientation of the crystal, an important component of any X-ray diffraction instrument is a device for accurately setting and changing the orientation of the crystal. The set of diffracted, emerging beams contains information about the underlying crystal structure. If we could use light instead of X-rays, we could set up a system of lenses to recombine the beams emerging from the crystal and thus bring into focus an enlarged image of the unit cell and the molecules therein. But the molecules do not diffract visible light, and X-rays, unlike light, cannot be focused with lenses. However, the scientific laws that lenses obey are well understood, and it is possible to calculate the molecular image with a computer. In effect, the computer mimics the action of a lens. The major drawback associated with this technique is that crystallization of the proteins is a difficult task. Crystals are formed by slowly precipitating proteins under conditions that maintain their native conformation or structure. These exact conditions can only be discovered by repeated trials that entail varying certain experimental conditions, one at a time. This is a very time consuming and tedious process. In some cases, the task of crystallizing a protein borders on the impossible.

Nuclear Magnetic Resonance (NMR) Spectroscopy

The basic phenomenon of NMR spectroscopy was discovered in 1945. In this Solution NMR is performed on technique, a sample is immersed in a magnetic field and bombarded with radio a solution of macromolecules waves. These radio waves encourage the nuclei of the molecule to resonate, or spin. while the molecules tumble and As the positively charged nucleus spins, the moving charge creates what is called a vibrate with thermal motion. magnetic moment. The thermal motion of the moleculethe movement of the molecule associated with the temperature of the materialfurther creates a torque, or twisting force, that makes the magnetic moment "wobble" like a child's top. When the radio waves hit the spinning nuclei, they tilt even more, sometimes flipping over. These resonating nuclei emit a unique signal that is then picked up on a special radio receiver and translated using a decoder. This decoder is called the Fourier Transform algorithm, a complex equation that translates the language of the nuclei into something a scientist can understand. By measuring the frequencies at which different nuclei flip, scientists can determine molecular structure, as well as many other interesting properties of the molecule. In the past 10 years, NMR has proven to be a powerful alternative to X-ray crystallography for the determination of molecular structure. NMR has the advantage over crystallographic techniques in that experiments are performed in solution as opposed to a crystal lattice. However, the principles that make NMR possible tend to make this technique very time consuming and limit the application to small- and medium-sized molecules.

The Advent of Computational Modeling


Researchers have been working for decades to develop procedures for predicting protein structure that are not so time consuming and that are not hindered by size and solubility constraints. To do this, researchers have turned to computers for help in predicting protein structure from gene sequences, a concept called homology modeling. The complete genomes of various organisms, including humans, have now been decoded and allow researchers to approach this goal in a logical and organized fashion. Before we go any further, it is important to define some common terminology used in this field. Common Terminology Used in Homology Modeling

Folding motifs are independent folding units, or particular structures, that recur in many molecules. Domains are the building blocks of a protein and are considered elementary units of molecular function. Families are groups of proteins that demonstrate sequence homology or have similar sequences. Superfamilies consist of proteins that have similar folding motifs but do not exhibit sequence similarity.

Some Basic Theory It is theorized that proteins that share a similar sequence generally share the same basic structure. Therefore, by experimentally determining the structure for one member of a protein family, called a target, researchers have a model on which to base the structure of other proteins within that family. Moving a step further, by selecting a target from each superfamily, researchers can study the universe of protein folds in a systematic fashion and outline a set of sequences associated with each folding motif. Many of these sequences may not demonstrate a resemblance to one another, but their identification and assignment to a particular fold is essential for predicting future protein structures using homology modeling.

The scientific basis for these theories is that a strong conservation of protein threedimensional shape across large evolutionary distances both within single species, between species, and in spite of sequence variationhas been demonstrated again and again. Although most scientists choose high-priority structures as their targets, this theory provides the option to choose any one of the proteins within a family as the target, rather than trying to achieve experimental results using a protein that is particularly difficult to work with using crystallographic or NMR techniques.

A computer-generated image of a protein's structure shows the relative locations of most, if not all, of the protein's thousands of atoms. The image also reveals the physical, chemical, and electrical properties of the protein and provides clues about its role in the body.

Steps for Maximizing Results Specific tasks must be carried out to maximize results when determining protein structure using homology modeling. First, protein sequences must be organized in terms of families, preferably in a searchable database, and a target must be selected. Protein families can be identified and organized by comparing protein sequences derived from completely sequenced genomes. Targets may be selected for families that do not exhibit apparent sequence homology to proteins with a known three-dimensional structure. Next, researchers must generate a purified protein for analysis of the chosen target and then experimentally determine the target's structure, either by X-ray crystallography and/or NMR. Target structures determined experimentally may then be further analyzed to evaluate their similarity to other known protein structures and to determine possible evolutionary relationships that are not identifiable from protein sequence alone. The target structure will also serve as a detailed model for determining the structure of other proteins within that family. In favorable cases, just knowing the structure of a particular protein may also provide considerable insight into its possible function.

PDB: The Protein Data Bank


The PDB was the first "bioinformatics" database ever built and is designed to store complex three-dimensional data. The PDB was originally developed and housed at the Brookhaven National Laboratories but is now managed and maintained by the Research Collaboratory for Structural Bioinformatics (RCSB). The PDB is a collection of all publicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of other complexes experimentally determined by X-ray crystallography and NMR.

PDB is supported by funds from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.

Protein Modeling at NCBI


The Molecular Modeling Database NCBI's Molecular Modeling DataBase (MMDB), an integral part of our Entrez information retrieval system, is a compilation of all of the PDB three-dimensional structures of biomolecules. The difference between the two databases is that the MMDB records reorganize and validate the information stored in the database in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structurebased homology modeling and protein structure prediction.

MMDB records have value-added information compared to the original PDB entries, including explicit chemical graph information, uniformly derived secondary structure definitions, structure domain information, literature citation matching, and molecule-based assignment of taxonomy to each biologically derived protein or nucleic acid chain.

NCBI has also developed a three-dimensional structure viewer, called Cn3D, for easy interactive visualization of molecular structures from Entrez. Cn3D serves as a visualization tool for sequences and sequence alignments. What sets Cn3D apart from other software is its ability to correlate structure and sequence information. For example, using Cn3D, a scientist can quickly locate the residues in a crystal structure that correspond to known disease mutations or conserved active site residues from a family of sequence homologs, or sequences that share a common ancestor. Cn3D displays structure-structure alignments along with the corresponding structure-based sequence alignments to emphasize those regions within a group of related proteins that are most conserved in

structure and sequence. Cn3D also features custom labeling options, high-quality graphics, and a variety of file exports that together make Cn3D a powerful tool for literature annotation.

PDBeast: Taxonomy in MMDB Taxonomy is the scientific discipline that seeks to catalog and reconstruct the evolutionary history of life on earth. NCBI's Structure Group, in collaboration with NCBI's taxonomists, has undertaken taxonomy annotation for the structure data stored in MMDB. A semi-automated approach has been implemented in which a human expert checks, corrects, and validates automatic taxonomic assignments. The PDBeast software tool was developed by NCBI for this purpose. It pulls text-descriptions of "Source Organisms" from either the original PDB-Entries or user-specified information and looks for matches in the NCBI Taxonomy database to record taxonomy assignments.

The Role of Taxonomy

Taxonomy provides a vivid picture of the existing organic diversity of the earth. Taxonomy provides much of the information permitting a reconstruction of the phylogeny of life. Taxonomy reveals numerous, interesting evolutionary phenomena. Taxonomy supplies classifications that are of great explanatory value in most branches of biology.

COGs: Phylogenetic Classification of Proteins The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt at the phylogenetic classification of proteins a scheme that indicates the evolutionary relationships between organisms from complete genomes. Each COG includes proteins that are thought to be orthologous. Orthologs are genes in different species derived from a common ancestor and carried on through evolution. COGs may be used to detect similarities and differences between species for identifying protein families and predicting new protein functions and to point to potential drug targets in disease-causing species. The database is accompanied by the COGnitor program, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web page containing additional structural and functional information is now associated with each COG. These hyperlinked information pages include: systematic classification of the COG members under the different classification systems; indications of which COG members (if any) have been characterized genetically and biochemically; information on the domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains if known or predictable; a succinct summary of the common structural and functional features of the COG members as well as peculiarities of individual members; and key references.

Detecting New Sequence Similarities: BLAST against MMDB Comparison, whether of structural features or protein sequences, lies at the heart The journal article describing the of biology. The introduction of BLAST, or The Basic Local Alignment Search original algorithm used in BLAST Tool, in 1990 made it easier to rapidly scan huge databases for overt has since become one of the most homologies, or sequence similarities, and to statistically evaluate the resulting frequently cited papers of the decade, with over 10,000 citations. matches. BLAST works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. Sequence similarities found by BLAST have been critical in several gene discoveries. Hundreds of major sequencing centers and research institutions around the country use this software to transmit a query sequence from their local computer to a BLAST server at the NCBI via the Internet. In a matter of seconds, the BLAST server compares the user's sequence with up to a million known sequences and determines the closest matches. Not all significant homologies are readily and easily detected, however. Some of the most interesting are subtle

similarities that do not always rise to statistical significance during a standard BLAST search. Therefore, NCBI has extended the statistical methodology used in the original BLAST to address the problem of detecting weak, yet significant, sequence similarities. PSI-BLAST, or Position-Specific Iterated BLAST, searches sequence databases with a profile constructed using BLAST alignments, from which it then constructs what is called a position-specific score matrix. For protein analysis, the new Pattern Hit Initiated BLAST, or PHI-BLAST, serves to complement the profile-based searching that was previously introduced with PSI-BLAST. PHI-BLAST further incorporates hypotheses as to the biological function of a query sequence and restricts the analysis to a set of protein sequences that is already known to contain a specific pattern or motif.

BLAST now comes in several varieties in addition to those described above. Specialized BLASTs are also available for human, microbial, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences.

Structure Similarity Searching Using VAST As just noted, a sequence-sequence similarity program provides an alignment of two sets of sequences. A structure-structure similarity program provides a three-dimensional structure superposition. Structure similarity search services are based on the premise that some measure can be computed between two structures to assess their similarities, much the same way a BLAST alignment is predicted. VAST, or the Vector Alignment Search Tool, is a computer algorithm developed at NCBI for use in identifying similar three-dimensional protein structures. VAST is capable of detecting structural similarities between proteins stored in MMDB, even when no sequence similarity is detected.

VAST Search is NCBI's structure-structure similarity search service that compares threedimensional coordinates of newly determined protein structures to those in the MMDB or PDB databases. VAST Search creates a list of structure neighbors, or related structures, that a user can then browse interactively. VAST Search will retrieve almost all structures with an identical three-dimensional fold, although it may occasionally miss a few structures or report chance similarities.

The detection of structural similarity in the absence of obvious sequence similarity is a powerful tool to study remote homologies and protein evolution.

The Conserved Domain Database The Conserved Domain Database (CDD) is a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from SMART and Pfam, two popular Web-based tools for studying sequence domains, as well as domains contributed by NCBI researchers. CD Search, another NCBI search service, can be used to identify conserved domains in a protein query sequence. CD-Search uses RPS-BLAST to compare a query sequence against specific matrices that have been prepared from conserved domain alignments present in CDD. Alignments are also mapped to known three-dimensional structures and can be displayed using Cn3D (see above).

Conserved Domain Architecture Retrieval Tool NCBI's Conserved Domain Architecture Retrieval Tool (CDART) displays the functional domains that make up a protein and lists other proteins with similar domain architectures. CDART determines the domain architecture of a query protein sequence by comparing it to the CDD, a database of conserved domain alignments, using RPSBLAST. The Conserved Domain Architecture Retrieval Tool then compares the protein's domain architecture to that of other proteins in NCBI's non-redundant sequence database. Related sequences are identified as those proteins that share one or more similar domains. CDART displays these sequences using a graphical summary that depicts the types and locations of domains identified within each sequence. Links to the individual sequences as well as to further information on their domain architectures are also provided. Because protein domains may be considered elementary units of molecular function and proteins related by domain architecture may play similar roles in cellular processes, CDART serves as a useful tool in comparative sequence analysis.

RPS-BLAST is a "reverse" version of PSI-BLAST, which is described above. Both RPS-BLAST and PSI-BLAST use similar methods to derive conserved features of a protein family. However, RPS-BLAST compares a query sequence against a database of profiles prepared from ready-made alignments, whereas PSI-BLAST builds alignments starting from a single protein sequence. The programs also differ in purpose: RPS-BLAST is used to identify conserved domains in a query sequence, whereas PSI-BLAST is used to identify other members of the protein family to which a query sequence belongs.

Application to Biomedicine
Although the information derived from modeling studies is primarily about molecular function, protein structure data also provide a wealth of information on mechanisms linked to the function and the evolutionary history of and relationships between macromolecules. NCBI's goals in adding structure data to its Web site are to make this information easily accessible to the biomedical community worldwide and to facilitate comparative analysis involving three-dimensional structure.

SNPs: VARIATIONS ON A THEME


Wouldn't it be wonderful if you knew exactly what measures you could take to stave off, or even prevent, the onset of disease? Wouldn't it be a relief to know that you are not allergic to the drugs your doctor just prescribed? Wouldn't it be a comfort to know that the treatment regimen you are undergoing has a good chance of success because it was designed just for you? With the availability of millions of SNPs, biomedical researchers now believe that such exciting medical advances are not that far away.

What Are SNPs and How Are They Found?


A Single Nucleotide Polymorphism, or SNP (pronounced "snip"), is a small genetic change, or variation, that can occur within a person's DNA sequence. The genetic code is specified by the four nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNP variation occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide lettersC, G, or T.
Although many SNPs do not produce physical changes in people, scientists believe that other SNPs may predispose people to disease and even influence their response to drug regimens.

An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where the second "A" in the first snippet is replaced with a "T". On average, SNPs occur in the human population more than 1 percent of the time. Because only about 3 to 5 percent of a person's DNA sequence codes for the production of proteins, most SNPs are found outside of "coding sequences". SNPs found within a coding sequence are of particular interest to researchers because they are more likely to alter the biological function of a protein. Because of the recent advances in technology, coupled with the unique ability of these genetic variations to facilitate gene identification, there has been a recent flurry of SNP discovery and detection.

Needles in a Haystack
Finding single nucleotide changes in the human genome seems like a daunting prospect, but over the last 20 years, biomedical researchers have developed a number of techniques that make it possible to do just that. Each technique uses a different method to compare selected regions of a DNA sequence obtained from multiple individuals who share a common trait. In each test, the result shows a physical difference in the DNA samples only when a SNP is detected in one individual and not in the other.
As a result of recent advances in SNPs research, diagnostics for many diseases may improve.

Many common diseases in humans are not caused by a genetic variation within a single gene but are influenced by complex interactions among multiple genes as well as environmental and lifestyle factors. Although both

environmental and lifestyle factors add tremendously to the uncertainty of developing a disease, it is currently difficult to measure and evaluate their overall effect on a disease process. Therefore, we refer here mainly to a person's genetic predisposition, or the potential of an individual to develop a disease based on genes and hereditary factors. Genetic factors may also confer susceptibility or resistance to a disease and determine the severity or progression of disease. Because we do not yet know all of the factors involved in these intricate pathways, researchers have found it difficult to develop screening tests for most diseases and disorders. By studying stretches of DNA that have been found to harbor a SNP associated with a disease trait, researchers may begin to reveal relevant genes associated with a disease. Defining and understanding the role of genetic factors in disease will also allow researchers to better evaluate the role non-genetic factorssuch as behavior, diet, lifestyle, and physical activityhave on disease. Because genetic factors also affect a person's response to drug therapy, DNA polymorphisms such as SNPs will be useful in helping researchers determine and understand why individuals differ in their abilities to absorb or clear certain drugs, as well as to determine why an individual may experience an adverse side effect to a particular drug. Therefore, the recent discovery of SNPs promises to revolutionize not only the process of disease detection but the practice of preventative and curative medicine.

SNPs and Disease Diagnosis


It will only be a matter of time before physicians can screen patients for susceptibility to a disease by analyzing their DNA for specific SNP profiles.

Each person's genetic material contains a unique SNP pattern that is made up of many different genetic variations. Researchers have found that most SNPs are not responsible for a disease state. Instead, they serve as biological markers for pinpointing a disease on the human genome map, because they are usually located near a gene found to be associated with a certain disease. Occasionally, a SNP may actually cause a disease and, therefore, can be used to search for and isolate the

disease-causing gene. To create a genetic test that will screen for a disease in which the disease-causing gene has already been identified, scientists collect blood samples from a group of individuals affected by the disease and analyze their DNA for SNP patterns. Next, researchers compare these patterns to patterns obtained by analyzing the DNA from a group of individuals unaffected by the disease. This type of comparison, called an " association study", can detect differences between the SNP patterns of the two groups, thereby indicating which pattern is most likely associated with the disease-causing gene. Eventually, SNP profiles that are characteristic of a variety of diseases will be established. Then, it will only be a matter of time before physicians can screen individuals for susceptibility to a disease just by analyzing their DNA samples for specific SNP patterns.

SNPs and Drug Development


As mentioned earlier, SNPs may also be associated with the absorbance and clearance of therapeutic agents. Currently, there is no simple way to determine how a patient will respond to a particular medication. A treatment proven effective in one patient may be ineffective in others. Worse yet, some patients may experience an adverse immunologic reaction to a particular drug. Today, pharmaceutical companies are limited to developing agents to which the "average" patient will respond. As a result, many drugs that might benefit a small number of patients never make it to market.
Using SNPs to study the genetics of drug response will help in the creation of "personalized" medicine.

In the future, the most appropriate drug for an individual could be determined in advance of treatment by analyzing a patient's SNP profile. The ability to target a drug to those individuals most likely to benefit, referred to as "personalized medicine", would allow pharmaceutical companies to bring many more drugs to market and allow doctors to prescribe individualized therapies specific to a patient's needs.

SNPs and NCBI

Because SNPs occur frequently throughout the genome and tend to be relatively stable genetically, they serve as excellent biological markers. Biological markers are segments of DNA with an identifiable physical location that can be easily tracked and used for constructing a chromosome map that shows the positions of known genes, or other markers, relative to each other. These maps allow researchers to study and pinpoint traits resulting from the interaction of more than one gene. NCBI plays a major role in facilitating the identification and cataloging of SNPs through its creation and maintenance of the public SNP database (dbSNP). This powerful genetic tool may be accessed by the biomedical community worldwide and is intended to stimulate many areas of biological research, including the identification of the genetic components of disease.
Most SNPs are not responsible for a disease state. Instead, they serve as biological markers for pinpointing a disease on the human genome map.

NCBI's "Discovery Space" Facilitating SNP Research

Figure 1. The NCBI Discovery Space. Records in dbSNP are cross-annotated within other internal information resources such as PubMed, genome project sequences, GenBank records, the Entrez Gene database, and the dbSTS database of sequence tagged sites. Users may query dbSNP directly or start a search in any part of the NCBI discovery space to construct a set of dbSNP records that satisfy their search conditions. Records are also integrated with external information resources through hypertext URLs that dbSNP users can follow to explore the detailed information that is beyond the scope of dbSNP curation. Reproduced with permission from Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K."dbSNP: the NCBI database of genetic variation." Nucleic Acids Research. 2001; 29:308-311.

To facilitate research efforts, NCBI's dbSNP is included in the Entrez retrieval system which provides integrated access to a number of software tools and databases that can aid in SNP analysis. For example, each SNP record in the database links to additional resources within NCBI's "Discovery Space", as noted in Figure 1. Resources include: GenBank, NIH's sequence database; Entrez Gene, a focal point for genes and associated information; dbSTS, NCBI's resource containing sequence and mapping data on short genomic landmarks; human genome sequencing data; and PubMed, NCBI's literature search and retrieval system. SNP records also link to various external allied resources. Providing public access to a site for "one-stop SNP shopping" facilitates scientific research in a variety of fields, ranging from population genetics and evolutionary biology to large-scale disease and drug association studies. The long-term investment in such novel and exciting research promises not only to advance human biology but to revolutionize the practice of modern medicine.

ESTs: GENE DISCOVERY MADE EASIER


Investigators are working diligently to sequence and assemble the genomes of various organisms, including the mouse and human, for a number of important reasons. Although important goals of any sequencing project may be to

obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression. Once we begin to understand where and how a gene is expressed under normal circumstances, we can then study what happens in an altered state, such as in disease. To accomplish the latter goal, however, researchers must identify and study the protein, or proteins, coded for by a gene. As one can imagine, finding a gene that codes for a protein, or proteins, is not easy. Traditionally, scientists would start their search by defining a biological problem and developing a strategy for researching the problem. Oftentimes, a search of the scientific literature provided various clues about how to proceed. For example, other laboratories may have published data that established a link between a particular protein and a disease of interest. Researchers would then work to isolate that protein, determine its function, and locate the gene that coded for the protein. Alternatively, scientists could conduct what is referred to as linkage studies to determine the chromosomal location of a particular gene. Once the chromosomal location was determined, scientists would use biochemical methods to isolate the gene and its corresponding protein. Either way, these methods took a great deal of timeyears in some casesand yielded the location and description of only a small percentage of the genes found in the human genome. Now, however, the time required to locate and fully describe a gene is rapidly decreasing, thanks to the development of, and access to, a technology used to generate what are called Expressed Sequence Tags, or ESTs. ESTs provide researchers with a quick and inexpensive route for discovering new genes, for obtaining data on gene expression and regulation, and for constructing genome maps. Today, researchers using ESTs to study the human genome find themselves riding the crest of a wave of scientific discovery the likes of which has never been seen before.

An Expressed Sequence Tag is a tiny portion of an entire gene that can be used to help identify unknown genes and to map their positions within a genome.

What Are ESTs and How Are They Made?


ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene.

Separating the Wheat from the Chaff: Using mRNA to Generate cDNA Gene identification is very difficult in humans, because most of our genome is composed of introns interspersed with a relative few DNA coding sequences, or genes. These genes are expressed as proteins, a complex process composed of two main two steps. Each gene (DNA) must be converted, or transcribed, into messenger RNA (mRNA), RNA that serves as a template for protein synthesis. The resulting mRNA then guides the synthesis of a protein through a process called translation. Interestingly, mRNAs in a cell do not contain sequences from the regions between genes, nor from the non-coding introns that are present within many genes. Therefore, isolating mRNA is key to finding expressed genes in the vast expanse of the human genome.

Figure 1. An overview of the process of protein synthesis. Protein synthesis is the process whereby DNA codes for the production of amino acids and proteins. The process is divided into two parts: transcription and translation. During transcription, one strand of a DNA double helix is used as a template by mRNA polymerase to synthesize a mRNA. During this step, mRNA passes through various phases, including one called splicing, where the non-coding sequences are eliminated. In the next step, translation, the mRNA guides the synthesis of the protein by adding amino acids, one by one, as dictated by the DNA and represented by the mRNA.

The problem, however, is that mRNA is very unstable outside of a cell; therefore, scientists use special enzymes to convert it to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence.

cDNA is a form of DNA prepared in the laboratory using an enzyme called reverse transcriptase. cDNA production is the reverse of the usual process of transcription in cells because the procedure uses mRNA as a template rather than DNA. Unlike genomic DNA, cDNA contains only expressed DNA sequences, or exons.

From cDNAs to ESTs Once cDNA representing an expressed gene has been isolated, scientists can then sequence a few hundred nucleotides from either end of the molecule to create two different A "gene family" is a group of closely related kinds of ESTs. Sequencing only the beginning portion of the cDNA produces what is called genes that produces a 5' EST. A 5' EST is obtained from the portion of a transcript that usually codes for a similar protein products. protein. These regions tend to be conserved across species and do not change much within a gene family. Sequencing the ending portion of the cDNA molecule produces what is called a 3' EST. Because these ESTs are generated from the 3' end of a transcript, they are likely to fall within noncoding, or untranslated regions (UTRs), and therefore tend to exhibit less cross-species conservation than do coding sequences.

A UTR is that part of a gene that is not translated into protein.

Figure 2. An overview of how ESTs are generated. ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell. The mRNAs in a cell are copies of the genes that are being expressed. mRNA does not contain sequences from the regions between genes, nor from the non-coding introns that are present within many interesting parts of the genome.

ESTs: Tools for Gene Mapping and Discovery


ESTs as Genome Landmarks Just as a person driving a car may need a map to find a destination, scientists searching for genes also need genome maps to help them to navigate through the billions of nucleotides that make up the human genome. For a map to make navigational sense, it must include reliable landmarks or "markers". Currently, the most powerful mapping technique, and one that has been used to generate many genome maps, relies on Sequence Tagged Site (STS) mapping. An STS is a short DNA sequence that is easily recognizable and occurs only once in a genome (or chromosome). The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene.

ESTs as Gene Discovery Resources

ESTs are powerful tools in the hunt for known genes

because they greatly reduce the time required to locate a gene.

Because ESTs represent a copy of just the interesting part of a genome, that which is expressed, they have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary diseases. ESTs also have a number of practical advantages in that their sequences can be generated rapidly and inexpensively, only one sequencing experiment is needed per each cDNA generated, and they do not have to be checked for sequencing errors because mistakes do not prevent identification of the gene from which the EST was derived.

Using ESTs, scientists have rapidly isolated some of the genes involved in Alzheimer's disease and colon cancer.

To find a disease gene using this approach, scientists first use observable biological clues to identify ESTs that may correspond to disease gene candidates. Scientists then examine the DNA of disease patients for mutations in one or more of these candidate genes to confirm gene identity. Using this method, scientists have already isolated genes involved in Alzheimer's disease, colon cancer, and many other diseases. It is easy to see why ESTs will pave the way to new horizons in genetic research.

ESTs and NCBI


Because of their utility, speed with which they may be generated, and the low cost For ESTs to be easily associated with this technology, many individual scientists as well as large genome accessed and useful as sequencing centers have been generating hundreds of thousands of ESTs for public use. gene discovery tools, they Once an EST was generated, scientists were submitting their tags to GenBank, the NIH must be organized in a searchable database that sequence database operated by NCBI. With the rapid submission of so many ESTs, it also provides access to became difficult to identify a sequence that had already been deposited in the database. It genome data. was becoming increasingly apparent to NCBI investigators that if ESTs were to be easily accessed and useful as gene discovery tools, they needed to be organized in a searchable database that also provided access to other genome data. Therefore, in 1992, scientists at NCBI developed a new database designed to serve as a collection point for ESTs. Once an EST that was submitted to GenBank had been screened and annotated, it was then deposited in this new database, called dbEST.

dbEST: A Descriptive Catalog of ESTs Scientists at NCBI created dbEST to organize, store, and provide access to the great mass of public EST data that has already accumulated and that continues to grow daily. Using dbEST, a scientist can access not only data on human ESTs but information on ESTs from over 300 other organisms as well. Whenever possible, NCBI scientists annotate the EST record with any known information. For example, if an EST matches a DNA sequence that codes for a known gene with a known function, that gene's name and function are placed on the EST record. Annotating EST records allows public scientists to use dbEST as an avenue for gene discovery. By using a database search tool, such as NCBIs BLAST, any interested party can conduct sequence similarity searches against dbEST.
Scientists at NCBI annotate EST records with text information regarding DNA and mRNA homologies.

UniGene: A Non-Redundant Set of Gene-oriented Clusters Because a gene can be expressed as mRNA many, many times, ESTs ultimately derived from this mRNA may be redundant. That is, there may be many identical, or similar, copies of the same EST. Such redundancy and overlap means that when someone searches dbEST for a particular EST, they may retrieve a long list of tags, many of which may represent the same gene. Searching through all of these identical ESTs can be very time consuming. To resolve the redundancy and overlap problem, NCBI investigators developed the UniGene database UniGene automatically partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Although it is widely recognized that the generation of ESTs constitutes an efficient strategy to identify genes, it is important to acknowledge that despite its advantages, there are several limitations associated with the EST approach. One is that it is very difficult to isolate mRNA from some tissues and cell types. This results in a paucity of data on certain genes that may only be found in these tissues or cell types.

Second is that important gene regulatory sequences may be found within an intron. Because ESTs are small segments of cDNA, generated from a mRNA in which the introns have been removed, much valuable information may be lost by focusing only on cDNA sequencing. Despite these limitations, ESTs continue to be invaluable in characterizing the human genome, as well as the genomes of other organisms. They have enabled the mapping of many genes to chromosomal sites and have also assisted in the discovery of many new genes.

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE


With only a few exceptions, every cell of the body contains a full set of chromosomes and identical genes. Only a fraction of these genes are turned on, however, and it is the subset that is "expressed" that confers unique properties to each cell type. "Gene expression" is the term used to describe the transcription of the information contained within the DNA, the repository of genetic information, into messenger RNA (mRNA) molecules that are then translated into the proteins that perform most of the critical functions of cells. Scientists study the kinds and amounts of mRNA produced by a cell to learn which genes are expressed, which in turn provides insights into how the cell responds to its changing needs. Gene expression is a highly complex and tightly regulated process that allows a cell to respond dynamically both to environmental stimuli and to its own changing needs. This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.

The proper and harmonious expression of a large number of genes is a critical component of normal growth and development and the maintenance of proper health. Disruptions or changes in gene expression are responsible for many diseases.

Enabling Technologies Biomedical research evolves and advances not only through the compilation of knowledge but also through the development of new technologies. Using traditional methods to assay gene expression, researchers were able to survey a relatively small number of genes at a time. The emergence of new tools enables researchers to address previously intractable problems and to uncover novel potential targets for therapies. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and efficiently. They represent a major methodological advance and illustrate how the advent of new technologies provides powerful tools for researchers. Scientists are using microarray technology to try to understand fundamental aspects of growth and development as well as to explore the underlying genetic causes of many human diseases.

DNA Microarrays: The Technical Foundations


Two recent complementary advances, one in knowledge and one in technology, are greatly facilitating the study of gene expression and the discovery of the roles played by specific genes in the development of disease. As a result of the Human Genome Project, there has been an explosion in the amount of information available about the DNA sequence of the human genome. Consequently, researchers have identified a large number of novel genes within these previously unknown sequences. The challenge currently facing scientists is to find a way to organize and catalog this vast amount of information into a usable form. Only after the functions of the new genes are discovered will the full impact of the Human Genome Project be realized. The second advance may facilitate the identification and classification of this DNA sequence information and the assignment of functions to these new genes: the emergence of DNA microarray technology. A microarray works by exploiting the ability of a given mRNA molecule to bind specifically to, or hybridize to, the DNA template from which it originated. By using an array containing many DNA samples, scientists can determine, in a single experiment, the expression levels of hundreds or thousands of genes within a cell by measuring the amount of mRNA bound to each site on the array. With the aid of a computer, the amount of mRNA bound to the spots on the microarray is precisely measured, generating a profile of gene expression in the cell.

A microarray is a tool for analyzing gene expression that consists of a small membrane or glass slide containing samples of many genes arranged in a regular pattern.

Why Are Microarrays Important?


Microarrays are a significant advance both because they may contain a very large number of genes and because of their small size. Microarrays are therefore useful when one wants to survey a large number of genes quickly or when the sample to be studied is small. Microarrays may be used to assay gene expression within a single sample or to compare gene expression in two different cell types or tissue samples, such as in healthy and diseased tissue. Because a microarray can be used to examine the expression of hundreds or thousands of genes at once, it promises to revolutionize the way scientists examine gene expression. This technology is still considered to be in its infancy; therefore, many initial studies using microarrays have represented simple surveys of gene expression profiles in a variety of cell types. Nevertheless, these studies represent an important and necessary first step in our understanding and cataloging of the human genome. As more information accumulates, scientists will be able to use microarrays to ask increasingly complex questions and perform more intricate experiments. With new advances, researchers will be able to infer probable functions of new genes based on similarities in expression patterns with those of known genes. Ultimately, these studies promise to expand the size of existing gene families, reveal new patterns of coordinated gene expression across gene families, and uncover entirely new categories of genes. Furthermore, because the product of any one gene usually interacts with those of many others, our understanding of how these genes coordinate will become clearer through such analyses, and precise knowledge of these inter-relationships will emerge. The use of microarrays may also speed the identification of genes involved in the development of various diseases by enabling scientists to examine a much larger number of genes. This technology will also aid the examination of the integration of gene expression and function at the cellular level, revealing how multiple gene products work together to produce physical and chemical responses to both static and changing cellular needs.

What Exactly Is a DNA Microarray?


DNA Microarrays are small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations. The supports themselves are usually glass microscope slides, the size of two side-by-side pinky fingers, but can also be silicon chips or nylon membranes. The DNA is printed, spotted, or actually synthesized directly onto the support. The American Heritage Dictionary defines "array" as "to place in an orderly arrangement". It is important that the gene sequences in a microarray are attached to their support in an orderly or fixed way, because a researcher uses the location of each spot in the array to identify a particular gene sequence. The spots themselves can be DNA, cDNA, or oligonucleotides.

An oligonucleotide, or oligo as it is commonly called, is a short fragment of a single-stranded DNA that is typically 5 to 50 nucleotides long.

Designing a Microarray Experiment: The Basic Steps


One might ask, how does a scientist extract information about a disease condition from a dime-sized glass or silicon chip containing thousands of individual gene sequences? The whole process is based on hybridization probing, a technique that uses fluorescently labeled nucleic acid molecules as " mobile probes" to identify complementary molecules, sequences that are able to base-pair with one another. Each single-stranded DNA fragment is made up of four different nucleotides, adenine (A), thymine (T), guanine (G), and cytosine (C), that are linked end to end. Adenine is the complement of, or will always pair with, thymine, and guanine is the complement of cytosine. Therefore, the complementary sequence to G-T-C-C-T-A will be C-A-G-G-A-T. When two complementary sequences find each other, such as the immobilized target DNA and the mobile probe DNA, cDNA, or mRNA, they will lock together, or hybridize. Now, consider two cells: cell type 1, a healthy cell, and cell type 2, a diseased cell. Both contain an identical set of four genes, A, B, C, and D. Scientists are interested in determining the expression profile of these four genes in the two cell types. To do this, scientists isolate mRNA from each cell type and use this mRNA as templates to generate cDNA with a "fluorescent tag" attached. Different tags (red and green) are used so that the samples can be differentiated in subsequent steps. The two labeled samples are then mixed and incubated with a microarray containing the immobilized genes A, B, C, and D. The labeled molecules bind to the sites on the array corresponding to the genes expressed in each cell.

A DNA Microarray Experiment


1. Prepare your DNA chip using your chosen target DNAs. 3. Incubate your hybridization mixture containing fluorescently labeled cDNAs with your DNA chip.

2. Generate a hybridization solution containing a mixture of fluorescently labeled cDNAs.

4. Detect bound cDNA using laser technology and store data in a computer.

5. Analyze data using computational methods.

After this hybridization step is complete, a researcher will place the microarray in a "reader" or " scanner" that consists of some lasers, a special microscope, and a camera. The fluorescent tags are excited by the laser, and the microscope and camera work together to create a digital image of the array. These data are then stored in a computer, and a special program is used either to calculate the red-to-green fluorescence ratio or to subtract out background data for each microarray spot by analyzing the digital image of the array. If calculating ratios, the program then creates a table that contains the ratios of the intensity of red-to-green fluorescence for every spot on the array. For example, using the scenario outlined above, the computer may conclude that both cell types express gene A at the same level, that cell 1 expresses more of gene B, that cell 2 expresses more of gene C, and that neither cell expresses gene D. But remember, this is a simple example used to demonstrate key points in experimental design. Some microarray experiments can contain up to 30,000 target spots. Therefore, the data generated from a single array can mount up quickly.

The Colors of a Microarray

Reproduced with permission from the Office of Science Education, the National Institutes of Health.

In this schematic: GREEN represents Control DNA, where either DNA or cDNA derived from normal tissue is hybridized to the target DNA. RED represents Sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA. YELLOW represents a combination of Control and Sample DNA, where both hybridized equally to the target DNA. BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA. Each spot on an array is associated with a particular gene. Each color in an array represents either healthy (control) or diseased (sample) tissue. Depending on the type of array used, the location and intensity of a color will tell us whether the gene, or mutation, is present in either the control and/or sample DNA. It will also provide an estimate of the expression level of the gene(s) in the sample and control DNA.

Types of Microarrays
There are three basic types of samples that can be used to construct DNA microarrays, two are genomic and the other is "transcriptomic", that is, it measures mRNA levels. What makes them different from each other is the kind of immobilized DNA used to generate the array and, ultimately, the kind of information that is derived from the chip. The target DNA used will also determine the type of control and sample DNA that is used in the hybridization solution.

I. Changes in Gene Expression Levels Determining the level, or volume, at which a certain gene is expressed is called microarray expression analysis, and the arrays used in this kind of analysis are called " expression chips". The immobilized DNA is cDNA derived from the mRNA of known genes, and once again, at least in some experiments, the control and sample DNA hybridized to the chip is cDNA derived from the mRNA of normal and diseased tissue, respectively. If a gene is overexpressed in a certain disease state, then more sample cDNA, as compared to control cDNA, will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. Once researchers have characterized the expression patterns of various genes involved in many diseases, cDNA derived from diseased tissue from any individual can be hybridized to determine whether the expression pattern of the gene from the individual matches the expression pattern of a known disease. If this is the case, treatment appropriate for that disease can be initiated. As researchers use expression chips to detect expression patterns whether a particular gene(s) is being expressed more or less under certain circumstancesexpression chips may also be used to examine changes in gene expression over a given period of time, such as within the cell cycle. The cell cycle is a molecualr network that determines, in the normal cell, if the cell should pass through its life cycle. There are a variety of genes involved in regulating the stages of the cell cycle. Also built into this network are mechanisms designed to protect the body when this system fails or breaks down because of mutations within one of the "control genes", as is the case with cancerous cell growth. An expression microarray "experiment" could be designed where cell cycle data are generated in multiple arrays and referenced to time "zero". Analysis of the collected data could further elucidate details of the cell cycle and its "clock", providing much needed data on the points at which gene mutation leads to cancerous growth as well as sources of therapeutic intervention. In the same way, expression chips can be used to develop new drugs. For instance, if a certain gene is overexpressed in a particular form of cancer, researchers can use expression chips to see if a new drug will reduce overexpression and force the cancer into remission. Expression chips could also be used in disease diagnosis as well, e.g., in the identification of new genes involved in environmentally triggered diseases, such as those diseases affecting the immune, nervous, and pulmonary/respiratory systems.

II. Genomic Gains and Losses DNA repair genes are thought to be the body's frontline defense against mutations and, as such, play a major role in cancer. Mutations within these genes often manifest themselves as lost or broken chromosomes. It has been hypothesized that certain chromosomal gains and losses are related to cancer progression and that the patterns of these changes are relevant to clinical prognosis. Using different laboratory methods, researchers can measure gains and losses in the copy number of chromosomal regions in tumor cells. Then, using mathematical models to analyze these data, they can predict which chromosomal regions are most likely to harbor important genes for tumor initiation and disease progression. The results of such an analysis may be depicted as a hierarchical treelike branching diagram, referred to as a "tree model of tumor progression ". Researchers use a technique called microarray Comparative Genomic Hybridization (CGH) to look for genomic gains and losses or for a change in the number of copies of a particular gene involved in a disease state. In microarray CGH, large pieces of genomic DNA serve as the target DNA, and each spot of target DNA in the array has a known chromosomal location. The hybridization mixture will contain fluorescently labeled genomic DNA harvested from both normal (control) and diseased (sample) tissue. Therefore, if the number of copies of a particular target gene has increased, a large amount of sample DNA will hybridize to those spots on the microarray that represent the gene involved in that disease, whereas comparatively small amounts of control DNA will hybridize to those same spots. As a result, those spots containing the disease gene will fluoresce red with greater intensity than they will fluoresce green, indicating that the number of copies of the gene involved in the disease has gone up.

III. Mutations in DNA When researchers use microarrays to detect mutations or polymorphisms in a gene sequence, the target, or immobilized DNA, is usually that of a single gene. In this case though, the target sequence placed on any given spot within the array will differ from that of other spots in the same microarray, sometimes by only one or a few specific nucleotides. One type of sequence commonly used in this type of analysis is called a Single Nucleotide Polymorphism, or SNP, a small genetic change or variation that can occur within a person's DNA sequence. Another difference in mutation microarray analysis, as compared to expression or CGH microarrays, is that this type of

experiment only requires genomic DNA derived from a normal sample for use in the hybridization mixture. Once researchers have established that a SNP pattern is associated with a particular disease, they can use SNP microarray technology to test an individual for that disease expression pattern to determine whether he or she is susceptible to (at risk of developing) that disease. When genomic DNA from an individual is hybridized to an array loaded with various SNPs, the sample DNA will hybridize with greater frequency only to specific SNPs associated with that person. Those spots on the microarray will then fluoresce with greater intensity, demonstrating that the individual being tested may have, or is at risk for developing, that disease.

In Brief: Microarray Applications


Microarray type CGH Expression analysis Mutation/Polymorphism analysis Application Tumor classification, risk assessment, and prognosis prediction Drug development, drug response, and therapy development Drug development, therapy development, and tracking disease progression

NCBI and Microarray Data Management


Why is it necessary to have a uniform system that will manage and provide a disbursement point for microarray data? Consider the amount of data that can potentially be generated using a single microarray chip. Suppose that chip contains 30,000 spots of target DNA. Researchers interpreting the data generated by that chip would need to know the biological identity of each targetwhat gene is where; the biological properties of the control and sample DNA; the experimental conditions and procedures used in setting up the experiment; and finally, the results. Although experiments such as these will undoubtedly push forward our current understanding of gene expression and regulation, many new challenges are presented in terms of data tracking and analysis.

What Is GEO?
As we have just alluded, microarray technology is one of the most recent and important experimental breakthroughs in molecular biology. Today, proficiency in generating data is fast overcoming the capacity for storing and analyzing it. Much of this information is scattered across the Internet or is not even available to the public. As more laboratories acquire this technology, the problem will only get worse. This avalanche of data requires standardization of storage, sharing, and publishing techniques. To support the public use and dissemination of gene expression data, NCBI has launched the Gene Expression Omnibus, or GEO. GEO represents NCBI's effort to build an expression data repository and online resource for the storage and retrieval of gene expression data from any organism or artificial source. Many types of gene expression data, such as those types discussed in this primer, are accepted and archived as a public dataset.

Developing MAML: Reading Off the Same Platform


Microarray Markup Language, developed by the "MAML" working group of MGED, the Microarray Gene Expression Database, is a first attempt to provide a standard platform for submitting and analyzing the enormous amounts of microarray expression data generated by different laboratories around the world. The goal of this group, which includes NCBI investigators, is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalization methods. The underlying goal is to facilitate the establishment of gene expression data repositories, the comparability of gene expression data from different sources, the interoperability of different gene expression databases, and data analysis software. MAML proposes a framework for describing information about a DNA-array experiment and a data format for communicating this information, including details about: Experimental design: the set of the hybridization experiments as a whole

Array design: each array used and each spot on the array Samples: samples used, the extract preparation, and labeling Hybridizations: procedures and parameters Measurements: images, quantitation, and specifications Controls: types, values, and specifications

MAML is independent of the particular experimental platform and provides a framework for describing experiments done on all types of DNA arrays, including spotted and synthesized arrays, as well as oligo and cDNA arrays. What's more, MAML provides format to represent microarray data in a flexible way, which allows analysis of data obtained from not only any existing microarray platforms but also many of the possible future variants, including protein arrays. Although the data in GEO are not currently provided in MAML format, it is NCBI's goal to have the data delivered in a number of formats, including MAML, soon to be replaced by a more recent version called MAGEML (MicroArray Gene Expression Markup Language).

The Benefits of GEO and MAML By storing vast amounts of data on gene expression profiles derived from multiple experiments using varied criteria and conditions, GEO will aid in the study of functional genomics the development and application of global experimental approaches to assess gene function GEO will facilitate the cross-validation of data obtained using different techniques and technologies and will help set benchmarks and standards for further gene expression studies By making the information stored in GEO publicly available, the fields of bioinformatics and functional genomics will be both promoted and advanced That such experimental data should be freely accessible to all is consistent with NCBI's legislative mandate and mission: to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease

The Promise of Microarray Technology in Treating Disease


Now that you understand the concept behind array technology, picture this: a hand-held instrument that a physician could use to quickly diagnose cancer or other diseases during a routine office visit. What if that same instrument could also facilitate a personalized treatment regimen, exactly right for you? Personalized drugs. Molecular diagnostics. Integration of diagnosis and therapeutics. These are the long-term promises of microarray technology. Maybe not today or even tomorrow, but someday. For the first time, arrays offer hope for obtaining global views of biological processessimultaneous readouts of all the body's componentsby providing a systematic way to survey DNA and RNA variation. NCBI, by continuing its efforts to provide a standard format for microarray data and to provide free, universal access to that data, will help the scientific community in making those promises realities.

ONE SIZE DOES NOT FIT ALL: THE PROMISE OF PHARMACOGENOMICS


Adverse Drug Reaction. These three simple words convey little of the horror of a severe negative reaction to a prescribed drug. But such negative reactions can nonetheless occur. A 1998 study of hospitalized patients published in the Journal of the American Medical Association reported that in 1994, adverse drug reactions accounted for more than 2.2 million serious cases and over 100,000 deaths, making adverse drug reactions (ADRs) one of the leading causes of hospitalization and death in the United States. Currently, there is no simple way to determine whether people will respond well, badly, or not at all to a medication; therefore, pharmaceutical companies are limited to developing drugs using a "one size fits all" system. This system allows for the development of drugs to which the "average" patient will respond. But, as the statistics above show, one size does NOT fit all, sometimes with devastating results. What is needed is a way to solve the problem of ADRs before they happen. The solution is in sight though, and it is called pharmacogenomics.

What Is Pharmacogenomics?
The way a person responds to a drug (this includes both positive and negative reactions) is a complex trait that is influenced by many different genes. Without knowing all of the genes involved in drug response, scientists have found it difficult to develop genetic tests that could predict a person's response to a particular drug. Once scientists discovered that people's genes show small variations (or changes) in their nucleotide (DNA base) content, all of that changedgenetic testing for predicting drug response is now possible. Pharmacogenomics is a science that examines the inherited variations in genes that dictate drug response and explores the ways these variations can be used to predict whether a patient will have a good response to a drug, a bad response to a drug, or no response at all.

Is there a difference between pharmacogenomics and pharmacogenetics?


Pharmacogenomics refers to the general study of all of the many different genes that determine drug behavior. Pharmacogenetics refers to the study of inherited differences (variation) in drug metabolism and response.

The distinction between the two terms is considered arbitrary, however, and now the two terms are used interchangeably.

How Will Gene Variation Be Used in Predicting Drug Response? Right now, there is a race to catalog as many of the genetic variations found within the DNA sequencing is the human genome as possible. These variations, or SNPs (pronounced "snips"), as they are determination of the order commonly called, can be used as a diagnostic tool to predict a person's drug response. For of nucleotides (the base SNPs to be used in this way, a person's DNA must be examined ( sequenced) for the sequence) in a DNA molecule. presence of specific SNPs. The problem is, however, that traditional gene sequencing technology is very slow and expensive and has therefore impeded the widespread use of SNPs as a diagnostic tool. DNA microarrays (or DNA chips) are an evolving technology that should make it possible for doctors to examine their patients for the presence of specific SNPs quickly and affordably. A single microarray can now be used to screen 100,000 SNPs found in a patient's genome in a matter of hours. As DNA microarray technology is developed further, SNP screening in the doctor's office to determine a patient's response to a drug, prior to drug prescription, will be commonplace.

How Will Drug Development and Testing Benefit from Pharmacogenomics? SNP screenings will benefit drug development and testing because pharmaceutical companies could exclude from clinical trials those people whose pharmacogenomic screening would show that the drug being tested would be harmful or ineffective for them. Excluding these people will increase the chance that a drug will show itself useful to a particular population group and will thus increase the chance that the same drug will make it into the marketplace. Pre-screening clinical trial subjects should also allow the clinical trials to be smaller, faster, and therefore less expensive; therefore, the consumer could benefit in reduced drug costs. Finally, the ability to assess an individual's reaction to a drug before it is prescribed will increase a physician's confidence in prescribing the drug and the patient's confidence in taking the drug, which in turn should encourage the development of new drugs tested in a like manner.
Pre-screening should allow clinical trials to be smaller, faster, and less expensive; therefore, the consumer could benefit in reduced drug costs.

What Is NCBI'S Role in Pharmacogenomics? The explosion in both SNP and microarray data generated from the human genome project has necessitated the development of a means of cataloging and annotating (briefly describing) these data so that scientists can more easily access and use it for their research. NCBI, always on the forefront of bioinformatics research, has developed database repositories for both SNP (dbSNP) and microarray (GEO) data. These databases include either descriptive information about the data within the site itself and links to NCBI and external information resources. Access to these data and information resources will allow scientists to more easily interpret data that will be used not only to help determine drug response but to study disease susceptibility and conduct basic research in population genetics.

The Promise of Pharmacogenomics Right now, in doctors' offices all over the world, patients are given medications that either don't work or have bad side effects. Often, a patient must return to their doctor over and over again until the doctor can find a drug that is right for them. Pharmacogenomics offers a very appealing alternative. Imagine a day when you go into your doctor's office and, after a simple and rapid test of your DNA, your doctor changes her/his mind about a drug considered for you because your genetic test indicates that you could suffer a severe negative reaction to the medication. However, upon further examination of your test results, your doctor finds that you would benefit greatly from a new drug on the market, and that there would be little likelihood that you would react negatively to it. A day like this will be coming to your doctor's office soon, brought to you by pharmacogenomics.

GETTING STARTED Need help with dbSNP or Map Viewer? How about a quick "how-to" for using other NCBI data-mining tools? Try GETTING STARTED, a new NCBI resource designed to aid the novice user: brief descriptions of data that can be found/manipulated using a particular NCBI tool shortcuts for getting to where you need to go concise explanations of NCBI tool graphics with insider techniques for conducting database searches simple examples of tool usage

SYSTEMATICS AND MOLECULAR PHYLOGENETICS Classifying Organisms


Have you ever noticed that when you see an insect or a bird, there is real satisfaction in giving it a name, and an uncomfortable uncertainty when you can't? Along these same lines, consider the bewildering number and variety of organisms that live, or have lived, on this earth. If we did not know what to call these organisms, how could we communicate ideas about them, let alone the history of life? Thanks to taxonomy, the field of science that classifies life into groups, we can discuss just about any organism, from bacteria to man. Carolus Linnaeus pioneered the grouping of organisms based on scientific names using Latin. His system of giving an organism a scientific name of two parts, sometimes more, is called binomial nomenclature, or "two-word naming". His scheme was based on physical similarities and differences, referred to as characters. Today, taxonomic classification is much more complex and takes into account cellular types and organization, biochemical similarities, and genetic similarities. Taxonomy is but one aspect of a much larger field called systematics.

Taxonomic Classification
Taxonomic ranks approximate evolutionary distances among groups of organisms. For example, species belonging to two different superkingdoms are most distantly related (their common ancestor diverged in the distant past), with progressively more exclusive groups indicated by phylum, class and so on, down to infraspecific ranks, or ranks occurring within a species. Infraspecific ranks, such as subspecies, varietas, and forma, denote the closest evolutionary relationship. See the simplified classification of humans below. Taxonomists, scientists who classify living organisms, define a species as any group of closely related organisms that can produce fertile offspring. Two organisms are more closely "related" as they approach the level of species, that is, they have more genes in common. The level of species can be further divided into smaller segments. A population is the smallest unit of a species and is made up of organisms of the same species. Sometimes, a population will physically alter over time to suit the needs of its environment. This is called a cline and can make members of the same species look different.

Taxonomic Classification of Man Homo sapiens


Superkingdom: Eukaryota Kingdom: Metazoa Phylum: Chordata Class: Mammalia Order: Primata Family: Hominidae Genus: Homo Species: sapiens

What Is Phylogenetic Systematics?


Carolus Linnaeus was also credited with pioneering systematics, the field of science dealing with the diversity of life and the relationship between life's components. Systematics reaches beyond taxonomy to elucidate new methods and theories that can be used to classify species based on similarity of traits and possible mechanisms of evolution, a change in the gene pool of a population over time. Phylogenetic systematics is that field of biology that does deal with identifying and understanding the evolutionary relationships among the many different kinds of life on earth, both living ( extant) and dead (extinct). Evolutionary theory states that similarity among individuals or species is attributable to common descent, or inheritance from a common ancestor. Thus, the relationships established by phylogenetic systematics often describe a species' evolutionary history and, hence, its phylogeny, the historical relationships among lineages or organisms or their parts, such as their genes.

Charles Darwin was the first to recognize that the systematic hierarchy represented a rough approximation of evolutionary history. However, it was not until the 1950s that the German entomologist Willi Hennig proposed that systematics should reflect the known evolutionary history of lineages as closely as possible, an approach he called phylogenetic systematics. The followers of Hennig were disparagingly referred to as "cladists" by his opponents, because of the emphasis on recognizing only monophyletic groups, a group plus all of its descendents, or clades. However, the cladists quickly adopted that term as a helpful label, and nowadays, cladistic approaches to systematics are used routinely.

Understanding the Evolutionary Process


Genetic Variation: Changes in a Gene Pool Evolution is not always discrete with clearly defined boundaries that pinpoint the origin of a new species, nor is it a steady continuum. Evolution requires genetic variation which results from changes within a gene pool, the genetic make-up of a specific population. A gene pool is the combination of all the alleles alternative forms of a genetic locusfor all traits that population may exhibit. Changes in a gene pool can result from mutationvariation within a particular geneor from changes in gene frequencythe proportion of an allele in a given population.

How Does Genetic Variation Occur? Every organism possesses a genome that contains all of the biological information needed to construct and maintain a living example of that organism. The biological information contained in a genome is encoded in the nucleotide sequence of its DNA or RNA molecules and is divided into discrete units called genes. The information stored in a gene is read by proteins, which attach to the genome and initiate a series of reactions called gene expression. Every time a cell divides, it must make a complete copy of its genome, a process called DNA replication. DNA replication must be extremely accurate to avoid introducing mutations, or changes in the nucleotide sequence of a short region of the genome. Inevitably, some mutations do occur, usually in one of two ways; either from errors in DNA replication or from damaging effects of chemical agents or radiation that react with DNA and change the

structure of individual nucleotides. Many of these mutations result in a change that has no effect on the functioning of the genome, referred to as silent mutations. Silent mutations include virtually all changes that happen in the noncoding components of genes and gene-related sequences. Mutations in the coding regions of genes are much more important. Here we must consider the importance of the same mutation in a somatic cell compared with a germ line cell. A somatic cell is any cell of an organism other than a reproductive cell, such as a sperm or egg cell. A germ cell line is any line of cells that gives rise to gametes and is continuous through the generations. Because a somatic cell does not pass on copies of its genome to the next generation, a somatic cell mutation is important only for the organism in which it occurs and has no potential evolutionary impact. In fact, most somatic mutations have no significant effect because there are many other identical cells in the same tissue. On the other hand, mutations in germ cells can be transmitted to the next generation and will then be present in all of the cells of an individual who inherits that mutation. Even still, mutations within germ line cells may not change the phenotype of the organism in any significant way. Those mutations that do have an evolutionary effect can be divided into two categories, loss-of-function mutations and gain-of-function mutations. A loss-of-function mutation results in reduced or abolished protein function. Gain-of-function mutations, which are much less common, confer an abnormal activity on a protein.

The randomness with which mutations can occur is an important concept in biology and is a requirement of the Darwinian view of evolution, which holds that changes in the characteristics of an organism occur by chance and are not influenced by the environment in which the organism lives. Beneficial changes within an organism are then positively selected for, whereas harmful changes are negatively selected.

The Drivers of Evolution: Selection, Drift, and Founder Effects We just discussed that new alleles appear in a population because of mutations that occur in the reproductive cells of an organism. This means that many genes are polymorphic, that is, two or more alleles for that gene are present in a population. Each of these alleles has its own allele or gene frequency, a measure of how common an allele is in a population. Allele frequencies vary over time because of two conditions, natural selection and random drift. Natural Selection Natural selection is the process whereby one genotype, the hereditary constitution of an individual, leaves more offspring than another genotype because of superior life attributes, termed fitness. Natural selection acts on genetic variation by conferring a survival advantage to those individuals harboring a particular mutation that tends to favor a changing environmental condition. These individuals then reproduce and pass on this "new" gene, altering their gene pool. Natural selection, therefore, decreases the frequencies of alleles that reduce the fitness of an organism and increases the frequency of alleles that improve fitness.
"Natural Selection" is the principle by which each slight variation, if useful, is preserved. Charles Darwin

It is important to point out that natural selection does not always represent progress, only adaptation to a changing surrounding, that is, evolution attributable to natural selection is devoid of intent something does not evolve to better itself, only to adapt. Because environments are always changing, what was once an advantageous mutation can often become a liability further down the evolutionary line. Random Drift The term random drift actually encompasses a number of distinct processes, sometimes referred to as outcomes. They include indiscriminate parent sampling, the founder effect, and fluctuations in the rate of evolutionary processes such as selection, migration, and mutation. Parent sampling is the process of determining which organisms of one generation will be the parents of the next generation. Parent sampling may be discriminate, that is, with regard to fitness differences, or indiscriminate, without regard to fitness differences. Discriminate parent sampling is generally considered natural selection, whereas indiscriminate parent sampling is considered random drift.

What Is Sampling?
Suppose a population of red and brown squirrels share a habitat with a color blind predator. Although the predator is color blind, the brown squirrels seem to die in greater numbers than the red squirrels, suggesting that the brown squirrels just seem to be unlucky enough to come into contact with the predator more often. As a result, the frequency of brown squirrels in the next generation is reduced. More red squirrels survive to reproduce, or are sampled, but it is without regard to any differences in fitness between the two groups. The physical differences of the groups do not play a causal role in the differences in reproductive success. Now, lets say that the predator is not color blind and can now see the red squirrels better than the brown squirrels, resulting in a better survival rate for the brown squirrels. This would be a case of discriminate parent sampling, or natural selection.

Founder Effect Another important cause of genetic drift is the founder effect, the difference between the gene pool of a population as a whole and that of a newly isolated population of the same species. The founder effect occurs when populations are started from a small number of pioneer individuals of one original population. Because of small sample size, the new population could have a much different genetic ratio than the original population. An example of the founder effect would be when a plant population results from a single seed. Thus far, we have discussed natural selection and random drift as events that occur in isolation from one another. However, in most populations, the two processes will be occurring at the same time. Furthermore, there is great debate over whether, in particular instances and in general, natural selection is more prevalent that random drift.

Phylogenetic Trees: Presenting Evolutionary Relationships


Systematics describes the pattern of relationships among taxa and is intended to help us understand the history of all life. But history is not something we can seeit has happened once and leaves only clues as to the actual events. Scientists use these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the most convenient way of visually presenting evolutionary relationships among a group of organisms is through illustrations called phylogenetic trees.
Node: represents a taxonomic unit. This can be either an existing species or an ancestor. Branch: defines the relationship between the taxa in terms of descent and ancestry. Topology: the branching patterns of the tree. Branch length: represents the number of changes that have occurred in the branch. Root: the common ancestor of all taxa. Distance scale: scale that represents the number of differences between organisms or sequences. Clade: a group of two or more taxa or DNA sequences that includes both their common ancestor and all of their descendents. Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user to be used in a study, such as individuals, populations, species, genera, or bacterial strains.

A phylogenetic tree is composed of nodes, each representing a taxonomic unit (species, populations, individuals), and branches, which define the relationship between the taxonomic units in terms of descent and ancestry. Only one branch can connect any two adjacent nodes. The branching pattern of the tree is called the topology, and the branch length usually represents the number of changes that have occurred in the branch. This is called a scaled branch. Scaled trees are often calibrated to represent the passage of time. Such trees have a theoretical basis in the particular gene or genes under analysis. Branches can also be unscaled, which means that the branch length is not proportional to the number of changes that has occurred, although the actual number may be indicated numerically somewhere on the branch. Phylogenetic trees may also be either rooted or unrooted. In rooted trees, there is a particular node, called the root, representing a common ancestor, from which a unique path leads to any other node. An unrooted tree only specifies the relationship among species, without identifying a common ancestor, or evolutionary path.

Figure 1. Possible ways of drawing a tree.


Phylogenetic trees, a convenient way of representing evolutionary relationships among a group of organisms, can be drawn in various ways. Branches on phylogenetic trees may be scaled (top panel) representing the amount of evolutionary change, time, or both, when there is a molecular clock, or they may be unscaled (middle panel) and have no direct correspondence with either time or amount of evolutionary change. Phylogenetic trees may be rooted (top and middle panels) or unrooted (bottom panels). In the case of unrooted trees, branching relationships between taxa are specified by the way they are connected to each other, but the position of the common ancestor is not. For example, on an unrooted tree with five species, there are five branches (four external, one internal) on which the tree can be rooted. Rooting on each of the five branches has different implications for evolutionary relationships. .
Text and figures adapted with permission from A. Vierstraete, University of Ghent, Belgium.

Methods of Phylogenetic Analysis


Two major groups of analyses exist to examine phylogenetic relationships: phenetic methods and cladistic methods. It is important to note that phenetics and cladistics have had an uneasy relationship over the last 40 years or so. Most of today's evolutionary biologists favor cladistics, although a strictly cladistic approach may result in counterintuitive results. Phenetic Method of Analysis Phenetics, also known as numerical taxonomy, involves the use of various measures of overall similarity for the ranking of species. There is no restriction on the number or type of characters (data) that can be used, although all data must be first converted to a numerical value, without any character "weighting". Each organism is then compared with every other for all characters measured, and the number of similarities (or differences) is calculated. The organisms are then clustered in such a way that the most similar are grouped close together and the more different ones are linked more distantly. The taxonomic clusters, called phenograms, that result from such an analysis do not necessarily reflect genetic similarity or evolutionary relatedness. The lack of evolutionary significance in phenetics has meant that this system has had little impact on animal classification, and as a consequence, interest in and use of phenetics has been declining in recent years.

Cladistic Method of Analysis An alternative approach to diagramming relationships between taxa is called cladistics. The basic assumption behind cladistics is that members of a group share a common evolutionary history. Thus, they are more closely related to one another than they are to other groups of organisms. Related groups of organisms are recognized because they share a set of unique features (apomorphies) that were not present in distant ancestors but which are shared by most or all of the organisms within the group. These shared derived characteristics are called synapomorphies. Therefore, in contrast to phenetics, cladistics groupings do not depend on whether organisms share physical traits but depend on their evolutionary relationships. Indeed, in cladistic analyses two organisms may share numerous characteristics but still be considered members of different groups. Cladistic analysis entails a number of assumptions. For example, species are assumed to arise primarily by bifurcation, or separation, of the ancestral lineage; species are often considered to become extinct upon hybridization (crossbreeding); and hybridization is assumed to be rare or absent. In addition, cladistic groupings must possess the following characteristics: all species in a grouping must share a common ancestor and all species derived from a common ancestor must be included in the taxon. The application of these requirements results in the following terms being used to describe the different ways in which groupings can be made: A monophyletic grouping is one in which all species share a common ancestor, and all species derived from that common ancestor are included. This is the only form of grouping accepted as valid by cladists. A paraphyletic grouping is one in which all species share a common ancestor, but not all species derived from that common ancestor are included. A polyphyletic grouping is one in which species that do not share an immediate common ancestor are lumped together, while excluding other members that would link them.

The Origins of Molecular Phylogenetics


Macromolecular data, meaning gene (DNA) and protein sequences, are accumulating at an increasing rate because of recent advances in molecular biology. For the evolutionary biologist, the rapid accumulation of sequence data from whole genomes has been a major advance, because the very nature of DNA allows it to be used as a "document" of evolutionary history. Comparisons of the DNA sequences of various genes between different organisms can tell a scientist a lot about the relationships of organisms that cannot otherwise be inferred from morphology, or an organism's outer form and inner structure. Because genomes evolve by the gradual accumulation of mutations, the amount of nucleotide sequence difference between a pair of genomes from different organisms should indicate how recently those two genomes shared a common ancestor. Two genomes that diverged in the recent past should have fewer differences than two genomes whose common ancestor is more ancient. Therefore, by comparing different genomes with each other, it should be possible to derive evolutionary relationships between them, the major objective of molecular phylogenetics. Molecular phylogenetics attempts to determine the rates and patterns of change occurring in DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two general approaches may be taken to obtain this information. In the first approach, scientists use DNA to study the evolution of an organism. In the second approach, different organisms are used to study the evolution of DNA. Whatever the approach, the general goal is to infer process from pattern: the processes of organismal evolution deduced from patterns of DNA variation and processes of molecular evolution inferred from the patterns of variations in the DNA itself.

Molecular Phylogenetic Analysis: Fundamental Elements As we just discussed, macromolecules, especially gene and protein sequences, have surpassed morphological and other organismal characters as the most popular forms of data for phylogenetic analyses. Therefore, this next section will concentrate only on molecular data. It is important to point out that a single, all-purpose recipe does not exist for phylogenetic analysis of molecular data. Although numerous algorithms, procedures, and computer programs have been developed, their reliability and practicality are, in all cases, dependent upon the size and structure of the dataset under analysis. The merits and shortfalls of these various methods are subject to much scientific debate, because the danger of generating incorrect results is greater in computational molecular phylogenetics than in many other

Nucleotide and protein sequences can also be used to generate trees. DNA, RNA, and protein sequences can be considered as phenotypic traits. The sequences depict the relationship of genes and usually of the organism in which the genes are found.

fields of science. Occasionally, the limiting factor in such analyses is not so much the computational method used, but the users' understanding of what the method is actually doing with the data. Therefore, the goal of this section is to demonstrate to the reader that practical analysis should be thought of both as a search for a correct model (analysis) as well as a search for the correct tree (outcome). Phylogenetic tree-building models presume particular evolutionary models. For any given set of data, these models may be violated because of various occurrences, such as the transfer of genetic material between organisms. Therefore, when interpreting a given analysis, a person should always consider the model used and entertain possible explanations for the results obtained. For example, models used in molecular phylogenetic analysis methods make "default" assumptions, including: The sequence is correct and originates from the specified source. The sequences are homologousall descended in some way from a shared ancestral sequence. Each position in a sequence alignment is homologous with every other in that alignment. Each of the multiple sequences included in a common analysis has a common phylogenetic history with the other sequences. The sampling of taxa is adequate to resolve the problem under study. Sequence variation among the samples is representative of the broader group. The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem under study.

The Four Steps of Phylogenetic Analysis


A straightforward phylogenetic analysis consists of four steps:

1.
2. 3. 4.

Alignmentbuilding the data model and extracting a dataset. Determining the substitution modelconsider sequence variation. Tree building. Tree evaluation.

Tree Building: Key Features of DNA-based Phylogenetic Trees Studies of gene and protein evolution often involve the comparison of homologs, sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary level of similarity determined by alignment of matching bases are homologous. These sequences are inherited from a common ancestor that possessed similar structure, although the ancestor may be difficult to determine because it has been modified through descent.

Homologs are most commonly defined as orthologs, paralogs, or xenologs.


Orthologs are homologs produced by speciationthey represent genes derived from a common ancestor that diverged because of divergence of the organism. Orthologs tend to have similar function. Paralogs are homologs produced by gene duplication and represent genes derived from a common ancestral gene that duplicated within an organism and then diverged. Paralogs tend to have different functions. Xenologs are homologs resulting from the horizontal transfer of a gene between two organisms. The function of xenologs can be variable, depending on how significant the change in context was for the horizontally moving gene. In general, though, the function tends to be similar.

A typical gene-based phylogenetic tree is depicted below. This tree shows the relationship between four homologous genes: A, B, C, and D. The topology of this tree consists of four external nodes ( A, B, C, and D), each one representing one of the four genes, and two internal nodes ( e and f) representing ancestral genes. The branch lengths indicate the degree of evolutionary differences between the genes. This particular tree is unrooted it is only an illustration of the relationships between genes A, B, C, and D and does not signify anything about the series of

evolutionary events that led to these genes.

The second panel, below, depicts three rooted trees that can be drawn from the unrooted tree shown above, each representing the different evolutionary pathways possible between these four genes. A rooted tree is often referred to as an inferred tree. This is to emphasize that this type of illustration depicts only the series of evolutionary events that are inferred from the data under study and may not be the same as the true tree or the tree that depicts the actual series of evolutionary events that occurred.

To distinguish between the pathways, the phylogenetic analysis must include at least one outgroup, a gene that is less closely related to A, B, C, and D than these genes are to each other (panel below). Outgroups enable the root of the tree to be located and the correct evolutionary pathway to be identified. Let's say that the four homologous genes used in the previous tree examples come from human, chimpanzee, gorilla, and orangutan. In this case, an outgroup could be a gene from another primate, such as baboon, which is known to have branched away from the four species above before the common ancestor of the species.

Gene Trees Versus Species TreesWhy Are They Different? It is assumed that a gene tree, because it is based on molecular data, will be a more accurate and less ambiguous representation of the species tree than that obtainable by morphological comparisons. This may indeed be the case, but it does not mean that the gene tree is the same as the species tree. For this to be true, the internal nodes in both trees would have to be precisely equivalent, and they are not. An internal node in a gene tree indicates the divergence of an ancestral gene into two genes with different DNA sequences, usually resulting from a mutation of one sort or another. An internal node in a species tree represents what is called a speciation event, whereby the population of the ancestral species splits into two groups that are no longer able to interbreed. These two events, mutation and speciation, do not always occur at the same time.

Molecular Phylogenetics Terminology


Monophyletic: two or more DNA sequences that are derived from a single common ancestral DNA sequence. Clade: a group of monophyletic DNA sequences that make up all of the sequences included in the analysis that are descended from a particular common ancestral sequence. Parsimony: an approach that decides between different tree topologies by identifying the one that involves the shortest evolutionary pathway. This is the pathway that requires the smallest number of nucleotide changes to go from the ancestral sequence, at the root of the tree, to all of the present-day sequences that have been compared. Molecular Clock Hypothesis: states that nucleotide substitutions, or amino acid substitutions if proteins are being compared, occur at a constant rate, that is, the degree of difference between two sequences can be used to assign a date to the time at which their ancestral sequence diverged. The rate of molecular change differs among groups of organisms, among genes, and even among different parts of the same gene. Furthermore, molecular clocks require calibration with fossils to determine timing of origin of clades, and thus their accuracy is crucially dependent on the fossil record, or lack thereof, for the groups under study. Fossil DNA older than about 25,00050,000 years is virtually empty of phylogenetic signal except in rare instances, and therefore traditional morphological studies of extinct and extant organisms remain a crucial component of phylogenetic analysis.

Systematics and NCBI


The Taxonomy Project The purpose of NCBI's Taxonomy Project is to build a consistent phylogenetic taxonomy for the NCBI sequence databases. The Taxonomy Database contains the names and lineages of every organism represented by at least one nucleotide or protein sequence in the NCBI genetic databases. As of February 2003, this total is over 250,000 taxa. For current information, visit NCBI's Taxonomy Statistics Web page. The database is recognized as the standard reference by the international sequence database collaboration (GenBank, EMBL, DDJB, and Swiss-Prot). The Taxonomy Browser is an NCBI-derived search tool that allows an individual to search the Taxonomy database. Using the browser, information may be retrieved on available nucleotide, protein, and structure records for a particular species or higher taxon. The Taxonomy Browser can be used to view the taxonomic position or retrieve sequence and structural data for a particular organism or group of organisms. Searches may be made on the basis of whole, partial, or phonetically spelled organism names, and direct links to organisms commonly used in biological research are also provided. The Entrez Taxonomy system has the ability to display custom taxonomic trees representing userdefined subsets of the full NCBI taxonomy. TaxPlot, another component of the Taxonomy project, is a research tool for conducting three-way comparisons of different genomes. Comparisons are based on the sequences of the proteins encoded in that organism's genome. To use TaxPlot, one selects a reference genome to which two other genomes will be compared. The TaxPlot tool then uses a pre-computed BLAST result to plot a point for each protein predicted to be included in the reference genome.

BLAST: Detecting New Sequence Similarities Currently, the characters most widely used for phylogenetic analysis are DNA and protein sequences. DNA sequences may be compared directly, or for those regions that code for a known protein, translated into protein sequences. Creating phylogenies from nucleotide or amino acid sequences first requires aligning the bases so that the differences between the sequences being studied are easier to spot. The introduction of NCBI's BLAST, or The Basic Local Alignment Search Tool, in 1990 made it easier to rapidly scan huge databases for overt homologies, or sequence similarity, and to statistically evaluate the resulting matches. BLAST works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. In a matter of seconds, the BLAST server compares the user's sequence with up to a million known sequences and determines the closest matches. Specialized BLASTs are also available for human, mouse, microbial, and many other genomes. A single BLAST search can compare a sequence of interest to all other sequences stored in GenBank, NCBI's nucleotide sequence database. In this step, a researcher has the option of limiting the search to a specific taxonomic group. If the full scientific name or relationship of species of interest is not known, the user can search for such details using NCBI's Taxonomy Browser, which provides direct links to some of the organisms commonly used in molecular research projects, such as the zebrafish, fruit fly, bakers yeast, nematode, and many more.

BLAST next tallies the differences between sequences and assigns a "score" based on sequence similarity. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real sequence matches easier to distinguish from random background hits. This is because BLAST uses a special algorithm, or mathematical formula, that seeks local as opposed to global alignments and is therefore able to detect relationships among sequences that share only isolated regions of similarity. Taxonomy-related BLAST results are presented in three formats based on the information found in NCBI's Taxonomy database. The Organism Report sorts BLAST comparisons, also called hits, by species such that all hits to a given organism are grouped together. The Lineage Report provides a view of the relationships between the organisms based on NCBI's Taxonomy database. The Taxonomy Report provides in-depth details on the relationship between all the organisms in the BLAST hit list.

COGs: Phylogenetic Classification of Proteins The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt at the phylogenetic classification of proteins, a scheme that indicates the evolutionary relationships between organisms, from complete genomes. Each COG includes proteins that are thought to be orthologous, or connected through vertical evolutionary descent. COGs may be used to detect similarities and differences between species, for identifying protein families and predicting new protein functions, and to point to potential drug targets in disease-causing species. The database is accompanied by the COGnitor program, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web page containing additional structural and functional information is now associated with each COG. These hyperlinked information pages include: systematic classification of the COG members under the different classification systems; indications of which COG members (if any) have been characterized genetically and biochemically; information on the domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains if known or predictable; a succinct summary of the common structural and functional features of the COG members, as well as peculiarities of individual members; and key references.

HomoloGene HomoloGene is a database of both curated and calculated orthologs and homologs for the organisms represented in NCBI's UniGene database. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon, and from published reports. Computed orthologs and homologs are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. HomoloGene also contains a set of triplet clusters in which orthologous clusters in two organisms are both orthologous to the same cluster in a third organism. HomoloGene can be searched via the Entrez retrieval system.

UniGene is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information, such as the tissue types in which the gene has been expressed and map location.

Entrez Genome The whole genomes of over 1,200 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life bacteria, archaea, and eukaryotes are represented, as well as many viruses, viroids, plasmids, and eukaryotic organelles. Data can be accessed hierarchically starting from either an alphabetical listing or a phylogenetic tree for complete genomes in each of six principle taxonomic groups. One can follow the hierarchy to a variety of graphical overviews, including that of the whole genome of a single organism, a single chromosome, or even a single gene. At each level, one can access multiple views of the data, pre-computed summaries, and links to analyses appropriate for that level. In addition, any gene product (protein) that is a member of a COG is linked to the COGs database. A summary of COG functional groups is also presented in tabular and graphical formats at the genome level. For complete microbial genomes, pre-computed BLAST neighbors for protein sequences, including their taxonomic distribution and links to 3D structures, are given in TaxTables and PDBTables, respectively. Pairwise sequence alignments are presented graphically and linked to NCBI's Cn3D macromolecular viewer that allows the interactive display of three-dimensional structures and sequence alignments.

PDBeast: Taxonomy in MMDB NCBI's Structure Group, in collaboration with NCBI taxonomists, has undertaken taxonomy annotation for the threedimensional structure data stored in the Molecular Modeling Database (MMDB). A semi-automated approach has been implemented in which a human expert checks, corrects, and validates automatic taxonomic assignments in MMDB. The PDBeast software tool was developed by NCBI for this purpose. It pulls text descriptions of "Source Organisms" from either the original entries or user-specified information and looks for matches in the NCBI Taxonomy database to record taxonomy assignments.

The Molecular Modeling Database (MMDB) is a compilation of three-dimensional structures of biomolecules obtained from the Protein Data Bank (PDB). The PDB, managed and maintained by the Research Collaboratory for Structural Bioinformatics, is a collection of all publicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of other complexes experimentally determined by X-ray crystallography and NMR. The difference between the two databases is that MMDB records reorganize and validate the information stored in the database in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.

The Importance of Molecular Phylogenetics


The field of molecular phylogenetics has grown, both in size and in importance, since its inception in the early 1990s, attributable mostly to advances in molecular biology and more rigorous methods for phylogenetic tree building. The importance of phylogenetics has also been greatly enhanced by the successful application of tree reconstruction, as well as other phylogenetic techniques, to more diverse and perplexing issues in biology. Today, a survey of the scientific literature will show that molecular biology, genetics, evolution, development, behavior, epidemiology, ecology, systematics, conservation biology, and forensics are but a few examples of the many disparate fields conceptually united by the methods and theories of molecular phylogenetics. Phylogenies are used essentially the same way in all of these fields, either by drawing inferences from the structure of the tree or from the way the character states map onto the tree. Biologists can then use these clues to build hypotheses and models of important events in history. Broadly speaking, the relationships established by phylogenetic trees often describe a species' evolutionary history and, hence, its phylogenythe historical relationships among lineages or organisms or their parts, such as their genes. Phylogenies may be thought of as a natural and meaningful way to order data, with an enormous amount of evolutionary information contained within their branches. Scientists working in these different areas can then use these phylogenies to study and elucidate the biological processes occurring at many levels of life's hierarchy.

You might also like