Nihms 263986
Nihms 263986
Author Manuscript
Nat Rev Genet. Author manuscript; available in PMC 2011 May 1.
Published in final edited form as:
NIH-PA Author Manuscript
Abstract
In the few years since its initial application, massively parallel cDNA sequencing, or RNA-seq,
has allowed many advances in the characterization and quantification of transcriptomes. Recently,
several developments in RNA-seq methods have provided an even more complete characterization
of RNA transcripts. These developments include improvements in transcription start site mapping,
strand-specific measurements, gene fusion detection, small RNA characterization and detection of
alternative splicing events. Ongoing developments promise further advances in the application of
RNA-seq, particularly direct RNA sequencing and approaches that allow RNA quantification from
NIH-PA Author Manuscript
Over the past 10 years we have come to appreciate the dynamic state of genomes, including
both DNA modifications and RNA quantitative and qualitative changes, which have been
characterized in species ranging from simple model organisms to humans. This advance has
occurred through the use of various genomic measurements, including comprehensive
NGS platforms sequence as many as billions of DNA strands in parallel, yielding substantially more throughput and minimizing the
need for the fragment-cloning methods that are often used in Sanger sequencing of genomes.
Semisuppressive PCR
A PCR strategy that aims to reduce primer dimer accumulation by preferentially amplifying longer DNA fragments.
Spike pool
Internal controls added to RNA samples, consisting of RNA elements of known sequence and composition.
Paired-end reads
A strategy involving sequencing of two different regions that are located apart from each other on the same DNA fragment. This
strategy provides elevated physical coverage and alleviates several limitations of NGS platforms that arise because of their relatively
short read length.
Laser capture microdissection
(Often abbreviated to LCM.) A method allowing cells of interest that are chosen by the operator using a microscope to be specifically
captured from heterogeneous tissue samples. The isolated cells can be used for various analyses including of protein and nucleic acid.
Quantitative real-time polymerase chain reaction
A PCR application that enables the measurement of nucleic acid quantities in samples. Nucleic acid of interest is amplified with PCR.
The level of the amplified product accumulation during PCR cycles are measured in real time. This data is used to infer starting
nucleic acid quantities.
Circulating extracellular nucleic acid
Extracellular DNA or RNA molecules in plasma and serum
Ozsolak and Milos Page 2
transcriptomics studies1. We now have a new appreciation for the complexity of the
transcriptome, encompassing a multitude of previously unknown coding and non-coding
RNA species, particularly small RNAs (sRNAs), including microRNAs, promoter-
NIH-PA Author Manuscript
New methodologies for RNA-seq studies have been providing a progressively fuller
knowledge of both the quantitative and qualitative aspects of transcript biology in both
NIH-PA Author Manuscript
prokaryotes6 and eukaryotes5. Here we discuss these advances, which have included the
development of approaches to allow a more comprehensive understanding of transcription
initiation sites, the cataloguing of sense and antisense transcripts, improved detection of
alternative splicing events and the detection of gene fusion transcripts, which has become
increasingly important in cancer research — all at a data scale that was unimagined just
several years ago. Recently developed approaches also allow the selection of specific RNA
molecules before RNA-seq, allowing transcriptomics studies with more focused aims. In this
Review, we provide an overview of these methods, touching only briefly on the types of
biological insight that they allow, and focusing on the technologies themselves. We provide
a comparison of the different approaches that are available for each application and discuss
the current limitations and the potential for future improvements. We conclude by discussing
two new developments in RNA-seq technologies: direct RNA sequencing (DRS)7 and
methods for the reliable profiling of minute RNA quantities, which is important for
translational research and clinical applications of RNA-seq.
define RNA products and to identify adjacent promoter regions that regulate the expression
of each transcript. One of the first high-throughput TSS mapping methods was the cap
analysis of gene expression (cAGe) approach, which was initially developed for Sanger
sequencing8,9. This involved sequencing of cloned cDNA products derived from RNAs
with intact 5′ ends (for example, containing a 5′ cap structure). Although useful, the
technology required high quantities of input RNA and generated only short reads (~20
nucleotides) per TSS.
These limitations prompted the adaptation of the cAGe approach for NGS platforms, which
has resulted in the discovery of the unexpected complexity of TSS distribution across
genomes and in the regions surrounding individual promoters. Methods that combine RNA-
seq with CAGE include deep CAGE10, PEAT11, nanoCAGE and CAGEscan12, which
collectively resolve several technical challenges of the initial Sanger sequencing-based
CAGE strategies (TABLE 1). First, nanoCAGE12 now allows TSS mapping from total RNA
examination of the connectivity of TSSs with downstream regions and facilitates the
assignment of identified TSSs to specific transcripts. In addition, paired-end sequencing
partly alleviates the difficulty of aligning single short reads to repeat regions and thus allows
a subset of repeat elements to be at least partially characterized by RNA-seq.
However, there are several caveats of these NGS-based approaches. One is that no attempt
has been made to examine whether the amplification and other manipulation steps that are
carried out distort the resulting view of how frequently each TSS is used. Spike-in
experiments would be useful to address this issue. In addition, multiple difficulties were
encountered during the development of protocols involving cDNA synthesis and
amplification12. For example, researchers observed artefacts such as primer dimers that
dominated sequencing data sets and reduced effective coverage, prompting the use of
semisuppressive PCR to reduce primer dimer frequency12. Thus, although these methods
may be useful for qualitative applications, establishing and improving their quantitative
capabilities will probably require additional development.
sequence and structure. In addition, RNA-based TSS mapping is challenging for short-lived
transcripts such as primary microRNAs, which are transcribed generally at high levels but
are scarce owing to their rapid degradation. These limitations may be partly alleviated when
combined with other methods such as chromatin-based TSS prediction, which relies on
detecting histone modifications that are indicative of active transcription13,14. Such
integration may also be useful in light of the recent suggestion that post-transcriptional
processing results in 5′ cap-like structures in RNA fragments15. Thus, relying solely on
CAGE data for TSS mapping may result in difficulties in separating transcription initiation
events from RNA processing events.
Strand-specific RNA-seq
Transcriptomic studies in a range of species have revealed a pervasive presence of antisense
transcription events16. Although these events were once considered to reflect biological or
technical noise, it is now clear that antisense transcripts are functional and have various roles
in both normal physiological states and disease states16. There is therefore an increasing
interest in profiling transcriptomes at greater depths to fully characterize sense and antisense
transcription products. Standard RNA-seq approaches generally require double-stranded
NIH-PA Author Manuscript
cDNA synthesis, which erases RNA strand information. In addition, during first-strand
cDNA synthesis, spurious second-strand cDNA artefacts can be introduced, owing to the
DNA-dependent DNA polymerase (DDDP) activities of reverse transcriptases17-19, which
can confound sense versus antisense transcript determination20. Actinomycin D has been
suggested as a potential agent to reduce DDDP activities of reverse transcriptases18, but the
extent to which it is effective, and whether or not it introduces additional artefacts, has not
been fully examined. To overcome these difficulties, several strategies for strand-specific
analyses of transcriptomes have been developed.
The strategies that have been developed to generate strand-specific information generally
rely on one of three approaches. The first involves the ligation of adaptors in a
predetermined orientation to the ends of RNAs or to first-strand cDNA molecules21-23. The
known orientations of these adaptors are used as reference points to obtain RNA strand
information. A second approach is the direct sequencing of the first-strand cDNA products
that are generated, either in solution24,25 or on surfaces26. Last, a third approach is the
selective chemical marking of the second-strand cDNA synthesis products or RNA27,28.
These strategies have already begun to contribute to our understanding of transcriptomes,
NIH-PA Author Manuscript
including mapping of translation states of RNAs (for example, polysome profiling)29 and
identification of novel promoter-associated RNAs22.
A recent study that used the Saccharomyces cerevisiae genome as a reference compared the
performance of several of these strategies, and the authors observed differences in these
methods with respect to their level of strand specificity, evenness of coverage, agreement
with known annotations, library complexity (for example, number of unique read start
positions, which indicates the protocols’ abilities to avoid amplification artefacts such as
duplicate reads) and ability to generate quantitative expression profiles30. However, in-depth
comparative studies that characterize the biases and artefacts that are introduced by each of
these approaches are still lacking, and scientists working with these data sets should be
aware of several issues.
First, given the tendency of reverse transcriptase to generate spurious second-strand cDNA
products during first-strand cDNA synthesis17-19, it is not clear whether the approaches that
rely on sequencing first-strand cDNA products (either directly or by intra- or inter-molecular
ligation) are absolutely strand specific. The strand specificity of such approaches has been
reported by quantifying the ratio of reads that map in the antisense orientation to the known,
NIH-PA Author Manuscript
well-annotated genes, relative to the reads that map in the sense orientation. This
investigation revealed that a small fraction of reads obtained with these approaches still
align in the antisense orientation; thus, these approaches may not be entirely strand-
specific30. Furthermore, cDNA products that contain both first- and second-strand cDNA
products may not align properly to reference sequences. Given the incomplete annotations of
sense and antisense transcripts in genomes, even in those of well-studied species such as S.
cerevisiae, the true extent of strand specificity of these approaches should be carefully
assessed. Ideally, such assessment should be performed with chemically synthesized RNA
spike pools of defined sequence.
Second, ligation tends to have sequence preferences31,32. Thus, the approaches that rely on
ligation may suffer from various representational biases. examples of such bias are found in
transcriptome profiling23 and ribosome profiling experiments29, in which extremely uneven
coverage was seen for libraries prepared using ligation, compared with libraries prepared
using enzymatic 3′ polyadenylation29. Third, the in-solution or on-surface amplification step
included in some of these approaches may introduce additional artefacts — for example, in
the form of Gc biases and duplicate reads33-35. examination of such effects revealed a
duplicate read fraction in the range of 6.1% to 94.1% for standard and strand-specific
NIH-PA Author Manuscript
Illumina RNA-seq strategies, and the existence of Gc bias towards RNA templates with
neutral Gc content23. It is hoped that many of these limitations will be overcome by the
sequencing technologies that are in development or with modifications and improvements to
existing sequencing technologies4.
alternative splicing used computational strategies to compensate for this limitation. The
reference sequence used for alignment was supplemented with ‘artificial’ sequences that
surround all possible splice junctions between the annotated exons of genes, allowing the
NIH-PA Author Manuscript
reads to be aligned38-41. These approaches changed our view of human splicing, as more
than 95% of human multi-exon genes were found to be alternatively spliced, with ~110,000
novel splice sites per tissue42. By counting the number of reads mapping to each exon and
spanning each splice junction, these approaches also allowed the splice efficiency of each
junction to be determined and the levels of distinct isoforms to be quantified43,44.
Improvements to current sequencing technologies now enable longer read lengths, allowing
better mapping of the reads to the alternatively spliced exons. This improvement comes
from being able to partition the reads into multiple pieces and to align each piece
independently to the genomes. In addition, approaches that involve paired-end reads now
enable sequence information to be obtained from two points in a transcript with an estimated
distance between the reads. As a result, it is now possible to search for splicing patterns
without a requirement for prior knowledge of transcript annotations45,46 (FIG. 1).
examination of splicing patterns and transcript connectivity in an unbiased and genome-wide
manner requires full-length transcript sequences to be obtained, which may be enabled in the
future by emerging technologies47,48.
RNA-seq combined with computational analyses analogous to the ones described above for
splice-site detection can also be used to identify gene fusion events in disease tissues, which
has particular importance for cancer research49. Genomic DNA can be analysed with single-
read and paired-end-read strategies for the detection of translocations and other genomic
rearrangements50. However, RNA-seq may be preferable for identifying events that produce
aberrant RNA species and therefore have a higher likelihood of being functional or causal in
biological or disease settings51,52 (FIG. 2). Furthermore, genomic DNA-based approaches
cannot identify fusion events that are due to non-genomic factors, such as trans-splicing53
and read-through events between adjacent transcripts51,54. Paired-end RNA-seq can be
particularly advantageous for fusion identification because of the increased physical
coverage it offers. This approach has led to important biological findings in oncology55,56,
offering potential targets for therapeutic modulation.
The challenges faced in fusion detection are generally in parallel with those for alternative
splicing detection. In addition, RNA-seq-based analyses cannot detect fusion events that
involve the exchange of the promoter of a gene with the coding sequence of another gene.
Furthermore, RNA-seq data include chimeric cDNA artefacts that are generated by template
switching during reverse transcription and amplification57 (discussed below), leading to
NIH-PA Author Manuscript
false positives in gene fusion identification. These difficulties may be partly alleviated when
long-read RNA sequencing technologies with sufficient throughput and sequencing
performance become available4.
protein-coding transcriptome. RNA-seq of poly(A)+ RNA species offers a natural route for
exome sequencing without the use of enrichment strategies. The potential suitability of
mRNA-seq data for the identification of nucleotide variations has been demonstrated
recently by several studies59-61. However, these studies also underscored some challenges
— for example, the high sequencing depth required to sufficiently cover low-abundance
transcripts.
Slight modifications of the genomic DNA-enrichment strategies for cDNA applications have
allowed the development of targeted RNA-seq (FIG. 3). Targeted RNA-seq approaches have
been used to detect fusion transcripts, allele-specific expression, mutations and RNA-editing
events in a subset of transcripts62-64. Targeted RNA-seq strategies currently require longer
sample preparation steps and higher input RNA and cDNA quantities than do other RNA-
seq approaches, owing to the additional probe or microarray preparation and target-selection
steps. Furthermore, capture efficiency usually differs between target regions depending on
hybridization efficiency and other factors. Simplification of this process and improvements
in capture efficiency are desirable for better experimental outcomes.
The impact of NGS technologies on sRNA discovery and characterization has been
particularly noteworthy. These studies have been reviewed extensively by others (for
example, see REF. 65), so we do not review this topic in depth here but provide a brief
summary for completeness.
One important limitation of the current RNA-seq-based approaches for studying sRNAs is
their inability to provide an absolutely quantitative view of these transcripts. It has recently
become clear that, although the NGS-based sRNA-profiling approaches can be used for
differential expression analyses, the number of reads obtained per sRNA does not
necessarily correlate with their actual abundance73,74. This discrepancy seems to be due to
NIH-PA Author Manuscript
biases that are introduced during the sample preparation and sequencing steps. Whether
emerging technologies can improve sRNA quantification remains to be seen.
is being synthesized can sometimes dissociate from the template RNA and re-anneal to a
different stretch of RNA with a sequence similar to the initial template, generating
artefactual chimeric cDNAs. Template switching may cause problems in the identification
NIH-PA Author Manuscript
of exon–intron boundaries and true chimeric transcripts. Reverse transcriptases can also
synthesize cDNA in a primer-independent manner, which is thought to be caused by self
priming arising from the RNA secondary structure. This results in the generation of random
cDNA synthesis. Furthermore, reverse transcriptases have lower fidelity compared to other
polymerases owing to their lack of proofreading mechanisms78,79, and they have variable
RNA to cDNA conversion efficiency depending on the experimental conditions.
In addition to their requirement for cDNA synthesis, current RNA-seq approaches can
present other difficulties. First, the RNA-seq signal across transcripts tends to show non-
uniformity of coverage, which may be a result of biases introduced during various steps,
such as priming with random hexamers80,81, cDNA synthesis, ligation31,32, amplification35
and sequencing33-35,82. Second, commonly used RNA-seq strategies can result in
transcript-length bias because of the multiple fragmentation and RNA or cDNA size-
selection steps they use83. This bias may result in complications for downstream analyses84.
Third, quantification of transcripts with RNA-seq requires consideration of read mapping
uncertainty (owing to sequencing error rates, repetitive elements, incomplete genome
sequence and inaccuracies in transcript annotations)85 and normalization of the number of
reads mapping to each transcript, based on transcript length. Despite improvements in
NIH-PA Author Manuscript
The first massively parallel DRS approach was recently developed using the Helicos single-
molecule sequencing platform7,90,91 (FIG. 4). It relies on hybridization of several
femtomoles of 3′-polyadenylated RNA templates to single channels of poly(dT)-coated
sequencing surfaces, followed by sequencing by synthesis. This approach can select and
sequence poly(A)+ RNA from total RNA or cellular lysates, with sequence data being
derived from regions immediately upstream of the polyadenylation sites7. Thus, the
technology offers a path to obtain gene expression profiles and map polyadenylation sites in
a quantitative and genome-wide manner. RNA species that lack natural poly(A) tails can be
polyadenylated in vitro and analysed with DRS.
The development of DRS approaches that are free from cDNA synthesis artefacts such as
template switching and spurious second-strand synthesis provides potential improvements
for applications such as the surveying of strand-specific transcription. Furthermore, DRS
requires only femtomole or attomole levels of input RNA, depending on the application, and
involves relatively simple sample preparation. DRS-type technologies may therefore be
advantageous for applications that are challenging for current cDNA-based methodologies,
NIH-PA Author Manuscript
A key challenge for DRS is to generate the multimillion-level read quantities that are
required for many RNA applications, particularly quantification, and to further reduce error
rates and input RNA quantities through alterations to the sequencing chemistry and
template-capture steps. DRS may also not solve all of the RNA-seq limitations listed above
— including, for example, the issues of degradation products being captured during
poly(A)+ RNA selection. Furthermore, the combination of paired-end approaches with DRS
and longer read lengths is needed for various applications discussed above, including studies
focusing on the identification of 5′ (for example, CAGE-type TSS mapping) and 3′
NIH-PA Author Manuscript
progress towards personalized medicine. Strategies that can provide a comprehensive and
bias-free view of transcriptomes using picogram quantities of input RNA would therefore
stimulate great advances in a range of areas.
representation of the distinct RNA molecules in the original sample. To what extent the
current methods meet these criteria is not clear. Studies performed with microarray-based
measurements suggest that amplification introduces variability and discrepancies, especially
NIH-PA Author Manuscript
for middle- and low-abundance transcripts and as input RNA quantity is lowered further98.
Emerging technologies
A number of both hybridization- and sequencing-based technologies are now emerging that
may allow reliable transcriptome profiles to be obtained from minute cell quantities. On the
sequencing side, nanoCAGE12 now allows TSS mapping from 10 nanograms of total RNA
through the use of various amplification strategies. Amplification-free RNA-seq approaches
have recently been developed that minimize the quantity of input RNA required. One
approach involves the sequencing of first-strand cDNA products from as little as ~500
picograms of RNA, with priming carried out in solution with oligo-dT or random
NIH-PA Author Manuscript
Hybridization-based methodologies are also providing promise for working with very small
quantities of RNA. The NanoString nCounter System provides an alternative method for
RNA quantification without the requirement for cDNA synthesis, and it relies on the
generation of target-specific probes (FIG. 5b). The probe mixture is hybridized to RNA
samples in solution, followed by the immobilization of probe–RNA duplexes on surfaces
and single-molecule imaging to identify and count individual transcripts99. In principle, the
system can detect up to 16,384 transcripts simultaneously. This approach requires ~100
nanograms of RNA or 2000–5,000 cells100, but optimization of the probe hybridization and
NIH-PA Author Manuscript
Fluidigm offers a microfluidics platform that can perform quantitative real-time polymerase
chain reaction (qRT-PCR) experiments on gene panels in a multiplexed manner and has
been used to profile single cells. commercial kits allowing one-step cDNA synthesis and
amplification are used for cell lysis, cDNA synthesis and PCR amplification of the transcript
region of interest. Pre-amplified cDNAs are then introduced to the Fluidigm Dynamic Array
for qRT-PCR analysis. This approach may be useful for the determination of the expression
levels of a subset of transcripts across cells of interest101,102.
None of the approaches described above is mature, and none so far fully addresses our need
for reliable, genome-wide and in-depth transcriptome profiles from minute cell quantities.
For example, both the Fluidigm and NanoString technologies interrogate only a selected
subset of transcripts and do not provide comprehensive analyses. However, it is hoped that
future advances that will arise from the foundation formed by these technologies will enable
such capabilities.
NIH-PA Author Manuscript
Future perspectives
Recent advances in RNA-seq have provided researchers with a powerful toolbox for the
characterization and quantification of the transcriptome. Emerging sequencing technologies
promise to at least partly alleviate the difficulties of current RNA-seq methods and equip
scientists with better tools. Using these technological advances, we can build a complete
catalogue of transcripts that are derived from genomes ranging from those of simple
unicellular organisms to complex mammalian cells, as well as in tissues in normal and
disease states. Furthermore, with our increasing ability to work with minute RNA quantities
from fresh and formalin-fixed paraffin-embedded tissues and cells, and to provide
quantification of RNA species from even single cells, we have the opportunity to define
complex biological networks in a wide range of biological specimens. With these networks
in hand, we can use data-driven RNA network models of cells and tissues in an attempt to
fully understand the biological pathways that are active in various physiological conditions.
In addition, these technologies are bringing us closer to the ability to use RNA
measurements for clinical diagnostics. For example, analysis of circulating extracellular
nucleic acid103 and cells, such as fetal RNA and circulating tumour cells, with these new
technologies may allow for earlier assessment of health, disease recurrence or mutational
NIH-PA Author Manuscript
status. Thus, these technologies will continue to help us realize the full potential of genomic
information as it relates to basic biological questions of differentiation and diversity, as well
as its growing impact on the personalization of healthcare.
Acknowledgments
We apologize to authors whose work could not be cited owing to space constraints. We are grateful to the US
National Human Genome Research Institute for their support (grants R01 HG005230 and R44 HG005279).
References
1. Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the
ENCODE pilot project. Nature 2007;447:799–816. [PubMed: 17571346]
2. Berretta J, Morillon A. Pervasive transcription constitutes a new level of eukaryotic genome
regulation. EMBO Rep 2009;10:973–982. [PubMed: 19680288]
3. Kapranov P, Willingham AT, Gingeras TR. Genome-wide transcription and the implications for
genomic organization. Nature Rev. Genet 2007;8:413–423. [PubMed: 17486121]
4. Metzker ML. Sequencing technologies — the next generation. Nature Rev. Genet 2010;11:31–46.
[PubMed: 19997069] This Review provides a comprehensive overview of currently available and
NIH-PA Author Manuscript
10. Valen E, et al. Genome-wide detection and analysis of hippocampus core promoters using
DeepCAGE. Genome Res 2009;19:255–265. [PubMed: 19074369]
11. Ni T, et al. A paired-end sequencing strategy to map the complex landscape of transcription
NIH-PA Author Manuscript
[PubMed: 17897965]
19. Spiegelman S, et al. DNA-directed DNA polymerase activity in oncogenic RNA viruses. Nature
1970;227:1029–1031. [PubMed: 4317810]
20. Wu JQ, et al. Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing
reveals extensive transcription in the human genome. Genome Biol 2008;9:R3. [PubMed:
18173853]
21. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature
Methods 2008;5:613–619. [PubMed: 18516046]
22. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent
initiation at human promoters. Science 2008;322:1845–1848. [PubMed: 19056941]
23. Mamanova L, et al. FRT-seq: amplification-free, strand-specific transcriptome sequencing. Nature
Methods 2010;7:130–132. [PubMed: 20081834]
24. Lipson D, et al. Quantification of the yeast transcriptome by single-molecule sequencing. Nature
Biotechnol 2009;27:652–658. [PubMed: 19581875]
25. Ozsolak F, et al. Digital transcriptome profiling from attomole-level RNA samples. Genome Res
2010;20:519–525. [PubMed: 20133332]
26. Ozsolak F, et al. Amplification-free digital gene expression profiling from minute cell quantities.
Nature Methods 2010;7:619–621. [PubMed: 20639869]
NIH-PA Author Manuscript
27. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW. The antisense transcriptomes
of human cells. Science 2008;322:1855–1857. [PubMed: 19056939]
28. Parkhomchuk D, et al. Transcriptome analysis by strand-specific sequencing of complementary
DNA. Nucleic Acids Res 2009;37:e123. [PubMed: 19620212]
29. Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. Genome-wide analysis in vivo of
translation with nucleotide resolution using ribosome profiling. Science 2009;324:218–223.
[PubMed: 19213877]
30. Levin JZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.
Nature Methods 2010;7:709–715. [PubMed: 20711195]
31. Faulhammer D, Lipton RJ, Landweber LF. Fidelity of enzymatic ligation for DNA computing. J.
Comput. Biol 2000;7:839–848. [PubMed: 11382365]
32. Housby JN, Southern EM. Fidelity of DNA ligation: a novel experimental approach based on the
polymerisation of libraries of oligonucleotides. Nucleic Acids Res 1998;26:4259–4266. [PubMed:
9722647]
33. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets
from high-throughput DNA sequencing. Nucleic Acids Res 2008;36:e105. [PubMed: 18660515]
34. Goren A, et al. Chromatin profiling by directly sequencing small quantities of immunoprecipitated
NIH-PA Author Manuscript
43. Jiang H, Wong WH. Statistical inferences for isoform expression in RNA-seq. Bioinformatics
2009;25:1026–1032. [PubMed: 19244387]
44. Richard H, et al. Prediction of alternative isoforms from exon expression levels in RNA-seq
experiments. Nucleic Acids Res 2010;38:e112. [PubMed: 20150413]
45. Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection of splice junctions
from RNA-seq data. Genome Biol 2010;11:R34. [PubMed: 20236510]
46. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-seq.
Bioinformatics 2009;25:1105–1111. [PubMed: 19289445]
47. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science 2009;323:133–
138. [PubMed: 19023044]
48. Olasagasti F, et al. Replication of individual DNA molecules under electronic control using a
protein nanopore. Nature Nanotech 2010;5:798–806.
49. Mitelman F, Johansson B, Mertens F. The impact of translocations and gene fusions on cancer
causation. Nature Rev. Cancer 2007;7:233–245. [PubMed: 17361217]
50. Korbel JO, et al. Paired-end mapping reveals extensive structural variation in the human genome.
Science 2007;318:420–426. [PubMed: 17901297]
51. Maher CA, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 2009;458:97–
101. [PubMed: 19136943]
NIH-PA Author Manuscript
59. Chepelev I, Wei G, Tang Q, Zhao K. Detection of single nucleotide variations in expressed exons
of the human genome using RNA-seq. Nucleic Acids Res 2009;37:e106. [PubMed: 19528076]
60. Cirulli ET, et al. Screening the human exome: a comparison of whole genome and whole
NIH-PA Author Manuscript
69. Taft RJ, et al. Tiny RNAs associated with transcription start sites in animals. Nature Genet
2009;41:572–578. [PubMed: 19377478]
70. Berezikov E, et al. Diversity of microRNAs in human and chimpanzee brain. Nature Genet
2006;38:1375–1377. [PubMed: 17072315]
71. Kapranov P, et al. New class of gene-termini-associated human RNAs suggests a novel RNA
copying mechanism. Nature 2010;466:642–646. [PubMed: 20671709]
72. Lau NC, Lim LP, Weinstein EG, Bartel DP. An abundant class of tiny RNAs with probable
regulatory roles in Caenorhabditis elegans. Science 2001;294:858–862. [PubMed: 11679671]
73. Kawaji H, Hayashizaki Y. Exploration of small RNAs. PLoS Genet 2008;4:e22. [PubMed:
18225959]
74. Linsen SE, et al. Limitations and possibilities of small RNA digital gene expression profiling.
Nature Methods 2009;6:474–476. [PubMed: 19564845] The authors describe the difficulties
associated with the analysis and quantification of short RNA species using current NGS platforms.
75. Cocquet J, Chong A, Zhang G, Veitia RA. Reverse transcriptase template switching and false
alternative transcripts. Genomics 2006;88:127–131. [PubMed: 16457984]
76. Mader RM, et al. Reverse transcriptase template switching during reverse transcriptase-polymerase
chain reaction: artificial generation of deletions in ribonucleotide reductase mRNA. J. Lab. Clin.
Med 2001;137:422–428. [PubMed: 11385363]
NIH-PA Author Manuscript
77. Roy SW, Irimia M. When good transcripts go bad: artifactual RT-PCR ‘splicing’ and genome
analysis. Bioessays 2008;30:601–605. [PubMed: 18478540]
78. Chen D, Patton JT. Reverse transcriptase adds nontemplated nucleotides to cDNAs during 5′-
RACE and primer extension. Biotechniques 2001;30:574–582. [PubMed: 11252793]
79. Roberts JD, et al. Fidelity of two retroviral reverse transcriptases during DNA-dependent DNA
synthesis in vitro. Mol. Cell. Biol 1989;9:469–476. [PubMed: 2469002]
80. Armour CD, et al. Digital transcriptome profiling using selective hexamer priming for cDNA
synthesis. Nature Methods 2009;6:647–649. [PubMed: 19668204]
81. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random
hexamer priming. Nucleic Acids Res 2010;38:e131. [PubMed: 20395217]
82. Rosenkranz R, Borodina T, Lehrach H, Himmelbauer H. Characterizing the mouse ES cell
transcriptome with Illumina sequencing. Genomics 2008;92:187–194. [PubMed: 18602984]
83. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology.
Biol. Direct 2009;4:14. [PubMed: 19371405]
84. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq:
accounting for selection bias. Genome Biol 2010;11:R14. [PubMed: 20132535]
85. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-seq gene expression estimation with
NIH-PA Author Manuscript
uncultivated TM7 microbes from the human mouth. Proc. Natl Acad. Sci. USA 2007;104:11889–
11894. [PubMed: 17620602]
94. Pfeifer GP, Steigerwald SD, Mueller PR, Wold B, Riggs AD. Genomic sequencing and
methylation analysis by ligation mediated PCR. Science 1989;246:810–813. [PubMed: 2814502]
95. Dean FB, et al. Comprehensive human genome amplification using multiple displacement
amplification. Proc. Natl Acad. Sci. USA 2002;99:5261–5266. [PubMed: 11959976]
96. Dafforn A, et al. Linear mRNA amplification from as little as 5 ng total RNA for global gene
expression analysis. Biotechniques 2004;37:854–857. [PubMed: 15560142]
97. Eberwine J, et al. Analysis of gene expression in single live neurons. Proc. Natl Acad. Sci. USA
1992;89:3010–3014. [PubMed: 1557406]
98. Nygaard V, Hovig E. Options available for profiling small samples: a review of sample
amplification technology when combined with microarray profiling. Nucleic Acids Res
2006;34:996–1014. [PubMed: 16473852] This review provides a good overview of the current
low-quantity RNA applications and the complications associated with them.
99. Geiss GK, et al. Direct multiplexed measurement of gene expression with color-coded probe pairs.
Nature Biotech 2008;26:317–325.
100. Amit I, et al. Unbiased reconstruction of a mammalian transcriptional network mediating
pathogen responses. Science 2009;326:257–263. [PubMed: 19729616]
NIH-PA Author Manuscript
101. Byrne JA, Nguyen HN, Reijo Pera RA. Enhanced generation of induced pluripotent stem cells
from a subpopulation of human fibroblasts. PLoS ONE 2009;4:e7118. [PubMed: 19774082]
102. Helzer KT, et al. Circulating tumor cells are transcriptionally similar to the primary tumor in a
murine prostate model. Cancer Res 2009;69:7860–7866. [PubMed: 19789350]
103. Lo YM, et al. Plasma placental RNA allelic ratio permits noninvasive prenatal chromosomal
aneuploidy detection. Nature Med 2007;13:218–223. [PubMed: 17206148] This paper describes
the quantification of extracellular circulating RNA in mother’s plasma during pregnancy to detect
fetal aneuploidy.
104. Bowers J, et al. Virtual terminator nucleotides for next-generation DNA sequencing. Nature
Methods 2009;6:593–595. [PubMed: 19620973]
nucleotide (C-VT is shown as an example) and a polymerase mixture is carried out. After
this step, imaging is performed to locate the templates that have incorporated the nucleotide.
Chemical cleavage of the dye allows the surface and DNA templates to be ready for the next
nucleotide-addition cycle. Nucleotides are added in the C, T, A, G order for 120 total cycles
(30 additions of each nucleotide).
single-molecule DNA sequencing26. b | Counter system workflow. Two probes are used for
each target site: the capture probe (shown in red) contains a target-specific sequence and a
modification that allows the immobilization of the molecules on a surface; the reporter probe
contains a different target-specific sequence (shown in blue) and a fluorescent barcode
(shown by a green circle) that is unique to each target being examined. After hybridization
of the capture and reporter probe mixture to RNA samples in solution, excess probes are
removed. The hybridized RNA duplexes are then immobilized on a surface and imaged to
identify and count each transcript with the unique fluorescent signals on the capture and
reporter probes.
NIH-PA Author Manuscript
Table 1
Next generation sequencing-based approaches for transcription start site mapping
CAGEscan 5′ end of transcripts and either 3′ end or internal RNA sequence 10 ng total RNA 12
PEAT 5′ end of transcripts paired with random reads along the RNA 150 μg total RNA 11
CAGE, cap analysis of gene expression; CAGEscan, paired read to combine 5′ CAGE with downstream sequence; DeepCAGE, high-throughput CAGE sequencing; nanoCAGE, low-quantity CAGE; ng,
nanograms; PEAT, paired end analysis of transcription start sites; TSS, transcription start site.