0% found this document useful (0 votes)
105 views9 pages

Comparative Evaluation of Mass Spectrometry Platforms Used in Large-Scale Proteomics Investigations

Researchers have several options when designing proteomics experiments. LTQ and hybrid quadrupole time-of-flight mass spectrometers were compared. Low reproducibility between replicate data acquisitions can be exploited to increase sensitivity and confidence in large-scale protein identifications.

Uploaded by

Mihai Pandele
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views9 pages

Comparative Evaluation of Mass Spectrometry Platforms Used in Large-Scale Proteomics Investigations

Researchers have several options when designing proteomics experiments. LTQ and hybrid quadrupole time-of-flight mass spectrometers were compared. Low reproducibility between replicate data acquisitions can be exploited to increase sensitivity and confidence in large-scale protein identifications.

Uploaded by

Mihai Pandele
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ARTICLES

2005 Nature Publishing Group http://www.nature.com/naturemethods

Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations


Joshua E Elias1, Wilhelm Haas1, Brendan K Faherty2 & Steven P Gygi1,2
Researchers have several options when designing proteomics experiments. Primary among these are choices of experimental method, instrumentation and spectral interpretation software. To evaluate these choices on a proteome scale, we compared triplicate measurements of the yeast proteome by liquid chromatography tandem mass spectrometry (LC-MS/MS) using linear ion trap (LTQ) and hybrid quadrupole time-of-ight (QqTOF; QSTAR) mass spectrometers. Acquired MS/MS spectra were interpreted with Mascot and SEQUEST algorithms with and without the requirement that all returned peptides be tryptic. Using a composite target decoy database strategy, we selected scoring criteria yielding 1% estimated false positive identications at maximum sensitivity for all data sets, allowing reasonable comparisons between them. These comparisons indicate that Mascot and SEQUEST yield similar results for LTQacquired spectra but less so for QSTAR spectra. Furthermore, low reproducibility between replicate data acquisitions made on one or both instrument platforms can be exploited to increase sensitivity and condence in large-scale protein identications.

Increasingly, proteome-scale experiments using mass spectrometry are being used as biological assays1,2, with many studies identifying proteins numbering in the thousands35. More than generating simple catalogs of cellular components, such exploratory surveys establish the foundations for constructing protein interaction networks, and indicate signaling pathways involved in pathological and developmental processes. Experiments that maximize condent protein identications using available instrumentation and computation resources are therefore desirable. We present data here that will help guide researchers choice of experiment design with regard to two widely used mass spectrometers (LTQ and QSTAR) and MS/MS spectra interpretation software (Mascot and SEQUEST). Two commonly used types of tandem mass spectrometers are those with QqTOF congurations such as the QSTAR, and quadrupole ion trap (QIT) arrangements like the LTQ. Fundamental differences between these instruments affect the quality of mass measurements, including accuracy, resolution and dynamic
1Department

range68. The LTQ, a recently commercialized linear (two-dimensional) QIT mass spectrometer (2D-QIT), has higher ion capacity and scan rates than traditional three-dimensional ion traps9,10 properties that may increase sensitivity relative to the QSTAR despite higher mass accuracy and resolution provided by the latter mass spectrometer. Although both instruments are widely used by the mass spectrometry community, a thorough and detailed comparison of their performances on complex peptide mixtures has not been done. The most widely used search algorithms for interpreting tandem mass spectra are Mascot and SEQUEST1113. Traditionally, ion trapacquired MS/MS spectra are interpreted with SEQUEST, whereas Mascot is used to sequence TOF spectra (for example, see refs. 3,14). Based on at least one study15, it would appear that the two algorithms are fairly comparable, at least for ion trap acquired MS/MS spectra. It has remained unclear whether this holds true for TOF-acquired spectra as well. These two algorithms apply similar general approaches to assign peptides in a sequence database to measured MS/MS spectra13. But Mascot and SEQUEST use fundamentally different principles in their mathematical operations. Though not explicitly described, Mascot uses a probabilistic metric to assess the likelihood of a fragmented peptide to have given rise to an observed spectrum, whereas SEQUEST uses empirical and correlation measurements to score the alignment between observed and predicted spectra12. Furthermore, Mascot suggests a homology scoring threshold that is similar to a described measurement that uses the distribution of scores returned for peptides matching the observed precursor mass16 (J. Cottrell, personal communication). Accordingly, SEQUEST reports the normalized difference between the score of the top-ranked peptide and the scores of the remaining peptide hits. The magnitude of this difference correlates with highcondence peptide matches12,17,18. These dissimilarities make it seem unlikely that these two spectrum interpretation tools should be equivalent. Researchers might consider using complementary analytical platforms to increase peptide and protein identication. This, however, may be insufcient if platform reproducibility is low. As

of Cell Biology, 2Taplin Biological Mass Spectrometry Facility, 240 Longwood Ave., Harvard Medical School, Boston, Massachusetts 02115, USA. Correspondence should be addressed to S.P.G. (steven_gygi@hms.harvard.edu).

PUBLISHED ONLINE 23 AUGUST 2005; DOI:10.1038/NMETH785

NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 667

ARTICLES
a
LTQ QSTAR

Table 2 | Mascot and SEQUEST search parameters


Precursor Fragment Maximum mass ion mass missed tolerance tolerancea cleavages LTQ QSTAR
aDefault

Mass type

Dynamic modications

2.0 Da 0.2 Da

0.8 0.2

2 2

Average Oxidized methionine Monoisotopic Oxidized methionine

setting used for SEQUEST searches.

2005 Nature Publishing Group http://www.nature.com/naturemethods

Mascot

SEQUEST

Mascot

SEQUEST

Nonspecific Tryptic Nonspecific Tryptic Nonspecific Tryptic

Nonspecific

Tryptic

b
Ion accumulation
2 2 2 2 2 LTQ MS MS 1 MS 2 MS 3 MS 4 MS 5

Ion measurement

QSTAR

MS

MS2 1

MS2 2

MS2 3

MS2 4

MS2 5

reached three primary conclusions: (i) Mascot and SEQUEST results greatly overlapped (485%) for LTQ-acquired spectra that satisfy lter criteria, but Mascot appeared better-suited to interpreting QSTAR MS/MS spectra; (ii) results from replicate analyses overlapped less (B70%), suggesting more peptide and protein identications can be condently made by analyzing samples multiple times; and (iii) peptides identied by LTQ or QSTAR mass spectrometers overlapped even less (5060%), indicating complementarity between the two systems.
8,000

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Time (ms)

Figure 1 | Overview of experiment design. (a) Approximately 4 mg of yeast whole-cell lysate were electrophoresed on an SDS-polyacrylamide gel. This gel was divided into ten 0.5-cm slices, ve of which were subjected to in-gel trypsin digestion. Four percent of each digest were analyzed three times on QSTAR (QqTOF) and LTQ (QIT) mass spectrometers (Supplementary Methods). Analysis time on the QSTAR was extended to be three times that used on the LTQ so that each instrument would have similar numbers of sequencing attempts on eluting peptides. Resulting MS/MS spectra were searched with Mascot and SEQUEST algorithms with and without enzymatic specication. (b) Typical workow for automatic data-dependent acquisition methods used with LTQ and QSTAR mass spectrometers.

shotgun sequencing approaches usually sample only a fraction of a complex peptide mixture, one might expect to identify different peptide subsets across replicate analyses. We must therefore ask, does one gain more information by using alternative analytical platforms or simply by analyzing a sample multiple times? To address issues regarding choices of instrument, algorithm and experiment design, we performed a large-scale analysis of yeast whole-cell lysate with LTQ and QSTAR mass spectrometers. The resulting tandem mass spectra were interpreted with Mascot and SEQUEST algorithms with (tryptic search) and without (nonspecic search) the requirement that all matching peptides have two tryptic termini (tryptic), an indication of specic digestion by the protease trypsin. Using a composite target-decoy database search strategy1820, we effectively estimated the error rates of applied score lter criteria, allowing comparisons of multiple data sets with similar error rates. Based on these comparisons, we
Table 1 | Summary of run statistics
Gradient length LTQ QSTAR 30 min 90 min Gel regions 5 5 MS/MS per duty cycle 5 5 Average MS/MS per run 3,198 3,062 Total MS/MS collected 47,977 45,927

Replicates 3 3

RESULTS Dataset standardization We analyzed ve trypsin-digested gel regions representative of the yeast proteome in triplicate by nanoscale microcapillary LC-MS/ MS using LTQ (2D-QIT) and QSTAR (QqTOF) mass spectrometers (Fig. 1a). These instruments were optimized such that each would have similar numbers of opportunities to sequence eluting peptides (Fig. 1a and Supplementary Methods online) despite their different acquisition rates (Table 1 and Fig. 1b). Resulting MS/MS spectra were searched with Mascot and SEQUEST algorithms using commonly used parameters (Table 2), including both tryptic and nonspecic search modes (Fig. 1a). After nonspecic searches, we discarded all nontryptic peptide matches, as the vast majority of these are incorrect17,20. Searches against a composite target-decoy database containing all yeast protein sequences in both forward and reverse orientations18,20 provided a simple and effective way to estimate the false positive rate of peptide-spectral matches (PSMs; Supplementary Discussion online). Estimated algorithm false positive rates were calculated by doubling the number of decoy hits and dividing this by the total number of hits. This composite database strategy has two primary advantages over other proposed error estimation methods17,19. First, a correct top-ranked peptide can be condently selected even in the presence of a decoy hit with a slightly lower yet still largescore. Second, this method removes the requirement for exact a priori knowledge of mixture composition. Most importantly, this estimation method is instrument- and search algorithmindependent, a prerequisite for meaningful comparisons. This searching strategy was applied to each instrument platform and each search algorithm (Fig. 1). We then derived and applied selection criteria insensitive to dataset redundancies to generate lists of unique charged peptides with 1% false positive identications (99% precision) while maximizing the number of selected PSMs (Supplementary Table 1 online). Application of both a primary score (ion score and XCorr) and a relative score (the homology factor and DCn) was necessary to achieve an acceptable error rate at near-maximum sensitivity for both Mascot and SEQUEST (Supplementary Fig. 1 online). The 1% false positive threshold represented a suitable balance between sensitivity and precision, as we have shown previously18. Universal

668 | VOL.2 NO.9 | SEPTEMBER 2005 | NATURE METHODS

ARTICLES
application of this single criterion permitted rigorous comparisons between search algorithms, replicate analyses and analyses on multiple instruments. Do Mascot and SEQUEST identify the same peptides? Search algorithm differences and the disparate appearance of MS/ MS spectra collected on the LTQ versus the QSTAR (Supplementary Fig. 2 online) suggest Mascot and SEQUEST might identify different subsets of peptides within our 1% error tolerance. We determined that the algorithms similarly interpreted LTQ-acquired MS/MS spectra, but the scoring function of Mascot was bettersuited than SEQUEST for discriminating between correctly and incorrectly interpreted MS/MS spectra collected on the QSTAR (Table 3 and Fig. 2). When nonspecic SEQUEST searches were compared with tryptic Mascot searches, similar numbers of LTQ MS/MS spectra were condently identied, differing by just 0.5% (Table 3 and Fig. 2b). The overlap between these identications was substantial and noticeably greater than the proportion of selected PSMs in common between Mascot and SEQUEST results when just one search method (tryptic or nonspecic) was used by both algorithms (Fig. 2b). As previously observed15, PSMs condently identied by both algorithms represented a subpopulation estimated to be almost entirely devoid of false matches. Furthermore, we rarely (0.16%) observed spectra for which both algorithms scored PSMs at or above our selection criteria, yet did not agree on the peptide sequence that gave rise to the MS/MS spectrum (excluding simple isobaric residue substitutions). An average of 5,386 MS/MS spectra were condently assigned by either algorithm for the triplicate analyses of ve gel regionsa 14% improvement over the 4,700 identied by a traditional SEQUEST search of LTQ MS/MS spectra. The majority of all collected spectra, however, were not correctly identied. Of an average of 15,992 spectra acquired per set of ve gel regions, only 34% yielded high-condence PSMs. While Mascot and SEQUEST algorithms condently interpreted similar numbers of LTQ-acquired MS/MS spectra, Mascot scoring was better-suited to identifying correctly interpreted QSTAR spectra. Of the average 3,477 condently assigned QSTAR MS/MS spectra, Mascot exclusively selected nearly one-third, in comparison to the 15% selected just by SEQUEST (Fig. 2b). This is not to say SEQUEST assignments that did not meet our criteria were incorrect: 91% of the PSMs that passed Mascot criteria were also identied by SEQUEST within the top ten matches, but only 72% received XCorr and DCn scores sufcient to distinguish them from incorrect matches. In comparison, 87% of the LTQ-acquired

2005 Nature Publishing Group http://www.nature.com/naturemethods

Table 3 | Average numbers of interpreted MS/MS spectra identied by Mascot and SEQUEST from triplicate analyses by LTQ and QSTAR
LTQ Class Nonspecic Pass SEQUESTb Also identied by Mascot Also identied by top Mascot Also identied by Mascot, passes Mascotb Pass Mascotb Also identied by SEQUEST Also identied by top SEQUEST Also identied by SEQUEST, passes SEQUESTb Tryptic Pass SEQUESTb Also identied by Mascot Also identied by top Mascot Also identied by Mascot, passes Mascotb Pass Mascotb Also identied by SEQUEST Also identied by top SEQUEST Also identied by SEQUEST, passes SEQUESTb Nonspecic SEQUEST, tryptic Mascot Pass SEQUESTb Also identied by Mascot Also identied by top Mascot Also identied by Mascot, passes Mascotb Pass Mascotb Also identied by SEQUEST Also identied by top SEQUEST Also identied by SEQUEST, passes SEQUESTb
aFP,

QSTAR Precisiona 99.1 99.7 99.9 99.9 98.8 99.7 99.8 99.9 99.1 99.4 99.5 99.9 99.0 99.3 99.6 99.9 99.1 99.5 99.7 99.9 99.0 99.6 99.8 99.9 Selected 2,465 2,401 2,253 1,846 2,657 2,460 2,098 1,846 2,076 2,053 2,045 1,716 2,967 2,798 2,714 1,716 2,465 2,427 2,409 1,955 2,967 2,724 2,262 1,955 FPa 19 1 1 0 36 9 2 0 18 9 7 1 34 25 17 1 19 9 2 0 34 4 1 0 TPa 2,446 2,400 2,252 1,846 2,621 2,451 2,096 1,846 2,058 2,044 2,038 1,715 2,933 2,773 2,697 1,715 2,446 2,418 2,407 1,955 2,933 2,720 2,261 1,955 Precisiona 99.2 100.0 100.0 100.0 98.6 99.6 99.9 100.0 99.1 99.6 99.7 99.9 98.9 99.1 99.4 99.9 99.2 99.6 99.9 100.0 98.9 99.9 100.0 100.0

Selected 4,700 4,350 4,127 3,745 4,569 4,242 4,089 3,745 4,035 4,011 3,975 3,570 4,722 4,288 4,252 3,570 4,700 4,665 4,600 4,056 4,722 4,639 4,458 4,056

FPa 40 13 5 3 55 14 7 3 35 25 19 2 46 29 17 2 40 25 16 3 46 18 9 3

TPa 4,660 4,337 4,122 3,742 4,514 4,228 4,082 3,742 4,000 3,986 3,956 3,568 4,676 4,259 4,235 3,568 4,660 4,640 4,584 4,053 4,676 4,632 4,449 4,053

estimated false positive identications; TP, estimated true positive identications; Precision, (TP C (TP + FP)). bTo pass, PSMs had to receive scores greater than or equal to those indicated in Supplementary Table 1, designed to give 99% precision at maximum sensitivity, as estimated from reverse database hits.

NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 669

ARTICLES
LTQ QSTAR
4,700 (99.1%) 3,745 (99.9%) 824 (93.7%) [18.0%] 955 (96.1%) [20.3%] 811 (95.6%) [30.5%] 2,657 (98.6%) 1,846 (100.0%) 619 (96.9%) [25.1%] 2,465 (99.2%)

a
5,000

Nonspecific search

4,569 (98.8%)

Spectra passing selection criteria

4,000

3,000 4,722 (99.0%) 3,570 (99.9%) 1,152 (96.2%) [24.4%] 465 (92.9%) [11.5%] 4,035 (99.1%) 2,967 (98.9%) 1,716 (99.9%)

2005 Nature Publishing Group http://www.nature.com/naturemethods

2,000

Tryptic search

2,076 (99.1%)

1,000

1,251 (97.4%) [42.2%]

360 (95.3%) [17.3%]

Mascot: tryptic SEQUEST: nonspecific

Nonspecific LTQ

Tryptic

Nonspecific QSTAR

Tryptic

4,722 (99.0%) 4,056 (99.9%) 666 (93.5%) [14.1%]

4,700 (99.1%)

2,967 (98.9%) 1,955 (100.0%)

2,465 (99.2%)

644 (94.3%) [13.7%]

1,012 (96.6%) [34.1%]

510 (96.3%) [20.7%]

Figure 2 | Mascot and SEQUEST are complementary analysis tools for interpreting LTQ and QSTAR MS/MS. (a) Comparison of the average number of condently assigned spectra for each instrument, algorithm and search type combination. Error bars represent the maximum and minimum values for three replicate analyses. (b) Venn diagrams showing the degree of overlap between Mascot and SEQUEST searches. Numbers in parentheses indicate the precision represented by a particular region. Numbers in square brackets indicate the percentage of either search result that lies outside the overlap region. Nonspecic SEQUEST searches and tryptic Mascot searches were combined for further comparisons: for LTQ searches, 666 exclusive Mascot identications, 644 exclusive SEQUEST identications and 4,056 joint identications were combined yielding an average of 5,386 MS/MS spectra. For QSTAR searches, 1,012 exclusive Mascot identications, 510 exclusive SEQUEST identications and 1,955 joint identications were combined, yielding an average of 3,477 MS/MS spectra. Mascot and SEQUEST results are depicted in red and blue, respectively.

spectra identically interpreted by both Mascot and SEQUEST were given sufcient scores to pass the selection criteria for both algorithms (Table 3). As with the LTQ-acquired spectra, PSMs surpassing both Mascot and SEQUEST criteria were highly enriched for correct matches, with only 0.17% sequence disagreement for spectra that pass selection criteria for both algorithms. SEQUEST increased the total number of selected PSMs by an average of 17% when performed in addition to the traditional Mascot search of QSTAR-acquired MS/MS spectra. As with LTQ searches, most MS/MS spectra were not selected as correct PSMs. Of an average of 15,309 spectra acquired in three replicate runs, only 23% yielded high-condence PSMs. Differences between Mascot and SEQUEST For both LTQ and QSTAR runs, the populations of peptides condently selected by just one algorithm appeared to be indistinguishable from the peptides selected by the other, based on several measurements including charge state, residue composition and peptide length (data not shown). MS/MS spectrum quality presents a possible explanation for Mascot and SEQUEST performance differences. We observed that the apparent complexity and signalto-noise ratios of MS/MS spectra have profound and distinct effects on the magnitude of scores assigned by Mascot and SEQUEST (Supplementary Data online and Supplementary Fig. 2). Trypsin We occasionally identify peptides derived from nontryptic pathways during single-protein analyses in which we can detect these lower kinetic events (data not shown). When complex mixtures are
670 | VOL.2 NO.9 | SEPTEMBER 2005 | NATURE METHODS

examined, however, the overwhelming majority of condently identied peptides have tryptic ends. One might expect that by requiring a priori for peptide hits to have expected features of correct matches, incorrect matches would necessarily be removed. Whereas this certainly holds true in many cases, it cannot be considered a rule. Programs like SEQUEST and Mascot nd peptide matches to MS/MS spectra even when appropriate matches are absent from the sequence database. As nontryptic sequences greatly outnumber tryptic peptides considered by search algorithms under nonspecic search conditions, it is far more likely that nontryptic sequences will be incorrectly assigned to MS/MS spectra when correct sequences are not available for consideration, or when spectrum quality is insufcient to generate condent matches. Consequently, the overwhelming majority of nonfully tryptic PSMs are incorrect. Regardless of instrument or search algorithm, high-condence PSMs were enriched in a score-independent fashion by searching nonspecically and then restricting passing PSMs to be tryptic. For SEQUEST searches, more high-condence PSMs were selected in this way than by initially requiring all considered peptides to be tryptic. With tryptic searches, however, one must primarily rely on score lter criteria to distinguish correct from incorrect matches. As the score distribution of this larger incorrect PSM population overlaps more extensively with the high-condence PSM population (Supplementary Fig. 3 online), one must apply elevated lter criteria to maintain acceptable error rates (Supplementary Table 1). Higher criteria necessarily exclude many correct PSMs, thereby lowering sensitivity. We observed 1416% fewer condently assigned MS/MS by tryptic than nonspecic SEQUEST searches. This represents a much larger decrease in sensitivity than

ARTICLES
Peptides Proteins
3,875 (99.4%) 534 (97.0%) 2,510 (100.0%)

3,903 (99.1%) 567 (95.1%)

498 (98.8%)

662 (94.9%) 78 (61.5%) 41 (100.0%)

46 (91.3%)

641 (96.6%) 65 (75.4%)

497 (100.0%) Average number of identifications 33 (93.9%) 77 (84.4%) 648 (97.8%)

6,000 5,000 4,000 3,000 2,000 1,000 0 LTQ QSTAR Peptides 43 (100.0%) LTQ QSTAR Proteins 3 analyses 2 analyses 1 analysis

LTQ

328 (100.0%) 514 (97.7%)

333 (99.4%)

2005 Nature Publishing Group http://www.nature.com/naturemethods

3,685 (99.6%)

3,046 (99.3%)

655 (97.6%)

508 (100.0%) 532 (98.4%) 1,506 (100.0%)

2,865 (99.7%)

647 (97.2%) 63 (74.6%) 42 (95.2%)

68 (100.0%)

671 (98.5%) 86 (88.4%)

QSTAR

474 (100.0%)

377 (98.9%) 460 (96.5%) 2,662 (99.2%)

319 (100.0%)

61 (70.5%) 620 (96.8%)

Figure 3 | Considerably more peptide and protein identications can be achieved by analyzing a sample multiple times. For each instrument, Mascot tryptic and SEQUEST nonspecic searches were combined to create nonredundant lists of peptides and proteins. PSMs were ltered to remove redundant peptides with identical charge states. (a) Venn diagrams depicting overlap in peptides and proteins identied in three replicate runs. Numbers in parentheses indicate the identication precision for each region. (b) Average number of peptides and proteins identied following one, two or three replicate analyses of the same sample. Error bars indicate the maximum and minimum identications for each pairwise replicate comparison.

any observed as a result of incorrect, nontryptic peptides receiving higher scores than correct peptides (13% of all passing tryptic PSMs, except 7% of PSMs from SEQUEST searches of QSTAR MS/MS spectra). The benet of nonspecic searching is less pronounced for Mascot searches. Unlike SEQUEST, Mascot supplies a probabilitybased threshold score to determine if a PSM can be condently selected given the score distribution of peptides that could conceivably have given rise to an observed spectrum16. We found this scoring feature to be sufcient for condent PSM selection without further enhancement by nonspecic searching. We acknowledge that probabilistic measurements and evaluations have also been described for SEQUEST results17,21, and that these can add greater condence to matches. Comparative evaluations of their efcacies, however, are beyond the scope of this work. Probabilistic modeling and spectrum quality alone might not account for the larger disparity between Mascot and SEQUEST search results of QSTAR-acquired spectra, however. Unlike the SEQUEST revision used here, Mascot accounts for the higher mass accuracy (o50 p.p.m.) achievable on the QqTOF, potentially further contributing to this discrepancy. The greatest average numbers of spectra identied by Mascot were selected from tryptic searches (4,722, LTQ; 2,967, QSTAR). The greatest average numbers of spectra identied by SEQUEST were selected from nonspecic searches (4,700, LTQ; 2,465, QSTAR). By combining results from the two searches, more PSMs could be selected, although we observed a disproportionate increase in estimated false positive identications. Given that the majority of decoy database hits were selected from just one algorithms results (Fig. 2) and contained internal tryptic cleavage sites (data not shown), we removed these low condence missed cleavage peptides from the sets of nonoverlapping PSMs. Doing so

restored the estimated precision rate of unique identications to 99%. Search results combined in this way were used as the basis for further comparisons. Does it matter if a sample is analyzed more than once? It has been previously noted that subsequent identication by repeated analyses of a sample confers high condence in peptide and protein identications15 and may provide clues to protein abundance22. Although it is clear that they can increase the number of protein identications22, it has remained unclear if nonoverlap identications are primarily incorrect, or if they enhance protein coverage. Of the 5,284 (LTQ) or 4,357 (QSTAR) nonredundant PSMs condently identied in the sum of three replicate analyses, only 48% and 35% were selected from all three replicate runs, with any two analyses overlapping by averages of 76% and 67%, respectively (Fig. 3a). Clearly, repeated analyses of a single sample greatly enhanced the number of peptides identied from that sample (Fig. 3b). The 23%38% increase in condently identied peptides achieved by analyzing samples twice versus once, and 37%60% increase from three replicate analyses versus one considerably eclipses gains achieved through analysis by both Mascot and SEQUEST. As expected, more identied peptides translated to more identied proteins. At the protein level, however, the percent increase from repeated analyses did not parallel the gains observed at the peptide level. For example, we condently identied an average of 33% and 52% more peptides by reanalyzing a sample once or twice on the QSTAR, but only recorded 19% and 30% more proteins. This reects the observations that neither the number nor the identities of peptides that identify a given protein are necessarily consistent from run to run. The reduction in the number of
NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 671

ARTICLES
a
Identified in any replicate Peptides
4,547 (98.6%) 1,944 (96.8%) [42.7%] 3,925 (98.9%) 1,322 (96.7%) [33.7%] 837 (92.4%) 627 (100.0%) 210 (69.5%) [25.1%] 210 (78.1%) [25.1%]

Proteins
837 (94.5%)

2,603 (100.0%)

Simulated single analysis

3,374 (99.3%) 1,450 (100.0%) 1,758 (98.7%) [47.9%]

2,623 (99.4%)

650 (96.4%)

646 (97.5%)

2005 Nature Publishing Group http://www.nature.com/naturemethods

1,007 (98.4%) [38.4%]

167 (86.1%) [22.3%]

483 (100.0%)

163 (89.8%) [25.2%]

Identified in all replicates

2,315 (100.0%)

1,449 (100.0%)

497 (100.0%) 111 (100.0%) [22.3%] 386 (100.0%)

474 (100.0%)

1,395 (100.0%) [60.3%]

b
80 70

920 (100.0%)

529 (100.0%) [36.5%]

88 (100.0%) [18.6%]

Figure 4 | LTQ and QSTAR mass spectrometers exclusively identify many peptides and proteins. (a) Venn diagrams depict the overlap extent for peptides and proteins identied on LTQ (blue) and QSTAR (red) mass spectrometers. Numbers in parentheses indicate the precision of the identications in each region. Numbers in square brackets indicate the percent of identications from each instrument that lie outside the overlap region. PSMs were ltered to remove redundant peptides regardless of charge state. (b) Length distribution for peptides identied in at least two of three replicate analyses on one instrument (QSTAR or LTQ) and not on the other. The predicted distribution (inset) is based on an in silico digest of the yeast proteome, assuming digestion by trypsin with up to one missed cleavage. Exclusively QSTAR-identied peptides were shorter (mean length, 12.1 residues; s.d., 3.8) than exclusively LTQ-identied peptides (mean length, 23.0 residues; s.d., 7.5). See Supplementary Methods for mass range optimizations.

Predicted number

60,000
QSTAR only

We observed roughly similar numbers of acquired MS/MS spectra and condently 20,000 identied peptides and proteins for the 40 0 LTQ and QSTAR. But if the gradient used 5 10 15 20 25 30 35 40 Peptide length 30 on the LTQ was extended to match that used on the QSTAR, the number of peptide 20 and protein identications would increase 10 proportionally (Fig. 5a). This is the effect we observed when a portion of our sample was 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 reacquired in triplicate with a 90-minute Peptide length gradient on the LTQ (Supplementary Fig. 4 online). proteins identied relative to the number of peptides appears For both peptides and proteins, identication by both LTQ consistent for both instruments, suggesting that under the condi- and QSTAR mass spectrometers gave near complete assurance tions used, the two shotgun approaches yield similar surveys of of correct identications, given peptides that passed the stated complex protein mixtures. We note that an average of 24% of pro- selection criteria. Conversely, false positive identications were tein identications not validated by any other replicate runs concentrated in the nonoverlap regions, particularly when conwere estimated to be incorrect, in comparison to the o2% sidering proteins. Examination of these nonoverlap proteins estimated to be incorrect when proteins were found in multiple revealed that many were identied by just one peptide, a symptom replicate analyses. of both low abundance as well as incorrect peptide identication (Supplementary Fig. 5 online). Do the LTQ and QSTAR yield the same identications? Comparing the set of proteins identied in any replicate Despite the fact the LTQ collected just 4% more MS/MS spectra by either instrument to a single analysis on one instrument, than the QSTAR (in one-third the acquisition time), 21% more we observed that the total number of protein identications unique PSMs were condently identied with the LTQ, and the increased by 60%. The number of false positive protein numbers of proteins identied from each instrument were essen- identications, however, increased nearly vefold. Because these tially the same (Fig. 3). It seems reasonable to suspect strong false identications were overwhelmingly restricted to the agreement on these identications across platforms as well. We nonoverlapping regions of the Venn diagram, adding the further found that this was not the case: less than half of all unique peptide constraint of being identied in multiple analyses is an effective sequences identied in any replicate by either instrument that strategy to enrich for correctly identied proteins. When passed our selection criteria were identied by both instruments this strategy was applied to our dataset, 826 proteins were selected (Fig. 4a), regardless of charge. This disparity was diminished as being correctly identied with an estimated false positive somewhat at the protein level, with approximately 60% of the rate near 0.0%. proteins identied by both LTQ and QSTAR mass spectrometers. We found 562 peptides identied in at least two replicate QSTAR We attribute this increase in protein overlap to the observation that analyses that were never condently identied by the LTQ and several peptides identied on just one instrument are derived from 1,121 peptides identied in at least two replicate LTQ analyses proteins identied on both. and never condently identied by the QSTAR. We estimated
40,000
LTQ only Predicted

60 50

672 | VOL.2 NO.9 | SEPTEMBER 2005 | NATURE METHODS

Count

ARTICLES
a
Per data set LTQ 15,992 15,309 5,366 3,477 3,374 2,623 650 646
140 160

QSTAR

120

106.6 34.0

Average percent increase

100

2005 Nature Publishing Group http://www.nature.com/naturemethods

Per minute

35.8 7.7 22.5 5.8 4.3 1.4

80

60

40

100.0 100.0 Per 100 MS/MS 33.6 22.7 21.1 17.1 5.2 5.5
20

0 Algorithms Replicates Instruments

2 1 1

1 1 2

1 2 1

2 1 2

2 2 1

1 3 1

2 2 2

2 3 2

Average number of acquired MS/MS Average number of selected PSMs

Average number of selected peptides Average number of selected proteins

Peptides, LTQ Proteins, LTQ

Peptides, QSTAR Proteins, QSTAR

Figure 5 | Practical comparison of analytical options. (a) Comparison of throughput and identication rates of LTQ and QSTAR mass spectrometers. (b) Percent increase in peptide (dark) and protein (light) identications over standard single algorithm, replicate and instrument analyses for LTQ (blue) and QSTAR (red) mass spectrometers. Estimated peptide and protein precision rates were in the ranges of 99.099.7% and 94.698.8%, respectively.

that 499% of these identications were correct. Although we found no obvious sequence-related reason why one instrument may show preference for one set of peptides (data not shown), we observed that exclusively LTQ-identied peptides were on average twice the length of QSTAR-identied peptides (Fig. 4b). As both Mascot and SEQUESTwere used to generate these lists of identied peptides, we conclude that fundamental differences between the instruments led to differences in the range of identiable peptide lengths, rather than issues related to spectrum interpretation. The distinct distributions of ions selected for MS/MS by the LTQ and QSTAR correlate with the lengths of identied peptides (Supplementary Fig. 6 online). This suggests the instruments inherent ion preferences and acquisition ranges inuence their abilities to sequence long peptides (Supplementary Methods). This observed length discrepancy accounts for just half of the peptides uniquely identied by the LTQ, and does not explain the peptide identications made exclusively by the QSTAR. Because the proportion of condently-assigned MS/MS spectra was nearly 50% more for LTQ- than QSTAR-acquired data (0.32 versus 0.21), we conclude that either the QSTAR was able to fragment and measure more peptides not represented in the sequence database, or the search algorithms had more difculty correctly interpreting and assigning distinguishable scores to QSTAR MS/MS spectra. The latter situation appears most likely because LTQ-acquired spectra tend to receive greater scores than QSTAR spectra (Supplementary Fig. 1). Considering only the peptides with lengths less than 17 residues, we found that average scores assigned to LTQ PSMs exceeded QSTAR PSM scores by 17% (nonspecic SEQUEST search) to 100% (tryptic Mascot search). Moreover, the decoy (incorrect) LTQ PSMs were conned more toward the lowerscoring ranges than the QSTAR PSMs. Finally, a greater proportion of QSTAR-acquired spectra have qualities with which both Mascot and SEQUEST have difculty (Supplementary Fig. 3).

DISCUSSION How mass spectrometry can best be used to exhaustively analyze protein samples has remained a logistical and computational challenge. In this report, we demonstrate that both protein and proteome coverage can be dramatically improved (i) when samples are analyzed multiple times; (ii) when samples are measured by complementary instruments, and to a lesser extent, (iii) when resulting MS/MS spectra are interpreted with complementary search algorithms (Fig. 5b). Furthermore, repeated identication by complementary systems presents a scoring-independent means by which correct peptide and protein identications can be selected. These factors should be particularly effective for validating dubious protein identications. For example, researchers often restrict their condent protein identications to those identied by two or more peptides23, as proteins identied by single peptides exhibit higher false positive rates. In support of this practice, we estimated that all false positive protein identications were proteins identied by one peptide for both LTQ and QSTAR analyses. Although removal of this peptide class as an additional ltering step would likely bring the estimated dataset precision to near 100%, doing so would remove more than one third of all protein identications (per replicate analysis), 8595% of which were estimated to be correct (Supplementary Fig. 5). The validation techniques used here as well as applying more restrictive score criteria should prove useful in rescuing these potentially valuable identications. We present a direct comparison of ion trap and TOF instrumentation platforms for the analysis of complex protein mixtures. Several other mass spectrometer congurations have been demonstrated to have utility in dening proteomes2427. Often limiting amounts of biological material, available instruments or computation capabilities place constraints on which approaches can be reasonably followed to achieve the most sensitive and accurate
NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 673

ARTICLES
proteome measurements. We believe a thorough understanding of the relative strengths of these platforms will prove invaluable to the proteome research community; when applied to these other platforms, the strategies explained here will allow researchers to make more rational choices regarding which analytical option will best suit their particular experimental goals. Similarly, many more spectral interpretation2831 and selection1719,32 options have been described than were used here. With so many data interpretation options, it is crucial that they be benchmarked for various mass spectrometer congurations. We have made available all raw data used here for this purpose (http://gygi.med. harvard.edu/pubs). METHODS Sample processing. Log-phase Saccharomyces cerevisiae cultures were lysed as described33. Lysate protein concentration was determined using a bicinchoninic acid (BCA) protein assay (Pierce). Roughly 4 mg of protein were reduced, alkylated with iodoacetamide, and separated by SDS-PAGE as described34. The resulting gel was divided into ten equal gel slices (Fig. 1). Five alternating 0.5-cm 10-cm slices were subjected to in-gel tryptic digestion35. Following off-line desalting on C18 solid-phase extraction cartridges, 4% of each digest were subjected to analysis by LC-MS/MS on either LTQ (ThermoElectron) or QSTAR-XL (Applied Biosystems/MDS Sciex) mass spectrometers. Detailed description of the conditions used for LC-MS/MS analysis is available in Supplementary Methods online. Database searching and data processing. All data were converted from raw instrument output to the .dta format using instrumentsupplied software: Analyst QS, Build 7051 (MDS SCIEX) for QSTAR-acquired MS/MS spectra or the program ExtractMS version 2.11 (ThermoElectron) for LTQ-acquired MS/MS spectra. The program MascotMerge (Matrix Science) was used to convert .dta les into the Mascot Generic Format prior to launching Mascot searches. Mascot (version 2.0, Matrix Science) searches were performed using a dual 1.1 Ghz processor server at the Harvard Medical School Pathology Functional Proteomics Center (http://pfpc.med.harvard.edu). SEQUEST (version 27, revision 9) searches were performed against the same sequence database using an in-house Linux cluster with fourteen 2.2-Ghz dual processor nodes. The protein sequence database used consisted of all translated open reading frame sequences (orf_trans.fasta from Saccharomyces Genome Database (SGD) at Stanford University; downloaded 10 September, 2004 (ftp.yeastgenome.org/yeast/)) in the forward (target) orientation preceding these same sequences in their reversed (decoy) orientations. All cysteine residues were searched as carboxamidomethylcysteine (+57.0215 Da), and methionine residues were allowed to be oxidized (+15.9949 Da). Up to two internal cleavages sites were allowed for tryptic searches. Parameters commonly set for all LTQ searches included use of average atomic masses, and a tolerance of 2.0 Da for precursor ions and 0.8 Da for fragment ions (Mascot). Parameters commonly set for all QSTAR searches included the use of monoisotopic atomic masses and a tolerance of 0.2 Da for both precursor and fragment ions. Specic score cutoffs were empirically determined for each replicate run by varying the scores listed in Supplementary Table 1 to maximize the number of accepted nonredundant PSMs,
674 | VOL.2 NO.9 | SEPTEMBER 2005 | NATURE METHODS

keeping the precision rate as close to 99% as possible. The cutoffs determined for each run were averaged, and applied to all three data sets. Scripts written in the Perl programming language were used to import, export and compile search results to and from a Postgres SQL database. Further data manipulations were performed with Microsoft Excel and Sigma Plot (Systat Software, Inc.).
Note: Supplementary information is available on the Nature Methods website. ACKNOWLEDGMENTS This work was supported in part by US National Institutes of Health GM67945 and HG00041 (S.P.G.). We thank D. Moazed for yeast lysate and the Pathology Functional Proteomic Center at Harvard Medical School for allowing use of their Mascot server. COMPETING INTERESTS STATEMENT The authors declare competing nancial interests (see the Nature Methods website for details). Received 6 April; accepted 26 July 2005 Published online at http://www.nature.com/naturemethods/ 1. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198 207 (2003). 2. Aebersold, R. & Goodlett, D.R. Mass spectrometry in proteomics. Chem. Rev. 101, 26995 (2001). 3. Florens, L. et al. A proteomic view of the Plasmodium falciparum life cycle. Nature 419, 520526 (2002). 4. Peng, J. et al. A proteomics approach to understanding protein ubiquitination. Nat. Biotechnol. 21, 921926 (2003). 5. Foster, L.J., De Hoog, C.L. & Mann, M. Unbiased quantitative proteomics of lipid rafts reveals high specicity for signaling factors. Proc. Natl. Acad. Sci. USA 100, 58135818 (2003). 6. Louris, J. et al. Instrumentation, applications, and energy deposition in quadrupole ion-trap tandem mass spectrometry. Anal. Chem. 59, 16771685 (1987). 7. Jonscher, K.R. & Yates, J.R., III. The quadrupole ion trap mass spectrometera small solution to a big challenge. Anal. Biochem. 244, 115 (1997). 8. Chernushevich, I.V., Loboda, A.V. & Thomson, B.A. An introduction to quadrupole-time-of-ight mass spectrometry. J. Mass Spectrom. 36, 849865 (2001). 9. Schwartz, J.C., Senko, M.W. & Syka, J.A. Two-dimensional quadurpole ion trap mass spectrometer. J. Am. Soc. Mass Spectrom. 13, 659669 (2002). 10. Mayya, V., Rezaul, K., Cong, Y.S. & Han, D. Systematic comparison of a twodimensional ion trap and a three-dimensional ion trap mass spectrometer in proteomics. Mol. Cell. Proteomics 4, 214223 (2005). 11. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identication by searching sequence databases using mass spectrometry data. Electrophoresis 20, 35513567 (1999). 12. Eng, J.K., McCormack, A.L. & Yates, J.R., III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976989 (1994). 13. Sadygov, R.G., Cociorva, D. & Yates, J.R., III. Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nat. Methods 1, 195202 (2004). 14. Lasonder, E. et al. Analysis of the Plasmodium falciparum proteome by highaccuracy mass spectrometry. Nature 419, 537542 (2002). 15. Resing, K.A. et al. Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal. Chem. 76, 35563568 (2004). 16. Fenyo, D. & Beavis, R.C. A method for assessing the statistical signicance of mass spectrometrybased protein identications using general scoring schemes. Anal. Chem. 75, 768774 (2003). 17. Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identications made by MS/MS and database search. Anal. Chem. 74, 53835392 (2002). 18. Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P. & Gygi, S.P. Intensity-based protein identication by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214219 (2004). 19. Moore, R.E., Young, M.K. & Lee, T.D. Qscore: an algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 13, 378386 (2002). 20. Peng, J., Elias, J.E., Thoreen, C.C., Licklider, L.J. & Gygi, S.P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry

2005 Nature Publishing Group http://www.nature.com/naturemethods

ARTICLES
(LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 4350 (2003). Sadygov, R.G. & Yates, J.R., III. A hypergeometric probability model for protein identication and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 37923798 (2003). Liu, H., Sadygov, R.G. & Yates, J.R., III. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 41934201 (2004). Kratchmarova, I., Blagoev, B., Haack-Sorensen, M., Kassem, M. & Mann, M. Mechanism of divergent growth factor effects in mesenchymal stem cell differentiation. Science 308, 14721477 (2005). Medzihradszky, K.F. et al. The characteristics of peptide collision-induced dissociation using a high-performance MALDI-TOF/TOF tandem mass spectrometer. Anal. Chem. 72, 552558 (2000). Hager, J.W. A new linear ion trap mass spectrometer. Rapid Comm. Mass Spec. 16, 512526 (2002). Lipton, M.S. et al. Global analysis of the Deinococcus radiodurans proteome by using accurate mass tags. Proc. Natl. Acad. Sci. USA 99, 1104911054 (2002). Meng, F. et al. Molecular-level description of proteins from Saccharomyces cerevisiae using quadrupole FT hybrid mass spectrometry for top down proteomics. Anal. Chem. 76, 28522858 (2004). 28. Zhang, N., Aebersold, R. & Schwikowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 14061412 (2002). 29. Tabb, D.L., Saraf, A. & Yates, J.R., III. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415 6421 (2003). 30. LeDuc, R.D. et al. ProSight PTM: an integrated environment for protein identication and characterization by top-down mass spectrometry. Nucleic Acids Res. 32, W340W345 (2004). 31. Chamrad, D.C. et al. Evaluation of algorithms for protein identication from sequence databases using mass spectrometry data. Proteomics 4, 619628 (2004). 32. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 46464658 (2003). 33. Verdel, A. & Moazed, D. Labeling and characterization of small RNAs associated with the RNA interference effector complex RITS. Methods Enzymol. 392, 297 307 (2005). 34. Beausoleil, S.A. et al. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. USA 101, 1213012135 (2004). 35. Peng, J. & Gygi, S.P. Proteomics: the move to mixtures. J. Mass Spectrom. 36, 10831091 (2001).

21.

22.

23.

2005 Nature Publishing Group http://www.nature.com/naturemethods

24.

25. 26.

27.

NATURE METHODS | VOL.2 NO.9 | SEPTEMBER 2005 | 675

You might also like