skip to main content
article

GENESHIFT: A Nonparametric Approach for Integrating Microarray Gene Expression Data Based on the Inner Product as a Distance Measure between the Distributions of Genes

Published: 01 March 2013 Publication History

Abstract

The potential of microarray gene expression (MAGE) data is only partially explored due to the limited number of samples in individual studies. This limitation can be surmounted by merging or integrating data sets originating from independent MAGE experiments, which are designed to study the same biological problem. However, this process is hindered by batch effects that are study-dependent and result in random data distortion; therefore numerical transformations are needed to render the integration of different data sets accurate and meaningful. Our contribution in this paper is two-fold. First we propose GENESHIFT, a new nonparametric batch effect removal method based on two key elements from statistics: empirical density estimation and the inner product as a distance measure between two probability density functions; second we introduce a new validation index of batch effect removal methods based on the observation that samples from two independent studies drawn from a same population should exhibit similar probability density functions. We evaluated and compared the GENESHIFT method with four other state-of-the-art methods for batch effect removal: Batch-mean centering, empirical Bayes or COMBAT, distance-weighted discrimination, and cross-platform normalization. Several validation indices providing complementary information about the efficiency of batch effect removal methods have been employed in our validation framework. The results show that none of the methods clearly outperforms the others. More than that, most of the methods used for comparison perform very well with respect to some validation indices while performing very poor with respect to others. GENESHIFT exhibits robust performances and its average rank is the highest among the average ranks of all methods used for comparison.

References

[1]
Batch Effects and Noise in Microarray Experiments: Sources and Solutions, A. Scherer, ed. John Wiley & Sons, 2009.
[2]
J.T. Leek, R.B. Scharpf, H.C. Bravo, D. Simcha, B. Langmead, W.E. Johnson, D. Geman, K. Baggerly, and R.A. Irizarry, "Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data," Nature Rev. Genetics, vol. 11, no. 10, pp. 733- 739, 2010.
[3]
C. Lazar, S. Meganck, J. Taminau, D. Steenhoff, A. Coletta, C. Molter, D.Y. Weiss-Solís, R. Duque, H. Bersini, and A. Nowé, "Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey," to be published in Briefings in Bioinformatics, 2012.
[4]
J.A. Gagnon-Bartsch and T.P. Speed, "Using Control Genes to Correct for Unwanted Variation in Microarray Data," Biostatistics, vol. 13, pp. 539-552, 2011.
[5]
C. Chen et al., "Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods," PLoS ONE, vol. 6, no. 2, article e17238, 2011.
[6]
J. Luo et al., "A Comparison of Batch Effect Removal Methods for Enhancement of Prediction Performance Using MAQC-II Microarray Gene Expression Data," Pharmacogenomics J., vol. 10, no. 4, pp. 278-291, 2010.
[7]
A. Scherer, "Variation, Variability, Batches and Bias in Microarray Experiments: An Introduction," Batch Effects and Noise in Micro-array Experiments: Sources and Solutions, A. Scherer, ed., chapter 1. John Wiley & Sons, 2009.
[8]
N. Altman, "Batches and Blocks, Sample Pools and Subsamples in the Design and Analysis of Gene Expression Studies," Batch Effects and Noise in Microarray Experiments: Sources and Solutions, A. Scherer, ed., chapter 4. John Wiley & Sons, 2009.
[9]
M. Suarez-Farinas et al., "Harshlight: A 'Corrective Make-Up' Program for Microarray Chips," BMC Bioinformatics, vol. 6, no. 1, article 294, 2005.
[10]
A. Sims et al., "The Removal of Multiplicative, Systematic Bias Allows Integration of Breast Cancer Gene Expression Data Sets--Improving Meta-Analysis and Prediction of Prognosis," BMC Medical Genomics, vol. 1, no. 1, article 42, 2008.
[11]
W.E. Johnson, C. Li, and A. Rabinovic, "Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods," Biostatistics, vol. 8, no. 1, pp. 118-127, 2007.
[12]
M. Benito et al., "Adjustment of Systematic Microarray Data Biases," Bioinformatics, vol. 20, no. 1, pp. 105-114, 2004.
[13]
A.A. Shabalin et al., "Merging Two Gene-Expression Studies via Cross-Platform Normalization," Bioinformatics, vol. 24, no. 9, pp. 1154-1160, 2008.
[14]
O. Alter, P.O. Brown, and D. Botstein, "Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 18, pp. 10 101-10 106, 2000.
[15]
J.T. Leek and J.D. Storey, "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLoS Genetics, vol. 3, no. 9, article e161, 2007.
[16]
P. Warnat, R. Eils, and B. Brors, "Cross-Platform Analysis of Cancer Microarray Data Improves Gene Expression Based Classification of Phenotypes." BMC Bioinformatics, vol. 6, no. 1, article 265, 2005.
[17]
M. McCall and R. Irizarry, "Thawing Frozen Robust Multi-Array Analysis (FRMA)," BMC Bioinformatics, vol. 12, no. 1, article 369, 2011.
[18]
E. Parzen, "On Estimation of a Probability Density Function and Mode," Annals of Math. Statistics, vol. 33, no. 3, pp.1065- 1076, 1962.
[19]
S.-H. Cha, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions," Int'l J. Math. Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300-307, 2007.
[20]
A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov, "Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles," Proc. Nat'l Academy Sciences USA, vol. 102, no. 43, pp. 15545-15550, 2005.
[21]
R. Thomas, L.D.L Torre, X. Chang, and S. Mehrotra, "Validation and Characterization of DNA Microarray Gene Expression Data Distribution and Associated Moments," BMC Bioinformatics, vol. 11, no. 1, article 576, 2010.
[22]
M.A. Newton, C.M. Kendziorski, C.S. Richmond, and F.R. Blattner, "On Differential Variability of Expression Ratios: Improving Statistical Inference About Gene Expression Changes from Microarray Data," J. Computational Biology, vol. 8, pp. 37-52, 2001.
[23]
D.M. Rocke and B. Durbin, "A Model for Measurement Error for Gene Expression Arrays," J. Computational Biology, vol. 8, pp. 557- 569, 2001.
[24]
J. Yu, V.A. Smith, P.P. Wang, A.J. Hartemink, and E.D. Jarvis, "Advances to Bayesian Network Inference for Generating Causal Networks from Observational Biological Data," Bioinformatics, vol. 20, no. 18, pp. 3594-3603, Dec. 2004.
[25]
T.-P. Lu, M.-H. Tsai, J.-M. Lee, C.-P. Hsu, P.-C. Chen, C.-W. Lin, J.-Y. Shih, P.-C. Yang, C.K. Hsiao, L.-C. Lai, and E.Y. Chuang, "Identification of a Novel Biomarker, Sema5A, for Nonsmall Cell Lung Carcinoma in Nonsmoking Women," Cancer Epidemiology Biomarkers and Prevention, vol. 19, no. 10, pp. 2590-2597, 2010.
[26]
J. Hou, J. Aerts, B. den Hamer, W. van IJcken, M. den Bakker, P. Riegman, C. van der Leest, P. van der Spek, J.A. Foekens, H.C. Hoogsteden, F. Grosveld, and S. Philipsen, "Gene Expression-Based Classification of Non-Small Cell Lung Carcinomas and Survival Prediction," PLoS ONE, vol. 5, no. 4, article e10312, 2010.
[27]
M.T. Landi, T. Dracheva, M. Rotunno, J.D. Figueroa, H. Liu, A. Dasgupta, F.E. Mann, J. Fukuoka, M. Hames, A.W. Bergen, S.E. Murphy, P. Yang, A.C. Pesatori, D. Consonni, P.A. Bertazzi, S. Wacholder, J.H. Shih, N.E. Caporaso, and J. Jen, "Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival," PLoS ONE, vol. 3, no. 2, article e1651, 2008.
[28]
A. Coletta, C. Molter, R. Duque, D. Steenhoff, J. Taminau, V. de Schaetzen, S. Meganck, C. Lazar, D. Venet, V. Detours, A. Nowe, H. Bersini, and D.Y.W. Solis, "InSilico DB Genomic Data Sets Hub: An Efficient Starting Point for Analyzing Genome-Wide Studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor," Genome Biology, vol. 13, no. 11, article R104, 2012.
[29]
J. Taminau, D. Steenhoff, A. Coletta, S. Meganck, C. Lazar, V. de Schaetzen, R. Duque, C. Molter, H. Bersini, A. Nowé, and D.Y.W. Solís, "InSilicoDb: An R/Bioconductor Package for Accessing Human Affymetrix Expert-Curated Data Sets from GEO," Bioinformatics, vol. 27, no. 22, pp. 3204-3205, 2011.
[30]
M.N. McCall, B.M. Bolstad, and R.A. Irizarry, "Frozen Robust Multiarray Analysis (FRMA)," Biostatistics, vol. 11, no. 2, pp. 242- 53, 2010.
[31]
G.K. Smyth, "Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments," Statistical Application Genetics Moleculer Biology, vol. 3, article 3, 2004.
[32]
C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, and A. Nowe, "A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1106-1119, July/Aug. 2012.
  1. GENESHIFT: A Nonparametric Approach for Integrating Microarray Gene Expression Data Based on the Inner Product as a Distance Measure between the Distributions of Genes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 10, Issue 2
      March 2013
      272 pages

      Publisher

      IEEE Computer Society Press

      Washington, DC, United States

      Publication History

      Published: 01 March 2013
      Published in TCBB Volume 10, Issue 2

      Author Tags

      1. Batch effects
      2. Data integration
      3. Estimation
      4. Gene expression
      5. Lungs
      6. Sociology
      7. Statistics
      8. density estimation
      9. distance measures between probability density functions
      10. inner product
      11. integrative analysis of gene expression microarrays
      12. microarray data integration
      13. nonparametric methods

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 97
        Total Downloads
      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media