article

Initializing Partition-Optimization Algorithms

Author:

Ranjan MaitraAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 6, Issue 1

Pages 144 - 157

https://doi.org/10.1109/TCBB.2007.70244

Published: 01 January 2009 Publication History

Abstract

Clustering datasets is a challenging problem needed in a wide array of applications. Partition-optimization approaches, such as k-means or expectation-maximization (EM) algorithms, are sub-optimal and find solutions in the vicinity of their initialization. This paper proposes a staged approach to specifying initial values by finding a large number of local modes and then obtaining representatives from the most separated ones. Results on test experiments are excellent. We also provide a detailed comparative assessment of the suggested algorithm with many commonly-used initialization approaches in the literature. Finally, the methodology is applied to two datasets on diurnal microarray gene expressions and industrial releases of mercury.

References

[1]

J.D. Banfield and A.E. Raftery, "Model-Based Gaussian and Non-Gaussian Clustering," Biometrics, vol. 49, pp. 803-821, 1993.

[2]

G. Brossier, "Piece-Wise Hierarchical Clustering," J. Classification, vol. 7, pp. 197-216, 1990.

[3]

G. Celeux and G. Govaert, "Gaussian Parsimonious Clustering Models," Computational Statistics and Data Analysis, vol. 28, pp. 781-793, 1995.

[4]

W.F. Eddy, A. Mockus, and S. Oue, "Approximate Single Linkage Cluster Analysis of Large Datasets in High-Dimensional Spaces," Computational Statistics and Data Analysis, vol. 23, pp. 29-43, 1996.

Digital Library

[5]

B.S. Everitt, "A Finite Mixture Model for the Clustering of Mixed-Mode Data," Statistics and Probability Letters, vol. 6, pp. 305-309, 1988.

[6]

I.J. Good, "The Clustering of Random Variables," J. Statistical Computing and Simulation, vol. 9, pp. 241-248, 1979.

[7]

J. Hartigan, "Statistical Theory in Clustering," J. Classification, vol. 2, pp. 63-76, 1985.

[8]

J.R. Kettenring, "The Practice of Cluster Analysis," J. Classification, vol. 23, pp. 3-30, 2006.

[9]

F. Murtagh, Multi-Dimensional Clustering Algorithms. Springer-Verlag, 1985.

[10]

D.B. Ramey, "Nonparametric Clustering Techniques," Encyclopedia of Statistical Science, vol. 6, pp. 318-319, Wiley, 1985.

[11]

B.D. Ripley, "Classification and Clustering in Spatial and Image Data," Analyzing and Modeling Data and Knowledge. pp. 85-91, Springer-Verlag, 1991.

[12]

M.J. Symons, "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, vol. 37, pp. 35-43, 1981.

[13]

R.J. Tibshirani and G. Walther, "Cluster Validation by Prediction Strength," J. Computational and Graphical Statistics, vol. 14, no. 3, pp. 511-528, 2005.

[14]

H.P. Friedman and J. Rubin, "On Some Invariant Criteria for Grouping Data," J. Am. Statistical Assoc., vol. 62, pp. 1159-1178, 1967.

[15]

A.J. Scott and M.J. Symons, "Clustering Methods Based on Likelihood Ratio Criteria," Biometrics, vol. 27, pp. 387-397, 1971.

[16]

W.D. Fisher, "On Grouping for Maximum Homogeneity," J. Am. Statistical Assoc., vol. 53, pp. 789-798, 1958.

[17]

L. Kaufman and P.J. Rousseuw, Finding Groups in Data. John Wiley and Sons, 1990.

[18]

D. Steinley, "Local Optima in k-means Clustering: What You Don't Know May Hurt You," Psychological Methods, vol. 8, pp. 294-304, 2003.

[19]

S.C. Zeeman, A. Tiessen, E. Pilling, K.L. Kato, A.M. Donald, and A.M. Smith, "Starch Synthesis in Arabidopsis. Granule Synthesis, Composition, and Structure," Plant Physiology, vol. 129, pp. 516- 529, 2002.

[20]

A.M. Smith, S.C. Zeeman, D. Thorneycroft, and S.M. Smith, "Starch Mobilization in Leaves," J. Experimental Botany, vol. 54, pp. 577-583, 2003.

[21]

S.M. Smith, D.C. Fulton, T. Chia, D. Thorneycroft, A. Chapple, H. Dunstan, C. Hylton, S.C. Zeeman, and A.M. Smith, "Diurnal Changes in the Transcriptome Encoding Enzymes of Starch Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of Starch Metabolism in Arabidopsis Leaves," Plant Physiology, vol. 136, pp. 2687-2699, 2004.

[22]

P. Grandjean, P. Weihe, R.F. White, F. Debes, S. Araki, K. Yokoyama, K. Murata, N. Sörensen, R. Dahl, and P.J. Jorgensen, "Cognitive Deficit in 7-Year-Old Children with Prenatal Exposure to Methylmercury," Neurotoxicology and Teratology, vol. 19, no. 6, pp. 417-428, 1997.

[23]

T. Kjellstrom, P. Kennedy, S. Wallis, and C. Mantell, Physical and Mental Development of Children with Prenatal Exposure to Mercury from Fish. Stage 1: Preliminary Tests at Age 4. Swedish Nat'l Environmental Protection Board, 1986.

[24]

N. Sörensen, K. Murata, E. Budtz-Jørgensen, P. Weihe, and P. Grandjean, "Prenatal Methylmercury Exposure as a Cardiovascular Risk Factor at Seven Years of Age," Epidemiology, vol. 10, no. 4, pp. 370-375, 1999.

[25]

Toxicological Effects of Methylmercury, N.A. of Sciences, Nat'l Academy Press, 2000.

[26]

America's Children and the Environment: Measures of Contaminants, Body Burdens, and Illnesses. US Environmental Protection Agency, 2003.

[27]

W.F. Fitzgerald, D.R. Engstrom, R.P. Mason, and E.A. Nater, "The Case for Atmospheric Mercury Contamination in Remote Areas," Environmental Science and Technology, vol. 32, pp. 1-7, 1998.

[28]

R: A Language and Environment for Statistical Computing, R Development Core Team, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http://www.R-project.org, 2006.

[29]

J.A. Lozano, J.M.P. Na, and P.L. Naga, "An Empirical Comparison of Four Initialization Methods for the k-Means Algorithm," Pattern Recognition Letters, vol. 20, pp. 1027-1040, 1999.

Digital Library

[30]

P.S. Bradley and U.M. Fayyad, "Refining Initial Points for K-Means Clustering," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 91-99, 1998.

Digital Library

[31]

U. Fayyad, C. Reina, and P. Bradley, "Initialization of Iterative Refinement Clustering Algorithms," Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 194-198, 1998.

[32]

C. Fraley and A.E. Raftery, "How Many Clusters? Which Cluster Method? Answers via Model-Based Cluster Analysis," Computer J., vol. 41, pp. 578-588, 1998.

[33]

G.C. Tseng and W.H. Wong, "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data," Biometrics, vol. 61, pp. 10-16, 2005.

[34]

M.B. Al-Daoud, "A New Algorithm for Cluster Initialization," Trans. Eng., Computing, and Technology, vol. V4, pp. 74-76, 2005.

[35]

C. Biernacki, G.C. Celeux, and G. Govaert, "Choosing Starting Values for the Em Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models," Computational Statistics and Data Analysis, vol. 413, pp. 561-575, 2003.

Digital Library

[36]

J.W. Demmel, Applied Numerical Linear Algebra. SIAM, 1977.

Digital Library

[37]

J. Friedman and N. Fisher, "Bump-Hunting in High-Dimensional Data," Statistics and Computing, vol. 9, no. 2, pp. 1-20, 1999.

Digital Library

[38]

L. Hubert and P. Arabie, "Comparing Partitions," J. Classification, vol. 2, pp. 193-218, 1985.

[39]

W.M. Rand, "Objective Criteria for the Evaluation of Clustering Methods," J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.

[40]

D. Steinley and M.J. Brusco, "Initializing k-Means Batch Clustering: A Critical Evaluation of Several Techniques," J. Classification, vol. 24, pp. 99-121, 2007.

Digital Library

[41]

J.A. Hartigan and M.A. Wong, "A k-Means Clustering Algorithm," Applied Statistics, vol. 28, pp. 100-108, 1979.

Digital Library

[42]

F.H. Marriott, "Practical Problems in a Method of Cluster Analysis," Biometrics, vol. 27, pp. 501-514, 1971.

[43]

G. Schwarz, "Estimating the Dimensions of a Model," Annals of Statistics, vol. 6, pp. 461-464, 1978.

[44]

S. Dasgupta, "Learning Mixtures of Gaussians," Proc. IEEE Symp. Foundations of Computer Science (FOCS' 99), pp. 633-644, 1999.

Digital Library

[45]

C.B.D.J. Newman, S. Hettich, and C. Merz, "UCI Repository of Machine Learning Databases," http://www.ics.uci.edu/ ~mlearn/MLRepository.html, 1998.

[46]

K. Nakai and M. Kinehasa, "Expert System for Predicting Protein Localization Sites in Gram-Negative Bacteria," PROTEINS: Structure, Function, and Genetics, vol. 11, pp. 95-110, 1991.

[47]

R. Maitra, "A Statistical Perspective to Data Mining," J. Indian Soc. Probability and Statistics, vol. 6, pp. 28-77, 2002.

[48]

P. Horton and K. Nakai, "A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins," Intelligent Systems in Molecular Biology, pp. 109-115, 1985.

Digital Library

[49]

Y. Benjamini and Y. Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," J. Royal Statistical Soc., vol. 57, pp. 289-300, 1995.

[50]

R.J. Tibshirani, G. Walther, and T.J. Hastie, "Estimating the Number of Clusters in a Dataset via the Gap Statistic," J. Royal Statistical Soc., vol. 63, no. 2, pp. 411-423, 2003.

[51]

S.H. Lin, "Statistical Behavior of Rain Attenuation," Bell System Technical J., vol. 52, no. 4, pp. 557-581, 1973.

Cited By

Vicente SMurua-Sazo A(2022)Determinantal consensus clusteringAdvances in Data Analysis and Classification10.1007/s11634-022-00514-617:4(829-858)Online publication date: 25-Aug-2022
https://dl.acm.org/doi/10.1007/s11634-022-00514-6
Xu YYe TWang XLai YQiu JZhang LZhang X(2021)GMM with parameters initialization based on SVD for network threat detectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20006640:1(477-490)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-200066
Zhang YMelnykov VZhu X(2021)Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patternsStatistical Analysis and Data Mining10.1002/sam.1150214:3(230-240)Online publication date: 4-May-2021
https://dl.acm.org/doi/10.1002/sam.11502
Show More Cited By

Index Terms

Initializing Partition-Optimization Algorithms
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Classification and regression trees
  2. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Probability and statistics
    1. Multivariate statistics

Recommendations

Maximal Subspace Coregulated Gene Clustering

Clustering is a popular technique for analyzing microarray datasets, with n genes and m experimental conditions. As explored by biologists, there is a real need to identify co-regulated gene clusters, which includes both positive/negative regulated gene ...
Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture ...
Cluster analysis using multivariate normal mixture models to detect differential gene expression with microarray data

DNA microarrays make it possible to study simultaneously the expression of thousands of genes in a biological sample. Univariate clustering techniques have been used to discover target genes with differential expression between two experimental ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 6, Issue 1

January 2009

173 pages

ISSN:1545-5963

Issue’s Table of Contents

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2009

Published in TCBB Volume 6, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
273
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vicente SMurua-Sazo A(2022)Determinantal consensus clusteringAdvances in Data Analysis and Classification10.1007/s11634-022-00514-617:4(829-858)Online publication date: 25-Aug-2022
https://dl.acm.org/doi/10.1007/s11634-022-00514-6
Xu YYe TWang XLai YQiu JZhang LZhang X(2021)GMM with parameters initialization based on SVD for network threat detectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20006640:1(477-490)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-200066
Zhang YMelnykov VZhu X(2021)Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patternsStatistical Analysis and Data Mining10.1002/sam.1150214:3(230-240)Online publication date: 4-May-2021
https://dl.acm.org/doi/10.1002/sam.11502
Almodóvar-Rivera IMaitra R(2020)Kernel-estimated nonparametric overlap-based syncytial clusteringThe Journal of Machine Learning Research10.5555/3455716.345583821:1(4808-4861)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.5555/3455716.3455838
Menardi G(2020)Nonparametric clustering for image segmentationStatistical Analysis and Data Mining10.1002/sam.1144413:1(83-97)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1002/sam.11444
Berry NMaitra R(2019)TiK‐meansStatistical Analysis and Data Mining10.1002/sam.1141612:3(223-233)Online publication date: 20-May-2019
https://dl.acm.org/doi/10.1002/sam.11416
Lithio AMaitra R(2018)An efficient k‐means‐type algorithm for clustering datasets with incomplete recordsStatistical Analysis and Data Mining10.1002/sam.1139211:6(296-311)Online publication date: 16-Nov-2018
https://dl.acm.org/doi/10.1002/sam.11392
Notsu AEguchi S(2016)Robust clustering method in the presence of scattered observationsNeural Computation10.1162/NECO_a_0083328:6(1141-1162)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1162/NECO_a_00833
Covões THruschka EGhosh J(2016)Evolving gaussian mixture models with splitting and merging mutation operatorsEvolutionary Computation10.1162/EVCO_a_0015224:2(293-317)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1162/EVCO_a_00152
Lin TMcLachlan GLee S(2016)Extending mixtures of factor models using the restricted multivariate skew-normal distributionJournal of Multivariate Analysis10.1016/j.jmva.2015.09.025143:C(398-413)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.jmva.2015.09.025
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents