skip to main content
article

Initializing Partition-Optimization Algorithms

Published: 01 January 2009 Publication History

Abstract

Clustering datasets is a challenging problem needed in a wide array of applications. Partition-optimization approaches, such as k-means or expectation-maximization (EM) algorithms, are sub-optimal and find solutions in the vicinity of their initialization. This paper proposes a staged approach to specifying initial values by finding a large number of local modes and then obtaining representatives from the most separated ones. Results on test experiments are excellent. We also provide a detailed comparative assessment of the suggested algorithm with many commonly-used initialization approaches in the literature. Finally, the methodology is applied to two datasets on diurnal microarray gene expressions and industrial releases of mercury.

References

[1]
J.D. Banfield and A.E. Raftery, "Model-Based Gaussian and Non-Gaussian Clustering," Biometrics, vol. 49, pp. 803-821, 1993.
[2]
G. Brossier, "Piece-Wise Hierarchical Clustering," J. Classification, vol. 7, pp. 197-216, 1990.
[3]
G. Celeux and G. Govaert, "Gaussian Parsimonious Clustering Models," Computational Statistics and Data Analysis, vol. 28, pp. 781-793, 1995.
[4]
W.F. Eddy, A. Mockus, and S. Oue, "Approximate Single Linkage Cluster Analysis of Large Datasets in High-Dimensional Spaces," Computational Statistics and Data Analysis, vol. 23, pp. 29-43, 1996.
[5]
B.S. Everitt, "A Finite Mixture Model for the Clustering of Mixed-Mode Data," Statistics and Probability Letters, vol. 6, pp. 305-309, 1988.
[6]
I.J. Good, "The Clustering of Random Variables," J. Statistical Computing and Simulation, vol. 9, pp. 241-248, 1979.
[7]
J. Hartigan, "Statistical Theory in Clustering," J. Classification, vol. 2, pp. 63-76, 1985.
[8]
J.R. Kettenring, "The Practice of Cluster Analysis," J. Classification, vol. 23, pp. 3-30, 2006.
[9]
F. Murtagh, Multi-Dimensional Clustering Algorithms. Springer-Verlag, 1985.
[10]
D.B. Ramey, "Nonparametric Clustering Techniques," Encyclopedia of Statistical Science, vol. 6, pp. 318-319, Wiley, 1985.
[11]
B.D. Ripley, "Classification and Clustering in Spatial and Image Data," Analyzing and Modeling Data and Knowledge. pp. 85-91, Springer-Verlag, 1991.
[12]
M.J. Symons, "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, vol. 37, pp. 35-43, 1981.
[13]
R.J. Tibshirani and G. Walther, "Cluster Validation by Prediction Strength," J. Computational and Graphical Statistics, vol. 14, no. 3, pp. 511-528, 2005.
[14]
H.P. Friedman and J. Rubin, "On Some Invariant Criteria for Grouping Data," J. Am. Statistical Assoc., vol. 62, pp. 1159-1178, 1967.
[15]
A.J. Scott and M.J. Symons, "Clustering Methods Based on Likelihood Ratio Criteria," Biometrics, vol. 27, pp. 387-397, 1971.
[16]
W.D. Fisher, "On Grouping for Maximum Homogeneity," J. Am. Statistical Assoc., vol. 53, pp. 789-798, 1958.
[17]
L. Kaufman and P.J. Rousseuw, Finding Groups in Data. John Wiley and Sons, 1990.
[18]
D. Steinley, "Local Optima in k-means Clustering: What You Don't Know May Hurt You," Psychological Methods, vol. 8, pp. 294-304, 2003.
[19]
S.C. Zeeman, A. Tiessen, E. Pilling, K.L. Kato, A.M. Donald, and A.M. Smith, "Starch Synthesis in Arabidopsis. Granule Synthesis, Composition, and Structure," Plant Physiology, vol. 129, pp. 516- 529, 2002.
[20]
A.M. Smith, S.C. Zeeman, D. Thorneycroft, and S.M. Smith, "Starch Mobilization in Leaves," J. Experimental Botany, vol. 54, pp. 577-583, 2003.
[21]
S.M. Smith, D.C. Fulton, T. Chia, D. Thorneycroft, A. Chapple, H. Dunstan, C. Hylton, S.C. Zeeman, and A.M. Smith, "Diurnal Changes in the Transcriptome Encoding Enzymes of Starch Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of Starch Metabolism in Arabidopsis Leaves," Plant Physiology, vol. 136, pp. 2687-2699, 2004.
[22]
P. Grandjean, P. Weihe, R.F. White, F. Debes, S. Araki, K. Yokoyama, K. Murata, N. Sörensen, R. Dahl, and P.J. Jorgensen, "Cognitive Deficit in 7-Year-Old Children with Prenatal Exposure to Methylmercury," Neurotoxicology and Teratology, vol. 19, no. 6, pp. 417-428, 1997.
[23]
T. Kjellstrom, P. Kennedy, S. Wallis, and C. Mantell, Physical and Mental Development of Children with Prenatal Exposure to Mercury from Fish. Stage 1: Preliminary Tests at Age 4. Swedish Nat'l Environmental Protection Board, 1986.
[24]
N. Sörensen, K. Murata, E. Budtz-Jørgensen, P. Weihe, and P. Grandjean, "Prenatal Methylmercury Exposure as a Cardiovascular Risk Factor at Seven Years of Age," Epidemiology, vol. 10, no. 4, pp. 370-375, 1999.
[25]
Toxicological Effects of Methylmercury, N.A. of Sciences, Nat'l Academy Press, 2000.
[26]
America's Children and the Environment: Measures of Contaminants, Body Burdens, and Illnesses. US Environmental Protection Agency, 2003.
[27]
W.F. Fitzgerald, D.R. Engstrom, R.P. Mason, and E.A. Nater, "The Case for Atmospheric Mercury Contamination in Remote Areas," Environmental Science and Technology, vol. 32, pp. 1-7, 1998.
[28]
R: A Language and Environment for Statistical Computing, R Development Core Team, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http://www.R-project.org, 2006.
[29]
J.A. Lozano, J.M.P. Na, and P.L. Naga, "An Empirical Comparison of Four Initialization Methods for the k-Means Algorithm," Pattern Recognition Letters, vol. 20, pp. 1027-1040, 1999.
[30]
P.S. Bradley and U.M. Fayyad, "Refining Initial Points for K-Means Clustering," Proc. 15th Int'l Conf. Machine Learning (ICML '98), pp. 91-99, 1998.
[31]
U. Fayyad, C. Reina, and P. Bradley, "Initialization of Iterative Refinement Clustering Algorithms," Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 194-198, 1998.
[32]
C. Fraley and A.E. Raftery, "How Many Clusters? Which Cluster Method? Answers via Model-Based Cluster Analysis," Computer J., vol. 41, pp. 578-588, 1998.
[33]
G.C. Tseng and W.H. Wong, "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data," Biometrics, vol. 61, pp. 10-16, 2005.
[34]
M.B. Al-Daoud, "A New Algorithm for Cluster Initialization," Trans. Eng., Computing, and Technology, vol. V4, pp. 74-76, 2005.
[35]
C. Biernacki, G.C. Celeux, and G. Govaert, "Choosing Starting Values for the Em Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models," Computational Statistics and Data Analysis, vol. 413, pp. 561-575, 2003.
[36]
J.W. Demmel, Applied Numerical Linear Algebra. SIAM, 1977.
[37]
J. Friedman and N. Fisher, "Bump-Hunting in High-Dimensional Data," Statistics and Computing, vol. 9, no. 2, pp. 1-20, 1999.
[38]
L. Hubert and P. Arabie, "Comparing Partitions," J. Classification, vol. 2, pp. 193-218, 1985.
[39]
W.M. Rand, "Objective Criteria for the Evaluation of Clustering Methods," J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
[40]
D. Steinley and M.J. Brusco, "Initializing k-Means Batch Clustering: A Critical Evaluation of Several Techniques," J. Classification, vol. 24, pp. 99-121, 2007.
[41]
J.A. Hartigan and M.A. Wong, "A k-Means Clustering Algorithm," Applied Statistics, vol. 28, pp. 100-108, 1979.
[42]
F.H. Marriott, "Practical Problems in a Method of Cluster Analysis," Biometrics, vol. 27, pp. 501-514, 1971.
[43]
G. Schwarz, "Estimating the Dimensions of a Model," Annals of Statistics, vol. 6, pp. 461-464, 1978.
[44]
S. Dasgupta, "Learning Mixtures of Gaussians," Proc. IEEE Symp. Foundations of Computer Science (FOCS' 99), pp. 633-644, 1999.
[45]
C.B.D.J. Newman, S. Hettich, and C. Merz, "UCI Repository of Machine Learning Databases," http://www.ics.uci.edu/ ~mlearn/MLRepository.html, 1998.
[46]
K. Nakai and M. Kinehasa, "Expert System for Predicting Protein Localization Sites in Gram-Negative Bacteria," PROTEINS: Structure, Function, and Genetics, vol. 11, pp. 95-110, 1991.
[47]
R. Maitra, "A Statistical Perspective to Data Mining," J. Indian Soc. Probability and Statistics, vol. 6, pp. 28-77, 2002.
[48]
P. Horton and K. Nakai, "A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins," Intelligent Systems in Molecular Biology, pp. 109-115, 1985.
[49]
Y. Benjamini and Y. Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," J. Royal Statistical Soc., vol. 57, pp. 289-300, 1995.
[50]
R.J. Tibshirani, G. Walther, and T.J. Hastie, "Estimating the Number of Clusters in a Dataset via the Gap Statistic," J. Royal Statistical Soc., vol. 63, no. 2, pp. 411-423, 2003.
[51]
S.H. Lin, "Statistical Behavior of Rain Attenuation," Bell System Technical J., vol. 52, no. 4, pp. 557-581, 1973.

Cited By

View all
  • (2022)Determinantal consensus clusteringAdvances in Data Analysis and Classification10.1007/s11634-022-00514-617:4(829-858)Online publication date: 25-Aug-2022
  • (2021)GMM with parameters initialization based on SVD for network threat detectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20006640:1(477-490)Online publication date: 1-Jan-2021
  • (2021)Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patternsStatistical Analysis and Data Mining10.1002/sam.1150214:3(230-240)Online publication date: 4-May-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 6, Issue 1
January 2009
173 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2009
Published in TCBB Volume 6, Issue 1

Author Tags

  1. Clustering
  2. Multivariate statistics
  3. Singular value decomposition
  4. Statistical methods
  5. and association rules
  6. classification

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Determinantal consensus clusteringAdvances in Data Analysis and Classification10.1007/s11634-022-00514-617:4(829-858)Online publication date: 25-Aug-2022
  • (2021)GMM with parameters initialization based on SVD for network threat detectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20006640:1(477-490)Online publication date: 1-Jan-2021
  • (2021)Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patternsStatistical Analysis and Data Mining10.1002/sam.1150214:3(230-240)Online publication date: 4-May-2021
  • (2020)Kernel-estimated nonparametric overlap-based syncytial clusteringThe Journal of Machine Learning Research10.5555/3455716.345583821:1(4808-4861)Online publication date: 1-Jan-2020
  • (2020)Nonparametric clustering for image segmentationStatistical Analysis and Data Mining10.1002/sam.1144413:1(83-97)Online publication date: 1-Jan-2020
  • (2019)TiK‐meansStatistical Analysis and Data Mining10.1002/sam.1141612:3(223-233)Online publication date: 20-May-2019
  • (2018)An efficient k‐means‐type algorithm for clustering datasets with incomplete recordsStatistical Analysis and Data Mining10.1002/sam.1139211:6(296-311)Online publication date: 16-Nov-2018
  • (2016)Robust clustering method in the presence of scattered observationsNeural Computation10.1162/NECO_a_0083328:6(1141-1162)Online publication date: 1-Jun-2016
  • (2016)Evolving gaussian mixture models with splitting and merging mutation operatorsEvolutionary Computation10.1162/EVCO_a_0015224:2(293-317)Online publication date: 1-Jun-2016
  • (2016)Extending mixtures of factor models using the restricted multivariate skew-normal distributionJournal of Multivariate Analysis10.1016/j.jmva.2015.09.025143:C(398-413)Online publication date: 1-Jan-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media