k-means clustering in Ruby. Uses NArray under the hood for fast calculations.
Jump to the examples directory to see this in action.
- Runs multiple clustering attempts to find optimal solution (single runs are susceptible to falling into non-optimal local minima)
- Initializes centroids via k-means++ algorithm, for faster convergence
- Calculates silhouette score for evaluation
- Option to scale data before clustering, so that output isn't biased by different feature scales
- Works with high-dimensional data
gem install kmeans-clusterer
Simple example:
require 'kmeans-clusterer'
data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
[45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]
labels = ['New York', 'Los Angeles', 'Baltimore',
'Portland', 'Washington DC', 'Las Vegas']
k = 2 # find 2 clusters in data
kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5
kmeans.clusters.each do |cluster|
puts cluster.id.to_s + '. ' +
cluster.points.map(&:label).join(", ") + "\t" +
cluster.centroid.to_s
end
# Use existing clusters for prediction with new data:
predicted = kmeans.predict [[41.85,-87.65]] # Chicago
puts "\nClosest cluster to Chicago: #{predicted[0]}"
# Clustering quality score. Value between -1.0..1.0 (1.0 is best)
puts "\nSilhouette score: #{kmeans.silhouette.round(2)}"
Output of simple example:
0. New York, Baltimore, Washington DC [39.63, -75.89]
1. Los Angeles, Portland, Las Vegas [38.56, -118.7]
Closest cluster to Chicago: 0
Silhouette score: 0.91
The following options can be passed in to KMeansClusterer.run
:
option | default | description |
---|---|---|
:labels | nil | optional array of Ruby objects to collate with data array |
:runs | 10 | number of times to run kmeans |
:log | false | print stats after each run |
:init | :kmpp | algorithm for picking initial cluster centroids. Accepts :kmpp, :random, or an array of k centroids |
:scale_data | false | scales features before clustering using formula (data - mean) / std |
:float_precision | :double | float precision to use. :double or :single |
:max_iter | 300 | max iterations per run |