Machine Learning for Computer Security
Exercise Sheet 7
Summer term 2024
Exercise 1 Lloyd algorithm
We begin by examining Lloyd’s algorithm as an example of a partition-based clustering
approach. To this end, let us consider a set of points and two centroids, c1 and c2 :
2 c1 c2
0
0 1 2 3 4 5 6 7 8
1. Which points are assigned to each centroid? Assume that the Euclidean distance is
used to measure the distance between points.
2. What are the new centroids after the first update step?
1
Solution 1
1.
c1 → (1, 2), (2, 1)
c2 → (4, 3), (7, 3), (6, 2), (7, 1)
2.
1.5 6
c1 = , c2 =
1.5 2.25
2
Exercise 2 K-means clustering
Building on the previous exercise, we now want to implement k-means clustering using
the Lloyd algorithm. On ISIS you can find a template to help you get started. Follow
the steps below:
1. First, write a function initialize that returns k 2-dimensional random centroids.
Sample the vectors from the set [xmin , xmax ] × [ymin , ymax ].
Hint: Numpy provides a function numpy.random.random that you can use.
2. Next, implement a function squared euclidean distances that returns an array
distances from a set of data points Z to a given centroid c, where distances[i]
is the distance of data point Zi to the centroid c.
3. To assign each Zi to its closest centroid, implement a function assignment step.
The function returns a list clusters, in which clusters[i] denotes the cluster
the datapoint Zi is assigned to.
Hint: Make use of squared euclidean distances.
4. To complete the algorithm, implement a function update step that updates the
centroid positions by calculating the mean of the vectors of each cluster. The
function returns the centroids as an k × 2 matrix.
5. Finally, run the k-means algorithm for different values of k and N , where k is the
number of centroids and N is the number of Gaussians used in the generation of
the data set.
3
Exercise 3 Linkage clustering
A different approach to partition-based clustering is hierarchical clustering. Assume that
the following points are given:
b
a
c d e
1. Perform a single-linkage clustering using the Euclidean distance to measure the
distance between the clusters. Draw the corresponding dendrogram.
2. Assume you want to have three clusters, which points would be grouped together?
3. If we use complete-linkage for clustering the points instead, how would this affect
the resulting dendrogram?
4
Solution 3
1. With single-linkage clustering, points are clustered based on the minimum distance
between points in different clusters:
(a), (b), (c), (d), (e), (f)
→ (a, b), (c), (d), (e), (f)
→ (a, b), (c), (d, e), (f)
→ (a, b), (c, d, e), (f)
→ (a, b, c, d, e), (f)
→ (a, b, c, d, e, f)
Thus, the dendrogram follows as:
a b c d e f
2. To have three clusters, we should cut the dendrogram at the level below D. This
results in clusters (a, b), (c, d, e), and (f).
3. If we use complete linkage, the resulting dendrogram is no longer unique, as c has
the same distance to b as to e.
5
Exercise 4 Malware clustering
In this task, we aim to apply the concepts from the previous exercises to categorize
malicious software based on its behavior. To accomplish this, use the dataset and template
provided on ISIS. Each line of the data file contains the Windows registry keys and mutex
names utilized by a malware program during its execution. Several of these programs
belong to the same malware family and can therefore be grouped together.
You can build upon your k-means implementation developed in the previous task.
Please select a suitable feature representation and justify your choice. You can verify
your solution by plotting the clusters and examining common features. The following
hints may guide you through your analysis:
1. Think about how to construct a suitable feature space. For instance, you could
follow a bag-of-words approach.
2. Your implemented clustering method only works on 2-dimensional data. To project
your high-dimensional data into a 2-dimensional space, you can use the Principal
Component Analysis (PCA) algorithm implementation provided in the template.
Note, however, that PCA works best on mean-centered data.
3. The template includes some enhancements that you might want to experiment with
(e.g., token filtering/normalization, scaling).
4. In the case of outliers, consider initializing your k-means implementation by selecting
random samples from the dataset as the initial centroids instead of using random
values within the range of the minimum and maximum values.