Skip to content

Sparkling Water is a scalable system for detecting, merging, and clustering similar server processes based on interaction logs. Using Apache Spark, MinHash, LSH, and time-series hashing (SSH, BSeSH), it efficiently identifies behavior patterns in large server infrastructures for performance optimization, anomaly detection, and system analysis.

Notifications You must be signed in to change notification settings

Robino-CK/-Py-Sparkling-Water

Repository files navigation

Sparkling Water

Short Description

Sparkling Water is a scalable system for detecting, merging, and clustering similar server processes based on interaction logs. Using Apache Spark, MinHash, LSH, and time-series hashing (SSH, BSeSH), it efficiently identifies behavior patterns in large server infrastructures for performance optimization, anomaly detection, and system analysis. A detailed decription can be found in the report


Features

  • Similarity detection between server processes (name + timing)
  • Time-series analysis using SSH and BSeSH
  • Merging of equivalent processes
  • Clustering using k-means++
  • Scalable, distributed log processing via Apache Spark

Installation

Ensure you have Python 3.7+ and Apache Spark installed.

Install the required Python packages:

pip install -r requirements.txt

Run the Pipelines

The project has two main parts. Each has its own script:

1. Similarity Detection

Run:

python pipeline_part1.py
  • Input: res/output.txt
  • Outputs:
    • res/part1Observations.txt – similarity analysis
    • res/part1Output.txt – merged process candidates

2. Merging and Clustering

Run:

python pipeline_part2.py
  • Input: res/part1Output.txt
  • Output:
    • res/part2Observations.txt – final clustering results

Authors

  • Liva van der Velden — Utrecht University
  • Robin Kollmann — Utrecht University
  • Simon Menke — Utrecht University

About

Sparkling Water is a scalable system for detecting, merging, and clustering similar server processes based on interaction logs. Using Apache Spark, MinHash, LSH, and time-series hashing (SSH, BSeSH), it efficiently identifies behavior patterns in large server infrastructures for performance optimization, anomaly detection, and system analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •