Sparkling Water is a scalable system for detecting, merging, and clustering similar server processes based on interaction logs. Using Apache Spark, MinHash, LSH, and time-series hashing (SSH, BSeSH), it efficiently identifies behavior patterns in large server infrastructures for performance optimization, anomaly detection, and system analysis. A detailed decription can be found in the report
- Similarity detection between server processes (name + timing)
- Time-series analysis using SSH and BSeSH
- Merging of equivalent processes
- Clustering using k-means++
- Scalable, distributed log processing via Apache Spark
Ensure you have Python 3.7+ and Apache Spark installed.
Install the required Python packages:
pip install -r requirements.txtThe project has two main parts. Each has its own script:
Run:
python pipeline_part1.py- Input:
res/output.txt - Outputs:
res/part1Observations.txt– similarity analysisres/part1Output.txt– merged process candidates
Run:
python pipeline_part2.py- Input:
res/part1Output.txt - Output:
res/part2Observations.txt– final clustering results
- Liva van der Velden — Utrecht University
- Robin Kollmann — Utrecht University
- Simon Menke — Utrecht University