CloudGlide is a Python-based simulation framework for analyzing and benchmarking different data-processing architectures and configurations. It supports:
- Multiple Architectures (e.g., classic data warehouse, autoscaling DW, QaaS, etc.)
- Scheduling Strategies (e.g., FCFS, Shortest Job, Priority)
- Autoscaling Approaches (e.g., reactive, queue-based, predictive)
- Cost Models (spot vs. on-demand, capacity-based pricing, data-scanned pricing)
- Benchmarking (CSV/JSON-based benchmarking for single queries or entire test scenarios)
The framework is designed for research experiments, allowing you to define test scenarios via JSON files, generate performance metrics (latency, cost, percentiles), and compare simulation results against ground-truth or expected outcomes.
- Project Structure
- Installation
- Configuration and Input Files
- Running Simulations
- Benchmarking Mode
- Output Files
- Extending CloudGlide
├── cloudglide
│ ├── config.py
│ ├── cost_model.py
│ ├── datasets
│ │ ├── ...
│ │ ├── ...
│ ├── execution_model.py
│ ├── job.py
│ ├── output_simulation
│ │ ├── ...
│ │ ├── ...
│ ├── query_processing_model.py
│ ├── README.md
│ ├── scaling_model.py
│ ├── scheduling_model.py
│ ├── simulation_runner.py
│ ├── simulations
│ │ ├── ...
│ │ ├── ...
│ ├── use_cases.py
│ └── visual_model.py
└── main.py
-
Clone the repository:
git clone https://github.com/mikegeo98/cloudglide_olap.git cd cloudglide_olap -
Install Dependencies
pip install -r requirements.txt
cloudglide/config.py contains global constants and parameters:
- INSTANCE_TYPES: Preset instance configurations (CPU, memory, bandwidth).
- DATASET_FILES: Maps dataset indices to CSV file paths.
- Cost constants such as
COST_PER_RPU_HOURorCOST_PER_SLOT_HOUR. - Default simulation parameters (e.g.,
DEFAULT_MAX_DURATION).
Dataset CSV files (e.g., tpch_all_runs.csv) are stored in the cloudglide/datasets directory. Each CSV contains query-related fields such as:
- job_id
- start time
- CPU time
- data scanned
- and more...
JSON simulation specs (e.g., tpch.json) reside in the cloudglide/simulations directory. These files define parameter sets for the simulation runner. Below is an example:
{
"test_case_keyword": {
"architecture_values": [0],
"scheduling_values": [1],
"nodes_values": [1],
"vpu_values": [0],
"scaling_values": [1],
"cold_starts_values": [false],
"hit_rate_values": [0.9],
"instance_values": [0],
"arrival_rate_values": [10.0],
"network_bandwidth_values": [10000],
"io_bandwidth_values": [650],
"memory_bandwidth_values": [40000],
"dataset_values": [999]
}
}The primary entry point is main.py. Below is the standard usage, focusing on the core parameters:
python main.py <test_case_keyword> <json_file_path> [--benchmark] [--benchmark_file BENCHMARK_FILE] [--output_prefix PREFIX]test_case_keyword: The key that appears in your JSON scenario file.json_file_path: Path to your scenario JSON (e.g.,cloudglide/simulations/tpch.json).
Optional arguments:
--benchmark: Enable benchmarking mode (see Benchmarking section).--benchmark_file: Path to a JSON file containing expected metrics for each scenario. Defaults tobenchmark_data.json.--output_prefix: Prefix for generated CSV output. Defaults tocloudglide/output_simulation/simulation.
python main.py tpch_all cloudglide/simulations/tpch.jsonWhen --benchmark is enabled, main.py compares the simulation results against expected metrics (e.g., expected execution time, median, cost) defined in benchmark_data.json (or a file you provide via --benchmark_file).
- After completion, a JSON report (e.g.,
tpch_experiment_run_benchmark_report.json) is generated. - Benchmark results are also printed to the terminal with color codes to highlight pass/fail within a specified tolerance.
Sample benchmark_data.json:
{
"scenario1": {
"architecture": 0,
"scheduling": 1,
"nodes": 1,
"vpu": 0,
"scaling": 1,
"cold_starts": false,
"hit_rate": 0.9,
"instance": 0,
"arrival_rate": 10.0,
"network_bandwidth": 10000,
"io_bandwidth": 650,
"memory_bandwidth": 40000,
"dataset": 999,
"expected_execution_time": 120.0,
"expected_median": 90.0,
"expected_95th": 150.0,
"expected_cost": 10.0
}
// ... additional scenarios
}After a run, CloudGlide writes simulation results in CSV format under the specified --output_prefix (or the default cloudglide/output_simulation/simulation).
- Each parameter combination in your JSON scenario creates a new CSV file named
<prefix>_1.csv,<prefix>_2.csv, etc. - Columns in the CSV include
query_duration,query_duration_with_queue,Queueing Delay,CPU,I/O, etc. - Logs are written to
simulation.login the project root (or wherever configured inmain.py).
- Adding new architectures: Update
configure_execution_params()insimulation_runner.pyandexecution_model.pyto handle a newarchitectureindex. - Custom scheduling strategies: Implement in
scheduling_model.py. - Custom scaling policies: Insert logic in
scaling_model.py(seeAutoscalerclass). - Custom cost models: Modify
cost_model.pyor reference your own cost function inexecution_model.py.
use_cases.py is a helper script that demonstrates how to run predefined simulation scenarios and analyze their results. Each scenario corresponds to a typical research or operational question about scheduling, scaling, caching, or cost modeling in data-processing systems. After running the simulation, use_cases.py automatically processes the output CSV files and generates relevant plots or statistics. This allows you to quickly reproduce experiments and gather insights without manually invoking main.py multiple times.
Below is a quick overview of each supported use case:
scheduling(): Compares different scheduling strategies (e.g., FCFS vs. priority-based) under various node configurations.scaling_options(): Evaluates fixed vs. autoscaling node configurations, highlighting how different strategies handle changing workloads.caching(): Assesses the effect of caching or partial reuse on overall query runtime and resource utilization.scaling_algorithms(): Investigates multiple autoscaling policies (e.g., reactive or queue-based) for performance and cost trade-offs.spot(): Explores the use of spot instances, analyzing potential savings versus performance variability.workload_patterns(): Simulates different workload distributions (e.g., bursty vs. steady) across multiple architectures (DWaaS, EP, QaaS).cold_starts(): Examines how cold starts in on-demand or serverless environments affect query latencies.tpch(): Demonstrates a TPC-H run for various scale factors and cluster sizes, commonly used for benchmarking relational query processing. Results plotted and compared against experimental data.concurrency(): Demonstrates how isolated or concurrent execution affects query latency on fixed hardware.
By running python use_cases.py <example_name>, you can reproduce these scenarios, generate CSV outputs under cloudglide/output_simulation/, and view any generated plots or metrics.
For any clarifications, contact: geom@in.tum.de
Happy Simulating!