Benchmarking utilities for schema discovery on property graphs, plus scripts for exporting graph statistics.
This repository does three main things:
- loads Neo4j dump files for multiple datasets
- applies benchmark perturbations such as property noise and label removal
- runs one or more schema discovery approaches on every benchmark case
Main scripts:
run.sh: installs Python requirements and starts the benchmarkbenchmark.py: main benchmark runnerevaluation.py: computes evaluation metrics from exported CSVsstats.py: exports graph statistics from a running Neo4j databasecp_metrics.py: computes CP metrics
Main folders:
datasets/: dataset folders with metadata CSVs and Neo4j dump filesoutput/: benchmark logs and method outputsstats_*: precomputed statistics for several datasets
Before running the benchmark, download the dataset dump files from Zenodo:
After downloading them, place each dump inside the matching dataset folder under datasets/.
Examples:
datasets/star-wars/star-wars-neo4j-4.4.0.dump
datasets/fib25/fib25-neo4j-4.4.0.dump
datasets/icij/icij-neo4j-4.4.0.dump
datasets/iyp/iyp-neo4j-5.25.1.dump
The benchmark expects the dump to already be in the correct dataset directory before it starts.
Each dataset folder under datasets/ is expected to contain:
- a dump named like
*neo4j-X.Y.Z.dump node_properties.csvedge_properties.csvnode_labels.csv- optionally
edge_labels.csv
The benchmark detects the Neo4j version directly from the dump filename.
Example:
datasets/fib25/fib25-neo4j-4.4.0.dump
datasets/iyp/iyp-neo4j-5.25.1.dump
The benchmark supports two Neo4j runtime modes.
Use neo4j_mode: "community" if you want this repo to manage Neo4j installs directly.
Behavior:
- uses a configured Community install if it already exists
- downloads Neo4j Community automatically if it is missing and
neo4j_auto_downloadistrue - updates
conf/neo4j.confwith the configured memory settings - loads the dump with
neo4j-admin
Use neo4j_mode: "desktop" if you want to reuse a DBMS managed by Neo4j Desktop.
Important:
- point
neo4j_desktop_dirsto the actual DBMS home directory, not the Desktop app - the benchmark then uses that DBMS's
bin/neo4j,bin/neo4j-admin, andbin/cypher-shell - dump loading is still automated; you do not need to import the dump manually through the Desktop UI each time
The benchmark reads config.json. A template is available in config_template.json.
Current example:
{
"datasets_dir": "./datasets",
"output_dir": "./output",
"commands_file": "./benchmark_commands.json",
"neo4j_password": "password",
"neo4j_port": 7687,
"noise_levels": [0, 10, 20, 30, 40],
"label_percents": [0.0, 0.5, 1.0],
"neo4j_mode": "community",
"neo4j_dirs": {
"4.4.0": "./neo4j-community-4.4.0",
"5.1.0": "./neo4j-community-5.1.0",
"5.25.1": "./neo4j-community-5.25.1"
},
"neo4j_desktop_dirs": {},
"neo4j_auto_download": true,
"neo4j_download_dir": "./neo4j_runtimes",
"neo4j_memory": {
"heap_initial": "2G",
"heap_max": "4G",
"pagecache": "2G"
},
"dataset_order": [
"starwars",
"pole",
"mb6",
"het",
"fib",
"icij",
"cord",
"twitch",
"ldbc",
"iyp"
],
"run_external_commands": true,
"query_batch_size": 10000
}Main keys:
datasets_dir: root directory for dataset foldersoutput_dir: benchmark outputs and logscommands_file: methods to runneo4j_password: password used bycypher-shellneo4j_port: port used by Neo4jnoise_levels: property-removal percentageslabel_percents: fraction of nodes that lose labelsneo4j_mode:communityordesktopneo4j_dirs: version-to-install-path mapping for Community modeneo4j_desktop_dirs: version-to-DBMS-home mapping for Desktop modeneo4j_auto_download: auto-download missing Community installsneo4j_download_dir: fallback directory for downloaded Community installsneo4j_memory: memory settings written intoneo4j.confdataset_order: benchmark execution order; aliases such asstarwars,het,fib, andcordare normalizedrun_external_commands: whether schema discovery methods should actually runquery_batch_size: batch size for the heavy Cypher write operations
Schema discovery methods are defined in benchmark_commands.json.
Each command can contain:
name: method labelcmd: shell command to runcwd: working directoryrepo: optional GitHub repo to clone before runningclone_dir: local checkout pathbranch: branch to clone or pullupdate_existing: whether to pull an existing checkoutsetup_cmd: optional dependency-install or build command, run once per checkout before dataset processing starts
Current methods:
PG_HIVE_LSHPG_HIVE_MINHASH
Recommended entrypoint:
./run.shWith evaluation:
./run.sh --evalYou can also run the Python script directly:
python3 benchmark.py --config ./config.jsonExecution flow:
- loads the config
- reads
benchmark_commands.json - clones or updates all external method repositories
- runs each method's
setup_cmdonce - iterates through the datasets in
dataset_order - detects the dump and required Neo4j version
- resolves the Neo4j runtime
- writes Neo4j memory settings into
neo4j.conf - stops Neo4j if needed
- loads the dump into the
neo4jdatabase - starts Neo4j and waits until
cypher-shell "RETURN 1"succeeds - saves original labels
- removes properties and labels according to the configured benchmark case
- runs all configured methods
- optionally runs evaluation if the expected CSVs exist and
--evalwas passed
- each noise level starts from a fresh dump load
- heavy write queries are batched to reduce memory usage
- external repos are prepared once at the beginning, not inside every dataset loop
- if
run_external_commandsisfalse, the benchmark still loads dumps and applies benchmark transformations, but skips the schema discovery methods
Outputs are written under output/<dataset_name>/.
Common files:
log_noise{noise}_*.txt: benchmark-side logs for the Cypher mutation stepsoutput_{DATASET}_noise{noise}_labels{percent}_{METHOD}.txt: stdout/stderr from each external method- evaluation CSVs and metric outputs when evaluation is enabled
Evaluation runs only when:
- you pass
--eval - all four expected CSV files exist for a dataset / noise / label-percent / method combination
Expected files:
original_nodes_...csvpredicted_nodes_...csvoriginal_edges_...csvpredicted_edges_...csv
Expected columns:
original_nodes.csv
_nodeIdoriginal_label
predicted_nodes.csv
merged_cluster_idsortedLabelsnodeIdsInCluster
original_edges.csv
srcIddstIdrelationshipTypesrcTypedstType
predicted_edges.csv
merged_cluster_idrelationshipTypessrcLabelsdstLabelsedgeIdsInCluster
stats.py exports a structural profile from a running Neo4j database.
Example:
python3 stats.py \
--uri bolt://localhost:7687 \
--user neo4j \
--password password \
--db neo4j \
--out stats_fibImportant outputs include:
summary_counts.csvsummary_counts.jsoncp_operationalization.csvcp_operationalization.jsonnode_label_counts.csvrelationship_type_counts.csvnode_type_counts.csvnode_patterns.csvedge_type_counts.csvedge_patterns.csvprofile.json
If Neo4j does not become ready:
- check
neo4j-community-<version>/logs/neo4j.log - make sure you are using Java 11 for Neo4j 4.4.x
- keep
neo4j_memoryrealistic for your machine; too-large values cause startup failure
If a method repo fails during setup:
- check whether the method needs extra system tools such as
sbt - check the method log under
output/<dataset>/
If large datasets stall:
- lower
query_batch_size - adjust
neo4j_memory
For ldbc-sbn, you can the benchmark here: https://github.com/ldbc/ldbc_snb_datagen_hadoop