Skip to content

CausalRivers/causalrivers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CausalRivers

Scaling up benchmarking of causal discovery for real-world time-series

standard-readme compliant

This is the official repository of CausalRivers, the largest real-world Causal Discovery benchmark for time series to this date. Also check our Website where we will maintain the current leaderboard.

Image of East-GermanyImage of Bavaria

Leaderboard submission

❗We maintain an active leaderboard for CausalRivers.❗ You can submit to this leaderboard by using the submission form. Here you can submit your raw performance for the different datasets. For the format, please check example submission file and the preparation tutorial. Note, you don't have to submit results for all datasets (as shown in the example file). Further, if you want to integrate the algorithm that produced your scorings, you can submit it along with the predictions and we will include it under the CD zoo. We will reference it accordingly to your wishes.

Install

For the core benchmarking package simply run (uses conda, unzip and wget):

./install.sh
conda activate causalrivers
python 0_generate_datasets.py

Alternatively, you can execute the following commands by hand:

conda env create -f causal_rivers_core.yml
conda activate causal_rivers_core
wget https://github.com/CausalRivers/benchmark/releases/download/First_release/product.zip
unzip product
rm product.zip
python 0_generate_datasets.py

If you want to get a working environment for the raven cluster (ELLIS summer school Jena) run:

conda create -n causalrivers
conda activate causalrivers
pip install -r cluster_env.txt

For the DWD data access, please install:

pip install polars-lts-cpu==1.32.2 # This somehow hast the installed after everything.

Functionality

This is the core benchmarking package, which only holds the core functionality and some tutorials on usage:

Usage

We use Hydra to organize preprocessing and method Hyperparameters. Along with this we provide functions to load data and score results. We keep a single baseline strategy (VAR) here that is used as a placeholder and can be replaced with your own method. To check how the scoring works simply run:

python 3_benchmark.py

This will run a var strategy with specified preprocessing on the "confounder 3" dataset and reproduce the scoring. For the remaining experimental results we refer to the experiments repo

If you want to score your own method on a specific set of graph samples you can simply replace the baseline_method, configure it with hydra and run:

python 3_benchmark.py label_path=datasets/random_3/east.p data_path=product/rivers_ts_east_germany.csv method=var  data_preprocess.normalize=False  data_preprocess.resolution=6H method.var_absolute_values=False method.max_lag=5

Of course you can also use any routine from the experiments repo, especially concerning Grid searches and result aggregations. Here, experiments were conducted on a computation cluster with the job submission system slurm and also via hydra configurations. However, the script can also be used on a single machine.

CausalRivers Benchmark Dataset Explanation

The dataset consists of three NetworkX graph structures, three metadata tables, and three time series in CSV file format. To facilitate matching between these different formats, each graph node shares a unique ID with its corresponding time series.

Additionally, the metadata table contains information about the individual nodes.

Column Name Description
ID Unique ID
R River name
X X coordinate of measurement station (longitude)
Y Y coordinate of measurement station (latitude)
D Distance to the end of the river (or distance from source, encoded as negative numbers)
H Elevation of measurement station
QD Quality marker of the Distance
QH Quality marker of the Height
QX Quality marker of the X coordinate
QY Quality marker of the Y coordinate
QR Quality marker of the River name
O Origin of the node (data source)
original_id ID of the station in the raw data before unification and reindexing (can be used to find the original station on online services of data providers)

Furthermore, both ground truth nodes and edges (in the graph) hold additional information.

Node Attribute Description
p X, Y coordinates
c color for consistency based on origin
origin origin of the node
H as above
R as above
D as above
QD as above
QH as above
QX as above
QY as above
QR as above
Edge Attribute Description
h_distance elevation change between the two nodes
geo_distance Euclidean distance between the two nodes
quality_geo quality of the distance estimation (depends on QX and QY of the nodes)
quality_h quality of the elevation estimation (depends on QH of the nodes)
origin strategy used to create this edge (see below for further information)

Quality Values

The graph construction, particularly the edge determination, involves multiple strategies. To ensure transparency and reliability, we provide quality markers for each piece of information. These quality markers are defined as follows:

Node Value Description
-1 Unknown as target value missing
0 Original value
> 0 Value that was estimated or looked up by hand (Check construction pipeline for more details)
Edge Value Description
origin The step under which the edge was added. E.g., origin 6 references to edges that were added as river splits by hand.
quality_h Sum of the quality of the corresponding Heights estimated of the connected nodes. E.g. 0 references that both height estimates were not estimated.
quality_km Sum of the quality of the corresponding coordinates (X, Y) estimated of the connected nodes. E.g. 0 references that both coordinates were not estimated.

@Ellis Summer school Jena Here are some resources to begin with Causal Discovery.

Maintainers

Main: @GideonStein, Code support: @Timozen

Contributors

This project exists thanks to the generous provision of data by the following German institutions:

All data sources fall under the data license Germany dl-de

About

The CausalRivers benchmark package. Evaluate your Causal Discovery method on real-world data.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •