This is the official repository of CausalRivers, the largest real-world Causal Discovery benchmark for time series to this date. Also check our Website where we will maintain the current leaderboard.
❗We maintain an active leaderboard for CausalRivers.❗ You can submit to this leaderboard by using the submission form. Here you can submit your raw performance for the different datasets. For the format, please check example submission file and the preparation tutorial. Note, you don't have to submit results for all datasets (as shown in the example file). Further, if you want to integrate the algorithm that produced your scorings, you can submit it along with the predictions and we will include it under the CD zoo. We will reference it accordingly to your wishes.
For the core benchmarking package simply run (uses conda, unzip and wget):
./install.sh
conda activate causalrivers
python 0_generate_datasets.pyAlternatively, you can execute the following commands by hand:
conda env create -f causal_rivers_core.yml
conda activate causal_rivers_core
wget https://github.com/CausalRivers/benchmark/releases/download/First_release/product.zip
unzip product
rm product.zip
python 0_generate_datasets.pyIf you want to get a working environment for the raven cluster (ELLIS summer school Jena) run:
conda create -n causalrivers
conda activate causalrivers
pip install -r cluster_env.txtFor the DWD data access, please install:
pip install polars-lts-cpu==1.32.2 # This somehow hast the installed after everything.This is the core benchmarking package, which only holds the core functionality and some tutorials on usage:
- How to build your graph subset: Custom graph sampling
- How to use the benchmark most efficiently: Usage
- How to sub-select specific temporal windows with certain weather conditions: Temporal selections
- Some general display of dataset properties that might be interesting for users: Data distribution
We use Hydra to organize preprocessing and method Hyperparameters. Along with this we provide functions to load data and score results. We keep a single baseline strategy (VAR) here that is used as a placeholder and can be replaced with your own method. To check how the scoring works simply run:
python 3_benchmark.pyThis will run a var strategy with specified preprocessing on the "confounder 3" dataset and reproduce the scoring. For the remaining experimental results we refer to the experiments repo
If you want to score your own method on a specific set of graph samples you can simply replace the baseline_method, configure it with hydra and run:
python 3_benchmark.py label_path=datasets/random_3/east.p data_path=product/rivers_ts_east_germany.csv method=var data_preprocess.normalize=False data_preprocess.resolution=6H method.var_absolute_values=False method.max_lag=5Of course you can also use any routine from the experiments repo, especially concerning Grid searches and result aggregations. Here, experiments were conducted on a computation cluster with the job submission system slurm and also via hydra configurations. However, the script can also be used on a single machine.
The dataset consists of three NetworkX graph structures, three metadata tables, and three time series in CSV file format.
To facilitate matching between these different formats, each graph node shares a unique ID with its corresponding time series.
Additionally, the metadata table contains information about the individual nodes.
| Column Name | Description |
|---|---|
ID |
Unique ID |
R |
River name |
X |
X coordinate of measurement station (longitude) |
Y |
Y coordinate of measurement station (latitude) |
D |
Distance to the end of the river (or distance from source, encoded as negative numbers) |
H |
Elevation of measurement station |
QD |
Quality marker of the Distance |
QH |
Quality marker of the Height |
QX |
Quality marker of the X coordinate |
QY |
Quality marker of the Y coordinate |
QR |
Quality marker of the River name |
O |
Origin of the node (data source) |
original_id |
ID of the station in the raw data before unification and reindexing (can be used to find the original station on online services of data providers) |
Furthermore, both ground truth nodes and edges (in the graph) hold additional information.
| Node Attribute | Description |
|---|---|
p |
X, Y coordinates |
c |
color for consistency based on origin |
origin |
origin of the node |
H |
as above |
R |
as above |
D |
as above |
QD |
as above |
QH |
as above |
QX |
as above |
QY |
as above |
QR |
as above |
| Edge Attribute | Description |
|---|---|
h_distance |
elevation change between the two nodes |
geo_distance |
Euclidean distance between the two nodes |
quality_geo |
quality of the distance estimation (depends on QX and QY of the nodes) |
quality_h |
quality of the elevation estimation (depends on QH of the nodes) |
origin |
strategy used to create this edge (see below for further information) |
The graph construction, particularly the edge determination, involves multiple strategies. To ensure transparency and reliability, we provide quality markers for each piece of information. These quality markers are defined as follows:
| Node Value | Description |
|---|---|
-1 |
Unknown as target value missing |
0 |
Original value |
> 0 |
Value that was estimated or looked up by hand (Check construction pipeline for more details) |
| Edge Value | Description |
|---|---|
origin |
The step under which the edge was added. E.g., origin 6 references to edges that were added as river splits by hand. |
quality_h |
Sum of the quality of the corresponding Heights estimated of the connected nodes. E.g. 0 references that both height estimates were not estimated. |
quality_km |
Sum of the quality of the corresponding coordinates (X, Y) estimated of the connected nodes. E.g. 0 references that both coordinates were not estimated. |
@Ellis Summer school Jena Here are some resources to begin with Causal Discovery.
- Basic Python Tutorial
- Text Book
- Causal Discovery in Real World settings
- Granger Causality
- Introduction to Causal Discovery for time series
Main: @GideonStein, Code support: @Timozen
This project exists thanks to the generous provision of data by the following German institutions:
- Thüringer Landesamt für Umwelt, Bergbau und Naturschutz
- Landesbetrieb für Hochwasserschutz und Wasserwirtschaft Sachsen-Anhalt
- Sächsisches Landesamt für Umwelt, Landwirtschaft und Geologie
- Landesamt für Umwelt, Naturschutz und Geologie Mecklenburg-Vorpommern
- Senatsverwaltung für Mobilität, Verkehr, Klimaschutz und Umwelt
- Landesamt für Umwelt Brandenburg
- Generaldirektion Wasserstraßen und Schifffahrt
- Bayerisches Landesamt für Umwelt
All data sources fall under the data license Germany dl-de