MRB-648 Scorecards for evalml by adestefani8 · Pull Request #145 · MeteoSwiss/evalml

adestefani8 · 2026-05-01T09:48:14Z

What this PR adds

This PR adds a new report_scorecard rule that renders a PNG comparing one run against one baseline.

The scorecard has:

one row per (variable × metric)
one column block per region
one column per lead time inside each region block

Each cell encodes the row's metric as a relative difference between the two runs for that region and lead time: (model − baseline) / |baseline| × 100.

Markers:

blue → model better
red → baseline better
grey → |diff| below the neutral threshold (default 5%)
grey x → missing or non-finite value

Scores:

The default score set uses RMSE and R2 for all variables, plus ETS for U_10M, V_10M, and TOT_PREC.
Additional supported scores are available for explicit use: MAE, STDE, CORR, POD, and FAR.
Score direction is handled by score type: RMSE, MAE, STDE, and FAR are lower-is-better, while CORR, R2, ETS, and POD are higher-is-better.

Above the neutral threshold, dot area scales linearly with |diff|% and caps at size_cap_pct (default 30%).

Configuration

Configurable via params on the rule:

lead_times: "start/stop/step" in hours
regions: regions to include as column blocks
variables: "VAR:M1,M2,..." entries; omit :M1,M2,... to use all_metrics for that variable.
Metric names can also expand by prefix. For example, requesting ETS includes all matching categorical scores, such as ETS_gt_0p0, ETS_gt_0p001, ETS_gt_0p005.

Other defaults (season, init_hour, metric settings, plot styling) live in the script's cfg.

Plot layout

The plotting script makes a few automatic layout decisions:

the longest region label is measured before rendering: col_width grows when necessary to prevent region header overlap, and the top margin/vertical separators adapt to the rendered header height
the longest metric label is measured before rendering: variable labels keep a fixed gap from metric labels, and horizontal group separators start from the measured metric-label area
the legend is centered on the scorecard area
the no-data legend entry only appears when missing values are present

TODOs

Expose main scorecard parameters in the evalml config

dnerini · 2026-05-06T16:00:55Z

looking good :)

jonasbhend

Very nice. I really like the scorecards. Great work.

For future PRs, could you please add a short description of the changes (high-level overview) and - if necessary - also of the goals of the PR? That would be very helpful for the review.

As an additional suggestion, could we include the scorecard in the dashboard (I know we don't always want to produce it, but in case it is available it would be nice to include in a separate tab)?

jonasbhend · 2026-05-07T08:30:58Z

+        regions=[
+            "all",
+            "mittelland",
+            "voralpen",
+            "alpennordhang",
+            "innerealpentaeler",
+            "alpensuedseite",
+            "jura",
+        ],
+        variables=[
+            "U_10M:RMSE,MAE,STDE,CORR,R2",
+            "V_10M:RMSE,MAE,STDE,CORR,R2",
+            "T_2M:RMSE,MAE,STDE,CORR,R2",
+            "PMSL:RMSE,MAE,STDE,CORR,R2",
+            "TD_2M:RMSE,MAE,STDE,CORR,R2",
+            "TOT_PREC:RMSE,MAE,STDE,CORR,R2",
+        ],


This is hard-coded in the rule. Would it make sense to expose this in the config (in particular given the expected rapid changes with new metrics. @cosunae in #144 has suggested a nice template for config 'sections', e.g.

runs: ... experiment: dashboard: stratification: ... scorecard: regions: - all - mittelland scores: U_10M: ["RMSE", "R2"] TOT_PREC: ["RMSE", "ETS_gt_0p0"] showcases: ...

Also, I would suggest to only show ["RMSE", "R2"] by default. The other scores don't add much, or are even a bit dubious (such as correlation which cannot be generally converted to a skill score due to its support from -1 to 1). I know that Marco Arpagaus always wants to see STDE, but I think this is also based on a misunderstanding of what STDE is. STDE is only different from RMSE if we have 'systematic' domain-wide biases, which is usually not the case.

With #137 there are now additional scores available for categorical forecasts. ETS would be a candidate to include for TOT_PREC and wind in particular.

Hi Jonas, thanks for the suggestion. I included ETS in a recent commit, could you please have a look and let me know what you think?

Hi Alberto. I really like the way the new scorecard looks. Super cool.

I was wondering, if the referencing of the score shouldn't be made consistent with the way these are represented in the scores file (verif_aggregated.nc). As such, I would the variables section above expect to look something like this:

variables=[ "T_2M:RMSE,MAE", "TOT_PREC:RMSE,MAE,ETS_gt_0p0", ],

That way, we can specifically include / exclude scores (and thresholds) on a per variable basiss.

Hi thanks for the kind words!

This is actually already supported: that syntax in the variables param in report.smk gives

That works thanks to this function in report_scorecard:

def _parse_var_metrics(spec: str): """Parse a 'VAR:M1,M2,...' (or 'VAR' alone) item into (var, [metrics]).""" if ":" in spec: var, metrics = spec.split(":", 1) return var.strip(), [m.strip() for m in metrics.split(",") if m.strip()] return spec.strip(), None # None → use all scores for this variable

It splits each entry first on : and then splits the metrics on ,. So you can write just ETS to include all available thresholds, or list specific ones like ETS_gt_0p0 as in your example.

jonasbhend · 2026-05-07T08:53:08Z

        ),
+        expand(
+            rules.report_scorecard.output,
+            run_id=CANDIDATES,


Currently we run this for all non-baseline runs against all baselines. It is configurable in the sense that you can copy-paste the config and remove runs / baselines that you don't want to be analysed. Is this enough?

jonasbhend · 2026-05-07T09:04:36Z

+            / "results/{experiment}/scorecard_plots/{run_id}/scorecard_{baseline}.png",
+        ),
+    params:
+        lead_times="6/33/6",


Also here, I suggest to make this configurable

…ividers, figure height); fix legend centering

jonasbhend · 2026-05-12T15:44:03Z

+        ],  # every entry must appear in metric_directions
+        "metric_directions": {
+            "lower_is_better": ["RMSE", "MAE", "STDE", "FAR"],
+            "higher_is_better": ["CORR", "R2", "ETS", "POD"],


As mentioned previously, I would not include CORR as an eligible metric to add to the scorecard. Especially so, as R2 is virtually the same.

jonasbhend · 2026-05-12T15:45:24Z

+        vars_metrics = {
+            "U_10M": ["RMSE", "R2", "ETS"],
+            "V_10M": ["RMSE", "R2", "ETS"],
+            "T_2M": ["RMSE", "R2"],
+            "PMSL": ["RMSE", "R2"],
+            "TD_2M": ["RMSE", "R2"],
+            "TOT_PREC": ["RMSE", "R2", "ETS"],
+        }


My comment from before about exposing the thresholds as well probably is better suited here. I.e. I would love to be able to specify what score AND what threshold to include on a per variable basis.

Actually this block is just a hardcoded fallback that only triggers when --variable is not passed at all (e.g., running the script manually). If the snakemake rule always provides explicit --variable arguments, this code path is never hit in the workflow.

Honestly, I'm not sure this fallback is that useful now... A cleaner alternative might be to drop it and instead show everything present in the data when no --variable is given (and same thing with the regions).

adestefani8 and others added 6 commits May 1, 2026 11:42

Add initial draft for scorecard

ca2ddb2

Add verification scorecard (with hardcoded params)

ff44188

Fix Ruff error

20f908f

Apply pre-commit formatting

dd4252e

Add CLI to scorecard script

ebc9d89

Drop config validation and fix legend

96211bf

dnerini marked this pull request as ready for review May 6, 2026 11:23

dnerini requested review from dnerini and teobuz May 6, 2026 11:24

Merge branch 'main' into MRB-648-Scorecards-for-evalml

4a05a2f

dnerini requested review from frazane and jonasbhend May 6, 2026 15:56

dnerini requested review from Louis-Frey May 6, 2026 16:01

jonasbhend requested changes May 7, 2026

View reviewed changes

adestefani8 and others added 2 commits May 11, 2026 13:37

Add support for ETS, POD, FAR and edit default scores

7449be3

Make scorecard layout self-adjusting to label sizes (group spacing, d…

87f4033

…ividers, figure height); fix legend centering

jonasbhend reviewed May 12, 2026

View reviewed changes

Conversation

adestefani8 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Configuration

Plot layout

TODOs

Uh oh!

dnerini commented May 6, 2026

Uh oh!

jonasbhend left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adestefani8 commented May 1, 2026 •

edited

Loading