Skip to content

MRB-648 Scorecards for evalml#145

Open
adestefani8 wants to merge 9 commits into
mainfrom
MRB-648-Scorecards-for-evalml
Open

MRB-648 Scorecards for evalml#145
adestefani8 wants to merge 9 commits into
mainfrom
MRB-648-Scorecards-for-evalml

Conversation

@adestefani8
Copy link
Copy Markdown
Collaborator

@adestefani8 adestefani8 commented May 1, 2026

What this PR adds

This PR adds a new report_scorecard rule that renders a PNG comparing one run against one baseline.

The scorecard has:

  • one row per (variable × metric)
  • one column block per region
  • one column per lead time inside each region block

Each cell encodes the row's metric as a relative difference between the two runs for that region and lead time: (model − baseline) / |baseline| × 100.

Markers:

  • blue → model better
  • red → baseline better
  • grey → |diff| below the neutral threshold (default 5%)
  • grey x → missing or non-finite value

Scores:

  • The default score set uses RMSE and R2 for all variables, plus ETS for U_10M, V_10M, and TOT_PREC.
  • Additional supported scores are available for explicit use: MAE, STDE, CORR, POD, and FAR.
  • Score direction is handled by score type: RMSE, MAE, STDE, and FAR are lower-is-better, while CORR, R2, ETS, and POD are higher-is-better.

Above the neutral threshold, dot area scales linearly with |diff|% and caps at size_cap_pct (default 30%).

scorecard_ICON-CH2-EPS_demo

Configuration

Configurable via params on the rule:

  • lead_times: "start/stop/step" in hours
  • regions: regions to include as column blocks
  • variables: "VAR:M1,M2,..." entries; omit :M1,M2,... to use all_metrics for that variable.
    Metric names can also expand by prefix. For example, requesting ETS includes all matching categorical scores, such as ETS_gt_0p0, ETS_gt_0p001, ETS_gt_0p005.

Other defaults (season, init_hour, metric settings, plot styling) live in the script's cfg.

Plot layout

The plotting script makes a few automatic layout decisions:

  • the longest region label is measured before rendering: col_width grows when necessary to prevent region header overlap, and the top margin/vertical separators adapt to the rendered header height
  • the longest metric label is measured before rendering: variable labels keep a fixed gap from metric labels, and horizontal group separators start from the measured metric-label area
  • the legend is centered on the scorecard area
  • the no-data legend entry only appears when missing values are present

TODOs

  • Expose main scorecard parameters in the evalml config

@dnerini dnerini marked this pull request as ready for review May 6, 2026 11:23
@dnerini dnerini requested review from dnerini and teobuz May 6, 2026 11:24
@dnerini dnerini requested review from frazane and jonasbhend May 6, 2026 15:56
@dnerini
Copy link
Copy Markdown
Member

dnerini commented May 6, 2026

looking good :)

image

@dnerini dnerini requested review from Louis-Frey May 6, 2026 16:01
Copy link
Copy Markdown
Contributor

@jonasbhend jonasbhend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I really like the scorecards. Great work.

For future PRs, could you please add a short description of the changes (high-level overview) and - if necessary - also of the goals of the PR? That would be very helpful for the review.

As an additional suggestion, could we include the scorecard in the dashboard (I know we don't always want to produce it, but in case it is available it would be nice to include in a separate tab)?

Comment thread workflow/rules/report.smk
Comment on lines +67 to +83
regions=[
"all",
"mittelland",
"voralpen",
"alpennordhang",
"innerealpentaeler",
"alpensuedseite",
"jura",
],
variables=[
"U_10M:RMSE,MAE,STDE,CORR,R2",
"V_10M:RMSE,MAE,STDE,CORR,R2",
"T_2M:RMSE,MAE,STDE,CORR,R2",
"PMSL:RMSE,MAE,STDE,CORR,R2",
"TD_2M:RMSE,MAE,STDE,CORR,R2",
"TOT_PREC:RMSE,MAE,STDE,CORR,R2",
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard-coded in the rule. Would it make sense to expose this in the config (in particular given the expected rapid changes with new metrics. @cosunae in #144 has suggested a nice template for config 'sections', e.g.

runs:
...

experiment:
    dashboard:
        stratification: ...
    scorecard:
        regions:
            - all
            - mittelland
        scores: 
            U_10M: ["RMSE", "R2"]
            TOT_PREC: ["RMSE", "ETS_gt_0p0"]

showcases:
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would suggest to only show ["RMSE", "R2"] by default. The other scores don't add much, or are even a bit dubious (such as correlation which cannot be generally converted to a skill score due to its support from -1 to 1). I know that Marco Arpagaus always wants to see STDE, but I think this is also based on a misunderstanding of what STDE is. STDE is only different from RMSE if we have 'systematic' domain-wide biases, which is usually not the case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #137 there are now additional scores available for categorical forecasts. ETS would be a candidate to include for TOT_PREC and wind in particular.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jonas, thanks for the suggestion. I included ETS in a recent commit, could you please have a look and let me know what you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alberto. I really like the way the new scorecard looks. Super cool.

I was wondering, if the referencing of the score shouldn't be made consistent with the way these are represented in the scores file (verif_aggregated.nc). As such, I would the variables section above expect to look something like this:

       variables=[
            "T_2M:RMSE,MAE",
            "TOT_PREC:RMSE,MAE,ETS_gt_0p0",
        ],

That way, we can specifically include / exclude scores (and thresholds) on a per variable basiss.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for the kind words!

This is actually already supported: that syntax in the variables param in report.smk gives
demoSyntax

That works thanks to this function in report_scorecard:

def _parse_var_metrics(spec: str):
    """Parse a 'VAR:M1,M2,...' (or 'VAR' alone) item into (var, [metrics])."""
    if ":" in spec:
        var, metrics = spec.split(":", 1)
        return var.strip(), [m.strip() for m in metrics.split(",") if m.strip()]
    return spec.strip(), None  # None → use all scores for this variable

It splits each entry first on : and then splits the metrics on ,. So you can write just ETS to include all available thresholds, or list specific ones like ETS_gt_0p0 as in your example.

Comment thread workflow/Snakefile
),
expand(
rules.report_scorecard.output,
run_id=CANDIDATES,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we run this for all non-baseline runs against all baselines. It is configurable in the sense that you can copy-paste the config and remove runs / baselines that you don't want to be analysed. Is this enough?

Comment thread workflow/rules/report.smk
/ "results/{experiment}/scorecard_plots/{run_id}/scorecard_{baseline}.png",
),
params:
lead_times="6/33/6",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, I suggest to make this configurable

], # every entry must appear in metric_directions
"metric_directions": {
"lower_is_better": ["RMSE", "MAE", "STDE", "FAR"],
"higher_is_better": ["CORR", "R2", "ETS", "POD"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned previously, I would not include CORR as an eligible metric to add to the scorecard. Especially so, as R2 is virtually the same.

Comment on lines +115 to +122
vars_metrics = {
"U_10M": ["RMSE", "R2", "ETS"],
"V_10M": ["RMSE", "R2", "ETS"],
"T_2M": ["RMSE", "R2"],
"PMSL": ["RMSE", "R2"],
"TD_2M": ["RMSE", "R2"],
"TOT_PREC": ["RMSE", "R2", "ETS"],
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment from before about exposing the thresholds as well probably is better suited here. I.e. I would love to be able to specify what score AND what threshold to include on a per variable basis.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this block is just a hardcoded fallback that only triggers when --variable is not passed at all (e.g., running the script manually). If the snakemake rule always provides explicit --variable arguments, this code path is never hit in the workflow.

Honestly, I'm not sure this fallback is that useful now... A cleaner alternative might be to drop it and instead show everything present in the data when no --variable is given (and same thing with the regions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants