MRB-648 Scorecards for evalml#145
Conversation
jonasbhend
left a comment
There was a problem hiding this comment.
Very nice. I really like the scorecards. Great work.
For future PRs, could you please add a short description of the changes (high-level overview) and - if necessary - also of the goals of the PR? That would be very helpful for the review.
As an additional suggestion, could we include the scorecard in the dashboard (I know we don't always want to produce it, but in case it is available it would be nice to include in a separate tab)?
| regions=[ | ||
| "all", | ||
| "mittelland", | ||
| "voralpen", | ||
| "alpennordhang", | ||
| "innerealpentaeler", | ||
| "alpensuedseite", | ||
| "jura", | ||
| ], | ||
| variables=[ | ||
| "U_10M:RMSE,MAE,STDE,CORR,R2", | ||
| "V_10M:RMSE,MAE,STDE,CORR,R2", | ||
| "T_2M:RMSE,MAE,STDE,CORR,R2", | ||
| "PMSL:RMSE,MAE,STDE,CORR,R2", | ||
| "TD_2M:RMSE,MAE,STDE,CORR,R2", | ||
| "TOT_PREC:RMSE,MAE,STDE,CORR,R2", | ||
| ], |
There was a problem hiding this comment.
This is hard-coded in the rule. Would it make sense to expose this in the config (in particular given the expected rapid changes with new metrics. @cosunae in #144 has suggested a nice template for config 'sections', e.g.
runs:
...
experiment:
dashboard:
stratification: ...
scorecard:
regions:
- all
- mittelland
scores:
U_10M: ["RMSE", "R2"]
TOT_PREC: ["RMSE", "ETS_gt_0p0"]
showcases:
...
There was a problem hiding this comment.
Also, I would suggest to only show ["RMSE", "R2"] by default. The other scores don't add much, or are even a bit dubious (such as correlation which cannot be generally converted to a skill score due to its support from -1 to 1). I know that Marco Arpagaus always wants to see STDE, but I think this is also based on a misunderstanding of what STDE is. STDE is only different from RMSE if we have 'systematic' domain-wide biases, which is usually not the case.
There was a problem hiding this comment.
With #137 there are now additional scores available for categorical forecasts. ETS would be a candidate to include for TOT_PREC and wind in particular.
There was a problem hiding this comment.
Hi Jonas, thanks for the suggestion. I included ETS in a recent commit, could you please have a look and let me know what you think?
There was a problem hiding this comment.
Hi Alberto. I really like the way the new scorecard looks. Super cool.
I was wondering, if the referencing of the score shouldn't be made consistent with the way these are represented in the scores file (verif_aggregated.nc). As such, I would the variables section above expect to look something like this:
variables=[
"T_2M:RMSE,MAE",
"TOT_PREC:RMSE,MAE,ETS_gt_0p0",
],
That way, we can specifically include / exclude scores (and thresholds) on a per variable basiss.
There was a problem hiding this comment.
Hi thanks for the kind words!
This is actually already supported: that syntax in the variables param in report.smk gives
That works thanks to this function in report_scorecard:
def _parse_var_metrics(spec: str):
"""Parse a 'VAR:M1,M2,...' (or 'VAR' alone) item into (var, [metrics])."""
if ":" in spec:
var, metrics = spec.split(":", 1)
return var.strip(), [m.strip() for m in metrics.split(",") if m.strip()]
return spec.strip(), None # None → use all scores for this variable
It splits each entry first on : and then splits the metrics on ,. So you can write just ETS to include all available thresholds, or list specific ones like ETS_gt_0p0 as in your example.
| ), | ||
| expand( | ||
| rules.report_scorecard.output, | ||
| run_id=CANDIDATES, |
There was a problem hiding this comment.
Currently we run this for all non-baseline runs against all baselines. It is configurable in the sense that you can copy-paste the config and remove runs / baselines that you don't want to be analysed. Is this enough?
| / "results/{experiment}/scorecard_plots/{run_id}/scorecard_{baseline}.png", | ||
| ), | ||
| params: | ||
| lead_times="6/33/6", |
There was a problem hiding this comment.
Also here, I suggest to make this configurable
…ividers, figure height); fix legend centering
| ], # every entry must appear in metric_directions | ||
| "metric_directions": { | ||
| "lower_is_better": ["RMSE", "MAE", "STDE", "FAR"], | ||
| "higher_is_better": ["CORR", "R2", "ETS", "POD"], |
There was a problem hiding this comment.
As mentioned previously, I would not include CORR as an eligible metric to add to the scorecard. Especially so, as R2 is virtually the same.
| vars_metrics = { | ||
| "U_10M": ["RMSE", "R2", "ETS"], | ||
| "V_10M": ["RMSE", "R2", "ETS"], | ||
| "T_2M": ["RMSE", "R2"], | ||
| "PMSL": ["RMSE", "R2"], | ||
| "TD_2M": ["RMSE", "R2"], | ||
| "TOT_PREC": ["RMSE", "R2", "ETS"], | ||
| } |
There was a problem hiding this comment.
My comment from before about exposing the thresholds as well probably is better suited here. I.e. I would love to be able to specify what score AND what threshold to include on a per variable basis.
There was a problem hiding this comment.
Actually this block is just a hardcoded fallback that only triggers when --variable is not passed at all (e.g., running the script manually). If the snakemake rule always provides explicit --variable arguments, this code path is never hit in the workflow.
Honestly, I'm not sure this fallback is that useful now... A cleaner alternative might be to drop it and instead show everything present in the data when no --variable is given (and same thing with the regions).
What this PR adds
This PR adds a new
report_scorecardrule that renders a PNG comparing one run against one baseline.The scorecard has:
(variable × metric)Each cell encodes the row's metric as a relative difference between the two runs for that region and lead time:
(model − baseline) / |baseline| × 100.Markers:
|diff|below the neutral threshold (default 5%)x→ missing or non-finite valueScores:
RMSEandR2for all variables, plusETSforU_10M,V_10M, andTOT_PREC.MAE,STDE,CORR,POD, andFAR.RMSE,MAE,STDE, andFARare lower-is-better, whileCORR,R2,ETS, andPODare higher-is-better.Above the neutral threshold, dot area scales linearly with
|diff|%and caps atsize_cap_pct(default 30%).Configuration
Configurable via
paramson the rule:lead_times:"start/stop/step"in hoursregions: regions to include as column blocksvariables:"VAR:M1,M2,..."entries; omit:M1,M2,...to useall_metricsfor that variable.Metric names can also expand by prefix. For example, requesting
ETSincludes all matching categorical scores, such asETS_gt_0p0,ETS_gt_0p001,ETS_gt_0p005.Other defaults (
season,init_hour, metric settings, plot styling) live in the script'scfg.Plot layout
The plotting script makes a few automatic layout decisions:
col_widthgrows when necessary to prevent region header overlap, and the top margin/vertical separators adapt to the rendered header heightTODOs