Skip to content

Conversation

@divine7022
Copy link
Collaborator

Description

Add comprehensive documentation for the input_design design matrix used to coordinate parameter draws and input file selection

closes : #3677

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Multi-site ensembles and sensitivity analyses that sample over input files use
an `input_design` data.frame to keep parameter draws and input files aligned
across runs. The design is created up front (typically via
`generate_joint_ensemble_design()` inside `runModule.run.write.configs()`) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop the "inside" bit. I think the preferred usage would be to generate the design first then pass it in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped and clarified

`write.sa.configs()`. It is not saved automatically to `samples.Rdata`, so keep
your copy if you need to reuse it.

- **Parameter column:** `param` gives the 1-based index of the posterior draw to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what a "1-based index" is. You could just call this an "index" or "index (i.e. row number)"

across runs. The design is created up front (typically via
`generate_joint_ensemble_design()` inside `runModule.run.write.configs()`) and
passed through to `run.write.configs()`, `write.ensemble.configs()`, and
`write.sa.configs()`. It is not saved automatically to `samples.Rdata`, so keep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK write.sa.configs doesn't currently use the input design, though we've discussed a future direction of pulling out the SA design code into the design function and then merging write.ensemble.configs & write.sa.configs, but that might not happen for months.


- **Parameter column:** `param` gives the 1-based index of the posterior draw to
use for each run. `run.write.configs()` reads it when building ensemble
samples so leave it in the design even though the downstream config writers do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "even though the downstream config writers do not reference it directly" sure sounds like a bug that should be fixed, not a "feature" that should be documented.

samples so leave it in the design even though the downstream config writers do
not reference it directly.
- **Input columns:** any name that matches a tag under `run/inputs` (for example
`met`, `soil`, `veg`, `poolinitcond`). Values are 1-based indices into that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop "1-based" throughout PR

- **Input columns:** any name that matches a tag under `run/inputs` (for example
`met`, `soil`, `veg`, `poolinitcond`). Values are 1-based indices into that
input’s `path` list. Leaving a column out keeps that input fixed across runs.
- **Row count and order:** include at least one row per run you plan to write.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the "at least" -- you can't specify multiple rows per run.

- **Row count and order:** include at least one row per run you plan to write.
For ensembles this means `ensemble.size` rows; for sensitivity analysis it
should cover the median run (row 1) plus every trait/quantile combination in
the order they are written. Extra rows are ignored; too few will leave later
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text "Extra rows are ignored; too few will leave later
runs without input paths and lead to confusing readme/config files, so size
the table to cover every planned run" also sounds like a bug. I'd prefer the code thrown an error before starting any runs if it detects a mismatch between the design size and ensemble size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added validation in runModule.run.write.configs and run.write.configs

|------:|----:|-----:|
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the met column be 1,2,1,2 as that's an easier design to understand

Comment on lines 11 to 18
#' @param input_design data.frame design matrix linking parameter draws and any
#' sampled inputs across runs. Include a `param` column whose values select
#' rows from `trait.samples`/`ensemble.samples` plus optional columns named for
#' `settings$run$inputs` tags (e.g. `met`, `soil`) with 1-based indices into
#' each input's `path` list. Provide at least one row per planned run (median +
#' all SA members and/or `ensemble.size`). Usually generated by
#' `runModule.run.write.configs()` via `generate_joint_ensemble_design()`, but
#' custom designs may be supplied.
Copy link
Member

@infotroph infotroph Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since run.write.configs is typically called internally while runModule.rn.write.configs is user-facing, I suggest moving the full details there and having the "as documented in..." here.

Suggested change
#' @param input_design data.frame design matrix linking parameter draws and any
#' sampled inputs across runs. Include a `param` column whose values select
#' rows from `trait.samples`/`ensemble.samples` plus optional columns named for
#' `settings$run$inputs` tags (e.g. `met`, `soil`) with 1-based indices into
#' each input's `path` list. Provide at least one row per planned run (median +
#' all SA members and/or `ensemble.size`). Usually generated by
#' `runModule.run.write.configs()` via `generate_joint_ensemble_design()`, but
#' custom designs may be supplied.
#' @param input_design data frame containing the design matrix describing parameter and input indices, as
#' documented in \code{runModule.run.write.configs()}.

Comment on lines 5 to 7
#' @param input_design design matrix describing parameter and input indices, as
#' documented in \code{run.write.configs()}. Defaults to the object returned by
#' \code{generate_joint_ensemble_design()} when NULL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @param input_design design matrix describing parameter and input indices, as
#' documented in \code{run.write.configs()}. Defaults to the object returned by
#' \code{generate_joint_ensemble_design()} when NULL.
#' @param input_design data.frame design matrix linking parameter draws and any
#' sampled inputs across runs. Include a `param` column whose values select
#' rows from `trait.samples`/`ensemble.samples` plus optional columns named for
#' `settings$run$inputs` tags (e.g. `met`, `soil`) with 1-based indices into
#' each input's `path` list. Provide at least one row per planned run (median +
#' all SA members and/or `ensemble.size`). Usually generated by
#' `generate_joint_ensemble_design()` but custom designs may be supplied.
#' If NULL, `generate_joint_ensemble_design()` will be called internally.

CHANGELOG.md Outdated
Comment on lines 60 to 63
- Added documentation for the `input_design` design matrix that coordinates
parameter draws and input file selections across ensemble runs. The matrix
requires a `param` column with parameter sample indices and one row per run
(#3677).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is covered by the block above

Suggested change
- Added documentation for the `input_design` design matrix that coordinates
parameter draws and input file selections across ensemble runs. The matrix
requires a `param` column with parameter sample indices and one row per run
(#3677).

Comment on lines +644 to +645
and passed to `runModule.run.write.configs()`. It is not saved automatically to
`samples.Rdata`, so keep your copy if you need to reuse it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but flagging for improvement later: Needing this sentence is a good indicator the current behavior is unintuitive and the design should be saved somewhere by default (though not in samples.Rdata, as established in other threads).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. Relying on the user to manually persist the design matrix is fragile.
This reminds me a potential gap fill could be done here -- #3708
Since we are moving toward a runs_manifest.csv to track run metadata (like pfts and traits, ), the natural home for the input design seems to be that same manifest. We could extend the manifest schema to include columns for the inputs (e.g. param_index, met_index, ...); that way runs_manifest.csv becomes the complete 'recipe' .
This would eliminate the need for a separate saved object entirely for that.


# Validate design matrix size for MultiSettings
if (!is.null(settings$ensemble$size) && nrow(input_design) != settings$ensemble$size) {
stop("input_design has ", nrow(input_design), " rows but settings$ensemble$size is ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infotroph do you think these need to be switched to PEcAn's logger?

@infotroph infotroph merged commit de5bf23 into PecanProject:release/v1.10.0 Dec 15, 2025
19 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants