A module-grounded framework for Earth System Model analysis. Scientists register analysis tools with typed metadata; any LLM reads the auto-generated tool catalog and composes YAML workflows. The human reviews, the engine executes.
This repository accompanies the manuscript:
Can We Trust LLMs for Complex Earth System Model Analysis? Silent Failure and Evidence from Module-Grounded Benchmarking Tian Zhou, Yun Qian, L. Ruby Leung Pacific Northwest National Laboratory Submitted to Geoscientific Model Development (GMD), 2026
The paper introduces ESFlow and benchmarks it against unconstrained LLM code generation across six contemporary LLMs and seven E3SM land-surface-hydrology analysis tasks, with a focus on silent failures — plausible, well-formatted output that numerically disagrees with hand-crafted references. The Zenodo v2 data release, including the sample data used by the reference workflows, is available at https://zenodo.org/records/20584449.
Correspondence: Tian Zhou (tian.zhou@pnnl.gov).
@article{zhou_2026_esflow,
author = {Zhou, Tian and Qian, Yun and Leung, L. Ruby},
title = {Can We Trust {LLMs} for Complex {Earth} System Model Analysis?
Silent Failure and Evidence from Module-Grounded Benchmarking},
journal = {Geoscientific Model Development},
year = {2026},
note = {Submitted}
}The citation will be updated with volume, pages, and DOI once the paper is accepted.
Scientist writes tool → @esmflow_tool decorator → tool_catalog.yaml (auto)
↓
User describes task → Any LLM reads catalog → workflow.yaml
↓
run_workflow.py → figures, metrics, CSVs
- Tools are Python functions decorated with
@esmflow_tool(ToolSpec(...)). The decorator registers typed inputs and outputs, validates parameters, and rejects unknown parameters. generate_catalog.pyauto-discovers all tools and writestool_catalog.yaml— the sole interface between LLMs and tools.- Any LLM (ChatGPT, Claude, Gemini, Llama, etc.) reads the catalog and generates a YAML workflow from a natural-language request.
run_workflow.pyexecutes the YAML step-by-step, passing outputs between tools via${step_id.outputs.key}references.
LLMs connect building blocks. Tools handle internals. Minimize decisions the LLM must make.
- 24 tools, 92 total input parameters
- Strict validation: unknown parameters are rejected immediately
- No modal behavior: each tool does exactly one thing, no mode/format switches
- Match by column name: no
match_by,x_column,y_columnparameters — tools match gauge_id columns automatically - Standardized file schemas: common CSV/NetCDF contracts are documented in the catalog and schema reference
- Simple observation format: one CSV per gauge (
date,discharge_m3s)
# Clone and install
git clone https://github.com/pnnl/esflow.git
cd esflow
pip install -r requirements.txt
# Validate a workflow (no data needed)
python run_workflow.py reference_workflows/task01_reference.yaml --dry-run
# Download the Zenodo v2 sample data and unpack it into data/ before execution
# https://zenodo.org/records/20584449
python run_workflow.py reference_workflows/task01_reference.yaml
# Reuse existing intermediate files
python run_workflow.py reference_workflows/task01_reference.yaml --reuse- Combine
docs/system_prompt.txt+tools/tool_catalog.yamlas the system message - Describe your analysis task in natural language as the user message
- Save the generated YAML and validate with
--dry-run - Run it
See docs/guide.md for the full protocol, task description examples, and instructions for adding your own tools.
| # | Category | Tool | Parameters | Description |
|---|---|---|---|---|
| 1 | fetchers | fetch_ilamb_data |
2 | Download observation NetCDF files from the ILAMB server |
| 2 | loaders | load_obs_metadata |
1 | Load and validate gauge metadata CSV |
| 3 | matchers | match_to_grid |
4 | Match gauges to E3SM model grid |
| 4 | extractors | extract_basin_mean |
3 | Extract area-weighted basin means from gridded fields |
| 5 | extractors | extract_e3sm_timeseries |
7 | Extract model time series at gauge locations |
| 6 | extractors | extract_obs_timeseries |
4 | Extract observations from per-gauge CSVs |
| 7 | extractors | extract_gridded_field |
6 | Extract 2D fields from E3SM or observation NetCDF files |
| 8 | analyzers | compute_climatology |
1 | Compute monthly climatology from time series |
| 9 | analyzers | compute_spatial_bias |
2 | Compute gridded spatial bias and summary statistics |
| 10 | analyzers | compute_zonal_stats |
3 | Compute area-weighted global or latitude-band means |
| 11 | analyzers | compute_summary_stats |
4 | Compute mean/std/min/max per time-series column |
| 12 | analyzers | compute_basin_budget |
8 | Build per-basin water budget summary tables |
| 13 | analyzers | compute_metrics |
2 | Compute NSE, KGE, PBIAS, RMSE, and correlation |
| 14 | analyzers | compute_fdc_metrics |
2 | Compute flow duration curve distributional metrics |
| 15 | plotters | plot_water_balance_basins |
8 | Plot global residual maps with basin water-balance bars |
| 16 | plotters | plot_scatter |
2 | Plot mean-value scatter comparisons |
| 17 | plotters | plot_gridded_map |
5 | Plot 2D gridded fields on geographic maps |
| 18 | plotters | plot_basin_radar |
1 | Plot basin water-cycle diagnostic radar charts |
| 19 | plotters | plot_timeseries |
2 | Plot simulated vs observed time series panels |
| 20 | plotters | plot_map |
3 | Plot validation metrics at gauge locations |
| 21 | plotters | plot_fdc |
6 | Plot flow duration curve comparisons |
| 22 | plotters | plot_basin_timeseries |
6 | Plot basin maps with discharge time series |
| 23 | plotters | plot_bias_comparison |
9 | Plot observation, simulation, and bias map comparisons |
| 24 | plotters | plot_basin_budget_comparison |
1 | Plot basin water-budget component comparisons |
Tools communicate through typed files. The table below lists common CSV contracts; tool-specific NetCDF, CSV, and figure outputs are documented in tools/tool_catalog.yaml.
| Schema | Index | Columns | Producers | Consumers |
|---|---|---|---|---|
| gauge_metadata | row | gauge_id, lat, lon, area_km2, river_name |
loaders | matchers, extractors, plotters |
| matched_gauges | row | gauge_id, lat, lon, model_lat, model_lon, lat_idx, lon_idx |
matchers | extractors |
| timeseries | time |
one column per gauge_id |
extractors | analyzers, plotters |
| climatology | month (1-12) |
one column per gauge_id |
compute_climatology | plotters |
| metrics | row | gauge_id, nse, kge, pbias, rmse, correlation, n_valid |
compute_metrics | plotters |
| summary_stats | row | column_name, mean, std, min, max (+ optional river_name, area_km2) |
compute_summary_stats | — |
| zonal_stats | row | region, mean, area_weighted_mean |
compute_zonal_stats | — |
| bias_stats | row | mean_bias, rmse, spatial_correlation |
compute_spatial_bias | — |
Sample data is distributed separately to keep this repository lightweight. Download the Zenodo v2 data release from https://zenodo.org/records/20584449 and unpack it into data/.
data/sample/
├── e3sm/ # Sample E3SM output used by reference workflows
└── obs/
├── gauge_metadata.csv # gauge_id, lat, lon, area_km2, river_name
└── streamflow/ # per-gauge CSVs: date, discharge_m3s
| Workflow | Steps | What it does |
|---|---|---|
reference_workflows/task01_reference.yaml |
3 | Observation streamflow summary statistics |
reference_workflows/task02_reference.yaml |
3 | Seasonal runoff field extraction, statistics, and map |
reference_workflows/task03_reference.yaml |
5 | Evapotranspiration benchmark against ILAMB observations |
reference_workflows/task04_reference.yaml |
6 | Streamflow distribution and FDC comparison |
reference_workflows/task05_reference.yaml |
6 | Basin-scale streamflow analysis |
reference_workflows/task06_reference.yaml |
13 | Water-balance diagnostic workflow |
reference_workflows/task07_reference.yaml |
23 | Integrated basin water-cycle evaluation |
ESFlow includes a benchmark system that evaluates LLMs in two modes:
- Protocol mode: LLM generates a YAML workflow using the tool catalog
- Baseline mode: LLM generates free-form Python code (no tools)
The benchmark in the paper is reproduced in five stages. Sample data (E3SM output and GRDC streamflow) is not included in this repository to keep it lightweight. Download the Zenodo v2 data release from https://zenodo.org/records/20584449.
# 1. Download sample data from Zenodo v2 and unpack into data/
# (produces data/sample/e3sm/... and data/sample/obs/...)
# 2. Run the reference workflows to generate ground-truth outputs
for t in 01 02 03 04 05 06 07; do
python run_workflow.py reference_workflows/task${t}_reference.yaml
done
# 3. Run the benchmark for both conditions (6 models x 7 tasks x 4 runs each)
export LLM_API_KEY="your-key-here"
python benchmark/run_benchmark.py --all --runs 4 --mode protocol
python benchmark/run_benchmark.py --all --runs 4 --mode baseline
# 4. Run the self-debug experiment on crashed runs (up to 3 repair rounds)
python benchmark/self_debug_crashes.py --max-rounds 3
# 5. Grade and merge results
python benchmark/structural_grading.py
python benchmark/grade_selfdebug.py
python benchmark/merge_grades.py
python benchmark/merge_grades_selfdebug.pyThe paper's reference run (claude-opus-4-6 protocol run2) is pre-included in benchmark/results/claude-opus-4-6_protocol/ — so Step 2 only needs to be re-run if you change tools or want fresh reference outputs.
| Step | What it checks | Applies to | Auto? |
|---|---|---|---|
| Step 1: Crash | Final deliverable missing (CSV for T1, PNG for T2–T7) | Both modes | Yes |
| Step 2: Success | Key data file matches reference within 1% tolerance | Protocol only | Yes |
| Step 3: Manual review | Human assigns silent failure or obvious failure | Undetermined | No |
Final grades: crash, success, silent failure, obvious failure.
Reference run: claude-opus-4-6 protocol run2. Grading script: benchmark/structural_grading.py. Manual labels: benchmark/results/manual_overrides.json.
# Protocol mode (default)
export LLM_API_KEY="your-key-here"
python benchmark/run_benchmark.py --task benchmark/protocol/task_01_obs_summary.txt
# Baseline mode (free-form Python)
python benchmark/run_benchmark.py --task benchmark/baselines/task_01_obs_summary.txt --baseline
# Full benchmark (all models, 4 runs each, both modes)
python benchmark/run_benchmark.py --all --runs 4
# Run structural grading
python benchmark/structural_grading.py
# Local models via LM Studio
python benchmark/run_benchmark.py --localpython run_workflow.py <workflow.yaml> [options]
Options:
--dry-run, -n Validate workflow without executing
--verbose, -v Print detailed progress
--start-from, -s Jump to a specific step (earlier steps assumed complete)
--reuse, -r Reuse existing intermediate files
See docs/guide.md for a complete walkthrough with examples. The short version:
- Create
tools/<category>/my_tool.pywith aToolSpecand@esmflow_tooldecorator - Run
python tools/generate_catalog.py --overwriteto update the catalog - The LLM sees your tool in the catalog automatically
esflow/
├── run_workflow.py # Workflow engine
├── requirements.txt
├── docs/
│ ├── guide.md # Full protocol guide (adding tools, prompting LLMs)
│ └── system_prompt.txt # System prompt template for LLM workflow generation
├── tools/
│ ├── generate_catalog.py # Auto-generates tool_catalog.yaml
│ ├── tool_catalog.yaml # LLM-readable tool metadata (24 tools, 92 input parameters)
│ ├── core/ # Framework internals
│ │ ├── base.py # @esmflow_tool, Param, ToolSpec, TOOL_REGISTRY
│ │ ├── e3sm.py # ESM utilities (cftime, file discovery, dataset opening)
│ │ ├── schemas.py # CSV schema documentation
│ │ ├── data_io.py # MOSART/ELM data loading
│ │ ├── spatial.py # Grid matching, river tracing
│ │ └── styling.py # Plot style presets
│ ├── fetchers/ # Remote data fetchers (1)
│ ├── loaders/ # Data loading tools (1)
│ ├── matchers/ # Gauge matching tools (1)
│ ├── extractors/ # Time series, basin, and field extraction tools (4)
│ ├── analyzers/ # Metrics, stats, bias, budget, and zonal tools (7)
│ └── plotters/ # Visualization tools (10)
├── reference_workflows/ # Human-reviewed benchmark reference workflows
├── data/sample/ # Populated after downloading Zenodo sample data
├── benchmark/
│ ├── protocol/ # System prompt + task descriptions (protocol mode)
│ ├── baselines/ # Task descriptions for free-form Python (baseline mode)
│ ├── run_benchmark.py # Multi-model, mode-aware benchmark runner
│ ├── structural_grading.py # 3-step reproducible grading script
│ └── results/ # Outputs, scores, manual review labels
├── visualize_workflow.py # Workflow visualization helper
└── workflow_to_dag.py # Workflow DAG conversion helper
This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, nor any jurisdiction or organization that has cooperated in the development of these materials, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness or any information, apparatus, product, software, or process disclosed, or represents that its use would not infringe privately owned rights.
Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or Battelle Memorial Institute. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
PACIFIC NORTHWEST NATIONAL LABORATORY
operated by
BATTELLE
for the
UNITED STATES DEPARTMENT OF ENERGY
under Contract DE-AC05-76RL01830