ESFlow

A module-grounded framework for Earth System Model analysis. Scientists register analysis tools with typed metadata; any LLM reads the auto-generated tool catalog and composes YAML workflows. The human reviews, the engine executes.

Associated Paper

This repository accompanies the manuscript:

Can We Trust LLMs for Complex Earth System Model Analysis? Silent Failure and Evidence from Module-Grounded Benchmarking Tian Zhou, Yun Qian, L. Ruby Leung Pacific Northwest National Laboratory Submitted to Geoscientific Model Development (GMD), 2026

The paper introduces ESFlow and benchmarks it against unconstrained LLM code generation across six contemporary LLMs and seven E3SM land-surface-hydrology analysis tasks, with a focus on silent failures — plausible, well-formatted output that numerically disagrees with hand-crafted references. The Zenodo v2 data release, including the sample data used by the reference workflows, is available at https://zenodo.org/records/20584449.

Correspondence: Tian Zhou (tian.zhou@pnnl.gov).

Cite this work

@article{zhou_2026_esflow,
  author  = {Zhou, Tian and Qian, Yun and Leung, L. Ruby},
  title   = {Can We Trust {LLMs} for Complex {Earth} System Model Analysis?
             Silent Failure and Evidence from Module-Grounded Benchmarking},
  journal = {Geoscientific Model Development},
  year    = {2026},
  note    = {Submitted}
}

The citation will be updated with volume, pages, and DOI once the paper is accepted.

How It Works

Scientist writes tool  →  @esmflow_tool decorator  →  tool_catalog.yaml (auto)
                                                            ↓
User describes task    →  Any LLM reads catalog    →  workflow.yaml
                                                            ↓
                          run_workflow.py           →  figures, metrics, CSVs

Tools are Python functions decorated with @esmflow_tool(ToolSpec(...)). The decorator registers typed inputs and outputs, validates parameters, and rejects unknown parameters.
generate_catalog.py auto-discovers all tools and writes tool_catalog.yaml — the sole interface between LLMs and tools.
Any LLM (ChatGPT, Claude, Gemini, Llama, etc.) reads the catalog and generates a YAML workflow from a natural-language request.
run_workflow.py executes the YAML step-by-step, passing outputs between tools via ${step_id.outputs.key} references.

Design Principles

LLMs connect building blocks. Tools handle internals. Minimize decisions the LLM must make.

24 tools, 92 total input parameters
Strict validation: unknown parameters are rejected immediately
No modal behavior: each tool does exactly one thing, no mode/format switches
Match by column name: no match_by, x_column, y_column parameters — tools match gauge_id columns automatically
Standardized file schemas: common CSV/NetCDF contracts are documented in the catalog and schema reference
Simple observation format: one CSV per gauge (date, discharge_m3s)

Quick Start

# Clone and install
git clone https://github.com/pnnl/esflow.git
cd esflow
pip install -r requirements.txt

# Validate a workflow (no data needed)
python run_workflow.py reference_workflows/task01_reference.yaml --dry-run

# Download the Zenodo v2 sample data and unpack it into data/ before execution
# https://zenodo.org/records/20584449
python run_workflow.py reference_workflows/task01_reference.yaml

# Reuse existing intermediate files
python run_workflow.py reference_workflows/task01_reference.yaml --reuse

Generating Workflows with an LLM

Combine docs/system_prompt.txt + tools/tool_catalog.yaml as the system message
Describe your analysis task in natural language as the user message
Save the generated YAML and validate with --dry-run
Run it

See docs/guide.md for the full protocol, task description examples, and instructions for adding your own tools.

Tools (24)

#	Category	Tool	Parameters	Description
1	fetchers	`fetch_ilamb_data`	2	Download observation NetCDF files from the ILAMB server
2	loaders	`load_obs_metadata`	1	Load and validate gauge metadata CSV
3	matchers	`match_to_grid`	4	Match gauges to E3SM model grid
4	extractors	`extract_basin_mean`	3	Extract area-weighted basin means from gridded fields
5	extractors	`extract_e3sm_timeseries`	7	Extract model time series at gauge locations
6	extractors	`extract_obs_timeseries`	4	Extract observations from per-gauge CSVs
7	extractors	`extract_gridded_field`	6	Extract 2D fields from E3SM or observation NetCDF files
8	analyzers	`compute_climatology`	1	Compute monthly climatology from time series
9	analyzers	`compute_spatial_bias`	2	Compute gridded spatial bias and summary statistics
10	analyzers	`compute_zonal_stats`	3	Compute area-weighted global or latitude-band means
11	analyzers	`compute_summary_stats`	4	Compute mean/std/min/max per time-series column
12	analyzers	`compute_basin_budget`	8	Build per-basin water budget summary tables
13	analyzers	`compute_metrics`	2	Compute NSE, KGE, PBIAS, RMSE, and correlation
14	analyzers	`compute_fdc_metrics`	2	Compute flow duration curve distributional metrics
15	plotters	`plot_water_balance_basins`	8	Plot global residual maps with basin water-balance bars
16	plotters	`plot_scatter`	2	Plot mean-value scatter comparisons
17	plotters	`plot_gridded_map`	5	Plot 2D gridded fields on geographic maps
18	plotters	`plot_basin_radar`	1	Plot basin water-cycle diagnostic radar charts
19	plotters	`plot_timeseries`	2	Plot simulated vs observed time series panels
20	plotters	`plot_map`	3	Plot validation metrics at gauge locations
21	plotters	`plot_fdc`	6	Plot flow duration curve comparisons
22	plotters	`plot_basin_timeseries`	6	Plot basin maps with discharge time series
23	plotters	`plot_bias_comparison`	9	Plot observation, simulation, and bias map comparisons
24	plotters	`plot_basin_budget_comparison`	1	Plot basin water-budget component comparisons

Standard Data Schemas

Tools communicate through typed files. The table below lists common CSV contracts; tool-specific NetCDF, CSV, and figure outputs are documented in tools/tool_catalog.yaml.

Schema	Index	Columns	Producers	Consumers
gauge_metadata	row	`gauge_id`, `lat`, `lon`, `area_km2`, `river_name`	loaders	matchers, extractors, plotters
matched_gauges	row	`gauge_id`, `lat`, `lon`, `model_lat`, `model_lon`, `lat_idx`, `lon_idx`	matchers	extractors
timeseries	`time`	one column per `gauge_id`	extractors	analyzers, plotters
climatology	`month` (1-12)	one column per `gauge_id`	compute_climatology	plotters
metrics	row	`gauge_id`, `nse`, `kge`, `pbias`, `rmse`, `correlation`, `n_valid`	compute_metrics	plotters
summary_stats	row	`column_name`, `mean`, `std`, `min`, `max` (+ optional `river_name`, `area_km2`)	compute_summary_stats	—
zonal_stats	row	`region`, `mean`, `area_weighted_mean`	compute_zonal_stats	—
bias_stats	row	`mean_bias`, `rmse`, `spatial_correlation`	compute_spatial_bias	—

Sample Data

Sample data is distributed separately to keep this repository lightweight. Download the Zenodo v2 data release from https://zenodo.org/records/20584449 and unpack it into data/.

data/sample/
├── e3sm/                           # Sample E3SM output used by reference workflows
└── obs/
    ├── gauge_metadata.csv          # gauge_id, lat, lon, area_km2, river_name
    └── streamflow/                 # per-gauge CSVs: date, discharge_m3s

Example Workflows

Workflow	Steps	What it does
`reference_workflows/task01_reference.yaml`	3	Observation streamflow summary statistics
`reference_workflows/task02_reference.yaml`	3	Seasonal runoff field extraction, statistics, and map
`reference_workflows/task03_reference.yaml`	5	Evapotranspiration benchmark against ILAMB observations
`reference_workflows/task04_reference.yaml`	6	Streamflow distribution and FDC comparison
`reference_workflows/task05_reference.yaml`	6	Basin-scale streamflow analysis
`reference_workflows/task06_reference.yaml`	13	Water-balance diagnostic workflow
`reference_workflows/task07_reference.yaml`	23	Integrated basin water-cycle evaluation

Benchmarking LLMs

ESFlow includes a benchmark system that evaluates LLMs in two modes:

Protocol mode: LLM generates a YAML workflow using the tool catalog
Baseline mode: LLM generates free-form Python code (no tools)

Reproducing the Paper Benchmark

The benchmark in the paper is reproduced in five stages. Sample data (E3SM output and GRDC streamflow) is not included in this repository to keep it lightweight. Download the Zenodo v2 data release from https://zenodo.org/records/20584449.

# 1. Download sample data from Zenodo v2 and unpack into data/
#    (produces data/sample/e3sm/... and data/sample/obs/...)

# 2. Run the reference workflows to generate ground-truth outputs
for t in 01 02 03 04 05 06 07; do
    python run_workflow.py reference_workflows/task${t}_reference.yaml
done

# 3. Run the benchmark for both conditions (6 models x 7 tasks x 4 runs each)
export LLM_API_KEY="your-key-here"
python benchmark/run_benchmark.py --all --runs 4 --mode protocol
python benchmark/run_benchmark.py --all --runs 4 --mode baseline

# 4. Run the self-debug experiment on crashed runs (up to 3 repair rounds)
python benchmark/self_debug_crashes.py --max-rounds 3

# 5. Grade and merge results
python benchmark/structural_grading.py
python benchmark/grade_selfdebug.py
python benchmark/merge_grades.py
python benchmark/merge_grades_selfdebug.py

The paper's reference run (claude-opus-4-6 protocol run2) is pre-included in benchmark/results/claude-opus-4-6_protocol/ — so Step 2 only needs to be re-run if you change tools or want fresh reference outputs.

3-Step Structural Grading

Step	What it checks	Applies to	Auto?
Step 1: Crash	Final deliverable missing (CSV for T1, PNG for T2–T7)	Both modes	Yes
Step 2: Success	Key data file matches reference within 1% tolerance	Protocol only	Yes
Step 3: Manual review	Human assigns silent failure or obvious failure	Undetermined	No

Final grades: crash, success, silent failure, obvious failure.

Reference run: claude-opus-4-6 protocol run2. Grading script: benchmark/structural_grading.py. Manual labels: benchmark/results/manual_overrides.json.

# Protocol mode (default)
export LLM_API_KEY="your-key-here"
python benchmark/run_benchmark.py --task benchmark/protocol/task_01_obs_summary.txt

# Baseline mode (free-form Python)
python benchmark/run_benchmark.py --task benchmark/baselines/task_01_obs_summary.txt --baseline

# Full benchmark (all models, 4 runs each, both modes)
python benchmark/run_benchmark.py --all --runs 4

# Run structural grading
python benchmark/structural_grading.py

# Local models via LM Studio
python benchmark/run_benchmark.py --local

CLI Options

python run_workflow.py <workflow.yaml> [options]

Options:
  --dry-run, -n       Validate workflow without executing
  --verbose, -v       Print detailed progress
  --start-from, -s    Jump to a specific step (earlier steps assumed complete)
  --reuse, -r         Reuse existing intermediate files

Adding a New Tool

See docs/guide.md for a complete walkthrough with examples. The short version:

Create tools/<category>/my_tool.py with a ToolSpec and @esmflow_tool decorator
Run python tools/generate_catalog.py --overwrite to update the catalog
The LLM sees your tool in the catalog automatically

Project Structure

esflow/
├── run_workflow.py              # Workflow engine
├── requirements.txt
├── docs/
│   ├── guide.md                # Full protocol guide (adding tools, prompting LLMs)
│   └── system_prompt.txt       # System prompt template for LLM workflow generation
├── tools/
│   ├── generate_catalog.py      # Auto-generates tool_catalog.yaml
│   ├── tool_catalog.yaml        # LLM-readable tool metadata (24 tools, 92 input parameters)
│   ├── core/                    # Framework internals
│   │   ├── base.py              # @esmflow_tool, Param, ToolSpec, TOOL_REGISTRY
│   │   ├── e3sm.py              # ESM utilities (cftime, file discovery, dataset opening)
│   │   ├── schemas.py           # CSV schema documentation
│   │   ├── data_io.py           # MOSART/ELM data loading
│   │   ├── spatial.py           # Grid matching, river tracing
│   │   └── styling.py           # Plot style presets
│   ├── fetchers/                # Remote data fetchers (1)
│   ├── loaders/                 # Data loading tools (1)
│   ├── matchers/                # Gauge matching tools (1)
│   ├── extractors/              # Time series, basin, and field extraction tools (4)
│   ├── analyzers/               # Metrics, stats, bias, budget, and zonal tools (7)
│   └── plotters/                # Visualization tools (10)
├── reference_workflows/         # Human-reviewed benchmark reference workflows
├── data/sample/                 # Populated after downloading Zenodo sample data
├── benchmark/
│   ├── protocol/                # System prompt + task descriptions (protocol mode)
│   ├── baselines/               # Task descriptions for free-form Python (baseline mode)
│   ├── run_benchmark.py         # Multi-model, mode-aware benchmark runner
│   ├── structural_grading.py    # 3-step reproducible grading script
│   └── results/                 # Outputs, scores, manual review labels
├── visualize_workflow.py        # Workflow visualization helper
└── workflow_to_dag.py           # Workflow DAG conversion helper

Disclaimer

This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, nor any jurisdiction or organization that has cooperated in the development of these materials, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness or any information, apparatus, product, software, or process disclosed, or represents that its use would not infringe privately owned rights.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or Battelle Memorial Institute. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

                 PACIFIC NORTHWEST NATIONAL LABORATORY
                              operated by
                                BATTELLE
                                for the
                   UNITED STATES DEPARTMENT OF ENERGY
                    under Contract DE-AC05-76RL01830

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark		benchmark
docs		docs
reference_workflows		reference_workflows
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_workflow.py		run_workflow.py
tool_catalog_generated.yaml		tool_catalog_generated.yaml
visualize_workflow.py		visualize_workflow.py
workflow_to_dag.py		workflow_to_dag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ESFlow

Associated Paper

Cite this work

How It Works

Design Principles

Quick Start

Generating Workflows with an LLM

Tools (24)

Standard Data Schemas

Sample Data

Example Workflows

Benchmarking LLMs

Reproducing the Paper Benchmark

3-Step Structural Grading

CLI Options

Adding a New Tool

Project Structure

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ESFlow

Associated Paper

Cite this work

How It Works

Design Principles

Quick Start

Generating Workflows with an LLM

Tools (24)

Standard Data Schemas

Sample Data

Example Workflows

Benchmarking LLMs

Reproducing the Paper Benchmark

3-Step Structural Grading

CLI Options

Adding a New Tool

Project Structure

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages