Skip to content

StreamCat date string handling in nhdplus_derived #98

@rjost1

Description

@rjost1

What happened?

Following the workflow to compute hydrologic signatures in the River Discharge example of the HyRiver docs. Working in VScode jupyter notebook. I was using the example code verbatim except for my own bounding box coordinates and different dates (different code shown below, verbatim code not shown).

dates = ("2000-10-01", "2011-09-30")
bbox = (-115.63, 43.94, -114.96, 44.35)

qobs = nwis.get_streamflow(stations, dates, mmd=True)
plot.signatures(qobs)

The nwis.get_streamflow() function fails and returns this error: ValueError: invalid literal for int() with base 10: '1990:2017'

What did you expect to happen?

I expected a plot of hydrologic signatures for the specified station and date range.

Minimal Complete Verifiable Example

from pygeohydro import NWIS, plot


dates = ("2000-10-01", "2011-09-30")
bbox = (-115.63, 43.94, -114.96, 44.35)
nwis = NWIS()
query = {
    "bBox": ",".join(f"{b:.06f}" for b in bbox),
    "hasDataTypeCd": "dv",
    "outputDataTypeCd": "dv",
}
info_box = nwis.get_info(query)

stations = info_box[
    (info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])
].site_no.tolist()

query = {
    "site": ",".join(stations),
    "hasDataTypeCd": "dv",
    "outputDataTypeCd": "dv",
}
info = nwis.get_info(query, expanded=True)
info.set_index("site_no").hcdn_2009

qobs = nwis.get_streamflow(stations, dates, mmd=True)
plot.signatures(qobs)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 1
----> 1 qobs = nwis.get_streamflow(stations, dates, mmd=True)
      2 plot.signatures(qobs)

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pygeohydro/nwis.py:759, in NWIS.get_streamflow(cls, station_ids, dates, freq, mmd, to_xarray)
    757 siteinfo = siteinfo[siteinfo.site_no.isin(sids)]
    758 if mmd:
--> 759     area_sqm = cls._drainage_area_sqm(siteinfo, freq)
    760     ms2mmd = 1000.0 * 24.0 * 3600.0
    761     try:

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pygeohydro/nwis.py:537, in NWIS._drainage_area_sqm(cls, siteinfo, freq)
    535 """Get drainage area of the stations."""
    536 if "nhd_areasqkm" not in siteinfo:
--> 537     area = cls._nhd_info(siteinfo["site_no"].to_list())
    538     area = area[["site_no", "nhd_areasqkm"]].copy()
    539 else:

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pygeohydro/nwis.py:301, in NWIS._nhd_info(site_ids)
    299 except (TypeError, IntCastingNaNError):
    300     area["comid"] = area["comid"].astype("Int32")
--> 301 nhd_area = pynhd.streamcat("fert", comids=area["comid"].dropna().to_list(), area_sqkm=True)
    302 area = area.merge(
    303     nhd_area[["comid", "wsareasqkm"]], left_on="comid", right_on="comid", how="left"
    304 )
    305 area["identifier"] = area["identifier"].str.replace("USGS-", "")

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pynhd/nhdplus_derived.py:726, in streamcat(metric_names, metric_areas, comids, regions, states, counties, conus, percent_full, area_sqkm, lakes_only)
    724 if metric_names is None:
    725     return StreamCat().metrics_df
--> 726 sc = StreamCatValidator(lakes_only)
    727 names = [metric_names] if isinstance(metric_names, str) else metric_names
    728 sc.validate(name=names)

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pynhd/nhdplus_derived.py:586, in StreamCatValidator.__init__(self, lakes_only)
    585 def __init__(self, lakes_only: bool = False) -> None:
--> 586     super().__init__(lakes_only)

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pynhd/nhdplus_derived.py:576, in StreamCat.__init__(self, lakes_only)
    573 self.metrics_df = names
    575 years = names.set_index("METRIC_NAME").YEAR.dropna()
--> 576 self.valid_years = {
    577     str(v): list(range(*(int(y) for y in yrs.split("-"))))
    578     if "-" in yrs
    579     else [int(y) for y in yrs.split(",")]
    580     for v, yrs in years.items()
    581 }

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pynhd/nhdplus_derived.py:579, in <dictcomp>(.0)
    573 self.metrics_df = names
    575 years = names.set_index("METRIC_NAME").YEAR.dropna()
    576 self.valid_years = {
    577     str(v): list(range(*(int(y) for y in yrs.split("-"))))
    578     if "-" in yrs
--> 579     else [int(y) for y in yrs.split(",")]
    580     for v, yrs in years.items()
    581 }

File ~/miniconda/envs/geos505/lib/python3.11/site-packages/pynhd/nhdplus_derived.py:579, in <listcomp>(.0)
    573 self.metrics_df = names
    575 years = names.set_index("METRIC_NAME").YEAR.dropna()
    576 self.valid_years = {
    577     str(v): list(range(*(int(y) for y in yrs.split("-"))))
    578     if "-" in yrs
--> 579     else [int(y) for y in yrs.split(",")]
    580     for v, yrs in years.items()
    581 }

ValueError: invalid literal for int() with base 10: '1990:2017'

Anything else we need to know?

This is the explanation of the error given by co-pilot:

The error ValueError: invalid literal for int() with base 10: '1990:2017' occurs because the code in the pynhd library (specifically in nhdplus_derived.py) is trying to parse a year string from StreamCat data, but the string '1990:2017' contains a colon (:) instead of the expected dash (-) for ranges or comma (,) for lists.

Here's what's happening step-by-step:

When you call nwis.get_streamflow(stations, dates, mmd=True), the mmd=True parameter triggers the calculation of mean monthly discharge, which requires drainage area data.

To get the drainage area, the code calls pynhd.streamcat("fert", ...) to retrieve StreamCat metrics.

Inside StreamCat.init(), the code parses the YEAR column from the metrics data to build valid_years. It expects year ranges to use dashes (e.g., "1990-2017") or comma-separated lists (e.g., "1990,2017").

For the string '1990:2017', it doesn't contain a dash, so it falls back to splitting by commas and converting each part to an integer. Since '1990:2017' splits to ['1990:2017'], int('1990:2017') fails because the colon isn't a valid character for an integer.

This appears to be a data formatting issue in the StreamCat dataset where a year range is incorrectly delimited with a colon instead of a dash. The pynhd library doesn't handle this case.

Environment

Details

SYS INFO

commit: None
python: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:53:07) [Clang 19.1.7 ]
python-bits: 64
OS: Darwin
OS-release: 21.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')

PACKAGE VERSION

async-retriever 0.19.3
pygeoogc 0.19.4
pygeoutils 0.19.5
py3dep 0.19.0
pynhd 0.19.4
pygridmet N/A
pydaymet N/A
hydrosignatures 0.19.3
pynldas2 N/A
pygeohydro 0.19.4
tiny-retriever N/A
aiodns 3.0.0
aiofiles 25.1.0
aiohttp 3.13.2
aiohttp-client-cache 0.14.1
aiosqlite 0.21.0
brotli 1.1.0
cytoolz 1.1.0
orjson 3.11.4
numpy 2.3.5
pandas 2.3.3
scipy 1.16.3
xarray 2025.12.0
numba N/A
numbagg N/A
click 8.3.0
geopandas 1.1.1
rasterio 1.4.3
rioxarray 0.19.0
shapely 2.1.2
netcdf4 1.7.3
pyproj 3.7.2
defusedxml 0.7.1
folium 0.20.0
h5netcdf 1.7.2
matplotlib 3.10.8
planetary-computer N/A
pystac-client N/A
joblib 1.5.2
multidict 6.6.3
owslib 0.34.1
requests 2.32.5
requests-cache 1.2.1
typing-extensions 4.15.0
url-normalize 2.2.1
urllib3 2.5.0
yarl 1.22.0
networkx 3.5
pyarrow 21.0.0
py7zr N/A
flox N/A
opt-einsum N/A

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions