dataset.sel inconsistent results when argument is a list or a slice. #6976

JamiePringle · 2022-09-01T14:40:34Z

What happened?

I am not sure if what I report is a bug; however, it is certainly not what I expect from a careful reading of the documentation, and I wonder if it is leading to some issues I describe below.

I am working with a large dataset produced by merging the output of runs made by different MPI processes. There are two coordinates, ("trajectory","obs"). All of the "obs" in the dataset are in order, but the "trajectory" coordinate is not in order. I made a smaller dataset that illustrates the issue below by reducing the number of "obs" from 250 to 2; this dataset can be found at http://oxbow.sr.unh.edu/data/smallExample.zarr.zip . This dataset looks like:

Dimensions:     (trajectory: 39363539, obs: 2)
Coordinates:
  * obs         (obs) int32 0 1
  * trajectory  (trajectory) int64 100 210 227 ... 39363210 39363255 39363379
Data variables:
    age         (trajectory, obs) float32 dask.array<chunksize=(50000, 2), meta=np.ndarray>
    lat         (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray>
    lon         (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray>
    time        (trajectory, obs) datetime64[ns] dask.array<chunksize=(40625, 2), meta=np.ndarray>
    z           (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray>

Note that the trajectory coordinate is not in order; this is due to how the problem is partitioned into MPI jobs.

If I want an ordered set of trajectories, say trajectories [1,2,3,4,5,6,7,8,9,10], and I do this with subSet=dataIn.sel(trajectory=arange(1,11)), I get what I would expect from the documentation: The first through 10th trajectories, in order:

<xarray.Dataset>
Dimensions:     (trajectory: 10, obs: 2)
Coordinates:
  * obs         (obs) int32 0 1
  * trajectory  (trajectory) int64 1 2 3 4 5 6 7 8 9 10
Data variables:
    age         (trajectory, obs) float32 dask.array<chunksize=(1, 2), meta=np.ndarray>
....

But if I use the slice operator to specify what I want, I get something very different: dataIn.sel(trajectory=slice(1,11)) returns 2567339 trajectories, starting with the location of trajectory coordinate 1 in dataIn and extending to the location of trajectory coordinate 10 in dataIn:

<xarray.Dataset>
Dimensions:     (trajectory: 2567339, obs: 2)
Coordinates:
  * obs         (obs) int32 0 1
  * trajectory  (trajectory) int64 1 27 57 59 ... 39363486 39363495 39363528 11
Data variables:
    age         (trajectory, obs) float32 dask.array<chunksize=(17944, 2), meta=np.ndarray>
...

This is not what I expect -- as I understand the documentation, .sel should work in coordinate space, and I would expect dataIn.sel(trajectory=slice(1,11)) and subSet=dataIn.sel(trajectory=arange(1,11)) to return the same thing. If I am wrong in this interpretation, perhaps a documentation update would be helpful.

I have had all sorts of issues with the full dataset, including .to_zarr(dataset, compute=False) failing to return a delayedObject because it used all the memory, .sortby(['trajectory']) failing by memory exhaustion, etc. I wonder if the issue reported here can be at the root of many of these issues? On a side note, the failure of .sortby(['trajectory']) makes re-ordering the dataset difficult, and I would be happy to hear any suggestions on that front.

What did you expect to happen?

See above for full descriptions

Minimal Complete Verifiable Example

#get data from http://oxbow.sr.unh.edu/data/smallExample.zarr.zip
dataIn=xr.open_zarr('smallExample.zarr')
print(dataIn.sel(trajectory=arange(1,11)))
print(dataIn.sel(trajectory=slice(1,11)))

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.23.2
scipy: 1.9.0
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.8.1
distributed: 2022.8.1
matplotlib: 3.5.3
cartopy: 0.20.3
seaborn: None
numbagg: None
fsspec: 2022.7.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.2.0
pip: 22.2.2
conda: None
pytest: 7.1.2
IPython: 8.4.0
sphinx: None

The text was updated successfully, but these errors were encountered:

mathause · 2022-09-01T15:37:25Z

A smaller repro:

import numpy as np
import xarray as xr

xr.DataArray(np.arange(5), dims="x", coords={"x": [2, 1, 0, 3, 5]})

da.sel(x=slice(2, 3))

Returns:

<xarray.DataArray (x: 4)>
array([0, 1, 2, 3])
Coordinates:
  * x        (x) int64 2 4 5 3

da.sel(x=[2, 3])

Returns

<xarray.DataArray (x: 2)>
array([0, 3])
Coordinates:
  * x        (x) int64 2 3

Yes, good point. What might be possible is to warn if selecting with a slice and the index is not monotonic increasing or decreasing.

JamiePringle · 2022-09-01T15:47:43Z

So is this an expected behavior? I can work around it by explicitly creating the indices with arange() or the like. I do wonder if this is what is causing to_zarr() to fail even with compute=False? But I can work around that.

benbovy · 2022-09-02T13:53:15Z

Xarray passes the label indexers to the underlying pandas index:

import pandas as pd

# "x" coordinate index
idx = pd.Index([2, 1, 0, 3, 5])

# da.sel(x=slice(2, 3)) does this:
idx.slice_indexer(2, 3)
# which returns slice(0, 4, None)

# da.sel(x=[2, 3]) does this:
idx.get_indexer([2, 3])
# which returns array([0, 3])

What might be possible is to warn if selecting with a slice and the index is not monotonic increasing or decreasing.

Is it always desirable? Asked differently, are there cases where one intentionally wants to select with a slice a non monotonic index? If yes, a warning might be annoying.

Maybe this could be clarified in the docs too?

mathause · 2022-09-02T14:06:54Z

Jup, that's always the tradeoff - #1613 discusses a similar case.

JamiePringle · 2022-09-02T14:13:27Z

I am happy to close this; it would be lovely if the documentation was more explicit about this issue. I was certainly surprised even after a close reading of the docs. Jamie

…

On Fri, Sep 2, 2022 at 10:07 AM Mathias Hauser ***@***.***> wrote: CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. CAUTION: This email originated from outside of the University System. Do not click links or open attachments unless you recognize the sender and know the content is safe. Jup, that's always the tradeoff - #1613 <#1613> discusses a similar case. — Reply to this email directly, view it on GitHub <#6976 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBZR25KUX6CEF5ESPUJLKLV4ICYRANCNFSM6AAAAAAQCNUNFM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

JamiePringle added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 1, 2022

mathause added topic-indexing topic-error reporting and removed bug needs triage Issue that has not been reviewed by xarray team member labels Sep 1, 2022

max-sixty closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.sel inconsistent results when argument is a list or a slice. #6976

dataset.sel inconsistent results when argument is a list or a slice. #6976

JamiePringle commented Sep 1, 2022

INSTALLED VERSIONS

mathause commented Sep 1, 2022

JamiePringle commented Sep 1, 2022

benbovy commented Sep 2, 2022

mathause commented Sep 2, 2022

JamiePringle commented Sep 2, 2022 via email

dataset.sel inconsistent results when argument is a list or a slice. #6976

dataset.sel inconsistent results when argument is a list or a slice. #6976

Comments

JamiePringle commented Sep 1, 2022

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

mathause commented Sep 1, 2022

JamiePringle commented Sep 1, 2022

benbovy commented Sep 2, 2022

mathause commented Sep 2, 2022

JamiePringle commented Sep 2, 2022 via email