feat: consolidating metadata for opening zarr files in read mode by lmtroper · Pull Request #82 · polaris-hub/polaris

lmtroper · 2024-03-26T15:32:59Z

Changelogs

Incorporated Zarr function that consolidates metadata when opening a Zarr group in read mode. This results in reduced number of ls calls and increases the speed in reading from the Zarr group.

Profiling without consolidation

====================================================================================================
Date: 2024-03-28
Time: 16:21:26
Size: 8.0 KB
Repeats: 1
Polaris version: 0.0.2.dev191+g82e7db2
Zarr version: 2.17.1
====================================================================================================
                         Creating the Zarr archive: 0:00:00.002254 ± 0:00:00
         Creating dataset from Source Zarr archive: 0:00:00.003214 ± 0:00:00
                      Uploading dataset to the Hub: 0:00:27.666530 ± 0:00:00
                          Loading dataset from Hub: 0:00:01.549233 ± 0:00:00
                   Iterating over dataset (remote): 0:00:25.385904 ± 0:00:00
                          Caching dataset to local: 0:00:18.538832 ± 0:00:00
                    Iterating over dataset (local): 0:00:00.001493 ± 0:00:00
           Baseline Zarr only upload to Cloudflare: 0:00:00.636373 ± 0:00:00
             Baseline dataset upload to Cloudflare: 0:00:03.207451 ± 0:00:00
       Baseline Zarr only download from Cloudflare: 0:00:00.251135 ± 0:00:00
         Baseline dataset download from Cloudflare: 0:00:00.432873 ± 0:00:00
====================================================================================================
                        Actual / Baseline - Upload: 8.626 ± 0.000
                      Actual / Baseline - Download: 42.827 ± 0.000

Profiling with consolidation

Profiling Report
====================================================================================================
Date: 2024-04-04
Time: 11:15:12
Size: 8.0 KB
Repeats: 2
Polaris version: dev
Zarr version: 2.16.1
====================================================================================================
                         Creating the Zarr archive: 0:00:00.018600 ± 0:00:00.007932
         Creating dataset from Source Zarr archive: 0:00:00.021886 ± 0:00:00.002237
                      Uploading dataset to the Hub: 0:00:35.845067 ± 0:00:02.073791
                          Loading dataset from Hub: 0:00:01.928347 ± 0:00:00.003685
                          Caching dataset to local: 0:00:03.662794 ± 0:00:00.299349
                    Iterating over dataset (local): 0:00:00.005156 ± 0:00:00.000391
           Baseline Zarr only upload to Cloudflare: 0:00:00.973820 ± 0:00:00.110363
             Baseline dataset upload to Cloudflare: 0:00:03.534744 ± 0:00:00.056176
       Baseline Zarr only download from Cloudflare: 0:00:00.447640 ± 0:00:00.131711
         Baseline dataset download from Cloudflare: 0:00:00.386626 ± 0:00:00.024372
====================================================================================================
                        Actual / Baseline - Upload: 10.134 ± 0.426
                      Actual / Baseline - Download: 9.463 ± 0.178

cwognum

To optimize performance, we don't want to consolidate the meta-data everytime.

From the Zarr docs:

>>> zarr.consolidate_metadata(store)  
This creates a special key with a copy of all of the metadata from all of the metadata objects in the store.

The key here refers to a single file for us (i.e. by default named .zmetadata, although this can be combined with the metadata_key parameter in the consolidate methods).

I think what we would want to do is:

Consolidate the archive locally.
Then copy over the consolidated archive to the Hub with zarr.convenience.copy_all.
If this doesn't work (i.e. I assume it would copy over the .zmetadata file, but maybe not?), then we should find out a way to copy over this single file manually.

Could you look into the above? I would be curious to know if this is possible!

Not having to consolidate everything on the Hub would make things a lot faster!

cwognum · 2024-03-26T23:11:29Z

Now that #83 is merged, could you actually look into adding the consolidation in the flow for using Zarr datasets from the Hub:

We want to consolidate the Zarr archive just before uploading it to the Hub (here)
We assume that any archive uploaded to the Hub has been consolidated, so when we open it here it should use open_consolidated. Maybe we add an as_consolidated parameter to client.open_zarr_file()?

…-10gb

cwognum

Great work!

Let's assume that any Zarr archive that is loaded for a dataset has been consolidated. This means that we should also change these lines of code to load in consolidated mode!

cwognum

Almost there!

The test cases are failing though because the test Zarr archive is not consolidated! The formatting also fails right now.

consolidating metadata for read mode

05fdbbe

lmtroper added the enhancement New feature or request label Mar 26, 2024

lmtroper requested a review from cwognum March 26, 2024 15:32

cwognum requested changes Mar 26, 2024

View reviewed changes

Comment thread polaris/hub/client.py

lmtroper added 3 commits March 27, 2024 12:11

Merge branch 'main' into 74-optimizing-polarisfs-for-zarr-files-up-to…

ead6522

…-10gb

consolidate metadata on hub upload

676a09a

formatting

8c4ce75

lmtroper requested a review from cwognum March 28, 2024 14:20

cwognum reviewed Mar 28, 2024

View reviewed changes

Comment thread polaris/hub/client.py

Comment thread polaris/hub/client.py

lmtroper added 2 commits March 28, 2024 13:23

CR changes

42a7425

open consolidated version in other parts of code

4ddc05c

cwognum reviewed Mar 28, 2024

View reviewed changes

Comment thread polaris/hub/client.py

Comment thread polaris/dataset/_dataset.py Outdated

fix zarr write consolidate error and try-catch for reading consolidated

1b30557

cwognum reviewed Mar 28, 2024

View reviewed changes

Comment thread polaris/dataset/_dataset.py Outdated

lmtroper and others added 2 commits March 28, 2024 15:55

throw error for reading unconsolidated files + updated tests

4cbc06f

Formatting and fixing test cases

607fa35

cwognum approved these changes Mar 28, 2024

View reviewed changes

cwognum merged commit e918577 into main Mar 28, 2024

lmtroper linked an issue Mar 28, 2024 that may be closed by this pull request

Optimizing PolarisFS for Zarr files up to 10GB #74

Closed

3 tasks

cwognum deleted the 74-optimizing-polarisfs-for-zarr-files-up-to-10gb branch March 28, 2024 21:37

cwognum mentioned this pull request Apr 9, 2024

Optimizing PolarisFS for Zarr files up to 10GB #74

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consolidating metadata for opening zarr files in read mode#82

feat: consolidating metadata for opening zarr files in read mode#82
cwognum merged 9 commits into
mainfrom
74-optimizing-polarisfs-for-zarr-files-up-to-10gb

lmtroper commented Mar 26, 2024 •

edited

Loading

Uh oh!

cwognum left a comment

Uh oh!

Uh oh!

cwognum commented Mar 26, 2024

Uh oh!

cwognum left a comment

Uh oh!

Uh oh!

Uh oh!

cwognum left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmtroper commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelogs

Profiling without consolidation

Profiling with consolidation

Uh oh!

cwognum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cwognum commented Mar 26, 2024

Uh oh!

cwognum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cwognum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lmtroper commented Mar 26, 2024 •

edited

Loading