fix: Use the Parquet file size and checksum for the table content metadata when creating a dataset on the Hub#57
Conversation
…d attribute to use those of the Parquet file to be uploaded
hadim
left a comment
There was a problem hiding this comment.
LGTM. Thanks Julien.
I would assume we don't have any concern about buffer not being garbage collected right? And so no need for anything like del buffer?
It'll be GC-able when it goes out of scope at the end of the function call. That being said, a large enough Parquet file generated here will probably blow up the process memory. I don't know Pandas well enough to say if it's also an issue with the dataframe in the In any case, we'll have to be smarter when handling very large datasets, like streaming from disk and multi-part uploads. |
Changelogs
This PR updates the file size and checksum set as metadata on a dataset's table content to use the Parquet file's attributes. This will let the Hub correctly check the size of the uploaded file and
set the state of the dataset to ready.
Links
Checklist:
[ ] Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).[ ] Update the API documentation if a new function is added, or an existing one is deleted.feature,fixortest(or ask a maintainer to do it for you).