Skip to content

feat(python): support file objects for read_json method#9573

Closed
Mause wants to merge 20 commits into
duckdb:mainfrom
Mause:09-27-feat_python_support_file_objects_for_read_json_method
Closed

feat(python): support file objects for read_json method#9573
Mause wants to merge 20 commits into
duckdb:mainfrom
Mause:09-27-feat_python_support_file_objects_for_read_json_method

Conversation

@Mause
Copy link
Copy Markdown
Contributor

@Mause Mause commented Nov 6, 2023

No description provided.

@Mause Mause requested a review from Tishj November 6, 2023 07:41
Copy link
Copy Markdown
Member

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me
Good thinking on the parametrized tests, solid way to easily add coverage for this!

@Mytherin
Copy link
Copy Markdown
Collaborator

Mytherin commented Nov 9, 2023

Thanks for the PR! Could you just have a look at the failing CI?

@Mause Mause force-pushed the 09-27-feat_python_support_file_objects_for_read_json_method branch from 1f9aed9 to dda97af Compare November 15, 2023 04:29
@Mytherin Mytherin changed the base branch from feature to main November 20, 2023 12:42
@github-actions github-actions Bot marked this pull request as draft January 19, 2024 11:17
@Mause
Copy link
Copy Markdown
Contributor Author

Mause commented Jan 19, 2024

I've had to make several changes here to avoid files being automatically closed, and to improve size inference. If I can't infer the size, I default back to -1. @lnkuiper thoughts?

@lnkuiper
Copy link
Copy Markdown
Member

@Mause if defaulting to -1 works then that seems fine, this will only affect the progress bar (I think?). The file size is used to determine the next number of bytes to read when reading files that are "seekable" (and therefore can be read in parallel).

Maybe it would be safer to make it non-seekable if the file size cannot be inferred? Then there shouldn't be issues at all.

@Mytherin Mytherin marked this pull request as ready for review January 19, 2024 22:23
@Mause
Copy link
Copy Markdown
Contributor Author

Mause commented Jan 23, 2024

@Mause if defaulting to -1 works then that seems fine, this will only affect the progress bar (I think?). The file size is used to determine the next number of bytes to read when reading files that are "seekable" (and therefore can be read in parallel).

I actually found that if I returned 0 as the file size, no rows were returned at all

Maybe it would be safer to make it non-seekable if the file size cannot be inferred? Then there shouldn't be issues at all.

We can look at this yeah, what would be the performance implication?

@lnkuiper
Copy link
Copy Markdown
Member

@Mause, yes. We cannot read from non-seekable files in parallel. Threads will need to wait for each other to read raw bytes from the file object, but after they have read a buffer, they can parse/convert independently (in parallel). So there's a performance implication, but it shouldn't be big because the file object is already in memory, not on disk.

I'm also doing some work on the JSON reader in #10300 currently, but I think the behavior for non-seekable files should be the same.

@github-actions github-actions Bot marked this pull request as draft February 1, 2024 16:23
…t_python_support_file_objects_for_read_json_method
@Mause Mause marked this pull request as ready for review February 5, 2024 07:51
@github-actions github-actions Bot marked this pull request as draft March 10, 2024 10:13
@Mause Mause changed the base branch from main to feature March 11, 2024 13:40
@Mause Mause marked this pull request as ready for review March 11, 2024 13:41
@Mytherin
Copy link
Copy Markdown
Collaborator

Mytherin commented Jun 4, 2024

@Mause it looks like there are still several test failures here, but I'm not sure if they are related to the actual PR. Do you want to pick this up again? If so could you resolve the merge conflicts?

@Mytherin Mytherin changed the base branch from feature to main June 21, 2024 12:39
@Mytherin
Copy link
Copy Markdown
Collaborator

Mytherin commented Aug 1, 2024

This has since been implemented in #13040

@Mytherin Mytherin closed this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants