-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Additional list functions #8907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
taniabogatsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cryoEncryp. Thanks for the PR! I started to have a look and already added some comments. I'll continue with the other functions and your tests next week.
Also, out of curiosity, where did you take the function names from? Are there any analytical systems with similar functionalities? How do they call these functions? Just wondering about the aliases...
Also, I think that you should use 1-based indexing for grade_up. We decided to use 1-based indexing for lists a while ago to be consistent with SQL. See for example here:
D SELECT ([1, 2])[1];
┌──────────────────────────┐
│ main.list_value(1, 2)[1] │
│ int32 │
├──────────────────────────┤
│ 1 │
└──────────────────────────┘
D SELECT ([1, 2])[0];
┌──────────────────────────┐
│ main.list_value(1, 2)[0] │
│ int32 │
├──────────────────────────┤
│ │
└──────────────────────────┘There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While continuing the review of this, I realized that everything except grade_up can be implemented with list lambdas and their new index functionality. Before merging this PR, can you benchmark its performance gain against list lambdas? If the difference is insignificant, I suggest we add rewrites instead of new code that must be maintained.
This is the PR: #8851.
Here are the rewrites with our list_extract 1-based-indexing behavior:
D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES ([11, 22, NULL, 33], [1, -1, NULL, 1, 3, 4], [true, false, true, false], ['a', 'b', 'c', 'd']); ^
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl;
┌────────────────────────────────────────────────────┐
│ list_filter(l, (main.row(x, x_i) -> where_l[x_i])) │
│ int32[] │
├────────────────────────────────────────────────────┤
│ [11, NULL] │
└────────────────────────────────────────────────────┘
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
┌──────────────────────────────────┐
│ list_transform(sel, (x -> l[x])) │
│ int32[] │
├──────────────────────────────────┤
│ [11, 33, NULL, 11, NULL, 33] │
└──────────────────────────────────┘
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
┌────────────────────────────────────────────────────────────────────────────────────┐
│ list_transform(l, (main.row(x, x_i) -> main.row(x, zip[x_i]))) │
│ struct(x integer, v2 varchar)[] │
├────────────────────────────────────────────────────────────────────────────────────┤
│ [{'x': 11, 'v2': a}, {'x': 22, 'v2': b}, {'x': NULL, 'v2': c}, {'x': 33, 'v2': d}] │
└────────────────────────────────────────────────────────────────────────────────────┘Either way (rewrites or new code), we should keep the tests. Can you change all statement error to have an explicit error message check? We introduced explicit error messages recently and used them for all new tests. They work like this:
statement error
MY_FAILING_SQL;
----
my error messageEDIT: I confused list_slice and list_extract.
I tried benchmarking the queries on my machine and got these time measurements. D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES (range(1000), range(1000), [true for x in range(1000)], ['a' for x in range(1000)]);
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl;
...
Run Time (s): real 0.225 user 0.222456 sys 0.003220
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
...
Run Time (s): real 0.227 user 0.431249 sys 0.003793
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
...
Run Time (s): real 0.630 user 1.223273 sys 0.008493For the scaling proposal, if we change 1000 to 10000. D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tb2;
...
Run Time (s): real 25.817 user 25.462564 sys 0.350333
D SELECT list_transform(sel, (x) -> l[x]) FROM tb2;
...
Run Time (s): real 25.932 user 25.539946 sys 0.386648
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tb2;
...
Run Time (s): real 64.306 user 63.449924 sys 0.834758Implementation in my branch: D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.027 user 0.026074 sys 0.001398
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.028 user 0.027922 sys 0.000636
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.039 user 0.045414 sys 0.001267Change 1000 to 10000: D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.056 user 0.055299 sys 0.001724
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.055 user 0.055458 sys 0.000807
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.120 user 0.163200 sys 0.001673Should the functions still be implemented by means of lambdas, or is the implementation in my branch all right?
Edit: Change table header |
|
@cryoEncryp, thanks for running the benchmarks! It's reassuring to see that your implementation executes much faster. 😄 The list lambdas still need a lot of performance refactoring, so even though they might become more compatible in the following months, your implementations are still significantly faster. |
taniabogatsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cryoEncryp. I finally went through all the files and left more comments. This is coming together nicely. I realize I'm requesting many changes that aren't critical to the solution 'just working'. But since we decided to add these functions despite them being possible with lambdas, it's essential to put some effort into making them perform very fast.
Again, thanks for your work!
taniabogatsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cryoEncryp! Thanks for implementing all my requests, this PR looks almost ready now! I've added a few more comments, and I'll ping you once I have a PR up to fix that one failing test.
|
Also, this might interest you in general (or for your thesis). We have a PR up that we'll eventually merge to support fixed-size lists (#8983). Especially for We discussed if adding this optimization would benefit this PR (you'd have to check if all lists are the same length for each row first) but decided it was not worth the overhead for the non-equal lists. |
|
We just merged #9288, so you can now add |
taniabogatsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me now! Thanks for all the work on it @cryoEncryp.
@Mytherin, we need to rebase this to the feature branch, but then it can go in (from my side).
|
@cryoEncryp, can you please merge |
|
Thanks! |
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
As part of my Master's thesis for database systems chair at the University of Tübingen, I've implemented a few list functions:
list_where,list_select,list_zip, andlist_grade_up.
list_where(ANY[], BOOL[]) -> ANY[]list_select(ANY[], INT[]) -> ANY[]list_zip(ANY[], ..., ANY[]) -> struct(ANY, ..., ANY)[]list_grade_up(ANY[]) -> INT[]
Examples:
To reach me quicker than here on GitHub you can contact me on Discord, my tag is
bambi4.