Skip to content

Conversation

@cryoEncryp
Copy link
Contributor

As part of my Master's thesis for database systems chair at the University of Tübingen, I've implemented a few list functions: list_where, list_select, list_zip, and list_grade_up.

  1. list_where(ANY[], BOOL[]) -> ANY[]
    • With the booleans we can apply a mask to the first list.
  2. list_select(ANY[], INT[]) -> ANY[]
    • With the integers we can pick elements by index from a list.
  3. list_zip(ANY[], ..., ANY[]) -> struct(ANY, ..., ANY)[]
    • Produce a new list of structs where each struct consists of one element from every input list.
  4. list_grade_up(ANY[]) -> INT[]
    • Works like sort, but the results are the indexes instead of the actual values.

      Examples:
SELECT list_where([1,2,3,4], [true, false, false, true]); -- [1,4]
SELECT list_select([1,2,3,4], [0,2,3,1]); -- [1,3,4,2]
SELECT list_zip([1,2,3], ['a', 'b', 'c']); -- [struct(1, 'a'), struct(2, 'b'), struct(3, 'c')]
SELECT list_grade_up([3,2,1]); -- [2,1,0]


To reach me quicker than here on GitHub you can contact me on Discord, my tag is bambi4.

@github-actions github-actions bot marked this pull request as draft September 13, 2023 09:53
@cryoEncryp cryoEncryp marked this pull request as ready for review September 13, 2023 11:23
@github-actions github-actions bot marked this pull request as draft September 14, 2023 08:44
@cryoEncryp cryoEncryp changed the base branch from master to main September 14, 2023 12:05
@cryoEncryp cryoEncryp marked this pull request as ready for review September 14, 2023 14:09
Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cryoEncryp. Thanks for the PR! I started to have a look and already added some comments. I'll continue with the other functions and your tests next week.

Also, out of curiosity, where did you take the function names from? Are there any analytical systems with similar functionalities? How do they call these functions? Just wondering about the aliases...

Also, I think that you should use 1-based indexing for grade_up. We decided to use 1-based indexing for lists a while ago to be consistent with SQL. See for example here:

D SELECT ([1, 2])[1];
┌──────────────────────────┐
│ main.list_value(1, 2)[1] │
│          int32           │
├──────────────────────────┤
│                        1 │
└──────────────────────────┘
D SELECT ([1, 2])[0];
┌──────────────────────────┐
│ main.list_value(1, 2)[0] │
│          int32           │
├──────────────────────────┤
│                          │
└──────────────────────────┘

Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While continuing the review of this, I realized that everything except grade_up can be implemented with list lambdas and their new index functionality. Before merging this PR, can you benchmark its performance gain against list lambdas? If the difference is insignificant, I suggest we add rewrites instead of new code that must be maintained.

This is the PR: #8851.
Here are the rewrites with our list_extract 1-based-indexing behavior:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES ([11, 22, NULL, 33], [1, -1, NULL, 1, 3, 4], [true, false, true, false], ['a', 'b', 'c', 'd']);               ^
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl; 
┌────────────────────────────────────────────────────┐
│ list_filter(l, (main.row(x, x_i) -> where_l[x_i])) │
│                      int32[]                       │
├────────────────────────────────────────────────────┤
│ [11, NULL]                                         │
└────────────────────────────────────────────────────┘
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
┌──────────────────────────────────┐
│ list_transform(sel, (x -> l[x])) │
│             int32[]              │
├──────────────────────────────────┤
│ [11, 33, NULL, 11, NULL, 33]     │
└──────────────────────────────────┘                               
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
┌────────────────────────────────────────────────────────────────────────────────────┐
│           list_transform(l, (main.row(x, x_i) -> main.row(x, zip[x_i])))           │
│                          struct(x integer, v2 varchar)[]                           │
├────────────────────────────────────────────────────────────────────────────────────┤
│ [{'x': 11, 'v2': a}, {'x': 22, 'v2': b}, {'x': NULL, 'v2': c}, {'x': 33, 'v2': d}] │
└────────────────────────────────────────────────────────────────────────────────────┘

Either way (rewrites or new code), we should keep the tests. Can you change all statement error to have an explicit error message check? We introduced explicit error messages recently and used them for all new tests. They work like this:

statement error
MY_FAILING_SQL;
----
my error message

EDIT: I confused list_slice and list_extract.

@cryoEncryp
Copy link
Contributor Author

cryoEncryp commented Sep 18, 2023

While continuing the review of this, I realized that everything except grade_up can be implemented with list lambdas and their new index functionality. Before merging this PR, can you benchmark its performance gain against list lambdas? If the difference is insignificant, I suggest we add rewrites instead of new code that must be maintained.

This is the PR: #8851. Here are the rewrites with our list_extract 1-based-indexing behavior:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES ([11, 22, NULL, 33], [1, -1, NULL, 1, 3, 4], [true, false, true, false], ['a', 'b', 'c', 'd']);               ^
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl; 
┌────────────────────────────────────────────────────┐
│ list_filter(l, (main.row(x, x_i) -> where_l[x_i])) │
│                      int32[]                       │
├────────────────────────────────────────────────────┤
│ [11, NULL]                                         │
└────────────────────────────────────────────────────┘
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
┌──────────────────────────────────┐
│ list_transform(sel, (x -> l[x])) │
│             int32[]              │
├──────────────────────────────────┤
│ [11, 33, NULL, 11, NULL, 33]     │
└──────────────────────────────────┘                               
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
┌────────────────────────────────────────────────────────────────────────────────────┐
│           list_transform(l, (main.row(x, x_i) -> main.row(x, zip[x_i])))           │
│                          struct(x integer, v2 varchar)[]                           │
├────────────────────────────────────────────────────────────────────────────────────┤
│ [{'x': 11, 'v2': a}, {'x': 22, 'v2': b}, {'x': NULL, 'v2': c}, {'x': 33, 'v2': d}] │
└────────────────────────────────────────────────────────────────────────────────────┘

Either way (rewrites or new code), we should keep the tests. Can you change all statement error to have an explicit error message check? We introduced explicit error messages recently and used them for all new tests. They work like this:

statement error
MY_FAILING_SQL;
----
my error message

EDIT: I confused list_slice and list_extract.

I tried benchmarking the queries on my machine and got these time measurements.
Implementation with Lambdas:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES (range(1000), range(1000), [true for x in range(1000)], ['a' for x in range(1000)]);
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl;
...
Run Time (s): real 0.225 user 0.222456 sys 0.003220
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
...
Run Time (s): real 0.227 user 0.431249 sys 0.003793
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
...
Run Time (s): real 0.630 user 1.223273 sys 0.008493

For the scaling proposal, if we change 1000 to 10000.

D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tb2;
...
Run Time (s): real 25.817 user 25.462564 sys 0.350333
D SELECT list_transform(sel, (x) -> l[x]) FROM tb2;
...
Run Time (s): real 25.932 user 25.539946 sys 0.386648
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tb2;
...
Run Time (s): real 64.306 user 63.449924 sys 0.834758

Implementation in my branch:

D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.027 user 0.026074 sys 0.001398
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.028 user 0.027922 sys 0.000636
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.039 user 0.045414 sys 0.001267

Change 1000 to 10000:

D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.056 user 0.055299 sys 0.001724
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.055 user 0.055458 sys 0.000807
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.120 user 0.163200 sys 0.001673

Should the functions still be implemented by means of lambdas, or is the implementation in my branch all right?

10^3+λ 10^4+λ 10^3 10^4
where 0.225 25.817 0.027 0.056
select 0.227 25.932 0.028 0.055
zip 0.630 64.306 0.039 0.120

Edit: Change table header

@taniabogatsch
Copy link
Contributor

@cryoEncryp, thanks for running the benchmarks! It's reassuring to see that your implementation executes much faster. 😄 The list lambdas still need a lot of performance refactoring, so even though they might become more compatible in the following months, your implementations are still significantly faster.

@github-actions github-actions bot marked this pull request as draft September 21, 2023 15:23
Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cryoEncryp. I finally went through all the files and left more comments. This is coming together nicely. I realize I'm requesting many changes that aren't critical to the solution 'just working'. But since we decided to add these functions despite them being possible with lambdas, it's essential to put some effort into making them perform very fast.

Again, thanks for your work!

Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cryoEncryp! Thanks for implementing all my requests, this PR looks almost ready now! I've added a few more comments, and I'll ping you once I have a PR up to fix that one failing test.

@taniabogatsch
Copy link
Contributor

Also, this might interest you in general (or for your thesis). We have a PR up that we'll eventually merge to support fixed-size lists (#8983). Especially for zip, this would allow us to reference the list of child vectors in the resulting struct, making it highly efficient!

We discussed if adding this optimization would benefit this PR (you'd have to check if all lists are the same length for each row first) but decided it was not worth the overhead for the non-equal lists.

@taniabogatsch
Copy link
Contributor

We just merged #9288, so you can now add PRAGMA enable_verification to your list_select tests.

@github-actions github-actions bot marked this pull request as draft October 17, 2023 12:40
@cryoEncryp cryoEncryp marked this pull request as ready for review October 17, 2023 13:05
@github-actions github-actions bot marked this pull request as draft October 17, 2023 13:41
@cryoEncryp cryoEncryp marked this pull request as ready for review October 17, 2023 13:56
@taniabogatsch taniabogatsch added the Needs Documentation Use for issues or PRs that require changes in the documentation label Oct 31, 2023
Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me now! Thanks for all the work on it @cryoEncryp.

@Mytherin, we need to rebase this to the feature branch, but then it can go in (from my side).

@Mytherin Mytherin changed the base branch from main to feature October 31, 2023 10:25
@github-actions github-actions bot marked this pull request as draft November 1, 2023 08:05
@Mytherin Mytherin marked this pull request as ready for review November 1, 2023 08:05
@taniabogatsch
Copy link
Contributor

@cryoEncryp, can you please merge feature into this PR, as the failing tests are unrelated and fixed on that branch? Then, if CI passes, I believe that we can merge this.

@github-actions github-actions bot marked this pull request as draft November 7, 2023 09:59
@cryoEncryp cryoEncryp marked this pull request as ready for review November 7, 2023 09:59
@Mytherin Mytherin merged commit b0dbd9b into duckdb:feature Nov 8, 2023
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 8, 2023

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param
Merge pull request duckdb/duckdb#9185 from pdet/adbc_07
Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata
Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema
Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure
Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2
Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast
Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas
Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist
Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused
Merge pull request duckdb/duckdb#9220 from hawkfish/exclude
Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization
Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer
Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes
Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade
Merge fixes
Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance
Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan
Merge pull request duckdb/duckdb#9516 from carlopi/fixformat
Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization
Merge pull request duckdb/duckdb#9388 from chrisiou/issue217
Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes
Merge pull request duckdb/duckdb#9583 from carlopi/feature
Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions
Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow
Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional
Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs Documentation Use for issues or PRs that require changes in the documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants