Additional list functions #8907

cryoEncryp · 2023-09-13T08:38:44Z

As part of my Master's thesis for database systems chair at the University of Tübingen, I've implemented a few list functions: list_where, list_select, list_zip, and list_grade_up.

list_where(ANY[], BOOL[]) -> ANY[]
- With the booleans we can apply a mask to the first list.
list_select(ANY[], INT[]) -> ANY[]
- With the integers we can pick elements by index from a list.
list_zip(ANY[], ..., ANY[]) -> struct(ANY, ..., ANY)[]
- Produce a new list of structs where each struct consists of one element from every input list.
list_grade_up(ANY[]) -> INT[]
- Works like sort, but the results are the indexes instead of the actual values.
  
  Examples:

SELECT list_where([1,2,3,4], [true, false, false, true]); -- [1,4]
SELECT list_select([1,2,3,4], [0,2,3,1]); -- [1,3,4,2]
SELECT list_zip([1,2,3], ['a', 'b', 'c']); -- [struct(1, 'a'), struct(2, 'b'), struct(3, 'c')]
SELECT list_grade_up([3,2,1]); -- [2,1,0]

To reach me quicker than here on GitHub you can contact me on Discord, my tag is bambi4.

taniabogatsch

Hey @cryoEncryp. Thanks for the PR! I started to have a look and already added some comments. I'll continue with the other functions and your tests next week.

Also, out of curiosity, where did you take the function names from? Are there any analytical systems with similar functionalities? How do they call these functions? Just wondering about the aliases...

Also, I think that you should use 1-based indexing for grade_up. We decided to use 1-based indexing for lists a while ago to be consistent with SQL. See for example here:

D SELECT ([1, 2])[1];
┌──────────────────────────┐
│ main.list_value(1, 2)[1] │
│          int32           │
├──────────────────────────┤
│                        1 │
└──────────────────────────┘
D SELECT ([1, 2])[0];
┌──────────────────────────┐
│ main.list_value(1, 2)[0] │
│          int32           │
├──────────────────────────┤
│                          │
└──────────────────────────┘

src/core_functions/scalar/list/functions.json

src/core_functions/scalar/list/list_sort.cpp

taniabogatsch

While continuing the review of this, I realized that everything except grade_up can be implemented with list lambdas and their new index functionality. Before merging this PR, can you benchmark its performance gain against list lambdas? If the difference is insignificant, I suggest we add rewrites instead of new code that must be maintained.

This is the PR: #8851.
Here are the rewrites with our list_extract 1-based-indexing behavior:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES ([11, 22, NULL, 33], [1, -1, NULL, 1, 3, 4], [true, false, true, false], ['a', 'b', 'c', 'd']);               ^
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl; 
┌────────────────────────────────────────────────────┐
│ list_filter(l, (main.row(x, x_i) -> where_l[x_i])) │
│                      int32[]                       │
├────────────────────────────────────────────────────┤
│ [11, NULL]                                         │
└────────────────────────────────────────────────────┘
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
┌──────────────────────────────────┐
│ list_transform(sel, (x -> l[x])) │
│             int32[]              │
├──────────────────────────────────┤
│ [11, 33, NULL, 11, NULL, 33]     │
└──────────────────────────────────┘                               
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
┌────────────────────────────────────────────────────────────────────────────────────┐
│           list_transform(l, (main.row(x, x_i) -> main.row(x, zip[x_i])))           │
│                          struct(x integer, v2 varchar)[]                           │
├────────────────────────────────────────────────────────────────────────────────────┤
│ [{'x': 11, 'v2': a}, {'x': 22, 'v2': b}, {'x': NULL, 'v2': c}, {'x': 33, 'v2': d}] │
└────────────────────────────────────────────────────────────────────────────────────┘

Either way (rewrites or new code), we should keep the tests. Can you change all statement error to have an explicit error message check? We introduced explicit error messages recently and used them for all new tests. They work like this:

statement error
MY_FAILING_SQL;
----
my error message

EDIT: I confused list_slice and list_extract.

test/sql/function/list/list_grade_up.test

test/sql/function/list/list_select.test

src/function/scalar/list/list_select.cpp

cryoEncryp · 2023-09-18T15:19:41Z

While continuing the review of this, I realized that everything except grade_up can be implemented with list lambdas and their new index functionality. Before merging this PR, can you benchmark its performance gain against list lambdas? If the difference is insignificant, I suggest we add rewrites instead of new code that must be maintained.

This is the PR: #8851. Here are the rewrites with our list_extract 1-based-indexing behavior:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES ([11, 22, NULL, 33], [1, -1, NULL, 1, 3, 4], [true, false, true, false], ['a', 'b', 'c', 'd']);               ^
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl; 
┌────────────────────────────────────────────────────┐
│ list_filter(l, (main.row(x, x_i) -> where_l[x_i])) │
│                      int32[]                       │
├────────────────────────────────────────────────────┤
│ [11, NULL]                                         │
└────────────────────────────────────────────────────┘
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
┌──────────────────────────────────┐
│ list_transform(sel, (x -> l[x])) │
│             int32[]              │
├──────────────────────────────────┤
│ [11, 33, NULL, 11, NULL, 33]     │
└──────────────────────────────────┘                               
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
┌────────────────────────────────────────────────────────────────────────────────────┐
│           list_transform(l, (main.row(x, x_i) -> main.row(x, zip[x_i])))           │
│                          struct(x integer, v2 varchar)[]                           │
├────────────────────────────────────────────────────────────────────────────────────┤
│ [{'x': 11, 'v2': a}, {'x': 22, 'v2': b}, {'x': NULL, 'v2': c}, {'x': 33, 'v2': d}] │
└────────────────────────────────────────────────────────────────────────────────────┘

Either way (rewrites or new code), we should keep the tests. Can you change all statement error to have an explicit error message check? We introduced explicit error messages recently and used them for all new tests. They work like this:

statement error
MY_FAILING_SQL;
----
my error message

EDIT: I confused list_slice and list_extract.

I tried benchmarking the queries on my machine and got these time measurements.
Implementation with Lambdas:

D CREATE TABLE tbl (l INTEGER[], sel INTEGER[], where_l BOOLEAN[], zip VARCHAR[]);
D INSERT INTO tbl VALUES (range(1000), range(1000), [true for x in range(1000)], ['a' for x in range(1000)]);
D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tbl;
...
Run Time (s): real 0.225 user 0.222456 sys 0.003220
D SELECT list_transform(sel, (x) -> l[x]) FROM tbl;
...
Run Time (s): real 0.227 user 0.431249 sys 0.003793
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tbl;
...
Run Time (s): real 0.630 user 1.223273 sys 0.008493

For the scaling proposal, if we change 1000 to 10000.

D SELECT list_filter(l, (x, x_i) -> where_l[x_i]) FROM tb2;
...
Run Time (s): real 25.817 user 25.462564 sys 0.350333
D SELECT list_transform(sel, (x) -> l[x]) FROM tb2;
...
Run Time (s): real 25.932 user 25.539946 sys 0.386648
D SELECT list_transform(l, (x, x_i) -> (x, zip[x_i])) FROM tb2;
...
Run Time (s): real 64.306 user 63.449924 sys 0.834758

Implementation in my branch:

D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.027 user 0.026074 sys 0.001398
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.028 user 0.027922 sys 0.000636
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.039 user 0.045414 sys 0.001267

Change 1000 to 10000:

D SELECT list_where(l, where_l) FROM tbl;
...
Run Time (s): real 0.056 user 0.055299 sys 0.001724
D SELECT list_select(l, sel) FROM tbl;
...
Run Time (s): real 0.055 user 0.055458 sys 0.000807
D SELECT list_zip(l, zip) FROM tbl;
...
Run Time (s): real 0.120 user 0.163200 sys 0.001673

Should the functions still be implemented by means of lambdas, or is the implementation in my branch all right?

	10^3+λ	10^4+λ	10^3	10^4
where	0.225	25.817	0.027	0.056
select	0.227	25.932	0.028	0.055
zip	0.630	64.306	0.039	0.120

Edit: Change table header

taniabogatsch · 2023-09-19T08:36:24Z

@cryoEncryp, thanks for running the benchmarks! It's reassuring to see that your implementation executes much faster. 😄 The list lambdas still need a lot of performance refactoring, so even though they might become more compatible in the following months, your implementations are still significantly faster.

taniabogatsch

Hey @cryoEncryp. I finally went through all the files and left more comments. This is coming together nicely. I realize I'm requesting many changes that aren't critical to the solution 'just working'. But since we decided to add these functions despite them being possible with lambdas, it's essential to put some effort into making them perform very fast.

Again, thanks for your work!

src/core_functions/scalar/list/list_sort.cpp

src/function/scalar/list/list_select.cpp

src/function/scalar/list/list_where.cpp

src/function/scalar/list/list_zip.cpp

test/sql/function/list/list_select.test

src/function/scalar/list/list_zip.cpp

test/sql/function/list/list_where.test

test/sql/function/list/list_zip.test

taniabogatsch

Hi @cryoEncryp! Thanks for implementing all my requests, this PR looks almost ready now! I've added a few more comments, and I'll ping you once I have a PR up to fix that one failing test.

src/function/scalar/list/list_select.cpp

src/function/scalar/list/list_zip.cpp

test/sql/function/list/list_select.test

src/function/scalar/list/list_select.cpp

taniabogatsch · 2023-10-09T11:58:07Z

Also, this might interest you in general (or for your thesis). We have a PR up that we'll eventually merge to support fixed-size lists (#8983). Especially for zip, this would allow us to reference the list of child vectors in the resulting struct, making it highly efficient!

We discussed if adding this optimization would benefit this PR (you'd have to check if all lists are the same length for each row first) but decided it was not worth the overhead for the non-equal lists.

taniabogatsch · 2023-10-11T09:28:18Z

We just merged #9288, so you can now add PRAGMA enable_verification to your list_select tests.

taniabogatsch

This PR looks good to me now! Thanks for all the work on it @cryoEncryp.

@Mytherin, we need to rebase this to the feature branch, but then it can go in (from my side).

taniabogatsch · 2023-11-07T09:33:06Z

@cryoEncryp, can you please merge feature into this PR, as the failing tests are unrelated and fixed on that branch? Then, if CI passes, I believe that we can merge this.

Mytherin · 2023-11-08T11:56:36Z

Thanks!

Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg

cryoEncryp added 10 commits September 11, 2023 09:25

Feat: list_zip

e63273b

Feat: list_select

bf80dec

Feat: list_where

2bf928a

Feat: list_grade_up

953f36b

Add: test for grade_up

7e81ece

Add: list_select test

fc9020a

Add: list_where test

b725c00

Add: list_zip tests

eff7696

Merge branch 'master' into new-list-functions

8e03ba2

Fix: Add file extension and change type

2a971e8

github-actions bot marked this pull request as draft September 13, 2023 09:53

cryoEncryp marked this pull request as ready for review September 13, 2023 11:23

Merge remote-tracking branch 'upstream/main' into new-list-functions

36e7a13

github-actions bot marked this pull request as draft September 14, 2023 08:44

cryoEncryp changed the base branch from master to main September 14, 2023 12:05

cryoEncryp and others added 2 commits September 14, 2023 15:15

Set capacity of result vector

9cd8e39

Reserve capacity for zip

f51f120

cryoEncryp marked this pull request as ready for review September 14, 2023 14:09

Mytherin requested a review from taniabogatsch September 14, 2023 14:15

taniabogatsch suggested changes Sep 15, 2023

View reviewed changes

taniabogatsch suggested changes Sep 18, 2023

View reviewed changes

Fix: Implement list_grade_up suggestion

8b1d249

github-actions bot marked this pull request as draft September 21, 2023 15:23

taniabogatsch suggested changes Sep 22, 2023

View reviewed changes

Mytherin added the feature label Sep 22, 2023

Bjoern added 3 commits September 25, 2023 17:51

Add list_select suggestion

715b84a

Add list_sort suggestion

fd3912a

Remove wrong including

13f5749

cryoEncryp requested a review from taniabogatsch October 4, 2023 12:34

taniabogatsch suggested changes Oct 9, 2023

View reviewed changes

Bjoern added 2 commits October 13, 2023 11:03

Merge remote-tracking branch 'upstream/main' into new-list-functions

a8f6685

Add suggestion

0373234

github-actions bot marked this pull request as draft October 17, 2023 12:40

cryoEncryp requested a review from taniabogatsch October 17, 2023 12:45

cryoEncryp marked this pull request as ready for review October 17, 2023 13:05

Fix compare idx_t to int64_t

7189fee

github-actions bot marked this pull request as draft October 17, 2023 13:41

Bjoern added 2 commits October 17, 2023 15:50

Fix compare fix

7d2bb22

Fix: selections can now be int64

e236bb8

cryoEncryp marked this pull request as ready for review October 17, 2023 13:56

taniabogatsch mentioned this pull request Oct 19, 2023

Lambda performance revamp #9395

Merged

taniabogatsch added the Needs Documentation Use for issues or PRs that require changes in the documentation label Oct 31, 2023

duckdblabs-bot mentioned this pull request Oct 31, 2023

[duckdb/#8907] - Additional list functions needs documentation duckdb/duckdb-web#1460

Closed

taniabogatsch approved these changes Oct 31, 2023

View reviewed changes

Mytherin changed the base branch from main to feature October 31, 2023 10:25

Merge branch 'feature' into new-list-functions

6834d8d

github-actions bot marked this pull request as draft November 1, 2023 08:05

Mytherin marked this pull request as ready for review November 1, 2023 08:05

Merge remote-tracking branch 'upstream/feature' into new-list-functions

e507628

github-actions bot marked this pull request as draft November 7, 2023 09:59

cryoEncryp marked this pull request as ready for review November 7, 2023 09:59

cryoEncryp requested a review from taniabogatsch November 8, 2023 10:24

Mytherin merged commit b0dbd9b into duckdb:feature Nov 8, 2023

Additional list functions #8907

Additional list functions #8907

Uh oh!

Conversation

cryoEncryp commented Sep 13, 2023

Uh oh!

taniabogatsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taniabogatsch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cryoEncryp commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taniabogatsch commented Sep 19, 2023

Uh oh!

taniabogatsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taniabogatsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taniabogatsch commented Oct 9, 2023

Uh oh!

taniabogatsch commented Oct 11, 2023

Uh oh!

taniabogatsch left a comment

Choose a reason for hiding this comment

Uh oh!

taniabogatsch commented Nov 7, 2023

Uh oh!

Mytherin commented Nov 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taniabogatsch left a comment •

edited

Loading

cryoEncryp commented Sep 18, 2023 •

edited

Loading