-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Or filter pushdown into zone maps #14313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Mytherin
merged 69 commits into
duckdb:feature
from
Tmonster:or_filter_pushdown_zone_map_feature
Oct 15, 2024
Merged
Or filter pushdown into zone maps #14313
Mytherin
merged 69 commits into
duckdb:feature
from
Tmonster:or_filter_pushdown_zone_map_feature
Oct 15, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…h a zone map down
Collaborator
|
Thanks! Looks great. |
This was referenced Nov 18, 2024
When reading parquet, filter using IN () on two values much slower than equals on single value
#4295
Closed
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR will create a zone map table filter when a OR/IN filter on an integer column is present above a table scan.
When are OR/IN filters are pushed down?
If the column is an integer type column. Based on some benchmarks, the overhead of checking the min max for integer columns when the values are unordered or distinct values are sparse is very little compared to the benefit you get when they are ordered and/order the distinct values are dense.
See the following results. I test or filter pushdown on the following columsn
lineitem.I test on ordered column values and unordered column values.
I select just two values around the minimum + ~(rowgroup size) and the maximum - ~(row group size). This way at least two row groups are emitted from the scan. I query on tpch datasets at sf1, sf10, sf100. Notice that pushing down the OR filter on an unordered version of lineitem is only ~0.1 second slower than not pushing down. This comes from the min max checks. Inspecting the explain analyze by hand, the zone map filter lets in every row group on the higher scale factors
Foreign Key Data (lineitem.l_orderkey)
Execution times to find 2 values in the
lineitem.l_orderkeycolumn. This benchmark is included in the PR and was performed on a c6id.8xlarge (32 cores, 64 GB memory).select * from {lineitem_ordered/random_sfXXX} where l_orderkey in (99584, 5900006);Primary key data.(orders.o_orderkey)
Execution times to find 2 values in the
orders.o_orderkeycolumn. Here every o_orderkey is distinct. However, the two values selected mean that in the random order case, all values are propagated through to the FILTER.select * from {orders_ordered/random_sfXXX} where o_orderkey in (99584, 5900006);Future work:
TODO: Add more tests