-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[CSV Sniffer] Tweaking header detection #10714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Edit: I changed a lot of tests to remove now unnecessary |
Mytherin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good - some minor comments:
| # conversion is skipped if we don't read the value - so even with the incorrect type specified this still works | ||
| query I | ||
| SELECT l_returnflag FROM read_csv('test/sql/copy/csv/data/real/lineitem_sample.csv', delim='|', columns={'l_orderkey': 'INTEGER','l_partkey': 'INTEGER','l_suppkey': 'INTEGER','l_linenumber': 'INTEGER','l_quantity': 'INTEGER','l_extendedprice': 'DECIMAL(15,2)','l_discount': 'DECIMAL(15,2)','l_tax': 'DECIMAL(15,2)','l_returnflag': 'VARCHAR','l_linestatus': 'VARCHAR','l_shipdate': 'DATE','l_commitdate': 'DATE','l_receiptdate': 'DATE','l_shipinstruct': 'DATE','l_shipmode': 'VARCHAR','l_comment': 'VARCHAR'}); | ||
| SELECT l_returnflag FROM read_csv('test/sql/copy/csv/data/real/lineitem_sample.csv', delim='|', columns={'l_orderkey': 'INTEGER','l_partkey': 'INTEGER','l_suppkey': 'INTEGER','l_linenumber': 'INTEGER','l_quantity': 'INTEGER','l_extendedprice': 'DECIMAL(15,2)','l_discount': 'DECIMAL(15,2)','l_tax': 'DECIMAL(15,2)','l_returnflag': 'VARCHAR','l_linestatus': 'VARCHAR','l_shipdate': 'DATE','l_commitdate': 'DATE','l_receiptdate': 'DATE','l_shipinstruct': 'DATE','l_shipmode': 'VARCHAR','l_comment': 'VARCHAR'}, header = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary here - shouldn't DetectHeaderWithSetColumn figure this one out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. The problem with these tests is that the types of the columns are set incorrectly, which causes the conversion test to be skipped if we don't read the value. However, this will lead to inconsistency in the row because the cast will fail, and it breaks the header detection.
|
Thanks! |
Merge pull request duckdb/duckdb#10714 from pdet/header_default_true Merge pull request duckdb/duckdb#10840 from ahuarte47/main_add-version-parts
This change affects how header detection works for CSV files. In the previous algorithm, we preferred false negatives over false positives, leading us to miss headers in many different CSV files. For example
The issue here is that most sane CSV files actually do have a header. I then changed the algorithm to always detect these cases correctly.
This change increases our accuracy in many of our tests. I believe that the only situation where our header detection will fail is when we have an all-varchar CSV file where the first row is not a header. For example:
Because all columns of the CSV File are varchar, we will wrongfully detect
Pedro;~29as the header.cc: @tdoehmen