Skip to content

Conversation

@pdet
Copy link
Collaborator

@pdet pdet commented Feb 16, 2024

This change affects how header detection works for CSV files. In the previous algorithm, we preferred false negatives over false positives, leading us to miss headers in many different CSV files. For example

  1. Single Row Files
creationDate, Id
  1. Files with undetected types by auto-detect:
Value
"68,527.00"
  1. Files with borked types:
Date
02/01/2019
08//01/2019
  1. All Varchar
name
Pedro

The issue here is that most sane CSV files actually do have a header. I then changed the algorithm to always detect these cases correctly.

This change increases our accuracy in many of our tests. I believe that the only situation where our header detection will fail is when we have an all-varchar CSV file where the first row is not a header. For example:

Pedro;~29
Mark; >30

Because all columns of the CSV File are varchar, we will wrongfully detect Pedro;~29 as the header.

cc: @tdoehmen

@pdet
Copy link
Collaborator Author

pdet commented Feb 16, 2024

Edit:

I changed a lot of tests to remove now unnecessary header = true options. And added some, now needed header = false options.

@github-actions github-actions bot marked this pull request as draft February 16, 2024 19:15
@pdet pdet marked this pull request as ready for review February 19, 2024 18:19
@github-actions github-actions bot marked this pull request as draft February 20, 2024 14:30
@pdet pdet marked this pull request as ready for review February 20, 2024 16:06
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some minor comments:

# conversion is skipped if we don't read the value - so even with the incorrect type specified this still works
query I
SELECT l_returnflag FROM read_csv('test/sql/copy/csv/data/real/lineitem_sample.csv', delim='|', columns={'l_orderkey': 'INTEGER','l_partkey': 'INTEGER','l_suppkey': 'INTEGER','l_linenumber': 'INTEGER','l_quantity': 'INTEGER','l_extendedprice': 'DECIMAL(15,2)','l_discount': 'DECIMAL(15,2)','l_tax': 'DECIMAL(15,2)','l_returnflag': 'VARCHAR','l_linestatus': 'VARCHAR','l_shipdate': 'DATE','l_commitdate': 'DATE','l_receiptdate': 'DATE','l_shipinstruct': 'DATE','l_shipmode': 'VARCHAR','l_comment': 'VARCHAR'});
SELECT l_returnflag FROM read_csv('test/sql/copy/csv/data/real/lineitem_sample.csv', delim='|', columns={'l_orderkey': 'INTEGER','l_partkey': 'INTEGER','l_suppkey': 'INTEGER','l_linenumber': 'INTEGER','l_quantity': 'INTEGER','l_extendedprice': 'DECIMAL(15,2)','l_discount': 'DECIMAL(15,2)','l_tax': 'DECIMAL(15,2)','l_returnflag': 'VARCHAR','l_linestatus': 'VARCHAR','l_shipdate': 'DATE','l_commitdate': 'DATE','l_receiptdate': 'DATE','l_shipinstruct': 'DATE','l_shipmode': 'VARCHAR','l_comment': 'VARCHAR'}, header = 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary here - shouldn't DetectHeaderWithSetColumn figure this one out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. The problem with these tests is that the types of the columns are set incorrectly, which causes the conversion test to be skipped if we don't read the value. However, this will lead to inconsistency in the row because the cast will fail, and it breaks the header detection.

@github-actions github-actions bot marked this pull request as draft February 23, 2024 21:57
@pdet pdet marked this pull request as ready for review February 23, 2024 22:01
@Mytherin Mytherin merged commit 3f4f1d9 into duckdb:main Feb 27, 2024
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 15, 2024
Merge pull request duckdb/duckdb#10714 from pdet/header_default_true
Merge pull request duckdb/duckdb#10840 from ahuarte47/main_add-version-parts
@pdet pdet deleted the header_default_true branch June 25, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants