Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CSV import for files containing Byte Order Mark (BOM) #23332

Merged
merged 4 commits into from
Aug 14, 2024

Conversation

azrikahar
Copy link
Contributor

Scope

Seems like this turns out to be a UTF-8 byte order mark (BOM) issue. Although it was resolved back in #12970 for csv-parser, csv-parser was then changed to papaparse in #19739, hence BOM isn't being handled again.

Related issues in PapaParse's repo:

With the sample.csv reproduction file provided in #21727 which contains BOM (if you open it with vscode or notepad, the bottom right corner should say UTF-8 with BOM), as well as using the transformHeader config option as a quick test:

transformHeader: (header) => {
    console.log(header);
    return header;
},

That allows us to identify the problem, which is the first column always_ignored actually gets processed as always_ignored (notice the visible leading space):

This causes the import process to attempt to import to a technically non-existing column, hence the missing data for users that are using CSV with BOM for their first column every time.

This PR opted to use a simplistic .trim() as brought up in PapaParse's issues, but other potential alternatives could be:

  • only trimming it when index === 0 to make it clearer we're cleaning up just the first column

  • Only cleaning up BOM specifically which a slightly more elaborate code such as:

    transformHeader: (header) => {
        if (header.charCodeAt(0) === 0xFEFF) {
            return header.substring(1);
        }
        return header;
    },

What's changed:

  • added transformHeader papaparse config for CSV imports to trim the headers

Potential Risks / Drawbacks

  • Maybe a tiny overhead for trimming the headers

Review Notes / Questions

  • Whether the simple trim() is acceptable, or other alternatives mentioned above are more preferred since they are more specific in terms of code readability/maintainability

Fixes #21727

Copy link

changeset-bot bot commented Aug 14, 2024

🦋 Changeset detected

Latest commit: b57a1ca

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@directus/api Patch
directus Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

.changeset/good-bats-shake.md Outdated Show resolved Hide resolved
@paescuj paescuj enabled auto-merge (squash) August 14, 2024 12:50
@paescuj paescuj merged commit a41ee74 into directus:main Aug 14, 2024
4 checks passed
@github-actions github-actions bot added this to the Next Release milestone Aug 14, 2024
@azrikahar azrikahar deleted the fix/21727 branch August 14, 2024 13:29
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV import always omits first column
2 participants