Skip to content

Conversation

@mikelolasagasti
Copy link

Description:

The old logic skipped all application/* files as binary, which risked ignoring human-readable formats if detection improved. In practice, JSON/YAML/etc. were scanned only because filetype returned "unknown".

This change:

  • uses mimetype for broader and more accurate detection
  • adds isHumanReadable() to explicitly whitelist text/* and common textual application/* types (json, xml, yaml, toml, js, xhtml)
  • makes the skip policy explicit instead of relying on "unknown" fallback
  • adds regression tests to verify classification

Additionally, this patch migrates the detection library from h2non/filetype to the more accurate and actively maintained gabriel-vasile/mimetype.

Checklist:

  • Does your PR pass tests?
  • Have you written new tests for your changes?
  • Have you lint your code locally prior to submission?

…le filter

The old logic skipped all application/* files as binary, which risked
ignoring human-readable formats if detection improved. In practice,
JSON/YAML/etc. were scanned only because filetype returned "unknown".

This change:
- uses mimetype for broader and more accurate detection
- adds isHumanReadable() to explicitly whitelist text/* and common
textual application/* types (json, xml, yaml, toml, js, xhtml)
- makes the skip policy explicit instead of relying on "unknown"
fallback
- adds regression tests to verify classification

Additionally, this patch migrates the detection library from
`h2non/filetype` to the more accurate and actively maintained
`gabriel-vasile/mimetype`.

Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant