Skip to content

fix!: enhance detection for binary previews#2561

Merged
joelim-work merged 2 commits into
gokcehan:masterfrom
valoq:binary-detection
May 15, 2026
Merged

fix!: enhance detection for binary previews#2561
joelim-work merged 2 commits into
gokcehan:masterfrom
valoq:binary-detection

Conversation

@valoq
Copy link
Copy Markdown
Contributor

@valoq valoq commented Apr 29, 2026

At the moment, small binaries and random files are detected as text with high probability by lf and passed to the internal previewer. This is the most likely cause of artifacts due to character width mismatch in terminal (a terminal bug)
Details are discussed in #2560 and gdamore/tcell#1079

Displaying binary as text preview is ugly on its own too.

This PR improves the detection of valid text files significantly.

Previously, a 100 byte file with random data would be detected as text more then half the time. a common 200 bytes encrypted pgp file would still be detected about 1 in 3 times as text.

This patch reduces the chance of false positive detection to a very low percentage that causes almost no false positives - unless the file is very small, e.g. 10 bytes, which is very rare

Copy link
Copy Markdown
Collaborator

@joelim-work joelim-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-printable characters are already replaced in the UI:

lf/ui.go

Lines 111 to 118 in be24843

if gc == "\t" {
w := gOpts.tabstop - (x+off+printLength(b.String()))%gOpts.tabstop
b.WriteString(strings.Repeat(" ", w))
} else if isPrintable(gc) {
b.WriteString(gc)
} else {
b.WriteString("\uFFFD")
}

So this change isn't about safety, it is about trying to enhance the text vs binary detection for previews (i.e. display binary for the latter). The problem is that there is no standard for this and only heuristics can be used - the logic for checking for null bytes comes from 7fe72c1 and I don't recall seeing any complaints about it since.

Currently your patch will now mark FIX protocol message logs as binary, which uses SOH (0x01) as the delimiter - it is a niche topic however. I don't know about other cases though, but it does feel like the logic here is somewhat opinionated.

@valoq
Copy link
Copy Markdown
Contributor Author

valoq commented May 11, 2026

The problem is that there is no standard for this and only heuristics can be used [...] the logic here is somewhat opinionated

Yes, there does not appear to be a universal correct interpretation of text files I could find. even mime detection (which is much more complex) is not nearly reliable. The only purpose for this is to improve the hit rate to where false positive are very rare while using minimal complexity and computation to archive it.

I don't recall seeing any complaints about it since.

Actually the entire approach I attempted in #2560 was motivated by the problem this PR solves.

The tldr; There are many characters, that, while harmless, are misinterpreted by terminals in regards to their size. As a result, random artifacts can appear on the screen which are not automatically cleaned and even mess up other panes when moving.

It looks like this
screenshot_2026-05-11-144625

This image was the result of browsing the intentionally created problem characters in filenames and previews, but I originally encountered a less severe artifact when browsing small encrypted pgp files, which are currently detected as text with high probability.

Attempts to fix the issue in #2560 and gdamore/tcell#1079 by filtering the problem characters and cleaning up the mess turned out to be unreliable and messy. So the better and way more simple solution for lf is this PR, which does not address the issue caused by terminal bugs, but rather reduces the probability of the issue happening to near zero.

Not sure about that FIX protocol and how common it is, but I dont think we can or should cover every edge case. Without this PR, there are a lot more misinterpreted files anyway.

@joelim-work
Copy link
Copy Markdown
Collaborator

I'm probably missing something here, but what does Tcell (and any visual artifacts involved) have to do with this specific PR? My understanding is that win.print will replace any invalid characters with U+FFFD, which Tcell should be able to handle. So really what you are trying to do here is to have your small encrypted files detected as binary instead of text.

@valoq
Copy link
Copy Markdown
Contributor Author

valoq commented May 12, 2026

The issue is not invalid characters, but rather the fact that terminal emulators do not agree with tcell on the size of certain glyphs. This is a bug in various terminals and they all behave differently, with some maintainers refusing to address this. As a result of this terminals sometimes paint glyphs slightly outside the cell tcell tracked them in and leaving stale pixels in neighboring cells. Since they arent tracked, tcells redraw does not clear them either. there is no good fix in eitehr tcell nor lf, but fortunately the issue is quite rare, and the only realistic case where users encounter these is random files that are misinterpreted as text. And showing binaries as text is a bug in itself too.

So this PR simply ensures that the issues is not encountered in normal use cases by reducing the probability of binary files being misidentified as text. If you encrypt a single word with symmetric pgp, it at least 70 bytes, which reduces the chance of being falsely detected as text to near zero. In comparison, the current lf version will show this as text in most cases, with a chance of the glyph mismatch bug painting random garbage on the lf screen. This happened with real files multiple times for me, even though they were usually 200 bytes or larger.

@joelim-work
Copy link
Copy Markdown
Collaborator

Okay so I checked the history, previously all non-printable characters except for ESC and whitespace was treated as binary, but it was changed in 7fe72c1 to only look for null bytes. There is no explanation, and if I had to guess then it would be because it is based on Git's approach.

After thinking about it some more, I think this change is probably fine. There is no 100% foolproof way to distinguish between binary and text files, and the new logic is reasonable and not also too complex. Before I merge it, I just have to understand what glyphs actually cause rendering issues:

the fact that terminal emulators do not agree with tcell on the size of certain glyphs

Do you have an actual minimal reproducible example (a hexdump of the file will do) that contains these problematic glyphs? The point I was trying to make earlier is that distinguishing between binary and text might not matter here. If the file is detected as binary, then binary will be shown, and if the file is detected as text then some gibberish like �(xt�m\���F��qpXX|� will be shown - both render fine for me without leaving behind any artifacts, but that might be terminal-specific.

@valoq
Copy link
Copy Markdown
Contributor Author

valoq commented May 13, 2026

glyph-bug.zip
Here are example files that overwrite the border with artifacts. I detected some other issues previously as well, where some characters like similar to ' where showing in the middle of the preview pane but I cant reproduce this right now.
(this need internal previewer to trigger)

In the end the most significant improvement is still the fact that actual text is showing and binary garbage is not.

@joelim-work
Copy link
Copy Markdown
Collaborator

OK thanks, I managed to reproduce the artifacts using a file containing the following 3 bytes: d6 a3 61. I am quite sure there are many combinations of bytes that produce an artifact but this is just an example. Apparently d6 a3 is valid UTF-8, so I suspect the issue is to do with terminals and how they draw unexpected glyphs.

In any case since d6 a3 61 constitutes valid UTF-8 text content, and I think display artifacts is a separate issue from this PR, which is instead focused on trying improve the accuracy of detecting and displaying binary in the preview window.

So one last thing from me is that the heuristic for detecting binary vs text is actually documented. Could you please update this before I approve and merge?

@valoq
Copy link
Copy Markdown
Contributor Author

valoq commented May 14, 2026

Done

Copy link
Copy Markdown
Collaborator

@joelim-work joelim-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will merge now. Thanks.

@joelim-work joelim-work added this to the r42 milestone May 15, 2026
@joelim-work joelim-work added the breaking Pull requests that introduce breaking changes label May 15, 2026
@joelim-work joelim-work changed the title improve text detection fix!: enhance detection for binary previews May 15, 2026
@joelim-work joelim-work merged commit 87d067c into gokcehan:master May 15, 2026
32 checks passed
joelim-work added a commit that referenced this pull request May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Pull requests that introduce breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants