fix!: enhance detection for binary previews#2561
Conversation
joelim-work
left a comment
There was a problem hiding this comment.
Non-printable characters are already replaced in the UI:
So this change isn't about safety, it is about trying to enhance the text vs binary detection for previews (i.e. display binary for the latter). The problem is that there is no standard for this and only heuristics can be used - the logic for checking for null bytes comes from 7fe72c1 and I don't recall seeing any complaints about it since.
Currently your patch will now mark FIX protocol message logs as binary, which uses SOH (0x01) as the delimiter - it is a niche topic however. I don't know about other cases though, but it does feel like the logic here is somewhat opinionated.
Yes, there does not appear to be a universal correct interpretation of text files I could find. even mime detection (which is much more complex) is not nearly reliable. The only purpose for this is to improve the hit rate to where false positive are very rare while using minimal complexity and computation to archive it.
Actually the entire approach I attempted in #2560 was motivated by the problem this PR solves. The tldr; There are many characters, that, while harmless, are misinterpreted by terminals in regards to their size. As a result, random artifacts can appear on the screen which are not automatically cleaned and even mess up other panes when moving. This image was the result of browsing the intentionally created problem characters in filenames and previews, but I originally encountered a less severe artifact when browsing small encrypted pgp files, which are currently detected as text with high probability. Attempts to fix the issue in #2560 and gdamore/tcell#1079 by filtering the problem characters and cleaning up the mess turned out to be unreliable and messy. So the better and way more simple solution for lf is this PR, which does not address the issue caused by terminal bugs, but rather reduces the probability of the issue happening to near zero. Not sure about that FIX protocol and how common it is, but I dont think we can or should cover every edge case. Without this PR, there are a lot more misinterpreted files anyway. |
|
I'm probably missing something here, but what does Tcell (and any visual artifacts involved) have to do with this specific PR? My understanding is that |
|
The issue is not invalid characters, but rather the fact that terminal emulators do not agree with tcell on the size of certain glyphs. This is a bug in various terminals and they all behave differently, with some maintainers refusing to address this. As a result of this terminals sometimes paint glyphs slightly outside the cell tcell tracked them in and leaving stale pixels in neighboring cells. Since they arent tracked, tcells redraw does not clear them either. there is no good fix in eitehr tcell nor lf, but fortunately the issue is quite rare, and the only realistic case where users encounter these is random files that are misinterpreted as text. And showing binaries as text is a bug in itself too. So this PR simply ensures that the issues is not encountered in normal use cases by reducing the probability of binary files being misidentified as text. If you encrypt a single word with symmetric pgp, it at least 70 bytes, which reduces the chance of being falsely detected as text to near zero. In comparison, the current lf version will show this as text in most cases, with a chance of the glyph mismatch bug painting random garbage on the lf screen. This happened with real files multiple times for me, even though they were usually 200 bytes or larger. |
|
Okay so I checked the history, previously all non-printable characters except for ESC and whitespace was treated as binary, but it was changed in 7fe72c1 to only look for null bytes. There is no explanation, and if I had to guess then it would be because it is based on Git's approach. After thinking about it some more, I think this change is probably fine. There is no 100% foolproof way to distinguish between binary and text files, and the new logic is reasonable and not also too complex. Before I merge it, I just have to understand what glyphs actually cause rendering issues:
Do you have an actual minimal reproducible example (a hexdump of the file will do) that contains these problematic glyphs? The point I was trying to make earlier is that distinguishing between binary and text might not matter here. If the file is detected as binary, then |
|
glyph-bug.zip In the end the most significant improvement is still the fact that actual text is showing and binary garbage is not. |
|
OK thanks, I managed to reproduce the artifacts using a file containing the following 3 bytes: In any case since So one last thing from me is that the heuristic for detecting binary vs text is actually documented. Could you please update this before I approve and merge? |
|
Done |
joelim-work
left a comment
There was a problem hiding this comment.
Looks good, will merge now. Thanks.
At the moment, small binaries and random files are detected as text with high probability by lf and passed to the internal previewer. This is the most likely cause of artifacts due to character width mismatch in terminal (a terminal bug)
Details are discussed in #2560 and gdamore/tcell#1079
Displaying binary as text preview is ugly on its own too.
This PR improves the detection of valid text files significantly.
Previously, a 100 byte file with random data would be detected as text more then half the time. a common 200 bytes encrypted pgp file would still be detected about 1 in 3 times as text.
This patch reduces the chance of false positive detection to a very low percentage that causes almost no false positives - unless the file is very small, e.g. 10 bytes, which is very rare