Skip to content

Conversation

@sidmorizon
Copy link

No description provided.

@jakob
Copy link
Owner

jakob commented Jul 11, 2017

Thanks for your pull request! I've reviewed it, and it causes an issue for me: Files encoded in Windows Latin 1 encoding are now recognised as GBK. While this is probably what you want for Chinese users, it is inconvenient for European users (like myself).

Is there an easy way to tell files in Latin1 and files in GBK encoding apart? I assume that a reliable method would have to look at character frequency (most latin 1 encoded files probably have only very few characters with the high bit set, while GBK text probably has a lot more).

Another possible solution would be to consider the locale setting of the computer -- if the user has the computer set to chinese, prefer GBK; if the user has German, Swedish, etc., prefer Windows Latin 1. That's probably the easiest solution.

In any case, it would be nice if you could add some sample files in GBK encoding.

@sidmorizon
Copy link
Author

Hi, jakob, thanks for your kindly review, I've post sample files below
Chinese_GBK_Sample.txt

@jakob
Copy link
Owner

jakob commented Jul 17, 2017

The sample file you provided does not work with the pull request you sent. It seems that your sample text file uses a different comma character (GBK code point A3AC), whereas Table Tool expects the ASCII-comma (code point 2C).

What code point do other tools use when they create CSV files in GBK encoding? I'd assume that they would use 2C?

I assume you created the sample file with a text editor. Do you have any "real life" sample files that you can provide? Can you check in your CSV files with a hex editor what code point they use?

@jeffreykjliu
Copy link

Jakob,
You are right. For CSV file, there is no Chinese version. The separators are all ASCII. Just Chinese encoding is GBK. I upload one for your information.
regards,
-jeffrey liu

@jeffreykjliu
Copy link

demo_GBK_csv.txt

@jakob
Copy link
Owner

jakob commented Jul 24, 2017

Thank you. I've merged your pull request in commit (7737ecf). I've changed the auto-detection logic to consider the user's language setting, and I've added the sample file from @jeffreykjliu

Thanks a lot for your contributions! As soon as I've tested everything, I'll release a new version!

@jakob jakob closed this Jul 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants