Property sun.stdout.encoding is undefined in recent Java versions#407
Conversation
| String output = Py.getCommandResultWindows("chcp"); | ||
| String output = Py.getCommandResultWindows("chcp.com"); | ||
| /* | ||
| * The output will be like "Active code page: 850" or maybe "Aktive Codepage: 1252." or |
There was a problem hiding this comment.
I'm explaining why the only reliable bit of the message is the number. ISTR this one is German.
Ja. So stimmt es. https://www.mikrocontroller.net/topic/561065
There was a problem hiding this comment.
Reverted back to Aktive.
|
As I commented on the original issue, testing console handling is hard because our CI doesn't really exercise it. So it's down to conscientious local testing. |
| if (os != null && os.startsWith("Windows")) { | ||
| // Go via the Windows code page built-in command "chcp". | ||
| String output = Py.getCommandResultWindows("chcp"); | ||
| String output = Py.getCommandResultWindows("chcp.com"); |
There was a problem hiding this comment.
When running chcp from Git-Bash on Windows, it cannot find "chcp" and one has to specify the full name which is "chcp.com". I thought Jython might experience the same issue, but it does not. So will revert this change.
There was a problem hiding this comment.
C:\> where chcp
C:\Windows\System32\chcp.com
|
@jeff5 , the following code fragment is unlikely to be executed on Windows given that stdout.encoding and sun.stdout.encoding seems to cover all bases. if (isWindows) {
// Go via the Windows code page built-in command "chcp".
String output = Py.getCommandResultWindows("chcp");
/*
* The output will be like "Active code page: 850" or maybe "Aktive Codepage: 1252." or
* "활성 코드 페이지: 949". Assume the first number with 2 or more digits is the code page.
*/
final Pattern DIGITS_PATTERN = Pattern.compile("[1-9]\\d+");
Matcher matcher = DIGITS_PATTERN.matcher(output);
if (matcher.find()) {
encoding = "cp".concat(output.substring(matcher.start(), matcher.end()));
if (encoding.equals("cp65001")) {
encoding = "utf-8";
}
return encoding;
}
} |
|
@jeff5 , I installed SDKMAN using Git-Bash on Windows, and this makes it a lot easier to do Jython console testing using different versions of Java. With SDKMAN I can easily install and switch between diffrent JDK versions. First install scoop - https://scoop.sh/ Then run these commands to install Git/Bash, curl, zip and unzip
Lastly, install SDKMAN
Now one can start bash.exe from inside the Jython project folder cd IdeaProjects\jython Bash is just a different Windows shell, so when Jython runs it still runs as it normally would as a Windows process. Don't you think this would make a good addition to the README.md file? |
|
Results of tests performanced on Linux, MacOS and Windows. We can see that for Java 21 and higher, property stdout.encoding is always defined. Older versions of Java don't define property stdout.encoding, and may or may not defined property sun.stdout.encoding. Linux
MacOS
Windows
test.cmd test.sh test.kt val javaVersion = System.getProperty("java.version") ?: "null"
val sun_stdout_encoding = System.getProperty("sun.stdout.encoding") ?: "null"
val stdout_encoding = System.getProperty("stdout.encoding") ?: "null"
fun main() {
print("java.version = " + javaVersion)
print(", sun.stdout.encoding = " + sun_stdout_encoding)
println(", stdout.encoding = " + stdout_encoding)
} |
|
@jeff5 , testing has been completed. Please review once more. Could we eventually release this as Jython 2.7.5? |
|
@Stewori . please review and if possible merge this PR next. How these two properties below are managed in newer versions of Java have changed:
The old logic failed with Java 25, but with this PR it now it works again. This code has been extensively tested using different major versions of Java. |
| // From Java 8 onwards, the answer may already be to hand in the registry: | ||
| String encoding = props.getProperty("sun.stdout.encoding"); | ||
| String os = props.getProperty("os.name"); | ||
| boolean isWindows = os != null && os.startsWith("Windows"); |
There was a problem hiding this comment.
Please move these two lines (String os = .... and boolean is Windows....) a bit downwards, so they are created only if stdout.encoding is null.
| // Java 19+ | ||
| String encoding = props.getProperty("stdout.encoding"); | ||
| if (encoding != null) { | ||
| // Windows: cp65001 is automatically mapped to UTF-8, no additional processing is needed |
There was a problem hiding this comment.
I verified this claim by googling, but struggle to find an authoritative source. I do not doubt this fact. However, if you have a link to an authoritative source for this claim, please add it as second line to this comment. E.g. something in the msdn docs or so.
This is not critical, just nice to have for reference.
There was a problem hiding this comment.
(I mean that cp65001 is or used to be Window's name for utf-8)
There was a problem hiding this comment.
The low-level C code in the JVM 19+ is only the documentation for this Windows specific mapping.
static char* getConsoleEncoding()
{
char* buf = malloc(16);
int cp;
if (buf == NULL) {
return NULL;
}
cp = GetConsoleCP();
if (cp >= 874 && cp <= 950)
sprintf(buf, "ms%d", cp);
else if (cp == 65001)
sprintf(buf, "UTF-8");
else
sprintf(buf, "cp%d", cp);
return buf;
}
|
Re-reading #404 it seems that cp65001 might be more subtle than just aliasing it to utf-8. I'll have to do more background research to be able to make an educated decision on this. In any case, please add a line to NEWS in bugs fixed 2.7.5, similar to
|
|
Okay, after a bit research and AI-talk, what about using |
|
Checking more carefully, |
|
I think the main drawback of using cp65001 is that some utf-8 characters can be misinterpreted or can't be displayed. By contrast, the behavior when using one of the common non-utf-8 code pages, having a wrong character in the output may cause an error. This is for Windows 10+. One could still get errors pre-windows 10 depending on the string, the Java version, etc. There's no perfect solution when the shell itself isn't fully utf-8 compatible, especially the cmd.exe shell pre-Windows 10, which is limited by its long history. |
|
Okay, this is about representing cp65001 as utf-8, not about permitting or encouraging its use in the first place. |
|
According to Mr. Chatbot Claude, for whatever it is worth:
If we say, as I think we agreed on, to say the next edition of Jython will only support Windows 10+, we ought to be all right. |
|
Thank you for cross-checking this. Looks like it aligns with my info, more specific though (at least AIs agree here^^). I'm not really sure how official the Windows 10+ agreement is, but I think using Windows <10 became so impracticable nowadays, that we don't need to consider it much. Even then - distinguish "support" from "bug-compatibility". IMHO to support something does not mean to compensate (every single of) its bugs. So I wouldn't even frame this as "support for Win < 10 dropped". An then, the problematic API is not even used by Jython directly. This should be safe enough. |
|
I remember years ago, probably with Python 2+ rather than Jython, I used to monkey patch the code that selected the right encoder/decoder pair so that it would choose cp65001 when the encoding was utf-8. That simplified life quite a bit, encoding-wise. I had forgotten all about that until today. I think that recent versions of CPython 2.7 have changed how they work in that respect. |
That was not actually my intention when I asked for a reference. What I actually meant was an authoritative reference (url or some citation coordinates) to a source (e.g. in MSDN docs) that really states that cp65001 means UTF-8, or at least is supposed to mean. From what I found online, this more seems to be "folklore" and an actual written official statement is hard to find. So if there is one, it would be valuable to preserve that reference in a comment for the future. |
|
As I have posted earlier, CP65001 is not full utf-8. It's a design limitation left over from the old, old design. There are not uncommon edge cases that fail. UTF-16 surrogate pairs are especially likely to cause trouble. Yes, there are bugs, and they apparently vary from one to another versions of Windows. Since WIndows uses utf-16 internally, any code point above U+FFFF must be encoded as a surrogate pair. |
|
Also https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers . When I looked a few years ago there seemed to be questions about how complete or reliable the implementation was, but I understand it as always intended to be UTF-8. I found the reference into OpenJDK useful, but I think it belongs in a discussion like this. It is probably better to link a stable repo/branch. https://github.com/openjdk/jdk25u/blob/master/src/java.base/windows/native/libjava/java_props_md.c#L131 (to the discussion, I mean). |
I wonder if my poor experience with PoSh is a hang-over from that? I have followed this through in the debugger now to the point where UTF-8 bytes disappear into a channel (which I think is our console ... not sure) and they still come out as
I don't think this affects us at the Python level, or my Footnotes
|
|
@jeff5 , thanks for the review and feedback. I'll work on the PR updates in the coming days. |
|
I think this also fixes #384 which I noticed, but forgot. |
|
@wfouche I would be happy to make the restructuring change if you don't mind me pushing to your PR branch, as I'd like to include this in the beta release this month (if I can make that). I don't think you've signed the Contributor Agreement (see instructions). IIRC I didn't press you to do so when it was only on JBang integration but for core code. #424 says it is based partly on your code but I would prefer to merge your original so you get proper credit. |
|
@jeff5 , life intervened and I only now got back to this now. Working on resolving the issues today. |
|
@jeff5 , you commented earlier on UTF8 console support in PowerShell. In powershell you need to add the following line to you $PROFILE file in order to enable UTF8 for the console. |
|
@jeff5 , I applied the two changes you requested. This has simplified the implementation a lot. Thanks. private static String getConsoleEncoding(Properties props) {
// Java 19+
String encoding = props.getProperty("stdout.encoding");
if (encoding != null) {
// Windows: cp65001 is automatically mapped to UTF-8
// https://github.com/openjdk/jdk25u/blob/master/src/java.base/windows/native/libjava/java_props_md.c#L131
// C function: getConsoleEncoding()
encoding = encoding.toLowerCase();
return encoding;
}
// Java 8 to 18
String os = props.getProperty("os.name");
encoding = props.getProperty("sun.stdout.encoding");
if (encoding != null) {
// Windows: some of the older versions of Java return "cp65001" for UTF-8
if (encoding.equals("cp65001")) {
encoding = "utf-8";
}
return encoding;
}
// On Linux and macOS, sun.stdout.encoding and stdout.encoding are undefined for Java 8 and 11
// Try a Unix-like "locale charmap".
String output = Py.getCommandResult("locale", "charmap");
// The result of "locale charmap" is just the charmap name ~ Charset or codec name.
if (output.length() > 0) {
return output;
}
// If we land here it is because we found no answer, and we will assume UTF-8.
return "utf-8";
} |
Nice to see you again. Would you sign the CLA please instructions, and tell us that you have. |
I signed the agreement last year for the Jython CLI contribution. Updated the ACKNOWLEDGEMENTS file just now, in the previous commit. |
Agreed your research shows that on versions you examined, and on Windows, one or other of these is defined satisfactorily. But It is good that it is unlikely to be executed, but there is a risk attached to taking it out that we do not need to take. |
|
Thanks I'll pull that and kick the tyres. Surprisingly, I didn't find a way to undefine |
So I do. I hate Windows encoding. |
@jeff5 , yes, it is a mess. I worked on a PR for JBang to enable UTF8 support automatically on Windows but in the end this effort was not merged, and we decided to just document in the user guide how to enable UTF8 support. Here's the changes I made for PowerShell to enable UTF-8: https://github.com/jbangdev/jbang/pull/2349/changes#diff-aed1494cec0154f2562c2c89da1a4336bbc127dfe013e1c5f055364d55da474b |
|
Although modified a bit since, this has a had a pretty thorough examination. Thanks for your patient work, @wfouche . |
This PR was tested with a simple script that displays a Unicode character.
on Java 8 and 25.
Fixes #404
Resolves #384