-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix testsuite fails on s390x (big-endian) #7309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix testsuite fails on s390x (big-endian) #7309
Conversation
On big-endian systems like s390x, the already existing byteswapping approach
does not work as expected. If the array is accessed directly, it works fine:
print (array[0][0])
-29045=0x8E8B
array = array.view(array.dtype.newbyteorder("<"))
print (array[0][0])
-29810= 0x8B8E
But tobytes() returns the same bytes before and after the existing approach:
print(array.tobytes().hex())
8e8b...
Similar to to_array(), the new approach is using byteswap() and tobytes() is
really returning the swapped bytes.
Furthermore, I've removed _IS_LITTLE_ENDIAN and used sys.byteorder directly as
also done in to_array().
This fixes those tests on s390x:
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_int16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_uint16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_int32 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_int64 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_complex128 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_complex64 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_float FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_float16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_from_dict_values_are_np_arrays_of_float FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_from_dict_values_are_np_arrays_of_int FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_from_dict_values_are_np_arrays_of_ints FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_00_FLOAT FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_03_UINT16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_04_INT16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_05_INT32 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_06_INT64 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_08_FLOAT16 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_09_DOUBLE FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_10_UINT32 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_11_UINT64 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_12_COMPLEX64 FAILED
onnx/test/numpy_helper_test.py::TestNumpyHelper::test_to_array_from_array_13_COMPLEX128 FAILED
onnx/test/test_external_data.py::TestLoadExternalData_0_protobuf::test_load_external_data FAILED
onnx/test/test_external_data.py::TestLoadExternalData_0_protobuf::test_load_external_data_for_model FAILED
onnx/test/test_external_data.py::TestLoadExternalData_0_protobuf::test_save_external_data FAILED
onnx/test/test_external_data.py::TestLoadExternalData_1_textproto::test_load_external_data FAILED
onnx/test/test_external_data.py::TestLoadExternalData_1_textproto::test_load_external_data_for_model FAILED
onnx/test/test_external_data.py::TestLoadExternalData_1_textproto::test_save_external_data FAILED
onnx/test/test_external_data.py::TestLoadExternalDataSingleFile_0_protobuf::test_load_external_single_file_data FAILED
onnx/test/test_external_data.py::TestLoadExternalDataSingleFile_0_protobuf::test_save_external_single_file_data FAILED
onnx/test/test_external_data.py::TestLoadExternalDataSingleFile_1_textproto::test_load_external_single_file_data FAILED
onnx/test/test_external_data.py::TestLoadExternalDataSingleFile_1_textproto::test_save_external_single_file_data FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_0_protobuf::test_convert_model_to_external_data_converts_attribute_values FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_0_protobuf::test_convert_model_to_external_data_from_one_file_with_location FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_0_protobuf::test_convert_model_to_external_data_without_size_threshold FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_0_protobuf::test_save_model_does_convert_and_saves_the_model FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_0_protobuf::test_save_model_without_loading_external_data FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_1_textproto::test_convert_model_to_external_data_converts_attribute_values FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_1_textproto::test_convert_model_to_external_data_from_one_file_with_location FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_1_textproto::test_convert_model_to_external_data_without_size_threshold FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_1_textproto::test_save_model_does_convert_and_saves_the_model FAILED
onnx/test/test_external_data.py::TestSaveAllTensorsAsExternalData_1_textproto::test_save_model_without_loading_external_data FAILED
onnx/test/inference_function_test.py::TestInferenceFunctionCall::test_reshape_inference FAILED
onnx/test/model_container_refeval_test.py::TestLargeOnnxReferenceEvaluator::test_large_multi_files FAILED
onnx/test/model_container_refeval_test.py::TestLargeOnnxReferenceEvaluator::test_large_one_weight_file FAILED
onnx/test/model_container_refeval_test.py::TestLargeOnnxReferenceEvaluator::test_large_onnx_no_large_initializer FAILED
onnx/test/reference_evaluator_model_test.py::TestReferenceEvaluatorModel::test_loop_fft FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_function_attribute_nested_nested_graph FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_if FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_if_function FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_00 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_01 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_02 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_03 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_04 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_05 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_06 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_07 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_08 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_09 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_10 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_11 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_12 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_13 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_14 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_15 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_16 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_op_reduce_17 FAILED
onnx/test/reference_evaluator_test.py::TestReferenceEvaluator::test_reference_evaluator_lr_clip FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_constant FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_constant_function FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_constant_graph FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_initializer FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_range FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_range_function FAILED
onnx/test/tools_test.py::TestToolsFunctions::test_replace_range_graph FAILED
Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
On big-endian systems like s390x, those tests are failing as creating a numpy-array via numpy_helper.to_array(tensor) assumes that raw_data within the tensor is formatted as little-endian and then performs a byte swap on big-endian systems. This fixes those tests on s390x: onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.INT16] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.INT32] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.INT64] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.FLOAT16] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.FLOAT] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.DOUBLE] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.COMPLEX64] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.COMPLEX128] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.UINT16] FAILED onnx/test/helper_test.py::test_make_tensor_raw[TensorProto.UINT32] FAILED Note: Some of those tests are also mentioned in the following issue from 2024-01-14: s390x test failures - test_make_tensor_raw onnx#6181 The same applies for test_make_bfloat16_tensor_raw. Note that this test did not fail before as byteswapping ml_dtypes.bfloat16 is currently not byteswapping at all. The test will pass as soon as either ml_dtypes.bfloat16 is fixed or an alternative approach for byteswapping is used in numpy_helper.to_array(). See ml_dtypes issue: "bfloat16: byteswap does not swap bytes onnx#308" jax-ml/ml_dtypes#308 Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
On s390x, I get those "RuntimeWarning: invalid value encountered in cast" in onnx/test/helper_test.py for functions test_make_tensor_vals and test_make_tensor_raw and the UINT8, UINT16, UINT32, UINT64 data types. Fixed those warnings by using only positive floats for those datatypes. See similar commit a90d702 Use randint instead of randn to prevent undefined cast (float to uint) (onnx#4804) Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
…) on s390x On big-endian systems like s390x some external tests are failing because the tensors are created with big instead of little endian bytes. After saving and loading the model, the tensors are converted to a np.ndarray where the bytes are byteswapped on big-endian machines from little to big-endian. Thus this patch creates the tensors in the correct byte order. This fixes those tests on s390x: onnx/test/test_external_data.py::TestExternalDataToArray_0_protobuf::test_save_model_with_external_data_multiple_times FAILED onnx/test/test_external_data.py::TestExternalDataToArray_0_protobuf::test_to_array_with_external_data FAILED onnx/test/test_external_data.py::TestExternalDataToArray_1_textproto::test_save_model_with_external_data_multiple_times FAILED onnx/test/test_external_data.py::TestExternalDataToArray_1_textproto::test_to_array_with_external_data FAILED Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
If a numpy array is passed to helper.py::make_tensor(raw=True), it currently just uses tobytes(). Compared to numpy_helper.py:from_array() this is wrong for the 4bit datatypes as those needs to be packed. Furthermore the bytes needs to be stored in little endian byte order (only needed on big endian machines like s390x). Thus this patch adds the byteswap and the packing. Furthermore the testcase onnx/test/helper_test.py::test_make_tensor_raw() is extended to also check passing numpy arrays. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
On big endian machines, we have to ensure that the bytes stored in a tensor are converted to little endian. While searching in the sources, those are the remaining places - all testcases - where tobytes() is used. This patch either add a byteswap or is now passing a numpy array to make_tensor, which then performs the byteswap, instead of using tobytes(). Note that all corresponding tests did not fail before. Thus in theory we do not need those changes. But it is more consistent as before. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
justinchuby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the fixes!
Yes we can bump that for s390x systems. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7309 +/- ##
==========================================
+ Coverage 54.38% 54.43% +0.04%
==========================================
Files 511 511
Lines 31916 31930 +14
Branches 2870 2873 +3
==========================================
+ Hits 17359 17381 +22
+ Misses 13767 13760 -7
+ Partials 790 789 -1 ☔ View full report in Codecov by Sentry. |
|
@cyyever Maybe we should run the test for s390x (big-endian) in some docker container once a week? So far, we haven't tested anything, neither the build process nor the tests? |
|
@andife I agree |
|
We should fix the serialisation endian and record it in documentation. |
Sounds great. Thus we first have to wait until the pull-request for ml_dtypes is reviewed/merged and a new release is created. |
|
@stliibm do you know what the best of setting up an s390x CI on GitHub is? |
This patch introduces a common function numpy_helper.py::tobytes_little_endian() to convert a numpy array to bytes. On big-endian systems it also swaps the bytes which ensures that the returned bytes are in little-endian order. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
From a colleague, I've got an example how to use Github action runner on s390x. Of course in the link below, it was requested for other projects: Here is also an onboarding guide: I hope this helps? |
justinchuby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This adds a doc-string and the versionadded directive to tobytes_little_endian(). Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
The check when to swap the bytes to little-endian is now more accurate as the byteorder of the dtype is really checked. Furthermore, we avoid using byteswap() but instead use astype() as the latter one also records the correct byteorder. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
|
nice |
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
@justinchuby @cyyever Thanks a lot for the review and merging the branch. I've just left a comment in the ml_dtypes PR Fix bfloat16 byteswap #311. Hopefully it gets reviewed soon. Then also the remaining bfloat16 testcases will pass. |
|
@stliibm Nice job |
In order to get the testsuite clean on s390x (big-endian), we need the already committed onnx-commit: "Fix testsuite fails on s390x (big-endian) (onnx#7309)" onnx@f307a8d AND the ml_dtypes-commit: "Fix bfloat16 byteswap not working" jax-ml/ml_dtypes@fd7fa4a The ml_dtypes one has now landed in ml_dtypes v0.5.4 release: https://github.com/jax-ml/ml_dtypes/releases/tag/v0.5.4 "Fixed bug in byte-swap operation for custom floats (onnx#311)" This patch bumps the requirement for ml_dtypes to v0.5.4 only on s390x as discussed in the original pull-request: onnx#7309 (comment) All other architectures, still require for ml_dtypes v0.5.0 as before.
In order to get the testsuite clean on s390x (big-endian), we need the already committed onnx-commit: "Fix testsuite fails on s390x (big-endian) (onnx#7309)" onnx@f307a8d AND the ml_dtypes-commit: "Fix bfloat16 byteswap not working" jax-ml/ml_dtypes@fd7fa4a The ml_dtypes one has now landed in ml_dtypes v0.5.4 release: https://github.com/jax-ml/ml_dtypes/releases/tag/v0.5.4 "Fixed bug in byte-swap operation for custom floats (onnx#311)" This patch bumps the requirement for ml_dtypes to v0.5.4 only on s390x as discussed in the original pull-request: onnx#7309 (comment) All other architectures, still require for ml_dtypes v0.5.0 as before. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
In order to get the testsuite clean on s390x (big-endian), we need the already committed onnx-commit: "Fix testsuite fails on s390x (big-endian) (#7309)" f307a8d AND the ml_dtypes-commit: "Fix bfloat16 byteswap not working" jax-ml/ml_dtypes@fd7fa4a The ml_dtypes one has now landed in ml_dtypes v0.5.4 release: https://github.com/jax-ml/ml_dtypes/releases/tag/v0.5.4 "Fixed bug in byte-swap operation for custom floats (#311)" This patch bumps the requirement for ml_dtypes to v0.5.4 only on s390x as discussed in the original pull-request: #7309 (comment) All other architectures, still require for ml_dtypes v0.5.0 as before. Signed-off-by: Stefan Liebler <stli@linux.ibm.com>
On big-endian machines, there are plenty of test fails due to missing/wrong byteswaps. I've created a couple of patches to fix those fails, in numpy_helper.py::from_array() and helper.py::make_tensor() and in various testcases.
Some of the fixed Testcases test_make_tensor_raw are also mentioned in the issue from 2024-07-14 created by tehbone:
s390x test failures - test_make_tensor_raw #6181
Especially the last commit "Adjust remaining tests which are using tobytes()." is not really required as all corresponding tests did not fail before. But it makes the tests more consistent as before.
In the end I'm down to 7 remaining fails:
Those fails happen as the ml_dtypes-bfloat16 datatype does not byteswap. I've already filed an issue:
"bfloat16: byteswap does not swap bytes #308"
There is already a pull-request fixing this issue, but the assigned reviewer has not yet responded:
"Fix bfloat16 byteswap #311"
With a locally fixed ml_dtypes, the remaining tests are also passing.
How are those dependency issues solved in onnx?
Is the required version bumped as soon as fixed and released in ml_dtypes?
Or shall we implement a workaround with a different approach to swap bytes for bfloat16 datatype in onnx-sources?