Add a forward pass nn model with dynamism test. #4256

vanbasten23 · 2022-11-30T18:22:27Z

No description provided.

vanbasten23 · 2022-11-30T18:25:28Z

I have a question about my test test_forward_pass_nn_model_compile_once in this pr: in the test I run the forward pass twice and expect we only compile it once. I assume met.metric_data("CompileTime")[0] indicates the number of compilation. But the actual value now is 3, instead of 1. I wonder if you have any pointers to debug such issue for me to begin with. @JackCaoG @miladm

test/run_tests.sh

test/test_dynamic_shape_models.py

JackCaoG · 2022-11-30T22:09:45Z

test/test_dynamic_shape_models.py

+        xm.mark_step()
+    # TODO: figure out if met.metric_data("CompileTime") indicates
+    # the number of compilations. Also figure out why the counter now is 3 instead of the expected 1.
+    np.testing.assert_equal(met.metric_data('CompileTime')[0], 1)


dump the IR graphs then you will know what got executed.

I got the IR dump, there are 30 ## BEGIN_GRAPH for the 10 iteration. I think it means for each iteration, the IR graph gets compiled into HLO for 3 times, right?

Also, does the metric "CompileTime" refer to compiling from IR to HLO or compiling from HLO to LLO?

Plus, with the actual met.metric_data('CompileTime')[0] being 3, I think the dynamic behavior is what we expected right? It's because the compilation time doesn't grow as the iteration. Is my understanding correct?

test/test_dynamic_shape_models.py

vanbasten23 · 2022-12-06T01:18:39Z

Right now the newly added tests succeed on TPU but fails on CPU with error:

ERROR: test_forward_pass_dynamic_input_correctness (__main__.TestDynamicShapeModels)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_dynamic_shape_models.py", line 51, in test_forward_pass_dynamic_input_correctness
    xm.mark_step()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.14-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 953, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: INVALID_ARGUMENT: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INVALID_ARGUMENT: Fail to proof the equality of two dimensions at compile time: %reduce.56 = s32[] reduce(s32[10]{0} %convert.50, s32[] %constant.51), dimensions={0}, to_apply=%add_S32.52 vs %reduce.17 = s32[] reduce(s32[10]{0} %convert.11, s32[] %constant.12), dimensions={0}, to_apply=%add_S32.13
	 [[{{node XRTCompile}}]]
	 [[XRTCompile_G3]]
  (1) INVALID_ARGUMENT: Fail to proof the equality of two dimensions at compile time: %reduce.56 = s32[] reduce(s32[10]{0} %convert.50, s32[] %constant.51), dimensions={0}, to_apply=%add_S32.52 vs %reduce.17 = s32[] reduce(s32[10]{0} %convert.11, s32[] %constant.12), dimensions={0}, to_apply=%add_S32.13
	 [[{{node XRTCompile}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  0 successful operations.
  0 derived errors ignored.
  Recent warning and error logs:
    OP_REQUIRES failed at xrt_compile_ops.cc:221 : INVALID_ARGUMENT: Fail to proof the equality of two dimensions at compile time: %reduce.56 = s32[] reduce(s32[10]{0} %convert.50, s32[] %constant.51), dimensions={0}, to_apply=%add_S32.52 vs %reduce.17 = s32[] reduce(s32[10]{0} %convert.11, s32[] %constant.12), dimensions={0}, to_apply=%add_S32.13
  OP_REQUIRES failed at xrt_compile_ops.cc:221 : INVALID_ARGUMENT: Fail to proof the equality of two dimensions at compile time: %reduce.56 = s32[] reduce(s32[10]{0} %convert.50, s32[] %constant.51), dimensions={0}, to_apply=%add_S32.52 vs %reduce.17 = s32[] reduce(s32[10]{0} %convert.11, s32[] %constant.12), dimensions={0}, to_apply=%add_S32.13

I found a similar issue but it doesn't explain why it fails. I wonder if you have encountered this issue before @miladm @JackCaoG .

JackCaoG · 2022-12-06T02:19:32Z

maybe check where the error is from(somewhere in xla) and see why it failed?

JackCaoG · 2022-12-07T01:55:35Z

test/test_dynamic_shape_models.py

+        y_pred = model(x_test)
+        before_train = criterion(y_pred.squeeze(), y_test)
+        xm.mark_step()
+    np.testing.assert_equal(met.metric_data('CompileTime')[0], 3)


why is it 3 here?

Does the "CompileTime" here refer to compiling IR graph to HLO graph, or compiling HLO to LLO/executable?

The IR dump shows 3 graphs: 2 for before_train = criterion(y_pred.squeeze(), y_test) and 1 for xm.mark_step(). The "CompileTime" doesn't grow linearly with the number of iteration. Does 3 match your expectation?

hmm, so the second graph

[ScheduleSyncTensorsGraph] TensorsGraphInfo: __bool__ (/home/ptxla/.local/lib/python3.8/site-packages/torch/__init__.py:212) binary_cross_entropy (/home/ptxla/.local/lib/python3.8/site-packages/torch/nn/functional.py:3087) forward (/home/ptxla/.local/lib/python3.8/site-packages/torch/nn/modules/loss.py:619) _call_impl (/home/ptxla/.local/lib/python3.8/site-packages/torch/nn/modules/module.py:1480) test_forward_pass_dynamic_input_compile_once (pytorch/xla/test/test_dynamic_shape_models.py:71) _callTestMethod (/usr/local/lib/python3.8/unittest/case.py:633) run (/usr/local/lib/python3.8/unittest/case.py:676) __call__ (/usr/local/lib/python3.8/unittest/case.py:736) run (/usr/local/lib/python3.8/unittest/suite.py:122) __call__ (/usr/local/lib/python3.8/unittest/suite.py:84) run (/usr/local/lib/python3.8/unittest/suite.py:122) __call__ (/usr/local/lib/python3.8/unittest/suite.py:84) run (/usr/local/lib/python3.8/unittest/runner.py:176) runTests (/usr/local/lib/python3.8/unittest/main.py:271) __init__ (/usr/local/lib/python3.8/unittest/main.py:101) <module> (pytorch/xla/test/test_dynamic_shape_models.py:93) Hashes: (5c2a92a233f40275064b7ca64d2c16ba) ## BEGIN_GRAPH IR { %0 = f32[1]{0} xla::device_data(), location=convert@module.py:1128, device=TPU:0 %1 = f32[1,10]{1,0} xla::device_data(), location=convert@module.py:1128, device=TPU:0 %2 = f32[10,1]{0,1} aten::permute(%1), location=forward@linear.py:114, dims=(1, 0) %3 = f32[10]{0} xla::device_data(), location=convert@module.py:1128, device=TPU:0 %4 = f32[10,2]{0,1} xla::device_data(), location=convert@module.py:1128, device=TPU:0 %5 = f32[2,10]{1,0} aten::permute(%4), location=forward@linear.py:114, dims=(1, 0) %6 = f32[5,2]{0,1} xla::device_data(), location=create_dynamic_test_data@test_dynamic_shape_models.py:85, device=TPU:0 %7 = s32[5,2]{0,1} xla::cast(%6), location=create_dynamic_test_data@test_dynamic_shape_models.py:86, type=s32, dtype=Int, stype=Float %8 = (s32[<=10,2]{1,0}, s32[]) aten::nonzero(%7), num_outputs=2, location=create_dynamic_test_data@test_dynamic_shape_models.py:86 %9 = f32[<=10,2]{1,0} xla::cast(%8.0), location=create_dynamic_test_data@test_dynamic_shape_models.py:86, type=f32, dtype=Float, stype=Int %10 = f32[<=10,10]{1,0} aten::addmm(%9, %5, %3), location=forward@linear.py:114 %11 = f32[<=10,10]{1,0} aten::relu(%10), location=relu@functional.py:1457 %12 = f32[<=10,1]{1,0} aten::addmm(%11, %2, %0), location=forward@linear.py:114 %13 = f32[<=10,1]{1,0} aten::sigmoid(%12), location=forward@activation.py:294 %14 = f32[<=10]{0} aten::view(%13), location=binary_cross_entropy@functional.py:3087, output_size=(10) %15 = s32[] aten::size(%14), ROOT=0 }

is a bit concerning, it seems like we materalize the size by a bool operator somewhere. I would like to understand where does that happen in a follow up pr.

JackCaoG · 2022-12-07T02:07:35Z

Hmm I think I know what's going on, in cpu compiler(tensorflow/compiler/xla/service/cpu/cpu_compiler.cc)

dynamic_padder_options.shape_check_mode = DynamicDimensionInference::ShapeCheckMode::kCompileTime;

which will fail if at compileTime it can not verify if two shapes are equivalent, this will pretty much blocks ds work. In GPU it is being set to kRuntime which only check shape eqality at run time. I think we can follow up on why CPU does not support this check, but run this test only on GPU and TPU for now.

vanbasten23 · 2022-12-07T05:41:14Z

In GPU it is being set to kRuntime

How do you know "In GPU it is being set to kRuntime"?

Also, how do we usually follow up on why CPU doesn't support this check? Do we just ask Blake or open an github issue for tensorflow?

JackCaoG · 2022-12-07T18:29:32Z

In GPU it is being set to kRuntime

How do you know "In GPU it is being set to kRuntime"?

Also, how do we usually follow up on why CPU doesn't support this check? Do we just ask Blake or open an github issue for tensorflow?

I just search the error message in xla code base and then find the kCompileTime and kRunTime. Then search kRunTime and found gpu is using it.

This reverts commit 14e30de.

This reverts commit a49baf4.

vanbasten23 · 2022-12-08T00:24:12Z

test/test_dynamic_shape_models.py

+
+  def test_forward_pass_dynamic_input_correctness(self):
+    losses = []
+    for dev in [torch.device('gpu'), xla_dev]:


@miladm @JackCaoG is it possible to get 2 devices on one machine?

This tests is designed to test the model generates the same model on different devices.

you shouldn't need this test, we expect the HLO generation part is mostly device independent.

vanbasten23 · 2022-12-08T00:25:44Z

I think we can follow up on why CPU does not support this check, but run this test only on GPU and TPU for now.

This is done now. Also created an issue to track #4298. Can you take another look at the PR?

JackCaoG · 2022-12-08T00:34:26Z

test/test_dynamic_shape_models.py

+
+
+@unittest.skipIf(
+    xm.get_xla_supported_devices("CPU"),


can you check if this test will get run on GPU? Check the gpu test log. What could happen is that GPU CI also can get the CPU device. It is better if you specifilly check if you can get gpu and tpu device.

JackCaoG

Feel free to merge it after you verify test actually get run in GPU. I also think we don't need to compare gpu and cpu HLO in this test.

vanbasten23 · 2022-12-09T01:15:14Z

Feel free to merge it after you verify test actually get run in GPU.

You are right. With # @unittest.skipIf(xm.get_xla_supported_devices("CPU"), my test didn't run on GPU:

OK
+ run_dynamic python3 /tmp/pytorch/xla/test/test_dynamic_shape_models.py --verbosity=2
+ [[ '' == \1 ]]
+ echo 'Running in DynamicShape mode: python3' /tmp/pytorch/xla/test/test_dynamic_shape_models.py --verbosity=2
Running in DynamicShape mode: python3 /tmp/pytorch/xla/test/test_dynamic_shape_models.py --verbosity=2
+ XLA_EXPERIMENTAL=nonzero:masked_select:masked_scatter
+ run_test python3 /tmp/pytorch/xla/test/test_dynamic_shape_models.py --verbosity=2
+ python3 /tmp/pytorch/xla/test/test_dynamic_shape_models.py --verbosity=2
test_forward_pass_dynamic_input_compile_once (__main__.TestDynamicShapeModels) ... skipped 'The tests fail on CPU. See https://github.com/pytorch/xla/issues/4298 for more detail.'
test_forward_pass_dynamic_input_correctness (__main__.TestDynamicShapeModels) ... skipped 'The tests fail on CPU. See https://github.com/pytorch/xla/issues/4298 for more detail.'

----------------------------------------------------------------------
Ran 2 tests in 0.003s

I've pushed another commit to fix that.

I also think we don't need to compare gpu and cpu HLO in this test.

I modified the test_forward_pass_dynamic_input_correctness test such that I run the same test twice on the same device and make sure 2 losses are the same.

Edit: I verified that the test runs on GPU.

JackCaoG · 2022-12-09T18:14:53Z

@cicirori @ymwangg Feel free to give this a try, we are still working on enabling the backward in #4289

Add ds model test to the ci.

df8d18d

vanbasten23 requested review from JackCaoG and miladm November 30, 2022 18:25

vanbasten23 added 2 commits November 30, 2022 18:32

fix linter

2717f76

add verbose flag.

6a87404