Parallel test with pytest-xdist#518
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/518
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4c3c1fd with merge base 4f53882 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
What I had tried messing around with before was using https://docs.pytest.org/en/stable/how-to/mark.html which is useful to mark the 2 gpu and 1 gpu tests independently, and in the github action file you can just have 2 pytest commands to run them one after the other |
|
Yup that's what I have in mind too. Just want to make sure only those 2 tests use 2 GPUs. |
|
We can merge this but let's keep an eye out on whether this makes debugging test failures more tricky to debug in teh future |
|
@msaroufim I keep seeing this test failing Not exactly clear what's causing it. Should we revert or try to fix it? |
|
Revert and ask questions later |
This reverts commit cbaff6c.
Currently this PR creates 2 workers, each with 2 GPUs
I think only FSDP tests require 2 GPUs
ao/test/dtypes/test_nf4.py
Lines 484 to 494 in 4f53882
ao/test/prototype/test_low_bit_optim.py
Lines 161 to 168 in 4f53882
Thus, CI runtime can be further improved by
UPDATE: parallel tests with 4 1-GPU workers + sequential multi-GPU tests is implemented now. Some problems I encountered along the way (for future reference and improvement). pytest-xdist provides multiple methods to distribute work across workers. Refer to here for the full list. I have tried the following
--dist load(send each test to next available worker):TestSaveLoadMetause the same filenametest.pthwhich is problematic when multiple tests write/read the same file. I made a unique name for each test to solve it for now, though the better way is to use pytesttmp_pathfixture (not compatible withunittest.TestCase).test_quant_llm_quantize_ebits_2_mbits_2gets a strange torch.compile warning in PyTorch 2.3 onlytorch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py:436] [0/48] failed to eagerly compile backwards for dynamic, suppressing in case backwards not needed. Looks harmless though.--dist loadscope(send a TestClass to next available worker):TestSaveLoadMetagets a strange error -RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent forward run. Stack trace: [Could not find stack trace]. To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.. No traceback. Since tests within a class are run by the same worker, there shouldn't be filename conflicttest.py. Perhaps some strange interaction between how pytest-xdist sends test to the subprocess worker and CUDA graphs + torch.compile?--dist loadfile(send each file to next available worker): this should be the least error-prone. However, there is little speedup (24min vs 35min) sincetest_integration.pyfile is huge -> load imbalance (probably a good idea to split this into smaller files anyway).The current approach of this PR is using
--dist load. To summarizeI also notice that it takes 3min to pull the Docker image and another 3-4min to download and install packages. So it seems like two 2-GPU workers have pretty good scaling (7 + 14 vs 7 + 28), while four 1-GPU workers is not much of an improvement (7 + 12 vs 7 + 28). I guess the 2-GPU FSDP tests can be run in parallel too for the latest configuration.