add L0 build #2

guizili0 · 2024-03-12T01:57:06Z

Build with below command:

export SYCL_TOOLKIT_PATH=$workspace/oneapi/compiler/2024.0/
export TF_L0_PATH=$workspace/level_zero/usr
python build/build.py --enable_sycl --bazel_options=--override_repository=xla=$workspace/xla

PiperOrigin-RevId: 629867362

…d phase to Initialize() Imported from GitHub PR openxla#12228 The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9. Here's a backtrace from the hang that will be fixed: ``` Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"): #0 0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 openxla#5 0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 openxla#6 0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 openxla#7 0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 openxla#8 0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 openxla#9 0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 openxla#10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so openxla#19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so ``` Copybara import of the project: -- f535330 by Trevor Morris <tmorris@nvidia.com>: Fix hang with cudnn layer norm by moving cudnn init to Initialize() Merging this change closes openxla#12228 COPYBARA_INTEGRATE_REVIEW=openxla#12228 from trevor-m:tmorris-norm-init f535330 PiperOrigin-RevId: 633220207

Use std::aligned_storage_t trick to avoid default-initializing Node struct on a hot path. name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 791µs ± 4% 720µs ± 2% -8.93% BM_SelectAndScatterF32/256/process_time 3.20ms ± 4% 2.96ms ± 2% -7.46% BM_SelectAndScatterF32/512/process_time 13.7ms ± 5% 12.8ms ± 2% -6.80% name old time/op new time/op delta BM_SelectAndScatterF32/128/process_time 790µs ± 5% 719µs ± 1% -9.00% BM_SelectAndScatterF32/256/process_time 3.20ms ± 3% 2.96ms ± 1% -7.58% BM_SelectAndScatterF32/512/process_time 13.2ms ± 4% 12.3ms ± 1% -6.82% PiperOrigin-RevId: 658139935

add L0 build

56845af

github-actions bot added the kokoro:force-run label Mar 12, 2024

add intel build script

38cff03

justkw pushed a commit that referenced this pull request May 2, 2024

[xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #2

a05ff09

PiperOrigin-RevId: 629867362

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add L0 build #2

add L0 build #2

guizili0 commented Mar 12, 2024 •

edited

Loading

add L0 build #2

Are you sure you want to change the base?

add L0 build #2

Conversation

guizili0 commented Mar 12, 2024 • edited Loading

guizili0 commented Mar 12, 2024 •

edited

Loading