Skip to content

SIMD implementation and examples#2556

Open
mehmetyusufoglu wants to merge 1 commit into
alpaka-group:developfrom
mehmetyusufoglu:simdWithExamples2
Open

SIMD implementation and examples#2556
mehmetyusufoglu wants to merge 1 commit into
alpaka-group:developfrom
mehmetyusufoglu:simdWithExamples2

Conversation

@mehmetyusufoglu
Copy link
Copy Markdown
Contributor

@mehmetyusufoglu mehmetyusufoglu commented Jul 29, 2025

Alpaka SIMD implementation

A Two-Level SIMD Architecture
Level 1: std::experimental::simd

  • Uses the standardized C++ SIMD library if available
  • Detects and uses the SIMD instructions (SSE2, AVX2, AVX512)
  • Disabled automatically in mixed CPU+GPU builds due to NVCC limitations

Level 2: Array-Based Fallback Backend

  • Fallback implementation when std::experimental::simd is unavailable, or not supported
  • Uses std::array with compile-time loop unrolling
  • Cpu-gpu mixed backend and cpu-serial backend with clang + libc++ builds use this fallback

Key Features
SIMD width determined depending on the accelerator type at compile-time

APIs
2 different APIs: Low-level explicit SIMD types/operations and high-level SimdAlgo forEach abstraction.

Three SIMD Examples
vectorAddSIMD: Has 2 kernels. Shows both explicit SIMD type/operator usage and SimdAlgo forEach approaches
reduceRGBSIMD: Real-world RGB-to-grayscale conversion with 8.8x-10.7x speedups
sumOfSquaresSIMD: Demonstration of SIMD usage (reduction)

Performance Results

Screenshot from 2025-08-18 13-57-47 Screenshot from 2025-08-18 14-48-48

TEST1: RUN of EXAMPLE with FALLBACK Simd-Policy:
(Cmake fallback simd-policy setting: ALPAKA_FORCE_DISABLE_STD_SIMD ON)

./example/examplesUsingSIMD/reduceRGBSIMD/reduceRGBSIMD 
Elements: 33554432
Accelerator: AccCpuSerial<1,long unsigned int>
Testing packed RGBA SIMD processing...

=== SIMD System Information ===
Compiler: GCC 13.3
Standard Library: libstdc++ 13
SIMD Instructions: AVX2
SIMD Policy: Fallback
Accelerator Setup: CPU only (Serial)
Compiler Environment: libstdc++ stdlib | AVX2 instructions | Release build
================================
simdWidth for packed RGBA (uint32_t) is 8
[WorkDiv CPU-Serial] numElements=33554432 simdWidth=8 packCount=4194304 gridBlocks=524288 threadsPerBlock=1 logicalThreads=524288 hardwareThreads=1
Device buffer aligned: YES
Scalar: 524288 threads (524288 blocks × 1 threads) simdWidth=8
SIMD  : 524288 threads (524288 blocks × 1 threads) simdWidth=8 (in-place)

--- Performance Summary ---
scalar_time_ms = 137.032
simd_time_ms   = 15.647
speedup        = 8.758 (scalar/simd)
simd policy    = Fallback
---------------------------

TEST2: RUN with default Simd-Policy (std::experimental::simd)


./example/examplesUsingSIMD/reduceRGBSIMD/reduceRGBSIMD
Elements: 33554432
Accelerator: AccCpuSerial<1,long unsigned int>
Testing packed RGBA SIMD processing...

=== SIMD System Information ===
Compiler: GCC 13.3
Standard Library: libstdc++ 13
SIMD Instructions: AVX2
SIMD Policy: std::experimental::simd
Accelerator Setup: CPU only (Serial)
Compiler Environment: libstdc++ stdlib | AVX2 instructions | Release build
================================
simdWidth for packed RGBA (uint32_t) is 8
[WorkDiv CPU-Serial] numElements=33554432 simdWidth=8 packCount=4194304 gridBlocks=524288 threadsPerBlock=1 logicalThreads=524288 hardwareThreads=1
Device buffer aligned: YES
Scalar: 524288 threads (524288 blocks × 1 threads) simdWidth=8
SIMD  : 524288 threads (524288 blocks × 1 threads) simdWidth=8 (in-place)

--- Performance Summary ---
scalar_time_ms = 184.336
simd_time_ms   = 14.527
speedup        = 12.689 (scalar/simd)
simd policy    = std::experimental::simd
---------------------------

@mehmetyusufoglu mehmetyusufoglu marked this pull request as draft July 29, 2025 08:06
@mehmetyusufoglu mehmetyusufoglu changed the title Simd and examples [wip] Simd and examples Jul 29, 2025
@SimeonEhrig
Copy link
Copy Markdown
Member

Did you changed anything on the job generator? Looks like only reformating.

@mehmetyusufoglu mehmetyusufoglu force-pushed the simdWithExamples2 branch 4 times, most recently from 51bbfc1 to 8ca9828 Compare August 4, 2025 18:37
Comment thread example/reduceRGBbySIMD/src/reduceRGBbySIMD.cpp Outdated
@mehmetyusufoglu mehmetyusufoglu force-pushed the simdWithExamples2 branch 5 times, most recently from 9793c02 to 46cedf6 Compare August 13, 2025 16:14
@mehmetyusufoglu mehmetyusufoglu force-pushed the simdWithExamples2 branch 10 times, most recently from 62faf14 to 37d34ab Compare August 18, 2025 12:41
@mehmetyusufoglu
Copy link
Copy Markdown
Contributor Author

Did you changed anything on the job generator? Looks like only reformating.

I removed, just reformatting.

@mehmetyusufoglu mehmetyusufoglu changed the title [wip] Simd and examples Simd Implementation and examples Aug 18, 2025
@mehmetyusufoglu mehmetyusufoglu marked this pull request as ready for review August 18, 2025 13:12
Copy link
Copy Markdown
Member

@psychocoderHPC psychocoderHPC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked only the beginning, this is not a complete review

Comment thread CMakeLists.txt Outdated

option(alpaka_BUILD_EXAMPLES "Build the examples" OFF)
option(alpaka_BUILD_BENCHMARKS "Build the benchmarks" OFF)
option(ALPAKA_FORCE_DISABLE_STD_SIMD "Force use of fallback PortableSimd implementation instead of std::experimental::simd" OFF)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename to alpaka_USE_STD_SIMD set the value to off to use our portable simd type.
If you have disable in the name you have a double negation if you set it to off. This is making things complicated.
btw all variable start always with a small alpaka

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed to alpaka_USE_STD_SIMD and corrected code accordingly . Default is ON.

Comment thread CMakeLists.txt Outdated

# Propagate force-disable SIMD flag if requested
if(ALPAKA_FORCE_DISABLE_STD_SIMD)
add_compile_definitions(ALPAKA_FORCE_DISABLE_STD_SIMD=1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take care that you have somewhere in the code a macro which is falling back to a usefull default in case the define is not set. This is e.g. the case in standalone headers where cmake is not used.

Copy link
Copy Markdown
Contributor Author

@mehmetyusufoglu mehmetyusufoglu Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oki will add this into the code; to make it available even cmake is not used.
#ifndef ALPAKA_USE_STD_SIMD #define ALPAKA_USE_STD_SIMD 1 #endif

There is also internal variable ALPAKA_SIMD_USE_STD_EXPERIMENTAL, i will change its name sincethey are very similar.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i removed internal variable ALPAKA_SIMD_USE_STD_EXPERIMENTAL from the code. If ALPAKA_USE_STD_SIMD is ON at the cmake-configuration AND alpaka is technically available by internal code macro ALPAKA_SIMD_ENV_STD_EXP_SIMD_AVAILABLE; then std::simd will be used.

@mehmetyusufoglu mehmetyusufoglu marked this pull request as draft August 19, 2025 10:10
@mehmetyusufoglu mehmetyusufoglu force-pushed the simdWithExamples2 branch 2 times, most recently from bf3fdfb to 2e6f6c6 Compare August 19, 2025 13:11
@mehmetyusufoglu mehmetyusufoglu marked this pull request as ready for review August 20, 2025 09:10
Always use:

* ``-O3`` (Release build)
* ``-march=native`` (CPU-specific optimizations)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the resulting binaries a bit more generic, rather than -march=native you could use the x86-64 micro-architectures:

  • -march=x86-64-v2 to enable only SSE3/SSE4;
  • -march=x86-64-v3 to enable also AVX/AVX2 and FMA;
  • -march=x86-64-v4 to enable also AVX512.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok added, thanks i did not know. (Although i could ONLY test at clang and g++, i also added corresponding flags for other compilers.)

Comment thread example/CMakeLists.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the new tests should be added in alphabetical order, like to existing ones ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, doing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks.

@fwyzard fwyzard changed the title Simd Implementation and examples SIMD implementation and examples Oct 7, 2025
@fwyzard fwyzard added this to the 2.2.0 milestone Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants