This benchmark program demonstrates the performance impact of CPU cache prefetching on memory access patterns. It compares two approaches for accessing strided memory: one without prefetch instructions and one with explicit prefetch hints to the CPU.
The program measures and compares the execution time of memory access operations with and without software prefetching, demonstrating how proper cache management can improve performance in memory-bound applications with predictable access patterns.
- C++ Standard: C++17 or higher
- Supported Compilers:
- GCC 7.0+
- Clang 5.0+
- Any compiler supporting
__builtin_prefetch
- x86/x86_64 architecture with SSE support
- Minimum 64 MB RAM (for data allocation)
- CPU with hardware prefetch capabilities
- Standard C++ Library
<xmmintrin.h>(SSE intrinsics header)
g++ -std=c++17 -O2 main.cpp -o prefetch_benchmarkg++ -std=c++17 -O3 -march=native main.cpp -o prefetch_benchmarkg++ -std=c++17 -g -O0 main.cpp -o prefetch_benchmark_debug-std=c++17: Enables C++17 standard features-O2/-O3: Optimization levels (O3 for maximum performance)-march=native: Optimizes for the host CPU architecture-g: Includes debug symbols-O0: Disables optimizations for debugging
./prefetch_benchmarkSum 1024
Sum 1024
No prefetch: 1234us
With prefetch: 987us
constexpr size_t SIZE = 16'777'216;- Value: 16,777,216 elements (16 Mi)
- Memory: ~64 MB for integer array
- Purpose: Large enough to exceed L3 cache on most systems, forcing main memory access
constexpr size_t STRIDE = 1024 * 16;- Value: 16,384 elements
- Memory Distance: ~64 KB between accesses
- Purpose: Creates non-sequential memory access pattern that benefits from prefetching
template <typename Func> long long measure_time(Func &&f)Description: Generic template function for high-precision timing of arbitrary operations.
Parameters:
f: Callable object (lambda, function, functor) to be timed
Returns:
long long: Execution time in microseconds
Implementation Details:
- Uses
std::chrono::high_resolution_clockfor precise timing - Perfect forwarding via universal reference
- Template allows zero-overhead abstraction
- Resolution: Microsecond precision
Use Cases:
- Performance benchmarking
- Profiling code sections
- Comparing algorithm implementations
long long t1 = measure_time([&] {
long long sum = 0;
for (size_t i = 0; i < SIZE; i += STRIDE) {
sum += data[i];
}
std::cout << "Sum " << sum << '\n';
});Purpose: Establishes baseline performance for strided memory access without optimization hints.
long long t2 = measure_time([&] {
long long sum = 0;
for (size_t i = 0; i < SIZE; i += STRIDE) {
if (i + STRIDE < SIZE) {
__builtin_prefetch(&data[i + STRIDE], 0, 3);
}
sum += data[i];
}
std::cout << "Sum " << sum << '\n';
});Prefetch Instruction:
__builtin_prefetch(&data[i + STRIDE], 0, 3);Parameters:
&data[i + STRIDE]: Address to prefetch0: Read access3: High temporal locality hint
Bounds Check:
if (i + STRIDE < SIZE)Prevents prefetching beyond array boundaries.
Without Prefetch:
- Cache misses cause main memory access delays
With Prefetch:
- Prefetched data reduces cache miss latency
- Small to moderate performance gains depending on hardware
- Use prefetching for predictable strided access
- Adjust prefetch distance and locality hints
- Always profile on target hardware
- Compiler-specific intrinsics
- Platform dependencies
- Hardware prefetcher capabilities
- Use
g++instead ofgccfor C++ code - Ensure bounds checking for prefetch addresses
- Intel Optimization Manual
- GCC
__builtin_prefetchdocumentation - CPU architecture guides
Educational resource for benchmarking
- v1.0: Initial implementation and benchmarking