Memory Prefetch Performance Benchmark Documentation

Overview

This benchmark program demonstrates the performance impact of CPU cache prefetching on memory access patterns. It compares two approaches for accessing strided memory: one without prefetch instructions and one with explicit prefetch hints to the CPU.

Purpose

The program measures and compares the execution time of memory access operations with and without software prefetching, demonstrating how proper cache management can improve performance in memory-bound applications with predictable access patterns.

System Requirements

Compiler Support

C++ Standard: C++17 or higher
Supported Compilers:
- GCC 7.0+
- Clang 5.0+
- Any compiler supporting __builtin_prefetch

Hardware Requirements

x86/x86_64 architecture with SSE support
Minimum 64 MB RAM (for data allocation)
CPU with hardware prefetch capabilities

Dependencies

Standard C++ Library
<xmmintrin.h> (SSE intrinsics header)

Build Instructions

Standard Build

g++ -std=c++17 -O2 main.cpp -o prefetch_benchmark

Optimized Build (Recommended)

g++ -std=c++17 -O3 -march=native main.cpp -o prefetch_benchmark

Debug Build

g++ -std=c++17 -g -O0 main.cpp -o prefetch_benchmark_debug

Compiler Flags Explained

-std=c++17: Enables C++17 standard features
-O2 / -O3: Optimization levels (O3 for maximum performance)
-march=native: Optimizes for the host CPU architecture
-g: Includes debug symbols
-O0: Disables optimizations for debugging

Usage

Running the Benchmark

./prefetch_benchmark

Expected Output

Sum 1024
Sum 1024
No prefetch: 1234us
With prefetch: 987us

Code Architecture

Constants

`SIZE`

constexpr size_t SIZE = 16'777'216;

Value: 16,777,216 elements (16 Mi)
Memory: ~64 MB for integer array
Purpose: Large enough to exceed L3 cache on most systems, forcing main memory access

`STRIDE`

constexpr size_t STRIDE = 1024 * 16;

Value: 16,384 elements
Memory Distance: ~64 KB between accesses
Purpose: Creates non-sequential memory access pattern that benefits from prefetching

Functions

`measure_time`

template <typename Func> long long measure_time(Func &&f)

Description: Generic template function for high-precision timing of arbitrary operations.

Parameters:

f: Callable object (lambda, function, functor) to be timed

Returns:

long long: Execution time in microseconds

Implementation Details:

Uses std::chrono::high_resolution_clock for precise timing
Perfect forwarding via universal reference
Template allows zero-overhead abstraction
Resolution: Microsecond precision

Use Cases:

Performance benchmarking
Profiling code sections
Comparing algorithm implementations

Main Function

Test 1: Baseline (No Prefetch)

long long t1 = measure_time([&] {
    long long sum = 0;
    for (size_t i = 0; i < SIZE; i += STRIDE) {
        sum += data[i];
    }
    std::cout << "Sum " << sum << '\n';
});

Purpose: Establishes baseline performance for strided memory access without optimization hints.

Test 2: With Prefetch

long long t2 = measure_time([&] {
    long long sum = 0;
    for (size_t i = 0; i < SIZE; i += STRIDE) {
        if (i + STRIDE < SIZE) {
            __builtin_prefetch(&data[i + STRIDE], 0, 3);
        }
        sum += data[i];
    }
    std::cout << "Sum " << sum << '\n';
});

Prefetch Instruction:

__builtin_prefetch(&data[i + STRIDE], 0, 3);

Parameters:

&data[i + STRIDE]: Address to prefetch
0: Read access
3: High temporal locality hint

Bounds Check:

if (i + STRIDE < SIZE)

Prevents prefetching beyond array boundaries.

Performance Characteristics

Memory Access Pattern

Without Prefetch:

Cache misses cause main memory access delays

With Prefetch:

Prefetched data reduces cache miss latency

Expected Results

Small to moderate performance gains depending on hardware

Best Practices

Use prefetching for predictable strided access
Adjust prefetch distance and locality hints
Always profile on target hardware

Limitations and Considerations

Compiler-specific intrinsics
Platform dependencies
Hardware prefetcher capabilities

Troubleshooting

Use g++ instead of gcc for C++ code
Ensure bounds checking for prefetch addresses

References

Intel Optimization Manual
GCC __builtin_prefetch documentation
CPU architecture guides

License

Educational resource for benchmarking

Version History

v1.0: Initial implementation and benchmarking

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp

Folders and files

Latest commit

History

Repository files navigation

Memory Prefetch Performance Benchmark Documentation

Overview

Purpose

System Requirements

Compiler Support

Hardware Requirements

Dependencies

Build Instructions

Standard Build

Optimized Build (Recommended)

Debug Build

Compiler Flags Explained

Usage

Running the Benchmark

Expected Output

Code Architecture

Constants

SIZE

STRIDE

Functions

measure_time

Main Function

Test 1: Baseline (No Prefetch)

Test 2: With Prefetch

Performance Characteristics

Memory Access Pattern

Expected Results

Best Practices

Limitations and Considerations

Troubleshooting

References

License

Version History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`SIZE`

`STRIDE`

`measure_time`

Packages