Skip to content

Bishal-9/PreFetch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Memory Prefetch Performance Benchmark Documentation

Overview

This benchmark program demonstrates the performance impact of CPU cache prefetching on memory access patterns. It compares two approaches for accessing strided memory: one without prefetch instructions and one with explicit prefetch hints to the CPU.

Purpose

The program measures and compares the execution time of memory access operations with and without software prefetching, demonstrating how proper cache management can improve performance in memory-bound applications with predictable access patterns.

System Requirements

Compiler Support

  • C++ Standard: C++17 or higher
  • Supported Compilers:
    • GCC 7.0+
    • Clang 5.0+
    • Any compiler supporting __builtin_prefetch

Hardware Requirements

  • x86/x86_64 architecture with SSE support
  • Minimum 64 MB RAM (for data allocation)
  • CPU with hardware prefetch capabilities

Dependencies

  • Standard C++ Library
  • <xmmintrin.h> (SSE intrinsics header)

Build Instructions

Standard Build

g++ -std=c++17 -O2 main.cpp -o prefetch_benchmark

Optimized Build (Recommended)

g++ -std=c++17 -O3 -march=native main.cpp -o prefetch_benchmark

Debug Build

g++ -std=c++17 -g -O0 main.cpp -o prefetch_benchmark_debug

Compiler Flags Explained

  • -std=c++17: Enables C++17 standard features
  • -O2 / -O3: Optimization levels (O3 for maximum performance)
  • -march=native: Optimizes for the host CPU architecture
  • -g: Includes debug symbols
  • -O0: Disables optimizations for debugging

Usage

Running the Benchmark

./prefetch_benchmark

Expected Output

Sum 1024
Sum 1024
No prefetch: 1234us
With prefetch: 987us

Code Architecture

Constants

SIZE

constexpr size_t SIZE = 16'777'216;
  • Value: 16,777,216 elements (16 Mi)
  • Memory: ~64 MB for integer array
  • Purpose: Large enough to exceed L3 cache on most systems, forcing main memory access

STRIDE

constexpr size_t STRIDE = 1024 * 16;
  • Value: 16,384 elements
  • Memory Distance: ~64 KB between accesses
  • Purpose: Creates non-sequential memory access pattern that benefits from prefetching

Functions

measure_time

template <typename Func> long long measure_time(Func &&f)

Description: Generic template function for high-precision timing of arbitrary operations.

Parameters:

  • f: Callable object (lambda, function, functor) to be timed

Returns:

  • long long: Execution time in microseconds

Implementation Details:

  • Uses std::chrono::high_resolution_clock for precise timing
  • Perfect forwarding via universal reference
  • Template allows zero-overhead abstraction
  • Resolution: Microsecond precision

Use Cases:

  • Performance benchmarking
  • Profiling code sections
  • Comparing algorithm implementations

Main Function

Test 1: Baseline (No Prefetch)

long long t1 = measure_time([&] {
    long long sum = 0;
    for (size_t i = 0; i < SIZE; i += STRIDE) {
        sum += data[i];
    }
    std::cout << "Sum " << sum << '\n';
});

Purpose: Establishes baseline performance for strided memory access without optimization hints.

Test 2: With Prefetch

long long t2 = measure_time([&] {
    long long sum = 0;
    for (size_t i = 0; i < SIZE; i += STRIDE) {
        if (i + STRIDE < SIZE) {
            __builtin_prefetch(&data[i + STRIDE], 0, 3);
        }
        sum += data[i];
    }
    std::cout << "Sum " << sum << '\n';
});

Prefetch Instruction:

__builtin_prefetch(&data[i + STRIDE], 0, 3);

Parameters:

  1. &data[i + STRIDE]: Address to prefetch
  2. 0: Read access
  3. 3: High temporal locality hint

Bounds Check:

if (i + STRIDE < SIZE)

Prevents prefetching beyond array boundaries.

Performance Characteristics

Memory Access Pattern

Without Prefetch:

  • Cache misses cause main memory access delays

With Prefetch:

  • Prefetched data reduces cache miss latency

Expected Results

  • Small to moderate performance gains depending on hardware

Best Practices

  • Use prefetching for predictable strided access
  • Adjust prefetch distance and locality hints
  • Always profile on target hardware

Limitations and Considerations

  • Compiler-specific intrinsics
  • Platform dependencies
  • Hardware prefetcher capabilities

Troubleshooting

  • Use g++ instead of gcc for C++ code
  • Ensure bounds checking for prefetch addresses

References

  • Intel Optimization Manual
  • GCC __builtin_prefetch documentation
  • CPU architecture guides

License

Educational resource for benchmarking

Version History

  • v1.0: Initial implementation and benchmarking

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors