Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend by yimingc · Pull Request #3561 · colmap/colmap

yimingc · 2025-08-13T18:41:34Z

Key Changes:

Removed Global Serialization Lock for CUDA
- Eliminated sift_match_gpu_mutexes_ that blocked all CUDA matching operations
- CUDA version now runs completely lock-free during compute operations
Explicit CUDA Initialization in Worker Thread Level using cudaSetDevice()
- Each worker thread initializes its own CUDA context independently
- Eliminates need for complex per-instance initialization logic
Improved Variable Naming for Clarity
- Renamed sift_match_gpu_mutexes_ to sift_opengl_mutexes_
- Updated comments to clarify OpenGL-specific mutex usage
- Removed misleading references to "all GPU implementations"

Why Can We Remove The Global Mutex?

CUDA Context Lifecycle: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. link
Thread-Safe CUDA Operations: Common operations including memory allocation (cudaMalloc), data copying (cudaMemcpy), and stream creation/switching (cudaStreamCreate) are inherently thread-safe at the CUDA Runtime API level. link
Per-Thread Default Stream Isolation: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads.
No Static Variable Dependencies: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection.

Why Do We Add `SetBestCudaDevice(gpu_index);` in `FeatureMatcherWorker::Run()`?

Short answer: cudaSetDevice(gpu_index) will be called inside and we need it to bind CUDA context explicitly.
Thread-Local Device Context Requirements:

Per-Thread Device Binding: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations.
Multi-GPU Environment Support: In systems with multiple GPUs, different worker threads may be assigned to different devices. The cudaSetDevice() call ensures each thread operates on its designated GPU.
Early Initialization Timing: By setting the device at the beginning of Run(), we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device.
Context Warm-up: The cudaFree(0) call immediately after cudaSetDevice() serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing.

🔒 Thread Safety Guarantees

Initialization Phase: Protected by per-GPU mutexes during lazy setup
Compute Phase: Lock-free parallel execution with dedicated CUDA streams
Instance Isolation: Each thread operates on independent matcher instances
Stream Isolation: PTDS ensures each thread's CUDA operations are automatically isolated

🚀 Benefits

Eliminates Serialization Bottleneck: Multiple threads can submit kernels concurrently
Maximizes GPU Utilization: True parallelism with PTDS integration
Reduces CPU Overhead: No lock contention in compute-heavy operations
Maintains Safety: Thread-safe initialization with zero runtime locking cost
Cleaner Architecture: Separation of initialization and compute concerns

Example Use Case

When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput.

…matching

ahojnnes · 2025-08-14T08:10:34Z

Thanks @yimingc for upstreaming these changes.

…olmap#3561) ### **Key Changes:** 1. **Removed Global Serialization Lock for CUDA** * Eliminated `sift_match_gpu_mutexes_` that blocked all CUDA matching operations * CUDA version now runs completely lock-free during compute operations 2. **Explicit CUDA Initialization in Worker Thread Level using `cudaSetDevice()`** * Each worker thread initializes its own CUDA context independently * Eliminates need for complex per-instance initialization logic 3. **Improved Variable Naming for Clarity** * Renamed `sift_match_gpu_mutexes_` to `sift_opengl_mutexes_` * Updated comments to clarify OpenGL-specific mutex usage * Removed misleading references to "all GPU implementations" #### Why Can We Remove The Global Mutex? * **CUDA Context Lifecycle**: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. [link](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization) * **Thread-Safe CUDA Operations**: Common operations including memory allocation (`cudaMalloc`), data copying (`cudaMemcpy`), and stream creation/switching (`cudaStreamCreate`) are inherently thread-safe at the CUDA Runtime API level. [link](https://forums.developer.nvidia.com/t/cudahostregister-on-multiple-threads/296497?utm_source=chatgpt.com) * **Per-Thread Default Stream Isolation**: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads. * **No Static Variable Dependencies**: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection. #### **Why Do We Add `SetBestCudaDevice(gpu_index);` in `FeatureMatcherWorker::Run()`?** Short answer: `cudaSetDevice(gpu_index)` will be called inside and we need it to bind CUDA context explicitly. **Thread-Local Device Context Requirements:** * **Per-Thread Device Binding**: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations. * **Multi-GPU Environment Support**: In systems with multiple GPUs, different worker threads may be assigned to different devices. The `cudaSetDevice()` call ensures each thread operates on its designated GPU. * **Early Initialization Timing**: By setting the device at the beginning of `Run()`, we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device. * **Context Warm-up**: The `cudaFree(0)` call immediately after `cudaSetDevice()` serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing. ### 🔒 **Thread Safety Guarantees** * **Initialization Phase**: Protected by per-GPU mutexes during lazy setup * **Compute Phase**: Lock-free parallel execution with dedicated CUDA streams * **Instance Isolation**: Each thread operates on independent matcher instances * **Stream Isolation**: PTDS ensures each thread's CUDA operations are automatically isolated ### 🚀 **Benefits** 1. **Eliminates Serialization Bottleneck**: Multiple threads can submit kernels concurrently 2. **Maximizes GPU Utilization**: True parallelism with PTDS integration 3. **Reduces CPU Overhead**: No lock contention in compute-heavy operations 4. **Maintains Safety**: Thread-safe initialization with zero runtime locking cost 5. **Cleaner Architecture**: Separation of initialization and compute concerns ### Example Use Case When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput. --------- Co-authored-by: Yiming Chen <yiming@meta.com> Co-authored-by: Johannes Schönberger <jsch@meta.com> Co-authored-by: Johannes Schönberger <jsch@demuc.de>

Yiming Chen and others added 3 commits August 11, 2025 17:27

Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend

5814723

d

e93d836

Merge branch 'main' into user/yiming/remove_global_lock_in_sift_cuda_…

0430bdd

…matching

ahojnnes approved these changes Aug 14, 2025

View reviewed changes

ahojnnes enabled auto-merge (squash) August 14, 2025 08:10

ahojnnes merged commit fc0afc1 into colmap:main Aug 14, 2025
13 checks passed

BrewTestBot mentioned this pull request Nov 7, 2025

colmap 3.13.0 Homebrew/homebrew-core#253538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend#3561

Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend#3561
ahojnnes merged 3 commits into
colmap:mainfrom
yimingc:user/yiming/remove_global_lock_in_sift_cuda_matching

yimingc commented Aug 13, 2025

Uh oh!

ahojnnes commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yimingc commented Aug 13, 2025

Key Changes:

Why Can We Remove The Global Mutex?

Why Do We Add SetBestCudaDevice(gpu_index); in FeatureMatcherWorker::Run()?

🔒 Thread Safety Guarantees

🚀 Benefits

Example Use Case

Uh oh!

ahojnnes commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why Do We Add `SetBestCudaDevice(gpu_index);` in `FeatureMatcherWorker::Run()`?