Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend#3561
Merged
ahojnnes merged 3 commits intoAug 14, 2025
Merged
Conversation
Contributor
|
Thanks @yimingc for upstreaming these changes. |
ahojnnes
approved these changes
Aug 14, 2025
tavislocus
pushed a commit
to tavislocus/colmap_6dof
that referenced
this pull request
Aug 19, 2025
…olmap#3561) ### **Key Changes:** 1. **Removed Global Serialization Lock for CUDA** * Eliminated `sift_match_gpu_mutexes_` that blocked all CUDA matching operations * CUDA version now runs completely lock-free during compute operations 2. **Explicit CUDA Initialization in Worker Thread Level using `cudaSetDevice()`** * Each worker thread initializes its own CUDA context independently * Eliminates need for complex per-instance initialization logic 3. **Improved Variable Naming for Clarity** * Renamed `sift_match_gpu_mutexes_` to `sift_opengl_mutexes_` * Updated comments to clarify OpenGL-specific mutex usage * Removed misleading references to "all GPU implementations" #### Why Can We Remove The Global Mutex? * **CUDA Context Lifecycle**: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. [link](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization) * **Thread-Safe CUDA Operations**: Common operations including memory allocation (`cudaMalloc`), data copying (`cudaMemcpy`), and stream creation/switching (`cudaStreamCreate`) are inherently thread-safe at the CUDA Runtime API level. [link](https://forums.developer.nvidia.com/t/cudahostregister-on-multiple-threads/296497?utm_source=chatgpt.com) * **Per-Thread Default Stream Isolation**: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads. * **No Static Variable Dependencies**: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection. #### **Why Do We Add `SetBestCudaDevice(gpu_index);` in `FeatureMatcherWorker::Run()`?** Short answer: `cudaSetDevice(gpu_index)` will be called inside and we need it to bind CUDA context explicitly. **Thread-Local Device Context Requirements:** * **Per-Thread Device Binding**: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations. * **Multi-GPU Environment Support**: In systems with multiple GPUs, different worker threads may be assigned to different devices. The `cudaSetDevice()` call ensures each thread operates on its designated GPU. * **Early Initialization Timing**: By setting the device at the beginning of `Run()`, we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device. * **Context Warm-up**: The `cudaFree(0)` call immediately after `cudaSetDevice()` serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing. ### 🔒 **Thread Safety Guarantees** * **Initialization Phase**: Protected by per-GPU mutexes during lazy setup * **Compute Phase**: Lock-free parallel execution with dedicated CUDA streams * **Instance Isolation**: Each thread operates on independent matcher instances * **Stream Isolation**: PTDS ensures each thread's CUDA operations are automatically isolated ### 🚀 **Benefits** 1. **Eliminates Serialization Bottleneck**: Multiple threads can submit kernels concurrently 2. **Maximizes GPU Utilization**: True parallelism with PTDS integration 3. **Reduces CPU Overhead**: No lock contention in compute-heavy operations 4. **Maintains Safety**: Thread-safe initialization with zero runtime locking cost 5. **Cleaner Architecture**: Separation of initialization and compute concerns ### Example Use Case When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput. --------- Co-authored-by: Yiming Chen <yiming@meta.com> Co-authored-by: Johannes Schönberger <jsch@meta.com> Co-authored-by: Johannes Schönberger <jsch@demuc.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
Removed Global Serialization Lock for CUDA
sift_match_gpu_mutexes_that blocked all CUDA matching operationsExplicit CUDA Initialization in Worker Thread Level using
cudaSetDevice()Improved Variable Naming for Clarity
sift_match_gpu_mutexes_tosift_opengl_mutexes_Why Can We Remove The Global Mutex?
CUDA Context Lifecycle: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. link
Thread-Safe CUDA Operations: Common operations including memory allocation (
cudaMalloc), data copying (cudaMemcpy), and stream creation/switching (cudaStreamCreate) are inherently thread-safe at the CUDA Runtime API level. linkPer-Thread Default Stream Isolation: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads.
No Static Variable Dependencies: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection.
Why Do We Add
SetBestCudaDevice(gpu_index);inFeatureMatcherWorker::Run()?Short answer:
cudaSetDevice(gpu_index)will be called inside and we need it to bind CUDA context explicitly.Thread-Local Device Context Requirements:
Per-Thread Device Binding: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations.
Multi-GPU Environment Support: In systems with multiple GPUs, different worker threads may be assigned to different devices. The
cudaSetDevice()call ensures each thread operates on its designated GPU.Early Initialization Timing: By setting the device at the beginning of
Run(), we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device.Context Warm-up: The
cudaFree(0)call immediately aftercudaSetDevice()serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing.🔒 Thread Safety Guarantees
🚀 Benefits
Example Use Case
When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput.