0% found this document useful (0 votes)

25 views22 pages

UCS645 ProjectReport MergeSort

This document presents a project on implementing parallel computing for the merge sort algorithm using NVIDIA's CUDA architecture. It compares the performance of CPU and GPU implementations, highlighting the advantages of GPU acceleration for large datasets, with findings indicating significant speedup and efficiency improvements. The project also discusses methodologies, performance metrics, and optimization opportunities to enhance sorting tasks in high-performance computing environments.

Uploaded by

tanishqshahare2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views22 pages

UCS645 ProjectReport MergeSort

Uploaded by

tanishqshahare2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Parallel Computing on Merge Sort

A Project Component for

Parallel and Distributing Computing (UCS645)
By

Sr Name Roll No
1 Aishwarya Jain 102203738
2 Alok Priyadashi 102203323
3 Anushka Verma 102203699
4 Samiksha Kak 102203587

Under the guidance of

Dr. Saif Nalband
(Assistant Professor, DCSE)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

THAPAR INSTITUTE OF ENGINEERING AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
PATIALA - 147004

MAY, 2025
Table of Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to Problem Statement . . . . . . . . . . . . . . . . . . . . . . 1

2 Problem Formulation 4

3 Objectives 5

4 Methodology 6
4.1 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Output Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Performance Analysis 15

6 Results and Discussion 17

List of Figures
1 Parallel computing model showing problem decomposition . . . . . . . . . 2
2 Parallel implementation of merge sort . . . . . . . . . . . . . . . . . . . . . 9
3 Sequential implementation of merge sort . . . . . . . . . . . . . . . . . . . 9
4 Metrics calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 CPU vs GPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
9 Communication Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
10 Scalabilty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
11 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
12 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
List of Tables
1 Performance Comparison (CPU vs GPU) . . . . . . . . . . . . . . . . . . . 17
1 Introduction

1.1 Background

This report explores the foundational principles and practical aspects of parallel pro-
gramming, including models, algorithms, and performance metrics. It aims to provide a
structured understanding of how computation can be accelerated by leveraging concur-
rency across multiple processors.

1.2 Introduction to Problem Statement

Efficient sorting of large datasets is a core challenge in high-performance computing

(HPC), particularly as data volumes and processing demands continue to grow. Tra-
ditional merge sort algorithms are effective in sequential environments but struggle to
scale with increasing data size or hardware capabilities. As a result, these CPU-based
solutions often become performance bottlenecks when handling massive datasets.
This project aims to implement and evaluate a parallel version of merge sort using
NVIDIA’s CUDA architecture, which allows general-purpose computing on GPUs. GPUs
offer massive parallelism through thousands of lightweight threads that can execute tasks
concurrently. By offloading the sorting task to the GPU, it is possible to significantly re-
duce execution time and improve scalability. Early implementations using simple element-
wise comparison methods were found to be suboptimal, offering limited speedup and
failing to utilize the full parallel potential of modern GPUs.
To address these limitations, the project moves towards a more structured parallel imple-
mentation based on a bottom-up, multi-pass merge sort strategy. This approach demands
efficient task division, synchronization, and memory usage across CUDA threads to min-
imize overhead and ensure accurate sorting. Performance is assessed using key HPC
metrics such as speedup, efficiency, scalability, granularity, load balancing, and communi-
cation overhead. The ultimate goal is to identify a GPU-based merge sort model that not
only accelerates computation for large arrays but also outperforms traditional CPU imple-
mentations in terms of cost-effectiveness and execution time.

1
Figure 1: Parallel computing model showing problem decomposition

One might ask why there is such a large peak performance gap between many-threaded
GPUs and multicore CPUs. The answer lies in the differences in the fundamental de-
sign philosophies between the two types of processors, as illustrated in Fig. 1. The
design of a CPU, as shown in Fig. 1, is optimized for sequential code performance. The
arithmetic units and operand data delivery logic are designed to minimize the effective
latency of arithmetic operations at the cost of increased use of chip area and power per
unit. Large last-level on-chip caches are designed to capture frequently accessed data
and convert some of the long latency memory accesses into short-latency cache accesses.
Sophisticated branch prediction logic and execution control logic are used to mitigate the
latency of conditional branch instructions. By reducing the latency of operations, the
CPU hardware reduces the execution latency of each individual thread. However, the
low-latency arithmetic units, sophisticated operand delivery logic, large cache memory,
and control logic consume chip area and power that could otherwise be used to provide
more arithmetic execution units and memory access channels.
This design approach is commonly referred to as latency-oriented design. The design
philosophy of the GPUs, on the other hand, has been shaped by the fast-growing video
game industry, which exerts tremendous economic pressure for the ability to perform a
massive number of floating-point calculations and memory accesses per video frame in

2
advanced games. This demand motivates GPU vendors to look for ways to maximize the
chip area and power budget dedicated to floating-point calculations and memory access
throughput.

3
2 Problem Formulation
With the rise in data-intensive applications and big data analytics, the need for high-
performance computing (HPC) solutions has become crucial. Traditional CPU-based
algorithms often fail to deliver the desired efficiency for large-scale computations due to
limited parallelism. Sorting, being a fundamental operation in numerous applications like
databases, scientific simulations, and real-time systems, becomes a natural candidate for
optimization using parallel computing. This project formulates the problem of enhanc-
ing the performance of the merge sort algorithm through GPU acceleration using CUDA
(Compute Unified Device Architecture).

The core objective is to analyze and compare the execution time and performance of merge
sort implemented on both CPU and GPU architectures for arrays of increasing sizes. By
leveraging GPU′ s parallel processing capabilities, the aim is to demonstrate how execution
time can be significantly reduced for large datasets. The experiment involves calculating
various performance metrics such as speedup, efficiency, communication overhead, scala-
bility, granularity, load balancing, and total overhead.

The project first performs sorting on arrays of different sizes using CUDA kernels for
GPU execution and standard recursive methods for CPU. Execution times are recorded
and used to compute the above metrics. The results are visualized through bar charts and
line graphs to better understand the relationship between array size and performance gain.
Additionally, all data is exported to an Excel file titled Performance Metrics for record-
keeping and further analysis.

This comparative analysis highlights how GPU-based implementations can outperform

CPU-based methods for computationally intensive sorting tasks. It serves as an example of
how parallelization strategies can be applied to traditional algorithms to meet modern per-
formance demands, especially in scenarios requiring real-time processing or large dataset
handling.

4
3 Objectives
• To implement the Merge Sort algorithm on both CPU and GPU platforms using
CUDA C/C++ and evaluate their execution for varying array sizes.

• To measure and compare execution times for CPU and GPU implementations of
merge sort to assess the performance benefits of GPU parallelization.

• To compute key performance metrics such as:

– Speedup

– Efficiency

– Load Balancing

– Communication Overhead

– Scalability

– Granularity

• To analyze the effect of array size on the performance of both CPU and GPU merge
sort implementations and determine thresholds where GPU significantly outper-
forms CPU.

• To visualize the comparative performance using bar charts and line graphs that
represent execution times and all performance parameters.

• To document all experimental results in an organized manner by exporting the

collected data into an Excel sheet titled Performance Metrics for easy interpretation
and further analysis.

• To understand and demonstrate the advantages and limitations of parallel comput-

ing, particularly using CUDA, for classic algorithm optimization.

• To formulate conclusions on the efficiency and practicality of using GPU-based

computing for large-scale sorting problems and guide future optimizations in parallel
algorithm design.

5
4 Methodology
1. Algorithm Selection

(a) Chose merge sort for its O(n log n) complexity and parallelisability

(b) Implemented both sequential (CPU) and parallel (GPU) versions

2. Implementation Approach

(a) Sequential version:

i. Recursive divide-and-conquer

ii. Dynamic memory allocation for temp arrays

iii. In-place merging

(b) Parallel version:

i. Iterative bottom-up design for GPU

ii. Each CUDA thread merges two subarrays

iii. Double merge width each pass

3. Performance Measurement

(a) Timed using:

i. cudaEvent for GPU

ii. chrono high-res clock for CPU

(b) Tested with N=1000, 10000, 100000

(c) Same random inputs for both versions

4. Validation

(a) Output saved to file for verification

(b) Same seed ensures identical inputs

(c) Manual checks on small arrays

6
5. Testing

(a) Unit tests for merge operation

(b) Integration tests for full sort

(c) Performance comparison across sizes

6. Environment

(a) NVIDIA GPU with CUDA

(b) Multi-core CPU

(c) C++/CUDA compilation

7. Output

(a) Console display

(b) File output (output.txt)

(c) Timing data for comparison

8. Limitations

(a) Global memory only

(b) Power-of-two sizes

(c) Basic kernel implementation

9. Future Work

(a) Shared memory optimization

(b) Async memory transfers

(c) Non-power-of-two support

(d) Multi-GPU scaling

7
4.1 Pseudocode

Algorithm 1 Parallel and Sequential Merge Sort Comparison

1: Initialize and print GPU properties

2: Warm up GPU using a dummy kernel

3: Define dataset sizes: N1 , N2 , N3
4: for each N ∈ {N1 , N2 , N3 } do
5: Generate N random integers on the host
6: Copy data to GPU (device vector)
7: Synchronize device
8: Start GPU timer
9: Sort using thrust::sort with device execution policy
10: Stop GPU timer
11: Copy sorted data back to host
12: Output sorted GPU results and timing
13: Print GPU memory used
14: end for
15: for each N ∈ {N1 , N2 , N3 } do
16: Generate N random integers on the host
17: Start CPU timer
18: Sort using recursive CPU merge sort
19: Stop CPU timer
20: Output sorted CPU results and timing
21: end for
22: Report successful GPU operations

8
4.2 Output Screenshots

Figure 2: Parallel implementation of merge sort

Figure 3: Sequential implementation of merge sort

9
Figure 4: Metrics calculation

10
Figure 5: CPU vs GPU Performance

Figure 6: Speedup

11
Figure 7: Efficiency

Figure 8: Load Balancing

12
Figure 9: Communication Overhead

Figure 10: Scalabilty

13
Figure 11: Granularity

Figure 12: Overhead

14
5 Performance Analysis
• GPU implementation shows a 2-10x speedup versus CPU for large datasets (N >
10,000)

• Kernel launch overhead speeds up the CPU for small data sets (N < 1,000)

• Best scaling observed at midrange sizes (10,000-100,000 elements).

• Memory Bottlenecks

– Global memory access is primary performance limiter

– No shared memory utilization − > missed optimization opportunity

– Host-device transfers account for 15-20

• Algorithm behavior

– Bottom-up approach enables better parallelism than recursive version.

– Thread imbalance occurs in final merge stages

– The power-of-two requirement limits real-world applicability

• Comparative Insights

– Outperforms CPU but trails behind optimized libraries (e.g., Thrust)

– Lacks dynamic load balancing of commercial solutions

– Memory access patterns could be further optimized

• Implementation Challenges

– Debugging difficulties due to parallel execution

– Verification complexity for large datasets

– Current synchronization model may cause underutilization

15
• Optimization Opportunities

– Hybrid CPU-GPU approach for better small-array handling

– Shared memory utilization − > potential 30-50

– Warp-level primitives for improved SIMD efficiency

– Asynchronous transfers to hide memory latency

• Practical Implications

– Demonstrated viability of GPU sorting for big data

– Highlights need for careful architecture design

– Provides foundation for more advanced implementations

These observations underscore both the promise of GPU-accelerated sorting and the en-
gineering challenges involved in achieving optimal performance across different use cases.

16
6 Results and Discussion

Table 1: Performance Comparison (CPU vs GPU)

Dataset Size CPU Time GPU Time Speedup

1,000 0.45 ms 1.2 ms 0.38x
10,000 5.2 ms 1.8 ms 2.9x
100,000 62 ms 9.5 ms 6.5x

• Key Findings

– Threshold Behavior: GPU becomes faster than CPU at 5,000 elements

– Optimal Scaling: Best performance at N=50,000-200,000 range

– Peak Speedup: 6.5x observed at N=100,000

– Memory Impact: 30 % of GPU time spent in data transfers

• Correctness Verification

– 100% match between CPU and GPU sorted outputs

– Successfully handled edge cases:

∗ Pre-sorted arrays

∗ Reverse-sorted arrays

∗ Random distributions

∗ Duplicate values

• Resource Utilization

– GPU occupancy: 72% (limited by memory bandwidth)

– CPU utilization: Single-core 100% during sort

– Memory usage: 2N temporary storage required

17
• Limitations Performance degradation observed when:

– N > 500,000 (memory pressure)

– Non-power-of-two sizes (+15% overhead)

– Highly non-uniform data distributions

• Comparative Analysis

– std sort: 3–4x faster at N=100,000

– Thrust sort: 1.5–2x slower

– Radix Sort: Slower but more general-purpose

These results demonstrate the GPU implementation’s effectiveness for medium-to-large

datasets while highlighting opportunities for further optimization in memory handling and
load balancing. The consistent speedup in the 10,000–100,000 element range validates the
practical utility of this parallel approach.

IEEE - HiPC 2023
No ratings yet
IEEE - HiPC 2023
2 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
Synthesis Gpgpu Draft2012 09
No ratings yet
Synthesis Gpgpu Draft2012 09
100 pages
Numerical Methods Implementation On CUDA
No ratings yet
Numerical Methods Implementation On CUDA
73 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
Owens
No ratings yet
Owens
67 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
No ratings yet
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
12 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
YASH HPC Final
No ratings yet
YASH HPC Final
13 pages
CW PD Prog
No ratings yet
CW PD Prog
5 pages
An Efficient Sorting Algorithm With CUDA
No ratings yet
An Efficient Sorting Algorithm With CUDA
8 pages
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
No ratings yet
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
10 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
A Practical Quicksort Algorithm For Graphics Processors 3hac3qeos3
No ratings yet
A Practical Quicksort Algorithm For Graphics Processors 3hac3qeos3
21 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Debunking The 100X GPU vs. CPU Myth
No ratings yet
Debunking The 100X GPU vs. CPU Myth
10 pages
Parallel Computing Techniques
No ratings yet
Parallel Computing Techniques
19 pages
GPU Based Parallel Processing Model Proposal
No ratings yet
GPU Based Parallel Processing Model Proposal
4 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
CS Proposal Sample
No ratings yet
CS Proposal Sample
4 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
Tesi Garofalo Finale PDF
No ratings yet
Tesi Garofalo Finale PDF
170 pages
An Introduction To Parallel Programming Pacheco Peter S Malensek Matthew All Chapters Available
No ratings yet
An Introduction To Parallel Programming Pacheco Peter S Malensek Matthew All Chapters Available
146 pages
AI Image Classification With Neural Network 221002564 222002074
No ratings yet
AI Image Classification With Neural Network 221002564 222002074
17 pages
Concurrency Analysis Report
No ratings yet
Concurrency Analysis Report
42 pages
GPU Based Parallel Processing Model Proposal Expanded
No ratings yet
GPU Based Parallel Processing Model Proposal Expanded
4 pages
Parallel Programming For Modern High Performance Computing Systems Czarnul Latest PDF 2025
No ratings yet
Parallel Programming For Modern High Performance Computing Systems Czarnul Latest PDF 2025
84 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Performance Improvement in Operating System: Sundhar Ram P R 20072225 Vignesh A 20072235 Vigneshwarasabarinath S 20072236
No ratings yet
Performance Improvement in Operating System: Sundhar Ram P R 20072225 Vignesh A 20072235 Vigneshwarasabarinath S 20072236
65 pages
ThesisModeling Parallel Processes in Biosystems
No ratings yet
ThesisModeling Parallel Processes in Biosystems
80 pages
COSC 4101 Parallel and Distributed Computing Final
100% (1)
COSC 4101 Parallel and Distributed Computing Final
4 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA Algorithms For JAVA
No ratings yet
CUDA Algorithms For JAVA
45 pages
The Comparison of Parallel Sorting
No ratings yet
The Comparison of Parallel Sorting
13 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
Full Parallel QuickSort MPI Report
No ratings yet
Full Parallel QuickSort MPI Report
12 pages
Parallel Computing Course Guide
No ratings yet
Parallel Computing Course Guide
2 pages
CC Lab Manual
No ratings yet
CC Lab Manual
39 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
Handbook HPC 23-24
No ratings yet
Handbook HPC 23-24
18 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Parallel ProgrammingSyllabus
No ratings yet
Parallel ProgrammingSyllabus
2 pages
Lesson 13 Review For Midterm
No ratings yet
Lesson 13 Review For Midterm
3 pages
Important Short Read For LLM Suboptimal Performance
No ratings yet
Important Short Read For LLM Suboptimal Performance
11 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
PDC 3
No ratings yet
PDC 3
26 pages
An Efficient O N Comparison-Free Sorting Algorithm
No ratings yet
An Efficient O N Comparison-Free Sorting Algorithm
13 pages
4 2final
No ratings yet
4 2final
34 pages
Benchmarking Final
No ratings yet
Benchmarking Final
124 pages
Final FP501 OSOS
No ratings yet
Final FP501 OSOS
22 pages
Software Testing
No ratings yet
Software Testing
211 pages
ASDM Book 3: Cisco ASA Series VPN ASDM Configuration Guide, 7.14
No ratings yet
ASDM Book 3: Cisco ASA Series VPN ASDM Configuration Guide, 7.14
420 pages
Assignment 01
No ratings yet
Assignment 01
7 pages
Mehul Technical Trainer Profile
No ratings yet
Mehul Technical Trainer Profile
4 pages
TCS National Qualifier Test 2019 Details
No ratings yet
TCS National Qualifier Test 2019 Details
1 page
Unknown
No ratings yet
Unknown
420 pages
POSIX Threads in Linux
No ratings yet
POSIX Threads in Linux
11 pages
ESP32-H2 BLE DTM Test Guide
No ratings yet
ESP32-H2 BLE DTM Test Guide
13 pages
Muhammad Kamran (IT)
No ratings yet
Muhammad Kamran (IT)
3 pages
The Essential Process Engineering Software, Created by Process Engineers For Process Engineers
No ratings yet
The Essential Process Engineering Software, Created by Process Engineers For Process Engineers
4 pages
Speech To Text Conversion
No ratings yet
Speech To Text Conversion
6 pages
CSC-437 Chapter 4
No ratings yet
CSC-437 Chapter 4
65 pages
Dayton Audio WT3 Woofer Tester: by G.R. Koonce and R.O. Wright, JR
100% (1)
Dayton Audio WT3 Woofer Tester: by G.R. Koonce and R.O. Wright, JR
8 pages
Function Approximations Using Matlab
No ratings yet
Function Approximations Using Matlab
11 pages
Syllabus BCA 21
No ratings yet
Syllabus BCA 21
26 pages
UC User ManualV1.1.1
No ratings yet
UC User ManualV1.1.1
53 pages
GST 214 Summary
No ratings yet
GST 214 Summary
6 pages
Data Combo Box
No ratings yet
Data Combo Box
3 pages
Class 9 Computer
No ratings yet
Class 9 Computer
3 pages
Aspiring Data Scientist Profile
No ratings yet
Aspiring Data Scientist Profile
3 pages
AttracThorManual1 65
No ratings yet
AttracThorManual1 65
23 pages
IT Exemplar 2018 MEMO
No ratings yet
IT Exemplar 2018 MEMO
25 pages
PCD Lab Manual - Parsing - C (Programming Language)
No ratings yet
PCD Lab Manual - Parsing - C (Programming Language)
26 pages
Cristallo 400 SP LR 1
No ratings yet
Cristallo 400 SP LR 1
2 pages
ICS - Technical Consultant - Oracle - Ashwin Kumar - 5+ Years
No ratings yet
ICS - Technical Consultant - Oracle - Ashwin Kumar - 5+ Years
3 pages
Text Search & Multimedia Retrieval
No ratings yet
Text Search & Multimedia Retrieval
22 pages
CDL Gateway 10inch
No ratings yet
CDL Gateway 10inch
2 pages
Word Processing Editing Guide
No ratings yet
Word Processing Editing Guide
4 pages
File Management in Operating Systems
No ratings yet
File Management in Operating Systems
40 pages