Computational Complexity Theory
In computer science, computational complexity theory is the branch of the theory of computation that
studies the resources, or cost, of the computation required to solve a given computational problem.
In computational complexity theory, P, NP, NP-hard, and NP-complete are essential classes of problems
that help to categorize the difficulty of computational problems.
P (Polynomial Time):
P refers to the class of problems that can be solved by a deterministic algorithm in polynomial time. This
means the time required to solve the problem grows at a polynomial rate (e.g., n^2, n^3) as the size of the
input increases.
Example: Sorting a list of numbers (e.g., using algorithms like Merge Sort or Quick Sort) is in P because
these algorithms run in O(n logn) time.
Key Feature: Problems in P are considered "easy" or efficiently solvable because the algorithm's time
grows at a manageable rate relative to the input size.
NP (Nondeterministic Polynomial Time):
A problem that can not be solved in polynomial time but can be verifiable in polynomial time using a
non-deterministic algorithm is known as NP (Nondeterministic Polynomial) problem.
Example: Sudoku problem, Scheduling problem.
Key Feature: Problems in NP can be "verified" quickly, but solving them may take a long time unless a
polynomial-time solution is found.
NP-Hard
NP-hard problems are generally not solvable in polynomial time, and they represent some of the hardest
problems in computation. If you could solve an NP-hard problem efficiently, you could solve all NP
problems efficiently.
Example: In a chess game, given a move, checking whether that is a winning move.
1
NP-Complete:
NP-complete is the class of problems that are both in NP and NP-hard. This means that an NP-complete
problem is:
1. In NP: A solution can be verified in polynomial time.
2. NP-hard: Every problem in NP can be reduced to it in polynomial time.
If you could find a polynomial-time solution for any NP-complete problem, you could solve all NP
problems in polynomial time. This is because NP-complete problems are the hardest problems in NP.
Example: Solving a rubik's cube, Travelling Salesman Problem,
To solve an NP complete problem for any nontrivial problem size, generally one of the following
approaches is used:
Approximation
Probabilistic
Special cases
Heuristic
Approximation Algorithms
Approximation algorithms are algorithms used to find near-optimal solutions to optimization problems,
especially when finding an exact solution is too time-consuming or computationally hard (like NP-hard
problems).
Key Concepts:
1. Optimization Problem: A problem where you're looking to maximize or minimize a certain value
(e.g., shortest path, minimum cost, maximum profit).
2. NP-Hard Problems: These are problems for which no known polynomial-time algorithms can find
an exact solution (e.g., Traveling Salesman, Knapsack, Vertex Cover).
3. Approximation Ratio: If your approximation algorithm returns a solution that's within a factor ρ of
the optimal, it's called a ρ-approximation.
○ For minimization: Approx ≤ ρ * OPT
○ For maximization: Approx ≥ OPT / ρ
Examples of Approximation Algorithms:
Some common examples of approximation algorithms include:
● Vertex Cover Problem: The goal is to find the smallest set of vertices such that every edge in the
graph is incident to at least one vertex in the set. An approximation algorithm for this problem might
find a vertex cover that is at most twice the size of the optimal solution. We can solve this problem
by using the Greedy Approach. Its approximation ratio is 2.
2
○ A Vertex Cover of a graph G is a set of vertices such that each edge in G is incident to at least
one of these vertices.
○ The decision vertex-cover problem was proven by the NPC. Now, we want to solve the
optimal version of the vertex cover problem, i.e., we want to find a minimum size vertex
cover of a given graph. We call such vertex cover an optimal vertex cover C*.
○
○ An approximate algorithm for vertex cover:
Approx-Vertex-Cover (G = (V, E))
{
C = empty-set;
E'= E;
While E' is not empty do {
Let (u, v) be any edge in E': (*)
Add u and v to C;
Remove from E' all edges incident to u or v;
}
Return C;
}
○ The idea is to take an edge (u, v) one by one, put both vertices to C, and remove all the edges
incident to u or v. We carry on until all edges have been removed. C is a VC. But how good is
C?
○
○ VC = {b, c, d, e, f, g}
● Traveling Salesman Problem (TSP): The goal is to find the shortest possible route that visits each
city exactly once and returns to the origin city. An approximation algorithm for TSP might find a
route that is within a certain factor of the shortest possible route.
● Set Covering Problem: This problem involves finding the smallest subset of sets that cover all
elements. Approximation algorithms for this problem often use a logarithmic approximation ratio
3
Types of Approximation:
● Constant-factor: Ratio is a fixed number, e.g., 2.
● Logarithmic-factor: Ratio grows slowly, e.g., log(n).
● PTAS (Polynomial Time Approximation Scheme): Can get arbitrarily close to the optimal.
● FPTAS (Fully PTAS): Like PTAS, but also polynomial in 1/ε.
Approximation algorithms are often used for NP-hard problems because finding an exact solution is
usually computationally infeasible—especially as the input size grows.
Reasons Approximation Algorithms Are Used for NP-Hard Problems:
1. No Known Polynomial-Time Algorithms
For NP-hard problems, we don’t know any algorithm that can solve them exactly in polynomial time
(and likely never will, unless P = NP).
2. Exact Algorithms Take Exponential Time
Solving them exactly usually means checking an enormous number of possibilities—often
exponential in size—which is not practical for large inputs.
3. Real-World Applications Need Speed
In areas like logistics, scheduling, or network design, we often need good solutions fast, even if
they’re not perfect.
4. Approximation Guarantees
Approximation algorithms offer provable performance bounds (e.g., "This is within 2x of
optimal"), which gives more confidence than just using heuristics.
Approximation algorithms strike a balance between speed and solution quality, making them ideal for
tackling NP-hard problems in the real world.
4
Data Stream Algorithms
Data stream algorithms are specialized algorithms designed to process large volumes of data in real-time
or in one pass (or a few passes) over the data, typically using limited memory. These are especially useful
when the data is too big to store entirely in memory or on disk.
The core challenge here is to extract meaningful insights and perform computations using limited memory
and processing time, often in a single pass over the data. It's like trying to understand a movie by only seeing
each frame for a split second!
Why are Data Stream Algorithms Important?
● Big Data Processing: They are crucial for analyzing massive, continuously generated datasets where
traditional offline analysis is impractical.
● Real-time Analytics: Enable immediate insights and decision-making based on the latest data.
● Resource Efficiency: Allow for processing large volumes of data with limited computational
resources.
● Network Monitoring: Used for detecting network anomalies, traffic analysis, and quality of service
monitoring.
● Financial Analysis: Employed for fraud detection, high-frequency trading analysis, and risk
management.
● Sensor Networks: Essential for processing data from numerous sensors in real-time.
● Web Analytics: Used for tracking user behavior, identifying trends, and personalized
recommendations.
Examples of Data Streams in Action:
● Network Traffic: Packets flowing through routers and switches.
● Sensor Data: Readings from temperature sensors, GPS devices, industrial machinery.
● Financial Transactions: Stock trades, credit card purchases happening every second.
● Web Server Logs: User requests, clicks, and website interactions.
● Social Media Feeds: Tweets, posts, and updates being generated constantly.
● Clickstreams: User navigation patterns on a website or application.
Key Challenges Of Designing Algorithms For Data Streams:
Here are some of the key hurdles:
1. Memory Constraints:
● The "Too Much Data, Too Little Space" Problem: The sheer volume of data in a stream often far
exceeds the available memory. Algorithms must operate with a memory footprint that is significantly
smaller than the total data seen so far, ideally sublinear or even logarithmic in the stream size.2
● Maintaining Relevant Information: The challenge lies in deciding what information to keep in
memory to answer queries accurately in the future, while discarding the rest. This requires clever
summarization techniques.
5
2. Real-time Processing Requirements:
● Keeping Up with the Flow: Data arrives continuously and often at a high rate.3 Algorithms must
process each incoming item quickly to avoid bottlenecks and provide timely results.4 The processing
time per item needs to be very small.5
● Low Latency: For many applications (e.g., fraud detection, network monitoring), there's a strict
requirement for low latency between the arrival of data and the generation of insights.6
3. Single-Pass Constraint (or Few Passes):
● Making Decisions on the Fly: Since storing the entire stream is infeasible, algorithms typically need
to make decisions and update their internal state after seeing each data item only once.7 This limits
the complexity of computations that can be performed on any single data point.
● No Going Back: Unlike offline algorithms that can iterate over the data multiple times, streaming
algorithms have limited opportunities to refine their calculations based on future data.
4. Approximation and Accuracy Guarantees:
● The Inevitability of Error: Due to memory limitations and the single-pass nature, achieving exact
answers to many queries is often impossible. The focus shifts to designing algorithms that provide
good approximations.
● Providing Error Bounds: It's crucial for these approximate algorithms to come with theoretical
guarantees on their accuracy (e.g., the error is within a certain percentage with a certain probability).
This allows users to understand the reliability of the results.
5. Handling Dynamic Data and Concept Drift:
● Evolving Data Distributions: Real-world data streams are often non-stationary, meaning the
underlying data distribution can change over time (concept drift).8 Algorithms need to be robust to
these changes and potentially adapt their models or summaries.9
● Dealing with Insertions, Deletions, and Updates: While many streaming models focus on
insertions, some scenarios involve deletions or updates to previously seen data, adding complexity to
maintaining accurate summaries.
6. Maintaining Order and Time Information:
● Temporal Dependencies: The order in which data arrives can be critical. Algorithms might need to
consider recent data more heavily (e.g., in sliding window models) or detect patterns that occur over
time.10
● Handling Out-of-Order Arrivals: In distributed streaming systems, data might not always arrive in
perfect chronological order, requiring mechanisms to handle these inconsistencies.
7. Scalability and Parallelization:
● Dealing with Massive Streams: As data volumes continue to grow, algorithms need to be scalable
and potentially parallelizable to handle the processing load effectively across multiple machines or
6
cores.
● Distributed Streaming Systems: Designing algorithms that can operate efficiently in distributed
environments introduces challenges related to data partitioning, communication overhead, and
maintaining global state.11
In summary, the key challenges in designing data stream algorithms revolve around the fundamental tension
between the need to extract meaningful information from massive, continuous data flows and the severe
constraints on memory, processing time, and the number of times each data point can be examined.
Popular Streaming Algorithms
Problem Real-world Algorithm What It Solves
Analogy
Counting distinct Unique visitors to a Flajolet-Martin, Estimate the number of
elements website HyperLogLog unique items
Frequency Top search queries Count-Min Sketch Approximate count of
estimation each item
Heavy hitters Most popular Misra-Gries, Identify items that appear
products sold Space-Saving frequently
Quantiles Median Greenwald-Khanna Estimate ranks like
temperature median, 90th percentile
readings
Reservoir Random tweets Reservoir Sampling Uniformly sample from
Sampling from a feed stream
Sliding windows Last 10 minutes of Datar-Gionis-Indyk-Mot Focus on recent data only
data wani (DGIM)
7
Parallel Algorithms
A parallel algorithm can be executed simultaneously on many different processing devices and then
combined together to get the correct result. Parallel algorithms are highly useful in processing huge volumes
of data in quick time.
An algorithm is a sequence of steps that take inputs from the user and after some computation, produces an
output. A parallel algorithm is an algorithm that can execute several instructions simultaneously on different
processing devices and then combine all the individual outputs to produce the final result.
Types of Parallelism:
● Data Parallelism: The same task is performed on different parts of the data simultaneously. Think of
applying the same image filter to different regions of an image concurrently.
● Task Parallelism: Different, independent tasks are executed concurrently. For example, in a weather
simulation, one processor might handle temperature calculations while another handles wind
velocity.
Parallel Computing Architectures:
The design and performance of parallel algorithms are heavily influenced by the underlying hardware
architecture. Some common architectures include:
● Shared Memory Systems (Multicore Processors): Multiple processors share a common memory
space. Communication between processors is typically faster and simpler through shared variables.
However, managing memory access and avoiding conflicts becomes crucial.
● Distributed Memory Systems (Clusters): Multiple independent nodes (each with its own processor
and memory) are interconnected by a network. Communication between nodes requires explicit
message passing, which can introduce latency.
● Hybrid Systems: Combinations of shared and distributed memory architectures, such as clusters of
multi-core nodes.
● Graphics Processing Units (GPUs): Massively parallel architectures originally designed for
graphics rendering but now widely used for general-purpose parallel computing due to their high
throughput for data-parallel tasks.
Key Considerations in Designing Parallel Algorithms:
Designing efficient parallel algorithms introduces a new layer of complexity compared to their sequential
counterparts. The primary goal shifts from just minimizing the number of operations to also minimizing the
overhead associated with parallelism and effectively utilizing multiple computing resources. Here are the
key challenges:
● Problem Decomposition: How to effectively divide the problem into independent subtasks that can
be executed in parallel.
● Communication Overhead: TParallel algorithms often require processors to communicate with
each other, especially when aggregating results or sharing intermediate data. This communication
8
overhead can slow down performance, especially in distributed systems where data needs to be sent
over a network.
Example: In a distributed parallel computing system, if every worker node needs to constantly
synchronize its local results with others (e.g., in a parallel reduction), the time spent on
communication may become a significant bottleneck.
● Synchronization Overhead: When multiple processors work together, they often need to
synchronize their actions to ensure correctness (e.g., waiting for a dependent task to complete before
proceeding).
Example: If two threads try to increment the same counter in parallel without synchronization, they
may both read the counter’s value simultaneously and write back the same value, leading to incorrect
results.
● Load Balancing: In parallel algorithms, the input data or work must be distributed across multiple
processors or machines. Ideally, the work should be evenly distributed to prevent some processors
from being idle while others are overloaded.
● Sequential algorithms don’t have this problem because all work is processed sequentially on
one processor.
● In parallel algorithms, inefficient load balancing can lead to idle time on some processors
and overload on others, decreasing overall efficiency.
Example: In a parallel sorting algorithm, if one processor handles a large portion of the data while
others only handle small portions, the processor with the larger workload will take much longer,
leading to inefficient execution.
● Granularity: The size of the subtasks. Finer-grained parallelism might offer more potential for
speedup but can also lead to higher communication overhead. Coarser-grained parallelism has lower
communication but might have less opportunity for concurrency.
● Amdahl's Law: This law states that the maximum speedup achievable by parallelizing a program is
limited by the fraction of the program that cannot be parallelized. If a fraction 's' of the code is serial,
the maximum speedup is 1/s, regardless of the number of processors.
● Memory Access and Scalability: Memory access patterns can have a huge impact on the
performance of parallel algorithms. In a parallel system, multiple processors may try to access the
same data simultaneously, leading to contention and bottlenecks in memory.
● Sequential algorithms can take advantage of the cache efficiently because memory is
accessed in a linear and predictable manner.
● In parallel algorithms, non-local memory accesses (e.g., in distributed systems) and cache
coherence issues (when multiple processors cache the same data) can slow down the
execution significantly.
Example: In shared-memory parallelism (e.g., multithreading on a single machine), multiple
processors accessing the same variable can cause cache invalidation or false sharing, reducing
efficiency.
9
Factors Influencing the Design and Complexity of Parallel Algorithms
Communication Overhead:
● Influence on Design: The anticipated cost of communication profoundly shapes how a parallel
algorithm is structured. Designers prioritize algorithms and data distributions that minimize the
amount and frequency of data exchange. This might involve favoring local computations, carefully
partitioning data, and optimizing communication patterns (e.g., reducing the number of messages and
increasing their size).
● Influence on Complexity: Communication adds a significant layer to the analysis of parallel
algorithm complexity. Performance models must account for both computational and communication
costs. Scalability is often limited by how communication overhead scales with the number of
processors and problem size. The choice of algorithm might even shift towards one with higher
sequential complexity but lower communication needs in a parallel setting.
2. Problem Decomposition and Parallelizability:
● Influence on Design: The ability to effectively break down a problem into independent or
semi-independent tasks is fundamental to parallel algorithm design. Identifying inherent parallelism,
managing dependencies between tasks, and determining the appropriate granularity of subtasks are
crucial design considerations. The goal is to maximize the amount of work that can be done
concurrently while minimizing dependencies that require synchronization.
● Influence on Complexity: The inherent parallelizability of a problem dictates the theoretical limits
of speedup (Amdahl's Law). Analyzing the work (total operations) and span (longest sequence of
dependent operations) helps understand the potential for parallelism. The complexity of a parallel
algorithm is also affected by how efficiently the decomposition can be achieved and how effectively
dependencies are managed, often requiring more sophisticated data structures and control flow
compared to sequential algorithms.
3. Granularity of Parallelism:
● The granularity of parallelism refers to the size of the individual tasks into which the problem is
divided. If the tasks are too small (fine-grained), the overhead of managing parallelism (e.g., task
creation, synchronization, and communication) can outweigh the benefits of parallel execution. On
the other hand, if the tasks are too large (coarse-grained), there may be long periods where some
processors remain idle or the system fails to exploit parallelism effectively. Balancing the granularity
is crucial for the algorithm’s performance.
4. Load Balancing
Impact on Design:
● Work Distribution: A parallel algorithm must be designed to partition the data or tasks into
manageable chunks that can be processed independently by different processors. Efficient load
balancing is crucial for ensuring that all processors are utilized evenly and that the overall system
performs optimally.
10
○ Example: In a parallel matrix multiplication algorithm, the matrix needs to be divided into
submatrices, with each processor working on a submatrix. If one processor is given a much
larger submatrix than others, it may finish late, while others sit idle.
● Dynamic Load Balancing: In some cases, the workload may change during execution, requiring
dynamic adjustments to the distribution of tasks. For example, in adaptive parallel algorithms, the
system might redistribute tasks dynamically based on the workload to handle variations in
processing time.
○ Design Consideration: Algorithms may be designed to incorporate dynamic task migration,
where the system moves tasks around to balance load. However, this can introduce additional
overhead and complexity.
Impact on Complexity:
● Work Distribution Efficiency: Poor load balancing can result in idle processors or processor
bottlenecks, which increases the overall execution time and degrades performance. The total
complexity is affected by how well the work is distributed across processors. If the load distribution
is not optimal, the parallel speedup will be reduced.
○ Scalability: Efficient load balancing impacts the scalability of the algorithm. As the number
of processors increases, the complexity of distributing tasks becomes more challenging. If
load balancing isn’t handled properly, adding more processors might not significantly reduce
execution time due to inefficient task distribution.
○ Parallel Efficiency: The efficiency of the parallel algorithm is measured by how well the
total execution time scales with the number of processors. Load imbalance often leads to
reduced parallel efficiency, meaning that adding more processors might not reduce the
execution time as much as expected.
11