The human genome consists of approximately 3.2 billion base pairs in a haploid cell (single set of chromosomes). In diploid cells (most human cells), there are approximately 6.4 billion base pairs total, organized into 23 pairs of chromosomes.
- Haploid genome size: ~3.2 billion base pairs (~3.2 Gbp)
- Diploid genome size: ~6.4 billion base pairs (~6.4 Gbp)
- Number of chromosomes: 23 pairs (46 total)
- Coding DNA: Only ~1-2% of the genome codes for proteins
- Non-coding DNA: Includes regulatory sequences, introns, repetitive elements, and other functional elements
- Chromosome 1: ~249 million base pairs (largest)
- Chromosome 21: ~48 million base pairs (smallest autosome)
- Chromosome Y: ~57 million base pairs
- X chromosome: ~155 million base pairs
The human genome was first fully sequenced in 2003 as part of the Human Genome Project, though the reference genome continues to be refined with newer sequencing technologies.
Sequence alignment is fundamental to bioinformatics, enabling researchers to:
- Identify similar regions between sequences
- Find functional elements (genes, regulatory regions)
- Study evolutionary relationships
- Detect mutations and variations
- Map sequencing reads to reference genomes
Description: The simplest alignment technique that searches for exact occurrences of a pattern within a sequence.
Algorithm: Linear scan through the sequence, checking if the pattern matches at each position.
Time Complexity: O(n*m) where n is sequence length and m is pattern length
Space Complexity: O(1)
Use Cases:
- Finding exact gene sequences
- Locating known motifs
- Simple pattern matching when exact matches are expected
Limitations: Cannot handle mismatches, insertions, or deletions.
Description: A brute-force approach that checks all possible positions where the pattern could occur in the text.
Algorithm:
- Slide the pattern over the text one position at a time
- At each position, compare pattern characters with text characters
- If all characters match, record the position
Time Complexity: O(n*m) worst case, O(n) average case
Space Complexity: O(1)
Use Cases:
- Simple pattern matching
- Educational purposes
- When pattern length is small
Limitations: Inefficient for long patterns or large texts. No tolerance for errors.
Description: An approximate string matching algorithm that finds patterns within a specified edit distance threshold, allowing for mismatches, insertions, and deletions.
Algorithm:
- Slide pattern over sequence with sliding window
- Calculate edit distance (Levenshtein distance) between pattern and each window
- Return all positions where distance is within threshold
- Supports both full edit distance and Hamming distance (substitutions only)
Time Complexity: O(nmd) where n is sequence length, m is pattern length, d is max distance
Space Complexity: O(m*d) for bounded edit distance calculation
Distance Metrics:
- Edit Distance (Levenshtein): Minimum operations (insert, delete, substitute) to transform one string to another
- Hamming Distance: Number of positions where characters differ (only for same-length strings)
Use Cases:
- Finding patterns with known mutations
- Searching with sequencing errors
- Approximate pattern matching
- Finding similar motifs with variations
- Handling noisy sequencing data
Advantages:
- Tolerates errors and variations
- Configurable distance threshold
- Supports both substitutions and indels
- Case-insensitive matching
Limitations:
- Slower than exact matching
- May return many false positives with high threshold
- Edit distance calculation is computationally expensive for large sequences
Description: A dynamic programming algorithm for local sequence alignment, finding the best matching subsequences between two sequences.
Algorithm:
- Build a scoring matrix using dynamic programming
- Score each cell based on:
- Match: positive score (typically +2)
- Mismatch: negative score (typically -1)
- Gap (insertion/deletion): gap penalty (typically -1)
- Find the maximum score in the matrix
- Trace back from the maximum to find the optimal local alignment
Time Complexity: O(n*m) where n and m are sequence lengths
Space Complexity: O(n*m) for the scoring matrix
Scoring System:
- Match score: +2 (when bases match)
- Mismatch score: -1 (when bases differ)
- Gap penalty: -1 (for insertions/deletions)
Use Cases:
- Finding similar regions between sequences
- Identifying conserved domains
- Detecting local similarities in otherwise different sequences
- Protein domain identification
Advantages:
- Finds optimal local alignment
- Handles mismatches and gaps
- Suitable for sequences with only partial similarity
Description: A dynamic programming algorithm for global sequence alignment, aligning entire sequences from end to end.
Algorithm:
- Build a scoring matrix using dynamic programming
- Initialize first row and column with gap penalties
- Fill matrix using recurrence relation:
- Match/mismatch from diagonal
- Gap from left (insertion in sequence 1)
- Gap from top (insertion in sequence 2)
- Trace back from bottom-right to top-left to construct alignment
Time Complexity: O(n*m)
Space Complexity: O(n*m) for the scoring matrix
Scoring System: Similar to Smith-Waterman (match, mismatch, gap penalties)
Use Cases:
- Comparing closely related sequences
- Evolutionary analysis
- Full sequence comparison
- When entire sequences need to be aligned
Advantages:
- Finds optimal global alignment
- Handles all types of variations (mismatches, insertions, deletions)
- Provides complete alignment of both sequences
Differences from Smith-Waterman:
- Global vs. Local: Aligns entire sequences vs. finding best local region
- Initialization: Includes gap penalties in first row/column
- Traceback: Starts from bottom-right vs. maximum score position
Description: A heuristic algorithm designed for fast similarity searches in large databases.
Algorithm:
- Break query sequence into short words (k-mers)
- Find exact matches of these words in the database
- Extend matches in both directions
- Score extended alignments
- Return statistically significant matches
Time Complexity: O(n) average case (much faster than Smith-Waterman for large databases)
Space Complexity: O(n) for indexing
Use Cases:
- Searching large sequence databases
- Finding homologous sequences
- Gene identification
- Large-scale sequence comparison
Advantages:
- Extremely fast for database searches
- Heuristic approach scales to billions of sequences
- Statistical significance testing
Limitations:
- May miss some alignments (heuristic, not optimal)
- Less sensitive than Smith-Waterman for very similar sequences
Description: Advanced data structures for efficient substring search and pattern matching.
Suffix Tree:
- Compressed trie containing all suffixes of a string
- Enables O(m) pattern search after O(n) preprocessing
- Space: O(n)
Suffix Array:
- Sorted array of all suffixes
- Enables binary search for pattern matching
- Space: O(n), simpler than suffix trees
Time Complexity:
- Preprocessing: O(n)
- Pattern search: O(m + log n) for suffix arrays, O(m) for suffix trees
Use Cases:
- Multiple pattern searches on the same text
- Longest common substring problems
- Repeat finding
- Genome indexing
Advantages:
- Very efficient for repeated searches
- Enables complex substring queries
| Algorithm | Type | Time Complexity | Space Complexity | Optimal | Use Case |
|---|---|---|---|---|---|
| Exact Match | Exact | O(n*m) | O(1) | Yes | Simple exact search |
| Naive Search | Exact | O(n*m) | O(1) | Yes | Small patterns |
| Fuzzy Search | Approximate | O(nmd) | O(m*d) | Yes | Pattern with errors |
| Smith-Waterman | Local Alignment | O(n*m) | O(n*m) | Yes | Similar regions |
| Needleman-Wunsch | Global Alignment | O(n*m) | O(n*m) | Yes | Full sequence alignment |
| BLAST | Heuristic | O(n) avg | O(n) | No | Database search |
| Suffix Tree/Array | Indexing | O(m) search | O(n) | Yes | Multiple searches |
- Exact Match/Naive Search: When you need exact pattern matching and sequences are small
- Fuzzy Search: When you need to find patterns with known mutations, sequencing errors, or variations (within edit distance threshold)
- Smith-Waterman: When you want to find similar regions within larger sequences (local similarity)
- Needleman-Wunsch: When you need to align entire sequences and they are expected to be similar throughout
- BLAST: When searching large databases or when speed is critical
- Suffix Trees/Arrays: When performing many searches on the same reference sequence
See the implementation files in the src/ directory for C++ implementations of these algorithms.
mkdir build
cd build
cmake ..
make./dna_alignerThe program includes examples demonstrating each algorithm with sample DNA sequences.
- Human Genome Project: https://www.genome.gov/human-genome-project
- Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.
- Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453.
- Altschul, S. F., et al. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
Copyright (C) 2025, Shyamal Suhana Chandra