Skip to content

Sapana-Micro-Software/solve-genomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA Sequence Alignment: Documentation and Implementation

Human DNA Overview

Length and Structure

The human genome consists of approximately 3.2 billion base pairs in a haploid cell (single set of chromosomes). In diploid cells (most human cells), there are approximately 6.4 billion base pairs total, organized into 23 pairs of chromosomes.

Key Facts:

  • Haploid genome size: ~3.2 billion base pairs (~3.2 Gbp)
  • Diploid genome size: ~6.4 billion base pairs (~6.4 Gbp)
  • Number of chromosomes: 23 pairs (46 total)
  • Coding DNA: Only ~1-2% of the genome codes for proteins
  • Non-coding DNA: Includes regulatory sequences, introns, repetitive elements, and other functional elements

Chromosome Organization:

  • Chromosome 1: ~249 million base pairs (largest)
  • Chromosome 21: ~48 million base pairs (smallest autosome)
  • Chromosome Y: ~57 million base pairs
  • X chromosome: ~155 million base pairs

The human genome was first fully sequenced in 2003 as part of the Human Genome Project, though the reference genome continues to be refined with newer sequencing technologies.

Sequence Alignment Techniques

Sequence alignment is fundamental to bioinformatics, enabling researchers to:

  • Identify similar regions between sequences
  • Find functional elements (genes, regulatory regions)
  • Study evolutionary relationships
  • Detect mutations and variations
  • Map sequencing reads to reference genomes

1. Exact Match Search

Description: The simplest alignment technique that searches for exact occurrences of a pattern within a sequence.

Algorithm: Linear scan through the sequence, checking if the pattern matches at each position.

Time Complexity: O(n*m) where n is sequence length and m is pattern length

Space Complexity: O(1)

Use Cases:

  • Finding exact gene sequences
  • Locating known motifs
  • Simple pattern matching when exact matches are expected

Limitations: Cannot handle mismatches, insertions, or deletions.

2. Naive String Matching

Description: A brute-force approach that checks all possible positions where the pattern could occur in the text.

Algorithm:

  1. Slide the pattern over the text one position at a time
  2. At each position, compare pattern characters with text characters
  3. If all characters match, record the position

Time Complexity: O(n*m) worst case, O(n) average case

Space Complexity: O(1)

Use Cases:

  • Simple pattern matching
  • Educational purposes
  • When pattern length is small

Limitations: Inefficient for long patterns or large texts. No tolerance for errors.

2.5. Fuzzy Search (Edit Distance)

Description: An approximate string matching algorithm that finds patterns within a specified edit distance threshold, allowing for mismatches, insertions, and deletions.

Algorithm:

  1. Slide pattern over sequence with sliding window
  2. Calculate edit distance (Levenshtein distance) between pattern and each window
  3. Return all positions where distance is within threshold
  4. Supports both full edit distance and Hamming distance (substitutions only)

Time Complexity: O(nmd) where n is sequence length, m is pattern length, d is max distance

Space Complexity: O(m*d) for bounded edit distance calculation

Distance Metrics:

  • Edit Distance (Levenshtein): Minimum operations (insert, delete, substitute) to transform one string to another
  • Hamming Distance: Number of positions where characters differ (only for same-length strings)

Use Cases:

  • Finding patterns with known mutations
  • Searching with sequencing errors
  • Approximate pattern matching
  • Finding similar motifs with variations
  • Handling noisy sequencing data

Advantages:

  • Tolerates errors and variations
  • Configurable distance threshold
  • Supports both substitutions and indels
  • Case-insensitive matching

Limitations:

  • Slower than exact matching
  • May return many false positives with high threshold
  • Edit distance calculation is computationally expensive for large sequences

3. Smith-Waterman Algorithm

Description: A dynamic programming algorithm for local sequence alignment, finding the best matching subsequences between two sequences.

Algorithm:

  1. Build a scoring matrix using dynamic programming
  2. Score each cell based on:
    • Match: positive score (typically +2)
    • Mismatch: negative score (typically -1)
    • Gap (insertion/deletion): gap penalty (typically -1)
  3. Find the maximum score in the matrix
  4. Trace back from the maximum to find the optimal local alignment

Time Complexity: O(n*m) where n and m are sequence lengths

Space Complexity: O(n*m) for the scoring matrix

Scoring System:

  • Match score: +2 (when bases match)
  • Mismatch score: -1 (when bases differ)
  • Gap penalty: -1 (for insertions/deletions)

Use Cases:

  • Finding similar regions between sequences
  • Identifying conserved domains
  • Detecting local similarities in otherwise different sequences
  • Protein domain identification

Advantages:

  • Finds optimal local alignment
  • Handles mismatches and gaps
  • Suitable for sequences with only partial similarity

4. Needleman-Wunsch Algorithm

Description: A dynamic programming algorithm for global sequence alignment, aligning entire sequences from end to end.

Algorithm:

  1. Build a scoring matrix using dynamic programming
  2. Initialize first row and column with gap penalties
  3. Fill matrix using recurrence relation:
    • Match/mismatch from diagonal
    • Gap from left (insertion in sequence 1)
    • Gap from top (insertion in sequence 2)
  4. Trace back from bottom-right to top-left to construct alignment

Time Complexity: O(n*m)

Space Complexity: O(n*m) for the scoring matrix

Scoring System: Similar to Smith-Waterman (match, mismatch, gap penalties)

Use Cases:

  • Comparing closely related sequences
  • Evolutionary analysis
  • Full sequence comparison
  • When entire sequences need to be aligned

Advantages:

  • Finds optimal global alignment
  • Handles all types of variations (mismatches, insertions, deletions)
  • Provides complete alignment of both sequences

Differences from Smith-Waterman:

  • Global vs. Local: Aligns entire sequences vs. finding best local region
  • Initialization: Includes gap penalties in first row/column
  • Traceback: Starts from bottom-right vs. maximum score position

5. BLAST (Basic Local Alignment Search Tool)

Description: A heuristic algorithm designed for fast similarity searches in large databases.

Algorithm:

  1. Break query sequence into short words (k-mers)
  2. Find exact matches of these words in the database
  3. Extend matches in both directions
  4. Score extended alignments
  5. Return statistically significant matches

Time Complexity: O(n) average case (much faster than Smith-Waterman for large databases)

Space Complexity: O(n) for indexing

Use Cases:

  • Searching large sequence databases
  • Finding homologous sequences
  • Gene identification
  • Large-scale sequence comparison

Advantages:

  • Extremely fast for database searches
  • Heuristic approach scales to billions of sequences
  • Statistical significance testing

Limitations:

  • May miss some alignments (heuristic, not optimal)
  • Less sensitive than Smith-Waterman for very similar sequences

6. Suffix Trees and Suffix Arrays

Description: Advanced data structures for efficient substring search and pattern matching.

Suffix Tree:

  • Compressed trie containing all suffixes of a string
  • Enables O(m) pattern search after O(n) preprocessing
  • Space: O(n)

Suffix Array:

  • Sorted array of all suffixes
  • Enables binary search for pattern matching
  • Space: O(n), simpler than suffix trees

Time Complexity:

  • Preprocessing: O(n)
  • Pattern search: O(m + log n) for suffix arrays, O(m) for suffix trees

Use Cases:

  • Multiple pattern searches on the same text
  • Longest common substring problems
  • Repeat finding
  • Genome indexing

Advantages:

  • Very efficient for repeated searches
  • Enables complex substring queries

Algorithm Comparison

Algorithm Type Time Complexity Space Complexity Optimal Use Case
Exact Match Exact O(n*m) O(1) Yes Simple exact search
Naive Search Exact O(n*m) O(1) Yes Small patterns
Fuzzy Search Approximate O(nmd) O(m*d) Yes Pattern with errors
Smith-Waterman Local Alignment O(n*m) O(n*m) Yes Similar regions
Needleman-Wunsch Global Alignment O(n*m) O(n*m) Yes Full sequence alignment
BLAST Heuristic O(n) avg O(n) No Database search
Suffix Tree/Array Indexing O(m) search O(n) Yes Multiple searches

When to Use Each Technique

  • Exact Match/Naive Search: When you need exact pattern matching and sequences are small
  • Fuzzy Search: When you need to find patterns with known mutations, sequencing errors, or variations (within edit distance threshold)
  • Smith-Waterman: When you want to find similar regions within larger sequences (local similarity)
  • Needleman-Wunsch: When you need to align entire sequences and they are expected to be similar throughout
  • BLAST: When searching large databases or when speed is critical
  • Suffix Trees/Arrays: When performing many searches on the same reference sequence

Building and Running

See the implementation files in the src/ directory for C++ implementations of these algorithms.

Build Instructions

mkdir build
cd build
cmake ..
make

Running Examples

./dna_aligner

The program includes examples demonstrating each algorithm with sample DNA sequences.

References

  • Human Genome Project: https://www.genome.gov/human-genome-project
  • Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.
  • Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453.
  • Altschul, S. F., et al. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

License

Copyright (C) 2025, Shyamal Suhana Chandra

About

Email sapanamicrosoftware@duck.com for licensing copyrighted code.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published