Skip to content

rvrane/GenoMap-hg38

Repository files navigation

1. Introduction

This pipeline automates the preprocessing and analysis of high-throughput sequencing data to evaluate read mappability across the genome. It performs adapter trimming, read merging, genome alignment, and generates the mapping output file. The pipeline supports both single-end and paired-end data, with adaptable scripts based on input format. User-defined adapters and minimum read length thresholds are customizable in the trimming script. Designed for Linux environments, it leverages multi-threading where possible for scalability. The modular script design allows for debugging and isolated execution of any pipeline step. This pipeline provides a comprehensive and reproducible framework for evaluating the mappability of sequencing reads.


2. System Requirements:

- Linux or WSL environment
- ≥4 GB RAM (8 GB recommended)
- ≥50 GB free disk space
- Software requirements: FastQC, MultiQC, Cutadapt, Bowtie2, Samtools

Activate the conda environment before running any scripts:
Command: conda activate mytools


3. Pipeline Steps

3.1 Adapter Trimming and Quality Control

Input: Raw '.fastq.gz' files.
Output: Trimmed FASTQ files, Cutadapt reports, FastQC, and MultiQC summaries.
Operations:
- Removes standard Illumina and custom adapters.
- Trims poly-A, T, C, G sequences.
- Performs quality trimming and filters reads by minimum length.
- Runs `fastqc` on each trimmed file.
- Summarizes QC results with `multiqc`.
Tools: cutadapt, fastqc, multiqc.
Command: bash trim.sh

3.2 Merge (This Step is only for paired-end reads)

Input: Paired-end trimmed files (`*_R1_001.fastq.gz` and `*_R2_001.fastq.gz`).
Output: Concatenated files in `merged_results/`.
Operations:
- Merges each R1 and R2 pair using `cat`.
- Output is named `sample_001.fastq.gz`.
Command: bash merge.sh

3.3 Alignment to the reference genome (sam_bam.sh)

Input: Merged or trimmed single-end FASTQ files.
Output: Sorted and indexed BAM files in `output/`.
Operations:
- Uses `bowtie2` to align reads to the human reference genome (GRCh38).
- Converts SAM to BAM, sorts, and indexes with `samtools`.
- Removes intermediate SAM files.

Key Parameters:
- index_prefix: Path to Bowtie2 index files.
- threads: Number of CPU threads (default: 94).
Command: bash sam_bam.sh

3.4 Alignment Statistics and Mapping (stats.sh)

Input: BAM files from the alignment step.
Output: Per-sample chromosome-level stats (`*_chr_stats.txt`) in `stats/`.
Operations:
- Runs 'samtools idxstats' to extract:
- Chromosome names
- Chromosome lengths
- Number of mapped reads
- Number of unmapped reads
Command: bash stats.sh

3.5 Statistics Compilation (merge_stats.sh)

Input: Individual '*_chr_stats.txt' files.
Output: Summary table `merged_chr_stats.tsv`.
Operations:
- Adds a Sample column to each row.
- Appends all stats into a master TSV file.
Fields: Sample, Chromosome, Chromosome_Length, Mapped_Reads, Unmapped_Reads
Command Example: bash merge_stats.sh


4. Directory Outputs

----------------------------------------------------------------------------------------------------------------I

Directory Contents
trimmed_results/ Trimmed FASTQ files, Cutadapt, FastQC, MultiQC outputs
merged_results/ Merged R1 and R2 FASTQ files
output/ Sorted and indexed BAM files
stats/ Per-sample and merged mapping statistics
---------------------------------------------------------------------------------------------------------------I

5. Dependencies

Ensure the following tools are installed and accessible from the command line:
-cutadapt
-fastqc
-multiqc
-bowtie2
-samtools
-conda

6. Notes

-Adjust adapter sequences in 'trim.sh' according to your sequencing kit.
-Confirm Bowtie2 index path ('index_prefix') in `sam_bam.sh` matches your system.
-Scripts assume a flat directory structure with all input FASTQ files present.


About

Modular shell pipeline for adapter trimming, genome alignment to GRCh38, and chromosome-level read mappability profiling using Bowtie2 and SAMtools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages