This pipeline automates the preprocessing and analysis of high-throughput sequencing data to evaluate read mappability across the genome. It performs adapter trimming, read merging, genome alignment, and generates the mapping output file. The pipeline supports both single-end and paired-end data, with adaptable scripts based on input format. User-defined adapters and minimum read length thresholds are customizable in the trimming script. Designed for Linux environments, it leverages multi-threading where possible for scalability. The modular script design allows for debugging and isolated execution of any pipeline step. This pipeline provides a comprehensive and reproducible framework for evaluating the mappability of sequencing reads.
- Linux or WSL environment
- ≥4 GB RAM (8 GB recommended)
- ≥50 GB free disk space
- Software requirements: FastQC, MultiQC, Cutadapt, Bowtie2, Samtools
Activate the conda environment before running any scripts:
Command: conda activate mytools
Input: Raw '.fastq.gz' files.
Output: Trimmed FASTQ files, Cutadapt reports, FastQC, and MultiQC summaries.
Operations:
- Removes standard Illumina and custom adapters.
- Trims poly-A, T, C, G sequences.
- Performs quality trimming and filters reads by minimum length.
- Runs `fastqc` on each trimmed file.
- Summarizes QC results with `multiqc`.
Tools: cutadapt, fastqc, multiqc.
Command: bash trim.sh
Input: Paired-end trimmed files (`*_R1_001.fastq.gz` and `*_R2_001.fastq.gz`).
Output: Concatenated files in `merged_results/`.
Operations:
- Merges each R1 and R2 pair using `cat`.
- Output is named `sample_001.fastq.gz`.
Command: bash merge.sh
Input: Merged or trimmed single-end FASTQ files.
Output: Sorted and indexed BAM files in `output/`.
Operations:
- Uses `bowtie2` to align reads to the human reference genome (GRCh38).
- Converts SAM to BAM, sorts, and indexes with `samtools`.
- Removes intermediate SAM files.
Key Parameters:
- index_prefix: Path to Bowtie2 index files.
- threads: Number of CPU threads (default: 94).
Command: bash sam_bam.sh
Input: BAM files from the alignment step.
Output: Per-sample chromosome-level stats (`*_chr_stats.txt`) in `stats/`.
Operations:
- Runs 'samtools idxstats' to extract:
- Chromosome names
- Chromosome lengths
- Number of mapped reads
- Number of unmapped reads
Command: bash stats.sh
Input: Individual '*_chr_stats.txt' files.
Output: Summary table `merged_chr_stats.tsv`.
Operations:
- Adds a Sample column to each row.
- Appends all stats into a master TSV file.
Fields: Sample, Chromosome, Chromosome_Length, Mapped_Reads, Unmapped_Reads
Command Example: bash merge_stats.sh
----------------------------------------------------------------------------------------------------------------I
| Directory | Contents |
|---|---|
| trimmed_results/ | Trimmed FASTQ files, Cutadapt, FastQC, MultiQC outputs |
| merged_results/ | Merged R1 and R2 FASTQ files |
| output/ | Sorted and indexed BAM files |
| stats/ | Per-sample and merged mapping statistics |
| ---------------------------------------------------------------------------------------------------------------I |
Ensure the following tools are installed and accessible from the command line:
-cutadapt
-fastqc
-multiqc
-bowtie2
-samtools
-conda
-Adjust adapter sequences in 'trim.sh' according to your sequencing kit.
-Confirm Bowtie2 index path ('index_prefix') in `sam_bam.sh` matches your system.
-Scripts assume a flat directory structure with all input FASTQ files present.