This Nextflow pipeline performs automated plasmid assembly from PacBio long-read sequencing data using the Flye assembler as the core assembly engine. The pipeline implements a comprehensive workflow that includes read filtering, downsampling, assembly, consensus generation, circularization, annotation, and quality assessment.
The workflow follows a modular design with the following key components:
- Primary Assembler: Flye - optimized for long-read assembly of circular DNA elements
- Consensus Generation: Trycycler - creates high-quality consensus sequences from multiple assemblies
- Circularization: Circlator minimus2 - ensures proper circular contig formation
- Quality Control: Multi-stage filtering and validation
Length Range Filtering
- Reads length ranges from
length_ranges.csvconfiguration file - Applies size-based filtering to retain reads within specified ranges
- Helps remove very short fragments and excessive long reads that may impact assembly quality
Input Format
- Expects compressed FASTQ files (
*fastq.gz) in the specified input directory - Sample identification based on filename parsing
Downsampling
- Reduces read coverage to optimal levels for assembly
- Prevents computational bottlenecks while maintaining assembly quality
- Configurable coverage targets
Flye Assembly
- Utilizes Flye assembler specifically configured for plasmid assembly
- Optimized for circular DNA elements and repetitive sequences
- Generates initial contigs with associated assembly graphs
Contig Selection
- First selection step (
SELECT_FA1) filters contigs based on length criteria - Focuses on longer contigs likely to represent complete or near-complete plasmids
Circlator Processing
- Applies Circlator minimus2 to improve circular contig formation
- Corrects potential assembly breaks at circular junction points
- Essential for proper plasmid topology representation
Trycycler Integration
- Groups related contigs by sample identifier
- Generates high-quality consensus sequences
- Reconciles differences between multiple assembly attempts
Consensus Filtering
- Second selection step (
SELECT_FA2) ensures quality consensus sequences - Filters samples with excessive contig numbers that may indicate assembly issues
Plasmid Annotation
- Plannotate: Provides comprehensive plasmid-specific gene annotation
- Identifies resistance genes, replication origins, and other functional elements
- Generates GenBank format output for downstream analysis
Plasmid Mapping
- PlasmidMap: Creates visual representations of annotated plasmids
- Generates publication-ready circular maps
Assembly Metrics
- QUAST: Provides detailed assembly statistics and quality metrics
- Evaluates assembly completeness and accuracy
Read Alignment Analysis
- Minimap2: Aligns original reads back to assembled sequences
- PysamStats: Generates detailed alignment statistics and coverage analysis
- Replaces BAM read count analysis for improved accuracy
Comprehensive Reporting
- Summarize: Aggregates all metrics and statistics into final reports
- Combines alignment metrics with contig length information
- FASTQ Files: PacBio long-read sequencing data in compressed format
- Length Ranges: CSV file defining acceptable read length ranges
params.input: Directory containing input FASTQ files- Length filtering ranges defined in
length_ranges.csv
The pipeline generates several categories of output:
- High-quality consensus plasmid sequences (FASTA)
- Circularized and polished contigs
- GenBank files with comprehensive gene annotations
- Functional element identification
- Assembly statistics and metrics
- Read alignment coverage analysis
- Comprehensive summary reports
- Circular plasmid maps
- Quality assessment plots
- Designed for parallel processing of multiple samples
- Memory requirements scale with genome size and read coverage
- Flye assembly is the most computationally intensive step
- Multi-stage filtering prevents low-quality assemblies from propagating
- Consensus generation improves final sequence accuracy
- Comprehensive quality metrics enable result validation
- Modular design allows for component replacement or modification
- Configurable parameters for different experimental conditions
- Supports batch processing of multiple plasmid samples
- Accuracy: Combination of Flye assembly with Trycycler consensus generation
- Completeness: Specific optimizations for circular plasmid assembly
- Annotation: Integrated plasmid-specific gene annotation
- Quality Control: Multiple validation and filtering steps
- Automation: Fully automated workflow from raw reads to annotated assemblies
- Scalability: Designed for high-throughput plasmid characterization
This pipeline provides a robust, automated solution for high-quality plasmid assembly from PacBio sequencing data, suitable for both research applications and routine plasmid characterization workflows.