GCUA (General Codon Usage Analysis) is a comprehensive Python tool for analyzing codon usage patterns in DNA sequences. This improved implementation builds upon James McInerney's original C program from 1997, adding modern analytical capabilities, interactive visualizations, sequence optimization features, and scalable processing for large datasets.
The program allows researchers to:
- Analyze codon usage bias in DNA sequences from small to genome-scale datasets
- Perform multivariate analysis (Correspondence Analysis and Principal Component Analysis)
- Calculate various codon usage metrics (ENC, CAI, Fop, SCUO)
- Visualize results with interactive plots
- Optimize gene sequences based on identified codon usage patterns
- Process millions of genes with intelligent memory management and parallel processing
If you use GCUA in your research, please cite:
McInerney JO. GCUA: general codon usage analysis.
Bioinformatics. 1998;14(4):372-3.
doi: 10.1093/bioinformatics/14.4.372. PMID: 9632833.
GCUA requires Python 3.8 or higher and several third-party packages:
- numpy
- pandas
- biopython
- scipy
- matplotlib
- seaborn
- scikit-learn
- plotly
-
Clone the repository:
git clone https://github.com/username/gcua.git cd gcua -
Install required packages using the provided requirements.txt:
pip install -r requirements.txt
-
Verify installation:
python gcua.py --version
-
Clone the repository:
git clone https://github.com/username/gcua.git cd gcua -
Install the required packages manually:
pip install numpy pandas biopython scipy matplotlib seaborn scikit-learn plotly
-
Clone the repository:
git clone https://github.com/username/gcua.git cd gcua -
Create a conda environment and install dependencies:
conda create -n gcua python=3.9 conda activate gcua conda install numpy pandas biopython scipy matplotlib seaborn scikit-learn plotly
GCUA features an interactive command-line interface with menu-driven navigation:
python gcua.pyThe main menu now features a workflow-based structure:
- Data Management - Load and manage sequence data
- Quick Analysis (Guided Workflow) - Automated standard analysis
- Custom Analysis - Detailed analysis options by category
- Visualization & Export - Unified menu for outputs
- Advanced Tools - Clustering, optimization, outlier detection
- Settings & Preferences - Configure analysis parameters
- Help & Documentation - Comprehensive help system Q. Quit program
GCUA accepts DNA sequences in FASTA format. The sequences should be protein-coding DNA sequences (CDS) with lengths divisible by 3.
Example FASTA format:
>gene1
ATGGCGTACTTCGATATCGATCGATCGTAGCTAGCTGATCGATCGAT
>gene2
ATGACTGACTAGCTAGCTACGATCGATCGATCGTACGTAGCTAGCAT
A typical workflow might include:
Option 1: Quick Analysis (Recommended for new users)
- Data Management → Load FASTA file
- Quick Analysis → Follow guided workflow
- Automatically calculates all metrics
- Performs multivariate analysis
- Creates standard visualizations
- Exports results
Option 2: Custom Analysis
- Data Management → Load FASTA file
- Custom Analysis → Choose specific analyses:
- Basic metrics (codon usage, amino acids, base composition)
- Advanced analysis (multivariate CA/PCA)
- Codon bias metrics (ENC, CAI, Fop, SCUO)
- Visualization & Export → Create plots and export data
- Advanced Tools → Optimize sequences or detect clusters
- Codon Counting: Calculates codon frequencies for each gene
- RSCU Calculation: Computes Relative Synonymous Codon Usage
- Amino Acid Usage: Analyzes amino acid usage patterns
- Base Composition: Provides GC content analysis (overall, by position, GC3s)
- Extreme Genes Export: Extract genes from distribution extremes based on any metric
- Correspondence Analysis (CA): Identifies patterns in RSCU or amino acid usage
- Automatically excludes single-codon amino acids (e.g., Met, Trp) from RSCU analysis
- Principal Component Analysis (PCA): Alternative dimensionality reduction technique
- For amino acid data: Defaults to CLR transformation for proper compositional data analysis
- Automatically converts to frequencies and offers preprocessing options
- Reference Gene Identification: Automatically identifies potential highly expressed genes
GCUA offers interactive visualizations using Plotly:
- Multivariate Analysis Plots: Interactive scatter plots of CA or PCA results
- CA plots now show biplots with both genes and codons (RSCU) or amino acids (AA)
- GC Content Plots: GC vs GC3 content visualization
- ENC vs GC3s Plot (Wright's Plot): Visualizes relationship between ENC and GC3s
- RSCU Heatmaps: Color-coded visualization of RSCU values
- For small datasets (<1000 genes): Traditional gene-by-codon heatmap
- For large datasets (≥1000 genes): Summary statistics heatmap showing mean RSCU, standard deviation, and percentiles
- CAI Distribution Plots: Histograms of CAI values
- Custom Scatter Plots: User-defined plots of any calculated metrics
Export Formats: Visualizations can be exported in multiple formats:
- HTML (default): Interactive plots that open in web browser
- SVG: Vector graphics for publication-quality figures
- PNG: Raster images for presentations
- PDF: Document format for reports
Note: Static image formats (SVG, PNG, PDF) require the kaleido package (pip install kaleido)
- Optimal Codon Identification: Multiple methods for identifying optimal codons:
- Frequency-based
- Multivariate analysis-based
- RSCU-based
- Reference gene-based
- Gene Optimization: Replace codons with optimal alternatives
- Comparative Analysis: Analyze differences between original and optimized sequences
- Supports 33 different genetic codes (NCBI translation tables)
- Including standard code, mitochondrial codes, and various alternative nuclear codes
- Essential for analyzing sequences from diverse organisms
-
Hybrid Memory Management:
- Automatically adapts to file size
- Small files: Keep sequences in memory for speed
- Large files: Load sequences on demand to conserve memory
- Configurable threshold (default: 100 MB)
-
Parallel Processing:
- Multi-core support for faster analysis
- Configurable number of processes
- Batch processing for optimal performance
- Scales linearly with available CPU cores
-
Progress Tracking and Checkpointing:
- Automatic checkpoint saving for long-running analyses
- Resume interrupted processing from last checkpoint
- Configurable checkpoint intervals
- No lost work due to interruptions
-
Scalable to Genome-Scale Data:
- Tested with millions of genes
- Efficient memory usage even for 1M+ gene datasets
- Fast sequence indexing for quick access
-
Smart Visualizations:
- Automatic adaptation for large datasets
- RSCU heatmap switches to summary statistics for >1000 genes
- Prevents browser crashes with massive data matrices
- Shows meaningful patterns instead of overwhelming detail
GCUA saves all analysis results and visualizations in a gcua_outputs directory. Output formats include:
- TSV Files: Tab-separated values for data analysis
- HTML Files: Interactive Plotly visualizations
- FASTA Files: Optimized sequences
- JSON Files: Complex data structures like optimal codon definitions
You can identify optimal codons based on a set of reference genes, which can be:
- Manually selected
- Automatically identified through multivariate analysis
- Loaded from a text file (one gene name per line)
Example reference gene file:
gene1
gene3
gene7
GCUA can compare codon usage between genes at opposite ends of the primary multivariate axis, which often separates genes by expression level:
Main Menu > Analysis > Compare codon usage between axis cohorts
This analysis helps identify codons that are statistically overrepresented in potentially highly expressed genes.
GCUA can export genes from the extremes of any calculated metric distribution, which is useful for:
- Identifying candidate highly/lowly expressed genes
- Finding genes with unusual codon usage patterns
- Selecting outliers for experimental validation
- Creating gene sets for downstream analysis
To export extreme genes:
Main Menu > Analysis > Export extreme genes (E)
Features:
- Flexible metric selection: Export based on GC, GC3s, ENC, CAI, Fop, SCUO, or multivariate axes
- Customizable thresholds: Choose any percentage (e.g., top/bottom 10%, 20%)
- Multiple export options: Get top only, bottom only, or both extremes
- Comprehensive output: Includes all calculated metrics for exported genes
- Metadata headers: Files include selection criteria and threshold values
Example use cases:
- Export top 10% of genes by CAI to identify highly adapted genes
- Export bottom 20% by ENC to find genes with extreme codon bias
- Export genes at multivariate axis extremes to study expression-related patterns
You can set the default visualization format in Preferences:
Main Menu > Preferences > Visualization Format
This is particularly useful when:
- Working on HPC systems without GUI/browser access (use SVG or PNG)
- Preparing publication-quality figures (use SVG for vector graphics)
- Creating presentations (use PNG for easy insertion)
- Generating reports (use PDF format)
GCUA provides multiple methods for identifying optimal codons:
- Frequency-based: Uses the most frequent codon for each amino acid
- Multivariate: Uses cohorts identified by multivariate analysis
- RSCU-based: Uses codons with highest RSCU values
- Raw count: Uses the most common codons
Access performance settings through the Preferences menu:
Main Menu > Preferences > Performance Settings
Available options:
- Parallel Processing: Enable/disable and set number of CPU cores to use
- Batch Size: Adjust the number of sequences processed together (default: 1000)
- Progress Saving: Enable automatic checkpointing for long analyses
- Checkpoint Interval: How often to save progress (default: every 5000 sequences)
For large datasets (>100,000 genes), recommended settings:
- Enable parallel processing with all available cores
- Set batch size to 5000-10000 sequences
- Enable progress saving with checkpoint interval of 10000 sequences
-
Import errors: Make sure all dependencies are installed:
pip install -r requirements.txt
-
Memory issues with large datasets:
- GCUA automatically switches to low-memory mode for files >100MB
- Adjust the threshold in Preferences > Memory Management Settings
- For genome-scale data (>1M genes), ensure at least 8GB RAM available
-
Visualization not displaying: If HTML visualizations don't open automatically, try:
import webbrowser webbrowser.open('file:///path/to/visualization.html') -
Invalid sequence length errors: Ensure all sequences are coding sequences with lengths divisible by 3.
-
Processing interrupted:
- If analysis is interrupted, GCUA will automatically resume from the last checkpoint
- Checkpoint files are named
[filename]_checkpoint.pkl - Delete checkpoint files to restart analysis from the beginning
-
Slow processing on large files:
- Enable parallel processing in Preferences > Performance Settings
- Increase batch size for better throughput
- Check that all CPU cores are being utilized
Contributions to GCUA are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
ENC measures the degree of codon usage bias, ranging from 20 (extreme bias, only one codon per amino acid) to 61 (no bias, equal use of all synonymous codons). Lower values indicate stronger codon bias.
CAI measures how well a gene is adapted to the codon usage of highly expressed genes. Values range from 0 to 1, with higher values indicating stronger adaptation to the reference set.
Fop is the ratio of optimal codons to synonymous codons in a gene. Values range from 0 to 1, with higher values indicating more frequent use of optimal codons.
SCUO quantifies the degree of order in synonymous codon usage based on information theory. Values range from 0 to 1, with higher values indicating more ordered (non-random) codon usage.
RSCU is the observed frequency of a codon divided by the expected frequency if all synonymous codons for an amino acid were used equally. Values above 1 indicate codons used more frequently than expected by chance.