gffy - GFF3 Genomic Statistics Calculator

A fast and efficient tool for computing comprehensive statistics from GFF3 genomic annotation files.

gffy processes GFF3 files (including compressed .gz files) and generates detailed statistics about genes, transcripts, exons, introns, and CDS features, organized by gene categories (coding genes, non-coding genes, and pseudogenes).

Features

🚀 High Performance: Streaming parser that processes large GFF3 files efficiently without loading entire files into memory
📊 Comprehensive Statistics: Computes detailed metrics for genes, transcripts, exons, introns, and CDS features
🧬 Smart Categorization: Automatically categorizes genes into coding, non-coding, and pseudogenes
🌐 Remote & Local File Support: Can directly process GFF3 files from URLs or local file paths (including .gz compressed files)
💻 Command Line Interface: Easy-to-use CLI tool for processing files from the command line
📈 Detailed Metrics: Provides count, min, max, mean, and median statistics for various genomic features
🔄 Transcript Type Analysis: Breaks down statistics by transcript types (mRNA, lnc_RNA, miRNA, etc.)

Installation

This installs both the Python library and the gffy command-line tool.

From Source

git clone https://github.com/emilior/gffy.git
cd gffy
pip install -e .

Installing from source also provides the gffy command-line interface.

Quick Start

Python API

from gffy import compute_gff_stats

# Process a GFF3 file from a URL
url = "https://ftp.ensembl.org/pub/release-110/gff3/homo_sapiens/Homo_sapiens.GRCh38.110.gff3.gz"
stats = compute_gff_stats(url)

# Or process a local GFF3 file
local_file = "/path/to/your/annotation.gff3.gz"
stats = compute_gff_stats(local_file)

# Access statistics
print(f"Coding genes: {stats['coding_genes']['count']}")
print(f"Non-coding genes: {stats['non_coding_genes']['count']}")
print(f"Pseudogenes: {stats['pseudogenes']['count']}")

# Access detailed transcript statistics
for transcript_type, data in stats['coding_genes']['transcripts']['types'].items():
    print(f"{transcript_type}: {data['count']} transcripts")

Command Line Interface

After installation, you can use the gffy command to process GFF3 files directly from the command line:

# Process a local GFF3 file
gffy /path/to/your/annotation.gff3

# Process a compressed GFF3 file
gffy annotation.gff3.gz

# Process from a URL
gffy https://ftp.ensembl.org/pub/release-110/gff3/homo_sapiens/Homo_sapiens.GRCh38.110.gff3.gz

# Save output to a file with pretty printing
gffy annotation.gff3 --output stats.json --pretty

# Get help
gffy --help

The CLI outputs JSON statistics to stdout and progress messages to stderr, making it easy to pipe into other tools:

# Extract coding gene count
gffy annotation.gff3 | jq '.coding_genes.count'

# Save to file and view with jq
gffy annotation.gff3 --output stats.json && jq '.coding_genes.transcripts.count' stats.json

As a Module

You can also run the included script to process annotations from an API:

from gffy.compute_stats import compute_gff_stats
from gffy.tools.api import get_annotations_without_stats, update_annotation

# Fetch annotations from your API
annotations = get_annotations_without_stats()

# Process each annotation
for annotation in annotations:
    annotation_id = annotation['annotation_id']
    gff_url = annotation['source_file_info']['url_path']
    
    stats = compute_gff_stats(gff_url)
    update_annotation(annotation_id, stats)

Output Structure

The statistics are returned as a JSON-compatible dictionary with the following structure:

{
  "coding_genes": {
    "count": 22178,
    "length_stats": {
      "min": 10,
      "max": 2960899,
      "mean": 48306.64,
      "median": 17025.5
    },
    "transcripts": {
      "count": 102504,
      "per_gene": 4.62,
      "types": {
        "mRNA": {
          "count": 66153,
          "per_gene": 3.05,
          "exons_per_transcript": 9.08,
          "length_stats": { ... },
          "spliced_length_stats": { ... },
          "exon_length_stats": { ... }
        },
        ...
      }
    },
    "features": {
      "exons": { "count": 759800, "length_stats": { ... } },
      "introns": { "count": 657296, "length_stats": { ... } },
      "cds": { "count": 527234, "length_stats": { ... } }
    }
  },
  "non_coding_genes": { ... },
  "pseudogenes": { ... }
}

Gene Categories

coding_genes: Genes with CDS features or protein_coding biotype
non_coding_genes: Genes with exons but no CDS features
pseudogenes: Genes with feature type "pseudogene"

Statistics Computed

For each category, gffy computes:

Gene counts and length statistics (min, max, mean, median)
Transcript counts (total and per-gene average)
Per-type transcript statistics:
- Count and per-gene ratio
- Exons per transcript
- Genomic span length (start to end)
- Spliced length (sum of exon lengths)
- Exon length statistics
Feature statistics:
- Exon counts and lengths
- Intron counts and lengths
- CDS counts and lengths (for coding genes)

Configuration

The package can be configured using environment variables when working with the API integration:

export API_URL="http://your-api-server:5002"
export AUTH_KEY="your-auth-key"

Use Cases

Genome Annotation QC: Validate and assess the quality of genome annotations
Comparative Genomics: Compare gene structure statistics across different species or assemblies
Annotation Pipelines: Integrate into automated annotation workflows
Research: Analyze transcript diversity, gene structure, and feature distributions

Performance

gffy is optimized for performance:

Streaming parser (low memory footprint)
Efficient data structures using array.array for coordinates
String interning for repeated values
Single-pass processing with orphan resolution

Typical processing time for a complete human genome annotation (GRCh38): ~2-5 minutes on standard hardware.

Requirements

Python 3.9+
requests >= 2.25.0

JSON Schema

A JSON schema for validating the output is included in schema.json. This can be used to validate the statistics output programmatically.

Documentation

For more detailed information about how statistics are calculated, see STATS_CALCULATION_GUIDE.md.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use gffy in your research, please cite:

Emilio R. (2025). gffy: Fast GFF3 Genomic Statistics Calculator.
https://github.com/emilior/gffy

Changelog

Version 0.0.1 (2025-11-05)

Initial release
Support for GFF3 file processing from URLs and local file paths
Comprehensive statistics for genes, transcripts, and features
Categorization by gene type (coding, non-coding, pseudogene)
Per-transcript-type statistics
Command-line interface (gffy command)
JSON schema for output validation
Automatic gzip compression detection
Automatic gzip compression detection

Support

For issues, questions, or contributions, please visit the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src/gffy		src/gffy
tests		tests
.flake8		.flake8
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
benchmark_stats.py		benchmark_stats.py
debug_orphans.py		debug_orphans.py
pyproject.toml		pyproject.toml
schema.json		schema.json
stats.json		stats.json
stats_refactored.json		stats_refactored.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gffy - GFF3 Genomic Statistics Calculator

Features

Installation

From Source

Quick Start

Python API

Command Line Interface

As a Module

Output Structure

Gene Categories

Statistics Computed

Configuration

Use Cases

Performance

Requirements

JSON Schema

Documentation

Contributing

License

Citation

Changelog

Version 0.0.1 (2025-11-05)

Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

guigolab/gffy

Folders and files

Latest commit

History

Repository files navigation

gffy - GFF3 Genomic Statistics Calculator

Features

Installation

From Source

Quick Start

Python API

Command Line Interface

As a Module

Output Structure

Gene Categories

Statistics Computed

Configuration

Use Cases

Performance

Requirements

JSON Schema

Documentation

Contributing

License

Citation

Changelog

Version 0.0.1 (2025-11-05)

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages