GFF Converter

Introduction

GFF Converter is used to convert NCBI or ENSEMBL style gff3 files into a more concise UCSC style gff file with the script gff_converter.py

NCBI style input data:

NC_000001.10	BestRefSeq	gene	65419	71585	.	+	.	ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding
NC_000001.10	BestRefSeq	mRNA	65419	71585	.	+	.	ID=rna-NM_001005484.2;Parent=gene-OR4F5;Dbxref=GeneID:79501,Genbank:NM_001005484.2,HGNC:HGNC:14825;Name=NM_001005484.2;gbkey=mRNA;gene=OR4F5;product=olfactory receptor family 4 subfamily F member 5;tag=RefSeq Select;transcript_id=NM_001005484.2
NC_000001.10	BestRefSeq	exon	65419	65433	.	+	.	ID=exon-NM_001005484.2-1;Parent=rna-NM_001005484.2;Dbxref=GeneID:79501,Genbank:NM_001005484.2,HGNC:HGNC:14825;gbkey=mRNA;gene=OR4F5;product=olfactory receptor family 4 subfamily F member 5;tag=RefSeq Select;transcript_id=NM_001005484.2

ENSEMBL style input data:

1	ensembl_havana	gene	69091	70008	.	+	.	ID=gene:ENSG00000186092;Name=OR4F5;biotype=protein_coding;description=olfactory receptor%2C family 4%2C subfamily F%2C member 5 [Source:HGNC Symbol%3BAcc:14825];gene_id=ENSG00000186092;logic_name=ensembl_havana_gene;version=4
1	ensembl_havana	mRNA	69091	70008	.	+	.	ID=transcript:ENST00000335137;Parent=gene:ENSG00000186092;Name=OR4F5-001;biotype=protein_coding;ccdsid=CCDS30547.1;havana_transcript=OTTHUMT00000003223;havana_version=1;tag=basic;transcript_id=ENST00000335137;version=3
1	ensembl_havana	exon	69091	70008	.	+	.	Parent=transcript:ENST00000335137;Name=ENSE00002319515;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00002319515;rank=1;version=1

UCSC style output data；

chr1	NCBI	mRNA	65419	71585	.	+	.	ID=NM_001005484; name=OR4F5;
chr1	NCBI	5_UTR	65419	65433	.	+	.	Parent=NM_001005484;
chr1	NCBI	intron	65434	65519	.	+	.	Parent=NM_001005484;
chr1	NCBI	5_UTR	65520	65564	.	+	.	Parent=NM_001005484;
chr1	NCBI	CDS	65565	65573	.	+	0	Parent=NM_001005484;
chr1	NCBI	intron	65574	69036	.	+	.	Parent=NM_001005484;
chr1	NCBI	CDS	69037	70008	.	+	0	Parent=NM_001005484;
chr1	NCBI	3_UTR	70009	71585	.	+	.	Parent=NM_001005484;

It will also generate a table of (name, ncbi_id, ensembl_id) for each mRNA/transcript in the input file. The missing value will be represented as 'None', and can be queried using table_translate.py or org_Hs_eg_db_translate.R.

The format of the output table is

Name	NCBI	ENSEMBL
DDX11L1	NR_046018	ENST00000456328
WASH7P	NR_024540	None
OR4F5	NM_001005484	ENST00000641515
OR4F29	NM_001005221	ENST00000426406

check_gff.py can check if the generated gff file is valid:

Is every seq got its own id and name
Is every subseq got its parent, and is directy placed under its parent
If we join the coordinates of subseq end by end, its equal to the coordinates of the seq

Data Preprocess and Preparation

The input file should met the gff3 file specifications
Assembly report file shoule be provided for the seqid converion, it can be downloaded from NCBI FTP site
You can translate ensembl transcript ids to ncbi refseq ids, or vice versa based on the table gene2ensembl provided by NCBI
A translation table for Homo sapiens can be generated by the following command: awk 'BEGIN { FS = "\t" } ; {if(NR == 1 || $1=="9606") print $0}' gene2ensembl > homo_sapiens_gene2ensembl
You can also make the translation using the R libray org.Hs.eg.db

Install

You should install python3 to use gff_converter.py, coord_translate.py and table_translate.py. You should install R and the R library org.Hs.eg.db to use org_Hs_eg_db_translate.R. A yaml file is provided to build a conda environment named gfftools by conda/mamba using the following command:

mamba env create -f envs/environment.yml

Usage

The scripts can used with the following command:

gff_converter.py [-h] -i input_gff -o output_gff [-a assembly_report] -t output_name_id_table -s gff_style [--add_intron] [--add_utr]

table_translate.py [-h] -i input.tsv -o output.tsv -r translation_table [-t trans_type]

Rscript org_Hs_eg_db_translate.R input_file output_file trans_type

coord_translate.py [-h] -n ncbi.gff -e ensembl.gff -a assembly_report -o output.tsv

check_gff.py [-h] -i input.gff [-s gff_style]

Test Data

A small test data is provided in the folder test for testing. You can use the following command to test the scripts:

python3 gff_converter.py -i test/ncbi_test.gff -o test/ncbi_test_output.gff -a data/GCF_000001405.25_GRCh37.p13_assembly_report.txt -t test/ncbi_name_id.tsv -s NCBI  --add_intron --add_utr
python3 gff_converter.py -i test/ensembl_test.gff3 -o test/ensembl_test_output.gff -a data/GCF_000001405.25_GRCh37.p13_assembly_report.txt -t test/ensembl_name_id.tsv -s ENSEMBL  --add_intron --add_utr

python3 table_translate.py -i test/ncbi_name_id.tsv -o test/trans_ncbi_name_id.tsv  -r data/homo_sapiens_gene2ensembl -t n2e
python3 table_translate.py -i test/ensembl_name_id.tsv -o test/trans_ensembl_name_id.tsv -r data/homo_sapiens_gene2ensembl -t e2n 

Rscript org_Hs_eg_db_translate.R test/ncbi_name_id.tsv test/R_ncbi_name_id.tsv n2e
Rscript org_Hs_eg_db_translate.R gff_out/ensembl_name_id.tsv gff_out/R_trans_ensembl_name_id.tsv e2n

python3 coord_translate.py -n test/ncbi_test.gff -e test/ensembl_test.gff3 -a data/GCF_000001405.25_GRCh37.p13_assembly_report.txt -o test/trans_coord_id.tsv

python3 check_gff.py -i test/ncbi_test_output.gff -s NCBI
python3 check_gff.py -i test/ensembl_test_output.gff -s ENSEMBL

Report

The scripts have been tested using the follwing files:

NCBI style gff file : GCF_000001405.25_GRCh37.p13_genomic.gff
ENSEMBL style gff file: Homo_sapiens.GRCh37.87.Ensembl.gff3

Script for the testing:

python3 gff_converter.py -i gff_data/GCF_000001405.25_GRCh37.p13_genomic.gff -o gff_out/NCBI_output.gff -a data/GCF_000001405.25_GRCh37.p13_assembly_report.txt -t  gff_out/ncbi_name_id.tsv -s NCBI --add_intron --add_utr
python3 gff_converter.py -i gff_data/Homo_sapiens.GRCh37.87.Ensembl.gff3  -o gff_out/ENSEMBL_output.gff -a data/GCF_000001405.25_GRCh37.p13_assembly_report.txt -t  gff_out/ensembl_name_id.tsv -s ENSEMBL --add_intron --add_utr

python3 table_translate.py -i gff_out/ncbi_name_id.tsv -o gff_out/trans_ncbi_name_id.tsv  -r data/homo_sapiens_gene2ensembl -t n2e 
python3 table_translate.py -i gff_out/ensembl_name_id.tsv -o gff_out/trans_ensembl_name_id.tsv -r data/homo_sapiens_gene2ensembl -t e2n

Rscript org_Hs_eg_db_translate.R gff_out/ncbi_name_id.tsv gff_out/R_trans_ncbi_name_id.tsv n2e
Rscript org_Hs_eg_db_translate.R gff_out/ensembl_name_id.tsv gff_out/R_trans_ensembl_name_id.tsv e2n

python3 check_gff.py -i gff_out/NCBI_output.gff -s NCBI
python3 check_gff.py -i gff_out/ENSEMBL_output.gff -s ENSEMBL

The following table is the Number Quries and TransId generated by table_tranlate.py and org_Hs_eg_db_translate.R

Method	TransType	Quries	TransId
table_tranlate.py	n2e	68879	43056
table_tranlate.py	e2n	95160	40702
org_Hs_eg_db_translate.R	n2e	68879	12638
org_Hs_eg_db_translate.R	e2n	95160	9438

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
envs		envs
get_coord		get_coord
.gitignore		.gitignore
GFF.py		GFF.py
README.md		README.md
check_gff.py		check_gff.py
coord_translate.py		coord_translate.py
find_fusion.py		find_fusion.py
find_fusion_cosmic_table.py		find_fusion_cosmic_table.py
get_bed_file.py		get_bed_file.py
get_break_point_region.py		get_break_point_region.py
get_break_point_region2.py		get_break_point_region2.py
get_coordinates.py		get_coordinates.py
get_id.py		get_id.py
gff_converter.py		gff_converter.py
org_Hs_eg_db_translate.R		org_Hs_eg_db_translate.R
table_translate.py		table_translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GFF Converter

Introduction

Data Preprocess and Preparation

Install

Usage

Test Data

Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GFF Converter

Introduction

Data Preprocess and Preparation

Install

Usage

Test Data

Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages