Hi everyone,
I'm running BLASTP (all_vs_all) as part of my analysis. I merged all FASTA files from different species into a single file (merge.fast) and used it both as a query and as the database. The BLASTP execution itself went fine.
However, I noticed a discrepancy in gene naming for Arabidopsis thaliana between the protein FASTA file and the annotation GFF file—both downloaded from NCBI. A subset of each file is provided below for reference.
FASTA File Sample:
>NP_001030613.1 hypothetical protein 1 [Arabidopsis thaliana]
...
>NP_001030614.1 Phosphoglycerate mutase-like family protein [Arabidopsis thaliana]
...
>NP_001030615.2 ECA1-like gametogenesis related family protein [Arabidopsis thaliana]
...
GFF File Sample:
NC_003070.9 RefSeq gene 3631 5899 . + . ID=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010,GeneID:839580
NC_003070.9 RefSeq mRNA 3631 5899 . + . ID=rna-NM_099983.2;Parent=gene-AT1G01010;Dbxref=Araport:AT1G01010,GenBank:NM_099983.2
...
I also converted the GFF file into a 4-column format as required for MCScanX:
at003070.9 AT1G01010 3631 5899
at003070.9 AT1G01020 6788 9130
...
Issue:
When I ran MCScanX using:
./MCScanX ../synteny/ortho_mc/blast.tsv
I got the following result:
Reading BLAST file and pre-processing
Generating BLAST list
0 matches imported (0 discarded)
0 pairwise comparisons
0 alignments generated
Pairwise collinear blocks written to /synteny/ortho_mc/.collinearity [0.000 seconds elapsed]
Writing multiple syntenic blocks to HTML files
Done! [0.000 seconds elapsed]
It seems like no matches were imported.
Possible Cause & Question:
I suspect that the discrepancy in gene naming conventions between the protein FASTA file (NP_ accessions) and the GFF file (ATxGxxxxx locus IDs) might be the issue.
Does anyone know of a method, tool, or reference file to map NCBI protein accessions (NP_) to TAIR/Araport gene locus IDs (ATxGxxxxx)?
Or is there a better way to resolve this issue for MCScanX?
Any help or pointers would be greatly appreciated!
Thanks in advance!
Hi everyone,
I'm running BLASTP (all_vs_all) as part of my analysis. I merged all FASTA files from different species into a single file (
merge.fast) and used it both as a query and as the database. The BLASTP execution itself went fine.However, I noticed a discrepancy in gene naming for Arabidopsis thaliana between the protein FASTA file and the annotation GFF file—both downloaded from NCBI. A subset of each file is provided below for reference.
FASTA File Sample:
GFF File Sample:
I also converted the GFF file into a 4-column format as required for MCScanX:
Issue:
When I ran MCScanX using:
I got the following result:
It seems like no matches were imported.
Possible Cause & Question:
I suspect that the discrepancy in gene naming conventions between the protein FASTA file (NP_ accessions) and the GFF file (ATxGxxxxx locus IDs) might be the issue.
Does anyone know of a method, tool, or reference file to map NCBI protein accessions (NP_) to TAIR/Araport gene locus IDs (ATxGxxxxx)?
Or is there a better way to resolve this issue for MCScanX?
Any help or pointers would be greatly appreciated!
Thanks in advance!