A scientific Named Entity Recogntiion (NER) program
- Validates that the PMC ID and email are correctly formatted.
- Uses Biopython to fetch the PMC article XML.
- Parses the XML and stores the title and text in the Document object.
- Downloads the SciSpacy en_ner_bc5cdr_md NLP model and performs Named Entity Recognition (NER) on sentences with an HGNC ID to find the disease it is associated with.
- Saves the HGNC ID and its associated disease(s).
- Fetches gene metadata from genenames and HGNC rest API
- Saves metadata in a tab-separated file.
Example head of saved file:
| chrom | start | end | strand | assembly | hgnc_id | symbol | name | alias | ensembl | disease |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 227164624 | 227314792 | 1 | hg38 | HGNC:2204 | COL4A3 | collagen type IV alpha 3 chain | ENSG00000169031 | MIM 203780 | |
| 2 | 228029281 | 228179508 | 1 | hg19 | HGNC:2204 | COL4A3 | collagen type IV alpha 3 chain | ENSG00000169031 | MIM 203780 | |
| 22 | 36253071 | 36267530 | 1 | hg38 | HGNC:618 | APOL1 | apolipoprotein L1 | ENSG00000100342 | kidney disease | |
| 22 | 36649056 | 36663576 | 1 | hg19 | HGNC:618 | APOL1 | apolipoprotein L1 | ENSG00000100342 | kidney disease |
- chrom: STRING - Chromosome number/identifier
- start: INTEGER - Genomic start position
- end: INTEGER - Genomic end position
- strand: INTEGER - Strand orientation (1 for forward, -1 for reverse)
- assembly: STRING - Genome assembly version (hg38/hg19)
- hgnc_id: STRING - HGNC identifier in format "HGNC:#####"
- symbol: STRING - Official HGNC gene symbol
- name: STRING - Full gene name/description
- alias: STRING - Alternative gene name/symbol
- ensembl: STRING - Ensembl gene identifier
- disease: STRING - Associated disease
Requires Python 3.11+.
- Clone the repo
git clone https://github.com/jonathjd/felix.git && cd felix- Make a virtual environment and install dependencies.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtif using uv
uv sync- Run the program Use the following command, replacing the arguments as needed:
python main.py --pmc_id PMC####### --email your.email@domain.com --output genes.tsv--pmc_idor-pid: The PMC article ID (e.g. PMC11123321)--emailor-e: Your email address (required by NCBI)--outputor-o: Output TSV file path