Felix

A scientific Named Entity Recogntiion (NER) program

Validates that the PMC ID and email are correctly formatted.
Uses Biopython to fetch the PMC article XML.
Parses the XML and stores the title and text in the Document object.
Downloads the SciSpacy en_ner_bc5cdr_md NLP model and performs Named Entity Recognition (NER) on sentences with an HGNC ID to find the disease it is associated with.
Saves the HGNC ID and its associated disease(s).
Fetches gene metadata from genenames and HGNC rest API
Saves metadata in a tab-separated file.

Example head of saved file:

chrom	start	end	strand	assembly	hgnc_id	symbol	name	ensembl	disease
2	227164624	227314792	1	hg38	HGNC:2204	COL4A3	collagen type IV alpha 3 chain	ENSG00000169031	MIM 203780
2	228029281	228179508	1	hg19	HGNC:2204	COL4A3	collagen type IV alpha 3 chain	ENSG00000169031	MIM 203780
22	36253071	36267530	1	hg38	HGNC:618	APOL1	apolipoprotein L1	ENSG00000100342	kidney disease
22	36649056	36663576	1	hg19	HGNC:618	APOL1	apolipoprotein L1	ENSG00000100342	kidney disease

How to Run

Requires Python 3.11+.

git clone https://github.com/jonathjd/felix.git && cd felix

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

if using uv

uv sync

python main.py --pmc_id PMC####### --email your.email@domain.com --output genes.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
src/felix		src/felix
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tox.ini		tox.ini
uv.lock		uv.lock