hp-extractor

Host-Pathogen Relation Extraction

Dataset

The underlying dataset comes from the CLOVERT database (a more inclusive version of CLOVER)

I have pulled abstracts for as many of the underlying papers as possible.

Due to the way this database was built, We can assume that the abstracts descibe some evidence of an interaction between the Host and Parasite taxa (names in meta-data).

Name issues

For some subset of the entries in the Host and Parasite columns, these names match the abstract verbatim, and can be converted to labels for training, testing, and validation of models.

However, the CLOVERT meta-data has undergone some name harmonization (essentially making sure different versions of a species name are converted to a single accepted form). If I remember corretly, this harmonization is more common for host names than parasite names. Therefore, one task could be to identify the host and parasite names as they appear in each abstract, and their positions. This would help greatly increase the amount of labelled training data we have for downstream tasks.

Existing Language Models for Biodiversity

TaxoNERD is a model that can recognize species names (including common names and Latin binomials) and may be good for identifying host and parasite names out of the box.

BiodivBERT is a model that can do both Named Entity Recognition and Relation Extraction, so may be a good foundation model for identifying host-parasite interactions.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
Mistral_prompt_output		Mistral_prompt_output
cypher_graph_extraction		cypher_graph_extraction
gpt4all_code		gpt4all_code
raw_data		raw_data
scripts		scripts
zs		zs
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
valid_template_ent.xlsx		valid_template_ent.xlsx
valid_template_rel.xlsx		valid_template_rel.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

hp-extractor

Dataset

Name issues

Existing Language Models for Biodiversity

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

maxfarrell/hp-extractor

Folders and files

Latest commit

History

Repository files navigation

hp-extractor

Dataset

Name issues

Existing Language Models for Biodiversity

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages