-
Notifications
You must be signed in to change notification settings - Fork 38
feat!: refactor annotator as ABC, add NDJSON annotation + support optional side effects #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ehclark
approved these changes
Feb 14, 2025
korikuzma
approved these changes
Feb 17, 2025
Contributor
Author
quinnwai
reviewed
Feb 22, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
close #474
Refactor the Annotator class to better support configurable outputs and side effects beyond an annotated VCF. Specifically, the class is now an Abstract Base Class,
AbstractVcfAnnotator, and implementations must define three methods:on_vrs_object: perform filtering/transformation/side effects on VRS objects that have been translated from VCF coords. For example, attach additional extensions or mappings, or upload to a DB. Called every time the translator produces a successful VRS allele.on_vrs_object_collection: do something with the aggregation of all VRS alleles collected during VCF annotation, such as dump them to a file. Called once after VCF ingestion is complete, but only if the class variablecollect_allelesisTrue.raise_for_output_args: double-check that some kind of output has been declared inannotate(). This is here because there was a similar check inannotate()previously. The idea is to force a fast failure if you aren't going to be producing any kind of output. I don't feel particularly tied to keeping this, though.The existing pickle file dump is modified slightly to use VRS IDs as keys (unsure why the other thing was being used previously, could change back if necessary). It's refactored to be an optional add-on, and an additional option to output an NDJSON dump is added. A basic implementation incorporating all of this is defined in the class
VcfAnnotator. The CLI is updated to use this class.See https://github.com/biocommons/anyvar/blob/b57f400c1fe4b821828847d906ac0a5246308d36/src/anyvar/extras/vcf.py for an external implementation.
Some potential issues:
kwargsto pass arguments to the child class methods -- it's a little clunky obviously, both with use and documentation.vrs_databefore, I'm trying a name likeallele_collection?) was retaining stringified dict dumps of alleles (i.e.str(allele.model_dump(exclude_none=True))). I'm not totally sure why this was the case, so I changed it to just hold onto the pydantic objects and defer decisions about serialization etc toon_vrs_object_collection. I am not sure if this causes memory issues with extremely large VCFs.