Skip to content

Conversation

@jsstevenson
Copy link
Contributor

@jsstevenson jsstevenson commented Feb 7, 2025

close #474

Refactor the Annotator class to better support configurable outputs and side effects beyond an annotated VCF. Specifically, the class is now an Abstract Base Class, AbstractVcfAnnotator, and implementations must define three methods:

  • on_vrs_object: perform filtering/transformation/side effects on VRS objects that have been translated from VCF coords. For example, attach additional extensions or mappings, or upload to a DB. Called every time the translator produces a successful VRS allele.
    • Eugene had originally proposed this solely for producing side effects (like DB uploads). I thought it might be helpful to a) enable transformations/more annotations (maybe do something with the expression added by the translator, or don't pass along for VCF annotation if it meets certain conditions). I don't have any specific applications in mind, though, so if it doesn't seem well-thought out, that's because it isn't.
  • on_vrs_object_collection: do something with the aggregation of all VRS alleles collected during VCF annotation, such as dump them to a file. Called once after VCF ingestion is complete, but only if the class variable collect_alleles is True.
  • raise_for_output_args: double-check that some kind of output has been declared in annotate(). This is here because there was a similar check in annotate() previously. The idea is to force a fast failure if you aren't going to be producing any kind of output. I don't feel particularly tied to keeping this, though.

The existing pickle file dump is modified slightly to use VRS IDs as keys (unsure why the other thing was being used previously, could change back if necessary). It's refactored to be an optional add-on, and an additional option to output an NDJSON dump is added. A basic implementation incorporating all of this is defined in the class VcfAnnotator. The CLI is updated to use this class.

See https://github.com/biocommons/anyvar/blob/b57f400c1fe4b821828847d906ac0a5246308d36/src/anyvar/extras/vcf.py for an external implementation.

Some potential issues:

  • In general, reliance on kwargs to pass arguments to the child class methods -- it's a little clunky obviously, both with use and documentation.
  • since retaining all constructed alleles might be costly for speed and memory, it can be disabled/enabled by an implementation with the class variable "collect_alleles". It's disabled by default, so if you tried to add functionality like dumping to a file and missed that you need to change the class variable, it wouldn't do anything. Ideally there would be some way to raise an abc error of some kind, idk.
  • Previously, the VRS allele collection that's retained while ingesting VCFs (it was named vrs_data before, I'm trying a name like allele_collection?) was retaining stringified dict dumps of alleles (i.e. str(allele.model_dump(exclude_none=True))). I'm not totally sure why this was the case, so I changed it to just hold onto the pydantic objects and defer decisions about serialization etc to on_vrs_object_collection. I am not sure if this causes memory issues with extremely large VCFs.

@jsstevenson jsstevenson added the priority:medium Medium priority label Feb 7, 2025
@jsstevenson
Copy link
Contributor Author

@quinnwai

@jsstevenson jsstevenson merged commit 6fdff9c into main Feb 24, 2025
14 checks passed
@jsstevenson jsstevenson deleted the annotator-side-effect branch February 24, 2025 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add callback support for VCF annotator

5 participants