Skip to content

Add view -X flag to drop all aux tags#871

Open
EvanTheB wants to merge 1 commit into
samtools:developfrom
EvanTheB:develop
Open

Add view -X flag to drop all aux tags#871
EvanTheB wants to merge 1 commit into
samtools:developfrom
EvanTheB:develop

Conversation

@EvanTheB

Copy link
Copy Markdown
Contributor

I do not know if you want this feature, and it is implemented in the wrong place.

I added -X flag to drop all aux tags. I use this for compression when I just want to save the fastq-ish data.

I think I should have added the code to htslib, if you want it, I am happy to modify so it works like that.

@jkbonfield

Copy link
Copy Markdown
Contributor

I don't think this belongs in htslib really, but it could make use of the htslib feature to give hints to the decoder. On BAM there's little that can be done (except maybe not bothering to do validation?), but with CRAM it's possible to tell the decoder to ignore blocks in the file - don't bother decompressing them and no need to serialise all the tags together.

Eg:

    if (hts_set_opt(state->fp, CRAM_OPT_REQUIRED_FIELDS,
                    SAM_QNAME|SAM_FLAG|SAM_RNAME|SAM_POS|SAM_MAPQ|
                    SAM_CIGAR|SAM_RNEXT|SAM_TLEN|SAM_SEQ|SAM_QUAL) < 0)
        error...

(Clearly we ought to add SAM_ALL so we can do SAM_ALL & ~SAM_AUX.)

@jkbonfield

Copy link
Copy Markdown
Contributor

Maybe there's also an argument for adapting how the current --input-fmt-option required_fields=0x4ff implementation works (that example on CRAM does what you want by the way, implemented using the hts_set_opt call above). If you wanted to go further and drop other fields like TLEN, RNEXT, ISIZE, and filter non primary read1/read2 you could use 0x601 say and use view -F 0xF00 to drop all the secondary, supplementary etc reads.

However input-fmt-option is a hint. When reading a BAM record we'll have read all the data so the fields are already there. When reading CRAM, if the data is necessary for decoding of other fields (eg we must know POS and RNAME to decode SEQ) then it'll be in the structures, but otherwise it'll be given a place-holder value (*, 0, etc). Perhaps what we want though is a required_fields equivalent for output-fmt-option which goes beyond an optimisation hint to become a statement of what will be stored. At this point it's essentially a crude columnar filter. (Crude because it's all or nothing as far as tags go, barring RG.)

Thoughts anyone?

@EvanTheB

Copy link
Copy Markdown
Contributor Author

Is there documentation somewhere for what the --input-* and --output-* args take? I keep finding random examples scattered around but no exhaustive doc.

@EvanTheB

Copy link
Copy Markdown
Contributor Author

Does #516 with an empty whitelist do the same?

@jkbonfield

jkbonfield commented Jun 20, 2018

Copy link
Copy Markdown
Contributor

Thanks @EvanTheB - I'd totally forgotten about that aging PR! We should discuss it and make a decision as it's plenty mature by now. :-)

As for the options arguments, they're in the samtools man page under "GLOBAL OPTIONS". Quite a lot are CRAM only or only apply on input or output, but this is described in the text. We just added level (compression level) to there to so I'll update the man page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants