Skip to content

gzip/bzip2 output options#747

Merged
ctb merged 30 commits into
masterfrom
feature/gzip505
Aug 1, 2015
Merged

gzip/bzip2 output options#747
ctb merged 30 commits into
masterfrom
feature/gzip505

Conversation

@b-wyss

@b-wyss b-wyss commented Jan 30, 2015

Copy link
Copy Markdown
Contributor

Feature creation in response to #505

@mr-c

mr-c commented Feb 10, 2015

Copy link
Copy Markdown
Contributor

Jenkins, retest this please

Comment thread setup.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

@mr-c

mr-c commented Feb 10, 2015

Copy link
Copy Markdown
Contributor

Jenkins, test this please

1 similar comment
@mr-c

mr-c commented Feb 11, 2015

Copy link
Copy Markdown
Contributor

Jenkins, test this please

@mr-c

mr-c commented Feb 11, 2015

Copy link
Copy Markdown
Contributor

@b-wyss Looks like you have some PEP8 violations. make autopep8 should clear them up.

Keep on going, you've made good progress!

@b-wyss

b-wyss commented Feb 16, 2015

Copy link
Copy Markdown
Contributor Author

Scripts to update:

  • normalize-by-median (multiple files);
  • count-median;
  • extract-partitions;
  • extract-long-sequences;
  • extract-paired-reads;
  • fastq-to-fasta;
  • filter-abund;
  • filter-abund-single;
  • sample-reads-randomly;
  • split paired-reads;
  • interleave-reads;
  • trim-low-abund;

@ctb

ctb commented Feb 17, 2015

Copy link
Copy Markdown
Member

Looking at http://khmer.readthedocs.org/en/v1.3/user/scripts.html, I think:

filter-abund and filter-abund-single;
count-median;
extract-partitions;
extract-long-sequences;
extract-paired-reads;
fastq-to-fasta
interleave-reads
sample-reads-randomly
split-paired-reads

and eventually trim-low-abund when #759 is merged :)

Comment thread khmer/kfile.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be generalized a bit - right now the output file has to be in args.output, but for some of the scripts I think there will be a need for different output names/files, especially in the scripts that have multiple output files.

So maybe enable_output_compression can take a file handle, and return an fp (what gzip.GzipFile/bz2file.open give) for it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note: open for discussion. just a thought.)

@mr-c

mr-c commented Feb 17, 2015

Copy link
Copy Markdown
Contributor

@ctb First pass is for single outputs ( -o or similar)

@mr-c mr-c added this to the 1.4+ milestone May 13, 2015
@ctb ctb modified the milestones: 2.0, 1.4+ Jun 12, 2015
@SensibleSalmon

Copy link
Copy Markdown
Contributor

@b-wyss Mind if I vulture this?

@mr-c

mr-c commented Jul 21, 2015

Copy link
Copy Markdown
Contributor

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton notifications@github.com
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?


Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician crusoe@ucdavis.edu
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

@b-wyss

b-wyss commented Jul 21, 2015

Copy link
Copy Markdown
Contributor Author

Please do! Sorry I fell off the face of the earth, things got really crazy
towards the end of the spring, centered around a bit of a family situation.
I meant to officially leave and try to make it on good terms, but it never
really happened. You guys deserved to at least know what was going on with
me, and I didn't let anyone know. Seems like the lab has transferred over
pretty successfully - congratulations!

On Tue, Jul 21, 2015 at 1:57 PM Michael R. Crusoe notifications@github.com
wrote:

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton notifications@github.com
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?


Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician crusoe@ucdavis.edu
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe


Reply to this email directly or view it on GitHub
#747 (comment).

@SensibleSalmon

Copy link
Copy Markdown
Contributor

All good! Stuff happens. Hope everything is okay and you're doing well!

@SensibleSalmon SensibleSalmon self-assigned this Jul 21, 2015
@SensibleSalmon

Copy link
Copy Markdown
Contributor

TO-DO:

  • Generalize enable compression: currently it takes args and converts args.output to a gz/bz writer. as per @ctb reccomendation, will shift to a func that takes a file handle and returns an FP. This will enable use in scripts a la normalize-by-median with multiple files.
  • Enable compression in normalize-by-median
    • determine which outputs in normalize-by-median should be compressed
  • figure out what happens if we do -o - --gzip in normalize-by-median (probably nothing good)

@SensibleSalmon

Copy link
Copy Markdown
Contributor

question: should the func that adds the compression args be moved to khmer_args?

https://github.com/dib-lab/khmer/pull/747/files#diff-46d242f8c7f4ca46fd5c784bd24184c7R165

@SensibleSalmon

Copy link
Copy Markdown
Contributor

normalize-by-median's stdout 3 test is leaking output

(It's my doing, I just need to remember to fi x it)

@SensibleSalmon

Copy link
Copy Markdown
Contributor

"revert" to argparse handling opening files:

  • normalize-by-median (multiple files);
  • count-median;
  • extract-partitions;
  • extract-long-sequences;
  • extract-paired-reads;
  • fastq-to-fasta;
  • filter-abund;
  • filter-abund-single;
  • sample-reads-randomly;
  • split paired-reads;
  • interleave-reads;
  • trim-low-abund;

@SensibleSalmon

Copy link
Copy Markdown
Contributor

split-paired-reads has this:

parser.add_argument('infile', nargs='?', default='/dev/stdin')

we should make than an argparse handled file open

Comment thread khmer/kfile.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is primarily a debugging thing and will be removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrug it's ok to leave it in, with a brief explanation.

@SensibleSalmon

Copy link
Copy Markdown
Contributor

retest this, please

Comment thread scripts/fastq-to-fasta.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be args.output.name, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails if we pass sys.stdout as output--it doesn't have a name.

That said, my str solution doesn't really help. I should check to see if it is in fact a block device or whathaveyou.

@mr-c reccomended making a function to do checking on file handles to see if they are fifos, block devices, etc. and factoring those checks out of the rest of the codebase.

@ctb ctb Jul 23, 2015 via email

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctb

ctb commented Jul 30, 2015

Copy link
Copy Markdown
Member

Good question. No for now, but file an issue.

Titus Brown, ctbrown@ucdavis.edu

On Jul 30, 2015, at 11:22 AM, Jake Fenton notifications@github.com wrote:

question re trim-low-abund: we set aside stuff in a file, then go back to that file, pull things from it and then write to output. Currently, I don't compress the aside_file--Should I?


Reply to this email directly or view it on GitHub.

@SensibleSalmon

Copy link
Copy Markdown
Contributor

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch

@ctb

ctb commented Jul 30, 2015

Copy link
Copy Markdown
Member

I would suggest adding streaming tests to the new test_streaming_io.

Titus Brown, ctbrown@ucdavis.edu

On Jul 30, 2015, at 11:31 AM, Jake Fenton notifications@github.com wrote:

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch


Reply to this email directly or view it on GitHub.

@SensibleSalmon

Copy link
Copy Markdown
Contributor

Agreed, but my concern is that for nearly every case where we have an output file we could have a compressed output file, and we should be testing all of those--but that's like, 50 tests.

@SensibleSalmon

Copy link
Copy Markdown
Contributor

from high-bandwidth conversation; We'll employ stupidity-driven-testing here. No point in going through the combinatorial matrix (Well, there is a point, but the cost/benefit isn't there). So if somebody finds something that's super borked we'll fix it and add a test for it but as we stand we're probably on solid footing.

@SensibleSalmon

Copy link
Copy Markdown
Contributor

On that note, @ctb Merge?

Comment thread scripts/fastq-to-fasta.py

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, ok ;)

@ctb

ctb commented Jul 31, 2015

Copy link
Copy Markdown
Member
  • The number of changed files and the number of files mentioned in the ChangeLog entry are different;
    what's up?
  • Is is_block really only used in one place in the codebase? Is there nowhere else it belongs?
  • More generally, in the 'Wrote output to" section of fastq-to-fasta - is there a need for a general way to
    provide a human-readable output name for things that may-or-may-not-be-files? I would prefer that
    logic like if output_is_block be placed somewhere central and reusable (kfile.py?) - perhaps a
    function named 'describe_filehandle` or some such that returns a string?

@ctb

ctb commented Jul 31, 2015

Copy link
Copy Markdown
Member

Overall looks good. Double-check your diff-cover, fix issues above, ask for re-review :)

@SensibleSalmon

Copy link
Copy Markdown
Contributor

@ctb Cleaned up, ready for merge.

@ctb

ctb commented Aug 1, 2015

Copy link
Copy Markdown
Member

LGTM.

ctb added a commit that referenced this pull request Aug 1, 2015
@ctb ctb merged commit 0709ff8 into master Aug 1, 2015
@ctb ctb deleted the feature/gzip505 branch August 1, 2015 16:00
@SensibleSalmon

Copy link
Copy Markdown
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants