Skip to content

Explore a fix for broken casava format handling in a number of scripts#818

Merged
mr-c merged 38 commits into
masterfrom
fix/casava_formats
Feb 23, 2015
Merged

Explore a fix for broken casava format handling in a number of scripts#818
mr-c merged 38 commits into
masterfrom
fix/casava_formats

Conversation

@ctb

@ctb ctb commented Feb 21, 2015

Copy link
Copy Markdown
Member

See #817, among others.

@ctb

ctb commented Feb 21, 2015

Copy link
Copy Markdown
Member Author

Confronting #817 head-on, we have a few different problems:

It looks to me like the right thing to do is make sure that parse_description is always set to false
in screed iterators. This will necessitate changing about a dozen things in scripts/. @mr-c thoughts on this wrt semantic versioning?

cc @standage

@ctb

ctb commented Feb 21, 2015

Copy link
Copy Markdown
Member Author

Addresses #763

@standage

Copy link
Copy Markdown
Member

This fixes the splitting/interleaving issues with Casava 1.8 AFAICS.

@mr-c

mr-c commented Feb 22, 2015

Copy link
Copy Markdown
Contributor

The changes might not be fun but if the tests pass without modifications then we are violating the semantic versioning.

@ctb

ctb commented Feb 22, 2015

Copy link
Copy Markdown
Member Author

Sorry, clarify: yes, fix, and to heck with semver?

On Feb 21, 2015, at 8:26 PM, Michael R. Crusoe notifications@github.com wrote:

The changes might not be fun but if the tests pass without modifications then we are violating the semantic versioning.


Reply to this email directly or view it on GitHub.

@mr-c

mr-c commented Feb 22, 2015

Copy link
Copy Markdown
Contributor

mea culpa; ----> Then we are not violating semantic versioning if the
tests continue to pass unmodified.

On Sat Feb 21 2015 at 8:34:43 PM C. Titus Brown notifications@github.com
wrote:

Sorry, clarify: yes, fix, and to heck with semver?

On Feb 21, 2015, at 8:26 PM, Michael R. Crusoe notifications@github.com
wrote:

The changes might not be fun but if the tests pass without modifications
then we are violating the semantic versioning.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#818 (comment).

@mr-c

mr-c commented Feb 22, 2015

Copy link
Copy Markdown
Contributor

To summarize my understanding: khmer has always mangled sequence record names: either by dropping descriptions or appending different pairing notation on the end.

This PR aims to bring khmer in line with expected practice: both old and new style pair naming are accepted everywhere; descriptions will no longer be dropped.

With respect to semantic versioning we have three options:

A) We declare the old behavior a horrible bug and widely publicize the correction and thus we don't increment the major version number.

or

B) We accept that the old behavior was present for such a long time and likely required workarounds that will break upon its fixing. Therefore we increment the major version to indicate this, thus following our commitment to semantic (and not 'marketing') versioning.

or

C) we hid the corrected behavior behind a switch and wait for the 2.0 release to make it the default behavior

As Titus privately pointed out to me, seeing what changes will be required to update the protocols (if anything) will inform the wisdom of option B or C.

@ctb

ctb commented Feb 22, 2015

Copy link
Copy Markdown
Member Author

While additional tests are needed, and it's not yet ready for a full CR, it's mostly done, I think? Thoughts welcome. I'll finish sanding down the corners tomorrow or later today, as time permits.

@ctb

ctb commented Feb 22, 2015

Copy link
Copy Markdown
Member Author
  • add tests for filter-abund
  • add tests for filter-abund-single
  • resolve issues with count-median now allowing spaces in output
  • add tests for do-partition
  • add tests for annotate-partition and extract-partition
  • add tests for filter-stoptags
  • add tests for normalize-by-median
  • add tests for extract-long-sequences
  • add tests for fastq-to-fasta
  • add tests for sample-reads-randomly

OK, so it's a slightly bigger issue than I thought :).

@ctb

ctb commented Feb 23, 2015

Copy link
Copy Markdown
Member Author
  • file issue to update file formats to proper CSV, with headers, for khmer 2.0.

@ctb

ctb commented Feb 23, 2015

Copy link
Copy Markdown
Member Author
  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb

ctb commented Feb 23, 2015

Copy link
Copy Markdown
Member Author

Ready for review, w00t.

@ctb

ctb commented Feb 23, 2015

Copy link
Copy Markdown
Member Author

Additional issues/bugs found & fixed:

  • removed no longer needed test-data files;
  • interleave-reads.py used to happily interleave the same file twice; now it errors out.

Comment thread scripts/count-median.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'For khmer 1.x count-median.py will split sequence names at the first space which means that some ..'

Comment thread tests/test_scripts.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nifty if Casava 1.8 formatted test sequence files had 'c18' in their name.

@mr-c

mr-c commented Feb 23, 2015

Copy link
Copy Markdown
Contributor

LGTM!

mr-c added a commit that referenced this pull request Feb 23, 2015
Explore a fix for broken casava format handling in a number of scripts
@mr-c mr-c merged commit f938e15 into master Feb 23, 2015
@mr-c mr-c deleted the fix/casava_formats branch February 23, 2015 23:51
@ctb

ctb commented Feb 23, 2015

Copy link
Copy Markdown
Member Author

w00t!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants