first pass seqan impl by mr-c · Pull Request #642 · dib-lab/khmer

mr-c · 2014-10-29T04:22:21Z

Not for review; some tests fail for tested specific implementation quirks, others fail as seqan is very forgiving

mr-c · 2014-10-29T04:46:46Z

This does pass Camille's new tests.test_read_parsers.test_with_multiple_threads_big from #641 (not included for brevity)

Now to see if it is speedy.

mr-c · 2014-10-29T16:01:43Z

Implementation notes: Seqan has two variable names in conflict with defines
in our code and some IO related Linux headers. Hence the shuffling of the
#include preprocessor directives. This should be patched in our copy of
their headers and sent upstream.

To give thread safety a mutex was chosen for simplicity. @camillescott or @luizirber
can find a better solution here if needed. Seqan's use of the (deprecated
in C++11) autoptr does produce a lot of warnings. This could also be
patched via preprocessor directives to use the appropriate successors to
autoptr if compiling in c++11 mode.

A basic test with filter-abund shows comparable speed to the existing code.
Head-to-head comparisons with v1.1 using other code paths and larger and
more realistic data sets will be useful.

At this time the data suggests that the Seqan code is more correct and its
performance is of the same order of magnitude as the current code.

On Wed, Oct 29, 2014, 01:02 ged-jenkins notifications@github.com wrote:

Test FAILed.

—
Reply to this email directly or view it on GitHub
#642 (comment).

mr-c · 2014-10-29T17:46:30Z

Fixes #355, #77 (autodetect sequence type)

Confirmed to fix #249 (FASTQ parsing). Special thanks to @sjackman for his testcase which made confirming the fix super easy.

Will address parts of #274, #68

mr-c · 2014-10-30T09:41:57Z

I know this is a big PR, but it reduces the number of conditional branches by 635, the number of lines of code by 650 (not to mention all the code that was hiding behind preprocessor directives), and the number of classes by three.

Is it mergable?
Did it pass the tests?
If it introduces new functionality in scripts/ is it tested?
Check for code coverage.
Is it well formatted? Look at make pep8, make diff_pylint_report,
make cppcheck, and make doc output. Use make format and manual fixing as needed.
Is it documented in the ChangeLog?
Was a spellchecker run on the source code and documentation after
changes were made?

Comments are welcome by all, especially @luizirber @camillescott and @ctb

mr-c · 2014-10-30T10:07:43Z

Note: the bulk of the added lines is @camillescott 's test file of 200,000 lines that caught the missing reads error. (Thanks again to @camillescott !)

ctb · 2014-10-30T10:14:47Z

OK, so to try to summarize --

this is adding in the seqan library for sequence parsing;

seqan adds no new dependencies and is license compatible;

it fixes some rather serious issues with sequence parsing in our current code;

it doesn't seem to introduce any performance regressions;

so far, so good.

Given our good test coverage, the magnitude of the changes to code don't concern me too much. However, two questions --

is the plan to put this in for 1.2? It seems like a big change. OTOH the bug that @camillescott found is pretty problematic too :)

How do we check this against a really large-scale data set? We can at least be fairly certain that the old code didn't barf (e.g. memory leak) on billions of reads. We should ask @jiarong or @qingpeng to run it on one of their big files.

And I guess a third question - was there a discussion somewhere about including seqan? At this point I'm inferring this is something that was discussed offline, which is fine for starters but maybe bears further examination?

mr-c · 2014-10-30T10:57:21Z

My vote is for a v1.2 or v1.3 release; the FASTQ multithreading bug is pretty bad and very long standing.

I'm happy to do a larger run with whatever data. I think I'll do the acceptance test next and then @jiarong & @qingpeng 's big data.

Coverity scan comes up good.

Discussion was initially offline. I did the mockup to see if the performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL); not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

mr-c · 2014-10-30T11:05:29Z

Given the magnitude of these changes I would like a sooner merge then later, to make less unpleasant the merge of @luizirber and @camillescott 's works.

ctb · 2014-10-30T11:06:09Z

On Thu, Oct 30, 2014 at 03:57:21AM -0700, Michael R. Crusoe wrote:

My vote is for a v1.2 or v1.3 release; the FASTQ multithreading bug is pretty bad and very long standing.

Agreed.

I'm happy to do a larger run with whatever data. I think I'll do the acceptance test next and then @jiarong & @qingpeng 's big data.

OK.

Coverity scan comes up good.

Great!

Discussion was initially offline. I did the mockup to see if the performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL); not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

OK. Would appreciate a greater emphasis on making use of the issue tracker for
this kind of discussion; not only does it help to keep everyone on the team in
the loop (incuding, say, me), but we can bring in the broader community
(already tied in via biostars, admittedly) and it's part of being a true
open source project...

In this case, could you create an issue with the above paragraph and link it to
this PR? The issue can be closed immediately. I just want to have something
to refer people to that's not buried in the middle of a PR :).

thanks,

--titus

C. Titus Brown, ctb@msu.edu

ctb · 2014-10-30T11:11:02Z

On Thu, Oct 30, 2014 at 04:05:29AM -0700, Michael R. Crusoe wrote:

Given the magnitude of these changes I would like a sooner merge then later, to make less unpleasant the merge of @luizirber and @camillescott 's works.

We need to make damned sure we're not buying ourselves trouble with seqan
first, because I am confident that ripping this stuff back out is going
to be more painful than the merge problems. I think full release tests + large
file test (1 bn reads +) need to be run and results understood before a merge.

For the large file test, would suggest bugging @qingpeng for the soil data
and simply running diginorm -C 5 on a couple of the files. The fp rate
doesn't matter too much so you could use low memory - the thing to be sure
of are that memory doesn't go up through the run, and that seqan robustly
handles multiple very large files. (Call me paranoid, or maybe just
experienced.)

--t

C. Titus Brown, ctb@msu.edu

mr-c · 2014-10-30T11:19:48Z

On Thu Oct 30 2014 at 7:06:12 AM C. Titus Brown notifications@github.com
wrote:

On Thu, Oct 30, 2014 at 03:57:21AM -0700, Michael R. Crusoe wrote:

Discussion was initially offline. I did the mockup to see if the
performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL);
not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not
oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

OK. Would appreciate a greater emphasis on making use of the issue tracker
for
this kind of discussion; not only does it help to keep everyone on the
team in
the loop (incuding, say, me), but we can bring in the broader community
(already tied in via biostars, admittedly) and it's part of being a true
open source project...

Fair enough. The PR came about within 48 hours of in-person discussions.

In this case, could you create an issue with the above paragraph and link
it to
this PR? The issue can be closed immediately. I just want to have something
to refer people to that's not buried in the middle of a PR :).

Sure, done: see #643.

mr-c · 2014-10-30T11:20:39Z

On Thu Oct 30 2014 at 7:11:05 AM C. Titus Brown notifications@github.com
wrote:

On Thu, Oct 30, 2014 at 04:05:29AM -0700, Michael R. Crusoe wrote:

Given the magnitude of these changes I would like a sooner merge then
later, to make less unpleasant the merge of @luizirber and @camillescott 's
works.

We need to make damned sure we're not buying ourselves trouble with seqan
first, because I am confident that ripping this stuff back out is going
to be more painful than the merge problems. I think full release tests +
large
file test (1 bn reads +) need to be run and results understood before a
merge.

I agree wholeheartedly.

For the large file test, would suggest bugging @qingpeng for the soil data
and simply running diginorm -C 5 on a couple of the files. The fp rate
doesn't matter too much so you could use low memory - the thing to be sure
of are that memory doesn't go up through the run, and that seqan robustly
handles multiple very large files. (Call me paranoid, or maybe just
experienced.)

Sounds like a plan, thanks.

mr-c · 2014-10-30T18:12:25Z

Results from acceptance test

Dies at rsem step

ValueError: to_rgba: Invalid rgba arg "(0.0, 0.0, 1.0, '0.1')"
number in rbga sequence outside 0-1 range

Trinity assembly looks a bit off

root@seqantest:~/khmer-protocols/mrnaseq#   /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61429   4449    /mnt/work/trinity_out_dir/Trinity.fasta

But the mouse-like genes are found

root@seqantest:~/khmer-protocols/mrnaseq# grep "zinc transporter" /mnt/blast/trinity.x.mouse | wc -l
24

I'm repeating now with khmer v1.1 to see if this is due to version skew on Ubuntu 14.04

mr-c · 2014-10-30T19:20:55Z

khmer v1.1 fails the acceptance test the same way. The assembly stats are pretty close to the seqan version and the same number of matches to mouse zinc transporter are found. Success?

7 of 14 total genes diff expressed at P > 0.950000
Writing differentially expressed genes list to 0-vs-6-hour.changed.csv
Got sample spec: 5 from condition 1, 5 from condition 2
Loading differentially expressed gene names from 0-vs-6-hour.changed.csv
Loading gene matrix from 0-vs-6-hour.matrix
plotting...
Output figure to: 0-vs-6-hour.matrix.png
Traceback (most recent call last):
  File "/usr/local/share/eel-pond/plot-expression.py", line 89, in <module>
    main()
  File "/usr/local/share/eel-pond/plot-expression.py", line 86, in main
    savefig(filename)
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 561, in savefig
    return fig.savefig(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/figure.py", line 1421, in savefig
    self.canvas.print_figure(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/backend_bases.py", line 2220, in print_figure
    **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/backends/backend_agg.py", line 505, in print_png
    FigureCanvasAgg.draw(self)
  File "/usr/lib/pymodules/python2.7/matplotlib/backends/backend_agg.py", line 451, in draw
    self.figure.draw(self.renderer)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/figure.py", line 1034, in draw
    func(*args)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 2086, in draw
    a.draw(renderer)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 530, in draw
    gc.set_foreground(ln_color_rgba)
  File "/usr/lib/pymodules/python2.7/matplotlib/backend_bases.py", line 921, in set_foreground
    self._rgb = colors.colorConverter.to_rgba(fg)
  File "/usr/lib/pymodules/python2.7/matplotlib/colors.py", line 365, in to_rgba
    'to_rgba: Invalid rgba arg "%s"\n%s' % (str(arg), exc))
ValueError: to_rgba: Invalid rgba arg "(0.0, 0.0, 1.0, '0.1')"
number in rbga sequence outside 0-1 range

root@seqantest:~/khmer-protocols/mrnaseq#    /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61355   4449    /mnt/work/trinity_out_dir/Trinity.fasta

root@seqantest:~/khmer-protocols/mrnaseq# grep "zinc transporter"  /mnt/blast/trinity.x.mouse | wc -l
24

luizirber · 2014-10-30T19:25:33Z

It looks like the ValueError comes from
https://github.com/ctb/eel-pond/blob/master/plot-expression.py#L74

Note the '0.1' value for alpha. Probably matplotlib became more stringent on its inputs?

Also, on the FAQ they use without quotes:
http://matplotlib.org/faq/howto_faq.html#save-transparent-figures

ctb · 2014-10-31T09:12:40Z

On Thu, Oct 30, 2014 at 12:25:33PM -0700, Luiz Irber wrote:

It looks like the ValueError comes from
https://github.com/ctb/eel-pond/blob/master/plot-expression.py#L74

Note the '0.1' value for alpha. Probably matplotlib became more stringent on its inputs?

Fixed.

--t

C. Titus Brown, ctb@msu.edu

ctb · 2014-10-31T10:32:53Z

On Thu, Oct 30, 2014 at 11:12:26AM -0700, Michael R. Crusoe wrote:

Trinity assembly looks a bit off

root@seqantest:~/khmer-protocols/mrnaseq#   /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61429   4449    /mnt/work/trinity_out_dir/Trinity.fasta

Well, it's a small subset. So it should look "off" :)

ctb · 2014-10-31T10:37:03Z

On Thu, Oct 30, 2014 at 12:20:55PM -0700, Michael R. Crusoe wrote:

khmer v1.1 fails the acceptance test the same way. The assembly stats are pretty close to the seqan version and the same number of matches to mouse zinc transporter are found. Success?

success!

--titus

brtaylor92 · 2014-11-01T10:00:23Z

I cleaned up all the warnings I could find from seqan - see #646

mr-c · 2014-11-03T04:38:02Z

confirm that this is the correct paper to cite and do so http://www.biomedcentral.com/1471-2105/9/11

mr-c · 2014-11-04T19:49:44Z

Beginning an acceptance test with the full eel-pond data set while @qingpeng and I continue to search for the right data + workflow to stress test the seqan-backed ReadParser

ctb · 2014-11-04T21:57:58Z

On Nov 2, 2014, at 11:38 PM, Michael R. Crusoe notifications@github.com wrote:

• confirm that this is the correct paper to cite and do so http://www.biomedcentral.com/1471-2105/9/11

Let’s be sure to introduce ourselves to the project and verify this via e-mail, if/when we merge. I’m sure they’ll appreciate it :)

mr-c · 2014-11-05T00:27:30Z

In the interest of finding a previously-run 'very big data' analysis that used a multi-threaded ReadParser I've built on @bocajnotnef's list in #644 (comment)

Multi-threaded ReadParser:
abundance-dist-single.py
filter-abund-single.py
load-into-counting.py

Single-threaded ReadParser:
abundance-dist.py, indirectly
annotate-partitions.py, indirectly
count-overlap.py: indirectly
do-partition.py: indirectly
load-graph.py: says it is multithreaded but that is a lie

Screed:
count-median.py
extract-long-sequences.py
extract-partitions.py
filter-stoptags.py
interleave-reads.py
normalize-by-median.py
sample-reads-randomly.py
split-paired-reads.py

None:
find-knots.py
make-initial-stoptags.py
merge-partitions.py
partition-graph.py

mr-c · 2014-11-05T01:12:36Z

BaTLab results are coming in: builds & tests pass on Fedora 19, Fedora 20, RedHat 7 so far

mr-c · 2014-11-05T01:29:01Z

Debian 6, Ubuntu 10, Ubuntu 14, Scientific Linux 6

mr-c · 2014-11-05T01:58:17Z

Debian 7, RedHat 6, and Ubuntu 12.

to confirm: OS X 7 & 8. They need the OS X 9 specific compiler directive taken out.

mr-c · 2014-11-05T22:51:24Z

Turns out that -stdlib=c++ is not needed on OS X 10.10; the code compiles and the tests pass on OS X 10.7, 10.8, 10.9, and 10.10

mr-c · 2014-11-05T23:04:48Z

I've copied the changes to SeqAn back to a full release copy and I ran their tests:

100% tests passed, 0 tests failed out of 77

Total Test time (real) = 396.76 sec

mr-c · 2014-12-11T16:45:06Z

@ctb Shall I merge or do we want to wait for a review from @camillescott and/or you?

ctb · 2014-12-11T16:54:57Z

I'd like to take a quick look, thanks. Mostly to make sure I admire it :)

camillescott · 2014-12-11T17:49:28Z

Gimme an hour or two and I'll give it a final look as well.

On Thu, Dec 11, 2014 at 11:54 AM, C. Titus Brown notifications@github.com
wrote:

I'd like to take a quick look, thanks. Mostly to make sure I admire it :)

—
Reply to this email directly or view it on GitHub
#642 (comment).

Camille Scott

Department of Computer Science and Engineering
Lab for Genomics, Evolution, and Development
Michigan State University

camille.scott.w@gmail.com

ctb · 2014-12-14T15:46:58Z

Did a skim; LGTM. Much simplification in read_parsers ++.

Only remaining concern is solid threadsafety testing, but I'm guessing this is well covered by @camillescott's work on #655.

Merge?

mr-c · 2014-12-14T18:00:22Z

Thanks.

Threadsafety is guarenteed by the mutex inside SeqAnParser::imprint_next_read() https://github.com/ged-lab/khmer/pull/683/files#diff-be310f56bb83cef45967cfbe13f0adeeR43

camillescott · 2014-12-15T03:56:12Z

I can confirm that it is thread-safe. However, I suggest replacing the mutex with a spinlock. The pthread mutex requires a system call, which is very slow. A plain old atomic CAS (like is used for access to the bigcount map, for example) is much faster for a function being called millions of times.

camillescott · 2014-12-15T04:14:54Z

Other than that, I give it my stamp of approval, and I feel pretty confident after having used it with the multiprocessing work for a while now.

mr-c · 2014-12-15T05:12:00Z

Thanks @camillescott.

Here are the results of a benchmark I did with a simple spinlock implementation 38a22a5

(env)mcrusoe@athyra:~/khmer/gl-master$ for threads in {1..12}; do echo Threads: ${threads}; sudo su -c "echo 3 > /proc/sys/vm/drop_caches"; /usr/bin/time load-into-counting.py --threads ${threads} --report-total-kmers /dev/null /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq; done |& tee load-log-spinlock
[...]
(env)mcrusoe@athyra:~/khmer/gl-master$ grep user load-log-spinlock
13.02user 0.18system 0:17.31elapsed 76%CPU (0avgtext+0avgdata 16188maxresident)k
14.55user 0.16system 0:09.94elapsed 148%CPU (0avgtext+0avgdata 16216maxresident)k
16.29user 0.17system 0:07.48elapsed 219%CPU (0avgtext+0avgdata 16228maxresident)k
19.38user 0.19system 0:06.49elapsed 301%CPU (0avgtext+0avgdata 18276maxresident)k
28.26user 0.23system 0:07.18elapsed 396%CPU (0avgtext+0avgdata 18360maxresident)k
40.36user 0.25system 0:08.08elapsed 502%CPU (0avgtext+0avgdata 16364maxresident)k
43.39user 0.19system 0:07.65elapsed 569%CPU (0avgtext+0avgdata 18328maxresident)k
58.42user 0.23system 0:08.80elapsed 666%CPU (0avgtext+0avgdata 18444maxresident)k
76.96user 0.32system 0:10.26elapsed 753%CPU (0avgtext+0avgdata 18472maxresident)k
103.05user 0.35system 0:11.81elapsed 875%CPU (0avgtext+0avgdata 18476maxresident)k
135.60user 0.32system 0:13.88elapsed 978%CPU (0avgtext+0avgdata 20552maxresident)k
172.31user 0.46system 0:15.94elapsed 1083%CPU (0avgtext+0avgdata 20476maxresident)k
(env)mcrusoe@athyra:~/khmer/gl-master$ for threads in {1..12}; do echo Threads: ${threads}; sudo su -c "echo 3 > /proc/sys/vm/drop_caches"; /usr/bin/time load-into-counting.py --threads ${threads} --report-total-kmers /dev/null /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq; done |& tee load-log-mutex
[...]
(env)mcrusoe@athyra:~/khmer/gl-master$ grep user load-log-mutex
13.05user 0.18system 0:16.89elapsed 78%CPU (0avgtext+0avgdata 16188maxresident)k
14.43user 0.41system 0:09.50elapsed 156%CPU (0avgtext+0avgdata 18108maxresident)k
24.51user 5.10system 0:13.67elapsed 216%CPU (0avgtext+0avgdata 18176maxresident)k
31.91user 6.47system 0:16.59elapsed 231%CPU (0avgtext+0avgdata 18216maxresident)k
37.84user 7.62system 0:19.30elapsed 235%CPU (0avgtext+0avgdata 16328maxresident)k
38.42user 7.30system 0:19.25elapsed 237%CPU (0avgtext+0avgdata 18276maxresident)k
38.00user 7.51system 0:19.05elapsed 238%CPU (0avgtext+0avgdata 18384maxresident)k
38.69user 7.45system 0:19.15elapsed 240%CPU (0avgtext+0avgdata 18444maxresident)k
39.41user 7.85system 0:19.65elapsed 240%CPU (0avgtext+0avgdata 18432maxresident)k
39.56user 7.58system 0:19.59elapsed 240%CPU (0avgtext+0avgdata 20380maxresident)k
34.62user 6.95system 0:17.50elapsed 237%CPU (0avgtext+0avgdata 18452maxresident)k
39.81user 7.79system 0:19.78elapsed 240%CPU (0avgtext+0avgdata 18536maxresident)k

mr-c · 2014-12-15T05:12:49Z

(this are only 6 real cores on Athyra)
Looks like the mutex handles the higher load better. Both have disappointing scaling with more threads.

camillescott · 2014-12-15T05:23:52Z

Hrmm. This is... counterintuitive. What was the test?

mr-c · 2014-12-15T05:29:08Z

Oh, that got cut off.

for threads in {1..12}
        do echo Threads: ${threads}
        sudo su -c "echo 3 > /proc/sys/vm/drop_caches"
        /usr/bin/time load-into-counting.py --threads ${threads} \
                --report-total-kmers /dev/null \
                /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq;
done |& tee load-log-spinlock

mr-c · 2014-12-15T05:41:22Z

I also tried a better spinlock in the style of https://github.com/majek/dump/blob/master/msqueue/pthread_spin_lock_shim.h and got no improvement

ctb · 2014-12-15T11:09:58Z

@camillescott - your threading work in #655 renders all of this moot anyway, right? So even if it's not as fast as it could be now, it will be much faster soon?

mr-c · 2014-12-15T15:52:35Z

Heh, I tried to replicate the speed test with v1.2 and I get our old friend: "IOError: InvalidFASTQFileFormat: sequence and quality scores length mismatch"

mr-c · 2014-12-15T16:03:45Z

Same fail for ecoli_ref-5m.fastq & the old readparse. Good grief, did this ever work?

mr-c · 2014-12-15T16:09:57Z

and load-graph.py doesn't actually do multithreaded reading despite having an argument for that. Explains why more users didn't run into the read dropping/FASTQ parsing problems.

mr-c · 2014-12-15T17:58:05Z

Whoops, I was reading the output from /usr/bin/time wrong (the 3rd number is walltime). The spinlocks are indeed faster.

To cross check with v1.2 here is a different benchmark using a FASTQ file that the old readparser can handle in multithreaded mode (data/25k-casava1_8.fq.bz1)

Summary:

ReadParser shows a surprising amount of variability with the number of unique k-mers
Seqan+mutex is faster than ReadParser for two or more threads. Peak mutex performance is at four threads (though 12 (hyper)threads does a smidge better)
But Seqan+spinlocks are even faster than mutex or ReadParser for two or more threads. Peak performance is at 5 threads (Athyra is a 6 core system)

I ran these on OS X as well

ReadParser's best performance is with two threads (unsurprising given that iMac has two cores). There is a similar variability issue with the number of unique k-mers reported under Linux
Seqan+mutex AND Seqan+spinlocks are slower than ReadParser under OS X for all numbers of threads (0.43X - 2.30X slower)

Since OS X isn't the primary platform I am leaning towards merging this with spinlocks and revisiting the performance issue later.

Data and scripts @ https://gist.github.com/mr-c/106bfca65d1f385152d2

ctb · 2014-12-15T18:00:17Z

+1 - merge for cleanliness, revisit later.

sourcegraphbot · 2014-12-15T18:03:52Z

first pass seqan impl

mr-c · 2014-12-15T18:04:52Z

Done!

mr-c mentioned this pull request Oct 30, 2014

replacement C++ FAST[AQ] parser #643

Closed

mr-c mentioned this pull request Oct 30, 2014

added screed and read_parser streaming testing #644

Merged

mr-c added 2 commits December 11, 2014 11:37

restore header orde

0b78246

remove dropped reads known issue

1839ecd

mr-c force-pushed the test/simplify branch from 0d38898 to 1839ecd Compare December 11, 2014 16:40

switch to spinlocks

38a22a5

try direct clearing of the lock

d31cc7b

mr-c added a commit that referenced this pull request Dec 15, 2014

Merge pull request #642 from ged-lab/test/simplify

ed97e8a

first pass seqan impl

mr-c merged commit ed97e8a into master Dec 15, 2014

This was referenced Dec 15, 2014

limited automatic detection of sequence file type by load-into-counting.py #355

Closed

Determine sequence format by sniffing, not by filename extension #77

Closed

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

Closed

Conversation

mr-c commented Oct 29, 2014

Uh oh!

mr-c commented Oct 29, 2014

Uh oh!

mr-c commented Oct 29, 2014

Uh oh!

mr-c commented Oct 29, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

ctb commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

ctb commented Oct 30, 2014

--titus

Uh oh!

ctb commented Oct 30, 2014

--t

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

mr-c commented Oct 30, 2014

Uh oh!

luizirber commented Oct 30, 2014

Uh oh!

ctb commented Oct 31, 2014

--t

Uh oh!

ctb commented Oct 31, 2014

Uh oh!

ctb commented Oct 31, 2014

Uh oh!

brtaylor92 commented Nov 1, 2014

Uh oh!

mr-c commented Nov 3, 2014

Uh oh!

mr-c commented Nov 4, 2014

Uh oh!

ctb commented Nov 4, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Nov 5, 2014

Uh oh!

mr-c commented Dec 11, 2014

Uh oh!

ctb commented Dec 11, 2014

Uh oh!

camillescott commented Dec 11, 2014

Uh oh!

ctb commented Dec 14, 2014

Uh oh!

mr-c commented Dec 14, 2014

Uh oh!

camillescott commented Dec 15, 2014

Uh oh!

camillescott commented Dec 15, 2014

Uh oh!

mr-c commented Dec 15, 2014

Uh oh!

mr-c commented Dec 15, 2014

Uh oh!