Skip to content

first pass seqan impl#642

Merged
mr-c merged 13 commits into
masterfrom
test/simplify
Dec 15, 2014
Merged

first pass seqan impl#642
mr-c merged 13 commits into
masterfrom
test/simplify

Conversation

@mr-c

@mr-c mr-c commented Oct 29, 2014

Copy link
Copy Markdown
Contributor

Not for review; some tests fail for tested specific implementation quirks, others fail as seqan is very forgiving

@mr-c

mr-c commented Oct 29, 2014

Copy link
Copy Markdown
Contributor Author

This does pass Camille's new tests.test_read_parsers.test_with_multiple_threads_big from #641 (not included for brevity)

Now to see if it is speedy.

@mr-c

mr-c commented Oct 29, 2014

Copy link
Copy Markdown
Contributor Author

Implementation notes: Seqan has two variable names in conflict with defines
in our code and some IO related Linux headers. Hence the shuffling of the
#include preprocessor directives. This should be patched in our copy of
their headers and sent upstream.

To give thread safety a mutex was chosen for simplicity. @camillescott or @luizirber
can find a better solution here if needed. Seqan's use of the (deprecated
in C++11) autoptr does produce a lot of warnings. This could also be
patched via preprocessor directives to use the appropriate successors to
autoptr if compiling in c++11 mode.

A basic test with filter-abund shows comparable speed to the existing code.
Head-to-head comparisons with v1.1 using other code paths and larger and
more realistic data sets will be useful.

At this time the data suggests that the Seqan code is more correct and its
performance is of the same order of magnitude as the current code.

On Wed, Oct 29, 2014, 01:02 ged-jenkins notifications@github.com wrote:

Test FAILed.


Reply to this email directly or view it on GitHub
#642 (comment).

@mr-c

mr-c commented Oct 29, 2014

Copy link
Copy Markdown
Contributor Author

Fixes #355, #77 (autodetect sequence type)

Confirmed to fix #249 (FASTQ parsing). Special thanks to @sjackman for his testcase which made confirming the fix super easy.

Will address parts of #274, #68

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

I know this is a big PR, but it reduces the number of conditional branches by 635, the number of lines of code by 650 (not to mention all the code that was hiding behind preprocessor directives), and the number of classes by three.

  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual fixing as needed.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?

Comments are welcome by all, especially @luizirber @camillescott and @ctb

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

Note: the bulk of the added lines is @camillescott 's test file of 200,000 lines that caught the missing reads error. (Thanks again to @camillescott !)

@ctb

ctb commented Oct 30, 2014

Copy link
Copy Markdown
Member

OK, so to try to summarize --

this is adding in the seqan library for sequence parsing;

seqan adds no new dependencies and is license compatible;

it fixes some rather serious issues with sequence parsing in our current code;

it doesn't seem to introduce any performance regressions;

so far, so good.


Given our good test coverage, the magnitude of the changes to code don't concern me too much. However, two questions --

is the plan to put this in for 1.2? It seems like a big change. OTOH the bug that @camillescott found is pretty problematic too :)

How do we check this against a really large-scale data set? We can at least be fairly certain that the old code didn't barf (e.g. memory leak) on billions of reads. We should ask @jiarong or @qingpeng to run it on one of their big files.

And I guess a third question - was there a discussion somewhere about including seqan? At this point I'm inferring this is something that was discussed offline, which is fine for starters but maybe bears further examination?

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

My vote is for a v1.2 or v1.3 release; the FASTQ multithreading bug is pretty bad and very long standing.

I'm happy to do a larger run with whatever data. I think I'll do the acceptance test next and then @jiarong & @qingpeng 's big data.

Coverity scan comes up good.

Discussion was initially offline. I did the mockup to see if the performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL); not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

Given the magnitude of these changes I would like a sooner merge then later, to make less unpleasant the merge of @luizirber and @camillescott 's works.

@ctb

ctb commented Oct 30, 2014

Copy link
Copy Markdown
Member

On Thu, Oct 30, 2014 at 03:57:21AM -0700, Michael R. Crusoe wrote:

My vote is for a v1.2 or v1.3 release; the FASTQ multithreading bug is pretty bad and very long standing.

Agreed.

I'm happy to do a larger run with whatever data. I think I'll do the acceptance test next and then @jiarong & @qingpeng 's big data.

OK.

Coverity scan comes up good.

Great!

Discussion was initially offline. I did the mockup to see if the performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL); not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

OK. Would appreciate a greater emphasis on making use of the issue tracker for
this kind of discussion; not only does it help to keep everyone on the team in
the loop (incuding, say, me), but we can bring in the broader community
(already tied in via biostars, admittedly) and it's part of being a true
open source project...

In this case, could you create an issue with the above paragraph and link it to
this PR? The issue can be closed immediately. I just want to have something
to refer people to that's not buried in the middle of a PR :).

thanks,

--titus

C. Titus Brown, ctb@msu.edu

@ctb

ctb commented Oct 30, 2014

Copy link
Copy Markdown
Member

On Thu, Oct 30, 2014 at 04:05:29AM -0700, Michael R. Crusoe wrote:

Given the magnitude of these changes I would like a sooner merge then later, to make less unpleasant the merge of @luizirber and @camillescott 's works.

We need to make damned sure we're not buying ourselves trouble with seqan
first, because I am confident that ripping this stuff back out is going
to be more painful than the merge problems. I think full release tests + large
file test (1 bn reads +) need to be run and results understood before a merge.

For the large file test, would suggest bugging @qingpeng for the soil data
and simply running diginorm -C 5 on a couple of the files. The fp rate
doesn't matter too much so you could use low memory - the thing to be sure
of are that memory doesn't go up through the run, and that seqan robustly
handles multiple very large files. (Call me paranoid, or maybe just
experienced.)

--t

C. Titus Brown, ctb@msu.edu

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

On Thu Oct 30 2014 at 7:06:12 AM C. Titus Brown notifications@github.com
wrote:

On Thu, Oct 30, 2014 at 03:57:21AM -0700, Michael R. Crusoe wrote:

Discussion was initially offline. I did the mockup to see if the
performance was relatively okay and went from there.

Other libraries are either license incompatible (GATB is Affero GPL);
not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not
oriented for use as a library (SeqDB as suggested by @luizirber)

See also: https://www.biostars.org/p/486/

Seqan already supports SAM/BAM, which will be useful for #523

OK. Would appreciate a greater emphasis on making use of the issue tracker
for
this kind of discussion; not only does it help to keep everyone on the
team in
the loop (incuding, say, me), but we can bring in the broader community
(already tied in via biostars, admittedly) and it's part of being a true
open source project...

Fair enough. The PR came about within 48 hours of in-person discussions.

In this case, could you create an issue with the above paragraph and link
it to
this PR? The issue can be closed immediately. I just want to have something
to refer people to that's not buried in the middle of a PR :).

Sure, done: see #643.

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

On Thu Oct 30 2014 at 7:11:05 AM C. Titus Brown notifications@github.com
wrote:

On Thu, Oct 30, 2014 at 04:05:29AM -0700, Michael R. Crusoe wrote:

Given the magnitude of these changes I would like a sooner merge then
later, to make less unpleasant the merge of @luizirber and @camillescott 's
works.

We need to make damned sure we're not buying ourselves trouble with seqan
first, because I am confident that ripping this stuff back out is going
to be more painful than the merge problems. I think full release tests +
large
file test (1 bn reads +) need to be run and results understood before a
merge.

I agree wholeheartedly.

For the large file test, would suggest bugging @qingpeng for the soil data
and simply running diginorm -C 5 on a couple of the files. The fp rate
doesn't matter too much so you could use low memory - the thing to be sure
of are that memory doesn't go up through the run, and that seqan robustly
handles multiple very large files. (Call me paranoid, or maybe just
experienced.)

Sounds like a plan, thanks.

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

Results from acceptance test

Dies at rsem step

ValueError: to_rgba: Invalid rgba arg "(0.0, 0.0, 1.0, '0.1')"
number in rbga sequence outside 0-1 range

Trinity assembly looks a bit off

root@seqantest:~/khmer-protocols/mrnaseq#   /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61429   4449    /mnt/work/trinity_out_dir/Trinity.fasta

But the mouse-like genes are found

root@seqantest:~/khmer-protocols/mrnaseq# grep "zinc transporter" /mnt/blast/trinity.x.mouse | wc -l
24

I'm repeating now with khmer v1.1 to see if this is due to version skew on Ubuntu 14.04

@mr-c

mr-c commented Oct 30, 2014

Copy link
Copy Markdown
Contributor Author

khmer v1.1 fails the acceptance test the same way. The assembly stats are pretty close to the seqan version and the same number of matches to mouse zinc transporter are found. Success?

7 of 14 total genes diff expressed at P > 0.950000
Writing differentially expressed genes list to 0-vs-6-hour.changed.csv
Got sample spec: 5 from condition 1, 5 from condition 2
Loading differentially expressed gene names from 0-vs-6-hour.changed.csv
Loading gene matrix from 0-vs-6-hour.matrix
plotting...
Output figure to: 0-vs-6-hour.matrix.png
Traceback (most recent call last):
  File "/usr/local/share/eel-pond/plot-expression.py", line 89, in <module>
    main()
  File "/usr/local/share/eel-pond/plot-expression.py", line 86, in main
    savefig(filename)
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 561, in savefig
    return fig.savefig(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/figure.py", line 1421, in savefig
    self.canvas.print_figure(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/backend_bases.py", line 2220, in print_figure
    **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/backends/backend_agg.py", line 505, in print_png
    FigureCanvasAgg.draw(self)
  File "/usr/lib/pymodules/python2.7/matplotlib/backends/backend_agg.py", line 451, in draw
    self.figure.draw(self.renderer)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/figure.py", line 1034, in draw
    func(*args)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 2086, in draw
    a.draw(renderer)
  File "/usr/lib/pymodules/python2.7/matplotlib/artist.py", line 55, in draw_wrapper
    draw(artist, renderer, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 530, in draw
    gc.set_foreground(ln_color_rgba)
  File "/usr/lib/pymodules/python2.7/matplotlib/backend_bases.py", line 921, in set_foreground
    self._rgb = colors.colorConverter.to_rgba(fg)
  File "/usr/lib/pymodules/python2.7/matplotlib/colors.py", line 365, in to_rgba
    'to_rgba: Invalid rgba arg "%s"\n%s' % (str(arg), exc))
ValueError: to_rgba: Invalid rgba arg "(0.0, 0.0, 1.0, '0.1')"
number in rbga sequence outside 0-1 range
root@seqantest:~/khmer-protocols/mrnaseq#    /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61355   4449    /mnt/work/trinity_out_dir/Trinity.fasta
root@seqantest:~/khmer-protocols/mrnaseq# grep "zinc transporter"  /mnt/blast/trinity.x.mouse | wc -l
24

@luizirber

Copy link
Copy Markdown
Member

It looks like the ValueError comes from
https://github.com/ctb/eel-pond/blob/master/plot-expression.py#L74

Note the '0.1' value for alpha. Probably matplotlib became more stringent on its inputs?

Also, on the FAQ they use without quotes:
http://matplotlib.org/faq/howto_faq.html#save-transparent-figures

@ctb

ctb commented Oct 31, 2014

Copy link
Copy Markdown
Member

On Thu, Oct 30, 2014 at 12:25:33PM -0700, Luiz Irber wrote:

It looks like the ValueError comes from
https://github.com/ctb/eel-pond/blob/master/plot-expression.py#L74

Note the '0.1' value for alpha. Probably matplotlib became more stringent on its inputs?

Fixed.

--t

C. Titus Brown, ctb@msu.edu

@ctb

ctb commented Oct 31, 2014

Copy link
Copy Markdown
Member

On Thu, Oct 30, 2014 at 11:12:26AM -0700, Michael R. Crusoe wrote:

Trinity assembly looks a bit off

root@seqantest:~/khmer-protocols/mrnaseq#   /usr/local/share/khmer/sandbox/assemstats3.py 500 /mnt/work/trinity_out_dir/Trinity.fasta
** cutoff: 500
N       sum     max     filename
35      61429   4449    /mnt/work/trinity_out_dir/Trinity.fasta

Well, it's a small subset. So it should look "off" :)

@ctb

ctb commented Oct 31, 2014

Copy link
Copy Markdown
Member

On Thu, Oct 30, 2014 at 12:20:55PM -0700, Michael R. Crusoe wrote:

khmer v1.1 fails the acceptance test the same way. The assembly stats are pretty close to the seqan version and the same number of matches to mouse zinc transporter are found. Success?

success!

--titus

@brtaylor92

Copy link
Copy Markdown
Contributor

I cleaned up all the warnings I could find from seqan - see #646

@mr-c

mr-c commented Nov 3, 2014

Copy link
Copy Markdown
Contributor Author

@mr-c

mr-c commented Nov 4, 2014

Copy link
Copy Markdown
Contributor Author

Beginning an acceptance test with the full eel-pond data set while @qingpeng and I continue to search for the right data + workflow to stress test the seqan-backed ReadParser

@ctb

ctb commented Nov 4, 2014

Copy link
Copy Markdown
Member

On Nov 2, 2014, at 11:38 PM, Michael R. Crusoe notifications@github.com wrote:

• confirm that this is the correct paper to cite and do so http://www.biomedcentral.com/1471-2105/9/11

Let’s be sure to introduce ourselves to the project and verify this via e-mail, if/when we merge. I’m sure they’ll appreciate it :)

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author

In the interest of finding a previously-run 'very big data' analysis that used a multi-threaded ReadParser I've built on @bocajnotnef's list in #644 (comment)

Multi-threaded ReadParser:
abundance-dist-single.py
filter-abund-single.py
load-into-counting.py

Single-threaded ReadParser:
abundance-dist.py, indirectly
annotate-partitions.py, indirectly
count-overlap.py: indirectly
do-partition.py: indirectly
load-graph.py: says it is multithreaded but that is a lie

Screed:
count-median.py
extract-long-sequences.py
extract-partitions.py
filter-stoptags.py
interleave-reads.py
normalize-by-median.py
sample-reads-randomly.py
split-paired-reads.py

None:
find-knots.py
make-initial-stoptags.py
merge-partitions.py
partition-graph.py

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author

BaTLab results are coming in: builds & tests pass on Fedora 19, Fedora 20, RedHat 7 so far

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author
  • Debian 6, Ubuntu 10, Ubuntu 14, Scientific Linux 6

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author
  • Debian 7, RedHat 6, and Ubuntu 12.

to confirm: OS X 7 & 8. They need the OS X 9 specific compiler directive taken out.

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author

Turns out that -stdlib=c++ is not needed on OS X 10.10; the code compiles and the tests pass on OS X 10.7, 10.8, 10.9, and 10.10

@mr-c

mr-c commented Nov 5, 2014

Copy link
Copy Markdown
Contributor Author

I've copied the changes to SeqAn back to a full release copy and I ran their tests:

100% tests passed, 0 tests failed out of 77

Total Test time (real) = 396.76 sec

@mr-c

mr-c commented Dec 11, 2014

Copy link
Copy Markdown
Contributor Author

@ctb Shall I merge or do we want to wait for a review from @camillescott and/or you?

@ctb

ctb commented Dec 11, 2014

Copy link
Copy Markdown
Member

I'd like to take a quick look, thanks. Mostly to make sure I admire it :)

@camillescott

Copy link
Copy Markdown
Member

Gimme an hour or two and I'll give it a final look as well.

On Thu, Dec 11, 2014 at 11:54 AM, C. Titus Brown notifications@github.com
wrote:

I'd like to take a quick look, thanks. Mostly to make sure I admire it :)


Reply to this email directly or view it on GitHub
#642 (comment).

Camille Scott

Department of Computer Science and Engineering
Lab for Genomics, Evolution, and Development
Michigan State University

camille.scott.w@gmail.com

@ctb

ctb commented Dec 14, 2014

Copy link
Copy Markdown
Member

Did a skim; LGTM. Much simplification in read_parsers ++.

Only remaining concern is solid threadsafety testing, but I'm guessing this is well covered by @camillescott's work on #655.

Merge?

@mr-c

mr-c commented Dec 14, 2014

Copy link
Copy Markdown
Contributor Author

Thanks.

Threadsafety is guarenteed by the mutex inside SeqAnParser::imprint_next_read() https://github.com/ged-lab/khmer/pull/683/files#diff-be310f56bb83cef45967cfbe13f0adeeR43

@camillescott

Copy link
Copy Markdown
Member

I can confirm that it is thread-safe. However, I suggest replacing the mutex with a spinlock. The pthread mutex requires a system call, which is very slow. A plain old atomic CAS (like is used for access to the bigcount map, for example) is much faster for a function being called millions of times.

@camillescott

Copy link
Copy Markdown
Member

Other than that, I give it my stamp of approval, and I feel pretty confident after having used it with the multiprocessing work for a while now.

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Thanks @camillescott.

Here are the results of a benchmark I did with a simple spinlock implementation 38a22a5

(env)mcrusoe@athyra:~/khmer/gl-master$ for threads in {1..12}; do echo Threads: ${threads}; sudo su -c "echo 3 > /proc/sys/vm/drop_caches"; /usr/bin/time load-into-counting.py --threads ${threads} --report-total-kmers /dev/null /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq; done |& tee load-log-spinlock
[...]
(env)mcrusoe@athyra:~/khmer/gl-master$ grep user load-log-spinlock
13.02user 0.18system 0:17.31elapsed 76%CPU (0avgtext+0avgdata 16188maxresident)k
14.55user 0.16system 0:09.94elapsed 148%CPU (0avgtext+0avgdata 16216maxresident)k
16.29user 0.17system 0:07.48elapsed 219%CPU (0avgtext+0avgdata 16228maxresident)k
19.38user 0.19system 0:06.49elapsed 301%CPU (0avgtext+0avgdata 18276maxresident)k
28.26user 0.23system 0:07.18elapsed 396%CPU (0avgtext+0avgdata 18360maxresident)k
40.36user 0.25system 0:08.08elapsed 502%CPU (0avgtext+0avgdata 16364maxresident)k
43.39user 0.19system 0:07.65elapsed 569%CPU (0avgtext+0avgdata 18328maxresident)k
58.42user 0.23system 0:08.80elapsed 666%CPU (0avgtext+0avgdata 18444maxresident)k
76.96user 0.32system 0:10.26elapsed 753%CPU (0avgtext+0avgdata 18472maxresident)k
103.05user 0.35system 0:11.81elapsed 875%CPU (0avgtext+0avgdata 18476maxresident)k
135.60user 0.32system 0:13.88elapsed 978%CPU (0avgtext+0avgdata 20552maxresident)k
172.31user 0.46system 0:15.94elapsed 1083%CPU (0avgtext+0avgdata 20476maxresident)k
(env)mcrusoe@athyra:~/khmer/gl-master$ for threads in {1..12}; do echo Threads: ${threads}; sudo su -c "echo 3 > /proc/sys/vm/drop_caches"; /usr/bin/time load-into-counting.py --threads ${threads} --report-total-kmers /dev/null /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq; done |& tee load-log-mutex
[...]
(env)mcrusoe@athyra:~/khmer/gl-master$ grep user load-log-mutex
13.05user 0.18system 0:16.89elapsed 78%CPU (0avgtext+0avgdata 16188maxresident)k
14.43user 0.41system 0:09.50elapsed 156%CPU (0avgtext+0avgdata 18108maxresident)k
24.51user 5.10system 0:13.67elapsed 216%CPU (0avgtext+0avgdata 18176maxresident)k
31.91user 6.47system 0:16.59elapsed 231%CPU (0avgtext+0avgdata 18216maxresident)k
37.84user 7.62system 0:19.30elapsed 235%CPU (0avgtext+0avgdata 16328maxresident)k
38.42user 7.30system 0:19.25elapsed 237%CPU (0avgtext+0avgdata 18276maxresident)k
38.00user 7.51system 0:19.05elapsed 238%CPU (0avgtext+0avgdata 18384maxresident)k
38.69user 7.45system 0:19.15elapsed 240%CPU (0avgtext+0avgdata 18444maxresident)k
39.41user 7.85system 0:19.65elapsed 240%CPU (0avgtext+0avgdata 18432maxresident)k
39.56user 7.58system 0:19.59elapsed 240%CPU (0avgtext+0avgdata 20380maxresident)k
34.62user 6.95system 0:17.50elapsed 237%CPU (0avgtext+0avgdata 18452maxresident)k
39.81user 7.79system 0:19.78elapsed 240%CPU (0avgtext+0avgdata 18536maxresident)k

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

(this are only 6 real cores on Athyra)
Looks like the mutex handles the higher load better. Both have disappointing scaling with more threads.

@camillescott

Copy link
Copy Markdown
Member

Hrmm. This is... counterintuitive. What was the test?

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Oh, that got cut off.

for threads in {1..12}
        do echo Threads: ${threads}
        sudo su -c "echo 3 > /proc/sys/vm/drop_caches"
        /usr/bin/time load-into-counting.py --threads ${threads} \
                --report-total-kmers /dev/null \
                /home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq;
done |& tee load-log-spinlock

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

I also tried a better spinlock in the style of https://github.com/majek/dump/blob/master/msqueue/pthread_spin_lock_shim.h and got no improvement

@ctb

ctb commented Dec 15, 2014

Copy link
Copy Markdown
Member

@camillescott - your threading work in #655 renders all of this moot anyway, right? So even if it's not as fast as it could be now, it will be much faster soon?

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Heh, I tried to replicate the speed test with v1.2 and I get our old friend: "IOError: InvalidFASTQFileFormat: sequence and quality scores length mismatch"

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Same fail for ecoli_ref-5m.fastq & the old readparse. Good grief, did this ever work?

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

and load-graph.py doesn't actually do multithreaded reading despite having an argument for that. Explains why more users didn't run into the read dropping/FASTQ parsing problems.

@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Whoops, I was reading the output from /usr/bin/time wrong (the 3rd number is walltime). The spinlocks are indeed faster.

To cross check with v1.2 here is a different benchmark using a FASTQ file that the old readparser can handle in multithreaded mode (data/25k-casava1_8.fq.bz1)

Summary:

  • ReadParser shows a surprising amount of variability with the number of unique k-mers
  • Seqan+mutex is faster than ReadParser for two or more threads. Peak mutex performance is at four threads (though 12 (hyper)threads does a smidge better)
  • But Seqan+spinlocks are even faster than mutex or ReadParser for two or more threads. Peak performance is at 5 threads (Athyra is a 6 core system)

I ran these on OS X as well

  • ReadParser's best performance is with two threads (unsurprising given that iMac has two cores). There is a similar variability issue with the number of unique k-mers reported under Linux
  • Seqan+mutex AND Seqan+spinlocks are slower than ReadParser under OS X for all numbers of threads (0.43X - 2.30X slower)

Since OS X isn't the primary platform I am leaning towards merging this with spinlocks and revisiting the performance issue later.

Data and scripts @ https://gist.github.com/mr-c/106bfca65d1f385152d2

@ctb

ctb commented Dec 15, 2014

Copy link
Copy Markdown
Member

+1 - merge for cleanliness, revisit later.

mr-c added a commit that referenced this pull request Dec 15, 2014
@mr-c mr-c merged commit ed97e8a into master Dec 15, 2014
@mr-c

mr-c commented Dec 15, 2014

Copy link
Copy Markdown
Contributor Author

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants