first pass seqan impl#642
Conversation
|
This does pass Camille's new tests.test_read_parsers.test_with_multiple_threads_big from #641 (not included for brevity) Now to see if it is speedy. |
|
Implementation notes: Seqan has two variable names in conflict with defines To give thread safety a mutex was chosen for simplicity. @camillescott or @luizirber A basic test with filter-abund shows comparable speed to the existing code. At this time the data suggests that the Seqan code is more correct and its On Wed, Oct 29, 2014, 01:02 ged-jenkins notifications@github.com wrote:
|
|
I know this is a big PR, but it reduces the number of conditional branches by 635, the number of lines of code by 650 (not to mention all the code that was hiding behind preprocessor directives), and the number of classes by three.
Comments are welcome by all, especially @luizirber @camillescott and @ctb |
|
Note: the bulk of the added lines is @camillescott 's test file of 200,000 lines that caught the missing reads error. (Thanks again to @camillescott !) |
|
OK, so to try to summarize -- this is adding in the seqan library for sequence parsing; seqan adds no new dependencies and is license compatible; it fixes some rather serious issues with sequence parsing in our current code; it doesn't seem to introduce any performance regressions; so far, so good. Given our good test coverage, the magnitude of the changes to code don't concern me too much. However, two questions -- is the plan to put this in for 1.2? It seems like a big change. OTOH the bug that @camillescott found is pretty problematic too :) How do we check this against a really large-scale data set? We can at least be fairly certain that the old code didn't barf (e.g. memory leak) on billions of reads. We should ask @jiarong or @qingpeng to run it on one of their big files. And I guess a third question - was there a discussion somewhere about including seqan? At this point I'm inferring this is something that was discussed offline, which is fine for starters but maybe bears further examination? |
|
My vote is for a v1.2 or v1.3 release; the FASTQ multithreading bug is pretty bad and very long standing. I'm happy to do a larger run with whatever data. I think I'll do the acceptance test next and then @jiarong & @qingpeng 's big data. Coverity scan comes up good. Discussion was initially offline. I did the mockup to see if the performance was relatively okay and went from there. Other libraries are either license incompatible (GATB is Affero GPL); not up to the task (Heng Li's code was raw + zlib only, no bzip2); or not oriented for use as a library (SeqDB as suggested by @luizirber) See also: https://www.biostars.org/p/486/ Seqan already supports SAM/BAM, which will be useful for #523 |
|
Given the magnitude of these changes I would like a sooner merge then later, to make less unpleasant the merge of @luizirber and @camillescott 's works. |
|
On Thu, Oct 30, 2014 at 03:57:21AM -0700, Michael R. Crusoe wrote:
Agreed.
OK.
Great!
OK. Would appreciate a greater emphasis on making use of the issue tracker for In this case, could you create an issue with the above paragraph and link it to thanks, --titusC. Titus Brown, ctb@msu.edu |
|
On Thu, Oct 30, 2014 at 04:05:29AM -0700, Michael R. Crusoe wrote:
We need to make damned sure we're not buying ourselves trouble with seqan For the large file test, would suggest bugging @qingpeng for the soil data --tC. Titus Brown, ctb@msu.edu |
|
On Thu Oct 30 2014 at 7:06:12 AM C. Titus Brown notifications@github.com
Fair enough. The PR came about within 48 hours of in-person discussions.
Sure, done: see #643. |
|
On Thu Oct 30 2014 at 7:11:05 AM C. Titus Brown notifications@github.com
I agree wholeheartedly.
|
|
Results from acceptance test Dies at rsem step Trinity assembly looks a bit off But the mouse-like genes are found I'm repeating now with khmer v1.1 to see if this is due to version skew on Ubuntu 14.04 |
|
khmer v1.1 fails the acceptance test the same way. The assembly stats are pretty close to the seqan version and the same number of matches to mouse zinc transporter are found. Success? |
|
It looks like the ValueError comes from Note the '0.1' value for alpha. Probably matplotlib became more stringent on its inputs? Also, on the FAQ they use without quotes: |
|
On Thu, Oct 30, 2014 at 12:25:33PM -0700, Luiz Irber wrote:
Fixed. --tC. Titus Brown, ctb@msu.edu |
|
On Thu, Oct 30, 2014 at 11:12:26AM -0700, Michael R. Crusoe wrote:
Well, it's a small subset. So it should look "off" :) |
|
On Thu, Oct 30, 2014 at 12:20:55PM -0700, Michael R. Crusoe wrote:
success! --titus |
|
I cleaned up all the warnings I could find from seqan - see #646 |
|
|
Beginning an acceptance test with the full eel-pond data set while @qingpeng and I continue to search for the right data + workflow to stress test the seqan-backed ReadParser |
|
On Nov 2, 2014, at 11:38 PM, Michael R. Crusoe notifications@github.com wrote:
Let’s be sure to introduce ourselves to the project and verify this via e-mail, if/when we merge. I’m sure they’ll appreciate it :) |
|
In the interest of finding a previously-run 'very big data' analysis that used a multi-threaded ReadParser I've built on @bocajnotnef's list in #644 (comment) Multi-threaded ReadParser: Single-threaded ReadParser: Screed: None: |
|
BaTLab results are coming in: builds & tests pass on Fedora 19, Fedora 20, RedHat 7 so far |
|
to confirm: OS X 7 & 8. They need the OS X 9 specific compiler directive taken out. |
|
Turns out that |
|
I've copied the changes to SeqAn back to a full release copy and I ran their tests: |
|
@ctb Shall I merge or do we want to wait for a review from @camillescott and/or you? |
|
I'd like to take a quick look, thanks. Mostly to make sure I admire it :) |
|
Gimme an hour or two and I'll give it a final look as well. On Thu, Dec 11, 2014 at 11:54 AM, C. Titus Brown notifications@github.com
Camille Scott Department of Computer Science and Engineering |
|
Did a skim; LGTM. Much simplification in read_parsers ++. Only remaining concern is solid threadsafety testing, but I'm guessing this is well covered by @camillescott's work on #655. Merge? |
|
Thanks. Threadsafety is guarenteed by the mutex inside |
|
I can confirm that it is thread-safe. However, I suggest replacing the mutex with a spinlock. The pthread mutex requires a system call, which is very slow. A plain old atomic CAS (like is used for access to the bigcount map, for example) is much faster for a function being called millions of times. |
|
Other than that, I give it my stamp of approval, and I feel pretty confident after having used it with the multiprocessing work for a while now. |
|
Thanks @camillescott. Here are the results of a benchmark I did with a simple spinlock implementation 38a22a5 |
|
(this are only 6 real cores on Athyra) |
|
Hrmm. This is... counterintuitive. What was the test? |
|
Oh, that got cut off. for threads in {1..12}
do echo Threads: ${threads}
sudo su -c "echo 3 > /proc/sys/vm/drop_caches"
/usr/bin/time load-into-counting.py --threads ${threads} \
--report-total-kmers /dev/null \
/home/mcrusoe/data/kalamazoo/subreads-A-R1.fastq;
done |& tee load-log-spinlock |
|
I also tried a better spinlock in the style of https://github.com/majek/dump/blob/master/msqueue/pthread_spin_lock_shim.h and got no improvement |
|
@camillescott - your threading work in #655 renders all of this moot anyway, right? So even if it's not as fast as it could be now, it will be much faster soon? |
|
Heh, I tried to replicate the speed test with v1.2 and I get our old friend: "IOError: InvalidFASTQFileFormat: sequence and quality scores length mismatch" |
|
Same fail for ecoli_ref-5m.fastq & the old readparse. Good grief, did this ever work? |
|
and |
|
Whoops, I was reading the output from To cross check with v1.2 here is a different benchmark using a FASTQ file that the old readparser can handle in multithreaded mode ( Summary:
I ran these on OS X as well
Since OS X isn't the primary platform I am leaning towards merging this with spinlocks and revisiting the performance issue later. Data and scripts @ https://gist.github.com/mr-c/106bfca65d1f385152d2 |
|
+1 - merge for cleanliness, revisit later. |
|
Done! |
Not for review; some tests fail for tested specific implementation quirks, others fail as seqan is very forgiving