add estimate_optimal_hash.py by qingpeng · Pull Request #621 · dib-lab/khmer

qingpeng · 2014-09-28T01:52:00Z

initial effort to fix #390

Now I wrote a separate script to calculate optimal parameters of hash tables. There are two functions inside - estimate_optimal_with_N_and_M(N,M) and estimate_optimal_with_N_and_f(N,f).

To calculate minimum required memory to get desired false positiver rate, I noticed the formula described in khmer-counting paper was not accurate, not wrong though. Because optimal number of hash tables - Z must be integer. Practically The corresponding size of hash tables can not be calculated by log_0.6185(f)*N/Z, as described in khmer-counting paper. That is just the theoretical calculation on the premise that Z(number of hash tables) can be any number(including non-integer).

As to integrating the functions into khmer.khmer_args, my concern is that the number of distinct k-mers is difficult for end-user to estimate beforehand. One potential improvement is that we can supply a table of number of distinct k-mers in some typical sample datasets to give the end-user an idea. Like the number of distinct k-mers in e.coli genome, in human genome, in human gut metagenome, etc.

mr-c · 2014-09-29T20:56:16Z

retest this please

mr-c · 2014-10-01T22:23:38Z

This is great! To re-iterate my verbal comments: lets move these functions into the khmer module/folder as their own file. They also need their variable names expanded to something more meaningful than single characters.

Later someone can wrap these in helper functions for use in the scripts (see #347 (comment))

ctb · 2014-10-02T10:27:56Z

Looks great! My main "ask" is to have a few tests added as well - places where @qingpeng has calculated the right numbers without using this code (other code is fine :). I would suggest doing so with some fairly small data sets and numbers, so that we can include the tests with khmer.

… tables

qingpeng · 2015-01-30T22:14:04Z

Added two scripts to estimate optimal arguments for hash tables.
One is by hashbits counter, the other is by HLL counter.

For smaller data set, HLL counter is not that faster than hashbits, but using less memory.

Using hashbits still needs to set the argument for hash table beforehand, while using HLL does not need.

The output contains a table for different general scenarios.
For specific scenario not includd in the table, after getting the number of unique k-mers, use estimate_optimal_hash.py to get the optimal arguments.

example of output:

number of unique k-mers: 8336792
false positive rate: 0.010

If you have expected false positive rate to achieve:
expected_fp number_hashtable(Z) size_hashtable(H) expected_memory_usage
0.100 3 1.336201e+07 4.008602e+07
0.200 2 1.406380e+07 2.812761e+07
0.300 1 2.337364e+07 2.337364e+07
0.400 1 1.632023e+07 1.632023e+07
0.500 1 1.202745e+07 1.202745e+07
0.600 1 9.098413e+06 9.098413e+06
0.700 1 6.924402e+06 6.924402e+06
0.800 1 5.179940e+06 5.179940e+06
0.900 1 3.620622e+06 3.620622e+06

If you have expected memory to use:
expected_memory_usage number_hashtable(Z) size_hashtable(H) expected_fp
9.999999e+08 83 1.204819e+07 0.000
5.000000e+09 415 1.204819e+07 0.000
1.000000e+10 831 1.203369e+07 0.000
2.000000e+10 1662 1.203369e+07 0.000
5.000000e+10 4157 1.202790e+07 0.000
9.999999e+10 8314 1.202790e+07 0.000
2.000000e+11 16628 1.202790e+07 0.000
3.000000e+11 24942 1.202790e+07 0.000
4.000000e+11 33257 1.202754e+07 0.000
5.000000e+11 41571 1.202762e+07 0.000
9.999999e+11 83143 1.202747e+07 0.000
2.000000e+12 166286 1.202747e+07 0.000
5.000000e+12 415715 1.202747e+07 0.000

ctb · 2015-02-25T15:04:56Z

@qingpeng, is this something you'd like reviewed? If so, please copy/paste in the checklist here: http://khmer.readthedocs.org/en/latest/dev/coding-guidelines-and-review.html

Also see sandbox requirements here:

http://khmer.readthedocs.org/en/latest/dev/scripts-and-sandbox.html#sandbox-script-requirements-and-suggestions

plus one that I'd like to add:

directly executable script names should be all lowercase (no CamelCase), and use '-' instead of '_'.

qingpeng · 2015-02-25T15:14:43Z

I'd love to! I will be more focused on integrating HLL counter into arguments optimization.

SensibleSalmon · 2015-06-19T14:06:03Z

retest this please, Jenkins.

ged-jenkins · 2015-06-19T14:06:04Z

Can one of the admins verify this patch?

SensibleSalmon · 2015-06-19T14:24:36Z

ok to test

Vulture of #621: "add estimate_optimal_hash.py"

ctb · 2015-06-28T17:34:49Z

See #1106.

add estimate_optimal_hash.py

bf4556c

mr-c added this to the 1.2+ milestone Sep 29, 2014

mr-c mentioned this pull request Oct 1, 2014

Choosing hash table size and numbers according to number of unique k-mers #347

Closed

Qingpeng Zhang added 3 commits January 29, 2015 17:16

Merge branch 'master' of https://github.com/ged-lab/khmer into fix/390

90d2f80

add optimal_args_hashbits(HLL) to estimate optimal arguments for hash…

99466e4

… tables

Merge branch 'master' of https://github.com/ged-lab/khmer into fix/390

25da284

mr-c modified the milestones: 1.4+, 1.4 May 13, 2015

SensibleSalmon mentioned this pull request Jun 19, 2015

Vulture of #621: "add estimate_optimal_hash.py" #1106

Merged

luizirber added a commit that referenced this pull request Jun 28, 2015

Merge pull request #1106 from dib-lab/fix/390

fdd8f7a

Vulture of #621: "add estimate_optimal_hash.py"

ctb closed this Jun 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add estimate_optimal_hash.py#621

add estimate_optimal_hash.py#621
qingpeng wants to merge 4 commits into
dib-lab:masterfrom
qingpeng:fix/390

qingpeng commented Sep 28, 2014

Uh oh!

mr-c commented Sep 29, 2014

Uh oh!

mr-c commented Oct 1, 2014

Uh oh!

ctb commented Oct 2, 2014

Uh oh!

qingpeng commented Jan 30, 2015

Uh oh!

ctb commented Feb 25, 2015

Uh oh!

qingpeng commented Feb 25, 2015

Uh oh!

SensibleSalmon commented Jun 19, 2015

Uh oh!

ged-jenkins commented Jun 19, 2015

Uh oh!

SensibleSalmon commented Jun 19, 2015

Uh oh!

ctb commented Jun 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

qingpeng commented Sep 28, 2014

Uh oh!

mr-c commented Sep 29, 2014

Uh oh!

mr-c commented Oct 1, 2014

Uh oh!

ctb commented Oct 2, 2014

Uh oh!

qingpeng commented Jan 30, 2015

Uh oh!

ctb commented Feb 25, 2015

Uh oh!

qingpeng commented Feb 25, 2015

Uh oh!

SensibleSalmon commented Jun 19, 2015

Uh oh!

ged-jenkins commented Jun 19, 2015

Uh oh!

SensibleSalmon commented Jun 19, 2015

Uh oh!

ctb commented Jun 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants