Skip to content

add estimate_optimal_hash.py#621

Closed
qingpeng wants to merge 4 commits into
dib-lab:masterfrom
qingpeng:fix/390
Closed

add estimate_optimal_hash.py#621
qingpeng wants to merge 4 commits into
dib-lab:masterfrom
qingpeng:fix/390

Conversation

@qingpeng

Copy link
Copy Markdown
Contributor

initial effort to fix #390

Now I wrote a separate script to calculate optimal parameters of hash tables. There are two functions inside - estimate_optimal_with_N_and_M(N,M) and estimate_optimal_with_N_and_f(N,f).

To calculate minimum required memory to get desired false positiver rate, I noticed the formula described in khmer-counting paper was not accurate, not wrong though. Because optimal number of hash tables - Z must be integer. Practically The corresponding size of hash tables can not be calculated by log_0.6185(f)*N/Z, as described in khmer-counting paper. That is just the theoretical calculation on the premise that Z(number of hash tables) can be any number(including non-integer).

As to integrating the functions into khmer.khmer_args, my concern is that the number of distinct k-mers is difficult for end-user to estimate beforehand. One potential improvement is that we can supply a table of number of distinct k-mers in some typical sample datasets to give the end-user an idea. Like the number of distinct k-mers in e.coli genome, in human genome, in human gut metagenome, etc.

@mr-c

mr-c commented Sep 29, 2014

Copy link
Copy Markdown
Contributor

retest this please

@mr-c

mr-c commented Oct 1, 2014

Copy link
Copy Markdown
Contributor

This is great! To re-iterate my verbal comments: lets move these functions into the khmer module/folder as their own file. They also need their variable names expanded to something more meaningful than single characters.

Later someone can wrap these in helper functions for use in the scripts (see #347 (comment))

@ctb

ctb commented Oct 2, 2014

Copy link
Copy Markdown
Member

Looks great! My main "ask" is to have a few tests added as well - places where @qingpeng has calculated the right numbers without using this code (other code is fine :). I would suggest doing so with some fairly small data sets and numbers, so that we can include the tests with khmer.

@qingpeng

Copy link
Copy Markdown
Contributor Author

Added two scripts to estimate optimal arguments for hash tables.
One is by hashbits counter, the other is by HLL counter.

For smaller data set, HLL counter is not that faster than hashbits, but using less memory.

Using hashbits still needs to set the argument for hash table beforehand, while using HLL does not need.

The output contains a table for different general scenarios.
For specific scenario not includd in the table, after getting the number of unique k-mers, use estimate_optimal_hash.py to get the optimal arguments.

example of output:

number of unique k-mers: 8336792
false positive rate: 0.010

If you have expected false positive rate to achieve:
expected_fp number_hashtable(Z) size_hashtable(H) expected_memory_usage
0.100 3 1.336201e+07 4.008602e+07
0.200 2 1.406380e+07 2.812761e+07
0.300 1 2.337364e+07 2.337364e+07
0.400 1 1.632023e+07 1.632023e+07
0.500 1 1.202745e+07 1.202745e+07
0.600 1 9.098413e+06 9.098413e+06
0.700 1 6.924402e+06 6.924402e+06
0.800 1 5.179940e+06 5.179940e+06
0.900 1 3.620622e+06 3.620622e+06

If you have expected memory to use:
expected_memory_usage number_hashtable(Z) size_hashtable(H) expected_fp
9.999999e+08 83 1.204819e+07 0.000
5.000000e+09 415 1.204819e+07 0.000
1.000000e+10 831 1.203369e+07 0.000
2.000000e+10 1662 1.203369e+07 0.000
5.000000e+10 4157 1.202790e+07 0.000
9.999999e+10 8314 1.202790e+07 0.000
2.000000e+11 16628 1.202790e+07 0.000
3.000000e+11 24942 1.202790e+07 0.000
4.000000e+11 33257 1.202754e+07 0.000
5.000000e+11 41571 1.202762e+07 0.000
9.999999e+11 83143 1.202747e+07 0.000
2.000000e+12 166286 1.202747e+07 0.000
5.000000e+12 415715 1.202747e+07 0.000

@ctb

ctb commented Feb 25, 2015

Copy link
Copy Markdown
Member

@qingpeng, is this something you'd like reviewed? If so, please copy/paste in the checklist here: http://khmer.readthedocs.org/en/latest/dev/coding-guidelines-and-review.html

Also see sandbox requirements here:

http://khmer.readthedocs.org/en/latest/dev/scripts-and-sandbox.html#sandbox-script-requirements-and-suggestions

plus one that I'd like to add:

  • directly executable script names should be all lowercase (no CamelCase), and use '-' instead of '_'.

@qingpeng

Copy link
Copy Markdown
Contributor Author

I'd love to! I will be more focused on integrating HLL counter into arguments optimization.

@mr-c mr-c modified the milestones: 1.4+, 1.4 May 13, 2015
@SensibleSalmon

Copy link
Copy Markdown
Contributor

retest this please, Jenkins.

@ged-jenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

@SensibleSalmon

Copy link
Copy Markdown
Contributor

ok to test

luizirber added a commit that referenced this pull request Jun 28, 2015
Vulture of #621: "add estimate_optimal_hash.py"
@ctb

ctb commented Jun 28, 2015

Copy link
Copy Markdown
Member

See #1106.

@ctb ctb closed this Jun 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add implementation of optimal hash size calculation to khmer.khmer_args

5 participants