add estimate_optimal_hash.py#621
Conversation
|
retest this please |
|
This is great! To re-iterate my verbal comments: lets move these functions into the khmer module/folder as their own file. They also need their variable names expanded to something more meaningful than single characters. Later someone can wrap these in helper functions for use in the scripts (see #347 (comment)) |
|
Looks great! My main "ask" is to have a few tests added as well - places where @qingpeng has calculated the right numbers without using this code (other code is fine :). I would suggest doing so with some fairly small data sets and numbers, so that we can include the tests with khmer. |
|
Added two scripts to estimate optimal arguments for hash tables. For smaller data set, HLL counter is not that faster than hashbits, but using less memory. Using hashbits still needs to set the argument for hash table beforehand, while using HLL does not need. The output contains a table for different general scenarios. example of output: number of unique k-mers: 8336792 If you have expected false positive rate to achieve: If you have expected memory to use: |
|
@qingpeng, is this something you'd like reviewed? If so, please copy/paste in the checklist here: http://khmer.readthedocs.org/en/latest/dev/coding-guidelines-and-review.html Also see sandbox requirements here: plus one that I'd like to add:
|
|
I'd love to! I will be more focused on integrating HLL counter into arguments optimization. |
|
retest this please, Jenkins. |
|
Can one of the admins verify this patch? |
|
ok to test |
Vulture of #621: "add estimate_optimal_hash.py"
|
See #1106. |
initial effort to fix #390
Now I wrote a separate script to calculate optimal parameters of hash tables. There are two functions inside - estimate_optimal_with_N_and_M(N,M) and estimate_optimal_with_N_and_f(N,f).
To calculate minimum required memory to get desired false positiver rate, I noticed the formula described in khmer-counting paper was not accurate, not wrong though. Because optimal number of hash tables - Z must be integer. Practically The corresponding size of hash tables can not be calculated by log_0.6185(f)*N/Z, as described in khmer-counting paper. That is just the theoretical calculation on the premise that Z(number of hash tables) can be any number(including non-integer).
As to integrating the functions into khmer.khmer_args, my concern is that the number of distinct k-mers is difficult for end-user to estimate beforehand. One potential improvement is that we can supply a table of number of distinct k-mers in some typical sample datasets to give the end-user an idea. Like the number of distinct k-mers in e.coli genome, in human genome, in human gut metagenome, etc.