Skip to content

Conversation

@charlesgregory
Copy link

Added khashl and made some changes to the benchmark and dub file.
Khashl is hash table that performs deletions without tombstones allowing it to have a smaller
memory footprint for some sacrifices in speed. It uses fibonacci hashing. Described here.

attractivechaos says this about khashl:

TL;DR: With linear probing, we can delete elements from an open addressing hash table without tombstones.

Deletion speed vs bytes per key

Deletion speed vs bytes per key

Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, due to the presence of tombstones, khash.h has to double the number of buckets, resulting in fewer collisions. It implicitly trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Copying buckets can be slow. That said, the new deletion algorithm is only a little slower than khash.h and is faster than many other libraries for this particular task. It might also become faster than khash under a different payload (e.g. large batch of deletions).

Most notably:

In addition, khashl has simpler insertion. It is faster than khash even if no deletions are involved.

Considering the clear advantage in memory, I think the new deletion algorithm without tombstones is overall better than traditional algorithms. It should become the preferred way to implement hash tables.

Time, in msec, for n=500,000 operations benchmarked on WSL (Ryzen 1700); ldc2 -release

Operation HashMap khash khashl khashl (cached)
Insert 931 398 261 142
Retrieve (Serial) 14 1 18 --
Retrieve (Random) 50 25 36 --

There are no retrieval times for the cached version as that benchmark uses integer keys and my khashl template prevents that at compile time as there should be no real benefit for integers.

attractivechaos says this about hash-caching:

When we use long strings as keys, comparing two keys may take significant time. This comparison is often unnecessary. Note that the hash of a string is a good summary of the string. If two strings are different, their hashes are often different. We can cache the hash and only compare two keys when their hashes are equal.

charlesgregory and others added 4 commits April 21, 2020 19:04
Added more comments to khashl
Updated unittests to actually work for khashl
Added khashl to package.d
@jblachly
Copy link
Member

@charlesgregory How much of khashl is identical to khash? A bit it seems. Would it be sensible to factor out (e.g. the hash functions), or is it more important that we retain source-level homology for future updates?

@jblachly jblachly self-assigned this Aug 24, 2020
@jblachly jblachly added the enhancement New feature or request label Aug 24, 2020
dub.json Outdated
},
{
"name": "bench",
"targetType": "executable",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you are trying to do here but note the header of the _benchmark file; it is an embedded dub.sdl and can be run directly from cmdline like a script.

It was my mistake to put it under source/dklib. Please instead move it to a tests/ or examples/ (probably better) and document that it can be run from command line like script

@charlesgregory
Copy link
Author

@charlesgregory How much of khashl is identical to khash? A bit it seems. Would it be sensible to factor out (e.g. the hash functions), or is it more important that we retain source-level homology for future updates?

It appears that the uint and ulong hashing functions may be different for khash vs khashl. I think a factored-out pool of shared hashing functions is not out of the question. We can use aliases to keep decent source-level homology. Up to you though.
https://github.com/charlesgregory/dklib/blob/bca6747bbeb4e1ff55063118a14b834152bb9716/source/dklib/khashl.d#L413-L436
https://github.com/charlesgregory/dklib/blob/bca6747bbeb4e1ff55063118a14b834152bb9716/source/dklib/khash.d#L456-L472

@jblachly
Copy link
Member

Interesting, didn't notice that. I was also thinking of the _put and _resize functions

I am happy to accept the code as is , if you'll move the benchmark to examples/ and revert the dubfile

n8sh and others added 2 commits August 26, 2020 09:13
Added a build type to dub.json that can be used to test this:
  dub test --compiler=dmd --build=unittest-inline
Fix: khash with string key not compiling with dmd if inlining enabled (Nathan Sashihara)
@jblachly
Copy link
Member

@charlesgregory ping

@jblachly
Copy link
Member

@charlesgregory You'll also need to rebase on current master since we updated dubfile

@charlesgregory
Copy link
Author

charlesgregory commented Aug 26, 2020

As a side note I have found that the free statements in the dtors of both khash and khashl can cause SEGSEVs so in normal use I usually comment this out. I notice this most when making a hashmap of hashmaps.

kfree(cast(void*) this.keys);
kfree(cast(void*) this.flags);
kfree(cast(void*) this.vals);

Edit: I forgot I actually opened an issue for this #2

@charlesgregory
Copy link
Author

Did a rebase but I think I should've just done a merge. Either way I would do a squash merge to get rid of the commit history mess.

@jblachly
Copy link
Member

Also do you need to make the changes to khasl that @n8sh made to khash? (regarding inlineing, which I accepted, and parameter in qualifier which is in the pending PR) ?

@charlesgregory
Copy link
Author

I messed up the rebase/merge so we are closing this pull request and opening a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants