Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create broad-leaved red-black tree version of FildeshKV #114

Open
6 of 8 tasks
Tracked by #24
grencez opened this issue Jan 29, 2023 · 9 comments
Open
6 of 8 tasks
Tracked by #24

Create broad-leaved red-black tree version of FildeshKV #114

grencez opened this issue Jan 29, 2023 · 9 comments
Assignees
Labels

Comments

@grencez
Copy link
Member

grencez commented Jan 29, 2023

The "broadleaf" tree can contain 2 elements in its leaf nodes. The keys have to be small enough for us to pack both into a single size_t. The actual 2nd key and value will occupy the left and right child node pointers, which leaves don't use by definition.

  • Create broadleaf red-black tree that supports insertion but not removal.
  • Fuzz test insertion code.
  • Benchmark. (It's not quite as fast.)
  • Test to verify 25% compression factor.
  • Test with varying key sizes. Must be unique to avoid actual byte comparisons, which would segfault.
  • Test to verify that all leaf nodes that can be fused have been.
  • Support removal.
  • Switch default FildeshKV to be the broadleaf red-black tree.
@grencez grencez added this to the v0.1.8 milestone Jan 29, 2023
@grencez grencez self-assigned this Jan 29, 2023
@grencez grencez mentioned this issue Jan 29, 2023
8 tasks
@grencez
Copy link
Member Author

grencez commented Jan 30, 2023

The tree is only 1/3 smaller in the best case (perfectly balanced, full leaves). The improvement does show up in benchmarks. I need to analyze a bit to be sure that tree rotations aren't breaking it too badly.

I have a feeling that we want new rules in the insertion rotations to allow for fusing. Otherwise, leaf nodes will only be doubled up when insertion happens in the right place.

@grencez
Copy link
Member Author

grencez commented Feb 1, 2023

Turns out that the tree is guaranteeing a 25% compression factor as long as keys are small enough and we don't delete any.

This probably comes from the fact that a full tree has 1 more leaf node than inner nodes. If that intuition is accurate, and the savings isn't related to how well balanced the tree is, then it can probably work for deletions too!

@grencez
Copy link
Member Author

grencez commented Feb 9, 2023

Hmm it doesn't actually guarantee 25% node reduction yet. But I think it could.

The proof would come from a structural invariant that all subtrees have a node/entry ratio less than 75%.

  • Let the structural invariant be that subtrees >25% fewer nodes than entries.
  • In other words, x nodes with n entries has x/n < 3/4.
  • More accurately written as 4x < 3n.
  • Let y and m describe another arbitrary subtree matching the structural invariant (4y < 3m).
  • Combining the subtrees would yield (1+x+y) nodes and (1+n+m) entries.
  • Need to show the combined subtrees match the structural invariant.
    • In other words, prove that 4*(1+x+y) < 3*(1+n+m).

This is Presburger arithmetic, so a proof by SMT solver is probably cleanest. Edit: Yep, SMT solver says this is correct.

@grencez
Copy link
Member Author

grencez commented Feb 9, 2023

Due to how we can actually combine entries into leaves of a binary tree, another part of the structural invariant is that the number of entries is greater than 1 and not equal to 4.

Practically, what that means is if we create a single-entry leaf via insertion, rotation, or removal, we have to consider restructuring some neighboring nodes too. In terms of red-black tree, I think the largest neighborhood is the last black node level. It probably makes sense to special-case a bunch:

5 entries in 3 nodes (3/5).
      *
     / \
    ## ##

6 entries in 4 nodes (4/6).
      *            *
     / \          / \
    ##  #        #  ##
         \        \
         ++       ++

7 entries in 5 nodes (5/7).
      *
     / \
    #   #
     \   \
     ++  ++

8 entries in 5 nodes (5/8).
      *            *
     / \          / \
    ##  #        #  ##
       / \      / \
      ++ ++    ++ ++

9 entries in 6 nodes (6/9).
     _*_          _*_
    /   \        /   \
   #     #      #     #
    \   / \    / \     \
    ++ ++ ++  ++ ++    ++

10 entries in 7 nodes (7/10).   AVOID THIS ONE.
     _*_          _*_
    /   \        /   \
   #     #      #     #
  / \   / \    / \   / \
 +  ++ ++ ++  ++ ++ +  ++

11 entries in 7 or 8 nodes (7/11).
     _*_
    /   \
   #     #
  / \   / \
 ++ ++ ++ ++

To maintain that structure, our adds and removes need to shift a few entries around, but those are also the cases where we don't have to bubble up & rotate, so it's not that bad.

These special cases have some good properties:

  • The "root" of the subtree can be red or black, so we can perform normal red-black tree reasoning of inner nodes starting from there.
  • The rightmost node in the subtree is a broad leaf, so we know that fusing elements of the subtree will always work.
  • Structure of half of the subtree can be can be mixed as long as we avoid the 10-entry case.
    • Useful for tree rotations.
  • Node colors hint at the position in the subtree.
    • Red leaves are bottom level.
    • Black leaves are middle level.
    • Black leaves never have a single-entry leaf sibling.

@grencez
Copy link
Member Author

grencez commented Feb 9, 2023

We can think of those in terms of the subtree halves:

  • 2 entries in 1 node (1/2).
  • 3 entries in 2 nodes (2/3).
  • 5 entries in 3 nodes (3/5). Not using this one. See next comment.
    2     3        5
    ##    #        #
           \      / \
           ++    ++ ++

Dealing with those 3 elementary subtrees seems easiest. Just remember that we have to coordinate with the sibling subtree for lots of operations.

@grencez
Copy link
Member Author

grencez commented Feb 11, 2023

Actually, we can just deal with the first 2 elementary subtrees. Excluding the 3rd restricts us to the {5,6,7}-entry cases above, which keeps all the nice properties and avoids the {8,9,10,11}-entry special cases. Seems like less shuffling too.

@grencez
Copy link
Member Author

grencez commented Feb 12, 2023

Insert rule:

  • If double-black...
    • Change to a black node with double-red on the right.
  • If double-red or different side of black node...
    • If other side has only a double-black...
      • Add new node there.
    • Else...
      • Add new subtree root and do rotations.

Deletion rule:

  • If double-red or single-black...
    • Steal and merge to double-black.
  • If double-black...
    • If other side has single-black...
      • Steal from that side and make it double-black. Shift over. Steal from own side.
      • Else...
        • Steal from sister subtree if possible. May have to restructure more.

Leaves that are too big to merge can follow the same rules.

  • Double-black case is single-black + single-red.
  • Double-red case is single-black + 2 red leaves.

Elementary subtree operations:

  • maybe_take(blacknode, side)
    • returns node index
  • maybe_give(blacknode, side, node)
    • returns bool

grencez added a commit that referenced this issue Feb 12, 2023
grencez added a commit that referenced this issue Feb 12, 2023
Only the SINGLE_LIST kind of FildeshKV cares at the moment,
but oversized keys will be very important for testing "broadleaf" trees.

This commit includes some other changes related to broadleaf trees
without including a broadleaf tree implementation itself.

Issue #114
@grencez
Copy link
Member Author

grencez commented Feb 12, 2023

Oh boy. The insertion cases are getting a little wild. To keep some amount of sanity, I'm enumerating them as separate functions. Other than node creation for the 7-entry -> 8-entry transformation, a dispatch function can just call the following special cases to do the hard part. Well... I guess it's pretty hard to call the right function too.

/*
 * Root cases are only for 1 entry.
 * insert_0_root_sblack(): Standard. Might become oversized.
 * insert_1_root_sblack(): Standard. Might become oversized.
 *
 * 2-entry subtree.
 * insert_0_d2_bblack(): Standard. Left.
 * insert_0_d2_sblack(): Oversized. Left.
 * insert_1_d2_bblack(): Standard. Middle.
 * insert_1_d2_sred(): Oversized. Middle.
 * insert_2_d2_bblack(): Standard. Right. May become oversized.
 * insert_2_d2_sred(): Oversized. Right.
 *
 * rshins_d2(): Insert on left. Shift to pop right.
 * lshins_d2(): Insert on right. Shift to pop left.
 *
 * 3-entry subtree.
 * lshins_0_d3_sblack(): Standard. Left. No-op.
 * lshins_0_d3_sred(): Oversized. Left. No-op.
 * rshins_0_d3_sblack(): Standard. Left. Shift everything.
 * rshins_0_d3_sred(): Oversized. Left. Shift everything.
 * lshins_1_d3_rred(): Standard. LMiddle.
 * lshins_1_d3_sred(): Oversized. LMiddle.
 * rshins_1_d3_rred(): Standard. LMiddle.
 * rshins_1_d3_sred(): Oversized. LMiddle.
 * lshins_2_d3_rred(): Standard. RMiddle.
 * lshins_2_d3_sred(): Oversized. RMiddle.
 * rshins_2_d3_rred(): Standard. RMiddle.
 * rshins_2_d3_sred(): Oversized. RMiddle.
 * lshins_3_d3_rred(): Standard. Right. May become oversized. Shift everything.
 * lshins_3_d3_sred(): Oversized. Right. Shift everything.
 * rshins_3_d3_rred(): Standard. Right. No-op.
 * rshins_3_d3_sred(): Oversized. Right. No-op.
 *
 * lsh_d3(): Create a d2 by shifting to pop left.
 * rsh_d3(): Create a d2 by shifting to pop right.
 */

grencez added a commit that referenced this issue Feb 26, 2023
The remove operation will come later.
It took a lot to get insert working.

Issue #114
grencez added a commit that referenced this issue Feb 26, 2023
grencez added a commit that referenced this issue Feb 26, 2023
@grencez
Copy link
Member Author

grencez commented Feb 26, 2023

See comments in src/lib/kv/brbtree.c for the final description of how insertion works. The layout I described in #114 (comment) is mostly accurate, but there are a few more special cases for small trees.

As for removal, I guess it'll just happen in reverse and leverage the generic red-black tree rebalancing operations. It'll be roughly the same amount of complexity, but should be easier to implement because we already have a model for how all these subtrees should work.

However, the implementation is slightly slower than normal red-black trees, so I'm pushing this to a later release where it can be paired with a hash table where space-savings matters more.

@grencez grencez modified the milestones: v0.1.8, v0.1.9 Feb 26, 2023
@grencez grencez removed this from the v0.1.9 milestone Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant