Skip to content

[NetKAT] Intern decision nodes via per-field unique tables keyed by node index#100

Draft
smolkaj wants to merge 1 commit into
google:mainfrom
smolkaj:per-field-unique-tables
Draft

[NetKAT] Intern decision nodes via per-field unique tables keyed by node index#100
smolkaj wants to merge 1 commit into
google:mainfrom
smolkaj:per-field-unique-tables

Conversation

@smolkaj

@smolkaj smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Cuts node-related memory roughly in half and reorganizes the unique tables (hash-cons tables) of PacketSetManager and PacketTransformerManager by packet field, as a first step toward level-based node storage (the standard organization in BDD packages, where it improves locality and enables variable reordering).

Before: packet_by_node_ / transformer_by_node_ were flat_hash_map<DecisionNode, Handle> — every interned node was stored twice: once in nodes_, once as the map key. For PacketTransformerManager the key copy was especially costly, since each node owns heap-allocated btree maps.

After: one flat_hash_set<uint32_t> per packet field, storing node indices; hashing and equality dereference into nodes_. Each node is stored exactly once. Per-field tables also keep probe sequences short and drop the field from hashing entirely (it's implied by the table).

Design notes

  • Interning probes before appending. NodeToPacket/NodeToTransformer look up the candidate node via heterogeneous lookup (is_transparent functors), so the common already-interned case does no copies and no allocations. Only genuinely new nodes touch nodes_.
  • Manager moves. The table functors must reference nodes_, which moves with the manager. They do so through a heap-allocated location slot owned by the manager; the (now manual) move operations repoint the slot. This keeps moves O(1) — the alternative of rebuilding tables on move would be O(#nodes).

Benchmarks

Flat overall: first-time compilation unchanged, recompile microbenchmarks +~50ns (~8%) from the added indirection in the unique tables — in exchange for the memory halving.

Benchmark main this PR
BM_FirstTimeCompileNonOverlappingPredicate 44.7µs 44.6µs
BM_ReCompileNonOverlappingPredicate 649ns 703ns
BM_FirstTimeCompileOverlappingPredicate 10.9µs 10.5µs
BM_ReCompileOverlappingPredicate 638ns 691ns
BM_FirstTimeCompileNonOverlappingPolicy 285.8µs 277.9µs
BM_ReCompileNonOverlappingPolicy 4.19µs 4.16µs
BM_FirstTimeCompileOverlappingPolicy 45.5µs 43.4µs
BM_ReCompileOverlappingPolicy 4.07µs 4.14µs

Testing

All 17 bazel test targets pass, including the golden diff tests (node numbering is unchanged by this PR) and the fuzz tests comparing against reference implementations.

Two related PRs: the arena page-size fix is split out as an independent one-line win, and the originally-stacked level-packed-handles work (#102) is deferred until operation memoization lands, per the scaling data on that PR.

🤖 Generated with Claude Code

…ode index.

Previously, the unique tables (packet_by_node_, transformer_by_node_) stored a
full copy of every decision node as the map key, duplicating all node storage.
This was especially costly for PacketTransformerManager, whose nodes own
heap-allocated btree maps. Now the tables store 4-byte node indices and
hash/compare by dereferencing into nodes_, so each node is stored exactly once.

Splitting the tables by packet field keeps each table - and thus its probe
sequences - small, and lays the groundwork for storing the nodes themselves by
field/level (standard practice in BDD packages, where it improves locality and
enables variable reordering). A follow-up change builds on this.

Implementation notes:
* Interning probes the table with the candidate node via heterogeneous lookup
  (no node is added to nodes_ unless it is new), keeping the common
  already-interned case copy- and allocation-free.
* The table functors reference nodes_ through a heap-allocated location slot
  so that they survive manager moves; move operations repoint the slot.

Benchmarks are flat overall: first-time compilation is unchanged, recompile
microbenchmarks regress by ~8% (~50ns) due to the indirection in the unique
tables, in exchange for halving node-related memory.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@google-cla

google-cla Bot commented Jun 10, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Measured this PR stack on the new scale benchmark from #103 (realistically large inputs; medians of 3, -c opt, same machine, back-to-back):

Benchmark main #100 #102 (stack)
CompileForwardingTable/64 513µs 427µs 408µs (−20%)
CompileForwardingTable/512 34.4ms 27.9ms 27.7ms (−19%)
CompileForwardingTable/4096 2.98s 2.76s 2.72s (−9%)
CompileAclAsPacketSet/16 45.3µs 37.4µs 36.9µs (−19%)
CompileAclAsPacketSet/64 427µs 383µs 381µs (−11%)
CompileAclAsPacketSet/256 5.60ms 5.42ms 5.53ms (−1%)
CompileAclDifference/16 89.4µs 70.1µs 70.2µs (−21%)
CompileAclDifference/256 11.0ms 11.1ms 11.3ms (+2%)
CompileRingReachability/2 39.9µs 36.6µs 23.1µs (−42%)
CompileRingReachability/8 2.12ms 2.07ms 2.04ms (−4%)
CompileRingReachability/32 414ms 411ms 409ms (−1%)

Two takeaways:

  1. The stack helps consistently at small/medium scale (−10–42%), and [NetKAT] Intern decision nodes via per-field unique tables keyed by node index #100's node-copy elimination shows up much more here than in the micro-benchmarks — interning wide forwarding-table nodes previously copied a heap-backed branch array per duplicate probe.
  2. At the largest sizes the gains wash out (−1–9%): runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263; map copies in the transformer combinators). That's the expected division of labor — this stack is the memory/locality/constant-factor layer; the asymptotic wins need the memoization work, which [NetKAT] Add scale benchmark with realistically large inputs #103 now makes measurable.

@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Folding this into #102 per offline discussion — the two changes are tightly coupled (the follow-up rewrites the regions this PR introduces), and GitHub can't express fork-based stacking cleanly, so #102 already showed the combined diff. #102 now contains both commits and the combined description.

@smolkaj smolkaj closed this Jun 10, 2026
@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Reopening: we measured the full stack (#102) on the scale benchmark (#103) and decided to land this layer separately after all — see the re-scope rationale on #102. This PR is unchanged and independently valuable (halves node-related memory, −7–20% on table-heavy compiles at scale).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant