Skip to content

Conversation

@WoutLegiest
Copy link
Collaborator

We can make the LUT CGGI pipeline more efficient by using 4 elements LUTs instead of 3 elements. Since each LUT only produces a single bit output, and the whole program only consits of 4-to-1 LUT, we cannot overflow a 4b ciphertext.
In the case we use TFHE-rs as backend, this will leads to a more efficient use of PBS (need to check the impact on the Jaxite backend).

For now, would we like to dual support of lut3 and lut4 elements? For respectively, the jaxite and TFHE-rs backends?

@j2kun
Copy link
Collaborator

j2kun commented Dec 9, 2025

Lut4 by default sounds great, but please make the lut width configurable on the yosys optimizer sub-pipeline via pass option. Maybe it makes sense to just extend the mode pass option to split LUT into LUT3 and LUT4?

IIUC if we can limit the yosys optimizer pipeline to generate luts of a given size by configuration, then the rest of the pipeline doesn't need any special configuration and can just lower whatever luts it sees/supports at that stage. I.e., you don't need to touch the jaxite emitter and just let it fail on lut4, and secret-to-cggi can have both lut3 and lut4 patterns added without conflict.

I'm not too worried about having stuff in HEIR that jaxite_bool doesn't support (though adding support in jaxite_bool should be easy). But I know there are other groups (e.g., Cornami) that use the CGGI pipeline and may have limited support for certain LUT sizes.

@j2kun
Copy link
Collaborator

j2kun commented Dec 9, 2025

Also, a long time ago when I was working on lut support in the yosys optimizer, I did some tests to see what was the best tradeoff of lut size vs runtime, and I recall lut5/lut6 being a net negative because you needed too large crypto parameters to support them, and the incremental reduction in circuit size didn't outweigh the added cost of the PBS. But lut3/lut4 were basically the same: lut4 took longer per PBS, but had a smaller circuit, and for the circuits I tried the runtime was similar.

At the time, my tfhe-rs CPU experiments put 3-bit PBS at 15.8ms, up to lut6 at 145ms, while circuit size reduction looked like this (for one example circuit), so 10x runtime for < 0.5x circuit reduction was pretty bad. I don't seem to have the exact lut4 numbers recorded from that experiment, though.

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend grepping the entire repo (including the .github/workflows) for instances of techmap.v to ensure that this new techmap file is included in all places.

For example, it's included in the release workflows like https://github.com/google/heir/blob/nightly/.github/workflows/nightly.yml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the whole repo. Now, by default the LUT3 optimisation will be chosen.

@WoutLegiest WoutLegiest force-pushed the lut4 branch 3 times, most recently from b3aadc6 to f8d3a8a Compare December 19, 2025 13:54
Working 4lut impl

Updated tests

Update Emitter

Update Emitter

Working mode Lut3, Lut4 selector

Correct generation of LWE Types
@WoutLegiest
Copy link
Collaborator Author

Ooh, very cool to see the past research! Is just thought about the ability to run on FPGAs, and would be a 'nice to have' as an appendix somewhere to write about the FPGA performance of the current CGGI pipelines (and mostly to show how much improvement we can do in the future).

@WoutLegiest WoutLegiest marked this pull request as ready for review December 22, 2025 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants