Perf improvements for floating point math#852
Perf improvements for floating point math#852isaacbrodsky merged 3 commits intouber:masterfrom heshpdx:master
Conversation
- Convert all the remaining "long double" literals to "double". - Define new literals for some inverse values, and use them to change divide operations into multiply operations, since that is generally faster for most CPUs.
dfellis
left a comment
There was a problem hiding this comment.
I am very surprised that modern C compilers aren't making these optimizations by default with the performance impact you mentioned, but very excited at improving the performance of key functions in H3. :)
|
I've ported the Edit: cannot repro with the benchmark of this repo either. Must be HW dependent then. |
|
I wasn't able to reproduce quite the reported performance improvements on Linux x64 w/ GCC, but I'm happy to retest on ARM later. edit: I see performance improving by more around 10~15% Before After |
|
@isaacbrodsky your benchmark does show an improvement on |
Sorry, I was imprecise. I did see performance improvements in many benchmarks, but more on the order of 10~15% rather than the 30% reported. |
|
The benefit is definitely microarchitecture specific based on how the FPU is implemented, and latency and throughput of individual operations. Also, most CPUs implement "early-out" divides, so if the computation is like {N/1, 0/N, N/N, N<<2, etc} then it doesn't incur the full latency (e.g. if unit tests have zero dividend there will be no perf benefit) . I just ran "make benchmarks" and pulled a few which looked significant: That's {1.4x, 2.7x, 1.5x}, as measured on my Ampere AltraMax. The 1.3x I cited was from our SPEC CPU input. Thanks for considering this PR. |
|
I get similar or even better (40% on cellToLatLng) performance improvements when I test on Linux ARM: Before After |
|
@heshpdx Thanks for improving the performance here! |
| // first do a reverse conversion | ||
| x2 = a2 / M_SIN60; | ||
| x2 = a2 * M_RSIN60; | ||
| x1 = a1 + x2 / 2.0; |
There was a problem hiding this comment.
I just spotted this. I'm not sure if it matters since powers of two are quick anyway, but I figured I would document that we could change it to x1 = a1 + x2 * 0.5;
Or since this is the only usage of M_RSIN60, just craft a M_RSIN60_DIV_BY_2
There was a problem hiding this comment.
I confirmed that the same assembly is produced.
There was a problem hiding this comment.
If the same assembly is produced it sounds like it's fine to leave as-is because compiler optimizations take care of it for us?
There was a problem hiding this comment.
Yes, any compiler at -O1 or higher opt figures it out.
* Further performance improvements for FP math More FDIV->FMUL opportunities unlocked, following in the spirit of #852 * Formatting fix * Update src/h3lib/lib/localij.c Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com> * Add #905 to CHANGELOG.md * Save one fdiv and maybe a cosine --------- Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com>
This completes the work from #790, where we started the removal of "long double" types.
Additionally, there is a easy performance improvement opportunity through changing some FDIV's into FMUL's. In modern CPUs, divides usually takes 3 to 4 times as long to complete compared to multiply, so we can convert the high impact divide operations by defining literals where the inverse is pre-computed. Removing divides from loops has a big impact. I measured a 30% speedup in
cellToLatLngandcellToBoundaryon my machine. Please see what you can achieve on yours. Thank you!