Perf improvements for floating point math by heshpdx · Pull Request #852 · uber/h3

heshpdx · 2024-07-13T02:18:06Z

This completes the work from #790, where we started the removal of "long double" types.

Additionally, there is a easy performance improvement opportunity through changing some FDIV's into FMUL's. In modern CPUs, divides usually takes 3 to 4 times as long to complete compared to multiply, so we can convert the high impact divide operations by defining literals where the inverse is pre-computed. Removing divides from loops has a big impact. I measured a 30% speedup in cellToLatLng and cellToBoundary on my machine. Please see what you can achieve on yours. Thank you!

- Convert all the remaining "long double" literals to "double". - Define new literals for some inverse values, and use them to change divide operations into multiply operations, since that is generally faster for most CPUs.

CLAassistant · 2024-07-13T02:18:11Z

All committers have signed the CLA.

dfellis

I am very surprised that modern C compilers aren't making these optimizations by default with the performance impact you mentioned, but very excited at improving the performance of key functions in H3. :)

coveralls · 2024-07-13T11:55:59Z

coverage: 98.826%. remained the same
when pulling e570b03 on heshpdx:master
into ecc0d25 on uber:master.

src/h3lib/include/constants.h

grim7reaper · 2024-07-13T17:39:15Z

I've ported the use-mul-instead-of-div changes to h3o because the 30% speedup was very attractive, but I haven't noticed any noticeable performance improvement.
Maybe M1 CPU have fast division already or LLVM is already doing this optimization under the hood for Rust.

Edit: cannot repro with the benchmark of this repo either. Must be HW dependent then.

src/h3lib/lib/faceijk.c

isaacbrodsky · 2024-07-14T15:04:01Z

I wasn't able to reproduce quite the reported performance improvements on Linux x64 w/ GCC, but I'm happy to retest on ARM later.

edit: I see performance improving by more around 10~15%

Before

build-master-jul14$ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.165765 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.832082 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.128193 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 1.945774 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 2.400742 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 1.018848 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 5.000979 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 30.648170 microseconds per iteration (10000 iterations)
	-- gridDisk20: 116.188511 microseconds per iteration (10000 iterations)
	-- gridDisk30: 274.647540 microseconds per iteration (10000 iterations)
	-- gridDisk40: 441.203441 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 613.105132 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 5084.334198 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 17323.867540 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 40797.638900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 58.487380 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 2616.719411 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 14.005060 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 10.162646 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.217632 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 157.010829 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 154.470410 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 7074.462316 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 8923.350511 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 5023.494634 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 8218.255006 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 8942.472348 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 108.960790 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 38.634417 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 158.458785 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.241202 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.332053 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 7.849704 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 52.471268 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 369.739713 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 4029.634296 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 6255.191586 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 188593.924100 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 2265.643132 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 7476.944652 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 8589.903528 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 5523.648154 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 15981.319740 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 20323.545974 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 116890.366500 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 379016.690500 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 590245.006200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

After

build-branch-jul14$ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.174684 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.706215 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.113044 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 1.853511 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 2.095765 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 1.015881 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 4.406268 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 31.002723 microseconds per iteration (10000 iterations)
	-- gridDisk20: 115.963878 microseconds per iteration (10000 iterations)
	-- gridDisk30: 255.184783 microseconds per iteration (10000 iterations)
	-- gridDisk40: 446.646353 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 620.174954 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 5127.692764 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 17360.673460 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 41154.405900 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 59.351578 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 2677.547189 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 14.106074 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 9.734607 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.215882 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 160.913600 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 156.779922 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 7027.019166 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 8806.731603 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 4965.449012 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 8126.078029 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 8706.736355 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 110.695771 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 39.187226 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 160.627655 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.211110 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.388388 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 8.871911 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 56.922808 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 391.073105 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 3899.409916 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 6277.127410 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 188710.784900 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 2175.312946 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 7408.483802 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 8448.251498 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 5296.558980 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 15343.415832 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 19566.347054 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 113208.269200 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 363013.989700 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 559297.645200 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

dfellis · 2024-07-14T17:04:30Z

@isaacbrodsky your benchmark does show an improvement on latlngToCell from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.

isaacbrodsky · 2024-07-14T17:12:08Z

@isaacbrodsky your benchmark does show an improvement on latlngToCell from 2.4us to 2.1us. Assuming that's significant and reproducible, it's a 14% perf boost.

Sorry, I was imprecise. I did see performance improvements in many benchmarks, but more on the order of 10~15% rather than the 30% reported.

heshpdx · 2024-07-14T22:55:43Z

The benefit is definitely microarchitecture specific based on how the FPU is implemented, and latency and throughput of individual operations. Also, most CPUs implement "early-out" divides, so if the computation is like {N/1, 0/N, N/N, N<<2, etc} then it doesn't incur the full latency (e.g. if unit tests have zero dividend there will be no perf benefit) . I just ran "make benchmarks" and pulled a few which looked significant:

old  -- latLngToCell: 2.366658 microseconds per iteration (10000 iterations)
new  -- latLngToCell: 1.635445 microseconds per iteration (10000 iterations)
    
old  -- cellToChildren1: 0.404193 microseconds per iteration (10000 iterations)
new  -- cellToChildren1: 0.147156 microseconds per iteration (10000 iterations)

old  -- cellToChildren2: 1.099871 microseconds per iteration (10000 iterations)
new  -- cellToChildren2: 0.750266 microseconds per iteration (10000 iterations)

That's {1.4x, 2.7x, 1.5x}, as measured on my Ampere AltraMax. The 1.3x I cited was from our SPEC CPU input. Thanks for considering this PR.

isaacbrodsky · 2024-07-15T02:27:03Z

I get similar or even better (40% on cellToLatLng) performance improvements when I test on Linux ARM:

Before

~/oss/h3/build $ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.237791 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.953805 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.221790 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 2.608292 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 6.158289 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 3.538159 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 16.000204 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 46.712590 microseconds per iteration (10000 iterations)
	-- gridDisk20: 172.776119 microseconds per iteration (10000 iterations)
	-- gridDisk30: 379.537284 microseconds per iteration (10000 iterations)
	-- gridDisk40: 665.536855 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 974.917548 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 7932.902812 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 27031.574120 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 65397.877600 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 67.016416 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 3043.141366 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 40.614495 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 13.928412 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.383176 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 216.126529 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 224.302782 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 13482.154379 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 13888.525799 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 7786.916335 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 12766.925168 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 13777.683675 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 423.303284 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 157.237177 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 625.338030 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.244395 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.357393 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 9.080074 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 63.147554 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 441.493719 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 10539.029034 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 14892.152532 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 455600.007400 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 7021.455078 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 26996.973598 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 28265.139666 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 13734.053836 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 51138.265554 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 58866.366632 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 304419.850000 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 1275601.226200 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 1790633.328600 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

After

~/oss/h3-copy/build $ make benchmarks
[  0%] Formatting sources
[  0%] Built target format
[ 27%] Built target h3
[ 36%] Built target benchmarkPolygon
	-- pointInsideGeoLoopSmall: 0.242731 microseconds per iteration (100000 iterations)
	-- pointInsideGeoLoopLarge: 1.989570 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopSmall: 0.223000 microseconds per iteration (100000 iterations)
	-- bboxFromGeoLoopLarge: 2.658519 microseconds per iteration (100000 iterations)
[ 36%] Built target bench_benchmarkPolygon
[ 45%] Built target benchmarkH3Api
	-- latLngToCell: 3.780628 microseconds per iteration (10000 iterations)
	-- cellToLatLng: 2.141569 microseconds per iteration (10000 iterations)
	-- cellToBoundary: 10.879162 microseconds per iteration (10000 iterations)
[ 45%] Built target bench_benchmarkH3Api
[ 54%] Built target benchmarkGridDiskCells
	-- gridDisk10: 46.536392 microseconds per iteration (10000 iterations)
	-- gridDisk20: 173.230969 microseconds per iteration (10000 iterations)
	-- gridDisk30: 380.076526 microseconds per iteration (10000 iterations)
	-- gridDisk40: 666.374863 microseconds per iteration (10000 iterations)
	-- gridDiskPentagon10: 980.303592 microseconds per iteration (500 iterations)
	-- gridDiskPentagon20: 7948.988960 microseconds per iteration (500 iterations)
	-- gridDiskPentagon30: 27231.112900 microseconds per iteration (50 iterations)
	-- gridDiskPentagon40: 66191.866500 microseconds per iteration (10 iterations)
[ 54%] Built target bench_benchmarkGridDiskCells
[ 54%] Built target benchmarkGridPathCells
	-- gridPathCellsNear: 67.183286 microseconds per iteration (10000 iterations)
	-- gridPathCellsFar: 3054.412760 microseconds per iteration (1000 iterations)
[ 54%] Built target bench_benchmarkGridPathCells
[ 63%] Built target benchmarkDirectedEdge
	-- directedEdgeToBoundary: 30.176533 microseconds per iteration (10000 iterations)
[ 63%] Built target bench_benchmarkDirectedEdge
[ 72%] Built target benchmarkVertex
	-- cellToVertexes: 13.611636 microseconds per iteration (10000 iterations)
	-- cellToVertexesPent: 0.385624 microseconds per iteration (10000 iterations)
	-- cellToVertexesRing: 212.934427 microseconds per iteration (10000 iterations)
	-- cellToVertexesRingPent: 224.648723 microseconds per iteration (10000 iterations)
[ 72%] Built target bench_benchmarkVertex
[ 81%] Built target benchmarkIsValidCell
	-- pentagonChildren_2_8: 13472.980062 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14: 13887.771011 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_2: 7781.522597 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_10: 12761.149156 microseconds per iteration (1000 iterations)
	-- pentagonChildren_8_14_null_100: 13773.922437 microseconds per iteration (1000 iterations)
[ 81%] Built target bench_benchmarkIsValidCell
[ 90%] Built target benchmarkCellsToLinkedMultiPolygon
	-- cellsToLinkedMultiPolygonRing2: 320.794363 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonDonut: 124.114011 microseconds per iteration (10000 iterations)
	-- cellsToLinkedMultiPolygonNestedDonuts: 492.473339 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellsToLinkedMultiPolygon
[ 90%] Built target benchmarkCellToChildren
	-- cellToChildren1: 0.255307 microseconds per iteration (10000 iterations)
	-- cellToChildren2: 1.386753 microseconds per iteration (10000 iterations)
	-- cellToChildren3: 9.292348 microseconds per iteration (10000 iterations)
	-- cellToChildren4: 64.225439 microseconds per iteration (10000 iterations)
	-- cellToChildren5: 443.989882 microseconds per iteration (10000 iterations)
[ 90%] Built target bench_benchmarkCellToChildren
[100%] Built target benchmarkPolygonToCells
	-- polygonToCellsSF: 7519.098276 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda: 11145.530170 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion: 351837.750500 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCells
[100%] Built target benchmarkPolygonToCellsExperimental
	-- polygonToCellsSF_Center: 4643.820966 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Full: 17948.688888 microseconds per iteration (500 iterations)
	-- polygonToCellsSF_Overlapping: 18913.791116 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Center: 9732.431998 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Full: 34826.282658 microseconds per iteration (500 iterations)
	-- polygonToCellsAlameda_Overlapping: 40562.522346 microseconds per iteration (500 iterations)
	-- polygonToCellsSouthernExpansion_Center: 209794.639100 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Full: 855543.199300 microseconds per iteration (10 iterations)
	-- polygonToCellsSouthernExpansion_Overlapping: 1222980.075300 microseconds per iteration (10 iterations)
[100%] Built target bench_benchmarkPolygonToCellsExperimental
[100%] Built target benchmarks

isaacbrodsky · 2024-07-15T02:27:41Z

@heshpdx Thanks for improving the performance here!

Port of uber/h3#852

heshpdx · 2024-07-17T21:24:28Z

src/h3lib/lib/coordijk.c

    // first do a reverse conversion
-    x2 = a2 / M_SIN60;
+    x2 = a2 * M_RSIN60;
    x1 = a1 + x2 / 2.0;


I just spotted this. I'm not sure if it matters since powers of two are quick anyway, but I figured I would document that we could change it to x1 = a1 + x2 * 0.5;
Or since this is the only usage of M_RSIN60, just craft a M_RSIN60_DIV_BY_2

I confirmed that the same assembly is produced.

If the same assembly is produced it sounds like it's fine to leave as-is because compiler optimizations take care of it for us?

Yes, any compiler at -O1 or higher opt figures it out.

* add #852 to changelog * others

* Further performance improvements for FP math More FDIV->FMUL opportunities unlocked, following in the spirit of #852 * Formatting fix * Update src/h3lib/lib/localij.c Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com> * Add #905 to CHANGELOG.md * Save one fdiv and maybe a cosine --------- Co-authored-by: Nick Rabinowitz <public@nickrabinowitz.com>

Perf improvements for floating point math

555ca7e

- Convert all the remaining "long double" literals to "double". - Define new literals for some inverse values, and use them to change divide operations into multiply operations, since that is generally faster for most CPUs.

dfellis approved these changes Jul 13, 2024

View reviewed changes

heshpdx commented Jul 13, 2024

View reviewed changes

src/h3lib/include/constants.h Show resolved Hide resolved

Fix formatting in constants.h

dc62ea9

isaacbrodsky reviewed Jul 14, 2024

View reviewed changes

src/h3lib/lib/faceijk.c Outdated Show resolved Hide resolved

isaacbrodsky approved these changes Jul 14, 2024

View reviewed changes

Formatting fix

e570b03

isaacbrodsky merged commit a7845a7 into uber:master Jul 15, 2024

isaacbrodsky added a commit to isaacbrodsky/h3 that referenced this pull request Jul 15, 2024

add uber#852 to changelog

2c3ff14

isaacbrodsky mentioned this pull request Jul 15, 2024

add #852 to changelog #890

Merged

grim7reaper added a commit to HydroniumLabs/h3o that referenced this pull request Jul 15, 2024

coord: replace some FDIV by FMUL

07af804

Port of uber/h3#852

heshpdx commented Jul 17, 2024

View reviewed changes

cmuellner mentioned this pull request Jul 31, 2024

FP exception handler triggered on both x86 and aarch64 #891

Closed

isaacbrodsky added a commit that referenced this pull request Aug 25, 2024

add #852 to changelog (#890)

ebf4501

* add #852 to changelog * others

heshpdx mentioned this pull request Sep 9, 2024

Further performance improvements for FP math #905

Merged

iverase mentioned this pull request Sep 23, 2024

Small performance improvement in h3 library elastic/elasticsearch#113385

Merged

Conversation

heshpdx commented Jul 13, 2024

Uh oh!

CLAassistant commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfellis left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

grim7reaper commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

isaacbrodsky commented Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfellis commented Jul 14, 2024

Uh oh!

isaacbrodsky commented Jul 14, 2024

Uh oh!

heshpdx commented Jul 14, 2024

Uh oh!

isaacbrodsky commented Jul 15, 2024

Uh oh!

isaacbrodsky commented Jul 15, 2024

Uh oh!

heshpdx Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

heshpdx Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

isaacbrodsky Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

heshpdx Jul 18, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Jul 13, 2024 •

edited

Loading

coveralls commented Jul 13, 2024 •

edited

Loading

grim7reaper commented Jul 13, 2024 •

edited

Loading

isaacbrodsky commented Jul 14, 2024 •

edited

Loading