Optimize LEB128 data reading #795

d-e-s-o · 2025-09-10T16:26:59Z

As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path.
With this change we replace our decoding routines with these optimized ones. As a result of these changes, we see a respectable speed up:

System #1:

  Before:
    test bench_reading_leb128_unsigned  ... bench:  235.83 ns/iter (+/- 32.53)
  After:
    test bench_reading_leb128_unsigned  ... bench:  157.38 ns/iter (+/- 17.09)

System #2:

  Before:
    test bench_reading_leb128_unsigned  ... bench:  183.70 ns/iter (+/- 2.72)
  After:
    test bench_reading_leb128_unsigned  ... bench:  103.83 ns/iter (+/- 3.28)

[0] rust-lang/rust#69050
[1] rust-lang/rust#69157

Micro-optimize unsigned LEB128 reading by unrolling parts of the first loop to improve code gen for the statistically likely case of single byte values. This improves performance by a small amount as per my experimentation: Before: test bench_parsing_debug_info ... bench: 1,854,730.00 ns/iter (+/- 2,237.98) test bench_reading_leb128_unsigned_small ... bench: 649.41 ns/iter (+/- 3.82) test bench_reading_leb128_unsigned_large ... bench: 203.58 ns/iter (+/- 1.51) After: test bench_parsing_debug_info ... bench: 1,809,890.00 ns/iter (+/- 2,453.82) test bench_reading_leb128_unsigned_small ... bench: 607.84 ns/iter (+/- 0.63) test bench_reading_leb128_unsigned_large ... bench: 198.43 ns/iter (+/- 0.20) Adjust u16 in a way to make the first check symmetrical, though this change does not appear to have any measurable performance impact. Also add a test for a LEB128 encoded value that we want to make sure we don't regress on by treating parsing differently. Signed-off-by: Daniel Müller <deso@posteo.net>

bjorn3 · 2025-09-10T16:55:05Z

Have you benchmarked this on LEB128 integers obtained from actual debuginfo? They almost certainly have a different distribution to what rustc sees. And how does this compare outside of micro-benchmarks (eg reading debuginfo for an entire executable)?

d-e-s-o · 2025-09-10T17:55:12Z

Have you benchmarked this on LEB128 integers obtained from actual debuginfo? They almost certainly have a different distribution to what rustc sees.

They are heavily skewed towards single byte values. See lebs.txt attachment, which was created based on symbolization of an address on a sizable debug file.

$ cat /tmp/lebs.txt | sort -n | awk '{ for (i=1; i<=NF; i++) if ($i+0 > 255) print $i }' | wc -l
3633
$ wc -l lebs.txt
852168 lebs.txt

Benchmarking with these numbers specifically we see

Before:
  test bench_reading_leb128_live  ... bench:   3,620,225.00 ns/iter (+/- 126,073.05)
After:
  test bench_reading_leb128_live  ... bench:   2,885,162.70 ns/iter (+/- 80,306.12)

And how does this compare outside of micro-benchmarks (eg reading debuginfo for an entire executable)?

As per my experiments it seems to be within noise levels. Most reads are from abbrevs which are read as LEB128 u16 from what I recall.

Edit: Corrected sort logic.

philipc

This changes the behaviour, and now we incorrectly parse the following as 0:

&[0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 2][..],

We should add a regression test for this.

If we're going to change unsigned, then we should change signed too.

Does this show any change to the existing benchmarks that operate on real data?

benches/bench.rs

philipc · 2025-09-11T03:47:11Z

Also, don't use # for numbers in the commit message and description, since github likes to convert them to links.

philipc · 2025-09-11T09:17:18Z

I tried running the existing benchmark on my system, and this actually makes the most relevant one slower, which is surprising:

Before:
test bench_parsing_debug_info ... bench:   1,097,840.80 ns/iter (+/- 35,017.39)
After:
test bench_parsing_debug_info ... bench:   1,238,792.27 ns/iter (+/- 23,549.80)

d-e-s-o · 2025-09-15T20:48:25Z

I tried running the existing benchmark on my system, and this actually makes the most relevant one slower, which is surprising

That is surprising indeed. I am not sure where that comes from. I compared the assembly but I am none the wiser, really. The only explanation I can muster would be that, because the compiler seems to generate more code (not sure it if unrolls more iterations or so), we may end up putting more pressure on the L1i cache, perhaps exaggerated by inlining. That could explain why microbenchmarks consistently show improvements, I'd think. In any event, performance counters aren't really conclusive, as numbers seem to be all over the place and not really correlated with reported wall clock numbers to the degree I can tell.

I can't spend much more time on the investigation here, so I kept the main loop body as-is now and just moved the first check out of the loop. That does seem to improve performance, but it's not much anymore. Take it or leave it.

If we're going to change unsigned, then we should change signed too.

Why can't we optimize one without the other...?

philipc · 2025-09-16T01:56:20Z

In the commit message for 3c56f86, are the After and Before labels wrong?

This change looks good to me now. I see a performance improvement on my system for bench_parsing_debug_info, and the addr2line benchmarks also show an improvement:

context_new             time:   [488.19 µs 488.65 µs 489.15 µs]
                        change: [−2.5561% −1.7932% −1.1502%] (p = 0.00 < 0.05)
                        Performance has improved.

context_new_parse_lines time:   [5.1523 ms 5.1623 ms 5.1731 ms]
                        change: [−1.1673% −0.9021% −0.6303%] (p = 0.00 < 0.05)
                        Change within noise threshold.

context_new_parse_functions
                        time:   [7.5898 ms 7.5999 ms 7.6105 ms]
                        change: [−4.9737% −4.7767% −4.5780%] (p = 0.00 < 0.05)
                        Performance has improved.

context_new_parse_inlined_functions
                        time:   [26.468 ms 26.505 ms 26.544 ms]
                        change: [−2.5977% −2.2868% −2.0112%] (p = 0.00 < 0.05)
                        Performance has improved.

Why can't we optimize one without the other...?

We can, but I meant that if it is worth doing for one then it is likely to be worth doing for the other. However looking into it, the sign extension does complicate things, so it's not so clear.

For reference, rust-lang/rust#92604 is now the most relevant rust PR here, and it didn't do a similar change for signed, but makes no mention of considering it.

d-e-s-o · 2025-09-16T15:25:21Z

In the commit message for 3c56f86, are the After and Before labels wrong?

Ehm, yes. I fixed that.

This change looks good to me now. I see a performance improvement on my system for bench_parsing_debug_info, and the addr2line benchmarks also show an improvement:

👍

Why can't we optimize one without the other...?

We can, but I meant that if it is worth doing for one then it is likely to be worth doing for the other. However looking into it, the sign extension does complicate things, so it's not so clear.

Yeah, that was also the sticking point for me when I looked at it.

If there are no further comments, feel free to merge.

philipc

Thanks!

d-e-s-o force-pushed the topic/optimize-leb128 branch from 291361c to bf51236 Compare September 10, 2025 16:27

philipc requested changes Sep 11, 2025

View reviewed changes

benches/bench.rs Show resolved Hide resolved

d-e-s-o force-pushed the topic/optimize-leb128 branch from bf51236 to 3c56f86 Compare September 15, 2025 20:48

d-e-s-o force-pushed the topic/optimize-leb128 branch from 3c56f86 to 1bdf905 Compare September 16, 2025 15:23

philipc approved these changes Sep 17, 2025

View reviewed changes

philipc merged commit 6e090ae into gimli-rs:master Sep 17, 2025
20 checks passed

d-e-s-o deleted the topic/optimize-leb128 branch September 17, 2025 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize LEB128 data reading #795

Optimize LEB128 data reading #795

Uh oh!

d-e-s-o commented Sep 10, 2025 •

edited

Loading

Uh oh!

bjorn3 commented Sep 10, 2025

Uh oh!

d-e-s-o commented Sep 10, 2025 •

edited

Loading

Uh oh!

philipc left a comment

Uh oh!

Uh oh!

philipc commented Sep 11, 2025

Uh oh!

philipc commented Sep 11, 2025

Uh oh!

d-e-s-o commented Sep 15, 2025 •

edited

Loading

Uh oh!

philipc commented Sep 16, 2025

Uh oh!

d-e-s-o commented Sep 16, 2025

Uh oh!

philipc left a comment

Uh oh!

Uh oh!

Uh oh!

Optimize LEB128 data reading #795

Optimize LEB128 data reading #795

Uh oh!

Conversation

d-e-s-o commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Sep 10, 2025

Uh oh!

d-e-s-o commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philipc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

philipc commented Sep 11, 2025

Uh oh!

philipc commented Sep 11, 2025

Uh oh!

d-e-s-o commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philipc commented Sep 16, 2025

Uh oh!

d-e-s-o commented Sep 16, 2025

Uh oh!

philipc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d-e-s-o commented Sep 10, 2025 •

edited

Loading

d-e-s-o commented Sep 10, 2025 •

edited

Loading

d-e-s-o commented Sep 15, 2025 •

edited

Loading