Skip to content

Conversation

d-e-s-o
Copy link
Contributor

@d-e-s-o d-e-s-o commented Sep 10, 2025

As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path.
With this change we replace our decoding routines with these optimized ones. As a result of these changes, we see a respectable speed up:

System #1:

  Before:
    test bench_reading_leb128_unsigned  ... bench:  235.83 ns/iter (+/- 32.53)
  After:
    test bench_reading_leb128_unsigned  ... bench:  157.38 ns/iter (+/- 17.09)

System #2:

  Before:
    test bench_reading_leb128_unsigned  ... bench:  183.70 ns/iter (+/- 2.72)
  After:
    test bench_reading_leb128_unsigned  ... bench:  103.83 ns/iter (+/- 3.28)

[0] rust-lang/rust#69050
[1] rust-lang/rust#69157

Micro-optimize unsigned LEB128 reading by unrolling parts of the first
loop to improve code gen for the statistically likely case of single
byte values. This improves performance by a small amount as per my
experimentation:

Before:
  test bench_parsing_debug_info            ... bench:  1,854,730.00 ns/iter (+/- 2,237.98)
  test bench_reading_leb128_unsigned_small ... bench:        649.41 ns/iter (+/- 3.82)
  test bench_reading_leb128_unsigned_large ... bench:        203.58 ns/iter (+/- 1.51)

After:
  test bench_parsing_debug_info            ... bench:  1,809,890.00 ns/iter (+/- 2,453.82)
  test bench_reading_leb128_unsigned_small ... bench:        607.84 ns/iter (+/- 0.63)
  test bench_reading_leb128_unsigned_large ... bench:        198.43 ns/iter (+/- 0.20)

Adjust u16 in a way to make the first check symmetrical, though this
change does not appear to have any measurable performance impact. Also
add a test for a LEB128 encoded value that we want to make sure we don't
regress on by treating parsing differently.

Signed-off-by: Daniel Müller <deso@posteo.net>
@d-e-s-o d-e-s-o force-pushed the topic/optimize-leb128 branch from 291361c to bf51236 Compare September 10, 2025 16:27
@bjorn3
Copy link
Contributor

bjorn3 commented Sep 10, 2025

Have you benchmarked this on LEB128 integers obtained from actual debuginfo? They almost certainly have a different distribution to what rustc sees. And how does this compare outside of micro-benchmarks (eg reading debuginfo for an entire executable)?

@d-e-s-o
Copy link
Contributor Author

d-e-s-o commented Sep 10, 2025

Have you benchmarked this on LEB128 integers obtained from actual debuginfo? They almost certainly have a different distribution to what rustc sees.

They are heavily skewed towards single byte values. See lebs.txt attachment, which was created based on symbolization of an address on a sizable debug file.

$ cat /tmp/lebs.txt | sort -n | awk '{ for (i=1; i<=NF; i++) if ($i+0 > 255) print $i }' | wc -l
3633
$ wc -l lebs.txt
852168 lebs.txt

Benchmarking with these numbers specifically we see

Before:
  test bench_reading_leb128_live  ... bench:   3,620,225.00 ns/iter (+/- 126,073.05)
After:
  test bench_reading_leb128_live  ... bench:   2,885,162.70 ns/iter (+/- 80,306.12)

And how does this compare outside of micro-benchmarks (eg reading debuginfo for an entire executable)?

As per my experiments it seems to be within noise levels. Most reads are from abbrevs which are read as LEB128 u16 from what I recall.

Edit: Corrected sort logic.

Copy link
Collaborator

@philipc philipc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behaviour, and now we incorrectly parse the following as 0:

&[0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 2][..],

We should add a regression test for this.

If we're going to change unsigned, then we should change signed too.

Does this show any change to the existing benchmarks that operate on real data?

@philipc
Copy link
Collaborator

philipc commented Sep 11, 2025

Also, don't use # for numbers in the commit message and description, since github likes to convert them to links.

@philipc
Copy link
Collaborator

philipc commented Sep 11, 2025

I tried running the existing benchmark on my system, and this actually makes the most relevant one slower, which is surprising:

Before:
test bench_parsing_debug_info ... bench:   1,097,840.80 ns/iter (+/- 35,017.39)
After:
test bench_parsing_debug_info ... bench:   1,238,792.27 ns/iter (+/- 23,549.80)

@d-e-s-o
Copy link
Contributor Author

d-e-s-o commented Sep 15, 2025

I tried running the existing benchmark on my system, and this actually makes the most relevant one slower, which is surprising

That is surprising indeed. I am not sure where that comes from. I compared the assembly but I am none the wiser, really. The only explanation I can muster would be that, because the compiler seems to generate more code (not sure it if unrolls more iterations or so), we may end up putting more pressure on the L1i cache, perhaps exaggerated by inlining. That could explain why microbenchmarks consistently show improvements, I'd think. In any event, performance counters aren't really conclusive, as numbers seem to be all over the place and not really correlated with reported wall clock numbers to the degree I can tell.

I can't spend much more time on the investigation here, so I kept the main loop body as-is now and just moved the first check out of the loop. That does seem to improve performance, but it's not much anymore. Take it or leave it.

If we're going to change unsigned, then we should change signed too.

Why can't we optimize one without the other...?

@d-e-s-o d-e-s-o force-pushed the topic/optimize-leb128 branch from bf51236 to 3c56f86 Compare September 15, 2025 20:48
@philipc
Copy link
Collaborator

philipc commented Sep 16, 2025

In the commit message for 3c56f86, are the After and Before labels wrong?

This change looks good to me now. I see a performance improvement on my system for bench_parsing_debug_info, and the addr2line benchmarks also show an improvement:

context_new             time:   [488.19 µs 488.65 µs 489.15 µs]
                        change: [−2.5561% −1.7932% −1.1502%] (p = 0.00 < 0.05)
                        Performance has improved.

context_new_parse_lines time:   [5.1523 ms 5.1623 ms 5.1731 ms]
                        change: [−1.1673% −0.9021% −0.6303%] (p = 0.00 < 0.05)
                        Change within noise threshold.

context_new_parse_functions
                        time:   [7.5898 ms 7.5999 ms 7.6105 ms]
                        change: [−4.9737% −4.7767% −4.5780%] (p = 0.00 < 0.05)
                        Performance has improved.

context_new_parse_inlined_functions
                        time:   [26.468 ms 26.505 ms 26.544 ms]
                        change: [−2.5977% −2.2868% −2.0112%] (p = 0.00 < 0.05)
                        Performance has improved.

Why can't we optimize one without the other...?

We can, but I meant that if it is worth doing for one then it is likely to be worth doing for the other. However looking into it, the sign extension does complicate things, so it's not so clear.

For reference, rust-lang/rust#92604 is now the most relevant rust PR here, and it didn't do a similar change for signed, but makes no mention of considering it.

@d-e-s-o d-e-s-o force-pushed the topic/optimize-leb128 branch from 3c56f86 to 1bdf905 Compare September 16, 2025 15:23
@d-e-s-o
Copy link
Contributor Author

d-e-s-o commented Sep 16, 2025

In the commit message for 3c56f86, are the After and Before labels wrong?

Ehm, yes. I fixed that.

This change looks good to me now. I see a performance improvement on my system for bench_parsing_debug_info, and the addr2line benchmarks also show an improvement:

👍

Why can't we optimize one without the other...?

We can, but I meant that if it is worth doing for one then it is likely to be worth doing for the other. However looking into it, the sign extension does complicate things, so it's not so clear.

Yeah, that was also the sticking point for me when I looked at it.

If there are no further comments, feel free to merge.

Copy link
Collaborator

@philipc philipc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@philipc philipc merged commit 6e090ae into gimli-rs:master Sep 17, 2025
20 checks passed
@d-e-s-o d-e-s-o deleted the topic/optimize-leb128 branch September 17, 2025 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants