-
Notifications
You must be signed in to change notification settings - Fork 120
Optimize LEB128 data reading #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Micro-optimize unsigned LEB128 reading by unrolling parts of the first loop to improve code gen for the statistically likely case of single byte values. This improves performance by a small amount as per my experimentation: Before: test bench_parsing_debug_info ... bench: 1,854,730.00 ns/iter (+/- 2,237.98) test bench_reading_leb128_unsigned_small ... bench: 649.41 ns/iter (+/- 3.82) test bench_reading_leb128_unsigned_large ... bench: 203.58 ns/iter (+/- 1.51) After: test bench_parsing_debug_info ... bench: 1,809,890.00 ns/iter (+/- 2,453.82) test bench_reading_leb128_unsigned_small ... bench: 607.84 ns/iter (+/- 0.63) test bench_reading_leb128_unsigned_large ... bench: 198.43 ns/iter (+/- 0.20) Adjust u16 in a way to make the first check symmetrical, though this change does not appear to have any measurable performance impact. Also add a test for a LEB128 encoded value that we want to make sure we don't regress on by treating parsing differently. Signed-off-by: Daniel Müller <deso@posteo.net>
291361c
to
bf51236
Compare
Have you benchmarked this on LEB128 integers obtained from actual debuginfo? They almost certainly have a different distribution to what rustc sees. And how does this compare outside of micro-benchmarks (eg reading debuginfo for an entire executable)? |
They are heavily skewed towards single byte values. See lebs.txt attachment, which was created based on symbolization of an address on a sizable debug file. $ cat /tmp/lebs.txt | sort -n | awk '{ for (i=1; i<=NF; i++) if ($i+0 > 255) print $i }' | wc -l
3633
$ wc -l lebs.txt
852168 lebs.txt Benchmarking with these numbers specifically we see
As per my experiments it seems to be within noise levels. Most reads are from abbrevs which are read as LEB128 u16 from what I recall. Edit: Corrected sort logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the behaviour, and now we incorrectly parse the following as 0:
&[0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 2][..],
We should add a regression test for this.
If we're going to change unsigned
, then we should change signed
too.
Does this show any change to the existing benchmarks that operate on real data?
Also, don't use |
I tried running the existing benchmark on my system, and this actually makes the most relevant one slower, which is surprising:
|
That is surprising indeed. I am not sure where that comes from. I compared the assembly but I am none the wiser, really. The only explanation I can muster would be that, because the compiler seems to generate more code (not sure it if unrolls more iterations or so), we may end up putting more pressure on the L1i cache, perhaps exaggerated by inlining. That could explain why microbenchmarks consistently show improvements, I'd think. In any event, performance counters aren't really conclusive, as numbers seem to be all over the place and not really correlated with reported wall clock numbers to the degree I can tell. I can't spend much more time on the investigation here, so I kept the main loop body as-is now and just moved the first check out of the loop. That does seem to improve performance, but it's not much anymore. Take it or leave it.
Why can't we optimize one without the other...? |
bf51236
to
3c56f86
Compare
In the commit message for 3c56f86, are the After and Before labels wrong? This change looks good to me now. I see a performance improvement on my system for
We can, but I meant that if it is worth doing for one then it is likely to be worth doing for the other. However looking into it, the sign extension does complicate things, so it's not so clear. For reference, rust-lang/rust#92604 is now the most relevant rust PR here, and it didn't do a similar change for signed, but makes no mention of considering it. |
3c56f86
to
1bdf905
Compare
Ehm, yes. I fixed that.
👍
Yeah, that was also the sticking point for me when I looked at it. If there are no further comments, feel free to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path.
With this change we replace our decoding routines with these optimized ones. As a result of these changes, we see a respectable speed up:
System #1:
System #2:
[0] rust-lang/rust#69050
[1] rust-lang/rust#69157