feat: Add base to track byte reduction due to relaxation#1527
Conversation
|
I don't think the CI failure is caused by this change: #1528 |
|
Note: I tried something similar before in #1317, but it didn’t work out. This PR is another attempt (and it took some time to implement, as you can see from the gap). Based on this, I'm currently working on implementing call relaxation for RISC-V locally, and it seems to be working quite well so far. |
2712f60 to
782cbc4
Compare
6f32cae to
7d32bf6
Compare
davidlattimore
left a comment
There was a problem hiding this comment.
Unfortunately the performance loss is larger:
Benchmark 1 (671 runs): /home/david/save/zed/run-with env-rand /home/d/wild-builds/2026-02-13.cg1 --no-fork
measurement mean ± σ min … max outliers delta
wall_time 537ms ± 13.2ms 494ms … 565ms 1 ( 0%) 0%
peak_rss 3.47GB ± 1.21MB 3.47GB … 3.47GB 4 ( 1%) 0%
cpu_cycles 21.7G ± 182M 21.0G … 22.4G 8 ( 1%) 0%
instructions 19.4G ± 51.6M 19.3G … 19.6G 18 ( 3%) 0%
cache_references 473M ± 2.89M 466M … 485M 14 ( 2%) 0%
cache_misses 124M ± 627K 123M … 126M 2 ( 0%) 0%
branch_misses 38.3M ± 183K 37.1M … 38.8M 23 ( 3%) 0%
Benchmark 2 (639 runs): /home/david/save/zed/run-with env-rand /home/d/wild-builds/2026-02-13.cg1.relaxation-tracking --no-fork
measurement mean ± σ min … max outliers delta
wall_time 564ms ± 14.6ms 516ms … 592ms 1 ( 0%) 💩+ 5.0% ± 0.3%
peak_rss 3.59GB ± 1.06MB 3.59GB … 3.59GB 8 ( 1%) 💩+ 3.5% ± 0.0%
cpu_cycles 22.2G ± 172M 21.5G … 22.8G 11 ( 2%) 💩+ 2.1% ± 0.1%
instructions 19.6G ± 48.9M 19.5G … 19.9G 7 ( 1%) 💩+ 1.3% ± 0.0%
cache_references 496M ± 2.83M 489M … 507M 8 ( 1%) 💩+ 4.8% ± 0.1%
cache_misses 128M ± 629K 126M … 130M 4 ( 1%) 💩+ 2.7% ± 0.1%
branch_misses 39.5M ± 172K 38.6M … 40.1M 16 ( 3%) 💩+ 3.2% ± 0.1%
Memory consumption is also up. I assume this is because we're now storing an Option<Vec<..>> for every input section.
17eed65 to
2bae671
Compare
|
Implemented sparse maps for each object file to track byte reduction. While this is clearly more memory-efficient than using |
bfc2dda to
7a983ef
Compare
|
|
||
| if let Some(hdr_out) = table_writer.take_eh_frame_hdr_entry() { | ||
| let frame_ptr = (section_address + offset_in_section) as i64 | ||
| // When relaxation has deleted bytes fromq the target section, the |
| let n = self.deltas.len(); | ||
| let mut lo = 0usize; | ||
| let mut hi = n; | ||
| while lo < hi { |
There was a problem hiding this comment.
Would it work to use binary_search_by_key here? If not, could you add a comment to the code saying why.
There was a problem hiding this comment.
Since output_pos is strictly monotonically increasing because the input offsets are strictly ascending and the deletion ranges don't overlap, we didn't actually need to use a binary search here in the first place (the remnants of the first approach's implementation were still present).
|
Performance looks good now. Thanks :) |
7a983ef to
7cacb5a
Compare
Just wanted to mention we can reduce the binary size, but we must obey the align relocations, e.g. |
|
Since I've merged #1552, I'll begin addressing |
part of #874
Some relaxations in architectures such as RISC-V and LoongArch actually reduce the generated code size. Although similar relaxations exist in x86_64 and AArch64, those implementations pad the shortened instructions with NOPs, so the overall symbol sizes remain unchanged. In contrast, the relaxations mentioned above truly shrink the symbols themselves, which means we must track how much each symbol’s size is reduced during the layout phase. A naive implementation would require running the layout process twice (the second pass accounting for size reductions in other symbols) which would clearly hurt performance.
Therefore, while this PR does not introduce any specific relaxation, it lays the groundwork for tracking size reductions caused by relaxations without requiring a second layout pass.
SectionRelaxDeltasmaintains a list of offsets from which each section should remove a specific number of bytes. This information is used during the write phase to calculate offsets individually. Therelaxation_deletedfield withinSectionrecords the amount of byte reduction achieved for each section. From this data, we can calculate the actual size of each section, which can be used for the layout phase.