perf: parallelize copying data sections by karolzwolak · Pull Request #1277 · wild-linker/wild

karolzwolak · 2025-11-06T19:18:30Z

Part of #1194.

karolzwolak · 2025-11-06T19:21:30Z

Is this what you meant for the initial low hanging fruit @davidlattimore?

Marking this as draft cause some additional benchmarking is needed. Should the threshold be configurable in some way?

karolzwolak · 2025-11-06T21:23:17Z

On my system wild seems to be faster on the include-bob benchmark — however after this change writing is a lot faster.
I'm benchmarking https://github.com/rust-lang/rustc-perf/tree/master/collector/compile-benchmarks/include-blob build on release mode.

before:

┌───    1.38 Open input files
├───    0.00 Process linker scripts
├───    0.65 Parse input files
├───    0.49 Group files
│ ┌───    1.81 Read symbols
│ ├───    0.59 Populate symbol map
├─┴─    2.75 Build symbol DB
│ ┌───    0.71 Resolve symbols
│ ├───    1.09 Resolve sections
│ ├───    0.02 Assign section IDs
│ ├───    0.54 Merge strings
│ ├───    0.02 Canonicalise undefined symbols
│ ├───    0.08 Resolve alternative symbol definitions
├─┴─    2.55 Symbol resolution
│ ┌───    2.69 Find required sections
│ ├───    0.04 Finalise copy relocations
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    0.00 Merge GNU property notes
│ ├───    0.00 Merge e_flags
│ ├───    0.02 Merge .riscv.attributes sections
│ ├───    0.11 Finalise per-object sizes
│ ├───    0.00 Apply non-addressable indexes
│ ├───    0.01 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.03 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.10 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    0.17 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─    3.24 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.15 Split output buffers by group
│ ├─┴─   14.16 Write data to file
│ ├───    0.01 Sort .eh_frame_hdr
│ ├───    2.26 Compute build ID
│ ├───    1.12 Unmap output file
├─┴─   17.59 Write output file
│ ┌───    0.10 Verify inputs unchanged
│ ├───    0.10 Drop layout
│ ├───    1.05 Drop inputs
├─┴─    1.28 Shutdown
└─   30.05 Link

hyperfine --warmup 2 './run-with ld.lld' './run-with wild --no-fork'
Benchmark 1: ./run-with ld.lld
  Time (mean ± σ):      32.9 ms ±   2.6 ms    [User: 37.2 ms, System: 34.3 ms]
  Range (min … max):    30.3 ms …  44.6 ms    88 runs
 
Benchmark 2: ./run-with wild --no-fork
  Time (mean ± σ):      22.7 ms ±   1.0 ms    [User: 36.9 ms, System: 20.5 ms]
  Range (min … max):    21.4 ms …  26.6 ms    115 runs
 
Summary
  ./run-with wild --no-fork ran
    1.45 ± 0.13 times faster than ./run-with ld.lld

after:

┌───    2.65 Open input files
├───    0.01 Process linker scripts
├───    0.73 Parse input files
├───    0.47 Group files
│ ┌───    1.62 Read symbols
│ ├───    0.69 Populate symbol map
├─┴─    2.66 Build symbol DB
│ ┌───    0.79 Resolve symbols
│ ├───    1.35 Resolve sections
│ ├───    0.02 Assign section IDs
│ ├───    0.50 Merge strings
│ ├───    0.01 Canonicalise undefined symbols
│ ├───    0.08 Resolve alternative symbol definitions
├─┴─    2.83 Symbol resolution
│ ┌───    2.38 Find required sections
│ ├───    0.06 Finalise copy relocations
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    0.00 Merge GNU property notes
│ ├───    0.00 Merge e_flags
│ ├───    0.03 Merge .riscv.attributes sections
│ ├───    0.11 Finalise per-object sizes
│ ├───    0.00 Apply non-addressable indexes
│ ├───    0.01 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.03 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.11 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    0.28 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─    3.10 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.15 Split output buffers by group
│ ├─┴─    4.91 Write data to file
│ ├───    0.01 Sort .eh_frame_hdr
│ ├───    2.14 Compute build ID
│ ├───    1.26 Unmap output file
├─┴─    8.36 Write output file
│ ┌───    0.04 Verify inputs unchanged
│ ├───    0.15 Drop layout
│ ├───    1.05 Drop inputs
├─┴─    1.26 Shutdown
└─   22.26 Link

hyperfine --warmup 2 './run-with ld.lld' './run-with wild --no-fork' 
Benchmark 1: ./run-with ld.lld
  Time (mean ± σ):      31.8 ms ±   1.3 ms    [User: 37.7 ms, System: 33.2 ms]
  Range (min … max):    30.4 ms …  38.1 ms    83 runs
 
Benchmark 2: ./run-with wild --no-fork
  Time (mean ± σ):      14.3 ms ±   0.5 ms    [User: 53.3 ms, System: 38.5 ms]
  Range (min … max):    13.4 ms …  16.9 ms    174 runs
 
Summary
  ./run-with wild --no-fork ran
    2.22 ± 0.12 times faster than ./run-with ld.lld

That's over 50% improvement (twice as fast for writing output) in this scenario on my system!
I haven't benchmarked anything else yet — but this simple change looks really promising.

mati865 · 2025-11-06T22:00:25Z

We have picked up a lot of speed along the way in some conditions, if you hardcode https://github.com/davidlattimore/wild/blob/11c291f5e47d073a86fd71a390c4153764bccf06/libwild/src/file_writer.rs#L228 to return false you might be able to reproduce the original lack of performance. Alternatively, you could modify the benchmark to output cdylib instead of the executable.

davidlattimore · 2025-11-06T22:43:08Z

Nice! Yep, I see a good improvement on the include-blob benchmark too:

OUT=/run/user/1000/ttt bench poop -d 30000 "/home/david/tmp/inc-blob-save/1/run-with /home/david/wild-builds/2025-11-07" "/home/david/tmp/inc-blob-save/1/run-with target/release/wild"
  Temperature: 61.1 C
Benchmark 1 (840 runs): /home/david/tmp/inc-blob-save/1/run-with /home/david/wild-builds/2025-11-07
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          35.2ms ±  949us    33.1ms … 39.1ms          2 ( 0%)        0%
  peak_rss           7.55MB ± 70.8KB    7.03MB … 7.57MB         69 ( 8%)        0%
  cpu_cycles          335M  ± 12.3M      280M  …  405M           8 ( 1%)        0%
  instructions        245M  ± 13.5M      218M  …  277M           0 ( 0%)        0%
  cache_references   10.6M  ±  426K     9.74M  … 11.6M           0 ( 0%)        0%
  cache_misses       2.59M  ± 55.3K     2.46M  … 2.73M           0 ( 0%)        0%
  branch_misses       504K  ± 26.4K      451K  …  569K           0 ( 0%)        0%
Benchmark 2 (1177 runs): /home/david/tmp/inc-blob-save/1/run-with target/release/wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          25.0ms ±  549us    23.5ms … 27.3ms         17 ( 1%)        ⚡- 29.1% ±  0.2%
  peak_rss           7.54MB ± 87.4KB    7.24MB … 7.57MB        205 (17%)          -  0.2% ±  0.1%
  cpu_cycles          473M  ± 10.9M      355M  …  513M          23 ( 2%)        💩+ 41.0% ±  0.3%
  instructions        240M  ± 9.04M      221M  …  274M          44 ( 4%)        ⚡-  2.0% ±  0.4%
  cache_references   9.45M  ±  274K     8.92M  … 10.4M          54 ( 5%)        ⚡- 10.5% ±  0.3%
  cache_misses       4.02M  ± 38.9K     3.69M  … 4.18M          22 ( 2%)        💩+ 55.4% ±  0.2%
  branch_misses       497K  ± 17.8K      459K  …  565K          40 ( 3%)          -  1.3% ±  0.4%
  Temperature: 66.5 C

So I'm happy to merge this once it's marked as not a draft.

edit: I reran and amended the above benchmark results. The original baseline was wrong, so I was diffing two performance-sensitive changes not just this PR.

davidlattimore · 2025-11-06T23:43:02Z

Should the threshold be configurable in some way?

I don't think it's necessary. We could experiment with different thresholds and see what difference they make - although we'd need to use a different benchmark other than the include-blob benchmark. But I think the current threshold is likely a reasonable starting point, so it's fine to just go with that for now.

karolzwolak · 2025-11-07T07:12:31Z

So I'm happy to merge this once it's marked as not a draft.

Okay great — I've marked this as open.

perf: parallelize copying data sections

bece8dc

karolzwolak marked this pull request as ready for review November 7, 2025 07:09

davidlattimore approved these changes Nov 7, 2025

View reviewed changes

davidlattimore merged commit 276be30 into wild-linker:main Nov 7, 2025
20 checks passed

karolzwolak deleted the parallelize-copying-section-data branch November 7, 2025 18:50

karolzwolak mentioned this pull request Nov 9, 2025

include-blob Rust's benchmark with Wild is slower compared to LLD without forking #1082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: parallelize copying data sections#1277

perf: parallelize copying data sections#1277
davidlattimore merged 1 commit into
wild-linker:mainfrom
karolzwolak:parallelize-copying-section-data

karolzwolak commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 6, 2025 •

edited

Loading

Uh oh!

mati865 commented Nov 6, 2025

Uh oh!

davidlattimore commented Nov 6, 2025 •

edited

Loading

Uh oh!

davidlattimore commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

karolzwolak commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mati865 commented Nov 6, 2025

Uh oh!

davidlattimore commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidlattimore commented Nov 6, 2025

Uh oh!

karolzwolak commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karolzwolak commented Nov 6, 2025 •

edited

Loading

davidlattimore commented Nov 6, 2025 •

edited

Loading