Skip to content

perf: parallelize copying data sections#1277

Merged
davidlattimore merged 1 commit into
wild-linker:mainfrom
karolzwolak:parallelize-copying-section-data
Nov 7, 2025
Merged

perf: parallelize copying data sections#1277
davidlattimore merged 1 commit into
wild-linker:mainfrom
karolzwolak:parallelize-copying-section-data

Conversation

@karolzwolak

Copy link
Copy Markdown
Contributor

Part of #1194.

@karolzwolak

Copy link
Copy Markdown
Contributor Author

Is this what you meant for the initial low hanging fruit @davidlattimore?

Marking this as draft cause some additional benchmarking is needed. Should the threshold be configurable in some way?

@karolzwolak

karolzwolak commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

On my system wild seems to be faster on the include-bob benchmark — however after this change writing is a lot faster.
I'm benchmarking https://github.com/rust-lang/rustc-perf/tree/master/collector/compile-benchmarks/include-blob build on release mode.

before:

┌───    1.38 Open input files
├───    0.00 Process linker scripts
├───    0.65 Parse input files
├───    0.49 Group files
│ ┌───    1.81 Read symbols
│ ├───    0.59 Populate symbol map
├─┴─    2.75 Build symbol DB
│ ┌───    0.71 Resolve symbols
│ ├───    1.09 Resolve sections
│ ├───    0.02 Assign section IDs
│ ├───    0.54 Merge strings
│ ├───    0.02 Canonicalise undefined symbols
│ ├───    0.08 Resolve alternative symbol definitions
├─┴─    2.55 Symbol resolution
│ ┌───    2.69 Find required sections
│ ├───    0.04 Finalise copy relocations
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    0.00 Merge GNU property notes
│ ├───    0.00 Merge e_flags
│ ├───    0.02 Merge .riscv.attributes sections
│ ├───    0.11 Finalise per-object sizes
│ ├───    0.00 Apply non-addressable indexes
│ ├───    0.01 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.03 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.10 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    0.17 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─    3.24 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.15 Split output buffers by group
│ ├─┴─   14.16 Write data to file
│ ├───    0.01 Sort .eh_frame_hdr
│ ├───    2.26 Compute build ID
│ ├───    1.12 Unmap output file
├─┴─   17.59 Write output file
│ ┌───    0.10 Verify inputs unchanged
│ ├───    0.10 Drop layout
│ ├───    1.05 Drop inputs
├─┴─    1.28 Shutdown
└─   30.05 Link
hyperfine --warmup 2 './run-with ld.lld' './run-with wild --no-fork'
Benchmark 1: ./run-with ld.lld
  Time (mean ± σ):      32.9 ms ±   2.6 ms    [User: 37.2 ms, System: 34.3 ms]
  Range (min … max):    30.3 ms …  44.6 ms    88 runs
 
Benchmark 2: ./run-with wild --no-fork
  Time (mean ± σ):      22.7 ms ±   1.0 ms    [User: 36.9 ms, System: 20.5 ms]
  Range (min … max):    21.4 ms …  26.6 ms    115 runs
 
Summary
  ./run-with wild --no-fork ran
    1.45 ± 0.13 times faster than ./run-with ld.lld

after:

┌───    2.65 Open input files
├───    0.01 Process linker scripts
├───    0.73 Parse input files
├───    0.47 Group files
│ ┌───    1.62 Read symbols
│ ├───    0.69 Populate symbol map
├─┴─    2.66 Build symbol DB
│ ┌───    0.79 Resolve symbols
│ ├───    1.35 Resolve sections
│ ├───    0.02 Assign section IDs
│ ├───    0.50 Merge strings
│ ├───    0.01 Canonicalise undefined symbols
│ ├───    0.08 Resolve alternative symbol definitions
├─┴─    2.83 Symbol resolution
│ ┌───    2.38 Find required sections
│ ├───    0.06 Finalise copy relocations
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    0.00 Merge GNU property notes
│ ├───    0.00 Merge e_flags
│ ├───    0.03 Merge .riscv.attributes sections
│ ├───    0.11 Finalise per-object sizes
│ ├───    0.00 Apply non-addressable indexes
│ ├───    0.01 Propagate section attributes
│ ├───    0.01 Compute output order
│ ├───    0.03 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.11 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    0.28 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─    3.10 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.15 Split output buffers by group
│ ├─┴─    4.91 Write data to file
│ ├───    0.01 Sort .eh_frame_hdr
│ ├───    2.14 Compute build ID
│ ├───    1.26 Unmap output file
├─┴─    8.36 Write output file
│ ┌───    0.04 Verify inputs unchanged
│ ├───    0.15 Drop layout
│ ├───    1.05 Drop inputs
├─┴─    1.26 Shutdown
└─   22.26 Link
hyperfine --warmup 2 './run-with ld.lld' './run-with wild --no-fork' 
Benchmark 1: ./run-with ld.lld
  Time (mean ± σ):      31.8 ms ±   1.3 ms    [User: 37.7 ms, System: 33.2 ms]
  Range (min … max):    30.4 ms …  38.1 ms    83 runs
 
Benchmark 2: ./run-with wild --no-fork
  Time (mean ± σ):      14.3 ms ±   0.5 ms    [User: 53.3 ms, System: 38.5 ms]
  Range (min … max):    13.4 ms …  16.9 ms    174 runs
 
Summary
  ./run-with wild --no-fork ran
    2.22 ± 0.12 times faster than ./run-with ld.lld

That's over 50% improvement (twice as fast for writing output) in this scenario on my system!
I haven't benchmarked anything else yet — but this simple change looks really promising.

@mati865

mati865 commented Nov 6, 2025

Copy link
Copy Markdown
Member

We have picked up a lot of speed along the way in some conditions, if you hardcode https://github.com/davidlattimore/wild/blob/11c291f5e47d073a86fd71a390c4153764bccf06/libwild/src/file_writer.rs#L228 to return false you might be able to reproduce the original lack of performance. Alternatively, you could modify the benchmark to output cdylib instead of the executable.

@davidlattimore

davidlattimore commented Nov 6, 2025

Copy link
Copy Markdown
Member

Nice! Yep, I see a good improvement on the include-blob benchmark too:

OUT=/run/user/1000/ttt bench poop -d 30000 "/home/david/tmp/inc-blob-save/1/run-with /home/david/wild-builds/2025-11-07" "/home/david/tmp/inc-blob-save/1/run-with target/release/wild"
  Temperature: 61.1 C
Benchmark 1 (840 runs): /home/david/tmp/inc-blob-save/1/run-with /home/david/wild-builds/2025-11-07
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          35.2ms ±  949us    33.1ms … 39.1ms          2 ( 0%)        0%
  peak_rss           7.55MB ± 70.8KB    7.03MB … 7.57MB         69 ( 8%)        0%
  cpu_cycles          335M  ± 12.3M      280M  …  405M           8 ( 1%)        0%
  instructions        245M  ± 13.5M      218M  …  277M           0 ( 0%)        0%
  cache_references   10.6M  ±  426K     9.74M  … 11.6M           0 ( 0%)        0%
  cache_misses       2.59M  ± 55.3K     2.46M  … 2.73M           0 ( 0%)        0%
  branch_misses       504K  ± 26.4K      451K  …  569K           0 ( 0%)        0%
Benchmark 2 (1177 runs): /home/david/tmp/inc-blob-save/1/run-with target/release/wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          25.0ms ±  549us    23.5ms … 27.3ms         17 ( 1%)        ⚡- 29.1% ±  0.2%
  peak_rss           7.54MB ± 87.4KB    7.24MB … 7.57MB        205 (17%)          -  0.2% ±  0.1%
  cpu_cycles          473M  ± 10.9M      355M  …  513M          23 ( 2%)        💩+ 41.0% ±  0.3%
  instructions        240M  ± 9.04M      221M  …  274M          44 ( 4%)        ⚡-  2.0% ±  0.4%
  cache_references   9.45M  ±  274K     8.92M  … 10.4M          54 ( 5%)        ⚡- 10.5% ±  0.3%
  cache_misses       4.02M  ± 38.9K     3.69M  … 4.18M          22 ( 2%)        💩+ 55.4% ±  0.2%
  branch_misses       497K  ± 17.8K      459K  …  565K          40 ( 3%)          -  1.3% ±  0.4%
  Temperature: 66.5 C

So I'm happy to merge this once it's marked as not a draft.

edit: I reran and amended the above benchmark results. The original baseline was wrong, so I was diffing two performance-sensitive changes not just this PR.

@davidlattimore

Copy link
Copy Markdown
Member

Should the threshold be configurable in some way?

I don't think it's necessary. We could experiment with different thresholds and see what difference they make - although we'd need to use a different benchmark other than the include-blob benchmark. But I think the current threshold is likely a reasonable starting point, so it's fine to just go with that for now.

@karolzwolak karolzwolak marked this pull request as ready for review November 7, 2025 07:09
@karolzwolak

Copy link
Copy Markdown
Contributor Author

So I'm happy to merge this once it's marked as not a draft.

Okay great — I've marked this as open.

@davidlattimore davidlattimore merged commit 276be30 into wild-linker:main Nov 7, 2025
20 checks passed
@karolzwolak karolzwolak deleted the parallelize-copying-section-data branch November 7, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants