Implement `UnmanagedBuffer` and switch `TaskDeque` to use it. by saeta · Pull Request #41 · saeta/penguin

saeta · 2020-05-24T23:35:57Z

Previously, TaskDeque was implemented in terms of ManagedBuffer. While
ManagedBuffer implements the semantics we'd like, it is implemented as a
class. This can induce a significant amount of reference couting traffic,
especially when stored in Arrays.

UnmanagedBuffer implements a similar interface to ManagedBuffer, but
instead uses manual pointer allocation and management. This allows us to
avoid all reference counting traffic, at the cost of requiring explicit
destruction.

Switching TaskDeque from ManagedBuffer to UnmanagedBuffer yields
between a 2x and 6x performance improvemenet for key workloads that stress
the TaskDeque data structure within the NonBlockingThreadPool.

Below are performance numbers across 2 machines & operating systems
demonstrating performance improvements. Note: because this change has been
extracted from a stack of related performance improvements, if you benchmark
this PR itself, you will not see the expected performance improvements.
Instead, this PR has been separated out to facilitate easier reviewing.

Benchmark numbers on machine A:

Before (from previous commit):

name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 700.0 ns     ± 70289.84998218225   127457
NonBlockingThreadPool: join, two levels                                2107.0 ns    ± 131041.5070696377   31115
NonBlockingThreadPool: join, three levels                              4960.0 ns    ± 178122.9562964306   15849
NonBlockingThreadPool: join, four levels, three on thread pool thread  5893.0 ns    ± 224021.47900401088  13763
NonBlockingThreadPool: parallel for, one level                         22420.0 ns   ± 203689.69689780468  7581
NonBlockingThreadPool: parallel for, two levels                        500985.5 ns  ± 642136.0139757036   1390

After:

name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 429.0 ns     ± 59095.36128834737   223050
NonBlockingThreadPool: join, two levels                                1270.0 ns    ± 101587.48601579959  64903
NonBlockingThreadPool: join, three levels                              3098.0 ns    ± 165407.1669656578   28572
NonBlockingThreadPool: join, four levels, three on thread pool thread  3990.5 ns    ± 227217.34017343252  10000
NonBlockingThreadPool: parallel for, one level                         16853.0 ns   ± 260015.39296821563  8660
NonBlockingThreadPool: parallel for, two levels                        563926.0 ns  ± 609298.6358076902   2189

Benchmark numbers from machine B:

Before:

name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 3022.0 ns     ± 366686.3050127019   21717
NonBlockingThreadPool: join, two levels                                13313.5 ns    ± 550429.476815564    5970
NonBlockingThreadPool: join, three levels                              39009.5 ns    ± 716172.9687807652   3546
NonBlockingThreadPool: join, four levels, three on thread pool thread  341631.0 ns   ± 767483.9227743072   2367
NonBlockingThreadPool: parallel for, one level                         404375.0 ns   ± 590178.6724299589   3123
NonBlockingThreadPool: parallel for, two levels                        1000872.0 ns  ± 1592704.2766365155  805

After:

name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 749.0 ns      ± 174096.69284101788  91247
NonBlockingThreadPool: join, two levels                                12046.5 ns    ± 414670.5686344325   5920
NonBlockingThreadPool: join, three levels                              46975.0 ns    ± 543858.2306554643   3561
NonBlockingThreadPool: join, four levels, three on thread pool thread  559837.0 ns   ± 591477.1893574063   2795
NonBlockingThreadPool: parallel for, one level                         66446.0 ns    ± 627245.5098742851   2236
NonBlockingThreadPool: parallel for, two levels                        1668739.0 ns  ± 1536323.375783659   765

Previously, `TaskDeque` was implemented in terms of `ManagedBuffer`. While `ManagedBuffer` implements the semantics we'd like, it is implemented as a class. This can induce a significant amount of reference couting traffic, especially when stored in `Array`s. `UnmanagedBuffer` implements a similar interface to `ManagedBuffer`, but instead uses manual pointer allocation and management. This allows us to avoid all reference counting traffic, at the cost of requiring explicit destruction. Switching `TaskDeque` from `ManagedBuffer` to `UnmanagedBuffer` yields between a 2x and 6x performance improvemenet for key workloads that stress the `TaskDeque` data structure within the `NonBlockingThreadPool`. Below are performance numbers across 2 machines & operating systems demonstrating performance improvements. Note: because this change has been extracted from a stack of related performance improvements, if you benchmark this PR itself, you will not see the expected performance improvements. Instead, this PR has been separated out to facilitate easier reviewing. Benchmark numbers on machine A: ------------------------------- Before (from previous commit): ``` name time std iterations -------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 700.0 ns ± 70289.84998218225 127457 NonBlockingThreadPool: join, two levels 2107.0 ns ± 131041.5070696377 31115 NonBlockingThreadPool: join, three levels 4960.0 ns ± 178122.9562964306 15849 NonBlockingThreadPool: join, four levels, three on thread pool thread 5893.0 ns ± 224021.47900401088 13763 NonBlockingThreadPool: parallel for, one level 22420.0 ns ± 203689.69689780468 7581 NonBlockingThreadPool: parallel for, two levels 500985.5 ns ± 642136.0139757036 1390 ``` After: ``` name time std iterations -------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 429.0 ns ± 59095.36128834737 223050 NonBlockingThreadPool: join, two levels 1270.0 ns ± 101587.48601579959 64903 NonBlockingThreadPool: join, three levels 3098.0 ns ± 165407.1669656578 28572 NonBlockingThreadPool: join, four levels, three on thread pool thread 3990.5 ns ± 227217.34017343252 10000 NonBlockingThreadPool: parallel for, one level 16853.0 ns ± 260015.39296821563 8660 NonBlockingThreadPool: parallel for, two levels 563926.0 ns ± 609298.6358076902 2189 ``` Benchmark numbers from machine B: --------------------------------- Before: ``` name time std iterations --------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 3022.0 ns ± 366686.3050127019 21717 NonBlockingThreadPool: join, two levels 13313.5 ns ± 550429.476815564 5970 NonBlockingThreadPool: join, three levels 39009.5 ns ± 716172.9687807652 3546 NonBlockingThreadPool: join, four levels, three on thread pool thread 341631.0 ns ± 767483.9227743072 2367 NonBlockingThreadPool: parallel for, one level 404375.0 ns ± 590178.6724299589 3123 NonBlockingThreadPool: parallel for, two levels 1000872.0 ns ± 1592704.2766365155 805 ``` After: ``` name time std iterations --------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 749.0 ns ± 174096.69284101788 91247 NonBlockingThreadPool: join, two levels 12046.5 ns ± 414670.5686344325 5920 NonBlockingThreadPool: join, three levels 46975.0 ns ± 543858.2306554643 3561 NonBlockingThreadPool: join, four levels, three on thread pool thread 559837.0 ns ± 591477.1893574063 2795 NonBlockingThreadPool: parallel for, one level 66446.0 ns ± 627245.5098742851 2236 NonBlockingThreadPool: parallel for, two levels 1668739.0 ns ± 1536323.375783659 765 ```

Note: this change (by itself) does not reduce ARC traffic, but in concert with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in the parallelFor benchmark.

saeta · 2020-05-25T08:05:07Z

#33

dabrahams

If possible I'd like to better understand where the ARC traffic was coming from. The rationale in the description doesn't quite compute for me, because storing multiple refcounted things in an array doesn't increase ARC traffic; if anything, by comparison with storing them in something like a struct it reduces ARC traffic from N to 1. It should (generally) be possible to use ManagedBuffer in such a way that ARC is minimized by relying on unsafe (buffer) pointer APIs.

Note: this change (by itself) does not reduce ARC traffic, but in concert with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in the parallelFor benchmark.

saeta requested a review from dabrahams May 24, 2020 23:36

saeta force-pushed the unmanaged-buffer branch from 88e8df8 to 9e31153 Compare May 24, 2020 23:38

saeta mentioned this pull request May 25, 2020

Copies references to data structures to avoid extra ARC traffic. #42

Merged

Merge branch 'master' into unmanaged-buffer

b450bc7

dabrahams reviewed May 25, 2020

View reviewed changes

texasmichelle changed the base branch from master to main December 8, 2020 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `UnmanagedBuffer` and switch `TaskDeque` to use it.#41

Implement `UnmanagedBuffer` and switch `TaskDeque` to use it.#41
saeta wants to merge 2 commits intomainfrom
unmanaged-buffer

saeta commented May 24, 2020

Uh oh!

saeta commented May 25, 2020

Uh oh!

dabrahams left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saeta commented May 24, 2020

Benchmark numbers on machine A:

Benchmark numbers from machine B:

Uh oh!

saeta commented May 25, 2020

Uh oh!

dabrahams left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants