Skip to content

Implement UnmanagedBuffer and switch TaskDeque to use it.#41

Open
saeta wants to merge 2 commits intomainfrom
unmanaged-buffer
Open

Implement UnmanagedBuffer and switch TaskDeque to use it.#41
saeta wants to merge 2 commits intomainfrom
unmanaged-buffer

Conversation

@saeta
Copy link
Owner

@saeta saeta commented May 24, 2020

Previously, TaskDeque was implemented in terms of ManagedBuffer. While
ManagedBuffer implements the semantics we'd like, it is implemented as a
class. This can induce a significant amount of reference couting traffic,
especially when stored in Arrays.

UnmanagedBuffer implements a similar interface to ManagedBuffer, but
instead uses manual pointer allocation and management. This allows us to
avoid all reference counting traffic, at the cost of requiring explicit
destruction.

Switching TaskDeque from ManagedBuffer to UnmanagedBuffer yields
between a 2x and 6x performance improvemenet for key workloads that stress
the TaskDeque data structure within the NonBlockingThreadPool.

Below are performance numbers across 2 machines & operating systems
demonstrating performance improvements. Note: because this change has been
extracted from a stack of related performance improvements, if you benchmark
this PR itself, you will not see the expected performance improvements.
Instead, this PR has been separated out to facilitate easier reviewing.

Benchmark numbers on machine A:

Before (from previous commit):

name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 700.0 ns     ± 70289.84998218225   127457
NonBlockingThreadPool: join, two levels                                2107.0 ns    ± 131041.5070696377   31115
NonBlockingThreadPool: join, three levels                              4960.0 ns    ± 178122.9562964306   15849
NonBlockingThreadPool: join, four levels, three on thread pool thread  5893.0 ns    ± 224021.47900401088  13763
NonBlockingThreadPool: parallel for, one level                         22420.0 ns   ± 203689.69689780468  7581
NonBlockingThreadPool: parallel for, two levels                        500985.5 ns  ± 642136.0139757036   1390

After:

name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 429.0 ns     ± 59095.36128834737   223050
NonBlockingThreadPool: join, two levels                                1270.0 ns    ± 101587.48601579959  64903
NonBlockingThreadPool: join, three levels                              3098.0 ns    ± 165407.1669656578   28572
NonBlockingThreadPool: join, four levels, three on thread pool thread  3990.5 ns    ± 227217.34017343252  10000
NonBlockingThreadPool: parallel for, one level                         16853.0 ns   ± 260015.39296821563  8660
NonBlockingThreadPool: parallel for, two levels                        563926.0 ns  ± 609298.6358076902   2189

Benchmark numbers from machine B:

Before:

name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 3022.0 ns     ± 366686.3050127019   21717
NonBlockingThreadPool: join, two levels                                13313.5 ns    ± 550429.476815564    5970
NonBlockingThreadPool: join, three levels                              39009.5 ns    ± 716172.9687807652   3546
NonBlockingThreadPool: join, four levels, three on thread pool thread  341631.0 ns   ± 767483.9227743072   2367
NonBlockingThreadPool: parallel for, one level                         404375.0 ns   ± 590178.6724299589   3123
NonBlockingThreadPool: parallel for, two levels                        1000872.0 ns  ± 1592704.2766365155  805

After:

name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 749.0 ns      ± 174096.69284101788  91247
NonBlockingThreadPool: join, two levels                                12046.5 ns    ± 414670.5686344325   5920
NonBlockingThreadPool: join, three levels                              46975.0 ns    ± 543858.2306554643   3561
NonBlockingThreadPool: join, four levels, three on thread pool thread  559837.0 ns   ± 591477.1893574063   2795
NonBlockingThreadPool: parallel for, one level                         66446.0 ns    ± 627245.5098742851   2236
NonBlockingThreadPool: parallel for, two levels                        1668739.0 ns  ± 1536323.375783659   765

@saeta saeta requested a review from dabrahams May 24, 2020 23:36
Previously, `TaskDeque` was implemented in terms of `ManagedBuffer`. While
`ManagedBuffer` implements the semantics we'd like, it is implemented as a
class. This can induce a significant amount of reference couting traffic,
especially when stored in `Array`s.

`UnmanagedBuffer` implements a similar interface to `ManagedBuffer`, but
instead uses manual pointer allocation and management. This allows us to
avoid all reference counting traffic, at the cost of requiring explicit
destruction.

Switching `TaskDeque` from `ManagedBuffer` to `UnmanagedBuffer` yields
between a 2x and 6x performance improvemenet for key workloads that stress
the `TaskDeque` data structure within the `NonBlockingThreadPool`.

Below are performance numbers across 2 machines & operating systems
demonstrating performance improvements. Note: because this change has been
extracted from a stack of related performance improvements, if you benchmark
this PR itself, you will not see the expected performance improvements.
Instead, this PR has been separated out to facilitate easier reviewing.

Benchmark numbers on machine A:
-------------------------------

Before (from previous commit):

```
name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 700.0 ns     ± 70289.84998218225   127457
NonBlockingThreadPool: join, two levels                                2107.0 ns    ± 131041.5070696377   31115
NonBlockingThreadPool: join, three levels                              4960.0 ns    ± 178122.9562964306   15849
NonBlockingThreadPool: join, four levels, three on thread pool thread  5893.0 ns    ± 224021.47900401088  13763
NonBlockingThreadPool: parallel for, one level                         22420.0 ns   ± 203689.69689780468  7581
NonBlockingThreadPool: parallel for, two levels                        500985.5 ns  ± 642136.0139757036   1390
```

After:

```
name                                                                   time         std                   iterations
--------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 429.0 ns     ± 59095.36128834737   223050
NonBlockingThreadPool: join, two levels                                1270.0 ns    ± 101587.48601579959  64903
NonBlockingThreadPool: join, three levels                              3098.0 ns    ± 165407.1669656578   28572
NonBlockingThreadPool: join, four levels, three on thread pool thread  3990.5 ns    ± 227217.34017343252  10000
NonBlockingThreadPool: parallel for, one level                         16853.0 ns   ± 260015.39296821563  8660
NonBlockingThreadPool: parallel for, two levels                        563926.0 ns  ± 609298.6358076902   2189
```

Benchmark numbers from machine B:
---------------------------------

Before:

```
name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 3022.0 ns     ± 366686.3050127019   21717
NonBlockingThreadPool: join, two levels                                13313.5 ns    ± 550429.476815564    5970
NonBlockingThreadPool: join, three levels                              39009.5 ns    ± 716172.9687807652   3546
NonBlockingThreadPool: join, four levels, three on thread pool thread  341631.0 ns   ± 767483.9227743072   2367
NonBlockingThreadPool: parallel for, one level                         404375.0 ns   ± 590178.6724299589   3123
NonBlockingThreadPool: parallel for, two levels                        1000872.0 ns  ± 1592704.2766365155  805
```

After:

```
name                                                                   time          std                   iterations
---------------------------------------------------------------------------------------------------------------------
NonBlockingThreadPool: join, one level                                 749.0 ns      ± 174096.69284101788  91247
NonBlockingThreadPool: join, two levels                                12046.5 ns    ± 414670.5686344325   5920
NonBlockingThreadPool: join, three levels                              46975.0 ns    ± 543858.2306554643   3561
NonBlockingThreadPool: join, four levels, three on thread pool thread  559837.0 ns   ± 591477.1893574063   2795
NonBlockingThreadPool: parallel for, one level                         66446.0 ns    ± 627245.5098742851   2236
NonBlockingThreadPool: parallel for, two levels                        1668739.0 ns  ± 1536323.375783659   765
```
@saeta saeta force-pushed the unmanaged-buffer branch from 88e8df8 to 9e31153 Compare May 24, 2020 23:38
saeta added a commit that referenced this pull request May 25, 2020
Note: this change (by itself) does not reduce ARC traffic, but in concert
with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in
the parallelFor benchmark.
@saeta
Copy link
Owner Author

saeta commented May 25, 2020

#33

Copy link
Collaborator

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I'd like to better understand where the ARC traffic was coming from. The rationale in the description doesn't quite compute for me, because storing multiple refcounted things in an array doesn't increase ARC traffic; if anything, by comparison with storing them in something like a struct it reduces ARC traffic from N to 1. It should (generally) be possible to use ManagedBuffer in such a way that ARC is minimized by relying on unsafe (buffer) pointer APIs.

saeta added a commit that referenced this pull request Jun 10, 2020
Note: this change (by itself) does not reduce ARC traffic, but in concert
with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in
the parallelFor benchmark.
@texasmichelle texasmichelle changed the base branch from master to main December 8, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants