Implement UnmanagedBuffer and switch TaskDeque to use it.#41
Open
Implement UnmanagedBuffer and switch TaskDeque to use it.#41
UnmanagedBuffer and switch TaskDeque to use it.#41Conversation
Previously, `TaskDeque` was implemented in terms of `ManagedBuffer`. While `ManagedBuffer` implements the semantics we'd like, it is implemented as a class. This can induce a significant amount of reference couting traffic, especially when stored in `Array`s. `UnmanagedBuffer` implements a similar interface to `ManagedBuffer`, but instead uses manual pointer allocation and management. This allows us to avoid all reference counting traffic, at the cost of requiring explicit destruction. Switching `TaskDeque` from `ManagedBuffer` to `UnmanagedBuffer` yields between a 2x and 6x performance improvemenet for key workloads that stress the `TaskDeque` data structure within the `NonBlockingThreadPool`. Below are performance numbers across 2 machines & operating systems demonstrating performance improvements. Note: because this change has been extracted from a stack of related performance improvements, if you benchmark this PR itself, you will not see the expected performance improvements. Instead, this PR has been separated out to facilitate easier reviewing. Benchmark numbers on machine A: ------------------------------- Before (from previous commit): ``` name time std iterations -------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 700.0 ns ± 70289.84998218225 127457 NonBlockingThreadPool: join, two levels 2107.0 ns ± 131041.5070696377 31115 NonBlockingThreadPool: join, three levels 4960.0 ns ± 178122.9562964306 15849 NonBlockingThreadPool: join, four levels, three on thread pool thread 5893.0 ns ± 224021.47900401088 13763 NonBlockingThreadPool: parallel for, one level 22420.0 ns ± 203689.69689780468 7581 NonBlockingThreadPool: parallel for, two levels 500985.5 ns ± 642136.0139757036 1390 ``` After: ``` name time std iterations -------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 429.0 ns ± 59095.36128834737 223050 NonBlockingThreadPool: join, two levels 1270.0 ns ± 101587.48601579959 64903 NonBlockingThreadPool: join, three levels 3098.0 ns ± 165407.1669656578 28572 NonBlockingThreadPool: join, four levels, three on thread pool thread 3990.5 ns ± 227217.34017343252 10000 NonBlockingThreadPool: parallel for, one level 16853.0 ns ± 260015.39296821563 8660 NonBlockingThreadPool: parallel for, two levels 563926.0 ns ± 609298.6358076902 2189 ``` Benchmark numbers from machine B: --------------------------------- Before: ``` name time std iterations --------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 3022.0 ns ± 366686.3050127019 21717 NonBlockingThreadPool: join, two levels 13313.5 ns ± 550429.476815564 5970 NonBlockingThreadPool: join, three levels 39009.5 ns ± 716172.9687807652 3546 NonBlockingThreadPool: join, four levels, three on thread pool thread 341631.0 ns ± 767483.9227743072 2367 NonBlockingThreadPool: parallel for, one level 404375.0 ns ± 590178.6724299589 3123 NonBlockingThreadPool: parallel for, two levels 1000872.0 ns ± 1592704.2766365155 805 ``` After: ``` name time std iterations --------------------------------------------------------------------------------------------------------------------- NonBlockingThreadPool: join, one level 749.0 ns ± 174096.69284101788 91247 NonBlockingThreadPool: join, two levels 12046.5 ns ± 414670.5686344325 5920 NonBlockingThreadPool: join, three levels 46975.0 ns ± 543858.2306554643 3561 NonBlockingThreadPool: join, four levels, three on thread pool thread 559837.0 ns ± 591477.1893574063 2795 NonBlockingThreadPool: parallel for, one level 66446.0 ns ± 627245.5098742851 2236 NonBlockingThreadPool: parallel for, two levels 1668739.0 ns ± 1536323.375783659 765 ```
saeta
added a commit
that referenced
this pull request
May 25, 2020
Note: this change (by itself) does not reduce ARC traffic, but in concert with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in the parallelFor benchmark.
Owner
Author
dabrahams
reviewed
May 25, 2020
Collaborator
dabrahams
left a comment
There was a problem hiding this comment.
If possible I'd like to better understand where the ARC traffic was coming from. The rationale in the description doesn't quite compute for me, because storing multiple refcounted things in an array doesn't increase ARC traffic; if anything, by comparison with storing them in something like a struct it reduces ARC traffic from N to 1. It should (generally) be possible to use ManagedBuffer in such a way that ARC is minimized by relying on unsafe (buffer) pointer APIs.
saeta
added a commit
that referenced
this pull request
Jun 10, 2020
Note: this change (by itself) does not reduce ARC traffic, but in concert with `UnmanagedBuffer` (#41), we see a ~15x performance improvement in the parallelFor benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously,
TaskDequewas implemented in terms ofManagedBuffer. WhileManagedBufferimplements the semantics we'd like, it is implemented as aclass. This can induce a significant amount of reference couting traffic,
especially when stored in
Arrays.UnmanagedBufferimplements a similar interface toManagedBuffer, butinstead uses manual pointer allocation and management. This allows us to
avoid all reference counting traffic, at the cost of requiring explicit
destruction.
Switching
TaskDequefromManagedBuffertoUnmanagedBufferyieldsbetween a 2x and 6x performance improvemenet for key workloads that stress
the
TaskDequedata structure within theNonBlockingThreadPool.Below are performance numbers across 2 machines & operating systems
demonstrating performance improvements. Note: because this change has been
extracted from a stack of related performance improvements, if you benchmark
this PR itself, you will not see the expected performance improvements.
Instead, this PR has been separated out to facilitate easier reviewing.
Benchmark numbers on machine A:
Before (from previous commit):
After:
Benchmark numbers from machine B:
Before:
After: