In-place guarantees for sort by gevtushenko · Pull Request #499 · NVIDIA/cub

gevtushenko · 2022-05-31T07:29:47Z

This PR clarified documentation on in-place guarantees for our sort facilities. I'll briefly mention the motivation behind these guarantees:

segmented sort:
1.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, sort facilities for small segments bypass this storage. This facilities use LOAD_LDG, which prohibits reading and writing into the same memory. Potentially, we might permit this overload to work in-place, but it'd take significant effort for re-tuning. It'd also create a false impression that the double buffer overload can alias data as well. I suggest we postpone this until there's a request. If there's one, we could create an overload that takes one set of values to emphasize in-place guarantees.
1.2. double buffer: inherits this limitation from segmented radix sort.
segmented radix sort:
2.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, short-cut single tile kernel exist. This kernel bypasses the intermediate storage and uses LOAD_LDG. Motivation is the same as for the segmented sort. Besides that, we have a plan to rewrite this kernel. New kernel would be restricted by the LOAD_LDG as well.
2.2. double buffer: Main algorithm consists of three repeating steps. At the upsweep step, CTA loads keys and converts them into bin ids. These ids are used to calculate a private histogram of keys assigned to it. At the scan step, private histograms
are converted into a prefix sum that represents offsets of each private bin. At the final downsweep step, keys are loaded again to compute the local bin id as well as the offset within the local bin. This process is repeated until all bytes of keys are covered. Even in the serial implementation, radix sort version that is used in cub doesn't allow in-place execution. There's a data race at the downsweep step if the input and output arrays are aliased. Some CTA might overwrite k1 by a k2 before k1 is read. In this case, k1 will be lost and k2 will be stored elsewhere, overwritting some k3.
radix sort:
3.1 in/out: same motivation as in segmented version
3.2 double buffer: same motivation as in segmented version. Besides that, there's a one-sweep version of the algorithm that's based on the decoupled lookback. It's still unsafe to alias input and output arrays there because successor CTA might read overwritten data.

canonizer · 2022-05-31T19:48:06Z

To clarify why there's a problem in radix-sort (both upsweep/downsweep and onesweep).

@senior-zero As you've mentioned, if for a binning iteration (the downsweep or onesweep kernel), the input and output buffers are the same, there would be a race condition. Therefore, those buffers should be different.

Regarding whether it's possible for the input and output buffers to be the same (in the in/out version), it really depends on the number of binning iterations. If it's even, it's possible without extra overhead (the input data will be overwritten, of course). If it's odd, then it's only possible at the cost of an extra copy (again, the input data is overwritten). This is because the in/out version allocates extra space, the size of the input buffer.

The real problem, however, is that it's hard to guarantee that the number of binning iterations would be even. It depends on begin_bit, end_bit and the digit size (RADIX_BITS), which, in turn, depends on the GPU used. And if in-place sorting is used with the odd number of binning iterations, there's the copy overhead, which is non-negligible.

So if in-place-like behavior is desired, it's always best to use the double-buffer version. There, it doesn't really matter in which of the buffers the end result ends up, so it doesn't matter if the number of binning iterations is odd or even. In addition, it doesn't allocate any temporary storage that's equal in size to the input buffer, and so uses less memory.

gevtushenko added 4 commits May 30, 2022 20:55

In-place guarantees for segmented sort

66ea19b

Reformat segmented radix sort docs

5a66f1f

Add in-place guarantees for seg rad sort

27312ac

Reformat radix sort docs

fe39cd3

gevtushenko requested a review from alliepiper May 31, 2022 07:29

gevtushenko added only: docs Documentation changes only. Doesn't need code CI. area: docs Related to documentation. labels May 31, 2022

gevtushenko mentioned this pull request Jun 3, 2022

What operations can be done in-place? #238

Closed

alliepiper added type: enhancement New feature or request. P1: should have Necessary, but not critical. labels Jun 3, 2022

alliepiper added this to the 2.0.0 milestone Jun 3, 2022

alliepiper approved these changes Jun 24, 2022

View reviewed changes

gevtushenko merged commit dca1f11 into NVIDIA:main Jun 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-place guarantees for sort#499

In-place guarantees for sort#499
gevtushenko merged 4 commits into
NVIDIA:mainfrom
gevtushenko:enh-main/github/in_place_sort

gevtushenko commented May 31, 2022

Uh oh!

canonizer commented May 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gevtushenko commented May 31, 2022

Uh oh!

canonizer commented May 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants