In-place guarantees for sort#499
Conversation
|
To clarify why there's a problem in radix-sort (both upsweep/downsweep and onesweep). @senior-zero As you've mentioned, if for a binning iteration (the downsweep or onesweep kernel), the input and output buffers are the same, there would be a race condition. Therefore, those buffers should be different. Regarding whether it's possible for the input and output buffers to be the same (in the in/out version), it really depends on the number of binning iterations. If it's even, it's possible without extra overhead (the input data will be overwritten, of course). If it's odd, then it's only possible at the cost of an extra copy (again, the input data is overwritten). This is because the in/out version allocates extra space, the size of the input buffer. The real problem, however, is that it's hard to guarantee that the number of binning iterations would be even. It depends on So if in-place-like behavior is desired, it's always best to use the double-buffer version. There, it doesn't really matter in which of the buffers the end result ends up, so it doesn't matter if the number of binning iterations is odd or even. In addition, it doesn't allocate any temporary storage that's equal in size to the input buffer, and so uses less memory. |
This PR clarified documentation on in-place guarantees for our sort facilities. I'll briefly mention the motivation behind these guarantees:
1.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, sort facilities for small segments bypass this storage. This facilities use
LOAD_LDG, which prohibits reading and writing into the same memory. Potentially, we might permit this overload to work in-place, but it'd take significant effort for re-tuning. It'd also create a false impression that the double buffer overload can alias data as well. I suggest we postpone this until there's a request. If there's one, we could create an overload that takes one set of values to emphasize in-place guarantees.1.2. double buffer: inherits this limitation from segmented radix sort.
2.1. in/out: this overload allocates an intermediate storage so that input data is only read once. Nonetheless, short-cut single tile kernel exist. This kernel bypasses the intermediate storage and uses
LOAD_LDG. Motivation is the same as for the segmented sort. Besides that, we have a plan to rewrite this kernel. New kernel would be restricted by theLOAD_LDGas well.2.2. double buffer: Main algorithm consists of three repeating steps. At the upsweep step, CTA loads keys and converts them into bin ids. These ids are used to calculate a private histogram of keys assigned to it. At the scan step, private histograms
are converted into a prefix sum that represents offsets of each private bin. At the final downsweep step, keys are loaded again to compute the local bin id as well as the offset within the local bin. This process is repeated until all bytes of keys are covered. Even in the serial implementation, radix sort version that is used in cub doesn't allow in-place execution. There's a data race at the downsweep step if the input and output arrays are aliased. Some CTA might overwrite
k1by ak2beforek1is read. In this case,k1will be lost andk2will be stored elsewhere, overwritting somek3.3.1 in/out: same motivation as in segmented version
3.2 double buffer: same motivation as in segmented version. Besides that, there's a one-sweep version of the algorithm that's based on the decoupled lookback. It's still unsafe to alias input and output arrays there because successor CTA might read overwritten data.