-
Notifications
You must be signed in to change notification settings - Fork 338
Description
[AllReduce]:
Assume,
- total rank number: n
- each rank with an array of length N (i.e., N elements in the array)
For AllReduce with ring algorithm, take the case of N >= n && N % n == 0,
e.g., N=n=4 ,
- In ReduceScatter phase : each rank sends (N / n) * (n - 1)
- In AllGather phase: each rank sends (N / n) * (n - 1)
Hence each rank sends 2 * (N / n ) * (n - 1 ) for one AllReduce operation.
Then all the ranks send total 2 * ( N / n ) * (n - 1) * n = 2 * N * (n - 1) for one AllReduce operation.
Given “each rank has a bandwidth to the outside world of B”, total TX bandwidth is n * B.
Hence , t = 2 * N * (n -1 ) * sizeof (element’s datatype) / ( n * B) = ( N * sizeof (element’s datatype) / B ) * ( 2 * (n -1 ) / n )
Is above deduction correct for NCCL AllReduce? Some questions as below. Could please help clarify/correct?
[Questions]:
Compared to “ t = (S/B) * (2*(n-1)/n) ”, it looks S = N * sizeof (element’s datatype). But it is not the expected value in PERFORMANCE.md.
It is saying " Note that here, S is the size in bytes of the total array, which for NCCL is equal to recvcount*sizeof(datatype)*n as the recvcount argument is the count per rank." It seems that “the count per rank” refers to the length N (of the array). Then S = N * sizeof(datatype) * n, which includes all the elements from all the arrays from all ranks.
With such “S” , the “t” will be n times of above t.
According to “algbw = S/t” , it also implies S is equal to all the bytes from all the ranks?