cuEmbed is an open-source, header-only CUDA kernel library that accelerates embedding lookup. It aims to achieve high memory bandwidth utilization by maximizing loads in flight when accessing embedding rows. It makes extensive use of C++ templates and compile-time specialization to support a variety of embedding lookup configurations using only a small number of kernels optimized for memory-level parallelism. All of this is intended to make it easy for developers to achieve high performance on embedding lookups in their CUDA programs.
Supported Operations:
- Forward propagation (fixed-hotness or CSR index formats).
- Backward propagation (COO index format, full or compressed gradients).
- Index transformations (e.g., transpose).
cuEmbed is still under development. We aim to keep the host API stable. Users should expect changes in the kernel API and corresponding abstractions of operations.
Core components of cuEmbed are the kernel headers in the cuembed/include directory. These files have minimal dependency on third-party libraries and are safe to be copied into separate libraries.
We recommend using CMake Package Manager (CPM) to fetch cuEmbed into your project. With CPM, getting cuEmbed is easy:
CPMAddPackage(
NAME cuembed
GIT_REPOSITORY https://github.com/NVIDIA/cuEmbed.git
GIT_TAG main
OPTIONS
"BUILD_TESTS OFF"
"BUILD_BENCHMARKS OFF"
)
target_link_libraries(my_library ${cuembed_SOURCE_DIR})
The following example from utils/src/embedding_allocation.cu covers the basic usage of the host API for running forward propagation:
template <typename ElemT, typename IndexT, typename OffsetT, bool fp16_math>
void RunForward(const utils::AllocationOptions& options,
const thrust::device_vector<ElemT>& embedding,
const thrust::device_vector<IndexT>& indices,
const thrust::device_vector<OffsetT>& offsets,
const thrust::device_vector<ElemT>& weights,
thrust::device_vector<ElemT>* result) {
const int* offsets_ptr = nullptr;
int hotness = options.hotness();
if (options.is_csr()) {
offsets_ptr = offsets.data().get();
hotness = 0;
}
const ElemT* weight_ptr = nullptr;
if (options.is_weighted()) {
weight_ptr = weights.data().get();
}
using InputT = ElemT;
using OutputT = ElemT;
EmbeddingForward<InputT, OutputT, IndexT, OffsetT, fp16_math>(
embedding.data().get(),
options.embed_width(),
indices.data().get(),
offsets_ptr,
weight_ptr,
options.batch_size(),
hotness,
options.combine_mode(),
result->data().get());
}In the above example, we call EmbeddingForward with the corresponding data pointers from the embedding table (i.e., embedding), the embedding row indices (i.e., indices) & offsets indicating the starting position of each set of indices (i.e., offsets) & per sample weights (i.e., weights), the output of embedding lookup (i.e., result), and workload descriptions (i.e., embedding_width, hotness, batch_size, combine_mode unwrapped from options). The end result of embedding lookup is written into result.
Please refer to utils/src/embedding_allocation.cu for more examples, including index transposition and backward propagation.
Detailed descriptions of the full API and parameters can be found in cuembed/README.md.
Since cuEmbed is header-only, there is nothing to build to use it. To build the tests and benchmarks:
git clone --recursive https://github.com/NVIDIA/cuEmbed.git
cd cuembed
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
makeBinaries will be built into:
build/testsbuild/benchmarks
To run benchmarks locally:
cd benchmarks/
./sweep_parameters.shManual benchmarking can be done with the manual_benchmark binary in the benchmarks folder. This will run the forward, transpose, and backward stages.
Example:
./bin/benchmarks/manual_benchmark --num_categories 10000000 --embed_width 256 --batch_size 65536 --alpha=1.15 --hotness=64 --csr_input=false --half_embedding_type=true --weighted_sum=false --compressed_grad=true| Supported In Current Release | Future Release | |
|---|---|---|
| Embedding table size | single table single GPU | multiple tables and multiple devices |
| Embedding cache integration | no | yes |
| Embedding & Output types | fp32, fp16 | bf16 |
| Lookup Index types | int32_t, int64_t | |
| Lookup Index Layout (fwd) | fixed hotness, CSR | COO |
| Lookup Index Layout (bwd) | COO | |
| Reduction type (fwd) | weighted sum, concat, mean | |
| Reduction type (bwd) | weighted sum, concat | mean |
| Reduction precision | fp32, fp16 | bf16 |
| Kernel type | fwd, bwd, transpose | optimizer |
- nvcc 12.0+
- C++ 17
- Volta+