cuEmbed: embedding lookup kernel library

Overview

cuEmbed is an open-source, header-only CUDA kernel library that accelerates embedding lookup. It aims to achieve high memory bandwidth utilization by maximizing loads in flight when accessing embedding rows. It makes extensive use of C++ templates and compile-time specialization to support a variety of embedding lookup configurations using only a small number of kernels optimized for memory-level parallelism. All of this is intended to make it easy for developers to achieve high performance on embedding lookups in their CUDA programs.

Supported Operations:

Forward propagation (fixed-hotness or CSR index formats).
Backward propagation (COO index format, full or compressed gradients).
Index transformations (e.g., transpose).

Development Status

cuEmbed is still under development. We aim to keep the host API stable. Users should expect changes in the kernel API and corresponding abstractions of operations.

How to use

Core components of cuEmbed are the kernel headers in the cuembed/include directory. These files have minimal dependency on third-party libraries and are safe to be copied into separate libraries.

Adding `cuEmbed` to a CMake Project

We recommend using CMake Package Manager (CPM) to fetch cuEmbed into your project. With CPM, getting cuEmbed is easy:

CPMAddPackage(
  NAME cuembed
  GIT_REPOSITORY https://github.com/NVIDIA/cuEmbed.git
  GIT_TAG main
  OPTIONS
    "BUILD_TESTS OFF"
    "BUILD_BENCHMARKS OFF"
)

target_link_libraries(my_library ${cuembed_SOURCE_DIR})

Example usage: Forward Propagation

The following example from utils/src/embedding_allocation.cu covers the basic usage of the host API for running forward propagation:

template <typename ElemT, typename IndexT, typename OffsetT, bool fp16_math>
void RunForward(const utils::AllocationOptions& options,
                const thrust::device_vector<ElemT>& embedding,
                const thrust::device_vector<IndexT>& indices,
                const thrust::device_vector<OffsetT>& offsets,
                const thrust::device_vector<ElemT>& weights,
                thrust::device_vector<ElemT>* result) {
  const int* offsets_ptr = nullptr;
  int hotness = options.hotness();
  if (options.is_csr()) {
    offsets_ptr = offsets.data().get();
    hotness = 0;
  }
  const ElemT* weight_ptr = nullptr;
  if (options.is_weighted()) {
    weight_ptr = weights.data().get();
  }
  using InputT = ElemT;
  using OutputT = ElemT;
  EmbeddingForward<InputT, OutputT, IndexT, OffsetT, fp16_math>(
      embedding.data().get(),
      options.embed_width(),
      indices.data().get(),
      offsets_ptr,
      weight_ptr,
      options.batch_size(),
      hotness,
      options.combine_mode(),
      result->data().get());
}

In the above example, we call EmbeddingForward with the corresponding data pointers from the embedding table (i.e., embedding), the embedding row indices (i.e., indices) & offsets indicating the starting position of each set of indices (i.e., offsets) & per sample weights (i.e., weights), the output of embedding lookup (i.e., result), and workload descriptions (i.e., embedding_width, hotness, batch_size, combine_mode unwrapped from options). The end result of embedding lookup is written into result.

Please refer to utils/src/embedding_allocation.cu for more examples, including index transposition and backward propagation.

Detailed descriptions of the full API and parameters can be found in cuembed/README.md.

Building cuEmbed tests and benchmarks

Since cuEmbed is header-only, there is nothing to build to use it. To build the tests and benchmarks:

Build From Source

git clone --recursive https://github.com/NVIDIA/cuEmbed.git
cd cuembed
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Binaries will be built into:

build/tests
build/benchmarks

Benchmarks

Full Suite Benchmarks

To run benchmarks locally:

cd benchmarks/
./sweep_parameters.sh

Manual Benchmark Single Test Case

Manual benchmarking can be done with the manual_benchmark binary in the benchmarks folder. This will run the forward, transpose, and backward stages.

Example:

./bin/benchmarks/manual_benchmark --num_categories 10000000 --embed_width 256 --batch_size 65536 --alpha=1.15 --hotness=64 --csr_input=false --half_embedding_type=true --weighted_sum=false --compressed_grad=true

Detailed Support Matrix

	Supported In Current Release	Future Release
Embedding table size	single table single GPU	multiple tables and multiple devices
Embedding cache integration	no	yes
Embedding & Output types	fp32, fp16	bf16
Lookup Index types	int32_t, int64_t
Lookup Index Layout (fwd)	fixed hotness, CSR	COO
Lookup Index Layout (bwd)	COO
Reduction type (fwd)	weighted sum, concat, mean
Reduction type (bwd)	weighted sum, concat	mean
Reduction precision	fp32, fp16	bf16
Kernel type	fwd, bwd, transpose	optimizer

Requirements

nvcc 12.0+
C++ 17
Volta+

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmarks		benchmarks
cuembed		cuembed
docs		docs
examples/pytorch		examples/pytorch
tests		tests
third_party		third_party
utils		utils
.clang-format		.clang-format
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cuEmbed: embedding lookup kernel library

Overview

Development Status

How to use

Adding `cuEmbed` to a CMake Project

Example usage: Forward Propagation

Building cuEmbed tests and benchmarks

Build From Source

Benchmarks

Full Suite Benchmarks

Manual Benchmark Single Test Case

Detailed Support Matrix

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

NVIDIA/cuEmbed

Folders and files

Latest commit

History

Repository files navigation

cuEmbed: embedding lookup kernel library

Overview

Development Status

How to use

Adding cuEmbed to a CMake Project

Example usage: Forward Propagation

Building cuEmbed tests and benchmarks

Build From Source

Benchmarks

Full Suite Benchmarks

Manual Benchmark Single Test Case

Detailed Support Matrix

Requirements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Adding `cuEmbed` to a CMake Project

Packages