Skip to content

Support parallel encoding#1261

Closed
Senthil455 wants to merge 2 commits into
google:masterfrom
Senthil455:feature/parallel-encoding
Closed

Support parallel encoding#1261
Senthil455 wants to merge 2 commits into
google:masterfrom
Senthil455:feature/parallel-encoding

Conversation

@Senthil455

@Senthil455 Senthil455 commented Jun 5, 2026

Copy link
Copy Markdown

Support parallel encoding to handle super-long text. When input exceeds 10KB, the text is split into chunks and encoded concurrently using std::async, then the results are stitched back together.

Changes

src/sentencepiece_processor.cc

  • EncodeParallel(input, pieces, num_threads) — new overload returning vector<string>
  • EncodeParallel(input, ids, num_threads) — new overload returning vector<int>
  • EncodeParallel(input, spt, num_threads) — new overload returning SentencePieceText (core implementation)
    • Falls back to serial Encode() for inputs <10KB or when num_threads=1
    • Normalizes the entire input first, then splits into chunks via SplitInputIntoChunks
    • Encodes each chunk with std::async and combines results
    • Calls PopulateSentencePieceText to build the final output with correct byte offsets
  • SplitInputIntoChunks(input, num_chunks) — new helper that splits text at UTF-8 character boundaries (avoids splitting multi-byte sequences)
  • Added #include <future> and #include <thread>

src/sentencepiece_processor.h

  • Added EncodeParallel virtual methods (3 overloads) to the public API
  • Added convenience wrappers:
    • EncodeAsPiecesParallel(input, num_threads)
    • EncodeAsIdsParallel(input, num_threads)
    • EncodeAsImmutableProtoParallel(input, num_threads)
  • Added SplitInputIntoChunks as a private helper

src/sentencepiece_processor_test.cc

Added 5 test cases:

Test Description
EncodeParallelShortInputTest Input <10KB falls back to serial; tests all 3 overloads + convenience wrappers
EncodeParallelSingleThreadTest num_threads=1 matches serial Encode()
EncodeParallelInvalidNumThreadsTest num_threads=0 returns error
EncodeParallelConsistentWithSerialTest 15KB input produces identical results from parallel (4 threads) vs serial
EncodeParallelMultiByteUtf8Test 12KB input with 3-byte UTF-8 chars (あ) across chunk boundaries — parallel matches serial
Also added CharByCharModel test stub that tokenizes character-by-character without input verification, enabling parallel-mode testing.

Implementation Details

  • Chunk boundaries are adjusted backward to avoid splitting UTF-8 continuation bytes (string_util::IsTrailByte)
  • Uses std::async(std::launch::async, ...) for thread-safe parallel execution
  • Normalization is done once upfront, then chunks work on the normalized text
  • Final PopulateSentencePieceText maps byte offsets back to original (unnormalized) input
    Fixes Support parallel encoding #1229

- Add EncodeParallel() methods to SentencePieceProcessor for multi-threaded encoding
- Implement text splitting at UTF-8 boundaries to ensure correctness
- Normalize entire input first, then encode chunks in parallel
- Maintain backward compatibility with existing API
- Add convenient wrapper methods like EncodeAsPiecesParallel()
- Include fallback to single-threaded encoding for short texts
@taku910

taku910 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

ParallelEncode has already been implemented. See the head.

There are some comments on the parallel encoding.

  • Simply splitting the input is insufficient. If split at token boundaries, the tokenization result of the divided text may differ from the result of tokenizing the entire text all at once.
  • Passing the number of threads lacks extensibility and prevents coordination with other threads. Therefore, the implementation now passes a ThreadPool. (Internally, passing num_threads has been removed.)

@taku910 taku910 closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parallel encoding

2 participants