backend: fix extra spaces in tokenization and a CUDA crash #2778

cebtenzzre · 2024-07-30T21:55:32Z

The main issue being fixed here is a CUDA crash with certain sizes of long inputs (ggml-org/llama.cpp#8798). ~~For now there is just a workaround, but upstream is working on a proper fix.~~ I cherry-picked the change from ggml-org/llama.cpp#8800 since it seems to work.

Other things I fixed while working on improving context "recalculation":

Removed unnecessary logic guarding BOS insertion now that llama.cpp is better at knowing when it's needed
Tried to make the token cache match n_past more often (there are still some issues involving interrupted recalculates)
Remove now-unused "logits" and "logits_size" from the python binding's prompt context
Fix cases where we started incorrectly inserting leading spaces again after backend: rebase llama.cpp submodule on latest upstream #2694

Since upstream commit 1b67731e1 ("BERT tokenizer fixes (#6498)"), llama_tokenize will not add BOS for tokenizers that should not have it. Since upstream commit 37bef8943 ("tokenizer : BPE fixes (#7530)"), llama_add_bos_token can be used to confidently determine whether BOS will be added by llama_tokenize. The upstream logic to determine whether to add BOS has grown as tokenizers have been added and improved, so this could fix problems with a missing BOS, or context recalculation preserving the first token when it shouldn't. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

The size of the token cache is expected to match n_past during the decode phase of llmodel_prompt. We should make sure they match at entry, and never do anything that could cause them to desync. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

`logits` does nothing now that GPT-J is removed, so remove the unused fields. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

When llama.cpp was updated, I removed the space removal logic, but it turns out it's still actually needed. This is now a proper parameter, as we specifically only want to disable the *leading* space when we are tokenizing input that comes after a normal token. This fixes a regression in commit 290c629 ("backend: rebase llama.cpp submodule on latest upstream (#2694)"). Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llama.cpp commit e3c337d87 ("llama : support negative ith in llama_get_ API (#6519)") added a simpler way to get the logits for the last token in the batch, so use that instead. This also fixes potential issues with not serializing this value with the rest of the prompt context, although in practice we should always call evalTokens before llama_sample_top_p_top_k. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added 9 commits July 30, 2024 17:52

python: remove logits from the prompt context

44c8467

`logits` does nothing now that GPT-J is removed, so remove the unused fields. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

chatllm: add a FIXME comment

7aa0266

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

fix llama_tokenize call after incomplete fixup

f3e25da

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llama.cpp: update submodule to amend tokenize fix

1ab0231

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llama.cpp: update submodule for crash workaround

21146db

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre changed the title ~~WIP: llamamodel fixes~~ backend: fix extra spaces and work around CUDA bug Jul 31, 2024

llama.cpp: update submodule to replace workaround with slaren's fix

75bd250

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre changed the title ~~backend: fix extra spaces and work around CUDA bug~~ backend: fix extra spaces in tokenization and a CUDA crash Jul 31, 2024

llmodel: add FIXMEs to recalculateContext

3832d14

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre marked this pull request as ready for review July 31, 2024 21:24

cebtenzzre requested a review from manyoso July 31, 2024 21:24

manyoso approved these changes Aug 1, 2024

View reviewed changes

cebtenzzre merged commit 51bd01a into main Aug 1, 2024

cebtenzzre mentioned this pull request Aug 5, 2024

GPT4All v3.1.1: Replies from an LLM sometimes contain framing tokens from its System Prompt or Prompt Template #2779

Open

cebtenzzre deleted the llmodel-fixes branch February 10, 2025 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

backend: fix extra spaces in tokenization and a CUDA crash #2778

backend: fix extra spaces in tokenization and a CUDA crash #2778

Uh oh!

cebtenzzre commented Jul 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

backend: fix extra spaces in tokenization and a CUDA crash #2778

backend: fix extra spaces in tokenization and a CUDA crash #2778

Uh oh!

Conversation

cebtenzzre commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cebtenzzre commented Jul 30, 2024 •

edited

Loading