Skip to content

Conversation

@cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Feb 15, 2024

Here is an example of the kind of broken output this PR attempts to fix:

Screenshot 2024-02-15 at 4 58 29 PM

PR #1935 switched Mistral OpenOrca to the official prompt template (ChatML). But users have been seeing <|im_end|> in the output since this change. There are two problems:

  • The current GGUF file was converted 5 months ago by TheBloke. It is out-of-date and does not contain the special tokens.
  • We were still using llama_tokenize with special=false. This was changed to true (which has some caveats).

Now the model is actually seeing tokens 32001 (<|im_start|>) and 32000 (<|im_end|>):

Token output from llama_tokenize

Before:

523:  <
28766: |
321: im
28730: _
2521: start
28766: |
28767: >
1838: user
13:

28708: a
28789: <
28766: |
321: im
28730: _
416: end
28766: |
3409: ><
28766: |
321: im
28730: _
2521: start
28766: |
28767: >
489: ass
11143: istant
13:

After:

Token debug:
32001: <|im_start|>
1838: user
13:

28708: a
32000: <|im_end|>
32001: <|im_start|>
489: ass
11143: istant
13:

One limitation of calling llama_tokenize with special=true is that we can have false-positive special tokens, e.g. if the user input (or LocalDocs context) contains the strings <s> or </s>, which will be interpreted as BOS and EOS, respectively. The proper fix for this is to be able to specify raw token IDs in the prompt template so we can use special=false, but I don't know what the UI for this would look like.

Also, I haven't checked what happens to the EOS token (<|im_end|>) in history - for ChatML to work properly, this must come after the assistant's responses.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@ThiloteE
Copy link
Collaborator

LM Studio allows users to enter stop tokens in the GUI like this:

image

@apage43
Copy link
Member

apage43 commented Feb 16, 2024

The proper fix for this is to be able to specify raw token IDs in the prompt template so we can use special=false, but I don't know what the UI for this would look like.

What about allowing special tokens as-is in the template but not the input? That way if someone adapts a prompt template from the original HF readme things just work, but typing special tokens into the input box won't cause problems.

That is, instead of actually filling the template with string templating and tokenizing that "<special>%1</special>" -> "<special>user input here</special>" -> tokenize("<special>user input here</special>")
you would instead split the template at the marker and do concat(tokenize("<special>", allow_special=true), tokenize("user input here", allow_special=false), tokenize("</special>", allow_special=true))

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre
Copy link
Member Author

What about allowing special tokens as-is in the template but not the input? That way if someone adapts a prompt template from the original HF readme things just work, but typing special tokens into the input box won't cause problems.

Implemented. This does seem like the most robust option that is still user-friendly. The one casualty here is the documented ability to override self._format_chat_prompt_template - this is now deprecated (by checking the identity of the method) because it causes the prompt and prompt template to be merged into a single string.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre
Copy link
Member Author

I can confirm that im_end is indeed missing after the model's reply (as well as the newline that would follow it):

pos=86 2526 ' looking'
decode(n_past=87):
pos=87 354 ' for'
decode(n_past=88):
pos=88 28723 '.'
decode(n_past=89):
pos=89 32000 '<|im_end|>'
<snip>
decode(n_past=89):
pos=89 32001 '<|im_start|>'
pos=90 1838 'user'
pos=91 13 '
'
pos=92 287 ' b'

I propose modifying the prompt template so it looks like:

<|im_start|>user
%1<|im_end|>
<|im_start|>assistant
%2<|im_end|>

And then modifying LLModel to understand this format.

Thoughts? @manyoso @apage43

@apage43
Copy link
Member

apage43 commented Feb 20, 2024

I propose modifying the prompt template so it looks like: [...]

I'm cool with this change - I assume this means everything up to %2 is what we pass in to the model and the bit after %2 is treated as a stop-signal?

I do wonder if we are going to make a breaking change to the templating anyway if we could use more descriptive markers like {input} / {response} or similar (not a big deal if this makes things extra complicated though)

@cebtenzzre
Copy link
Member Author

I propose modifying the prompt template so it looks like: [...]

I'm cool with this change - I assume this means everything up to %2 is what we pass in to the model and the bit after %2 is treated as a stop-signal?

In the implementation I have right now, the model's EOS token is still honored, but since GPT4All throws out the EOS token (since this would be the right thing to do when the prompt format does not contain <|im_end|>) we use the part after the %2 to put back the <|im_end|> and a newline (AFAIK we somehow don't already enforce a newline between the response and the next prompt?)

I do wonder if we are going to make a breaking change to the templating anyway if we could use more descriptive markers like {input} / {response} or similar (not a big deal if this makes things extra complicated though)

I'm not intending on making a breaking change, just an extension to the format - if %2 is not seen, nothing is inserted into the chat history after the EOS token (which is fine for e.g. Alpaca-style prompting).

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@manyoso manyoso requested review from apage43 and removed request for manyoso February 21, 2024 17:59
@manyoso
Copy link
Collaborator

manyoso commented Feb 21, 2024

@apage43 please R+ if you are good with this and merge

@ThiloteE
Copy link
Collaborator

ThiloteE commented Feb 21, 2024

Tried this PR.
image
image
The pattern re-emerged at the 8th instruction/question I posed.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
This reverts commit 31841d0.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
This shouldn't be true anymore anyway because we fixed the BOS getting
dropped when we shift the context.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre
Copy link
Member Author

This PR is good enough for now - ChatML and similar templates work 100% correctly with these changes, plus my current local diff:

diff --git a/gpt4all-backend/llamamodel.cpp b/gpt4all-backend/llamamodel.cpp
index e8d2ccb..b3b619e 100644
--- a/gpt4all-backend/llamamodel.cpp
+++ b/gpt4all-backend/llamamodel.cpp
@@ -412,6 +412,10 @@ bool LLamaModel::evalTokens(PromptContext &ctx, const std::vector<int32_t> &toke
     // llama_decode will output logits only for the last token of the prompt
     batch.logits[batch.n_tokens - 1] = true;
 
+    std::cerr << "decode(n_past=" << ctx.n_past << "):\n";
+    for (int i = 0; i < batch.n_tokens; i++) {
+        std::cerr << "pos=" << ctx.n_past + i << " " << tokens[i] << " '" << llama_token_to_piece(d_ptr->ctx, tokens[i]) << "'\n";
+    }
     int res = llama_decode(d_ptr->ctx, batch);
     llama_batch_free(batch);
     return res == 0;
diff --git a/gpt4all-chat/metadata/models2.json b/gpt4all-chat/metadata/models2.json
index 903e7ad..5e33ca0 100644
--- a/gpt4all-chat/metadata/models2.json
+++ b/gpt4all-chat/metadata/models2.json
@@ -12,7 +12,7 @@
     "type": "Gemma",
     "description": "<strong>A state-of-the-art open model from Google</strong><br><ul><li>Fast responses</li><li>Chat based model</li><li>Trained by Google</li><li>Licensed for commercial use</li><li>Gemma is provided under and subject to the Gemma Terms of Use found at <a href=\"https://ai.google.dev/gemma/terms\">ai.google.dev/gemma/terms</a></li></ul>",
     "url": "https://gpt4all.io/models/gguf/gemma-7b-it.Q4_0.gguf",
-    "promptTemplate": "<start_of_turn>user\n%1<end_of_turn>\n<start_of_turn>model\n",
+    "promptTemplate": "<start_of_turn>user\n%1<end_of_turn>\n<start_of_turn>model\n%2<end_of_turn>\n",
     "systemPrompt": ""
   },
   {
@@ -28,7 +28,7 @@
     "type": "Mistral",
     "description": "<strong>Best overall fast chat model</strong><br><ul><li>Fast responses</li><li>Chat based model</li><li>Trained by Mistral AI<li>Finetuned on OpenOrca dataset curated via <a href=\"https://atlas.nomic.ai/\">Nomic Atlas</a><li>Licensed for commercial use</ul>",
     "url": "https://gpt4all.io/models/gguf/mistral-7b-openorca.Q4_0.gguf",
-    "promptTemplate": "<|im_start|>user\n%1<|im_end|>\n<|im_start|>assistant\n",
+    "promptTemplate": "<|im_start|>user\n%1<|im_end|>\n<|im_start|>assistant\n%2<|im_end|>\n",
     "systemPrompt": "<|im_start|>system\nYou are MistralOrca, a large language model trained by Alignment Lab AI. For multi-step problems, write out your reasoning for each step.\n<|im_end|>"
   },
   {
@@ -152,7 +152,7 @@
     "type": "MPT",
     "description": "<strong>Good model with novel architecture</strong><br><ul><li>Fast responses<li>Chat based<li>Trained by Mosaic ML<li>Cannot be used commercially</ul>",
     "url": "https://gpt4all.io/models/gguf/mpt-7b-chat-newbpe-q4_0.gguf",
-    "promptTemplate": "<|im_start|>user\n%1<|im_end|>\n<|im_start|>assistant\n",
+    "promptTemplate": "<|im_start|>user\n%1<|im_end|>\n<|im_start|>assistant\n%2<|im_end|>\n",
     "systemPrompt": "<|im_start|>system\n- You are a helpful assistant chatbot trained by MosaicML.\n- You answer questions.\n- You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.\n- You are more than just an information source, you are also able to write poetry, short stories, and make jokes.<|im_end|>"
   },
   {
diff --git a/gpt4all-chat/modellist.cpp b/gpt4all-chat/modellist.cpp
index 7d07e4c..bea4b85 100644
--- a/gpt4all-chat/modellist.cpp
+++ b/gpt4all-chat/modellist.cpp
@@ -7,7 +7,7 @@
 #include <QStandardPaths>
 #include <algorithm>
 
-//#define USE_LOCAL_MODELSJSON
+#define USE_LOCAL_MODELSJSON
 
 #define DEFAULT_EMBEDDING_MODEL "all-MiniLM-L6-v2-f16.gguf"
 #define NOMIC_EMBEDDING_MODEL "nomic-embed-text-v1.txt"

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre merged commit 4fc4d94 into main Feb 21, 2024
@cebtenzzre cebtenzzre changed the title Fix incorrect prompting of Mistral OpenOrca fix chat-style prompt templates (and Mistral OpenOrca) Feb 21, 2024
cebtenzzre added a commit that referenced this pull request Feb 24, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Feb 26, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Feb 28, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Feb 28, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Mar 6, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Jul 1, 2024
This fixes a regression in commit 4fc4d94 ("fix chat-style prompt
templates (#1970)"), which moved some return satements into a new
function (LLModel::decodePrompt) without making them return from the
parent as well.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit that referenced this pull request Jul 1, 2024
This fixes a regression in commit 4fc4d94 ("fix chat-style prompt
templates (#1970)"), which moved some return statements into a new
function (LLModel::decodePrompt) without making them return from the
parent as well.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
manyoso pushed a commit that referenced this pull request Jul 1, 2024
This fixes a regression in commit 4fc4d94 ("fix chat-style prompt
templates (#1970)"), which moved some return statements into a new
function (LLModel::decodePrompt) without making them return from the
parent as well.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre deleted the chatml-fix branch February 10, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

5 participants