-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: extend for small granite models #7481
Conversation
llama.cpp
Outdated
if (model.arch == LLM_ARCH_LLAMA) { | ||
vocab.add_space_prefix = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed - looks wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, it should be LLM_ARCH_GRANITE_SMALL
1fb9186
to
cd8d590
Compare
@compilade thanks, addressed the issues and pushed a new version |
Adding the |
Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.
I've simplified the implementation and it is using the existing Llama model. I've added a way to override the default rope type. Now the only Granite specific code in llama.cpp is to detect |
convert-hf-to-gguf.py
Outdated
# Skip for granite models | ||
if self.hparams.get("vocab_size", 32000) != 49152: | ||
if name.endswith("q_proj.weight"): | ||
data_torch = LlamaModel.permute(data_torch, n_head, n_head) | ||
if name.endswith("k_proj.weight"): | ||
data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid adding the rope type parameter all together, by permuting the Q, K attention tensors in the correct way here. I don't have an example code unfortunately, so we need to figure out how to do it. The only difference between RoPE NORM and NEOX is that in the former we rotate the pairs (x[2*i + 0], x[2*i + 1]
, while in the latter we rotate (x[i], x[i + n_rot/2])
. So it's a matter of reordering the rows in each head in the correct way to make the RoPE type to be NORM - as all other LLaMA-based models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion. I had a look at it and I am not sure that is possible to do just by rearranging the Q,K weights without changing their values too.
If I understand it correctly, given:
const float * const src = (float *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
float * dst_data = (float *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);
we would like to shuffle the positions of x0
and x1
around, so that (RoPE NORM):
const float x0 = src[0];
const float x1 = src[1];
dst_data[0] = x0*cos_theta*zeta - x1*sin_theta;
dst_data[1] = x0*sin_theta*zeta + x1*cos_theta;
can be used instead of (RoPE NEOX):
const float x0 = src[0];
const float x1 = src[n_dims/2];
dst_data[0] = x0*cos_theta - x1*sin_theta;
dst_data[n_dims/2] = x0*sin_theta + x1*cos_theta;
So not only we want to re-arrange the elements in a way that RoPE NORM can find them (this would probably be easy) but also ensure that after the RoPE operation, the output is written down with the same layout RoPE NEOX expects it since the rest of the model expects that output.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the output of Q = rope(q)
and K = rope(k)
would not be in the same order, it should still work because we compute KQ = K @ Q
which is invariant to how the data in the heads is reordered - as long as it is reordered in the same way in both K
and Q
I could be missing something though - not 100% confident in this. If you think it won't work, we can probably do the rope type thing, but I really prefer to find a way to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it something that could be changed later?
I am not confident either that it is not possible, I've spent a few hours on it and I've not been successful so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a try. Only the first n_dims
elements of each rows should be re-ordered.
Line 14418 in 95f84d5
if (ic < n_dims) { |
@staticmethod
def permute_neox_rope(weights: Tensor, rot_dim: int) -> Tensor:
orig_shape = weights.shape
assert orig_shape[-1] % rot_dim == 0
# reorder the first rot_dim elements of each row
weights = weights.reshape((-1 , weights.shape[-1] // rot_dim, rot_dim))
weights[:, 0, :] = weights[:, 0, :].reshape((-1, 2, rot_dim // 2)).mT.contiguous().reshape((-1, rot_dim))
return weights.reshape((orig_shape))
It seems to partially work, but the output is still wrong, because in RoPE NEOX, it's only the first rot_dim
elements per row that are roped, while in RoPE NORM, all of them are.
So it's not simply a re-ordering of elements that is necessary, unfortunately. The rope type is needed, it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
Could you put this diff into a patch I can cherry-pick and I update my PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe Put this in a file (say, permute-bias.patch
), then use git apply permute-bias.patch
from the repo's top directory.
Patch content (click to expand)
diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 99c1fdb4..63d50f8f 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1325,8 +1325,6 @@ class LlamaModel(Model):
# Apply to granite small models only
if self.hparams.get("vocab_size", 32000) == 49152:
self.gguf_writer.add_add_bos_token(False)
- self.gguf_writer.add_rope_type(gguf.RopeType.NEOX)
- self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.NONE)
@staticmethod
def permute(weights: Tensor, n_head: int, n_head_kv: int | None):
@@ -1342,12 +1340,10 @@ class LlamaModel(Model):
n_head = self.hparams["num_attention_heads"]
n_kv_head = self.hparams.get("num_key_value_heads")
- # Skip for granite models
- if self.hparams.get("vocab_size", 32000) != 49152:
- if name.endswith("q_proj.weight"):
- data_torch = LlamaModel.permute(data_torch, n_head, n_head)
- if name.endswith("k_proj.weight"):
- data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)
+ if name.endswith(("q_proj.weight", "q_proj.bias")):
+ data_torch = LlamaModel.permute(data_torch, n_head, n_head)
+ if name.endswith(("k_proj.weight", "k_proj.bias")):
+ data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)
# process the experts separately
if name.find("block_sparse_moe.experts") != -1:
diff --git a/gguf-py/gguf/constants.py b/gguf-py/gguf/constants.py
index d5c3d7b5..c9ae259e 100644
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@@ -57,7 +57,6 @@ class Keys:
CAUSAL = "{arch}.attention.causal"
class Rope:
- TYPE = "{arch}.rope.type"
DIMENSION_COUNT = "{arch}.rope.dimension_count"
FREQ_BASE = "{arch}.rope.freq_base"
SCALING_TYPE = "{arch}.rope.scaling.type"
@@ -807,13 +806,6 @@ class TokenType(IntEnum):
BYTE = 6
-class RopeType(Enum):
- NONE = 'none'
- NORM = 'norm'
- NEOX = 'neox'
- GLM = 'glm'
-
-
class RopeScalingType(Enum):
NONE = 'none'
LINEAR = 'linear'
@@ -1006,7 +998,6 @@ KEY_ATTENTION_LAYERNORM_EPS = Keys.Attention.LAYERNORM_EPS
KEY_ATTENTION_LAYERNORM_RMS_EPS = Keys.Attention.LAYERNORM_RMS_EPS
# RoPE
-KEY_ROPE_TYPE = Keys.Rope.TYPE
KEY_ROPE_DIMENSION_COUNT = Keys.Rope.DIMENSION_COUNT
KEY_ROPE_FREQ_BASE = Keys.Rope.FREQ_BASE
KEY_ROPE_SCALING_TYPE = Keys.Rope.SCALING_TYPE
diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py
index ebfd15fd..8b41b54e 100644
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@@ -427,9 +427,6 @@ class GGUFWriter:
def add_rope_freq_base(self, value: float) -> None:
self.add_float32(Keys.Rope.FREQ_BASE.format(arch=self.arch), value)
- def add_rope_type(self, value: RopeType) -> None:
- self.add_string(Keys.Rope.TYPE.format(arch=self.arch), value.value)
-
def add_rope_scaling_type(self, value: RopeScalingType) -> None:
self.add_string(Keys.Rope.SCALING_TYPE.format(arch=self.arch), value.value)
diff --git a/llama.cpp b/llama.cpp
index 16c11d43..f970c175 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -297,7 +297,6 @@ enum llm_kv {
LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,
LLM_KV_ATTENTION_CAUSAL,
- LLM_KV_ROPE_TYPE,
LLM_KV_ROPE_DIMENSION_COUNT,
LLM_KV_ROPE_FREQ_BASE,
LLM_KV_ROPE_SCALE_LINEAR,
@@ -376,7 +375,6 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, "%s.attention.layer_norm_rms_epsilon" },
{ LLM_KV_ATTENTION_CAUSAL, "%s.attention.causal" },
- { LLM_KV_ROPE_TYPE, "%s.rope.type" },
{ LLM_KV_ROPE_DIMENSION_COUNT, "%s.rope.dimension_count" },
{ LLM_KV_ROPE_FREQ_BASE, "%s.rope.freq_base" },
{ LLM_KV_ROPE_SCALE_LINEAR, "%s.rope.scale_linear" },
@@ -1131,29 +1129,12 @@ struct LLM_TN {
// gguf helpers
//
-static const std::map<enum llama_rope_type, const char *> LLAMA_ROPE_TYPES = {
- { LLAMA_ROPE_TYPE_NONE, "none" },
- { LLAMA_ROPE_TYPE_NORM, "norm" },
- { LLAMA_ROPE_TYPE_NEOX, "neox" },
- { LLAMA_ROPE_TYPE_GLM, "glm" },
-};
-
static const std::map<llama_rope_scaling_type, const char *> LLAMA_ROPE_SCALING_TYPES = {
{ LLAMA_ROPE_SCALING_TYPE_NONE, "none" },
{ LLAMA_ROPE_SCALING_TYPE_LINEAR, "linear" },
{ LLAMA_ROPE_SCALING_TYPE_YARN, "yarn" },
};
-static enum llama_rope_type llama_rope_type_from_string(const std::string & name) {
- for (const auto & kv : LLAMA_ROPE_TYPES) {
- if (kv.second == name) {
- return (enum llama_rope_type) kv.first;
- }
- }
-
- return LLAMA_ROPE_TYPE_NONE;
-}
-
static llama_rope_scaling_type llama_rope_scaling_type_from_string(const std::string & name) {
for (const auto & kv : LLAMA_ROPE_SCALING_TYPES) {
if (kv.second == name) {
@@ -4417,15 +4398,7 @@ static void llm_load_hparams(
hparams.use_alibi = true;
}
- hparams.rope_type = llama_default_rope_type(&model);
-
- const auto kv = LLM_KV(model.arch);
- const int rope_type_keyidx = gguf_find_key(ctx, kv(LLM_KV_ROPE_TYPE).c_str());
- if (rope_type_keyidx != -1) {
- std::string rope_type("none");
- ml.get_key(LLM_KV_ROPE_TYPE, rope_type);
- hparams.rope_type = llama_rope_type_from_string(rope_type);
- }
+ hparams.rope_type = llama_rope_type(&model);
}
// TODO: This should probably be in llama.h
@@ -16252,7 +16225,7 @@ enum llama_vocab_type llama_vocab_type(const struct llama_model * model) {
return model->vocab.type;
}
-enum llama_rope_type llama_default_rope_type(const struct llama_model * model) {
+enum llama_rope_type llama_rope_type(const struct llama_model * model) {
switch (model->arch) {
// these models do not use RoPE
case LLM_ARCH_GPT2:
diff --git a/llama.h b/llama.h
index 632136ca..16cece5d 100644
--- a/llama.h
+++ b/llama.h
@@ -422,7 +422,7 @@ extern "C" {
LLAMA_API enum llama_pooling_type llama_pooling_type(const struct llama_context * ctx);
LLAMA_API enum llama_vocab_type llama_vocab_type (const struct llama_model * model);
- LLAMA_API enum llama_rope_type llama_default_rope_type (const struct llama_model * model);
+ LLAMA_API enum llama_rope_type llama_rope_type (const struct llama_model * model);
LLAMA_API int32_t llama_n_vocab (const struct llama_model * model);
LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model);
If by "patch" you meant a commit, then... I think I can directly push it here if "Maintainers are allowed to edit this pull request." works as I think it does? (I never tried pushing on someone else's fork, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, if you are fine that I apply it directly on top of my patch then I can do.
I was thinking about you owning the ownership for the commit, since you came up with the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks for looking into this!
propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it works only for the small models 3b and 8b. The bigger models work fine with the existing GPTBigCodeForCausalLM architecture.
For the small models there are enough differences with the base llama arch that it is worth to define a new architecture.
To create the .gguf files, it is necessary to specify GraniteSmallForCausalLM in the architectures for the hf model.
Closes: #7116
Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com