Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add reranking support #9510

Merged
merged 25 commits into from
Sep 28, 2024
Merged

llama : add reranking support #9510

merged 25 commits into from
Sep 28, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Sep 16, 2024

ref #8555

This adds initial support for reranking to libllama, llama-embeddings and llama-server. I've tested mainly with the following 2 models:

The reranking is implemented as a pooling layer of type LLAMA_POOLING_TYPE_RANK. When used, libllama will attach a classification head at the end of the graph:

llama.cpp/src/llama.cpp

Lines 10246 to 10266 in 4d45775

case LLAMA_POOLING_TYPE_RANK:
{
struct ggml_tensor * inp_cls = build_inp_cls();
inp = ggml_get_rows(ctx0, inp, inp_cls);
// classification head
// https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566
GGML_ASSERT(model.cls != nullptr);
GGML_ASSERT(model.cls_b != nullptr);
cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls, inp), model.cls_b);
cur = ggml_tanh(ctx0, cur);
if (model.cls_out) {
// this path is taken for example by the https://huggingface.co/jinaai/jina-reranker-v1-tiny-en
// https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/blob/cb5347e43979c3084a890e3f99491952603ae1b7/modeling_bert.py#L884-L896
GGML_ASSERT(model.cls_out_b != nullptr);
cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls_out, cur), model.cls_out_b);
}
} break;

The current implementation likely does not cover all types of rerankers, so updates would be necessary in the future to support other types of classifications on a case-by-case basis.

The computed rank scores for each sequence can be accessed via the llama_get_embeddings_seq() call:

llama.cpp/include/llama.h

Lines 873 to 878 in 4d45775

// Get the embeddings for a sequence id
// Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
// when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence
// otherwise: float[n_embd] (1-dimensional)
LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);

The rank score is stored as a single float.

The server endpoint is designed mostly after https://jina.ai/reranker/, but it is not fully complete. Again, I think it's better to update it on a case-by-case basis + the API is not ideal (e.g. what is the purpose of top_n?, why are the document contents returned in the response?).

I started to add server tests, but it will take me more time to write the python code, so I'll create a separate issue for people to help with that:

# TODO: implement some tests
# https://github.com/ggerganov/llama.cpp/pull/9510
# Scenario: Rerank
# Given a prompt:
# """
# What is panda?
# """

TODO:

  • tests (left for follow-up PR)
  • clean-up
  • optimize classification head

Model: https://huggingface.co/BAAI/bge-reranker-v2-m3

Testing:

python3 convert_hf_to_gguf.py \
	~/Data/huggingface/bge-reranker-v2-m3/ \
    --outfile models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    --outtype f16

Classifier:

https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566

Testing (CLI)

Rank responses:

# first
Q: what is panda?
A: hi

# second
Q: what is panda?
A: it's a bear

# third
Q: what is panda?
A: The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.
./llama-embedding \
    -m models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    -p "what is panda?</s><s>hi\nwhat is panda?</s><s>it's a bear\nwhat is panda?</s><s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." \
    --pooling rank --embd-normalize -1 --verbose-prompt 
0.00.830.859 I main: prompt 0: 'what is panda?</s><s>hi'
0.00.830.865 I main: number of tokens in prompt = 10
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
  1274 -> ' hi'
     2 -> '</s>'


0.00.830.871 I main: prompt 1: 'what is panda?</s><s>it's a bear'
0.00.830.871 I main: number of tokens in prompt = 14
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
   442 -> ' it'
    25 -> '''
     7 -> 's'
    10 -> ' a'
 81148 -> ' bear'
     2 -> '</s>'


0.00.830.875 I main: prompt 2: 'what is panda?</s><s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.'
0.00.830.875 I main: number of tokens in prompt = 43
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
   581 -> ' The'
  6051 -> ' gian'
    18 -> 't'
     6 -> ' '
 85407 -> 'panda'
    15 -> ' ('
   284 -> 'A'
 12175 -> 'ilu'
 28437 -> 'rop'
 19165 -> 'oda'
 54159 -> ' melan'
 16836 -> 'ole'
 29808 -> 'uca'
   247 -> '),'
 68018 -> ' sometimes'
 35839 -> ' called'
    10 -> ' a'
     6 -> ' '
 85407 -> 'panda'
 81148 -> ' bear'
   707 -> ' or'
 42856 -> ' simply'
     6 -> ' '
 85407 -> 'panda'
     4 -> ','
    83 -> ' is'
    10 -> ' a'
 81148 -> ' bear'
114149 -> ' species'
 28117 -> ' ende'
 21068 -> 'mic'
    47 -> ' to'
  9098 -> ' China'
     5 -> '.'
     2 -> '</s>'


0.00.837.762 I batch_decode: n_tokens = 67, n_seq = 3

rerank score 0:   -6.851
rerank score 1:   -3.917
rerank score 2:    4.641

0.00.872.293 I llama_perf_context_print:        load time =     513.42 ms
0.00.872.294 I llama_perf_context_print: prompt eval time =      34.51 ms /    67 tokens (    0.52 ms per token,  1941.24 tokens per second)
0.00.872.296 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.872.298 I llama_perf_context_print:       total time =     358.22 ms /    68 tokens
0.00.872.433 I ggml_metal_free: deallocating

Testing (server)

./llama-server \
    -m ./models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    -c 65536 -np 8 -b 8192 -ub 8192 -fa \
    --host 127.0.0.1 --port 8012 -lv 1 \
    --embedding --pooling rank
curl http://127.0.0.1:8012/v1/rerank \
	-H "Content-Type: application/json" \
	-d '{
      "model": "some-model",
      "query": "Organic skincare products for sensitive skin",
      "top_n": 3,
      "documents": [
		"Organic skincare for sensitive skin with aloe vera and chamomile: Imagine the soothing embrace of nature with our organic skincare range, crafted specifically for sensitive skin. Infused with the calming properties of aloe vera and chamomile, each product provides gentle nourishment and protection. Say goodbye to irritation and hello to a glowing, healthy complexion.",
		"New makeup trends focus on bold colors and innovative techniques: Step into the world of cutting-edge beauty with this seasons makeup trends. Bold, vibrant colors and groundbreaking techniques are redefining the art of makeup. From neon eyeliners to holographic highlighters, unleash your creativity and make a statement with every look.",
		"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille: Erleben Sie die wohltuende Wirkung unserer Bio-Hautpflege, speziell für empfindliche Haut entwickelt. Mit den beruhigenden Eigenschaften von Aloe Vera und Kamille pflegen und schützen unsere Produkte Ihre Haut auf natürliche Weise. Verabschieden Sie sich von Hautirritationen und genießen Sie einen strahlenden Teint.",
		"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken: Tauchen Sie ein in die Welt der modernen Schönheit mit den neuesten Make-up-Trends. Kräftige, lebendige Farben und innovative Techniken setzen neue Maßstäbe. Von auffälligen Eyelinern bis hin zu holografischen Highlightern – lassen Sie Ihrer Kreativität freien Lauf und setzen Sie jedes Mal ein Statement.",
		"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla: Descubre el poder de la naturaleza con nuestra línea de cuidado de la piel orgánico, diseñada especialmente para pieles sensibles. Enriquecidos con aloe vera y manzanilla, estos productos ofrecen una hidratación y protección suave. Despídete de las irritaciones y saluda a una piel radiante y saludable.",
		"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras: Entra en el fascinante mundo del maquillaje con las tendencias más actuales. Colores vivos y técnicas innovadoras están revolucionando el arte del maquillaje. Desde delineadores neón hasta iluminadores holográficos, desata tu creatividad y destaca en cada look.",
		"针对敏感肌专门设计的天然有机护肤产品:体验由芦荟和洋甘菊提取物带来的自然呵护。我们的护肤产品特别为敏感肌设计,温和滋润,保护您的肌肤不受刺激。让您的肌肤告别不适,迎来健康光彩。",
		"新的化妆趋势注重鲜艳的颜色和创新的技巧:进入化妆艺术的新纪元,本季的化妆趋势以大胆的颜色和创新的技巧为主。无论是霓虹眼线还是全息高光,每一款妆容都能让您脱颖而出,展现独特魅力。",
		"敏感肌のために特別に設計された天然有機スキンケア製品: アロエベラとカモミールのやさしい力で、自然の抱擁を感じてください。敏感肌用に特別に設計された私たちのスキンケア製品は、肌に優しく栄養を与え、保護します。肌トラブルにさようなら、輝く健康な肌にこんにちは。",
		"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています: 今シーズンのメイクアップトレンドは、大胆な色彩と革新的な技術に注目しています。ネオンアイライナーからホログラフィックハイライターまで、クリエイティビティを解き放ち、毎回ユニークなルックを演出しましょう。"
      ]
    }' | jq

result:

{
  "model": "some-model",
  "object": "list",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  },
  "results": [
    {
      "index": 0,
      "relevance_score": 5.9729323387146
    },
    {
      "index": 1,
      "relevance_score": -11.031712532043457
    },
    {
      "index": 2,
      "relevance_score": 1.6428840160369873
    },
    {
      "index": 3,
      "relevance_score": -11.016538619995117
    },
    {
      "index": 4,
      "relevance_score": 4.703317642211914
    },
    {
      "index": 5,
      "relevance_score": -11.042262077331543
    },
    {
      "index": 6,
      "relevance_score": 5.164690017700195
    },
    {
      "index": 7,
      "relevance_score": -11.041573524475098
    },
    {
      "index": 8,
      "relevance_score": 5.811273097991943
    },
    {
      "index": 9,
      "relevance_score": -11.044628143310547
    }
  ]
}

@github-actions github-actions bot added the python python script changes label Sep 16, 2024
@HanClinto
Copy link
Collaborator

!!

I'm a big fan of the BGE embedding models (they're incredibly user-friendly to fine-tune on ones' own datasets) -- I'm really happy to see support being added for this! I'll definitely take a look and review.

@ggerganov
Copy link
Owner Author

ggerganov commented Sep 19, 2024

Initial working version using the llama-embedding example is now available. For now, thinking about implementing this as a new type of pooling - "rank". It's very similar to the existing "cls" pooling. Not very sure if this is a good approach - efficiency wise it's not optimal atm. The classification head can be optimized to not perform the large matrix multiplication with the entire batch. Open to alternative ideas.

In the meantime, will add a llama-server integration. (edit: done - see op)

@donguyen32
Copy link

@ggerganov
I am very interested in the reranking support feature. Could you please let me know when it will be released?

@ggerganov
Copy link
Owner Author

Hopefully sometime this week

@QuintinShaw
Copy link

Dear @ggerganov, I highly appreciate the reranking support. I would like to suggest adding support for the gte-multilingual-reranker-base model, which employs the 'NewForSequenceClassification' architecture. This addition would significantly enhance multilingual processing. Thank you!

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Sep 24, 2024

Benchmarking different rerankers

lower index = better

Model English average index French average index German average index Spanish average index Average index Total chunks Time spent(s) VRAM used (MB)
BAAI/bge-reranker-base 13,6 6,3 13,5 3,5 9,2 102 83,1980035305023 1068,8564453125
BAAI/bge-reranker-v2-m3 1,9 0,5 0,4 0,5 0,8 102 880,672662258148 2174,068359375
Alibaba-NLP/gte-multilingual-reranker-base 1,7 0,5 0,5 0,4 0,8 102 46,7268528938294 613,9794921875
jinaai/jina-reranker-v2-base-multilingual 1,8 0,5 0,5 0,8 0,9 102 28,4067573547363 546,23681640625
mixedbread-ai/mxbai-rerank-large-v1 8,8 13,1 9,5 4,7 9 102 216,711384534836 838,1962890625

https://github.com/user-attachments/files/17119411/results_1727199777.4010012.xlsx

@ggerganov
Copy link
Owner Author

@ExtReMLapin How is this benchmark performed?

@ExtReMLapin
Copy link
Contributor

You're right, I forgot to specify how we use it, so what benchmark we used.

At the office, as we only need rag we need embeddings and rerankers only to perform "needle in the haystack" challenge.

  1. We take a bunch of PDF and a bunch of Question-Answers pairs
  2. For each question-answer pairs, hide at a random location in the pdf text (predivided into chunks of size = 4092 characters)
  3. As we know where the needle (answer) was hidden in the text, we ask the reranker to find the "closest" chunk to the question. We expect the chunk that is "hosting" the hidden answer to be returned as the top result in similarity score.
  4. We assign the rank to the score of this chunk for this reranker model for example if we did hide the answer id chunk number 46 we expected the reranked chunks to be something like [46, ...] so the score is zero . But if we have for example [12, 64, 54, 46, ...] the score will be 3.

The only thing we care about is answering "can this question be answered with this text chunk", from our Point of view, rerankers are just upgraded embeddings models (it's technically different, I'm just talking about the final purpose).

Please note the questions-answer pairs are VERY similars, this is why we expects the rerankers to have a final score close to ZERO. We considers others rerankers to be unusable.

Again, this is a benchmark we built in two hours just to fit our needs.

If anyone want to take a look at the code, here it is : evaluate_rerankers.py.txt

@rujialiu
Copy link

rujialiu commented Sep 25, 2024

@ExtReMLapin Thanks for the benchmark. I wonder if you could add bce reranker (https://huggingface.co/maidalun1020/bce-reranker-base_v1). It's an interesting one because it claims to have "meaningful rerank score". See their repo for more information: https://github.com/netease-youdao/BCEmbedding

Edit: oops, I didn't realize it doesn't support the other three languages except English, in your table :(

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Sep 25, 2024

@ExtReMLapin Thanks for the benchmark. I wonder if you could add bce reranker (https://huggingface.co/maidalun1020/bce-reranker-base_v1). It's an interesting one because it claims to have "meaningful rerank score". See their repo for more information: https://github.com/netease-youdao/BCEmbedding

Edit: oops, I didn't realize it doesn't support the other three languages except English, in your table :(

If a reranker is not even succeding at the needing in the haystack challenge better than bge-m3 embeddings, it's literally not worth our disk space.

image

For reference, int he EXACT same benchmark, but we use embeddings models instead, everage score on BGE-M3 is ~4.5

So a reranker here is doing pretty much WORSE than embeddings 🤡

I'm re-running another test and I added 3 fanfictions pdf and one question-pair set where the link is much more subtile between question and answer.

@ExtReMLapin
Copy link
Contributor

@rujialiu @QuintinShaw

I finished running more tests using the new dataset

For reference, in the first needle in the haystack challenge, the searched text (needle/answer) was very similar to the question (query).

Questions-pairs were like :

    {
        "question": "When was Peter Donkey Born ?",
        "needles": [
            "Peter Donkey was born in november in 1996",
            "P. Donkey was born in 1996",
            "Peter Donkey est né en novembre 1996",
            "Peter Donkey ese nacio en 1996",
        ],
    },
    {
        "question": "What is the height of Mount Everest?",
        "needles": [
            "Mount Everest measures 8,848 meters above sea level.",
            "The tallest mountain is 8,848 meters high.",
            "La montagne la plus haute mesure 8 848 mètres, c'est l'Everest.",
            "La montaña más alta mide 8,848 metros.",
            "Der höchste Berg der Welt ist 8.848 Meter hoch.",
        ],
    },
    {
        "question": "Who invented the telephone?",
        "needles": [
            "Alexander Graham Bell is credited with the invention of the telephone.",
            "The telephone was first patented by Bell in 1876.",
            "Le téléphone a été inventé par Alexander Graham Bell.",
            "El teléfono fue inventado por Alexander Graham Bell.",
            "Das Telefon wurde von Alexander Graham Bell erfunden.",
        ],
    },

You didn't even need a reranker to find the answer as it had similar words

I added a new dataset called subtle which is more like this :

{
        "question": "When did Peter eat a fruit ?",
        "needles": [ #link is fruit -> apple
            "Right after he went to the gym, he ate an apple.",
        ],
},
{
        "question": "What did the criminal do to get in jail ?",
        "needles": [ # link is jail -> emprisoned
            "He's emprisoned because he stole a car.",
        ],
},
{
        "question": "What did the doctor prescribe to the patient ?",
        "needles": [ #link is doctor/patient -> hospital
            "Back from the hospital, he got penicilin.",
        ],
},
{
        "question": "What did the teacher give to the student ?",
        "needles": [ #link is teacher/student -> school
            "At school, he received a book.",
        ],
},
{
        "question": "What is used to quench thirst?",
        "needles": [ #link is thirst -> drink
            "After the long walk, he drank a glass of water.",
        ],
},

image

BAAI/bge-reranker-v2-m3 is still top 1 and others are far behind.

@ggerganov
Copy link
Owner Author

ggerganov commented Sep 30, 2024

Thanks a lot for these tests and analysis. I'm not sure why the scores are different - likely I'm missing something in the final classification layer. I'm looking at this code here, but maybe this is the wrong place:

https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566-L1585

How can I minimize peak VRAM usage for embedder/reranker if I want to use CPU inference?

To switch to CPU-only inference add -ngl 0 -nkvo.

You can also adjust the context size. In the example, I've used 65536 in order to be able to fit 8 queries of size 8192 in parallel at the same time. But if your use case is different, consider adjusting the parameters. Let me know if it is not clear and will explain further.

How can we reduce internal VRAM buffer of the server if we know it'll be idle for a long time? (It's ok that the next access's response will be considerably longer than usual)

Maybe spin-down the GPU based instance when it is not used and spin it up when there is high load? Not sure.

@rujialiu
Copy link

rujialiu commented Sep 30, 2024

Thanks a lot for these tests and analysis. I'm not sure why the scores are different - likely I'm missing something in the final classification layer. I'm looking at this code here, but maybe this is the wrong place:

Unfortunately, currently my LLM/reranker knowledge is not enough, but I'm learning :)

To switch to CPU-only inference add -ngl 0 -nkvo.

Thanks! I was not aware of -nkvo. But it doesn't seem to help (I reduced context size according to your suggestion):

C:\llama.cpp>llama-server -m bge-reranker-v2-m3-q4_0.gguf -c 1024 -ngl 0 -nkvo --rerank --port 8092
build: 0 (unknown) with MSVC 19.34.31937.0 for x64
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8092, http threads: 31
main: loading model
llama_model_loader: loaded meta data with 35 key-value pairs and 393 tensors from bge-reranker-v2-m3-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Bge M3
llama_model_loader: - kv   3:                       general.organization str              = BAAI
llama_model_loader: - kv   4:                         general.size_label str              = 568M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,3]       = ["transformers", "sentence-transforme...
llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv   8:                           bert.block_count u32              = 24
llama_model_loader: - kv   9:                        bert.context_length u32              = 8192
llama_model_loader: - kv  10:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv  11:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv  12:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  13:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                      bert.attention.causal bool             = false
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  22:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  23:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  24:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  30:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  31:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  247 tensors
llama_model_loader: - type q4_0:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 4
llm_load_vocab: token to piece cache size = 2.1668 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = UGM
llm_load_print_meta: n_vocab          = 250002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = -1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 567.75 M
llm_load_print_meta: model size       = 396.07 MiB (5.85 BPW)
llm_load_print_meta: general.name     = Bge M3
llm_load_print_meta: BOS token        = 0 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: SEP token        = 2 '</s>'
llm_load_print_meta: PAD token        = 1 '<pad>'
llm_load_print_meta: CLS token        = 0 '<s>'
llm_load_print_meta: MASK token       = 250001 '[PAD250000]'
llm_load_print_meta: LF token         = 6 '鈻?
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =   396.07 MiB
............................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    25.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     5.01 MiB
llama_new_context_with_model: graph nodes  = 854
llama_new_context_with_model: graph splits = 416
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 1024
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8092 - starting the main loop
srv  update_slots: all slots are idle

According to task manager (win10), llama-server.exe eats 433MB RAM and 188MB VRAM. It's almost the same for f16 model, the VRAM usage is slightly higher (190MB). I'm expecting 0KB VRAM usage.

Any strange information in the log? You could also point to me some codes to look at. I'm not bad at debugging 😆

Temporarily shutting down the server is not a bad idea. Will use it before finding a better solution.

@foldl
Copy link
Contributor

foldl commented Sep 30, 2024

@rujialiu have you compared the results of ChatLLM against infinity and tei?

@ggerganov
Copy link
Owner Author

@foldl You can test the results using the following input:

<s>A red apple</s><s>A llama in the garden</s>
<s>A red apple</s><s>I want some fruit</s>

Infinity and tei produce [-8.7734375, -0.10394287], while llama.cpp produces [-6.902970790863037, -1.4558653831481934].

@ggerganov
Copy link
Owner Author

ggerganov commented Sep 30, 2024

Any strange information in the log? You could also point to me some codes to look at. I'm not bad at debugging

No, I think it's expected to have some small VRAM usage. You probably want to deploy 2 llama-server instances - one without CUDA for CPU-only inference and one with CUDA. You can add some logic on top to decide which instance to run when.

For the CUDA instance, make sure to offload all layers and enable flash attention: -ngl 99 -fa

@slaren
Copy link
Collaborator

slaren commented Sep 30, 2024

You can also set the CUDA_VISIBLE_DEVICES environment variable to an empty string to fully disable CUDA.

@rujialiu
Copy link

@rujialiu have you compared the results of ChatLLM against infinity and tei?

I tried, but I can't find a trivial way to separate the reranker part of ChatLLM so I don't really know how to get the result to compare. Will try again when I have more time.

@foldl
Copy link
Contributor

foldl commented Sep 30, 2024

@ggerganov sadly that, after disabling sigmoid, results from chatllm are [-5.97335529, -3.10583854], although I remembered that I have compared the results when implementing BCE re-ranker which shares the same arch with BGE. I will look in it later.

@foldl
Copy link
Contributor

foldl commented Sep 30, 2024

@rujialiu
Copy link

rujialiu commented Oct 1, 2024

@rujialiu, there are easy-to-use Python bindings:

https://github.com/foldl/chatllm.cpp/blob/d458264ff8f6b70994f9e7cfd1de45bb656b1875/bindings/chatllm.py#L255

Oops. I forgot to check python bindings. Thanks! And a (hopefully) easier way to debug, is to trace through the official code of bge:

from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score([['A red apple', 'A llama in the garden'], ['A red apple', 'I want some fruit']])
print(score) # [-8.7265625, -0.10467529296875]

@rujialiu
Copy link

rujialiu commented Oct 2, 2024

I looked deeper into this. After printing tokenization result, I noticed a difference.

llama.cpp:

main: prompt 0: 'A red apple</s><s>A llama in the garden'
main: number of tokens in prompt = 12
     0 -> '<s>'
    62 -> ' A'
  4842 -> ' red'
108787 -> ' apple'
     2 -> '</s>'
     0 -> '<s>'
    62 -> ' A'
 67140 -> ' llama'
    23 -> ' in'
    70 -> ' the'
 80583 -> ' garden'
     2 -> '</s>'

However, FlagEmbedding's output (I modified its source code to print tokenizer result):

tensor([[     0,     62,   4842, 108787,      2,      2,     62,  67140,     23,
             70,  80583,      2],

So the BOS of the second sequence is 2 instead of 0. I don't know why, but if I change llama.cpp's code to enforce it to be 0 like this:

    for (int k = 0; k < n_prompts; k++) {
        // clamp to n_batch tokens
        auto & inp = inputs[k];
        inp[5] = 2; // <--- just for troubleshooting

Then it can get very similar result: -8.797 (vs tei's -8.773)! I've also tested some other equal-length pairs like this:

score = reranker.compute_score([['A red apple', 'A llama in the garden'], ['A red apple', 'I want some fruit!'], ['A red apple', 'Another nice looking big apple']])
print(score) # [-8.7265625, -1.53125, 2.203125]

llama.cpp's output:

rerank score 0:   -8.797
rerank score 1:   -1.500
rerank score 2:    2.239

That's very close!

EDIT: I made a mistake in my original post. After changing second setence's BOS to 2, actually everything is correct now. So we probably just need:

  • Add an EOS instead of BOS at the beginning of second sentence
  • Add sigmoid in server (we can also add a boolean flag raw_scores just like in tei)

@ggerganov

@ggerganov
Copy link
Owner Author

Nice find! So it really depends on what is the correct formatting of the input. The approach that we have implemented in llama.cpp (which was suggested by @foldl in #8555 (comment)) makes sense to me. On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it, since it's unlikely that tei has a bug in this regard. Still, it's a good idea to verify and find some reference for how to structure the input exactl.y

Adding sigmoid option is easy.

@rujialiu
Copy link

rujialiu commented Oct 2, 2024

On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it

I'm curious too. So I searched in transformers source and found models/xlm_roberta/tokenization_xlm_roberta_fast.py

class XLMRobertaTokenizerFast(PreTrainedTokenizerFast):
...
    def __init__(
        self,
        vocab_file=None,
        tokenizer_file=None,
        bos_token="<s>",
        eos_token="</s>",
        sep_token="</s>",
        cls_token="<s>",
        unk_token="<unk>",
        pad_token="<pad>",
        mask_token="<mask>",
        **kwargs,
    ):

So they're adding sep_token instead of eos_token which sounds reasonable. It's just that sep_token happened to be the same as EOS 😆

@rujialiu
Copy link

rujialiu commented Oct 2, 2024

Also, for tei, it uses huggingface's tokenizers library, in which RobertaProcessing's default sep is </s>:

impl Default for RobertaProcessing {
    fn default() -> Self {
        Self {
            sep: ("</s>".into(), 2),
            cls: ("<s>".into(), 0),
            trim_offsets: true,
            add_prefix_space: true,
        }
    }
}

see https://docs.rs/tokenizers/latest/src/tokenizers/processors/roberta.rs.html#16-25

@ggerganov
Copy link
Owner Author

Ok, it's clear now - we need to use:

<bos>query</eos><sep>answer</eos>

@rujialiu
Copy link

rujialiu commented Oct 3, 2024

After reading more codes in transformers and tokenizers I feel it's still a bit mess, for example, in transformers, tokenization_xlm_roberta_fast.py:

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An XLM-RoBERTa sequence has the following format:

        - single sequence: `<s> X </s>`
        - pair of sequences: `<s> A </s></s> B </s>`

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """

        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + sep + token_ids_1 + sep

So it's 3 seps but no eos (tokenizers has similar codes. 3 seps). But maybe it's just too nitpicking. Never mind. Let's do it! Thanks!

@foldl
Copy link
Contributor

foldl commented Oct 10, 2024

Nice find! So it really depends on what is the correct formatting of the input. The approach that we have implemented in llama.cpp (which was suggested by @foldl in #8555 (comment)) makes sense to me. On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it, since it's unlikely that tei has a bug in this regard. Still, it's a good idea to verify and find some reference for how to structure the input exactl.y

Adding sigmoid option is easy.

It's my fault: the pattern in #8555 (comment)) is wrong. Fixed now.

After the tokenizer got fixed, I think the results of chatllm and infinity/tei matched now.

@rujialiu
Copy link

Adding sigmoid option is easy.

Gently ping @ggerganov for the sigmoid option

@ggerganov
Copy link
Owner Author

Seems like a good first issue for new contributors. Feel free to create one.

@rujialiu
Copy link

Ok, I'll try next week

@rujialiu
Copy link

Seems like a good first issue for new contributors. Feel free to create one.

I ended up adding sigmoid in send_rerank function (changed two lines), thus changing the default behavior. Adding a raw_score flag in the Rest API seems unneccesary (only supported by tei). Is it ok to create PR from this minimal change?

@ggerganov
Copy link
Owner Author

We can leave it up to the client to apply sigmoid if they need to. It's a trivial operation and don't think there is any need to do it server-side. But we can add this option nevertheless. We just have to keep the existing behavior as default and optionally enable the sigmoid per-request.

@rujialiu
Copy link

You're right. I changed that because I want it to be a painless (almost drop-in) replacement of tei because it's already used in production in several places. But generally it doesn't need to be done server-side.

@ExtReMLapin
Copy link
Contributor

@QuintinShaw we ran more tests (real life tests) at the office.

We took the second book of harry potter, and used a sigmoid to give us a 0-1 score for the reranker.

Alibaba reranker is terrible compared to BGE reranker.

On the left, you can see sorted scores of each chunks (80) compared to the query "Which animal lives in the Chamber of Secrets ?"

On the right, same but for BGE M3 V2

image

Alibaba doesn't highlights any chunk, while BGE reranker does

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants