llama : add reranking support #9510

ggerganov · 2024-09-16T14:01:50Z

This adds initial support for reranking to libllama, llama-embeddings and llama-server. I've tested mainly with the following 2 models:

https://huggingface.co/BAAI/bge-reranker-v2-m3
https://huggingface.co/jinaai/jina-reranker-v1-tiny-en (small, useful for CI tests)

The reranking is implemented as a pooling layer of type LLAMA_POOLING_TYPE_RANK. When used, libllama will attach a classification head at the end of the graph:

llama.cpp/src/llama.cpp

Lines 10246 to 10266 in 4d45775

    
           case LLAMA_POOLING_TYPE_RANK: 
        
               { 
        
                   struct ggml_tensor * inp_cls = build_inp_cls(); 
        
                   inp = ggml_get_rows(ctx0, inp, inp_cls); 
        
                   // classification head 
        
                   // https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566 
        
                   GGML_ASSERT(model.cls       != nullptr); 
        
                   GGML_ASSERT(model.cls_b     != nullptr); 
        
                   cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls, inp), model.cls_b); 
        
                   cur = ggml_tanh(ctx0, cur); 
        
                   if (model.cls_out) { 
        
                       // this path is taken for example by the https://huggingface.co/jinaai/jina-reranker-v1-tiny-en 
        
                       // https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/blob/cb5347e43979c3084a890e3f99491952603ae1b7/modeling_bert.py#L884-L896 
        
                       GGML_ASSERT(model.cls_out_b != nullptr); 
        
                       cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls_out, cur), model.cls_out_b); 
        
                   } 
        
               } break;

The current implementation likely does not cover all types of rerankers, so updates would be necessary in the future to support other types of classifications on a case-by-case basis.

The computed rank scores for each sequence can be accessed via the llama_get_embeddings_seq() call:

llama.cpp/include/llama.h

Lines 873 to 878 in 4d45775

    
           // Get the embeddings for a sequence id 
        
           // Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE 
        
           // when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence 
        
           // otherwise: float[n_embd] (1-dimensional) 
        
           LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);

The rank score is stored as a single float.

The server endpoint is designed mostly after https://jina.ai/reranker/, but it is not fully complete. Again, I think it's better to update it on a case-by-case basis + the API is not ideal (e.g. what is the purpose of top_n?, why are the document contents returned in the response?).

I started to add server tests, but it will take me more time to write the python code, so I'll create a separate issue for people to help with that:

llama.cpp/examples/server/tests/features/rerank.feature

Lines 19 to 25 in 4d45775

    
           # TODO: implement some tests 
        
           #       https://github.com/ggerganov/llama.cpp/pull/9510 
        
           #  Scenario: Rerank 
        
           #    Given a prompt: 
        
           #      """ 
        
           #      What is panda? 
        
           #      """

TODO:

~~tests~~ (left for follow-up PR)
clean-up
optimize classification head

Model: https://huggingface.co/BAAI/bge-reranker-v2-m3

Testing:

python3 convert_hf_to_gguf.py \
	~/Data/huggingface/bge-reranker-v2-m3/ \
    --outfile models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    --outtype f16

Classifier:

https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566

Testing (CLI)

Rank responses:

# first
Q: what is panda?
A: hi

# second
Q: what is panda?
A: it's a bear

# third
Q: what is panda?
A: The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.

./llama-embedding \
    -m models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    -p "what is panda?</s><s>hi\nwhat is panda?</s><s>it's a bear\nwhat is panda?</s><s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." \
    --pooling rank --embd-normalize -1 --verbose-prompt

0.00.830.859 I main: prompt 0: 'what is panda?</s><s>hi'
0.00.830.865 I main: number of tokens in prompt = 10
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
  1274 -> ' hi'
     2 -> '</s>'


0.00.830.871 I main: prompt 1: 'what is panda?</s><s>it's a bear'
0.00.830.871 I main: number of tokens in prompt = 14
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
   442 -> ' it'
    25 -> '''
     7 -> 's'
    10 -> ' a'
 81148 -> ' bear'
     2 -> '</s>'


0.00.830.875 I main: prompt 2: 'what is panda?</s><s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.'
0.00.830.875 I main: number of tokens in prompt = 43
     0 -> '<s>'
  2367 -> ' what'
    83 -> ' is'
     6 -> ' '
 85407 -> 'panda'
    32 -> '?'
     2 -> '</s>'
     0 -> '<s>'
   581 -> ' The'
  6051 -> ' gian'
    18 -> 't'
     6 -> ' '
 85407 -> 'panda'
    15 -> ' ('
   284 -> 'A'
 12175 -> 'ilu'
 28437 -> 'rop'
 19165 -> 'oda'
 54159 -> ' melan'
 16836 -> 'ole'
 29808 -> 'uca'
   247 -> '),'
 68018 -> ' sometimes'
 35839 -> ' called'
    10 -> ' a'
     6 -> ' '
 85407 -> 'panda'
 81148 -> ' bear'
   707 -> ' or'
 42856 -> ' simply'
     6 -> ' '
 85407 -> 'panda'
     4 -> ','
    83 -> ' is'
    10 -> ' a'
 81148 -> ' bear'
114149 -> ' species'
 28117 -> ' ende'
 21068 -> 'mic'
    47 -> ' to'
  9098 -> ' China'
     5 -> '.'
     2 -> '</s>'


0.00.837.762 I batch_decode: n_tokens = 67, n_seq = 3

rerank score 0:   -6.851
rerank score 1:   -3.917
rerank score 2:    4.641

0.00.872.293 I llama_perf_context_print:        load time =     513.42 ms
0.00.872.294 I llama_perf_context_print: prompt eval time =      34.51 ms /    67 tokens (    0.52 ms per token,  1941.24 tokens per second)
0.00.872.296 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.872.298 I llama_perf_context_print:       total time =     358.22 ms /    68 tokens
0.00.872.433 I ggml_metal_free: deallocating

Testing (server)

./llama-server \
    -m ./models/bge-reranker-v2-m3/ggml-model-f16.gguf \
    -c 65536 -np 8 -b 8192 -ub 8192 -fa \
    --host 127.0.0.1 --port 8012 -lv 1 \
    --embedding --pooling rank

curl http://127.0.0.1:8012/v1/rerank \
	-H "Content-Type: application/json" \
	-d '{
      "model": "some-model",
      "query": "Organic skincare products for sensitive skin",
      "top_n": 3,
      "documents": [
		"Organic skincare for sensitive skin with aloe vera and chamomile: Imagine the soothing embrace of nature with our organic skincare range, crafted specifically for sensitive skin. Infused with the calming properties of aloe vera and chamomile, each product provides gentle nourishment and protection. Say goodbye to irritation and hello to a glowing, healthy complexion.",
		"New makeup trends focus on bold colors and innovative techniques: Step into the world of cutting-edge beauty with this seasons makeup trends. Bold, vibrant colors and groundbreaking techniques are redefining the art of makeup. From neon eyeliners to holographic highlighters, unleash your creativity and make a statement with every look.",
		"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille: Erleben Sie die wohltuende Wirkung unserer Bio-Hautpflege, speziell für empfindliche Haut entwickelt. Mit den beruhigenden Eigenschaften von Aloe Vera und Kamille pflegen und schützen unsere Produkte Ihre Haut auf natürliche Weise. Verabschieden Sie sich von Hautirritationen und genießen Sie einen strahlenden Teint.",
		"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken: Tauchen Sie ein in die Welt der modernen Schönheit mit den neuesten Make-up-Trends. Kräftige, lebendige Farben und innovative Techniken setzen neue Maßstäbe. Von auffälligen Eyelinern bis hin zu holografischen Highlightern – lassen Sie Ihrer Kreativität freien Lauf und setzen Sie jedes Mal ein Statement.",
		"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla: Descubre el poder de la naturaleza con nuestra línea de cuidado de la piel orgánico, diseñada especialmente para pieles sensibles. Enriquecidos con aloe vera y manzanilla, estos productos ofrecen una hidratación y protección suave. Despídete de las irritaciones y saluda a una piel radiante y saludable.",
		"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras: Entra en el fascinante mundo del maquillaje con las tendencias más actuales. Colores vivos y técnicas innovadoras están revolucionando el arte del maquillaje. Desde delineadores neón hasta iluminadores holográficos, desata tu creatividad y destaca en cada look.",
		"针对敏感肌专门设计的天然有机护肤产品：体验由芦荟和洋甘菊提取物带来的自然呵护。我们的护肤产品特别为敏感肌设计，温和滋润，保护您的肌肤不受刺激。让您的肌肤告别不适，迎来健康光彩。",
		"新的化妆趋势注重鲜艳的颜色和创新的技巧：进入化妆艺术的新纪元，本季的化妆趋势以大胆的颜色和创新的技巧为主。无论是霓虹眼线还是全息高光，每一款妆容都能让您脱颖而出，展现独特魅力。",
		"敏感肌のために特別に設計された天然有機スキンケア製品: アロエベラとカモミールのやさしい力で、自然の抱擁を感じてください。敏感肌用に特別に設計された私たちのスキンケア製品は、肌に優しく栄養を与え、保護します。肌トラブルにさようなら、輝く健康な肌にこんにちは。",
		"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています: 今シーズンのメイクアップトレンドは、大胆な色彩と革新的な技術に注目しています。ネオンアイライナーからホログラフィックハイライターまで、クリエイティビティを解き放ち、毎回ユニークなルックを演出しましょう。"
      ]
    }' | jq

result:

{
  "model": "some-model",
  "object": "list",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  },
  "results": [
    {
      "index": 0,
      "relevance_score": 5.9729323387146
    },
    {
      "index": 1,
      "relevance_score": -11.031712532043457
    },
    {
      "index": 2,
      "relevance_score": 1.6428840160369873
    },
    {
      "index": 3,
      "relevance_score": -11.016538619995117
    },
    {
      "index": 4,
      "relevance_score": 4.703317642211914
    },
    {
      "index": 5,
      "relevance_score": -11.042262077331543
    },
    {
      "index": 6,
      "relevance_score": 5.164690017700195
    },
    {
      "index": 7,
      "relevance_score": -11.041573524475098
    },
    {
      "index": 8,
      "relevance_score": 5.811273097991943
    },
    {
      "index": 9,
      "relevance_score": -11.044628143310547
    }
  ]
}

HanClinto · 2024-09-17T21:32:07Z

!!

I'm a big fan of the BGE embedding models (they're incredibly user-friendly to fine-tune on ones' own datasets) -- I'm really happy to see support being added for this! I'll definitely take a look and review.

ggerganov · 2024-09-19T10:26:08Z

Initial working version using the llama-embedding example is now available. For now, thinking about implementing this as a new type of pooling - "rank". It's very similar to the existing "cls" pooling. Not very sure if this is a good approach - efficiency wise it's not optimal atm. The classification head can be optimized to not perform the large matrix multiplication with the entire batch. Open to alternative ideas.

In the meantime, will add a llama-server integration. (edit: done - see op)

donguyen32 · 2024-09-23T07:01:22Z

@ggerganov
I am very interested in the reranking support feature. Could you please let me know when it will be released?

ggerganov · 2024-09-23T08:21:25Z

Hopefully sometime this week

QuintinShaw · 2024-09-23T22:40:48Z

Dear @ggerganov, I highly appreciate the reranking support. I would like to suggest adding support for the gte-multilingual-reranker-base model, which employs the 'NewForSequenceClassification' architecture. This addition would significantly enhance multilingual processing. Thank you!

ExtReMLapin · 2024-09-24T18:40:15Z

Benchmarking different rerankers

lower index = better

Model	English average index	French average index	German average index	Spanish average index	Average index	Total chunks	Time spent(s)	VRAM used (MB)
BAAI/bge-reranker-base	13,6	6,3	13,5	3,5	9,2	102	83,1980035305023	1068,8564453125
BAAI/bge-reranker-v2-m3	1,9	0,5	0,4	0,5	0,8	102	880,672662258148	2174,068359375
Alibaba-NLP/gte-multilingual-reranker-base	1,7	0,5	0,5	0,4	0,8	102	46,7268528938294	613,9794921875
jinaai/jina-reranker-v2-base-multilingual	1,8	0,5	0,5	0,8	0,9	102	28,4067573547363	546,23681640625
mixedbread-ai/mxbai-rerank-large-v1	8,8	13,1	9,5	4,7	9	102	216,711384534836	838,1962890625

https://github.com/user-attachments/files/17119411/results_1727199777.4010012.xlsx

ggerganov · 2024-09-25T06:33:58Z

@ExtReMLapin How is this benchmark performed?

ExtReMLapin · 2024-09-25T06:48:45Z

You're right, I forgot to specify how we use it, so what benchmark we used.

At the office, as we only need rag we need embeddings and rerankers only to perform "needle in the haystack" challenge.

We take a bunch of PDF and a bunch of Question-Answers pairs
For each question-answer pairs, hide at a random location in the pdf text (predivided into chunks of size = 4092 characters)
As we know where the needle (answer) was hidden in the text, we ask the reranker to find the "closest" chunk to the question. We expect the chunk that is "hosting" the hidden answer to be returned as the top result in similarity score.
We assign the rank to the score of this chunk for this reranker model for example if we did hide the answer id chunk number 46 we expected the reranked chunks to be something like [46, ...] so the score is zero . But if we have for example [12, 64, 54, 46, ...] the score will be 3.

The only thing we care about is answering "can this question be answered with this text chunk", from our Point of view, rerankers are just upgraded embeddings models (it's technically different, I'm just talking about the final purpose).

Please note the questions-answer pairs are VERY similars, this is why we expects the rerankers to have a final score close to ZERO. We considers others rerankers to be unusable.

Again, this is a benchmark we built in two hours just to fit our needs.

If anyone want to take a look at the code, here it is : evaluate_rerankers.py.txt

rujialiu · 2024-09-25T11:38:20Z

@ExtReMLapin Thanks for the benchmark. I wonder if you could add bce reranker (https://huggingface.co/maidalun1020/bce-reranker-base_v1). It's an interesting one because it claims to have "meaningful rerank score". See their repo for more information: https://github.com/netease-youdao/BCEmbedding

Edit: oops, I didn't realize it doesn't support the other three languages except English, in your table :(

ggml-ci

ExtReMLapin · 2024-09-25T14:39:26Z

@ExtReMLapin Thanks for the benchmark. I wonder if you could add bce reranker (https://huggingface.co/maidalun1020/bce-reranker-base_v1). It's an interesting one because it claims to have "meaningful rerank score". See their repo for more information: https://github.com/netease-youdao/BCEmbedding

Edit: oops, I didn't realize it doesn't support the other three languages except English, in your table :(

If a reranker is not even succeding at the needing in the haystack challenge better than bge-m3 embeddings, it's literally not worth our disk space.

For reference, int he EXACT same benchmark, but we use embeddings models instead, everage score on BGE-M3 is ~4.5

So a reranker here is doing pretty much WORSE than embeddings 🤡

I'm re-running another test and I added 3 fanfictions pdf and one question-pair set where the link is much more subtile between question and answer.

ExtReMLapin · 2024-09-25T16:03:04Z

@rujialiu @QuintinShaw

I finished running more tests using the new dataset

For reference, in the first needle in the haystack challenge, the searched text (needle/answer) was very similar to the question (query).

Questions-pairs were like :

    {
        "question": "When was Peter Donkey Born ?",
        "needles": [
            "Peter Donkey was born in november in 1996",
            "P. Donkey was born in 1996",
            "Peter Donkey est né en novembre 1996",
            "Peter Donkey ese nacio en 1996",
        ],
    },
    {
        "question": "What is the height of Mount Everest?",
        "needles": [
            "Mount Everest measures 8,848 meters above sea level.",
            "The tallest mountain is 8,848 meters high.",
            "La montagne la plus haute mesure 8 848 mètres, c'est l'Everest.",
            "La montaña más alta mide 8,848 metros.",
            "Der höchste Berg der Welt ist 8.848 Meter hoch.",
        ],
    },
    {
        "question": "Who invented the telephone?",
        "needles": [
            "Alexander Graham Bell is credited with the invention of the telephone.",
            "The telephone was first patented by Bell in 1876.",
            "Le téléphone a été inventé par Alexander Graham Bell.",
            "El teléfono fue inventado por Alexander Graham Bell.",
            "Das Telefon wurde von Alexander Graham Bell erfunden.",
        ],
    },

You didn't even need a reranker to find the answer as it had similar words

I added a new dataset called subtle which is more like this :

{
        "question": "When did Peter eat a fruit ?",
        "needles": [ #link is fruit -> apple
            "Right after he went to the gym, he ate an apple.",
        ],
},
{
        "question": "What did the criminal do to get in jail ?",
        "needles": [ # link is jail -> emprisoned
            "He's emprisoned because he stole a car.",
        ],
},
{
        "question": "What did the doctor prescribe to the patient ?",
        "needles": [ #link is doctor/patient -> hospital
            "Back from the hospital, he got penicilin.",
        ],
},
{
        "question": "What did the teacher give to the student ?",
        "needles": [ #link is teacher/student -> school
            "At school, he received a book.",
        ],
},
{
        "question": "What is used to quench thirst?",
        "needles": [ #link is thirst -> drink
            "After the long walk, he drank a glass of water.",
        ],
},

BAAI/bge-reranker-v2-m3 is still top 1 and others are far behind.

ggerganov · 2024-09-30T09:01:35Z

Thanks a lot for these tests and analysis. I'm not sure why the scores are different - likely I'm missing something in the final classification layer. I'm looking at this code here, but maybe this is the wrong place:

https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566-L1585

How can I minimize peak VRAM usage for embedder/reranker if I want to use CPU inference?

To switch to CPU-only inference add -ngl 0 -nkvo.

You can also adjust the context size. In the example, I've used 65536 in order to be able to fit 8 queries of size 8192 in parallel at the same time. But if your use case is different, consider adjusting the parameters. Let me know if it is not clear and will explain further.

How can we reduce internal VRAM buffer of the server if we know it'll be idle for a long time? (It's ok that the next access's response will be considerably longer than usual)

Maybe spin-down the GPU based instance when it is not used and spin it up when there is high load? Not sure.

rujialiu · 2024-09-30T10:51:17Z

Thanks a lot for these tests and analysis. I'm not sure why the scores are different - likely I'm missing something in the final classification layer. I'm looking at this code here, but maybe this is the wrong place:

Unfortunately, currently my LLM/reranker knowledge is not enough, but I'm learning :)

To switch to CPU-only inference add -ngl 0 -nkvo.

Thanks! I was not aware of -nkvo. But it doesn't seem to help (I reduced context size according to your suggestion):

C:\llama.cpp>llama-server -m bge-reranker-v2-m3-q4_0.gguf -c 1024 -ngl 0 -nkvo --rerank --port 8092
build: 0 (unknown) with MSVC 19.34.31937.0 for x64
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8092, http threads: 31
main: loading model
llama_model_loader: loaded meta data with 35 key-value pairs and 393 tensors from bge-reranker-v2-m3-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Bge M3
llama_model_loader: - kv   3:                       general.organization str              = BAAI
llama_model_loader: - kv   4:                         general.size_label str              = 568M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,3]       = ["transformers", "sentence-transforme...
llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv   8:                           bert.block_count u32              = 24
llama_model_loader: - kv   9:                        bert.context_length u32              = 8192
llama_model_loader: - kv  10:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv  11:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv  12:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv  13:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                      bert.attention.causal bool             = false
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  22:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  23:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  24:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  30:                tokenizer.ggml.cls_token_id u32              = 0
llama_model_loader: - kv  31:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  247 tensors
llama_model_loader: - type q4_0:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 4
llm_load_vocab: token to piece cache size = 2.1668 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = UGM
llm_load_print_meta: n_vocab          = 250002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 1024
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4096
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = -1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 335M
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 567.75 M
llm_load_print_meta: model size       = 396.07 MiB (5.85 BPW)
llm_load_print_meta: general.name     = Bge M3
llm_load_print_meta: BOS token        = 0 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: SEP token        = 2 '</s>'
llm_load_print_meta: PAD token        = 1 '<pad>'
llm_load_print_meta: CLS token        = 0 '<s>'
llm_load_print_meta: MASK token       = 250001 '[PAD250000]'
llm_load_print_meta: LF token         = 6 '鈻?
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =   396.07 MiB
............................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    25.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     5.01 MiB
llama_new_context_with_model: graph nodes  = 854
llama_new_context_with_model: graph splits = 416
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 1024
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8092 - starting the main loop
srv  update_slots: all slots are idle

According to task manager (win10), llama-server.exe eats 433MB RAM and 188MB VRAM. It's almost the same for f16 model, the VRAM usage is slightly higher (190MB). I'm expecting 0KB VRAM usage.

Any strange information in the log? You could also point to me some codes to look at. I'm not bad at debugging 😆

Temporarily shutting down the server is not a bad idea. Will use it before finding a better solution.

foldl · 2024-09-30T11:23:26Z

@rujialiu have you compared the results of ChatLLM against infinity and tei?

ggerganov · 2024-09-30T11:35:36Z

@foldl You can test the results using the following input:

<s>A red apple</s><s>A llama in the garden</s>
<s>A red apple</s><s>I want some fruit</s>

Infinity and tei produce [-8.7734375, -0.10394287], while llama.cpp produces [-6.902970790863037, -1.4558653831481934].

ggerganov · 2024-09-30T11:43:36Z

Any strange information in the log? You could also point to me some codes to look at. I'm not bad at debugging

No, I think it's expected to have some small VRAM usage. You probably want to deploy 2 llama-server instances - one without CUDA for CPU-only inference and one with CUDA. You can add some logic on top to decide which instance to run when.

For the CUDA instance, make sure to offload all layers and enable flash attention: -ngl 99 -fa

slaren · 2024-09-30T12:53:31Z

You can also set the CUDA_VISIBLE_DEVICES environment variable to an empty string to fully disable CUDA.

rujialiu · 2024-09-30T13:25:38Z

@rujialiu have you compared the results of ChatLLM against infinity and tei?

I tried, but I can't find a trivial way to separate the reranker part of ChatLLM so I don't really know how to get the result to compare. Will try again when I have more time.

foldl · 2024-09-30T13:56:16Z

@ggerganov sadly that, after disabling sigmoid, results from chatllm are [-5.97335529, -3.10583854], although I remembered that I have compared the results when implementing BCE re-ranker which shares the same arch with BGE. I will look in it later.

foldl · 2024-09-30T14:04:20Z

@rujialiu, there are easy-to-use Python bindings:

https://github.com/foldl/chatllm.cpp/blob/d458264ff8f6b70994f9e7cfd1de45bb656b1875/bindings/chatllm.py#L255

rujialiu · 2024-10-01T07:43:58Z

@rujialiu, there are easy-to-use Python bindings:

https://github.com/foldl/chatllm.cpp/blob/d458264ff8f6b70994f9e7cfd1de45bb656b1875/bindings/chatllm.py#L255

Oops. I forgot to check python bindings. Thanks! And a (hopefully) easier way to debug, is to trace through the official code of bge:

from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score([['A red apple', 'A llama in the garden'], ['A red apple', 'I want some fruit']])
print(score) # [-8.7265625, -0.10467529296875]

rujialiu · 2024-10-02T10:22:05Z

I looked deeper into this. After printing tokenization result, I noticed a difference.

llama.cpp:

main: prompt 0: 'A red apple</s><s>A llama in the garden'
main: number of tokens in prompt = 12
     0 -> '<s>'
    62 -> ' A'
  4842 -> ' red'
108787 -> ' apple'
     2 -> '</s>'
     0 -> '<s>'
    62 -> ' A'
 67140 -> ' llama'
    23 -> ' in'
    70 -> ' the'
 80583 -> ' garden'
     2 -> '</s>'

However, FlagEmbedding's output (I modified its source code to print tokenizer result):

tensor([[     0,     62,   4842, 108787,      2,      2,     62,  67140,     23,
             70,  80583,      2],

So the BOS of the second sequence is 2 instead of 0. I don't know why, but if I change llama.cpp's code to enforce it to be 0 like this:

    for (int k = 0; k < n_prompts; k++) {
        // clamp to n_batch tokens
        auto & inp = inputs[k];
        inp[5] = 2; // <--- just for troubleshooting

Then it can get very similar result: -8.797 (vs tei's -8.773)! I've also tested some other equal-length pairs like this:

score = reranker.compute_score([['A red apple', 'A llama in the garden'], ['A red apple', 'I want some fruit!'], ['A red apple', 'Another nice looking big apple']])
print(score) # [-8.7265625, -1.53125, 2.203125]

llama.cpp's output:

rerank score 0:   -8.797
rerank score 1:   -1.500
rerank score 2:    2.239

That's very close!

EDIT: I made a mistake in my original post. After changing second setence's BOS to 2, actually everything is correct now. So we probably just need:

Add an EOS instead of BOS at the beginning of second sentence
Add sigmoid in server (we can also add a boolean flag raw_scores just like in tei)

@ggerganov

ggerganov · 2024-10-02T11:02:49Z

Nice find! So it really depends on what is the correct formatting of the input. The approach that we have implemented in llama.cpp (which was suggested by @foldl in #8555 (comment)) makes sense to me. On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it, since it's unlikely that tei has a bug in this regard. Still, it's a good idea to verify and find some reference for how to structure the input exactl.y

Adding sigmoid option is easy.

rujialiu · 2024-10-02T11:47:07Z

On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it

I'm curious too. So I searched in transformers source and found models/xlm_roberta/tokenization_xlm_roberta_fast.py

class XLMRobertaTokenizerFast(PreTrainedTokenizerFast):
...
    def __init__(
        self,
        vocab_file=None,
        tokenizer_file=None,
        bos_token="<s>",
        eos_token="</s>",
        sep_token="</s>",
        cls_token="<s>",
        unk_token="<unk>",
        pad_token="<pad>",
        mask_token="<mask>",
        **kwargs,
    ):

So they're adding sep_token instead of eos_token which sounds reasonable. It's just that sep_token happened to be the same as EOS 😆

rujialiu · 2024-10-02T12:32:39Z

Also, for tei, it uses huggingface's tokenizers library, in which RobertaProcessing's default sep is </s>:

impl Default for RobertaProcessing {
    fn default() -> Self {
        Self {
            sep: ("</s>".into(), 2),
            cls: ("<s>".into(), 0),
            trim_offsets: true,
            add_prefix_space: true,
        }
    }
}

see https://docs.rs/tokenizers/latest/src/tokenizers/processors/roberta.rs.html#16-25

ggerganov · 2024-10-02T12:55:15Z

Ok, it's clear now - we need to use:

<bos>query</eos><sep>answer</eos>

rujialiu · 2024-10-03T13:26:37Z

After reading more codes in transformers and tokenizers I feel it's still a bit mess, for example, in transformers, tokenization_xlm_roberta_fast.py:

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An XLM-RoBERTa sequence has the following format:

        - single sequence: `<s> X </s>`
        - pair of sequences: `<s> A </s></s> B </s>`

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """

        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + sep + token_ids_1 + sep

So it's 3 seps but no eos (tokenizers has similar codes. 3 seps). But maybe it's just too nitpicking. Never mind. Let's do it! Thanks!

foldl · 2024-10-10T08:04:54Z

Nice find! So it really depends on what is the correct formatting of the input. The approach that we have implemented in llama.cpp (which was suggested by @foldl in #8555 (comment)) makes sense to me. On the other hand, using double EOS token does not make sense, but probably this is the correct way to do it, since it's unlikely that tei has a bug in this regard. Still, it's a good idea to verify and find some reference for how to structure the input exactl.y

Adding sigmoid option is easy.

It's my fault: the pattern in #8555 (comment)) is wrong. Fixed now.

After the tokenizer got fixed, I think the results of chatllm and infinity/tei matched now.

rujialiu · 2024-10-13T10:19:50Z

Adding sigmoid option is easy.

Gently ping @ggerganov for the sigmoid option

ggerganov · 2024-10-13T10:47:53Z

Seems like a good first issue for new contributors. Feel free to create one.

rujialiu · 2024-10-13T11:15:03Z

Ok, I'll try next week

rujialiu · 2024-10-14T05:31:12Z

Seems like a good first issue for new contributors. Feel free to create one.

I ended up adding sigmoid in send_rerank function (changed two lines), thus changing the default behavior. Adding a raw_score flag in the Rest API seems unneccesary (only supported by tei). Is it ok to create PR from this minimal change?

ggerganov · 2024-10-14T06:54:43Z

We can leave it up to the client to apply sigmoid if they need to. It's a trivial operation and don't think there is any need to do it server-side. But we can add this option nevertheless. We just have to keep the existing behavior as default and optionally enable the sigmoid per-request.

rujialiu · 2024-10-14T07:23:30Z

You're right. I changed that because I want it to be a painless (almost drop-in) replacement of tei because it's already used in production in several places. But generally it doesn't need to be done server-side.

ExtReMLapin · 2024-10-29T07:50:47Z

@QuintinShaw we ran more tests (real life tests) at the office.

We took the second book of harry potter, and used a sigmoid to give us a 0-1 score for the reranker.

Alibaba reranker is terrible compared to BGE reranker.

On the left, you can see sorted scores of each chunks (80) compared to the query "Which animal lives in the Chamber of Secrets ?"

On the right, same but for BGE M3 V2

Alibaba doesn't highlights any chunk, while BGE reranker does

* py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

github-actions bot added the python python script changes label Sep 16, 2024

github-actions bot added the examples label Sep 19, 2024

github-actions bot added the server label Sep 19, 2024

ggerganov mentioned this pull request Sep 23, 2024

Feature Request: Support Jina V3 arch #9585

Open

4 tasks

vaclcer mentioned this pull request Sep 24, 2024

Reranking models ollama/ollama#3368

Open

ggerganov added 9 commits September 25, 2024 16:59

py : add XLMRobertaForSequenceClassification [no ci]

3453e62

py : fix scalar-tensor conversion [no ci]

77723ed

py : fix position embeddings chop [no ci]

49f90de

llama : read new cls tensors [no ci]

dc0cdd8

llama : add classigication head (wip) [no ci]

d0a7bf9

llama : add "rank" pooling type

125a067

ggml-ci

server : add rerank endpoint

6235c62

ggml-ci

llama : aboud ggml_repeat during classification

6916ed1

rerank : cleanup + comments

62a45d1

ggerganov force-pushed the gg/rerank branch from 5b6468f to 62a45d1 Compare September 25, 2024 14:01

This was referenced Sep 25, 2024

changelog : libllama API #9289

Open

changelog : llama-server REST API #9291

Open

server : accept /rerank endpoint in addition to /v1/rerank [no ci]

7bde9a0

embedding : parse special tokens

c62a39d

ggerganov mentioned this pull request Oct 4, 2024

rerank : use [SEP] token instead of [BOS] #9737

Merged

4 tasks

pengjiang80 mentioned this pull request Oct 6, 2024

Support Rerank models and API gpustack/gpustack#214

Closed

ziyu4huang mentioned this pull request Oct 10, 2024

Modify to make rerank provider works to llama.cpp rerank API langgenius/dify#9159

Closed

5 tasks

donguyen32 mentioned this pull request Oct 14, 2024

Add reranking support abetlen/llama-cpp-python#1794

Open

ziyu4huang mentioned this pull request Oct 20, 2024

[Feature Request]: add rerank support to llama.cpp rerank infiniflow/ragflow#2905

Closed

1 task

	case LLAMA_POOLING_TYPE_RANK:
	{
	struct ggml_tensor * inp_cls = build_inp_cls();
	inp = ggml_get_rows(ctx0, inp, inp_cls);

	// classification head
	// https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566
	GGML_ASSERT(model.cls != nullptr);
	GGML_ASSERT(model.cls_b != nullptr);

	cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls, inp), model.cls_b);
	cur = ggml_tanh(ctx0, cur);

	if (model.cls_out) {
	// this path is taken for example by the https://huggingface.co/jinaai/jina-reranker-v1-tiny-en
	// https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/blob/cb5347e43979c3084a890e3f99491952603ae1b7/modeling_bert.py#L884-L896
	GGML_ASSERT(model.cls_out_b != nullptr);

	cur = ggml_add (ctx0, ggml_mul_mat(ctx0, model.cls_out, cur), model.cls_out_b);
	}
	} break;

	// Get the embeddings for a sequence id
	// Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
	// when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[1] with the rank of the sequence
	// otherwise: float[n_embd] (1-dimensional)
	LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);

	# TODO: implement some tests
	# https://github.com/ggerganov/llama.cpp/pull/9510
	# Scenario: Rerank
	# Given a prompt:
	# """
	# What is panda?
	# """

llama : add reranking support #9510

llama : add reranking support #9510

Conversation

ggerganov commented Sep 16, 2024 • edited Loading

Testing (CLI)

Testing (server)

HanClinto commented Sep 17, 2024

ggerganov commented Sep 19, 2024 • edited Loading

donguyen32 commented Sep 23, 2024

ggerganov commented Sep 23, 2024

QuintinShaw commented Sep 23, 2024

ExtReMLapin commented Sep 24, 2024 • edited Loading

ggerganov commented Sep 25, 2024

ExtReMLapin commented Sep 25, 2024

rujialiu commented Sep 25, 2024 • edited Loading

ExtReMLapin commented Sep 25, 2024 • edited Loading

ExtReMLapin commented Sep 25, 2024

ggerganov commented Sep 30, 2024 • edited Loading

rujialiu commented Sep 30, 2024 • edited Loading

foldl commented Sep 30, 2024

ggerganov commented Sep 30, 2024

ggerganov commented Sep 30, 2024 • edited Loading

slaren commented Sep 30, 2024

rujialiu commented Sep 30, 2024

foldl commented Sep 30, 2024

foldl commented Sep 30, 2024

rujialiu commented Oct 1, 2024

rujialiu commented Oct 2, 2024 • edited Loading

ggerganov commented Oct 2, 2024

rujialiu commented Oct 2, 2024

rujialiu commented Oct 2, 2024

ggerganov commented Oct 2, 2024

rujialiu commented Oct 3, 2024

foldl commented Oct 10, 2024

rujialiu commented Oct 13, 2024

ggerganov commented Oct 13, 2024

rujialiu commented Oct 13, 2024

rujialiu commented Oct 14, 2024

ggerganov commented Oct 14, 2024

rujialiu commented Oct 14, 2024

ExtReMLapin commented Oct 29, 2024

ggerganov commented Sep 16, 2024 •

edited

Loading

ggerganov commented Sep 19, 2024 •

edited

Loading

ExtReMLapin commented Sep 24, 2024 •

edited

Loading

rujialiu commented Sep 25, 2024 •

edited

Loading

ExtReMLapin commented Sep 25, 2024 •

edited

Loading

ggerganov commented Sep 30, 2024 •

edited

Loading

rujialiu commented Sep 30, 2024 •

edited

Loading

ggerganov commented Sep 30, 2024 •

edited

Loading

rujialiu commented Oct 2, 2024 •

edited

Loading