DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

【 📄 中文 | 📖 arXiv | 🤗 HF Papers | 🚀 Diver Models| Wechat 】

While retrieval-augmented generation (RAG) excels at direct knowledge retrieval, it falters on complex queries that require abstract or multi-step reasoning. To bridge this gap, we developed DIVER, a retrieval pipeline engineered for these reasoning-intensive tasks. DIVER integrates four stages: document pre-processing, iterative LLM-driven query expansion, a specialized retriever fine-tuned on complex synthetic data, and a novel reranker that merges retrieval scores with LLM-generated helpfulness ratings. On the BRIGHT benchmark, DIVER sets a new state-of-the-art, significantly outperforming other reasoning-aware models (NDCG 45.8). These results underscore the effectiveness of integrating deep reasoning into retrieval for solving complex, real-world problems. More details can be seen at Diver paper.

Key Features

1.LLM-Driven Query Expansion: Intelligently refines the search query.

2.Reasoning-Enhanced Retriever: A fine-tuned model that understands complex relationships.

3.Merged Reranker: Combines traditional search scores with LLM-based "helpfulness" scores for superior ranking.

🎉 Update

[2025-11-20] 🚀 We released our GroupRank reranking model Diver-GroupRank-7B and Diver-GroupRank-32B. The inference and SFT training code can be found at ./Reranker/rerank_groupwise.py and ./Reranker/train_sft_groupwise_reranker.sh. Our Diver-GroupRank-32B achieves 46.8 at BRIGHT via test-time scaling. The details can be found at GroupRank paper.
[2025-11-11] Environment installation guide is provided in ./env_requirements/README.md for reproduction. The code for merging pointwise and listwise rerankers is at ./Reranker/rerank_merge_point_and_list.py.
[2025-10-20] 🚀 We released DIVER-Retriever-4B-1020 model at ModelScope and Hugging Face, which achieve 31.9 at BRIGHT.
[2025-10-14] 🚀 We released DIVER-Retriever-1.7B model at ModelScope and Hugging Face, which achieve 27.3 at BRIGHT.
[2025-09-27] 🎉 Our Diver-Retriever-4B model have achieved monthly 2.64k+ downloads at 🤗 HuggingFace !
[2025-09-12] 🚀 We released the code for listwise reranking using Gemini; it can be found at ./Retriever/rerank_listwise.py, and it achieved a score of 43.9 on BRIGHT.
[2025-09-05] 🚀 We released DIVER-Retriever-0.6B model at ModelScope and Hugging Face, which achieve 25.2 at BRIGHT.
[2025-08-28] 🚀 We released our DIVER-Retriever-4B model at ModelScope.
[2025-08-24] 🏆 We released our Diver V2, which reaches 45.8 on Bright Leaderboard.
[2025-08-18] 🚀 We released our full codebase, including inference and SFT training.

TODO List

✅ Release DIVER-Reranker: Release source code and models

Model Downloads

You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model	#Total Params	Context Length	Download	BRIGHT
Diver-GroupRank-7B	7B	32K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-GroupRank-7B [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-GroupRank-7B
Diver-GroupRank-32B	32B	32K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-GroupRank-32B [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-GroupRank-32B	46.8
DIVER-Retriever-4B-1020	4B	40K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-Retriever-4B-1020 [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-Retriever-4B-1020	31.9
DIVER-Retriever-4B	4B	40K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-Retriever-4B [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-Retriever-4B	28.9
DIVER-Retriever-1.7B	1.7B	40K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-Retriever-1.7B [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-Retriever-1.7B	27.3
DIVER-Retriever-0.6B	0.6B	32K	[🤗 HuggingFace]https://huggingface.co/AQ-MedAI/Diver-Retriever-0.6B [🤖 ModelScope]https://www.modelscope.cn/models/AQ-MedAI/Diver-Retriever-0.6B	25.2

Evaluation

Overall Evaluation

Performance comparisons with competitive baselines on the BRIGHT leaderboard. The best result for each dataset is highlighted in bold.

Method	Avg.	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.
Rank-R1-14B	20.5	31.2	38.5	21.2	26.4	22.6	18.9	27.5	9.2	20.2	9.7	11.9	9.2
Qwen1.5-7B with InteRank-3B	27.4	51.2	51.4	22.4	31.9	17.3	26.6	22.4	24.5	23.1	13.5	19.3	25.5
GPT4 with Rank1-32B	29.4	49.7	35.8	22.0	37.5	22.5	21.7	35.0	18.8	32.5	10.8	22.9	43.7
ReasonIR with QwenRerank	36.9	58.2	53.2	32.0	43.6	28.8	37.6	36.0	33.2	34.8	7.9	32.6	45.0
ReasonIR with Rank-R1-32B	38.8	59.5	55.1	37.9	52.7	30.0	39.3	45.1	32.1	17.1	10.7	40.4	45.6
RaDeR with QwenRerank	39.2	58.0	59.2	33.0	49.4	31.8	39.0	36.4	33.5	33.3	10.8	34.2	51.6
XRR2	40.3	63.1	55.4	38.5	52.9	37.1	38.2	44.6	21.9	35.0	15.7	34.4	46.2
ReasonRank	40.8	62.72	55.53	36.7	54.64	35.69	38.03	44.81	29.46	25.56	14.38	41.99	50.06
DIVER	41.6	62.2	58.7	34.4	52.9	35.6	36.5	42.9	38.9	25.4	18.3	40.0	53.1
BGE Reasoner	45.2	66.5	63.7	39.4	50.3	37	42.9	43.7	35.1	44.3	17.2	44.2	58.5
DIVER V2	45.8	68	62.5	42.0	58.2	41.5	44.3	49.2	34.8	32.9	19.1	44.3	52.6

Diver Retriever Evaluation

Method	Avg.	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.
Evaluate Retriever with Original Query
BM25	14.5	18.9	27.2	14.9	12.5	13.6	18.4	15.0	24.4	7.9	6.2	10.4	4.9
SBERT	14.9	15.1	20.4	16.6	22.7	8.2	11.0	15.3	26.4	7.0	5.3	20.0	10.8
gte-Qwen1.5-7B	22.5	30.6	36.4	17.8	24.6	13.2	22.2	14.8	25.5	9.9	14.4	27.8	32.9
Qwen3-4B	5.6	3.5	8.0	2.3	2.0	1.6	1.0	4.4	2.1	0.1	4.9	18.0	19.2
OpenAI	17.9	23.3	26.7	19.5	27.6	12.8	14.3	20.5	23.6	2.4	8.5	23.5	11.7
Google	20.0	22.7	34.8	19.6	27.8	15.7	20.1	17.1	29.6	3.6	9.3	23.8	15.9
ReasonIR-8B	24.4	26.2	31.4	23.3	30.0	18.0	23.9	20.5	35.0	10.5	14.7	31.9	27.2
RaDeR-7B	25.5	34.6	38.9	22.1	33.0	14.8	22.5	23.7	37.3	5.0	10.2	28.4	35.1
Seed1.5-Embedding	27.2	34.8	46.9	23.4	31.6	19.1	25.4	21.0	43.2	4.9	12.2	33.3	30.5
DIVER-Retriever-0.6B	25.2	36.4	41.9	29.0	31.0	21.2	24.6	23.2	15.6	6.8	8.4	33.2	31.7
DIVER-Retriever-4B	28.9	41.8	43.7	21.7	35.3	21.0	21.2	25.1	37.6	13.2	10.7	38.4	37.3
Evaluate Retriever with GPT-4 REASON-query
BM25	27.0	53.6	54.1	24.3	38.7	18.9	27.7	26.3	19.3	17.6	3.9	19.2	20.8
SBERT	17.8	18.5	26.3	17.5	27.2	8.8	11.8	17.5	24.3	10.3	5.0	22.3	23.5
gte-Qwen1.5-7B	24.8	35.5	43.1	24.3	34.3	15.4	22.9	23.9	25.4	5.2	4.6	28.7	34.6
Qwen3-4B	5.5	1.3	17.3	2.5	6.2	1.0	4.8	4.5	3.0	5.9	0.0	7.2	12.5
OpenAI	23.3	35.2	40.1	25.1	38.0	13.6	18.2	24.2	24.5	6.5	7.7	22.9	23.8
Google	26.2	36.4	45.6	25.6	38.2	18.7	29.5	17.9	31.1	3.7	10.0	27.8	30.4
ReasonIR-8B	29.9	43.6	42.9	32.7	38.8	20.9	25.8	27.5	31.5	19.6	7.4	33.1	35.7
RaDeR-7B	29.2	36.1	42.9	25.2	37.9	16.6	27.4	25.0	34.8	11.9	12.0	37.7	43.4
DIVER-Retriever-4B	32.1	51.9	53.5	29.5	41.2	21.4	27.5	26.1	33.5	11.7	9.5	39.3	39.7
Evaluate retriever with DIVER-QExpand query
ReasonIR-8B	32.6	49.4	44.7	32.4	44.0	26.6	31.8	29.0	32.3	12.8	9.1	40.7	38.4
+BM25 (Hybrid)	35.7	56.8	53.5	33.0	48.5	29.4	34.2	32.0	35.2	16.8	12.9	39.3	36.8
DIVER-Retriever	33.9	54.5	52.7	28.8	44.9	25.1	27.4	29.5	34.5	10.0	14.5	40.7	44.7
+BM25 (Hybrid)	37.2	60.0	55.9	31.8	47.9	27.1	33.9	31.9	35.1	23.1	16.8	36.9	46.6

Quickstart

Inference

Reproduction Our Results on BRIGHT benchmark

One-click reproduction：

sh run_all.sh

or step-by-step reproduction：

# 0.1 Download BRIGHT dataset
cd Diver
git clone https://huggingface.co/datasets/xlangai/BRIGHT ./data/BRIGHT
# or modelscope download --dataset xlangai/BRIGHT --local_dir ./data/BRIGHT  # more faster in China

# 0.2 Download models
mkdir models && cd models
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B  # for DIVER-QExpand
git clone https://huggingface.co/AQ-MedAI/Diver-Retriever-4B  # for DIVER-Qexpand and DIVER-Retriever
cd ..

# 1. DIVER-QExpand
cd ./QExpand
bash run_qexpand.sh

# 2. DIVER-Retriever, achiving 33.9 NDCG@10, as reported in Table 3 of our paper: https://arxiv.org/pdf/2508.07995
cd ../Retriever
bash retriever_script.sh
# merge BM25 and DIVER-Retriever scores, achiving 37.2 NDCG@10 in Table 3 of our paper
python merge_score.py

# 3. DIVER-Reranker (v1 version with only pointwise reranker), achieving 41.6 NDCG@10 as shown in Table 2 of our paper
# cd ./Retriever
bash reranker_script.sh

Different Inference Methods for Diver Retriever

If you only want to do a simple test, below are examples of how to use the retriever with different frameworks:

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("AQ-MedAI/Diver-Retriever-4B")


# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)

Transformers Usage

# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('AQ-MedAI/Diver-Retriever-4B', padding_side='left')
model = AutoModel.from_pretrained('AQ-MedAI/Diver-Retriever-4B')


max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.9319270849227905, 0.5878604054450989], [0.639923095703125, 0.7950234413146973]]

vLLM usage

# Requires vllm>=0.8.5
import torch
import vllm
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="AQ-MedAI/Diver-Retriever-4B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)

Finetuning

We recommend you to use swift to finetune our DIVER-Retriever-4B with infonce.

Before starting training, please ensure your environment is properly configured.

pip install ms-swift -U
# Install from source
pip install git+https://github.com/modelscope/ms-swift.git

pip install transformers -U

# Optional packages
pip install deepspeed # multi-GPU training
pip install liger-kernel # save GPU memory resources
pip install flash-attn --no-build-isolation

Training Data Preparation

# LLM
{"query": "sentence1", "response":  "sentence2"}
# MLLM
{"query": "<image>", "response":  "sentence", "images": "/some/images.jpg"}
{"query": "<image>sentence1", "response":  "<image>sentence2", "rejected_response": ["<image>sentence1", "<image>sentence2"], "images": ["/some/images.jpg", "/some/images.jpg", "/some/images.jpg", "/some/images.jpg"]}

Training Command

Using the infonce loss as an example, the complete training command is as follows:

nproc_per_node=8
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model DIVER/DIVER-Retriever-4B \
    --task_type embedding \
    --model_type qwen3_emb \
    --train_type full \
    --dataset your_dataset \
    --split_dataset_ratio 0.05 \
    --eval_strategy steps \
    --output_dir output \
    --eval_steps 20 \
    --num_train_epochs 5 \
    --save_steps 20 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 6e-6 \
    --loss_type infonce \
    --label_names labels \
    --dataloader_drop_last true \
    --deepspeed zero3

Citation

If you think our work is helpful, please feel free to give us a cite.

@misc{DIVER,
      title={DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval}, 
      author={Meixiu Long and Duolin Sun and Dan Yang and Junjie Wang and Yue Shen and Jian Wang and Peng Wei and Jinjie Gu and Jiahai Wang},
      year={2025},
      eprint={2508.07995},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2508.07995}, 
}

Acknowledgement

We thank prior works and their open-source repositories: BRIGHT, ReasonIR, RaDer, ThinkQE, Qwen3-Embedding, ReasonRank.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
QExpand		QExpand
Reranker		Reranker
Retriever		Retriever
data/BRIGHT		data/BRIGHT
env_requirements		env_requirements
pic		pic
utils		utils
.DS_Store		.DS_Store
LICENSE.txt		LICENSE.txt
README.md		README.md
README_CN.md		README_CN.md
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

Key Features

🎉 Update

TODO List

Model Downloads

Evaluation

Overall Evaluation

Diver Retriever Evaluation

Quickstart

Inference

Reproduction Our Results on BRIGHT benchmark

Different Inference Methods for Diver Retriever

Finetuning

Training Data Preparation

Training Command

Citation

Acknowledgement

Star History

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

License

AQ-MedAI/Diver

Folders and files

Latest commit

History

Repository files navigation

DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

Key Features

🎉 Update

TODO List

Model Downloads

Evaluation

Overall Evaluation

Diver Retriever Evaluation

Quickstart

Inference

Reproduction Our Results on BRIGHT benchmark

Different Inference Methods for Diver Retriever

Finetuning

Training Data Preparation

Training Command

Citation

Acknowledgement

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages