RAG Benchmarking: Multi-Strategy Retrieval for Multi-Hop QA

A complete RAG system that achieves 72.89% Recall@10 on MultiHop-RAG, surpassing RAPTOR's ~70%. This repository includes:

🔧 Full RAG Implementation (ultimate_rag/) - RAPTOR + Graph + HyDE + BM25 + Neural Reranking
📊 Benchmark Suite (adapters/, scripts/) - Evaluation harness for MultiHop-RAG, CRAG
📝 Documentation (docs/) - Blog post, technical report, architecture

Quick Start

1. Install Dependencies

# Clone the repo
git clone https://github.com/incidentfox/OpenRag.git
cd rag_benchmarking

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt

2. Set API Keys

export OPENAI_API_KEY="sk-..."
export COHERE_API_KEY="..."  # Optional but recommended for best performance

3. Start the RAG Server

cd ultimate_rag
python -m api.server

Server runs at http://localhost:8000. Check health: curl http://localhost:8000/health

4. Run Benchmark

# MultiHop-RAG (2556 queries)
python scripts/run_multihop_eval.py --queries 100  # Quick test

# Full benchmark
python scripts/run_multihop_eval.py

Results

Benchmark	Queries Tested	Our Result	SOTA	Notes
MultiHop-RAG	2,556 (full)	72.89%	~70%	Beats RAPTOR baseline
SQuAD	200+ (ongoing)	99.0%	~85-90%	Full benchmark running on EC2
CRAG	10 (sample)	70%	~50-60%	Per-query corpus test

Note on SQuAD: Full 10,570-query benchmark running on EC2. After 200 queries: 99.0% Recall@10.

Note on CRAG: Tested 10 queries using each query's provided search results as corpus. Scaling requires per-query ingestion which is compute-intensive. CRAG is designed for API-augmented RAG, not static document retrieval.

Ablation Study

Component	Recall@10	Δ from baseline
Semantic only	55.2%	—
+ RAPTOR hierarchy	62.5%	+7.3%
+ Cohere reranking	71.8%	+16.6%
+ BM25 hybrid	72.4%	+17.2%
+ HyDE + Query decomp	72.89%	+17.7%

Key insight: Cohere's neural reranker alone adds +9.3 percentage points.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Query Input                              │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Parallel Retrieval Strategies                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │ Semantic │ │   HyDE   │ │   BM25   │ │  Query   │           │
│  │  Search  │ │ Expansion│ │  Hybrid  │ │  Decomp  │           │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘           │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Cohere Neural Reranking                       │
│                  (rerank-english-v3.0)                           │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Top-K Results                            │
└─────────────────────────────────────────────────────────────────┘

Repository Structure

rag_benchmarking/
├── ultimate_rag/              # 🔧 Full RAG implementation
│   ├── api/
│   │   └── server.py          # FastAPI server
│   ├── retrieval/
│   │   ├── retriever.py       # Main orchestration
│   │   ├── strategies.py      # HyDE, BM25, decomposition
│   │   └── reranker.py        # Cohere + cross-encoder
│   ├── raptor/
│   │   └── tree_building.py   # RAPTOR hierarchy
│   ├── graph/
│   │   └── graph.py           # Knowledge graph
│   ├── core/
│   │   └── node.py            # Tree/forest data structures
│   └── agents/
│       └── teaching.py        # Knowledge teaching interface
│
├── knowledge_base/            # 📚 RAPTOR core library
│   └── raptor/
│       ├── cluster_tree_builder.py
│       ├── EmbeddingModels.py
│       └── ...
│
├── adapters/                  # 🔌 Benchmark adapters
│   └── ultimate_rag_adapter.py
│
├── scripts/                   # 🚀 Evaluation scripts
│   ├── run_multihop_eval.py
│   └── run_crag_eval.py
│
├── docs/                      # 📝 Documentation
│   ├── blog_post.md           # Practitioner-friendly writeup
│   ├── technical_report.md    # Academic-style report
│   └── README.md
│
├── multihop_rag/              # 📊 MultiHop-RAG dataset
│   └── dataset/
│       ├── corpus.json        # 609 news articles
│       └── MultiHopRAG.json   # 2556 queries
│
├── crag/                      # 📊 CRAG dataset
│   └── ...
│
└── requirements.txt           # Dependencies

API Endpoints

Health Check

curl http://localhost:8000/health

Query (Retrieval)

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What was the outcome of the merger?", "top_k": 10}'

Ingest Documents

curl -X POST http://localhost:8000/ingest/batch \
  -H "Content-Type: application/json" \
  -d '{
    "tree": "default",
    "documents": [{"content": "Document text here..."}],
    "build_hierarchy": true
  }'

Save/Load Tree

# Save
curl -X POST http://localhost:8000/persist/save \
  -H "Content-Type: application/json" \
  -d '{"tree": "default"}'

# Load
curl -X POST http://localhost:8000/persist/load \
  -H "Content-Type: application/json" \
  -d '{"tree": "default", "path": "trees/default.pkl"}'

Configuration

Retrieval Modes

Mode	Strategies	Use Case
`fast`	Semantic only	Low latency, simple queries
`standard`	Semantic + HyDE + BM25 + Decomp	Balanced (default)
`thorough`	All strategies	Maximum recall, high latency

Environment Variables

OPENAI_API_KEY=sk-...          # Required for embeddings
COHERE_API_KEY=...             # Recommended for reranking (see privacy note below)
RETRIEVAL_MODE=standard        # fast|standard|thorough
DEFAULT_TOP_K=10               # Number of results

Privacy Notice: Cohere Reranker

This system uses Cohere's rerank API for neural reranking, which provides the best benchmark results (+9.3% improvement). Please be aware:

Data logging: By default, Cohere logs prompts and outputs on their SaaS platform (retained for 30 days)
Training opt-out: You can disable data usage for training in your Cohere dashboard under "Data Controls"
Zero retention: Enterprise customers can request zero data retention
Cloud deployments: If using Cohere via AWS/GCP/Azure, Cohere does not receive your data

For privacy-sensitive use cases, consider these alternatives:

Local cross-encoder: The system includes CrossEncoderReranker using BAAI/bge-reranker-base (runs locally, no external API)
Remove Cohere: Don't set COHERE_API_KEY and the system falls back to local reranking
LLM-as-reranker: Use a local/GDPR-compliant LLM for reranking

See Cohere's privacy policy and enterprise data commitments for details.

Cost Analysis

Component	Cost per Query
OpenAI embeddings	$0.000007
HyDE generation	$0.00018
Query decomposition	$0.00027
Cohere reranking	$0.002
Total	~$0.0025

Full benchmark (2556 queries): ~$6

Documentation

📝 Blog Post - Practitioner-friendly writeup
📊 Technical Report - Detailed analysis with ablations
🏗️ Architecture - System design

Citation

If you use this code, please cite:

@software{rag_benchmarking_2026,
  title = {Multi-Strategy RAG for Multi-Hop Question Answering},
  author = {Anonymous},
  year = {2026},
  url = {https://github.com/incidentfox/OpenRag}
}

License

MIT License - see LICENSE for details.

Acknowledgments

RAPTOR for hierarchical retrieval
Cohere for neural reranking API
MultiHop-RAG for benchmark dataset
Built with Claude as AI pair programmer

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
adapters		adapters
aws		aws
crag		crag
docs		docs
hotpotqa		hotpotqa
knowledge_base		knowledge_base
multihop_rag		multihop_rag
nq		nq
ragbench		ragbench
scripts		scripts
squad		squad
ultimate_rag		ultimate_rag
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_full_2556q.json		benchmark_full_2556q.json
hotpotqa_test_30.json		hotpotqa_test_30.json
multihop_results.json		multihop_results.json
requirements.txt		requirements.txt
test_standalone.py		test_standalone.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Benchmarking: Multi-Strategy Retrieval for Multi-Hop QA

Quick Start

1. Install Dependencies

2. Set API Keys

3. Start the RAG Server

4. Run Benchmark

Results

Ablation Study

Architecture

Repository Structure

API Endpoints

Health Check

Query (Retrieval)

Ingest Documents

Save/Load Tree

Configuration

Retrieval Modes

Environment Variables

Privacy Notice: Cohere Reranker

Cost Analysis

Documentation

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Benchmarking: Multi-Strategy Retrieval for Multi-Hop QA

Quick Start

1. Install Dependencies

2. Set API Keys

3. Start the RAG Server

4. Run Benchmark

Results

Ablation Study

Architecture

Repository Structure

API Endpoints

Health Check

Query (Retrieval)

Ingest Documents

Save/Load Tree

Configuration

Retrieval Modes

Environment Variables

Privacy Notice: Cohere Reranker

Cost Analysis

Documentation

Citation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages