Skip to content

incidentfox/OpenRag

Repository files navigation

RAG Benchmarking: Multi-Strategy Retrieval for Multi-Hop QA

Recall@10 Python License

A complete RAG system that achieves 72.89% Recall@10 on MultiHop-RAG, surpassing RAPTOR's ~70%. This repository includes:

  • πŸ”§ Full RAG Implementation (ultimate_rag/) - RAPTOR + Graph + HyDE + BM25 + Neural Reranking
  • πŸ“Š Benchmark Suite (adapters/, scripts/) - Evaluation harness for MultiHop-RAG, CRAG
  • πŸ“ Documentation (docs/) - Blog post, technical report, architecture

Quick Start

1. Install Dependencies

# Clone the repo
git clone https://github.com/incidentfox/OpenRag.git
cd rag_benchmarking

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt

2. Set API Keys

export OPENAI_API_KEY="sk-..."
export COHERE_API_KEY="..."  # Optional but recommended for best performance

3. Start the RAG Server

cd ultimate_rag
python -m api.server

Server runs at http://localhost:8000. Check health: curl http://localhost:8000/health

4. Run Benchmark

# MultiHop-RAG (2556 queries)
python scripts/run_multihop_eval.py --queries 100  # Quick test

# Full benchmark
python scripts/run_multihop_eval.py

Results

Benchmark Queries Tested Our Result SOTA Notes
MultiHop-RAG 2,556 (full) 72.89% ~70% Beats RAPTOR baseline
SQuAD 200+ (ongoing) 99.0% ~85-90% Full benchmark running on EC2
CRAG 10 (sample) 70% ~50-60% Per-query corpus test

Note on SQuAD: Full 10,570-query benchmark running on EC2. After 200 queries: 99.0% Recall@10.

Note on CRAG: Tested 10 queries using each query's provided search results as corpus. Scaling requires per-query ingestion which is compute-intensive. CRAG is designed for API-augmented RAG, not static document retrieval.

Ablation Study

Component Recall@10 Ξ” from baseline
Semantic only 55.2% β€”
+ RAPTOR hierarchy 62.5% +7.3%
+ Cohere reranking 71.8% +16.6%
+ BM25 hybrid 72.4% +17.2%
+ HyDE + Query decomp 72.89% +17.7%

Key insight: Cohere's neural reranker alone adds +9.3 percentage points.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Query Input                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Parallel Retrieval Strategies                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ Semantic β”‚ β”‚   HyDE   β”‚ β”‚   BM25   β”‚ β”‚  Query   β”‚           β”‚
β”‚  β”‚  Search  β”‚ β”‚ Expansionβ”‚ β”‚  Hybrid  β”‚ β”‚  Decomp  β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Cohere Neural Reranking                       β”‚
β”‚                  (rerank-english-v3.0)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Top-K Results                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Repository Structure

rag_benchmarking/
β”œβ”€β”€ ultimate_rag/              # πŸ”§ Full RAG implementation
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── server.py          # FastAPI server
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ retriever.py       # Main orchestration
β”‚   β”‚   β”œβ”€β”€ strategies.py      # HyDE, BM25, decomposition
β”‚   β”‚   └── reranker.py        # Cohere + cross-encoder
β”‚   β”œβ”€β”€ raptor/
β”‚   β”‚   └── tree_building.py   # RAPTOR hierarchy
β”‚   β”œβ”€β”€ graph/
β”‚   β”‚   └── graph.py           # Knowledge graph
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   └── node.py            # Tree/forest data structures
β”‚   └── agents/
β”‚       └── teaching.py        # Knowledge teaching interface
β”‚
β”œβ”€β”€ knowledge_base/            # πŸ“š RAPTOR core library
β”‚   └── raptor/
β”‚       β”œβ”€β”€ cluster_tree_builder.py
β”‚       β”œβ”€β”€ EmbeddingModels.py
β”‚       └── ...
β”‚
β”œβ”€β”€ adapters/                  # πŸ”Œ Benchmark adapters
β”‚   └── ultimate_rag_adapter.py
β”‚
β”œβ”€β”€ scripts/                   # πŸš€ Evaluation scripts
β”‚   β”œβ”€β”€ run_multihop_eval.py
β”‚   └── run_crag_eval.py
β”‚
β”œβ”€β”€ docs/                      # πŸ“ Documentation
β”‚   β”œβ”€β”€ blog_post.md           # Practitioner-friendly writeup
β”‚   β”œβ”€β”€ technical_report.md    # Academic-style report
β”‚   └── README.md
β”‚
β”œβ”€β”€ multihop_rag/              # πŸ“Š MultiHop-RAG dataset
β”‚   └── dataset/
β”‚       β”œβ”€β”€ corpus.json        # 609 news articles
β”‚       └── MultiHopRAG.json   # 2556 queries
β”‚
β”œβ”€β”€ crag/                      # πŸ“Š CRAG dataset
β”‚   └── ...
β”‚
└── requirements.txt           # Dependencies

API Endpoints

Health Check

curl http://localhost:8000/health

Query (Retrieval)

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What was the outcome of the merger?", "top_k": 10}'

Ingest Documents

curl -X POST http://localhost:8000/ingest/batch \
  -H "Content-Type: application/json" \
  -d '{
    "tree": "default",
    "documents": [{"content": "Document text here..."}],
    "build_hierarchy": true
  }'

Save/Load Tree

# Save
curl -X POST http://localhost:8000/persist/save \
  -H "Content-Type: application/json" \
  -d '{"tree": "default"}'

# Load
curl -X POST http://localhost:8000/persist/load \
  -H "Content-Type: application/json" \
  -d '{"tree": "default", "path": "trees/default.pkl"}'

Configuration

Retrieval Modes

Mode Strategies Use Case
fast Semantic only Low latency, simple queries
standard Semantic + HyDE + BM25 + Decomp Balanced (default)
thorough All strategies Maximum recall, high latency

Environment Variables

OPENAI_API_KEY=sk-...          # Required for embeddings
COHERE_API_KEY=...             # Recommended for reranking (see privacy note below)
RETRIEVAL_MODE=standard        # fast|standard|thorough
DEFAULT_TOP_K=10               # Number of results

Privacy Notice: Cohere Reranker

This system uses Cohere's rerank API for neural reranking, which provides the best benchmark results (+9.3% improvement). Please be aware:

  • Data logging: By default, Cohere logs prompts and outputs on their SaaS platform (retained for 30 days)
  • Training opt-out: You can disable data usage for training in your Cohere dashboard under "Data Controls"
  • Zero retention: Enterprise customers can request zero data retention
  • Cloud deployments: If using Cohere via AWS/GCP/Azure, Cohere does not receive your data

For privacy-sensitive use cases, consider these alternatives:

  1. Local cross-encoder: The system includes CrossEncoderReranker using BAAI/bge-reranker-base (runs locally, no external API)
  2. Remove Cohere: Don't set COHERE_API_KEY and the system falls back to local reranking
  3. LLM-as-reranker: Use a local/GDPR-compliant LLM for reranking

See Cohere's privacy policy and enterprise data commitments for details.


Cost Analysis

Component Cost per Query
OpenAI embeddings $0.000007
HyDE generation $0.00018
Query decomposition $0.00027
Cohere reranking $0.002
Total ~$0.0025

Full benchmark (2556 queries): ~$6


Documentation


Citation

If you use this code, please cite:

@software{rag_benchmarking_2026,
  title = {Multi-Strategy RAG for Multi-Hop Question Answering},
  author = {Anonymous},
  year = {2026},
  url = {https://github.com/incidentfox/OpenRag}
}

License

MIT License - see LICENSE for details.


Acknowledgments

About

Multi-strategy RAG system achieving 74% Recall@10 on MultiHop-RAG. Combines RAPTOR hierarchical retrieval, knowledge graphs, HyDE, BM25, and Cohere neural reranking.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors