Final project for the Information Retrieval and Generation class
git clone https://github.com/olaghattas/irg_final_project.git
cd irg_final_project
- (Recommended) create/activate a conda environment:
conda create -n rag python=3.11
conda activate rag
- Install Python packages:
pip install -r requirements.txt
cd src/getting_started
python download_dataset.py
Make sure you have ollama installed in your system. To install check here.
After installing ollama run the following to pull required model:
cd <project_root>
ollama pull llama3.1:8b-instruct-q8_0
ollama create llama3.1:8b-instruct-q8_0-16k -f src/getting_started/Modelfile
ollama serve # It will run ollama server in background.
cd irg_final_project
python3 src/methods/run_scratch.py \
--corpus dataset/LitSearch_corpus_clean \
--query dataset/LitSearch_query \
--topk 50
python3 src/methods/run_ce.py \
--run_file run_files/ltc_nnn_scratch_topk_1000.run \
--corpus dataset/LitSearch_corpus_clean \
--query dataset/LitSearch_query \
--topk 50
Dense retrieval dependencies conflict with the LLM reranker, so this method must be run in an environment created using requirements_dense.txt (not requirements.txt).
Run from the project root:
cd irg_final_project
python3 src/methods/BM25_LLMExp_DenseRetrieval.py
This also requires the requirements_dense.txt environment.
Run from the project root:
cd irg_final_project
python3 src/methods/tfidf_DenseRetrieval.py --runfile tfidf_runfile.run
tfidf_runfile.run must be generated before running hte script (see Step 6: TF-IDF (lnc.nnn)). Ensure the file is located inside the run_files directory.
This notebook must be run in the environment created with requirements.txt
- First generate the dense-retrieval runfile using Step 2.
- Update the Jupyter notebook with the correct path to the generated runfile.
- Run the jupyter notebook BM25_LLMExp_DenseRetrieval_LLMRe-rank.ipynb
Run the jupyter notebook bm25_LLMExp_LLMRerank.ipynb
Run the jupyter notebook tfidf_lnc_nnn.ipynb
Run the jupyter notebook tfidf_lnc-nnn_ThesaurusExp_LLMRerank.ipynb
python evaluate.py --qrels <qrels_path> --runs <run1> <run2> ... --metric <metric_name> --output <output_dir>
- --qrels: Path to qrels file (TREC format)
- --runs: One or more run files (TREC format)
ndcg@Kp@Kp@Rapmap
The script will generate: - Per-query CSV results - A summary CSV file - Printed summary statistics