This repository provides a lightweight, modular Retrieval-Augmented Generation (RAG) pipeline designed to help understand and interact with scientific literature using Large Language Models (LLMs).
Finetuning scripts and configurations can be found in the os_train_data_finetune directory.
This pipeline connects a local LLM with a retrieval component to enable question-answering over scientific texts. It is optimized for minimal resource usage and fast setup.
Ensure the following libraries are installed in your virtual environment:
pip install langchain-community transformers sentence-transformers fastapi[standard] peftNote: The
[standard]option forfastapiis required to include all necessary dependencies.
The demonstration is run on unarXive datasets. All scripts for the pipeline can be found in the rag_unarxive directory.
You'll need two terminals to start the backend servers and an optional third terminal to run the query interface.
In the first terminal, activate your virtual environment and run:
./rag_unarxive/start_llama_server.shIn the second terminal, activate your virtual environment and run:
./rag_unarxive/start_rag_server.shWait until both terminals show:
INFO: Application startup complete.
In a third terminal, activate the same virtual environment and run:
python rag_unarxive/RAG_openscholar_format.pyYou’ll be prompted to enter a question. Type your query and interact with the system.
Ask domain-specific questions like:
What are the recent advances in transformer-based models for biomedical NLP?
And get responses grounded in your scientific corpus.
├── os_train_data_finetune/ # Finetuning scripts and data handling
├── rag_unarxive/ # RAG pipeline and server scripts
│ ├── start_llama_server.sh # Script to launch the LLM server
│ ├── start_rag_server.sh # Script to launch the RAG server
│ ├── RAG_openscholar_format.py # Entry point for querying the pipeline
│ ├── ....
├── README.md # This file
- Make sure your model and data paths are correctly configured in the scripts, for example in
llama_pipeline.py:
# Change this path to your local model location
model_path = "/data/horse/ws/s9650707-llm_workspace/scholaryllm_prot/os_train_data_finetune/model_checkpoints/..."- The pipeline assumes local or pre-finetuned models and indexing.