A simple collection of LLM snippets and utilities.
- Download or clone the repo:
git clone git@github:pavdwest/llm_docsearch.git cd llm_docsearch- Create virtual env:
pip -m venv .venv - Activate venv:
source ./venv/bin/activate - Install requirements:
pip install -r requirements.txt - Copy
.env.exampleto.envand fill out your OpenAI key
The docs folder already contains some example data saved as pdfs and raw text, attributed to the following sources:
- https://www.touropia.com/famous-cathedrals-in-the-world/
- https://www.veranda.com/travel/g33234419/beautiful-cathedrals-in-the-world/
- https://www.rivieratravel.co.uk/blog/2019/05/23/the-10-most-famous-cathedrals-and-basilicas-across-europe/
- https://www.thecollector.com/greatest-gothic-cathedrals/
python ./train.py
It should output something like the following:
Delete existing db...
Loading 4 documents...
Creating vector db...
Done!
Note that you can rerun the training at any time to delete the existing db and reload only the files currently in the docs dir.
python ./run_query.py
It should output something like the following:
Running query: 'When was the Cologne Cathedral built and how tall is it?'
Response: ' The Cologne Cathedral was built in 1248 and is 157 metres tall. This information can be found in the Riviera Travel Blog and Touropia sources.'
It might be worth making a backup of the example docs if you'd like to use them again in the future.
Delete everything in the docs directory.
Copy all of your source documents into the docs folder.
Explicitly supported file types are *.html and *.pdf. It will attempt to load other types as text, mileage may vary.
Run 'training':
python ./train.py
Run a query by passing it as basic text via the command line:
python ./run_query.py Find all the details about 'SomeTopic' in my documents