A basic implementation of a locally-running LLM with RAG.
The code is copyright (c) 2024 AlertAvert.com. All rights reserved
The code is released under the Apache 2.0 License, see LICENSE for details.
- PDF Ingestion and Chunking:
Uses
unstructured.ioto extract text and split it into manageable chunks. - Embedding Generation:
Use a suitable embedding model (from
sentence-transformers). - Vector Store (Qdrant): Store embeddings and associated metadata in a local Qdrant instance (running via Docker).
- Query Execution: Retrieve relevant chunks from Qdrant based on user input.
- LLM Querying (Ollama): Use Ollama to generate context-aware answers using retrieved chunks.
- Frontend (Streamlit): Provide a user interface for querying the knowledge base and displaying results.
We run a local copy of Meta's Llama 3.2 3B model using ollama:
ollama run llama3.2:latestHowever, even if the LLM is not running (use ollama ps to confirm) it will be started automatically when running the ollama.generate() function.
See also this course on YouTube for more details.
Update
Given all the furore about DeepSeek R1 model, I have also added support for it, using the ollama model name deepseek-r1
Simply change the value for LLM_MODEL in constants/__init__.py to deepseek-r1:
LLM_MODEL: str = "deepseek-r1"this will use the 7B model, but you can also use deepseek-r1:14b if your hardware supports it.
Qdrant is a vector database designed for real-time, high-performance similarity search. It offers a simple API and runs efficiently on local machines. • Key Features: • Lightweight and fast. • REST API with gRPC support. • Compatible with embeddings from most AI frameworks. • Installation: Use Docker or pip for Python bindings.
docker run -p 6333:6333 qdrant/qdrantWhy Qdrant? • Ease of Use: Simple setup with REST and Python APIs. • Local Storage: Supports persistent storage, so you don’t need to worry about data loss. • Low Complexity: Lightweight, no complex dependencies. • Incremental Updates: Adding new embeddings (from additional PDFs) is straightforward. • Performance: Optimized for local environments and fast similarity searches.
Recommended Setup: • Run Qdrant locally via Docker or as a standalone binary. • Use Qdrant’s Python client for embedding storage and querying.
docker run -d -p 6333:6333 qdrant/qdrantCreated the bertie virtualenv (see requirements.txt)
As with every Python project, the recommended way is to use virtualenvwrapper, and create a new venv, then install the dependencies there:
mkvirtualenv -p $(which python3) bertie
pip install -r requirements.txtthis will also install and put on the PATH the necessary scripts (ollama and streamlit).
The easiest way to run the app is to use docker-compose:
docker compose up -duntil integrated with compose, the Streamlit app needs to be run manually:1
streamlit run app.py [debug]adding the debug flag will generate DEBUG logs.
This should open a browser window showing the UI, at http://localhost:8501.
1 See Issue #2