This small project leverages Large Language Models (LLMs) to automatically extract structured data from a set of scholarly articles in PDF format. It uses Mistral, lightweight Retrieval-Augmented Generation (RAG) and LangChain to process the input documents and identify key details specified by the user. The main script returns a JSON file storing the key information retrieved from one or more articles.
- Edit
config.pyto add your ownINPUT_PATHandOUTPUT PATH. - Edit
config.pyto add your own Mistral model underMODEL_NAME. - Toggle
PARSER_USAGEinconfig.pyto True if you would like to use a specific parser.
To install and run the project locally, follow the steps below:
-
A Mistral API key is needed (Get API Key).
-
Install Python. Version 3.12.3 was used for this development.
-
Clone the repository from terminal (git must be installed):
git clone https://github.com/carobs9/llm-pdf-retrieval.git
-
Navigate to the project directory:
cd [YOUR PROJECT DIRECTORY] -
Create a virtual environment:
python3.12.3 -m venv <env_name>
-
Activate the virtual environment:
-
Mac:
source venv/bin/activate -
Windows:
./env_name>/Scripts/activate -
Linux:
./<env_name>/bin/activate
- Install the dependencies:
pip install -r requirements.txt
This project uses a Mistral LLM to obtain results.
-
Go into the official Mistral website and click on "Try the API".
-
Create an account and click on API Keys.
-
Click on "Create new key" and store it in your environment file.
-
In a PowerShell terminal, run:
$env:MISTRAL_API_KEY = "[your_api_key]"
-
In the same terminal, run:
python main.py
llm-pdf-retrieval
| |___ config.py
| |___ main.py
| |___ README.md
| |___ requirements.txt
|
|___ outputs/
| |___ output.json
|
|___ pdfs/