Skip to content

This small project leverages Large Language Models (LLMs) to automatically extract structured data from a set of scholarly articles in PDF format. It uses Mistral, lightweight Retrieval-Augmented Generation (RAG) and LangChain to process the input documents and identify key details.

License

Notifications You must be signed in to change notification settings

carobs9/llm-pdf-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-pdf-retrieval

This small project leverages Large Language Models (LLMs) to automatically extract structured data from a set of scholarly articles in PDF format. It uses Mistral, lightweight Retrieval-Augmented Generation (RAG) and LangChain to process the input documents and identify key details specified by the user. The main script returns a JSON file storing the key information retrieved from one or more articles.

Configuration

  • Edit config.py to add your own INPUT_PATH and OUTPUT PATH.
  • Edit config.py to add your own Mistral model under MODEL_NAME.
  • Toggle PARSER_USAGE in config.py to True if you would like to use a specific parser.

Installation

To install and run the project locally, follow the steps below:

  1. A Mistral API key is needed (Get API Key).

  2. Install Python. Version 3.12.3 was used for this development.

  3. Clone the repository from terminal (git must be installed):

    git clone https://github.com/carobs9/llm-pdf-retrieval.git
  4. Navigate to the project directory:

    cd [YOUR PROJECT DIRECTORY]
  5. Create a virtual environment:

    python3.12.3 -m venv <env_name>
  6. Activate the virtual environment:

  1. Install the dependencies:
    pip install -r requirements.txt
    

Mistral API Configuration

This project uses a Mistral LLM to obtain results.

  1. Go into the official Mistral website and click on "Try the API".

  2. Create an account and click on API Keys.

  3. Click on "Create new key" and store it in your environment file.

  4. In a PowerShell terminal, run:

    $env:MISTRAL_API_KEY = "[your_api_key]"
  5. In the same terminal, run:

    python main.py

Structure

llm-pdf-retrieval
|  |___ config.py
|  |___ main.py
|  |___ README.md
|  |___ requirements.txt
|
|___ outputs/
|     |___ output.json
|
|___ pdfs/

About

This small project leverages Large Language Models (LLMs) to automatically extract structured data from a set of scholarly articles in PDF format. It uses Mistral, lightweight Retrieval-Augmented Generation (RAG) and LangChain to process the input documents and identify key details.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages