GitHub - grlazo/AIRA: AI Research Assistant for Ollama LLM : PDF collection interactions

AI Research Assistant (AIRA)

Your AI Research Assistant (AIRA) needs a few things to get started. Basically you need the Ollama AI interface (ollama.ai) installed, and the anaconda (or miniconda3) to set up a python environment.

Ollama can be istalled on Linux, Windows, or Apple (check your requirements).

Once you have Ollama installed, you need an embedding model and a large language model (LLM). You can visit the website to discover all that is available. It's best to install your ollama instance under the conda environment to make sure both are working under the same environment.

For instance on a Windows machine you would start your Anaconda Powershell and run the following commands:

$ ollama pull nomic-embed-text
$ ollama pull llama3.1

The above will get you started; you may change these later.

You can check if ollama is running on your local machine by opening a web browser to the following address: http://localhost:11434/ It should display: ollama is running

Next you will need to set up your conda environment; start by naming your environment:

$ conda create --name AIRA python==3.12
$ conda activate AIRA
$ conda list

Create a working directory for the AIRA files somewhere. Eventually you can change the name of the AIRA directory to your favorite SUBJECT matter for your prompt topics and papers.

You will install needed packages in this environment; updates often occur so validated packages and versions will be included in a requirements.txt file. I like to install each individually to make sure everything loads properly.

$ pip install pypdf
$ pip install pytest
$ pip install boto3
$ pip install langchain-community
$ pip install langchain-ollama
$ pip install langchain-chroma
$ pip install streamlit

Original source is from a tutorial at: github.com/pixegami/rag-tutorial-v2 modifications were made to reflect updates in available python packages.

There should be no issues under a Linux environment, but the peculiar steps suggested were encountered when walking someone through this process for a Windows machine (not sure about Apple). It seems to be a permission issue between user/administrator accounts (software installs require administrator permissions).

Now you're ready to create the vector store for your collection of PDF documents. Place your PDF files in a directory called 'data'. Start with one, or a few, to start until you're comfortable with the limitations on your machine settings. There can be a hard-limit overflow which will fail populating the vector store.

Once the PDFs are in place the directory structure should look like:

./AIRA/data/file01.pdf
./AIRA/data/file02.pdf
./AIRA/data/file03.pdf
./AIRA/get_embedding_function.py
./AIRA/populate_database.py
./AIRA/query_data.py
./AIRA/requirements.txt
./AIRA/RunningAIRA.txt

Enable the embedding function:

$ python get_embedding_function.py

Load documents into vector store:

$ python populate_database.py

You are ready to begin Retrieval Augmented Generation (RAG) prompts:

$ python query_data.py "Can you summarize the contents of the documents provided?"

If it replies you may be on a new road to discovery. Have fun!

As your collection of information grows, try adding new documents into the './data' directory and re-issue the embedding and populate python commands. If new documents are detected they will be added to your vector database (remember not to over-populate the ability of the vector store to read your documents (a 41500 chunk limit size was in chromdb version used).

You can unpack multiple subject directories renaming the AIRA directory to different topics such as biology, chemistry, fruits, vegetables, etc...

Optional browser use:

streamlit can be used to view your results in a browser interface, start it with:

$ streamlit run app.py

Then open a browser with the link provided (usually http://localhost:5301/). The top matches are still displayed in the console, but the browser does let you select and view the PDFs in the data directory. The command line process is still needed to build the vector store if you add new documents. The streamlit option is for common use scenarios.

Note: If you're in a Linux command shell, do:

$ grep '\$' README.md to see commands to issue after installing ollama.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Research Assistant (AIRA)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
LICENSE		LICENSE
README.md		README.md
RunningAIRA.txt		RunningAIRA.txt
app.py		app.py
get_embedding_function.py		get_embedding_function.py
pdfinfo.sh		pdfinfo.sh
populate_database.py		populate_database.py
query_data.py		query_data.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Research Assistant (AIRA)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages