Intelligent Document Search for the Staatsarchiv Zurich

Enhance access to the central collections of the Staatsarchiv Zürich with an intelligent hybrid search application.

Contents

Usage

This repository provides the production-ready code for our search app, which is available online here.

To set up the app:

Create a Conda environment: conda create -n search python=3.9
Activate the environment: conda activate search
Clone this repository.
Change into the project directory: cd ai-search_staatsarchiv/
Install the required packages: pip install -r requirements.txt

Run the Notebooks to Prepare the Data

Run the notebooks. Open these either in an IDE like Visual Studio Code. Alternatively, you can use Jupyter Notebook or Jupyter Lab.
Use the final notebook to create the Weaviate search index. Data is stored by default in .local/share/weaviate/. If you are deploying the app on a remote machine, copy the index data to the same path on the remote machine, or change the path in the app like so: client = weaviate.connect_to_embedded(persistence_data_path="/your_data_path_on_your_vm/").

Run the Search App

Change into app directory: cd _streamlit_app/
Start the app: streamlit run hybrid_search_stazh.py

Note

The app logs user interactions locally to a file named app.log. If you prefer not to collect analytics, simply comment out the relevant function call in the code.

Embedding Model

For the embeddings we use Jina AI's model jina-embeddings-v2-base-de. The model is a German/English bilingual text embedding model supporting 8,192 sequence length. According to the model card it is designed for «high performance in mono-lingual & cross-lingual applications and trained … specifically to support mixed German-English input without bias». Technical report here.

We tested several other models as well. PM-AI's bi-encoder_msmarco_bert-base_german model proved to be comparable and an excellent choice too. Jina's model offers more flexibility in terms of context (8,192 tokens vs. 350) and provides bilingual capabilities. Both models are relatively lightweight, with PM-AI's model at 440 MB and Jina's at 326 MB.

Note that we chunk all text on a sentence basis to a maximum of 500 tokens with a 100-token overlap.

Project Information

The Staatsarchiv Zürich manages and catalogs the «Zentralen Serien des Kantons Zürich 19. und 20. Jahrhundert» which include important historical documents such as minutes from the Cantonal Council, Government Council resolutions, collections of laws, and the Official Gazette. These records span from 1803 to the present, making them linguistically and thematically diverse.

We (the Staatsarchiv and the Statistical Office) developed an intelligent search application that enhances access to these extensive archives.

For more information, see the following article in the magazine ABI Technik: Mit Künstlicher Intelligenz zu besserer Nutzbarkeit: Die Zentralen Serien des Kantons Zürich (19. und 20. Jahrhundert) neu zugänglich gemacht

What Does the App Do?

This app allows users to search through these extensive archives using both lexical and semantic search methods. Unlike a traditional lexical search that looks for exact keywords, semantic search identifies words, sentences, or paragraphs with similar meanings, even if they don't exactly match the search term. For example, a search for «technology» might return documents containing related concepts like «digitalization», «artificial intelligence», «software development», or «computer science» even if «technology» isn't mentioned directly.

Additionally, semantic search can retrieve documents related to a reference text. For instance, entering a document reference like RRB 1804/1 will return documents with similar themes.

Semantic search leverages statistical methods and machine learning to analyze large text corpora, allowing models to learn word and sentence similarities, enabling more nuanced document retrieval. While semantic search offers significant benefits, results can sometimes be incomplete or include irrelevant matches.

Findings

Hybrid search significantly improves search results compared to traditional lexical search. , especially for complex or fuzzy queries and large corpora spanning over two centuries.
The embedding models we tested (and the one we use in the app) are astonishingly agnostic to the historical language used in the documents. Based on our observations, these models can capture the semantic meaning of very old texts too.
Weaviate has proven to be a reliable and efficient tool for semantic search. It is easy to use and integrates well with Python.
The app is inexpensive to run and maintain. It can be deployed on a local machine or a virtual machine with moderate resources. At the moment we use a VM with 8 CPUs and 32 GB RAM.

Project Team

Rebekka Plüss (Staatsarchiv) and Patrick Arnecke (Statistisches Amt, Team Data). A big thanks goes to Sarah Murer and Dominik Frefel too!

Feedback and Contributing

We welcome your feedback! Please share your thoughts and let us know how you use the app in your institution. You can email us or contribute by opening an issue or a pull request.

Please note that we use Ruff for linting and code formatting with default settings.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
_01_data-input		_01_data-input
_02_data-prep		_02_data-prep
_imgs		_imgs
_streamlit_app		_streamlit_app
.env		.env
.gitignore		.gitignore
01_get_data.ipynb		01_get_data.ipynb
02_krp.ipynb		02_krp.ipynb
03_rrb.ipynb		03_rrb.ipynb
04_os.ipynb		04_os.ipynb
05_abl.ipynb		05_abl.ipynb
06a_chunk_and_embed.ipynb		06a_chunk_and_embed.ipynb
06b_embed_colab.ipynb		06b_embed_colab.ipynb
07_create_search-index.ipynb		07_create_search-index.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
staatsarchiv_utils.py		staatsarchiv_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Document Search for the Staatsarchiv Zurich

Usage

Run the Notebooks to Prepare the Data

Run the Search App

Embedding Model

Project Information

What Does the App Do?

Findings

Project Team

Feedback and Contributing

About

Releases

Packages

Contributors 2

Languages

License

machinelearningZH/ai-search_staatsarchiv

Folders and files

Latest commit

History

Repository files navigation

Intelligent Document Search for the Staatsarchiv Zurich

Usage

Run the Notebooks to Prepare the Data

Run the Search App

Embedding Model

Project Information

What Does the App Do?

Findings

Project Team

Feedback and Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages