Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text extraction.
Why?
I am building, like many do now, a RAG focused service (checkout https://grantflow.ai). I have text extraction needs. There are quite a lot of commercial options out there, and several open-source + paid options. But I wanted something simple, which does not require expansive round-trips to an external API. Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.
Hence, this library.
- Extract text from PDFs, images, office documents and more (see supported formats below)
- Use modern Python with async (via
anyio) and proper type hints - Extensive error handling for easy debugging
-
Begin by installing the python package:
pip install kreuzberg
-
Install the system dependencies:
- pandoc (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
- tesseract-ocr (for image/PDF OCR, Apache License)
This library is built to be minimalist and simple. It also aims to utilize OSS tools for the job. Its fundamentally a high order async abstraction on top of other tools, think of it like the library you would bake in your code base, but polished and well maintained.
- PDFs are processed using pdfium2 for searchable PDFs + Tesseract OCR for scanned documents
- Images are processed using Tesseract OCR
- Office documents and other formats are processed using Pandoc
- PPTX files are converted using python-pptx
- HTML files are converted using html-to-markdown
- Plain text files are read directly with appropriate encoding detection
V1:
- - html file text extraction
- - better PDF table extraction
- - TBD
V2:
- - extra install groups (to make dependencies optional)
- - metadata extraction (possible breaking change)
- - TBD
Feel free to open a discussion in GitHub or an issue if you have any feature requests
Is welcome! Read guidelines below.
Kreuzberg supports a wide range of file formats:
- PDF (
.pdf) - both searchable and scanned documents - Word Documents (
.docx,.doc) - Power Point Presentations (
.pptx) - OpenDocument Text (
.odt) - Rich Text Format (
.rtf)
- JPEG, JPG (
.jpg,.jpeg,.pjpeg) - PNG (
.png) - TIFF (
.tiff,.tif) - BMP (
.bmp) - GIF (
.gif) - WebP (
.webp) - JPEG 2000 (
.jp2,.jpx,.jpm,.mj2) - Portable Anymap (
.pnm) - Portable Bitmap (
.pbm) - Portable Graymap (
.pgm) - Portable Pixmap (
.ppm)
- HTML (
.html,.htm) - Plain Text (
.txt) - Markdown (
.md) - reStructuredText (
.rst) - LaTeX (
.tex)
- Comma-Separated Values (
.csv) - Tab-Separated Values (
.tsv)
Kreuzberg exports two async functions:
- Extract text from a file (string path or
pathlib.Path) usingextract_file() - Extract text from a byte-string using
extract_bytes()
from pathlib import Path
from kreuzberg import extract_file
# Extract text from a PDF file
async def extract_pdf():
result = await extract_file("document.pdf")
print(f"Extracted text: {result.content}")
print(f"Output mime type: {result.mime_type}")
# Extract text from an image
async def extract_image():
result = await extract_file("scan.png")
print(f"Extracted text: {result.content}")
# or use Path
async def extract_pdf():
result = await extract_file(Path("document.pdf"))
print(f"Extracted text: {result.content}")
print(f"Output mime type: {result.mime_type}")from kreuzberg import extract_bytes
# Extract text from PDF bytes
async def process_uploaded_pdf(pdf_content: bytes):
result = await extract_bytes(pdf_content, mime_type="application/pdf")
return result.content
# Extract text from image bytes
async def process_uploaded_image(image_content: bytes):
result = await extract_bytes(image_content, mime_type="image/jpeg")
return result.contentWhen extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
You can do this by passing force_ocr=True:
from kreuzberg import extract_bytes
# Extract text from PDF bytes and force OCR
async def process_uploaded_pdf(pdf_content: bytes):
result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
return result.contentKreuzberg raises two exception types:
Raised when there are issues with input validation:
- Unsupported mime types
- Undetectable mime types
- Path doesn't point at an exist file
Raised when there are issues during the text extraction process:
- PDF parsing failures
- OCR errors
- Pandoc conversion errors
from kreuzberg import extract_file
from kreuzberg.exceptions import ValidationError, ParsingError
async def safe_extract():
try:
result = await extract_file("document.doc")
return result.content
except ValidationError as e:
print(f"Validation error: {e.message}")
print(f"Context: {e.context}")
except ParsingError as e:
print(f"Parsing error: {e.message}")
print(f"Context: {e.context}") # Contains detailed error informationBoth error types include helpful context information for debugging:
try:
result = await extract_file("scanned.pdf")
except ParsingError as e:
# e.context might contain:
# {
# "file_path": "scanned.pdf",
# "error": "Tesseract OCR failed: Unable to process image"
# }All extraction functions return an ExtractionResult named tuple containing:
content: The extracted text as a stringmime_type: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")
from kreuzberg import ExtractionResult
async def process_document(path: str) -> str:
result: ExtractionResult = await extract_file(path)
return result.content
# or access the result as tuple
async def process_document(path: str) -> str:
content, mime_type = await extract_file(path)
# do something with mime_type
return contentThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.
- Clone the repo
- Install the system dependencies
- Install the full dependencies with
uv sync - Install the pre-commit hooks with:
pre-commit install && pre-commit install --hook-type commit-msg - Make your changes and submit a PR
This library uses the MIT license.