Summary
Add a PDF community source so Coral can query text, tables, images, links, annotations, form fields, and document metadata from local PDF documents through SQL. The source uses a PyMuPDF converter script to extract page-level and document-level data into JSONL files.
Why this source
PDF is one of the most common document formats for resumes, reports, contracts, and research papers. Coral does not currently include a PDF source under sources/core or sources/community. Adding PDF support enables users to search across document collections, extract structured data from tables, find hyperlinks and annotations, and analyze form fields—all through Coral SQL.
Provider docs
Source shape
| Table |
Description |
Columns |
pages |
Text, tables, images, links, annotations, and form fields extracted from PDF documents, one row per page. |
12 columns |
documents |
Per-document summary metadata including table of contents, embedded files, page labels, and file-level metadata. |
8 columns |
Source scope
- Uses a Python converter script (
scripts/pdf-to-jsonl.py) that reads PDFs via PyMuPDF and writes JSONL files.
- The manifest reads from
~/.coral/pdf/pages.jsonl and ~/.coral/pdf/documents.jsonl.
- Provides read-only access to extracted PDF content and metadata.
- No required filters — all columns are queryable directly.
- Includes a converter script with support for OCR (via
ocrmypdf), recursive directory scanning, and custom output paths.
- Binary image data is not stored — only references (xref, dimensions).
- Table extraction uses PyMuPDF's built-in table detection.
- Scanned/image-only PDFs will have empty
text without the --ocr flag.
Validation plan
coral source lint
coral source add --file
coral source test
coral.tables introspection
coral.columns introspection for both tables
- Live pages query
- Live documents query
Summary
Add a PDF community source so Coral can query text, tables, images, links, annotations, form fields, and document metadata from local PDF documents through SQL. The source uses a PyMuPDF converter script to extract page-level and document-level data into JSONL files.
Why this source
PDF is one of the most common document formats for resumes, reports, contracts, and research papers. Coral does not currently include a PDF source under
sources/coreorsources/community. Adding PDF support enables users to search across document collections, extract structured data from tables, find hyperlinks and annotations, and analyze form fields—all through Coral SQL.Provider docs
Source shape
pagesdocumentsSource scope
scripts/pdf-to-jsonl.py) that reads PDFs via PyMuPDF and writes JSONL files.~/.coral/pdf/pages.jsonland~/.coral/pdf/documents.jsonl.ocrmypdf), recursive directory scanning, and custom output paths.textwithout the--ocrflag.Validation plan
coral source lintcoral source add --filecoral source testcoral.tablesintrospectioncoral.columnsintrospection for both tables