Skip to content

Add PDF community source #1259

@FiscalMindset

Description

@FiscalMindset

Summary

Add a PDF community source so Coral can query text, tables, images, links, annotations, form fields, and document metadata from local PDF documents through SQL. The source uses a PyMuPDF converter script to extract page-level and document-level data into JSONL files.

Why this source

PDF is one of the most common document formats for resumes, reports, contracts, and research papers. Coral does not currently include a PDF source under sources/core or sources/community. Adding PDF support enables users to search across document collections, extract structured data from tables, find hyperlinks and annotations, and analyze form fields—all through Coral SQL.

Provider docs

Source shape

Table Description Columns
pages Text, tables, images, links, annotations, and form fields extracted from PDF documents, one row per page. 12 columns
documents Per-document summary metadata including table of contents, embedded files, page labels, and file-level metadata. 8 columns

Source scope

  • Uses a Python converter script (scripts/pdf-to-jsonl.py) that reads PDFs via PyMuPDF and writes JSONL files.
  • The manifest reads from ~/.coral/pdf/pages.jsonl and ~/.coral/pdf/documents.jsonl.
  • Provides read-only access to extracted PDF content and metadata.
  • No required filters — all columns are queryable directly.
  • Includes a converter script with support for OCR (via ocrmypdf), recursive directory scanning, and custom output paths.
  • Binary image data is not stored — only references (xref, dimensions).
  • Table extraction uses PyMuPDF's built-in table detection.
  • Scanned/image-only PDFs will have empty text without the --ocr flag.

Validation plan

  • coral source lint
  • coral source add --file
  • coral source test
  • coral.tables introspection
  • coral.columns introspection for both tables
  • Live pages query
  • Live documents query

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions