Add PDF community source

## Summary

Add a PDF community source so Coral can query text, tables, images, links, annotations, form fields, and document metadata from local PDF documents through SQL. The source uses a PyMuPDF converter script to extract page-level and document-level data into JSONL files.

## Why this source

PDF is one of the most common document formats for resumes, reports, contracts, and research papers. Coral does not currently include a PDF source under `sources/core` or `sources/community`. Adding PDF support enables users to search across document collections, extract structured data from tables, find hyperlinks and annotations, and analyze form fields—all through Coral SQL.

## Provider docs

- PyMuPDF: https://pymupdf.readthedocs.io/
- OCRmyPDF (optional, for scanned PDFs): https://ocrmypdf.readthedocs.io/

## Source shape

| Table | Description | Columns |
| --- | --- | --- |
| `pages` | Text, tables, images, links, annotations, and form fields extracted from PDF documents, one row per page. | 12 columns |
| `documents` | Per-document summary metadata including table of contents, embedded files, page labels, and file-level metadata. | 8 columns |

## Source scope

- Uses a Python converter script (`scripts/pdf-to-jsonl.py`) that reads PDFs via PyMuPDF and writes JSONL files.
- The manifest reads from `~/.coral/pdf/pages.jsonl` and `~/.coral/pdf/documents.jsonl`.
- Provides read-only access to extracted PDF content and metadata.
- No required filters — all columns are queryable directly.
- Includes a converter script with support for OCR (via `ocrmypdf`), recursive directory scanning, and custom output paths.
- Binary image data is not stored — only references (xref, dimensions).
- Table extraction uses PyMuPDF's built-in table detection.
- Scanned/image-only PDFs will have empty `text` without the `--ocr` flag.

## Validation plan

- `coral source lint`
- `coral source add --file`
- `coral source test`
- `coral.tables` introspection
- `coral.columns` introspection for both tables
- Live pages query
- Live documents query


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF community source #1259

Summary

Why this source

Provider docs

Source shape

Source scope

Validation plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Table	Description	Columns
`pages`	Text, tables, images, links, annotations, and form fields extracted from PDF documents, one row per page.	12 columns
`documents`	Per-document summary metadata including table of contents, embedded files, page labels, and file-level metadata.	8 columns

Add PDF community source #1259

Description

Summary

Why this source

Provider docs

Source shape

Source scope

Validation plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions