Detect if PDF is scan or native #2755

Us3r369 · 2025-12-08T16:43:58Z

Us3r369
Dec 8, 2025

Hi all,

I was wondering if docling supports a check whether a PDF file is native, or if it's a scanned document. I'm thinking about using this in my pipeline to determine if OCR is needed, or regular parsing suffices. From what I've read so far, the typical approach is to calculate the ratio of images to text in a document (might also consider word count, number of different fonts, etc.), then heuristically determine if it's native or scanned. However, if I can avoid custom code by using some docling functionality, this would of course be ideal!

Cheers
Jan

Answered by dolfim-ibm

Dec 8, 2025

It is not possible yet (as a simple high-level API) but it is on the wish list. There could indeed be some useful automation and simplification which will be unlocked by having this feature.

View full answer

@Us3r369 · 2025-12-08T16:47:46Z

dosubot[bot]
bot Dec 8, 2025

Hi @Us3r369! I'm Dosu and I’m helping the docling team.

Docling doesn't have a built-in flag or API to directly tell you if a PDF is native (text-based) or scanned (image-based), but it does expose all the signals you need to implement this check in your pipeline. Docling's OCR pipeline uses bitmap (image) coverage per page: if a page is mostly image, OCR is triggered; if there's enough selectable text, OCR is skipped. This is controlled by options like bitmap_area_threshold (default 0.75) and force_full_page_ocr [source].

You can access the number of text cells and bitmap rectangles per page using Docling's PDF backends (e.g., PyPdfium2, DoclingParseV4), which lets you implement your own scan/native heuristic. For example, if a page has no text cells but large bitmap coverage, it's almost certainly scanned; if it has many text cells and little bitmap area, it's native [source].

If you want stricter or more explicit control, you can pre-process PDFs with a library like PyMuPDF (fitz) to check for text layers before running Docling, or analyze the DoclingDocument after conversion to infer the document type [source].

In summary: Docling doesn't provide a one-liner for this, but you can use its API to check for text cells and bitmap coverage per page and apply your own threshold or logic to decide if a PDF is scanned or native. This approach matches the heuristics you described and is how Docling internally decides when to apply OCR.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

dolfim-ibm · 2025-12-08T16:55:57Z

dolfim-ibm
Dec 8, 2025
Maintainer

It is not possible yet (as a simple high-level API) but it is on the wish list. There could indeed be some useful automation and simplification which will be unlocked by having this feature.

1 reply

dolfim-ibm Dec 8, 2025
Maintainer

Something we are also exploring is a swap in the pipeline. First run the (cheap) layout structure model, then check if the parser has content in the text regions and in case do OCR there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect if PDF is scan or native #2755

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Detect if PDF is scan or native #2755

Uh oh!

Us3r369 Dec 8, 2025

Replies: 2 comments · 1 reply

Uh oh!

dosubot[bot] bot Dec 8, 2025

Uh oh!

dolfim-ibm Dec 8, 2025 Maintainer

Uh oh!

dolfim-ibm Dec 8, 2025 Maintainer

Us3r369
Dec 8, 2025

Replies: 2 comments 1 reply

dosubot[bot]
bot Dec 8, 2025

dolfim-ibm
Dec 8, 2025
Maintainer

dolfim-ibm Dec 8, 2025
Maintainer