Detect if PDF is scan or native #2755
-
|
Hi all, I was wondering if docling supports a check whether a PDF file is native, or if it's a scanned document. I'm thinking about using this in my pipeline to determine if OCR is needed, or regular parsing suffices. From what I've read so far, the typical approach is to calculate the ratio of images to text in a document (might also consider word count, number of different fonts, etc.), then heuristically determine if it's native or scanned. However, if I can avoid custom code by using some docling functionality, this would of course be ideal! Cheers |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
Hi @Us3r369! I'm Dosu and I’m helping the docling team. Docling doesn't have a built-in flag or API to directly tell you if a PDF is native (text-based) or scanned (image-based), but it does expose all the signals you need to implement this check in your pipeline. Docling's OCR pipeline uses bitmap (image) coverage per page: if a page is mostly image, OCR is triggered; if there's enough selectable text, OCR is skipped. This is controlled by options like You can access the number of text cells and bitmap rectangles per page using Docling's PDF backends (e.g., PyPdfium2, DoclingParseV4), which lets you implement your own scan/native heuristic. For example, if a page has no text cells but large bitmap coverage, it's almost certainly scanned; if it has many text cells and little bitmap area, it's native [source]. If you want stricter or more explicit control, you can pre-process PDFs with a library like PyMuPDF (fitz) to check for text layers before running Docling, or analyze the DoclingDocument after conversion to infer the document type [source]. In summary: Docling doesn't provide a one-liner for this, but you can use its API to check for text cells and bitmap coverage per page and apply your own threshold or logic to decide if a PDF is scanned or native. This approach matches the heuristics you described and is how Docling internally decides when to apply OCR. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
It is not possible yet (as a simple high-level API) but it is on the wish list. There could indeed be some useful automation and simplification which will be unlocked by having this feature. |
Beta Was this translation helpful? Give feedback.
It is not possible yet (as a simple high-level API) but it is on the wish list. There could indeed be some useful automation and simplification which will be unlocked by having this feature.