-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Description
Basic Information
tesseract 5.3.0-19-ga3b9ac, compiled with --disable-legacy
Operating System
macOS 13 Ventura
Compiler
clang 14.0
Current Behavior
When Tesseract is compiled with --disable-legacy, hOCR output reports each line as being upside-down (textangle 180) and omits baseline information.
Steps to reproduce:
./configure --disable-legacy
./tesseract some-image.jpg output hocr
In the generated output.hocr file, ocr_line entries look like this:
<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; textangle 180; x_size 34; x_descenders 8; x_ascenders 9">Expected Behavior
If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:
<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 9">Suggested Fix
Internally, it looks like the issue is that:
ColumnFinder::text_rotation_is initialized to a null vector. When the legacy engine is disabled, theColumnFinder::CorrectOrientationfunction does not get called, and so this vector remains null.- This null vector gets propagated to
PageIterator::Orientation, which does not handle this case correctly, as it converts this null vector toORIENTATION_PAGE_DOWN-tesseract/src/ccmain/pageiterator.cpp
Line 585 in a3b9acf
if (up_in_image.y() > 0.0F) { - The HOCR renderer then maps this orientation value to
textangle 180and omits baseline info
Some fixes I tested locally were to change the initialization of ColumnFinder::text_rotation_ to be the same as the norotation value in ColumnFinder::CorrectOrientation, or to change the logic in PageIterator::Orientation to handle null rotation vectors by mapping them to ORIENTATION_PAGE_UP. I'm happy to submit a PR but I'm not sure the preferred way to go.