Skip to content

Conversation

@jronallo
Copy link
Contributor

@jronallo jronallo commented Aug 18, 2016

This is a potential fix for #30 where rawtext gets referenced without having been defined. If innerword.txt is None then rawtext will never get defined. This patch simply continues if innerword.text is also None. I'm not certain this is the best fix, but this works for me.

The issue arises because sometimes tesseract outputs hOCR where an ocrx_word just has a single space. the cases I've seen are with images of text that tesseract has lots of problems with.

<div class='ocr_carea' id='block_1_1' title="bbox 118 3884 122 5088">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 118 3884 122 5088">
     <span class='ocr_line' id='line_1_1' title="bbox 118 3884 122 5088; baseline 0 0"><span class='ocrx_word' id='word_1_1' title='bbox 118 3884 122 5088; x_wconf 95' lang='eng' dir='ltr'><strong><em> </em></strong></span> 
     </span>
    </p>
   </div>

@stweil stweil added the bug label Aug 24, 2016
@stweil stweil self-assigned this Aug 24, 2016
@stweil stweil merged commit 50c8a9e into ocropus:master Aug 24, 2016
@stweil
Copy link
Collaborator

stweil commented Aug 24, 2016

Thank you for this pull request. I could confirm that it fixes the reported bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants