Tess4J Activity

Brought to you by: nguyenq

Activity for Tess4J

9 months ago
Quan Nguyen posted a comment on discussion Open Discussion

Yes, it can be. https://github.com/java-native-access/jna/blob/master/www/DirectMapping.md
9 months ago
George posted a comment on discussion Open Discussion

thanks, I will try it out. Is direct mapping more efficient than using the interface ?
9 months ago
Quan Nguyen posted a comment on discussion Open Discussion

Leptonica library has many dependencies to open various image file types, such as TIFF, PNG, JPEG, etc., which in turn have other dependencies, as you've seen. On Windows, we were able to embed all the image library dependencies inside libleptonica.dll. We don't know how to generate a similar static library liblept.so on Linux. Installing Tesseract would ensure installing of all the required dependency libraries.
9 months ago
Quan Nguyen posted a comment on discussion Open Discussion

Please use the JNA Direct Mapping API — Leptonica1. https://tess4j.sourceforge.net/docs/index.html
9 months ago
George modified a comment on discussion Open Discussion

I encountered ** jdk.internal.org.objectweb.asm.MethodTooLargeException** when I tried to load lept4j 1.16.1 using OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) I assume this is caused by ASM library used by JDK during invocation of Native.loadLibrary() when the byte code size exceeds the JDK's method size limit (64KB), . A similar issue reported here https://bugs.openjdk.org/browse/JDK-8314528 Is there any work around to load lept4j 1.16.1 on JDK 21 without requiring to create...
9 months ago
George modified a comment on discussion Open Discussion

I encountered ** jdk.internal.org.objectweb.asm.MethodTooLargeException** when I tried to load lept4j 1.16.1 using OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) I assume this is caused by ASM library used by JDK during invocation of Native.loadLibrary() when the byte code size exceeds the JDK's method size limit (64KB), . A similar issue reported here https://bugs.openjdk.org/browse/JDK-8314528 Is there any work around to load lept4j 1.16.1 on JDK 21 without requiring to create...
9 months ago
George modified a comment on discussion Open Discussion

I encountered ** jdk.internal.org.objectweb.asm.MethodTooLargeException** when I tried to load lept4j 1.16.1 using OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) I assume this is caused by ASM library used by JDK during invocation of Native.loadLibrary() when the byte code size exceeds the JDK's method size limit (64KB), . A similar issue reported here https://bugs.openjdk.org/browse/JDK-8314528 Is there any work around to load lept4j 1.16.1 on JDK 21 without requiring to create...
9 months ago
George modified a comment on discussion Open Discussion

I encountered ** jdk.internal.org.objectweb.asm.MethodTooLargeException** when I tried to load lept4j 1.16.1 using OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) I assume this is caused by ASM library used by JDK during invocation of Native.loadLibrary() , and appears to be similar to the issue reported here https://bugs.openjdk.org/browse/JDK-8314528 Is there any work around to load lept4j 1.16.1 on JDK 21 ? On testing tess4j 5.0.0 seems to load without issue on JDK21
9 months ago
George posted a comment on discussion Open Discussion

I encountered ** jdk.internal.org.objectweb.asm.MethodTooLargeException** when I tried to load tess4j 5.0.0 and lept4j 1.16.1 using OpenJDK Runtime Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) I assume this is caused by ASM library used by JDK during invocation of Native.loadLibrary() , and appears to be similar to the issue reported here https://bugs.openjdk.org/browse/JDK-8314528 Is there any work around to load tess4j 5.0.0 and lept4j 1.16.1 on JDK 21 ?
12 months ago
Srinivas Arava posted a comment on discussion Open Discussion

I was able to run the tess4j in a windows machine without actually installing the software. It is picking up the required dll's from the jars or path. I am not able to do the same on linux. I tried copying the .so files one by one until I hit blocker. java,lang.UnsatisfiedLinkError: /lib64/libm.so.6: version 'GLIBC_2.29' not found (required by libpng15.so.15) My goal is to be able to run the tesst4j with expliciltly installing tesseract but by simply packaging the so files. Can someone please guide...
1 year ago
Jian Wang modified a comment on discussion Open Discussion

I have solved my question.
1 year ago
Jian Wang modified a comment on discussion Open Discussion
1 year ago
Jian Wang posted a comment on discussion Open Discussion

Hello, Is Tess4J an open-source project? Where is the source code please? Thank you.
2 years ago
Quan Nguyen posted a comment on discussion Open Discussion

For Tesseract non-Windows binary, you'll have to install or compile it yourself. https://tesseract-ocr.github.io/tessdoc/#compiling-and-installation
2 years ago
Angelo Schneider posted a comment on discussion Open Discussion

Hello, for Macs the binay of the lib is missing: darwin/libtesseract.dylib Best regards Angelo
2 years ago
Anonymous posted a comment on ticket #19

Thanks for your support. I ended up not using the path returned by the method. I let tess4j do its thing and that works fine. If I ever end up needing the path, I'll ensure that my registry value works or that I do it another way
2 years ago
Quan Nguyen modified a comment on ticket #19

On my Win11 machine, java.io.tmpdir is resolved to C:\Users\<username>\AppData\Local\Temp\tess4j. You might have correctly assessed, this seems to be due to a legacy dos setting in windows on your machine. You might try setting the Windows Registry value as suggested in the second article you mentioned.
2 years ago
Quan Nguyen posted a comment on ticket #19

On my Win11 machine, java.io.tmpdir is resolved to C:\Users\<username>\AppData\Local\Temp\tess4j. You might have correctly assessed, this seems to be due to a legacy dos setting in windows on your machine.
2 years ago
Anonymous posted a comment on ticket #19

I'm using Win11 Pro (64 Bit).
2 years ago
Quan Nguyen modified ticket #18

Tesseract upgrade missing text when extracting
2 years ago
Quan Nguyen posted a comment on ticket #19

I remember the 8.3 filename limitation in old DOS or Windows 95 era, but all modern OSes should be able to handle the long filenames. Which Windows version are you seeing the issue in?
3 years ago
Anonymous created ticket #19

LoadLibs.extractTessResources() returns wrong dos style filenames
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

@Praveen Anand Please use the Lept4J version compatible with your Leptonica installation.
3 years ago
Praveen Anand posted a comment on discussion Open Discussion

@ShawnChen Did this issue got resolved ? Im facing the exact same error
3 years ago
Synergi posted a comment on discussion Open Discussion

Fixed it. It was an issue with the JNA dependency. Had JNA loaded in another linked project. As a result it was suing the older version vs this one below. <dependency> <groupId>net.java.dev.jna</groupId> <artifactId>jna</artifactId> <version>5.12.1</version> </dependency>
3 years ago
Synergi modified a comment on discussion Open Discussion

So as it appears to me.... LoadLibs wants to copy the contents from a folder named linux-x86-64 in the jar file into /tmp/tess4j/linux-x86-64. The issue I see is the folder linux-x86-64 doesn't appear to exist in the jar file (tess4j-5.5.0.jar). Now as its a Linux system, I am guessing it doesn't need this tmp folder... but regardless of this the code seems to crash. FYI it seems to execute a similar process with Lept4J and copies over a dll from a windows directory in the jar file. I don't think...
3 years ago
Synergi modified a comment on discussion Open Discussion

So as it appears to me.... LoadLibs wants to copy the contents from a folder named linux-x86-64 in the jar file into /tmp/tess4j/linux-x86-64. The issue I see is the folder linux-x86-64 doesn't appear to exist in the jar file (tess4j-5.5.0.jar). Now as its a Linux system, I am guessing it doesn't need this tmp folder... but the code seems to crash. FYI it seems to execute a similar process with Lept4J and copies over a dll form a windows directory in the jar file. I dont think its used, but it allows...
3 years ago
Synergi posted a comment on discussion Open Discussion

So as it appears to me.... LoadLibs wants to copy the contents from a folder named linux-x86-64 in the jar file into /tmp/tess4j/linux-x86-64. The issue I see is the folder linux-x86-64 doesn't appear to exist in the jar file (tess4j-5.5.0.jar). Now as its a Linux system, I am guessing it doesn't need this tmp folder... but the code seems to crash. FYI it seems to execute a similar process with Lept4J and copies over a dll form a windows directory in the jar file. I dont think its used, but it allows...
3 years ago
Synergi modified a comment on discussion Open Discussion

I am using tess4j v 5.5.0 (which is supposed to work with Tesseract 5.0.3) via Maven in Java on Linux Ubuntu 20.04.3 LTS (Focal Fossa). The application I am using worked previously using Tess4J with Tesseract 4.1.1. I keep getting errors now when I run the following code :- TessAPI.TessBaseAPI handle = TessAPI.INSTANCE.TessBaseAPICreate(); This always worked in the past but now I get the following error :- Exception in thread "pool-23-thread-1" java.lang.NoClassDefFoundError: Could not initialize...
3 years ago
Synergi modified a comment on discussion Open Discussion

I am using tess4j v 5.5.0 (which is supposed to work with Tesseract 5.0.3) via Maven in Java on Linux Ubuntu 20.04.3 LTS (Focal Fossa). The application I am using worked previously using Tess4J with Tesseract 4.1.1. I keep getting errors now when I run the following code :- TessAPI.TessBaseAPI handle = TessAPI.INSTANCE.TessBaseAPICreate(); This always worked in the past but now I get the following error :- Exception in thread "pool-23-thread-1" java.lang.NoClassDefFoundError: Could not initialize...
3 years ago
Synergi posted a comment on discussion Open Discussion

I am using tess4j v 5.5.0 (which is supposed to work with Tesseract 5.0.3) via Maven in Java on Linux Ubuntu 20.04.3 LTS (Focal Fossa). The application I am using worked previously using Tess4J with Tesseract 4.1.1. I keep getting errors now when I run the following code :- TessAPI.TessBaseAPI handle = TessAPI.INSTANCE.TessBaseAPICreate(); This always worked in the past but now I get the following error :- Exception in thread "pool-23-thread-1" java.lang.NoClassDefFoundError: Could not initialize...
3 years ago
giuseppe coniglio posted a comment on discussion Help

Hi to all, I have implemented a Spring boot microservice which use tess4j 4.3.1 and pdfbox 2.0.22 in my server Oracle Linux Server , example code https://colwil.com/how-to-extract-text-from-a-scanned-pdf-using-ocr-in-java/ When I execute code with my IDE on windows pc and invoke local service, time execution is fast : "Tesseract.doOcr" 8 seconds, so when I execute api to invoke microservice's code method "Tesseract.doOcr" is slow 40-50 seconds, parameter pdf file is the same Any idea? Thanks :-)
3 years ago
Quan Nguyen posted a comment on discussion Help

If it was properly installed after built, a libtesseract.dylib symbolic link would be created. If not, you can manually create it. This link is what JNA is looking to load the native library.
3 years ago
Tevž Selčan modified a comment on discussion Help

Does this apply to Mac M1? I compiled tesseract like here (https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos) and downloaded Tess4J, but I cant find the libtesseract.dylib file in any of them?
3 years ago
Tevž Selčan posted a comment on discussion Help

Does this apply to Mac M1? I once compiled tesseract like here (https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos) and downloaded Tess4J, but I cant find the libtesseract.dylib file in any of them?
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

You mean separate physical copies of the training data files? I've seen instances of Tesseract running in multithreaded applications using the same set of training data files.
3 years ago
George posted a comment on discussion Open Discussion

Is it necessary to have separate copies of tesseract training data when running multiple instances of Tess4j in a separate JVMs.
3 years ago
Quan Nguyen modified a comment on discussion Help

No need to modify the .jar file. Just need to set jna.library.path property to the location of libtesseract.dylib file during launch. https://tess4j.sourceforge.net/tutorial/
3 years ago
Quan Nguyen modified a comment on discussion Help

No need to modify the .jar file. Just need to set jna.library.path property to the location of libtesseract.dylib file during launch.
3 years ago
Quan Nguyen posted a comment on discussion Help

No need to modify the .jar file. Just need to set jna.library.path property to the location of libtesseract.dylib file during launch
3 years ago
Tevž Selčan posted a comment on discussion Help

Issue solved, https://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x
3 years ago
Tevž Selčan posted a comment on discussion Help

Hello, I have a problem while trying to use Tess4J with Maven. I get this error : Exception in thread "main" java.lang.UnsatisfiedLinkError: Can't load library: /Users/tevzselcan/Library/Caches/JNA/temp/jna1926430164363992306.tmp at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2393) at java.base/java.lang.Runtime.load0(Runtime.java:755) at java.base/java.lang.System.load(System.java:1953) at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:1018) at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:988)...
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

The bug was fixed in tess4j-5.4.0 release.
3 years ago
Quan Nguyen modified a comment on discussion Open Discussion

The latest source is being hosted at https://github.com/nguyenq/tess4j .
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

The latest source is being hosted at https://github.com/nguyenq/tess4j.
3 years ago
Ralph Cook posted a comment on discussion Open Discussion

I want to put the source for Tess4J into eclipse so I can debug a problem I'm having. The current version of the library appears to be 5.4.0; if I put a dependency for net.sourceforge.tess4j:tess4j:5.4.0 in a Maven pom.xml file and update the project, I get a tess4j-5.4.0.jar. I cannot find source labelled for that version -- the latest I can find after rooting around on tess4j.sourceforge.net is labelled 3.4.8; the sources themselves do not have version numbers in them, so I cannot tell whether...
3 years ago
L Evans posted a comment on discussion Open Discussion

The output OCR documents look good. So, the 1 word count is really misleading. We have conditional logic that follows the createDocumentsWithResults() call that relies on the size of the Words list in the OCRResult.
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

What about the output documents (files) themselves? Can you put in a new issue at https://github.com/nguyenq/tess4j/issues ? Thanks.
3 years ago
L Evans posted a comment on discussion Open Discussion

We've encountered a bug when calling createDocumentsWithResults() from Tesseract/tess4j 4.5.5. The Tiff scanned by the method call, has 32 pages, and ~3100 words. Yet, the result produced by the Java call only contains the result of the last page scanned. The OCRResult, in Java, is an empty string in this bounding box: [ [Confidence: 95.000000 Bounding box: 313 434 938 822]], which is the same result when scanning the last page of the Tiff file. Can the Tess4j team investigate this bug ?
3 years ago
Xunnozza Vlinx Xenx posted a comment on discussion Open Discussion

Hello, Can please reduce the unnecessary dependence jai-imageio-core:1.4.0 ? The last update was also over 4 years ago. Also, I see (sorry if I missed something) that this library is only used for TIFF Meta and this is also possible with the Java 11 api. Therefore I recommend to remove this dependency and use the new Java API.
3 years ago
Quan Nguyen posted a comment on discussion Open Discussion

https://github.com/nguyenq/tess4j/issues/230
3 years ago
Peter Kronenberg posted a comment on discussion Open Discussion

I'm seeing an error in the ImageDeskew routine. The below sample code shows a rotation of -6.8 (the unredacted version shows -10) on the attached file even though it should be 0. Any idea why it’s not calculating correctly? It seems to happen on somewhat sparse images like this, which understably makes it harder to figure out the orientation. I'm wondering if anything can be done to make it more accurate public class GetAngle { private static double getAngle(Path sourceFile) throws IOException {...
4 years ago
Moritz Weibold posted a comment on discussion Help

Hey there, I am using Tess4J to extract the sum of a bill. My Maven Quarkus Server works perfektly fine on localhost in IntelliJ. After running the following command, I always pushed the target/quarkus-app/ folder onto my oracle vm. mvn clean build And as soon as the folder is uploaded, I run: java -jar server/quarkus-run.jar & The issue is, that on my oracle vm the server suddenly stop at the tesseract.doOCR(tempFile) function. There is no error or any hint on why it is not working. The server also...
4 years ago
Quan Nguyen posted a comment on ticket #18

You may want to put in a ticket at https://github.com/tesseract-ocr/tesseract/issues site. Thanks.
4 years ago
Anonymous posted a comment on ticket #17

Thanks a lot
4 years ago
Anonymous created ticket #18

Tesseract upgrade missing text when extracting
4 years ago
Quan Nguyen posted a comment on discussion Help

JNA is looking for a libtesseract.dylib to load. Do you have it in system path? Several developers were able to use the library on MacOs. Please search through the forum posts.
4 years ago
Ben posted a comment on discussion Help

Hi, I have tried to get it to work so many times but it still is not working. I added the dependency to my maven and then wrote the code following instructions. I'm not sure why it is not working. Could someone help? Thanks!
4 years ago
Quan Nguyen posted a comment on ticket #17

Yes, tess4j-4.6.1.
4 years ago
Anonymous posted a comment on ticket #17

Do we have this fixed for Tess4J that will work with Tesseract 4.1.1?
4 years ago
Anantha posted a comment on ticket #17

Thank you for the fix!
4 years ago
Quan Nguyen modified ticket #17

Security - log4j2 vulnerability - Tess4J using old version(1.2.17) of log4j which needs upgrade to 2.17.1
4 years ago
Quan Nguyen modified a comment on ticket #17

5.1.1 has been released with ghost4j dependency removed. Thank you for bringing this issue to our attention.
4 years ago
Quan Nguyen posted a comment on ticket #17

5.1.1 has been released with ghost4j dependency removed.
4 years ago
Quan Nguyen posted a comment on ticket #17

If vulnerabilities exist in ghost4j library, that's beyond our control. We can elect to remove ghost4j dependency from tess4j.
4 years ago
Anantha posted a comment on ticket #17

Upon upgrading tess4j to latest version(5.1.0) , we could still see log4j 1.2.17 dependency coming from ghost4j, could you please check Attached is the screenshot for reference
4 years ago
Quan Nguyen posted a comment on ticket #17

According to Apache Log4j Security Vulnerabilities, Log4j 1.x is not impacted by this vulnerability. Latest versions of tess4j do not have log4j dependency.
4 years ago
Quan Nguyen posted a comment on discussion Help

I suggest that you clone the github repository, switch to tess4j-3 branch, study and execute the unit tests in your IDE, and go from there. You may want to start out with the simple example first to ensure that the library and its dependencies are set up correctly before going further with more complicated codes.
4 years ago
Anantha created ticket #17

Security - log4j2 vulnerability - Tess4J using old version(1.2.17) of log4j which needs upgrade to 2.17.1
4 years ago
Kehinde Adeoya posted a comment on discussion Help

Thanks. I have switched to tesseract 3.0.5 but I'm still getting the same error. Could you help figure this out by scheduling a Zoom call? Please let me know when it's convenient by you. I am using tesseract 3.0.5.2 and Tess4j- 3.5.0, lept4j-1.13.0, and jna-5.10.0 This is the error I got this morning # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x000000012f453b6f, pid=33004, tid=9987 # # JRE version: OpenJDK Runtime Environment Homebrew (11.0.12) (build...
4 years ago
Quan Nguyen posted a comment on discussion Help

Your dependency versions look correct. https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j/5.0.0 If the simple example works right, that means jna/tess4j/lept4j are working properly with your tesseract/leptonica installation. That suggests something is not working correctly in your application code. Look at the test cases in tess4j project for examples: https://github.com/nguyenq/tess4j As mentioned in Issue 1074, the font info was only available in tesseract 3.x.
4 years ago
Kehinde Adeoya modified a comment on discussion Help

Thanks for your support. The simple app works fine. The font info is what I want to obtain at the moment. The reason for this hassles. I am presently using this combination of libraries. i reasoned with you after reading the github link on this issue, but I think it's been 5 years when that was published, any current update by tesseract on it? tesseract 5.0.0-29-g727796 leptonica-1.82.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0 Found...
4 years ago
Kehinde Adeoya modified a comment on discussion Help

Thanks for your support. The simple app works fine. The font info is what I want to obtain at the moment. The reason for this hassles. I am presently using this combination of libraries. i reasoned with you after reading the github link on this issue, but I think it's been 5 years when that was published, any current update by tesseract on it? tesseract 5.0.0-29-g727796 leptonica-1.82.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0 Found...
4 years ago
Kehinde Adeoya posted a comment on discussion Help

Thanks for your support. The simple app works fine. The font info is what I want to obtain at the moment. The reason for this hassles. I am presently using this combination of libraries. i reasoned with you after reading the github link on this issue, but I think it's been 5 years when that was published, any current update by tesseract on it? tesseract 5.0.0-29-g727796 leptonica-1.82.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0 Found...
4 years ago
Quan Nguyen posted a comment on discussion Help

What's the output of executing tesseract -v in the terminal? Make sure you use the Java library versions that match your native ones. I suggest you try a simple example first. http://tess4j.sourceforge.net/codesample.html If you want to obtain font info, I don't think the feature is not available in tesseract 4 and 5. https://github.com/tesseract-ocr/tesseract/issues/1074
4 years ago
Quan Nguyen posted a comment on discussion Open Discussion

No. The program will convert the input PDF to a multi-page TIFF image. What you can do is process the PDF before the OCR step, probably use PDFBox to extract a specified page, then convert that page to an image, and send it to tesseract engine.
4 years ago
Kehinde Adeoya posted a comment on discussion Help

This is the error I'm getting # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x000000010c005e9d, pid=66743, tid=9475 # # JRE version: Java(TM) SE Runtime Environment (17.0.1+12) (build 17.0.1+12-LTS-39) # Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1+12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64) # Problematic frame: # C [libtesseract.dylib+0x5e9d] tesseract::TessBaseAPI::Init(char const*, int, char...
4 years ago
Kehinde Adeoya posted a comment on discussion Help

@nguyenq can i have an answer to this, please, I'm fagged out trying to resolve a single problem for over a week.
4 years ago
Kehinde Adeoya modified a comment on discussion Help

I have earlier posted in the wrong forum. I have tried to repost in the HELP forum but it seems there's no way to edit and switch forums once it has been submitted. This is a cry for help. I am fagged out trying to resolve this problem for over a week. It's simple installing and setup of Tesseract and Tess4J on MacOS Monterey. I have followed all docs available but none could resolve the issue. I hope I can find the right help here. I am trying to get the text/font/style properties of an image. i...
4 years ago
Kehinde Adeoya posted a comment on discussion Help

I have earlier posted in the wrong forum. I have tried to repost in the HELP forum but it seems there's no way to edit and switch forums once it has been submitted. This is a cry for help. I am fagged out trying to resolve this problem for over a week. It's simple installing and setup of Tesseract and Tess4J on MacOS Monterey. I have followed all docs available but none could resolve the issue. I hope I can find the right help here. I am trying to get the text/font/style properties of an image. i...
4 years ago
Alfonso Vizcaino modified a comment on discussion Open Discussion

Hello When using PDF files with multiple pages, is there a way to specify which page i want to do OCR? Thanks
4 years ago
Alfonso Vizcaino posted a comment on discussion Open Discussion

Hello When using PDF files with multiple pages, is there a way to specify which page i want to do OCR? Thanks
4 years ago
Quan Nguyen posted a comment on discussion Help

In Tess4J, PDF documents are converted to grayscale images by Ghostscript or PDFBox before feeding to Tesseract OCR engine. You can do your own conversion of PDF files before the OCR processing.
4 years ago
John Mc.Queide Clemente modified a comment on discussion Help

I am doing a OCR in a PDF file, but the PDF result file loses its color. Am I doing something wrong? That doesn't happen when my input file is a PNG file. This is my code snippet public class OcrServiceImpl implements OcrService { @Override public void doOcr(String inputPath, String outputPath) { try { List<ITesseract.RenderedFormat> renderList = new ArrayList<>(); renderList.add(ITesseract.RenderedFormat.PDF); Tesseract tesseract = new Tesseract(); tesseract.setOcrEngineMode(0); tesseract.setDatapath("C:\\Program...
4 years ago
John Mc.Queide Clemente posted a comment on discussion Help

I am doing a OCR in a PDF file, but the PDF result file loses its color. Am I doing something wrong? That doesn't happen when my input file is a PNG file. This is my code snippet public class OcrServiceImpl implements OcrService { @Override public void doOcr(String inputPath, String outputPath) { try { List<ITesseract.RenderedFormat> renderList = new ArrayList<>(); renderList.add(ITesseract.RenderedFormat.PDF); Tesseract tesseract = new Tesseract(); tesseract.setOcrEngineMode(0); tesseract.setDatapath("C:\\Program...
4 years ago
Quan Nguyen modified a comment on discussion Help

MS Document formats are not supported. The library can only produce the output formats that Tesseract supports.
4 years ago
Quan Nguyen posted a comment on discussion Help

MS Word format is not supported. The library can only produce the output formats that Tesseract supports.
4 years ago
Quan Nguyen posted a comment on ticket #4

Please continue the discussion either in the Discussion section or over on GitHub site rather than on this old, closed ticket. Thanks.
4 years ago
Peter Kronenberg modified a comment on ticket #4

I see TessBaseAPIAllWordConfidences, which says that it returns the same number of values as that returned by GetUTF8. But TessBaseAPIGetUTF8Text returns a single string, not an array. Can you provide an example? I've read the Javadoc, but it's not always clear without an example. Is there an efficient way to process multiple images, but one at a time, without sending them all in as an array. TessBaseAPIAllWordConfidences() doesn't seem to work with doOCR(), because doOCR() closes everything down...
4 years ago
Peter Kronenberg posted a comment on ticket #4

I see TessBaseAPIAllWordConfidences, which says that it returns the same number of values as that returned by GetUTF8. But TessBaseAPIGetUTF8Text returns a single string, not an array. Can you provide an example? I've read the Javadoc, but it's not always clear without an example. Is there an efficient way to process multiple images, but one at a time, without sending them all in as an array
4 years ago
Quan Nguyen posted a comment on ticket #4

Documentation: http://tess4j.sourceforge.net/docs/docs-4.4/ You can pass in a List<IIOImage> to doOCR method. There are other methods in Tesseract class that returns confidence values. JNA Direct Mapping: https://github.com/java-native-access/jna/blob/master/www/DirectMapping.md
4 years ago
Anonymous posted a comment on ticket #4

I know this issue is a years old, but I'm wondering what is the current 'best' way to get the confidences? Like others, I am also confused by the difference between Tesseract vs Tesseract1 and TessAPI vs TessAPI1 I see what you said about doOcr() being intended for a single image because it shuts down after processing. What is the best way to be able to process multiple images? Is there any documentation on the best way to do this (as well as getting the confidences) thank you
4 years ago
Peter Kronenberg posted a comment on ticket #4

I just entered that last post, but I wasn't logged in.
4 years ago
sriKrishnaKumar posted a comment on discussion Help

Hello Team, I am looking to develop an application internally to do convert Image format to Searchable PDF and then Searchable PDF to Microsoft document format or directly from Image Format to Microsft Document Format. Does Tess4J along with other library supports this requirement. I know we can use Tess4j to convert image to Searchable PDF. Any suggestions are welcome
4 years ago
Quan Nguyen posted a comment on discussion Help

The Leptonica API method seems to have changed over the years after several versions. http://tess4j.sourceforge.net/docs/lept4j-docs-1.10.0/net/sourceforge/lept4j/Leptonica1.html http://tess4j.sourceforge.net/docs/lept4j-docs-1.14.0/net/sourceforge/lept4j/Leptonica1.html#pixaaDisplayByPixa(net.sourceforge.lept4j.Pixaa,int,float,int,int,int)
4 years ago
Jeremy Young posted a comment on discussion Help

IntelliJ is telling me that the parameters for pixaaDisplayByPixa are different from the documentation. Have I done something wrong? If not, is there a workaround? Thx