0% found this document useful (0 votes)

31 views6 pages

Tesseract

The 'tesseract' package is an open-source OCR engine that supports over 100 languages and is highly configurable for optimal text detection. It includes functions for extracting text from images and downloading training data for various languages. The package requires specific system dependencies and is maintained by Jeroen Ooms.

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views6 pages

Tesseract

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Package ‘tesseract’

March 23, 2025

Type Package
Title Open Source OCR Engine
Version 5.2.3
Description Bindings to 'Tesseract':
a powerful optical character recognition (OCR) engine that supports over 100 languages.
The engine is highly configurable in order to tune the detection algorithms and
obtain the best possible results.
License Apache License 2.0

URL https://docs.ropensci.org/tesseract/
https://ropensci.r-universe.dev/tesseract

BugReports https://github.com/ropensci/tesseract/issues
SystemRequirements Tesseract >= 3.03 (libtesseract-dev /
tesseract-devel) and Leptonica (libleptonica-dev /
leptonica-devel). On Debian you need to install the English
training data separately (tesseract-ocr-eng)
Imports Rcpp (>= 0.12.12), pdftools (>= 1.5), curl, rappdirs, digest
LinkingTo Rcpp
RoxygenNote 7.3.2
Suggests magick (>= 1.7), spelling, knitr, tibble, rmarkdown
Encoding UTF-8
VignetteBuilder knitr
Language en-US
NeedsCompilation yes
Author Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>)
Maintainer Jeroen Ooms <jeroenooms@gmail.com>
Repository CRAN
Date/Publication 2025-03-23 14:50:01 UTC

1
2 ocr

Contents
ocr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
tesseract_download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Index 6

ocr Tesseract OCR

Description

Extract text from an image. Requires that you have training data for the language you are reading.
Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our
package vignette for image preprocessing tips.

Usage

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Arguments

image file path, url, or raw vector to image (png, tiff, jpeg, etc)
engine a tesseract engine created with tesseract(). Alternatively a language string
which will be passed to tesseract().
HOCR if TRUE return results as HOCR xml instead of plain text

Details

The ocr() function returns plain text by default, or hOCR text if hOCR is set to TRUE. The
ocr_data() function returns a data frame with a confidence rate and bounding box for each word
in the text.

References

Tesseract: Improving Quality

Other tesseract: tesseract(), tesseract_download()

tesseract 3

Examples
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)

cat(xml)

df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)

# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image

img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")

# Extract text from png image

text <- ocr(img_file)
unlink(img_file)
cat(text)

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract Tesseract Engine

Description
Create an OCR engine for a given language and control parameters. This can be used by the ocr
and ocr_data functions to recognize text.

Usage
tesseract(
language = "eng",
datapath = NULL,
configs = NULL,
options = NULL,
cache = TRUE
)

tesseract_params(filter = "")

tesseract_info()
4 tesseract_download

Arguments
language string with language for training data. Usually defaults to eng
datapath path with the training data for this language. Default uses the system library.
configs character vector with files, each containing one or more parameter values. These
config files can exist in the current directory or one of the standard tesseract
config files that live in the tessdata directory. See details.
options a named list with tesseract parameters. See details.
cache speed things up by caching engines
filter only list parameters containing a particular string

Details
Tesseract control parameters can be set either via a named list in the options parameter, or in a
config file text file which contains the parameter name followed by a space and then the value, one
per line. Use tesseract_params() to list or find parameters. Note that that some parameters are
only supported in certain versions of libtesseract, and that invalid parameters can sometimes cause
libtesseract to crash.

See Also
Other tesseract: ocr(), tesseract_download()

Examples
tesseract_params('debug')

tesseract_download Tesseract Training Data

Description
Helper function to download training data from the official tessdata repository. On Linux, the fast
training data can be installed directly with yum or apt-get.

Usage
tesseract_download(
lang,
datapath = NULL,
model = c("fast", "best"),
progress = interactive()
)
tesseract_download 5

Arguments
lang three letter code for language, see tessdata repository.
datapath destination directory where to download store the file
model either fast or best is currently supported. The latter downloads more accurate
(but slower) trained models for Tesseract 4.0 or higher
progress print progress while downloading

Details
Tesseract uses training data to perform OCR. Most systems default to English training data. To
improve OCR performance for other languages you can to install the training data from your distri-
bution. For example to install the spanish training data:

• tesseract-ocr-spa (Debian, Ubuntu)

• tesseract-langpack-spa (Fedora, EPEL)
On Windows and MacOS you can install languages using the tesseract_download function which
downloads training data directly from github and stores it in a the path on disk given by the
TESSDATA_PREFIX variable.

References
tesseract wiki: training data

See Also
Other tesseract: ocr(), tesseract()

Examples
## Not run:
if(is.na(match("fra", tesseract_info()$available)))
tesseract_download("fra", model = 'best')
french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)

## End(Not run)
Index

∗ tesseract
ocr, 2
tesseract, 3
tesseract_download, 4

ocr, 2, 3–5
ocr_data, 3
ocr_data (ocr), 2

tessdata (tesseract_download), 4
tesseract, 2, 3, 5
tesseract(), 2
tesseract_download, 2, 4, 4, 5
tesseract_info (tesseract), 3
tesseract_params (tesseract), 3
tesseract_params(), 4

Package Tesseract': July 25, 2019
No ratings yet
Package Tesseract': July 25, 2019
5 pages
Installing and Using Tesseract 500 OCRFINAL
No ratings yet
Installing and Using Tesseract 500 OCRFINAL
4 pages
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
No ratings yet
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
19 pages
Home UB-Mannheim-tesseract Wiki GitHub
No ratings yet
Home UB-Mannheim-tesseract Wiki GitHub
4 pages
Installing and Using Tesseract OCR PDF
100% (1)
Installing and Using Tesseract OCR PDF
5 pages
Ocr Nanonets Tesseract
No ratings yet
Ocr Nanonets Tesseract
39 pages
Python OCR Tool for Developers
No ratings yet
Python OCR Tool for Developers
5 pages
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
No ratings yet
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
65 pages
Python Tesseract
No ratings yet
Python Tesseract
2 pages
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
No ratings yet
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
7 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
Tesseract Ocr
No ratings yet
Tesseract Ocr
3 pages
Module # 10C - Text Recognition With Tesseract OCR
No ratings yet
Module # 10C - Text Recognition With Tesseract OCR
8 pages
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
No ratings yet
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
10 pages
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
No ratings yet
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
1 page
Manual
No ratings yet
Manual
2 pages
Ahsbsdns
No ratings yet
Ahsbsdns
1 page
Tesseract OCR for Developers
No ratings yet
Tesseract OCR for Developers
22 pages
Tesseract Osc On
No ratings yet
Tesseract Osc On
22 pages
Tesseract I CD Ar 2007
No ratings yet
Tesseract I CD Ar 2007
5 pages
Python CAPTCHA Breaking with OCR
No ratings yet
Python CAPTCHA Breaking with OCR
4 pages
Written Notes
No ratings yet
Written Notes
5 pages
Code Snippets
No ratings yet
Code Snippets
2 pages
Tesseract OCR Troubleshooting Guide
No ratings yet
Tesseract OCR Troubleshooting Guide
3 pages
How To
No ratings yet
How To
2 pages
Package Pdftools': R Topics Documented
No ratings yet
Package Pdftools': R Topics Documented
6 pages
AI Advantage and Disadvantage 1
No ratings yet
AI Advantage and Disadvantage 1
14 pages
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
No ratings yet
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
22 pages
Pdftools
No ratings yet
Pdftools
6 pages
OCRHindi Using VietOCR and Tesseract PDF
No ratings yet
OCRHindi Using VietOCR and Tesseract PDF
7 pages
Pdftools
No ratings yet
Pdftools
6 pages
An Overview of Tesseract OCR Engine
No ratings yet
An Overview of Tesseract OCR Engine
15 pages
Ocr Gtts
No ratings yet
Ocr Gtts
49 pages
98DSP
No ratings yet
98DSP
8 pages
Tesseract Language Model Guide
No ratings yet
Tesseract Language Model Guide
1 page
Ocrmypdf Readthedocs Io en Stable
No ratings yet
Ocrmypdf Readthedocs Io en Stable
147 pages
Study of Tesseract OCR
No ratings yet
Study of Tesseract OCR
12 pages
Iqjaqokskss
No ratings yet
Iqjaqokskss
3 pages
An Overview of The Tesseract OCR Engine: 2. Architecture
No ratings yet
An Overview of The Tesseract OCR Engine: 2. Architecture
6 pages
Tesseract 1
No ratings yet
Tesseract 1
35 pages
Extracting Text From Scanned PDF Using Pytesseract & Open CV
No ratings yet
Extracting Text From Scanned PDF Using Pytesseract & Open CV
9 pages
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
Overview of Tesseract OCR Engine
No ratings yet
Overview of Tesseract OCR Engine
5 pages
Tesseract OCR Engine Overview
No ratings yet
Tesseract OCR Engine Overview
5 pages
Sections Revision
No ratings yet
Sections Revision
27 pages
9589-First Manuscript-57755-2-10-20220620 - X
No ratings yet
9589-First Manuscript-57755-2-10-20220620 - X
12 pages
Digital Image Processing Guide
No ratings yet
Digital Image Processing Guide
8 pages
TikZDevice: R Graphics in LaTeX
No ratings yet
TikZDevice: R Graphics in LaTeX
37 pages
Design Phase
No ratings yet
Design Phase
10 pages
Text Data Generator Guide
No ratings yet
Text Data Generator Guide
21 pages
Sarkar
No ratings yet
Sarkar
26 pages
Class X Unit 3 QP 12.10.2024
No ratings yet
Class X Unit 3 QP 12.10.2024
1 page
B Buletin K3 Februari 2023
No ratings yet
B Buletin K3 Februari 2023
4 pages
Unit 3 SE
No ratings yet
Unit 3 SE
7 pages
4-Decision Tree Learning 1
No ratings yet
4-Decision Tree Learning 1
21 pages
Marks Sheet
No ratings yet
Marks Sheet
1 page
Data Science Expertise & Projects
No ratings yet
Data Science Expertise & Projects
2 pages
Fake Job Posting Detection Using SVM & RF
No ratings yet
Fake Job Posting Detection Using SVM & RF
5 pages
Lab Manual - MACHINE LEARNING LABORATORY
No ratings yet
Lab Manual - MACHINE LEARNING LABORATORY
42 pages
1 Data Warehouse Construction Real Life Problem To Be Defined For Data
No ratings yet
1 Data Warehouse Construction Real Life Problem To Be Defined For Data
3 pages
Student Q&A Platform with PHP & MySQL
No ratings yet
Student Q&A Platform with PHP & MySQL
31 pages
Building Ontologies For Knowledge Discovery
No ratings yet
Building Ontologies For Knowledge Discovery
47 pages
Data Quality Framework Eu Medicines Regulation - en
No ratings yet
Data Quality Framework Eu Medicines Regulation - en
42 pages
Model Once, Represent Everywhere - UDA (Unified Data Architecture) at Netflix - by Netflix Technology Blog - Jun, 2025 - Netflix TechBlog
No ratings yet
Model Once, Represent Everywhere - UDA (Unified Data Architecture) at Netflix - by Netflix Technology Blog - Jun, 2025 - Netflix TechBlog
27 pages
CNCF - Ai 2
No ratings yet
CNCF - Ai 2
21 pages
Anshul Yadav: Profile
No ratings yet
Anshul Yadav: Profile
6 pages
Adytia Pratama: Work Experiences
No ratings yet
Adytia Pratama: Work Experiences
1 page
Rock Identification Using Deep Learning CNN Project Report: Abstract
No ratings yet
Rock Identification Using Deep Learning CNN Project Report: Abstract
12 pages
Design and Implementation of Veterinary Management System - Vinci-Updated
No ratings yet
Design and Implementation of Veterinary Management System - Vinci-Updated
54 pages
QB RPE 5 Answers
No ratings yet
QB RPE 5 Answers
11 pages
Short Response Questions Class9
No ratings yet
Short Response Questions Class9
2 pages
Modular Arch Programmer Guide
No ratings yet
Modular Arch Programmer Guide
10 pages
IGCSE (Data) 4
No ratings yet
IGCSE (Data) 4
77 pages
Unit 5 IRS
No ratings yet
Unit 5 IRS
17 pages
Modul Master 01
No ratings yet
Modul Master 01
49 pages
Evidence Based Library and Information Practice: Research Methods: Content Analysis
No ratings yet
Evidence Based Library and Information Practice: Research Methods: Content Analysis
3 pages
SWE2027-Assignment-2-For A1 and A2-Last Date For Submission-2025!03!09
No ratings yet
SWE2027-Assignment-2-For A1 and A2-Last Date For Submission-2025!03!09
3 pages
Complete HTML Guide
No ratings yet
Complete HTML Guide
10 pages
Dr.D.karthika Renuka CV
No ratings yet
Dr.D.karthika Renuka CV
17 pages
01it0701 Advanced Web Technologies
No ratings yet
01it0701 Advanced Web Technologies
3 pages
Types of SW
No ratings yet
Types of SW
4 pages

Tesseract

Uploaded by

Tesseract

Uploaded by

Package ‘tesseract’

March 23, 2025

ocr Tesseract OCR

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Tesseract: Improving Quality

Other tesseract: tesseract(), tesseract_download()

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)

# Render pdf to png image

# Extract text from png image

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract Tesseract Engine

tesseract_download Tesseract Training Data

• tesseract-ocr-spa (Debian, Ubuntu)

You might also like