Mit Paper
Mit Paper
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
January 18, 2023
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Amar Gupta
Research Scientist
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Katrina LaCurts
Chair, Master of Engineering Thesis Committee
Development of an End-to-End Pipeline for Custom
Key-Value Extraction from Commercial Invoices
by
Abhishek Mohan
Abstract
Inefficiencies in manual extraction of information from business documents have re-
sulted in the development of automated processing solutions. Within the scope of
business documents, commercial invoices present additional complexities due to the
diversity of document layouts and the variation in quality of scanned documents.
Commercially available solutions have been built to perform invoice extraction, yet
they do not provide flexibility in accomplishing tasks unique to a particular dataset
and its associated complications. Using sample documents provided by a leading
electronic component distributor, we researched different approaches capable of ex-
tracting key-value information from a complex dataset of invoices. The thesis provides
a detailed look into the development of a highly accurate, end-to-end data pipeline
accomplishing this task. A multi-module approach integrating image processing, op-
tical character recognition, custom algorithms, and machine learning-based matching
was built and compartmentalized into continuous stages - allowing for effective and
efficient key-value extraction of information from invoice documents. In conjunction
with an intuitive web interface, the custom pipeline provides a solution with strong
performance and the flexibility to be generalized for extraction of additional business
documents in future efforts.
2
Acknowledgments
First and foremost, I would like to thank my thesis advisor, Dr. Amar Gupta, for
providing me with the opportunity to join his research group and work within such
an interesting space. His support and mentorship throughout working on this project
has been an invaluable part of my MEng experience, and I am looking forward to
taking what I have learned from him into my future endeavors.
Next, I would like to thank all of the different student members from the Gupta
Research Group who have supported the project in some capacity from the beginning:
Pierce Lai, Steve Kim, Samuel Lee, Victor Chu, Prabhakar Kafle, and Haimoshri Dali.
Their contributions all helped move the project forward, and I am very grateful for
the experience of leading our project team.
I would also like to thank Justin Mintz and Bert Love, who were the main repre-
sentatives from Arrow Electronics that my project team worked closely with. Their
correspondence and cooperation ensured the rapid progress that was made on the
project.
Last but not least, I would like to thank my family and friends, who have been
a critical part in my successful completion of the MEng program. Their love and
support continues to be a great source of my inspiration.
3
Contents
1 Introduction 9
1.1 Document Processing Overview . . . . . . . . . . . . . . . . . . . . . 9
1.2 Invoice Documents Overview . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 A Need for Customized Extraction . . . . . . . . . . . . . . . . . . . 11
1.4 Identified Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Related Work 15
2.1 Document-based Datasets . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Document Layout Models . . . . . . . . . . . . . . . . . . . . . . . . 18
3 General Methodology 19
3.1 Invoice Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Web Application Interface . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Internal Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Postprocessing Module 31
5.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4
5.2 Levenshtein Distance Overview . . . . . . . . . . . . . . . . . . . . . 31
5.3 Levenshtein Distance Customization . . . . . . . . . . . . . . . . . . 32
5.4 Word Bank Construction . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Additional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Document Tabulation 42
7.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3 Algorithm Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 44
8 Final Evaluation 45
8.1 Preprocessing, OCR, and Postprocessing Evaluation . . . . . . . . . . 45
8.2 Machine Learning Model Evaluation . . . . . . . . . . . . . . . . . . 46
8.3 Comprehensive Pipeline Evaluation . . . . . . . . . . . . . . . . . . . 48
8.4 Alternative Pipeline Comparison . . . . . . . . . . . . . . . . . . . . 49
9 Conclusion 52
9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 52
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5
List of Figures
6-1 Example of key-value pairs following results from the Key-Value Match-
ing Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6-2 Diagram of the pipeline’s ML Model component. . . . . . . . . . . . . 37
6-3 Architecture of the LayoutLM model. . . . . . . . . . . . . . . . . . . 38
6
6-4 Architecture of the DONUT model. . . . . . . . . . . . . . . . . . . . 39
7
List of Tables
8
Chapter 1
Introduction
9
higher efficiency compared to traditional hand processing, many difficulties make
automation challenging [6, 7]; this includes noise, varying locations of important
information from document to document, and unclearly printed letters. Hence, there
is an existing need for custom pipeline solutions to be built around particular datasets
on an as-needed basis.
10
development of a pipeline that can process and extract key-value information from
their provided collection of invoices; in other words, identify structured pairs from
their unstructured data.
11
Since many large companies handle many document types, the desired solution to
extract relevant information would require compatibility with Arrow’s diverse dataset.
Most processing techniques require document customization and the algorithms them-
selves must be tuned for each document format [13]. In the domain of tables as found
in invoices, while creating a system to extract key-value pairs from a specific format
usually does not pose significant difficulty, creating a single approach that achieves a
high accuracy on many different tabular layouts is quite challenging [14].
Despite the existence of commercially available pipelines to extract key-value
based information, most approaches are usually tailored to a specific type of doc-
ument or a certain document format [15]. Components such as the use of dark
background colors and light foreground colors, as well as shading and background im-
ages, increase document complexity [16]. This further elevates the difficulty of using
ready-made techniques to read, process, and match extracted information. The task
of building a custom pipeline for key-value extraction, both efficiently and accurately,
was accordingly shown to be one that would require deliberate preparation.
12
binarization, thresholding, deskewing, denoising, etc. The challenge with this process
is knowing which combination of techniques to use for each document, and the degree
to which they should be applied. Then, an OCR engine for the pipeline would also
need to be identified. The OCR component would allow for the extraction of texts
and their relative locations within each of the documents.
We intended to complete additional work to make the pipeline easily usable, pri-
marily designing and creating a web interface that would allow users to run the
pipeline on invoice documents from their local system. This work could also include
developing a tabulation method that could convert the matched key-value results into
13
a format that can be visually identified on a processed invoice document, after which
it could be displayed on the developed web interface.
14
Chapter 2
Related Work
Each of the following topics will be discussed in more detail throughout the thesis,
but a variety of previous research studies and background information adjacent to
the topics are presented here. By obtaining an initial understanding of these items,
we are better prepared as to how the pipeline can be built, in addition to unique
approaches that can be utilized.
15
dataset could help explore fine-tuning model(s) applied in the pipeline, with advanced
models having shown effectiveness. Additionally, use of such a dataset could improve
the selected model’s ability to handle dynamic tabular data in the future - key-value
pairs in which certain keys may not have been seen within the current Arrow dataset.
The “Scanned Receipts OCR and Information Extraction” (SROIE) dataset pro-
vides scanned receipts that are generally low quality [20]. It consists of six hundred
receipts in the training set and four hundred receipts in the test set, with four possi-
ble keys: company, address, date, total. The dataset was used for three competition
tasks in the study (text localization, OCR, key-information extraction), indicating it
had good comparative quality to the Arrow dataset.
16
written text within an image file or scanned document, after which the extracted
material is converted into a machine-readable format [27]. The extracted results can
be used for a variety of data processing applications, and is primarily used to improve
efficiency of processing documents - eliminating the need for manual human entry.
We accordingly investigated some studies discussing previously applied OCR en-
gines. Tesseract is an engine shown to take less than a second to extract information
from an image [28]. On a dataset of license plate numbers, Tesseract had an accuracy
of about 70% and performed better on grayscale compared to color images - impor-
tant to note as the Arrow dataset primarily consisted of images with no color. Google
Cloud Vision is another noteworthy engine that is mainly used for image classifica-
tion, for which a previous study found that it is not robust to noise; adding random
noise to images (e.g. about 20%, random colored dots) altered image classifications
[29]. However, it is important to note that the type of noise added is not the same
as the typical noise in the Arrow dataset, which contained less random distributed
noise, more dots and random lines, and skewing.
17
2.3 Document Layout Models
When matching extracted words (values) to their corresponding keys, there are two
approaches: the use of deterministic algorithms or the application of a model trained
on a dataset [30]. Custom algorithms have shown applicability for text extraction
purposes, and can be built for the pipeline based upon keys that we know exactly
what to expect for [31, 32]. However, more advanced methods must be used for
the majority of keys in the Arrow dataset. Machine learning models, specifically
document layout models, provide the ability to match key-value pairs from a complex
set of documents, and hence should also be incorporated into the pipeline [33].
LayoutLM is an example of a model that can jointly learn text with document
layout information, and achieves state of the art results on multiple datasets such
as the SROIE and FUNSD [33]. Built off of the BERT model, which uses text
and position embeddings, LayoutLM integrates 2D position embeddings and image
embeddings. The model outperforms other powerful models for objectives such as
Masked Visual-Language Modeling (MVLM) and Multi-label Document Classification
(MDC). Next generation models such as BROS and Docformer also indicate some
potential for use within the pipeline [34, 35].
Some available solutions provide an existing end-to-end model that would elimi-
nate the need for building individual modules for the pipeline. Many methods out-
source the job of OCR to off-the-shelf engines, which can be costly, inflexible, and
propagate OCR errors throughout the pipeline [36]. Models such as DONUT elim-
inate the need for this outsourcing while achieving strong results on many datasets
such as CORD and DocVQA while also being faster - demonstrating potential in its
use within the pipeline.
18
Chapter 3
General Methodology
Figure 3-1: Example of an invoice document from the Arrow Electronics dataset.
19
An example of a document is shown in Figure 3-1 (some information is redacted for
confidentiality reasons). From the tabular format, visual assessment of the example
can determine the presence of desired keys such as invoice date, company providing
the invoice, and waybill number. The pipeline, in its final form, should be able to
automatically and efficiently extract such keys and their corresponding values. Each
of the documents includes a variety of possible keys, from which a comprehensive set
of keys for extraction was identified:
Unlike standardized datasets such as the SROIE dataset, however, the Arrow
dataset lacked quality control [20]. Hence, potential techniques to improve data
quality for the pipeline’s preprocessing component such as skewing and denoising
were deemed necessary [42]. Additionally when provided to us, less than 1/3 of the
initial dataset corresponded to any lines in the master ground-truth spreadsheet, and
20
of all the lines in the spreadsheets, only about 6.6% of them corresponded to any
invoice PDFs. Since training via the pipeline would require being able to link invoice
documents with ground-truth data, the data on which the pipeline could be trained
on was only a small fraction of the full dataset.
To ensure that the dataset would be compatible with the pipeline, instances of
error with respect to matching with the ground-truth were corrected with corre-
spondence from Arrow Electronics. The corrected dataset consisted of ∼440 invoice
documents from ∼50 distinct companies, corresponding to ∼2,200 lines within the
ground-truth spreadsheet. This dataset continued to grow throughout the develop-
ment process, with similar issues of some inconsistencies between the invoice docu-
ments and ground-truth spreadsheets being seen and corrected as the dataset grew.
Ultimately, the comprehensive dataset used during the development process contained
∼18,000 identified key-value pairs.
21
This structure is visually detailed within Figure 3-2, where each small box rep-
resents an internal component, script, or data file: black boxes represent data files;
blue boxes represent regular document processing components; green boxes repre-
sent components associated with key-value matching; purple boxes represent pipeline
evaluation scripts.
The architecture of the proposed pipeline is not fully linear. The second half of the
pipeline containing the machine learning and algorithmic extraction components form
two distinct branches, after which their outputs are combined. This ensured that the
pipeline would not have any circular dependencies, and hence no cycles would form
and cause disruption - a challenge found with managing data pipelines [41].
22
extracted key-value pairs presented on the right (some information is redacted for
confidentiality reasons). Some of the UI’s features were:
• The web application could run both halves of the pipeline on one or multiple
input PDF(s)
• Just as a progress bar can be seen for progress within a terminal, we integrated
a progress bar into the web application so users could be aware of approximately
how close the pipeline was to being completed for its run
23
allow us to automatically extract key-value pairs from commercial documents other
than specifically invoices.
To potentially allow such changes in the future, we broke down the pipeline’s in-
ternal dependency structure by developing architecture diagrams and an extensive
documentation report. This report is not included in the thesis, but essentially pro-
vides internal implementation details of the many individual modules, scripts, and files
forming the pipeline. Figures 3-4, 3-5, and 3-6 are diagrams illustrating the internal
dependency relationships responsible for image processing, machine learning-related
components, and the web application. Using these diagrams, we can easily and visu-
ally identify which file/component to work on to add new features as desired in the
future, and they simply provide greater clarity as to what the internal dependencies
of the pipeline look like.
Figure 3-4: Diagram illustrating the dependencies of internal files responsible for the
image processing of invoice documents.
24
Figure 3-5: Diagram illustrating the dependencies of internal files responsible for the
machine learning components.
Figure 3-6: Diagram illustrating the dependencies of internal files responsible for the
web application.
25
Chapter 4
• Resize: Rescales an image, as images with a low DPI (dots per inch) tend to
result in decreased readability
• Binarize: Converts an image to consist of only black and white pixels, increas-
ing contrast within images and allowing for text to better stand out from the
background
• Denoise: Removes minor pepper noise (specks and small impurities from the
scanning process)
• Sharpen: Sharpens edges and text within an image, which allows for text to
better stand out
26
• Dilate: Increases the area taken up by elements within an image, which can
help fill in degraded or missing parts of individual characters
27
Figure 4-2: Example of bounding boxes returned from an OCR API.
28
Hence, we decided to test between our two considered options. Pytesseract was
quite slow when we ran it on a large number of documents, as shown in Figure 4-3.
Even with multithreading, it took ∼60 minutes to run Pytesseract on a subset of ∼230
invoice PDFs. Running the same dataset using GCV’s service with multithreading
took ∼9 minutes, providing a ∼6x increase in speed.
Figure 4-3: Runtime comparison between Pytesseract and GCV OCR engines.
The original preprocessing module with Pytesseract was constructed with three
of the six potential preprocessing techniques: denoising (removing impurities from
an image), sharpening (sharpening an image’s edges to improve “reading” of text),
and deskewing (fixing an image’s orientation); other preprocessing methods were also
tested, such as binarization and dilation, but this set of three worked best in the
pipeline. GCV was much more robust than Pytesseract meaning that some of these
preprocessing techniques could be removed, simplifying the preprocessing step.
With GCV, the pipeline’s preprocessing module consisted of only the deskewing
technique, as denoising and sharpening did not improve GCV’s overall performance
and would hence be unnecessary. Deskewing, however, remained necessary, as the
provided documents varied in orientation and skew. Additionally, we found that
GCV was overall more accurate than Pytesseract for our dataset, as it detected more
29
words and made fewer mistakes. Based on these observations, and it being recognized
as one of the most accurate and robust options available, GCV was selected as the
pipeline’s OCR engine [48].
With regards to the use of multithreading within the pipeline, integration of the
Python package “multiprocessing” into the pipeline improved performance. This pack-
age allowed the pipeline to utilize multiple cores and improved processing speed by
∼5x (depending on the number of cores used). As of now only the first half of the
pipeline is multithreaded, while the second half is not.
30
Chapter 5
Postprocessing Module
31
Figure 5-1: Complete equation for calculating the true Levenshtein distance.
previous work, this method accepted input words which were each then compared to
their closest neighbor within the word bank [52]. The method would swap a word
with its word bank counterpart if its calculated Levenshtein distance was below a set
threshold. This was due to the fact that the words of interest in the dataset consisted
of more than just words within the English lexicon, as commercial invoices could also
contain foreign and industry-specific words. Levenshtein distance does not discrim-
inate between English and non-English words, and hence was generalizable for our
objectives.
32
weights and with inspiration from the results of a prior study, the distance thresholds
for “fixing” words were accordingly set as follows: 0.99 for words of length 3 or less,
1.99 for length 4 to 8, and 2.99 for all other word lengths [55].
• To remove redundant inputs to the word bank, we checked if a word did not
include numbers and stripped special characters
• Dates were converted into the ISO format, which would make dates easier to
detect (for example, “July 17, 2021” would become “2021-07-17”)
33
• Unit prices would often have “/M” attached to it, which meant that the unit
price was for a thousand units; the module would detect this and divide the
price by 1000 to identify the correct price per unit
• “FACTURA” numbers occasionally had “-1” or “/1” attached at the end, which
would be removed
• The given list of bounding boxes would sometimes be modified in-place to com-
bine words surrounding hyphens and slashes
34
Chapter 6
Figure 6-1: Example of key-value pairs following results from the Key-Value Matching
Module.
35
6.2 Deterministic Algorithms Overview
We identified that the keys corresponding to waybill number, invoice date, and ship-
ping method could be determined following a specific structure for every document.
This deterministic quality made them less suitable to be extracted by a ML model, and
hence we developed and used custom algorithms for them. Specifically, the waybill
number was often parsed as multiple words, and so we used a rule-checking method to
determine if the extracted value corresponded to a possible waybill number; the date
was already preprocessed into the ISO format, and so a simple format comparison
was needed; the shipping method was always from the company DHL (either “DHL
EXPRESS” or “DHL 3RD PARTY INT’L BILL”) and hence was also easy to extract.
36
Figure 6-2: Diagram of the pipeline’s ML Model component.
beddings are then passed through a multi-layer bidirectional transformer that is able
to generate contextualized representations with an adaptive attention mechanism.
Document layouts contain visually rich information that can also be aligned with
input texts, and this ideology serves as the foundation of the LayoutLM model. There
is document layout information that contains the relative position of words within the
invoice documents, which can be embedded as 2-D position representations. There
is also visual information that primarily contains indications of which document seg-
ments are important and should accordingly be prioritized, which can be represented
as image features. Thus, combining these two types of information allows for a more
nuanced semantic representation of a document [57].
LayoutLM does exactly this by applying the BERT architecture and adding two
additional input embeddings: a 2-D text position embedding and an image embed-
ding. The 2-D position embedding is a way through which the relative spatial position
in a document can be represented. The spatial position of elements (via bounding
boxes) is represented by (x0, y0, x1, y1), where (x0, y0) corresponds to the position
of the bounding box’s upper left corner and (x1, y1) corresponds to the position of
the bounding box’s lower right corner. For the image embedding, with each of the
word’s bounding boxes from the OCR results, the image is split into several pieces,
all of which have a one-to-one correspondence with the words. These image region
features are then converted into token image embeddings. As shown in Figure 6-3, the
37
downstream task (in our case, key-value matching) is accomplished upon combining
the image and LayoutLM embeddings after passing through the model.
A powerful component of the LayoutLM model was its independence from pre-
processing, OCR, and postprocessing methods, meaning that its sole purpose was to
identify the key-value pairs in the pipeline. Separating the first two modules from the
third module containing the model in the pipeline, LayoutLM accordingly allowed us
to more easily identify and remedy errors if they appeared earlier on - improving the
pipeline’s final results.
38
words, it is an end-to-end model that handles the entire process of taking in processed
input images and matching the key-value pairs.
The use of the DONUT model was appealing for several reasons. Unlike other
models, we would not need to externally identify potentially effective OCR engines,
errors from the OCR component would not propagate through the rest of the model,
and no OCR postprocessing module would be needed. This in theory could allow the
pipeline to be simpler and faster while also attaining a higher accuracy. As further
shown in Figure 6-4, DONUT provides a full system with no outsourcing of processing
approaches or OCR engine, allowing for focus on the objective of key-value extraction
from a provided document [36].
To obtain a base understanding of the DONUT model, we trained and tested using
the SROIE dataset. After 30 epochs of training, the results were slightly accurate with
an accuracy score of 0.679 and an F1 score of 0.574. Next, to evaluate if integrating
the DONUT model would be a better approach, we trained the model with the Arrow
dataset, first for 10 epochs to verify proper training, and then further trained it for
another 20 epochs, for a total of 30 epochs.
266 invoice documents from the Arrow dataset were used, on which DONUT did
not perform well. Both the summarized and full accuracy results are shown in Tables
6.1 and 6.2. Deterministically matched keys such as “GUIA” and “SHIP_METHOD”
had high accuracies as expected, but all other keys involved within the ML com-
ponent greatly underperformed. We did note that the model learned the output
representation extremely quickly, within 10 epochs. However, even after 30 epochs,
the information output did ultimately not match with the ground-truth, and did not
39
show much improvement with additional epochs.
Table 6.1: Summarized results of the DONUT model on the Arrow dataset.
Table 6.2: Individual results of the DONUT model on the Arrow dataset.
We soon realized that the results were poor not because the model was inferior,
but because DONUT expected an input image of size 1280 x 720, while images of
size such as (5 * 1280) x 720 were being passed in, meaning the image had to be
compressed vertically or in another manner. In order to remedy this, we developed
a script that modified the dataset such that the images were rotated correctly (as
needed) and then were made as large as possible, instead of shrinking them down to
12 pages - making the 88% of documents with 6 or fewer pages more legible.
This, however, came with several caveats. The features in each document now
varied considerably in size, and as we would need to apply the same process for
future datasets, GCV would also be required as a preprocessing step, increasing cost
and processing time. Converting the Arrow dataset into a compatible format and
testing the DONUT model in the pipeline produced an output with accuracies that
40
were much lower than expected. As this model presented many inefficiencies and
difficulties, we ultimately decided not to move forward with fully integrating the
DONUT model into the pipeline.
The LayoutLMv2 and LayoutLMv3 models were also considered, which would
provide further improved versions of the LayoutLM model [60, 61]. However, these
new versions were not allowed to be used for commercial uses (a stipulation of the
project), and hence LayoutLM was deemed as the final choice for the machine learning
component.
41
Chapter 7
Document Tabulation
A custom, rule-based algorithm was developed to convert the output from the
Key-Value Matching Module into a visual table form overlaid on top of the input
document. This was critical as our dataset could have multiple rows of keys (such as
multiple rows containing part numbers and quantities) within a single document, and
hence the final result needed to be properly organized by row. The output was visual
42
key-value classification on the original invoice document processed by the pipeline, as
shown in Figure 7-1 (some information is redacted for confidentiality reasons).
To initially identify the rows, the algorithm took in the key classifications from
the previous ML component, and divided the bounding boxes by classification. Then,
to handle each key separately, it converted each box into a set of (x, y) coordinate
pairs. Specifically, since words could be left-aligned, right-aligned, or center-aligned,
the algorithm considered three separate sets of points for every bounding box: the
top left corners, the top right corners, and the top middle point.
Next, the algorithm looked for a pattern of vertically-spaced, horizontally-aligned
coordinate pairs, which became the guess for that key’s row position. To obtain the
position of the rows for the entire page, the algorithm combined the guess for each
key, determined the most frequent row height, and selected the positions obtained by
the highest key with that row height.
After determining the rows, the correct boxes within each row were selected, which
was necessary as the model may have classified additional words per key - so the al-
gorithm must filter for the correct word. This was accomplished via a combination of
individual formatting checks and matching information across keys. For the format-
ting check, an example is checking if a word classified as a numeric-valued key (such
as “QTY”) was actually a numeric value. For matching information, the information
should match up across keys; particularly, the quantity multiplied by the unit price
43
should equal the total price for each row.
Finally, after obtaining the preliminary values for each row, the algorithm post-
processed the rows to improve certain cases. For example, an invoice may have the
PO number written at the top of the page instead of a PO number in each row, so the
algorithm would detect that PO number and append it to each row. After methods
such as this, we obtained the final result of the pipeline: the extracted key-value
information as a classified table.
44
Chapter 8
Final Evaluation
We developed three evaluation scripts separate from the three main modules, each
of which were used to measure the performance of specific parts of the pipeline.
The first evaluation script measured the performance of the first half of the pipeline
(preprocessing, OCR, postprocessing). The model evaluation script measured the
performance of the trained LayoutLM model, while the final evaluation script evalu-
ated the accuracy of the comprehensive pipeline - beginning to end. A comparison of
the pipeline to a leading commercially available solution is also presented.
45
racy was also determined; for example, 91.67% of the “GUIA” (or waybill number)
words from our ground-truth data was present within the information output from the
pipeline’s preprocessing, OCR, and postprocessing components. This communicated
the fact that the processing half was able to correctly identify a significant majority
of the information present in the invoice documents, which was a crucial first step in
the objective of key-value extraction.
Table 8.1: Summarized results from the Preprocessing, OCR, and Postprocessing
Evaluation.
Table 8.2: Individual results from the Preprocessing, OCR, and Postprocessing Eval-
uation.
46
score was used as opposed to a micro F1 score as micro F1 scores often don’t return
an objective measure of model performance when classes are imbalanced, while the
macro F1 score does.
The model’s accuracy was found to be higher than the F1 score. We deduced this
to be due to the fact that the model sometimes classified extra words to each key.
For example, looking at the “QTY” column in Figure 8-1, we observe there are 3,124
words which were classified as “QTY” and were indeed a “QTY,” but there were also
631 words classified as “QTY” that were actually “_OTHER” - or not associated with
any key. For our pipeline’s identified objectives, classifying extra keys was preferred
over missing keys, as the former could be handled within the tabulation algorithm.
The model’s output confusion matrix for the test set is presented in Figure 8-1,
and is an evaluation technique used to summarize the performance of a classification
47
algorithm. In the matrix, the diagonal elements represent the number of datapoints
for which the predicted label is equal to the true label, while the other elements are
those that have been mislabeled by the classifier. The higher the diagonal values of
the confusion matrix the better as it indicates more correct predictions. The bottom
labels represent the classes given by the model, while the left labels represent the
classes in the ground-truth. Visual assessment of our model’s confusion matrix along
the diagonal accordingly indicated effective classification.
It is also important to note that although the model was conclusively accurate,
its input data was based on the output from the OCR and postprocessing modules,
meaning that errors within those stages could have propagated into the model as well.
For example, sometimes the training data within the Arrow dataset was missing some
classifications, which could be attributed to the fact that the training data constructor
was unable to match the tokens up to the ground truth in certain cases. This is
something that can be improved upon as the pipeline continues to be developed.
48
# of Correct Keys # of Total Keys Total Accuracy Equally-weighted Accuracy
7,984 9,540 83.69% 87.85%
49
other key-value pairs which Textract detected that didn’t fall under one of these fields
were classified as “OTHER,” and Textract returned the keys it detected as a separate
parameter.
To compare the final pipeline with Textract, we used six possible keys: the four
keys shared with the standard fields, as well as “GUIA” and “PUNIT.” These keys were
chosen as they were the easiest to convert from the Textract output, after which we
randomly selected a dataset of about 100 documents. We observed that our pipeline
performed much more accurately than Textract on the Arrow dataset, as shown in
Figure 8-2. Our pipeline more accurately extracts all of the specified key-value pairs
in comparison to the Textract pipeline.
Figure 8-2: Accuracy comparison of AWS Textract and the developed pipeline.
We can partially attribute this to the fact that we have defined a specific method-
ology to extract the proper key-value pairs, while Textract is not fully positioned
to handle the range of inputs from our dataset; for example, the waybill number
(“GUIA”) is not a standard field that Textract extracts, but by checking the obtained
label for certain keywords it was able to obtain an accuracy score over 70%.
From this we concluded that the Textract pipeline could be improved, but only
if a significant amount of time was spent in doing so. However with the objective of
this comparison being to compare our complete pipeline with the commercial AWS
50
Textract pipeline, and the fact that adding custom rules to Textract would then
signify that it isn’t a standalone solution, we deduced our current pipeline as being
more effective for key-value extraction from the Arrow dataset.
51
Chapter 9
Conclusion
52
ate performance, both on individual modules and on the comprehensive pipeline. To
supplement the pipeline, we also developed a document tabulation algorithm to pro-
vide visual key-value classification on processed invoice documents, and an intuitive
web interface to easily run the pipeline and examine results on a local system.
53
Bibliography
[2] Selwyn, Neil. “Data Entry: Towards the Critical Study of Digital Data and Edu-
cation.” Learning, Media and Technology, vol. 40, no. 1, Informa UK Limited, 28
May 2014, pp. 64–82.
[3] Han, Jiang, et al. “Improving the Efficacy of the Data Entry Process for Clinical
Research With a Natural Language Processing–Driven Medical Information Ex-
traction System: Quantitative Field Research.” JMIR Medical Informatics, vol.
7, no. 3, JMIR Publications Inc., 16 July 2019, p. e13331.
[4] Hegghammer, Thomas. “OCR with Tesseract, Amazon Textract, and Google Doc-
ument AI: A Benchmarking Experiment.” Journal of Computational Social Sci-
ence, vol. 5, no. 1, Springer Science and Business Media LLC, 22 Nov. 2021, pp.
861–882.
[5] Singh, Himanshu. "Practical Machine Learning with AWS." Apress, 2021.
[6] Soysal, Ergin, et al. “CLAMP – a Toolkit for Efficiently Building Customized
Clinical Natural Language Processing Pipelines.” Journal of the American Medical
Informatics Association, vol. 25, no. 3, Oxford University Press (OUP), 24 Nov.
2017, pp. 331–336.
54
[7] Priya, K. “Customized Data Extraction and Effective Text Data Preprocessing
Technique for Hydroxychloroquin Related Twitter Data.” Bioscience Biotechnol-
ogy Research Communications, vol. 13, no. 13, Society for Science and Nature, 25
Dec. 2020, pp. 150–158.
[9] Sage, Clement, et al. “Recurrent Neural Network Approach for Table Field Ex-
traction in Business Documents.” 2019 International Conference on Document
Analysis and Recognition (ICDAR), IEEE, Sept. 2019.
[11] Palacios, Rafael, and Gupta, Amar. “A System for Processing Handwritten Bank
Checks Automatically.” Image and Vision Computing, vol. 26, no. 10, Elsevier BV,
Oct. 2008, pp. 1297–1313.
[12] Gupta, Amar, et al. "Automatic Processing of Brazilian Bank Checks." 2016.
[13] Kim, Donghwa, et al. "Multi-co-training for document classification using various
document representations: TF–IDF, LDA, and Doc2Vec." Information Sciences
477, 2019, pp. 15-29.
[15] Mulfari, Davide, et al. "Using Google Cloud Vision in assistive technology sce-
narios." 2016 IEEE symposium on computers and communication (ISCC). IEEE,
2016.
55
[16] Q. Ye and D. Doermann. "Text Detection and Recognition in Imagery: A Sur-
vey." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37,
no. 7, July 2015, pp. 1480-1500.
[18] H. Zhou, L. Shao and H. Zhang. "SRRNet: A Transformer Structure with Adap-
tive 2D Spatial Attention Mechanism for Cell Phone-Captured Shopping Receipt
Recognition." IEEE Transactions on Consumer Electronics, 2022.
[19] Park, Seunghyun, et al. "CORD: a consolidated receipt dataset for post-OCR
parsing." Workshop on Document Intelligence at NeurIPS 2019, 2019.
[20] Huang, Zheng et al. “ICDAR2019 Competition on Scanned Receipt OCR and
Information Extraction.” 2019 International Conference on Document Analysis
and Recognition (ICDAR), 2019, pp. 1516-1520.
[21] Ghosh, Aindrila, et al. "A comprehensive review of tools for exploratory analysis
of tabular industrial datasets." Visual Informatics 2.4, 2018, pp. 235-253.
[23] Rosid, Mochamad Alfan, et al. "Improving text preprocessing for student com-
plaint document classification using sastrawi." IOP Conference Series: Materials
Science and Engineering Vol. 874. No. 1. IOP Publishing, 2020.
[24] Shobha Rani, N., A. Sajan Jain, and H. R. Kiran. "A unified preprocessing tech-
nique for enhancement of degraded document images." International Conference
on ISMAC in Computational Vision and Bio-Engineering, Springer, Cham, 2019.
56
[25] Binmakhashen, Galal M., and Sabri A. Mahmoud. "Document layout analysis:
a comprehensive survey." ACM Computing Surveys (CSUR), 2019, pp. 1-36.
[26] Huang, Yilun, et al. "A YOLO-based table detection method." 2019 Interna-
tional Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019.
[28] Patel, Chirag et al. "Optical Character Recognition by Open source OCR Tool
Tesseract: A Case Study." International Journal of Computer Applications, 2012,
pp. 50-56.
[29] H. Hosseini, B. Xiao and R. Poovendran. "Google’s Cloud Vision API is Not
Robust to Noise." 2017 16th IEEE International Conference on Machine Learning
and Applications (ICMLA), 2017, pp. 101-105.
[30] Chakraborty, Sunandan, et al. "Extraction of (key, value) pairs from unstruc-
tured ads." 2014 AAAI Fall Symposium Series, 2014.
[31] Salloum, Said A., et al. "Using text mining techniques for extracting informa-
tion from research articles." Intelligent Natural Language Processing: Trends and
Applications, Springer, Cham, 2018, pp. 373-397.
[32] Karthikeyan, T., et al. "Personalized content extraction and text classification
using effective web scraping techniques." International Journal of Web Portals
(IJWP) 11.2, 2019, pp. 41-52.
[33] Xu, Y., et al. "LayoutLM: Pre-training of Text and Layout for Document Image
Understanding." Proceedings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, 2020.
[34] Hong, Teakgyu, et al. "Bros: A pre-trained language model focusing on text and
layout for better key information extraction from documents." Proceedings of the
AAAI Conference on Artificial Intelligence Vol. 36, 2022.
57
[35] Appalaraju, Srikar, et al. "Docformer: End-to-end transformer for document un-
derstanding." Proceedings of the IEEE/CVF International Conference on Com-
puter Vision, 2021.
[37] Morris, David, Peichen Tang, and Ralph Ewerth. "A neural approach for text
extraction from scholarly figures." 2019 International Conference on Document
Analysis and Recognition (ICDAR), IEEE, 2019.
[40] Böschen, Falk, and Ansgar Scherp. "Formalization and preliminary evaluation of
a pipeline for text extraction from infographics." CEUR Workshop Proceedings
Vol. 1458, 2015.
[41] Munappy, A., Bosch, J., et al. "Modelling Data Pipelines." 46th Euromicro Con-
ference on Software Engineering and Advanced Applications (SEAA), 2020, pp.
13-20.
[42] Shen, Mande and Lei, Hansheng. "Improving OCR performance with background
image elimination." 2015 12th International Conference on Fuzzy Systems and
Knowledge Discovery (FSKD), 2015, pp. 1566-1570.
[43] Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools. 2000.
58
[44] “Detect Text in Images | Cloud Vision API |.” Google Cloud, January 2022.
Available: cloud.google.com/vision/docs/ocr.
[45] Paliwal, S., et al. "TableNet: Deep Learning Model for end-to-end table detection
and tabular data extraction from scanned document images." 2019 International
Conference on Document Analysis and Recognition (ICDAR), 2019.
[46] Chen, SH., Chen, YH. "A Content-Based Image Retrieval Method Based on the
Google Cloud Vision API and WordNet." Intelligent Information and Database
Systems. ACIIDS 2017. Lecture Notes in Computer Science(), vol 10191. Springer,
Cham, 2017.
[48] Malkadi, Abdulkarim, Mohammad, Alahmadi, and Haiduc, Sonia. "A study
on the accuracy of ocr engines for source code transcription from programming
screencasts." Proceedings of the 17th International Conference on Mining Software
Repositories, 2020.
[49] Nguyen, T., et al. "Survey of Post-OCR Processing Approaches." ACM Comput.
Surv., 54(6), 2021.
[50] X. Qiu, et al. "A Post-Processing Method for Text Detection Based on Geometric
Features." IEEE Access, vol. 9, 2021, pp. 36620-36633.
[51] Berger, Bonnie, Michael S. Waterman, and Yun William Yu. "Levenshtein dis-
tance, sequence comparison and biological database search." IEEE Transactions
on Information Theory 67.6, 2020, pp. 3287-3294.
[52] C. Yao, X. Bai and W. Liu, "A Unified Framework for Multioriented Text De-
tection and Recognition." IEEE Transactions on Image Processing, vol. 23, no.
11, Nov. 2014, pp. 4737-4749.
59
[53] Haldar, Rishin, and Debajyoti Mukhopadhyay. "Levenshtein Distance Technique
in Dictionary Lookup Methods: An Improved Approach." 1, arXiv, 2011.
[55] Hicham, Gueddah. "Introduction of the Weight Edition Errors in the Levenshtein
Distance." 1, arXiv, 2012.
[56] Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding." Proceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[57] Nguyen, TA.D., Vu, H.M., Son, N.H., Nguyen, MT. "A Span Extraction Ap-
proach for Information Extraction on Visually-Rich Documents." Document Anal-
ysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in
Computer Science(), vol 12917. Springer, 2021.
[58] H. Guo, et al. "EATEN: Entity-Aware Attention for Single Shot Visual Text Ex-
traction." 2019 International Conference on Document Analysis and Recognition
(ICDAR), 2019, pp. 254-259.
[59] Park, S., et al. "CORD: a consolidated receipt dataset for post-OCR parsing."
Workshop on Document Intelligence at NeurIPS 2019, 2019.
[61] Huang, Yupan, et al. "LayoutLMv3: Pre-Training for Document AI with Unified
Text and Image Masking." 3, arXiv, 2022.
60
[62] Xu, Ting, et al. "Intelligent Document Processing: Automate Business with
Fluid Workflow." Konica Minolta technology report 18, 2021, pp. 89-94.
[63] Rehman, Amjad, and Tanzila Saba. "Neural networks for document image pre-
processing: state of the art." Artificial Intelligence Review 42.2, 2014, pp. 253-273.
[66] Ding, Pan, et al. “Textual Information Extraction Model of Financial Reports.”
Proceedings of the 2019 7th International Conference on Information Technology:
IoT and Smart City, ACM, 20 Dec. 2019.
[67] Liu, Xiaojing, et al. "Graph Convolution for Multimodal Information Extraction
from Visually Rich Documents." 1, arXiv, 2019.
61