skip to main content
10.1145/3558100acmconferencesBook PagePublication PagesdocengConference Proceedingsconference-collections
DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering
ACM2022 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
DocEng '22: ACM Symposium on Document Engineering 2022 San Jose California September 20 - 23, 2022
ISBN:
978-1-4503-9544-1
Published:
18 November 2022
Sponsors:
In-Cooperation:
Recommend ACM DL
ALREADY A SUBSCRIBER?SIGN IN

Reflects downloads up to 20 Feb 2025Bibliometrics
Skip Abstract Section
Abstract

The symposium brings together experts in all areas of document engineering, across academia and industry, with the intention of presenting and discussing the most recent advances in the field of Document Engineering.

Skip Table Of Content Section
tutorial
Binarization of photographed documents image quality, processing time and size assessment
Article No.: 1, Pages 1–10https://doi.org/10.1145/3558100.3564159

Today, over eighty percent of the world's population owns a smart-phone with an in-built camera, and they are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on ...

keynote
How did dennis ritchie produce his PhD thesis?: a typographical mystery
Article No.: 2, Pages 1–10https://doi.org/10.1145/3558100.3563839

Dennis Ritchie, the creator of the C programming language and, with Ken Thompson, the co-creator of the Unix operating system, completed his Harvard PhD thesis on recursive function theory in early 1968. But for unknown reasons, he never officially ...

research-article
Graphical document representation for french newsletters analysis
Article No.: 3, Pages 1–8https://doi.org/10.1145/3558100.3563856

Document analysis is essential in many industrial applications. However, engineering natural language resources to represent entire documents is still challenging. Besides, available resources in French are scarce and do not cover all possible tasks, ...

short-paper
A cascaded approach for page-object detection in scientific papers
Article No.: 4, Pages 1–4https://doi.org/10.1145/3558100.3563851

In recent years, Page Object Detection (POD) has become a popular document understanding task, proving to be a non-trivial task given the potential complexity of documents. The rise of neural networks facilitated a more general learning approach to this ...

short-paper
From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points
Article No.: 5, Pages 1–4https://doi.org/10.1145/3558100.3563847

Simply transposing the print newspapers into digital media can not be satisfactory because they were not designed for small displays. One key feature lost is the notion of entry points that are essential for navigation. By focusing on headlines as entry ...

research-article
Long-term lifecycle-related management of digital building documents: towards a holistic and standard-based concept for a technical and organizational solution in building authorities
Article No.: 6, Pages 1–10https://doi.org/10.1145/3558100.3563842

The long-term lifecycle-related management of digital building information is essential to improve the overall quality of public built assets. However, this management task still poses great challenges for building authorities, as they are usually ...

short-paper
Open Access
Theory entity extraction for social and behavioral sciences papers using distant supervision
Article No.: 7, Pages 1–4https://doi.org/10.1145/3558100.3563855

Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility ...

research-article
Open Access
Best Paper
Best Paper
Tab this folder of documents: page stream segmentation of business documents
Article No.: 8, Pages 1–10https://doi.org/10.1145/3558100.3563852

In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are ...

short-paper
Modifying PDF sewing patterns for use with projectors
Article No.: 9, Pages 1–4https://doi.org/10.1145/3558100.3563853

Print-at-home PDF sewing patterns have gained popularity over the last decade and now represent a significant proportion of the home sewing pattern market. Recently, an all-digital workflow has emerged through the use of ceiling-mounted projectors, ...

short-paper
SeNMFk-SPLIT: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection
Article No.: 10, Pages 1–4https://doi.org/10.1145/3558100.3563844

As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an ...

research-article
Best Student Paper
Best Student Paper
Downstream transformer generation of question-answer pairs with preprocessing and postprocessing pipelines
Article No.: 11, Pages 1–8https://doi.org/10.1145/3558100.3563846

We present a method to perform a downstream task of transformers on generating question-answer pairs (QAPs) from a given article. We first finetune pretrained transformers on QAP datasets. We then use a preprocessing pipeline to select appropriate ...

short-paper
Open Access
Academic writing and publishing beyond documents
Article No.: 12, Pages 1–4https://doi.org/10.1145/3558100.3563840

Research on writing tools stopped in the late 1980s when Microsoft Word had achieved monopoly status. However, the development of the Web and the advent of mobile devices are increasingly rendering static print-like documents obsolete. In this vision ...

short-paper
Optical character recognition with transformers and CTC
Article No.: 13, Pages 1–4https://doi.org/10.1145/3558100.3563845

Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of ...

short-paper
Open Access
Optical character recognition guided image super resolution
Article No.: 14, Pages 1–4https://doi.org/10.1145/3558100.3563841

Recognizing disturbed text in real-life images is a difficult problem, as information that is missing due to low resolution or out-of-focus text has to be recreated. Combining text super-resolution and optical character recognition deep learning models ...

short-paper
Anonymizing and obfuscating PDF content while preserving document structure
Article No.: 15, Pages 1–4https://doi.org/10.1145/3558100.3563849

The portable document format (PDF) is both versatile and complex, with a specification exceeding well over a thousand pages. For independent developers writing software that reads, displays, or transforms PDFs, it is difficult to comprehensively account ...

short-paper
Open Access
Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC
Article No.: 16, Pages 1–4https://doi.org/10.1145/3558100.3563850

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant ...

short-paper
Detecting malware using text documents extracted from spam email through machine learning
Article No.: 17, Pages 1–4https://doi.org/10.1145/3558100.3563854

Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In ...

short-paper
Open Access
Triplet transformer network for multi-label document classification
Article No.: 18, Pages 1–4https://doi.org/10.1145/3558100.3563843

Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without ...

short-paper
Chinese public procurement document harvesting pipeline
Article No.: 19, Pages 1–4https://doi.org/10.1145/3558100.3563848

We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The ...

Contributors
  • Adobe Inc.
  • University of Nottingham
  • Colorado State University

Recommendations

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%
YearSubmittedAcceptedRate
DocEng '24271659%
DocEng '2327933%
DocEng '19773039%
DocEng '17711318%
DocEng '16351131%
DocEng '15311135%
DocEng '14411537%
DocEng '13501632%
DocEng '10421331%
DocEng '08622134%
DocEng '02462146%
DocEng '01551833%
Overall56419434%