Skip to content

7shoe/AdaParse

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

AdaParse

AdaParse logo

AdaParse is part of the AuroraGPT Initiative

AdaParse (Adaptive Parallel PDF Parsing and Resource Scaling Engine) enable scalable high-accuracy PDF parsing. AdaParse is a data-driven strategy that assigns an appropriate parser to each document; offering high accuracy for any computaional budget. More details on AuroraGPT and Moreover, it offers a workflow of various PDF parsing software that includes

Version 2 Updates

  • Full Aurora-support: all parsers have been ported and configs optimized
  • AdaParse can route a document's pages to different parsers (in prediction mode by_page via pagewise inference of text-quality)
    • new prediction models to infer text quality from pages and documents
    • higher accuracy due to adaptive fill-in (inception-parsing)
    • faster pre-processing pipeline (free of albumentations and cv2)
  • Nougat dependencies disentangled from the source repo to ensure continued support

AdaParse designed to run on HPC systems and has parsed millions of (scientific) PDFs. It uses Parsl to submit jobs to the scheduler. While AdaParse is agnostic to the specific system, instructions below are tailored to the Polaris supercomputer at Argonne National Laboratory (ANL). Regardless, AdaParse can run on any system (large or small) by adding an appropriate Parsl configuration.

Citation

AdaParse has been accepted to MLSys 2025 πŸŽ‰

This work has been presented at MLSys on May 13th, 2025 (Video)[https://mlsys.org/virtual/2025/poster/3229] πŸŽ₯

The MLSys Proceedings are not up to date. Here the ArXiV citation:

@inproceedings{siebenschuhadaparse,
  title={AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine},
  author={Siebenschuh, Carlo and Hippe, Kyle and Gokdemir, Ozan and Brace, Alexander and Khan, Arham Mushtaq and Hossain, Khalid and Babuji, Yadu and Chia, Nicholas and Vishwanath, Venkatram and Ramanathan, Arvind and others},
  booktitle={Eighth Conference on Machine Learning and Systems}
}

Installation

Polaris

The steps below enable any of the parsers.

# conda env (machine-speicifc)
module use /soft/modulefiles; module load conda/2024-04-29 # Polaris
conda create -n adaparse python=3.12 -y
conda activate adaparse

# git repo (machine-agnostic)
git clone git@github.com:7shoe/AdaParse.git
cd AdaParse
pip install --upgrade pip setuptools wheel
pip install -e '.[transformers]' # pull transformers too

If you plan on using Tesseract, additional installation steps are required.

Aurora

The steps below enable any of the parsers.

# one-time
module load frameworks
conda create -n adaparse --clone /opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0
git clone git@github.com:7shoe/AdaParse.git
cd AdaParse

# use
module load frameworks
conda activate adaparse
pip install --upgrade pip setuptools wheel
pip install -e .
export PATH="$HOME/.local/aurora/frameworks/2024.2.1_u1/bin:$PATH"

Usage

The adaparse workflow can be run at scale using Parsl

> python -m adaparse.convert --help
usage: convert.py [-h] --config CONFIG

PDF conversion workflow

optional arguments:
  -h, --help       show this help message and exit
  --config CONFIG  Path to workflow configuration file

A single command triggers the embarassingly parallel PDF parsing engine:

python -m adaparse.convert --config <your-config.yaml>

Data preparation

PDF files (zipped or unzipped) reside in out_dir. See the configuration file below. AdaParse requires the PDFs to be zipped and will ignore unzipped .pdf files in that directory. Zipped input is optional for the other parsers. This repository provides a CLI to zip PDFs.

adaparse zip-pdfs --input_dir path/to/pdf_directory ---output_dir path/to/destination_directory

Configuration

The YAML configuration file specifies all aspects of the chosen parser, virtual environment and computing platform it is run on.

An sample configuration YAML file is provided below.

# The directory containing the PDFs to be parsed
pdf_dir: /lus/eagle/projects/argonne_tpc/siebenschuh/small-pdf-dataset

# The directory to store the JSONLs
out_dir: runs/output-dir

# AdaParse *requires* zipped input (optional for other parsers)
iszip: true

# The number of PDFs per parsl task
chunk_size: 1

# Parser settings
parser_settings:
  # The name of the parser to use
  name: adaparse

# Compute settings (e.g., ANL's Polaris)
compute_settings:
  # The name of the compute platform to use
  name: polaris
  # The number of compute nodes to use
  num_nodes: 1
  # Activate conda environment and set HF cache path
  worker_init: "module use /soft/modulefiles; module load conda/2024-04-29; conda activate adaparse; export HF_HOME=<path-to-your-HF-cache-dir>"
  # Scheduler options
  scheduler_options: "#PBS -l filesystems=home:eagle"
  # Your account/project that will be charged
  account: <your-account-name-to-charge>
  # The HPC queue to submit to
  queue: debug
  # The amount of runtime requested for this job
  walltime: 00:60:00

Example configuration files for each parser can be found in:

Output

Once you've updated the YAML file and run the AdaParse command, the textual output will be written to out_dir. The subdirectory <out_dir>/parsed_pdfs contains the parsed PDF output in JSON Lines format. Each line of the JSONL file contains a path field with the PDF source file, a text field containing the parsed text, and a metadata field with information such as the author and title. Please note that the specific metadata stored depends on the parser. Moreover, some attributes may not be provided by the PDF file, resulting in an empty string (''). Hence, a typical line in the JSONL file may look like this:

{"path": "/path/to/1.pdf",
 "text": "Text of the 1st PDF.",
 "metadata" : {
    "title" : "Ising string beyond the Nambu-Goto action",
    "authors" : "",
    "format" : "PDF 1.4",
    "creationdate" : "",
    "keywords" : "",
    "doi" : "",
    "first_page" : "One of the most promising approaches ...",
    "abstract" : "A major result of ...",
    "page_char_idx" : [0, 2961, 7407, 11735, 13927]
 },
"parser" : "pymupdf"
}

Note: If the parser fails to parse a PDF, the JSONL file will not contain an entry for that PDF.

See the Monitoring the Workflow section for description of the other log files that are generated during the workflow.

Developement

It is recommended to use a virtual environment for developement. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks.

python3.10 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -e '.[dev,docs]'
pre-commit install

About

Adaptive Parallel PDF Parsing and Resource Scaling Engine

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 48.6%
  • Mermaid 42.9%
  • Shell 7.3%
  • Jupyter Notebook 1.2%