Skip to content

huggingface/finepdfs

Repository files navigation

FinePDFs

FinePDFs: code and pipelines for the FinePDFs dataset

This repository accompanies the FinePDFs dataset release and contains the end‑to‑end code to filter, extract, OCR, postprocess, deduplicate, classify, and package large‑scale PDF text data.

Dataset card: HuggingFaceFW/finepdfs

Installation

This project uses a workspace setup with vendored Docling and a specific datatrove branch. We recommend uv:

pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv sync --no-build-isolation-package flash-attn

Requirements are in pyproject.toml (notable: torch==2.6.0, vllm>=0.8.5.post1, pymupdf==1.26.1). A GPU is needed for vLLM steps.

Quickstart

All steps are orchestrated by the pipeline script.

Example run:

python run_finepdfs_pipeline.py \
  --crawl-ids CC-MAIN-2019-43 \
  --languages eng_Latn \

Notes:

  • The pipeline is a rough reproduction of our production pipeline. Some parts are optimized for clarity rather than optimal resource allocation or speed.
  • In the production pipeline we used LMDeploy instead of vLLM. As we decided not to support LMDeploy here, we use vLLM. For production use cases we recommend LMDeploy, as we found it to be the fastest of the sglang, vLLM, LMDeploy trio.

OCR vs no‑OCR classifier

The trained OCR routing classifier is in the models/xgb_ocr_classifier/ directory. Training code and features are in the training notebook. The manually annotated training dataset is released at HuggingFaceFW/ocr-annotations.

OpenVINO‑quantized Docling layout model

We provide the code used to quantize/convert ds4sd/docling-layout-heron to INT8 with OpenVINO. We only provide code for quantization, not conversion (which is straightforward to reproduce). We do not provide the source PDFs/images used, so consider this a reference implementation rather than fully runnable code. Evaluation code is provided in a similar form.

Language filtering thresholds

We first used google/gemma-3-27b-it as a classifier on a random subsample of 20k samples per language, prompting the LLM to check if a sample is in the language (any portion counts due to code‑switching). Using these annotations, we selected thresholds that maximize the F‑beta score with (\beta = 0.1) (heavily prioritizing precision), with minimum recall 0.1, minimum precision 0.9, and a minimum score cutoff to ensure reasonable thresholds.

During filtering, we pick the first (language, score) pair with score > language_threshold. If none qualifies and we have a threshold for the top language, we mark the sample for removal; otherwise we route it to the top language.

Gemma‑labeled dataset: HuggingFaceFW/finepdfs_lang_classification

python thresholds/find_th.py --min-recall=0.1 --min-precision=0.9 --min-score=0.1 --workers=11 --th_file=th_values.json

Use the Gemma classification script to run classification.

Quality classifier training

We evaluated the following filters/classifiers:

  • dclm
  • ocr quality
  • edu
  • edu v2 (beyond undergraduate‑level textbooks)

Prompts are in the labeling code. We found only edu and dclm to yield meaningful gains, with edu clearly leading. For non‑English languages we provide a BERT‑based edu classifier. For English, we provide a multi‑head classifier on top of ModernBERT for efficient multi‑task inference. The threshold selected is by taking top-10% of scores.

For labeling we used Qwen3-235B-A22B-Instruct-2507, which most closely matched Claude Sonnet 3.7 among open‑source LLMs.

Code:

Datasets/models:

  • HuggingFaceFW/finepdfs_fw_edu_labeled (edu labeling for all languages)
  • HuggingFaceFW/finepdfs_edu_classifier_{language} (distilled edu classifier)
  • HuggingFaceFW/finepdfs_{dclm|ocr_quality}_classifier_eng_Latn (distilled dclm/OCR‑quality classifier)
  • HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn (distilled edu v2 classifier)

Repository structure (where things live)

Limitations (high‑level)

  • Docling: extracts embedded text only; content in images is missed; tables/equations may misalign; possible paragraph order issues.
  • OCR: may hallucinate or miss text, especially for low‑resource languages; page‑level failures may occur.
  • Filtering: we minimize ML‑based filtering to avoid systematic content biases; harmful content may remain.

For context and trade‑offs, see the dataset card: HuggingFaceFW/finepdfs

License

  • Dataset: ODC‑By 1.0; subject also to CommonCrawl terms.
  • Code: Top‑level code in this repository is licensed under AGPL‑3.0. Vendored components under docling_code/* retain their original licenses (Docling and Docling‑Core under MIT; Docling‑IBM‑Models per upstream). The evaluation tooling is licensed under MIT (same as Docling). PyMuPDF is used for rendering and is AGPL‑3.0. See third party notices for details.

Citation

@misc{kydlicek2025finepdfs,
  title        = {FinePDFs},
  author       = {Hynek Kydl{\'\i}{\\v{c}}ek and Guilherme Penedo and Leandro von Werra},
  year         = {2025},
  publisher    = {Hugging Face},
  journal      = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/datasets/HuggingFaceFW/finepdfs}}
}

Releases

No releases published

Packages

No packages published