Document parsing has become a significant challenge in the era of large language models—mining important textual information from large amounts of specialized domain data, mainly in PDF form. No open-source document parsing framework offers a complete parsing workflow, from document processing to the final JSONL output. Based on several open-source projects, this project builds a comprehensive parsing framework for PDF documents. The process begins with layout recognition to obtain a structured JSON output. Then, post-processing is performed to update the structured JSON for challenging elements like formulas, images, and tables. Finally, the data is assembled into a Markdown file.
Key features of the project include:
- Multi-process and multi-threading processing for parallel acceleration
- Integration with open-source multimodal large models for image and table processing
- Easy for secondary development to add more functionalities
The framework for parsing PDF documents in this project is as follows:
For some environment installations, you can refer to the PaddleOCR Installation Guide or follow the installation guide below:
conda create --name llmpro python=3.9
conda activate llmpro
python3 -m pip install "paddlepaddle-gpu" -i https://mirror.baidu.com/pypi/simple
cd llm-pdf-parsing/PaddleOCR
python3 -m pip install -r ppstructure/recovery/requirements.txt
# Install PaddleOCR, version 2.6 is recommended.
pip3 install "paddleocr>=2.6"
# Markdown conversion requires
pip install markdownify
# formula support
pip install --upgrade unimernet
pip install transformers==4.40.0 # 需要确保transformers版本号
sudo apt-get install libmagickwand-dev
# InternVL support, refer to (a new environment can be installed separately)
https://github.com/OpenGVLab/InternVL/blob/main/INSTALLATION.md
# Note the version of transformers here
pip install transformers==4.33.0
The following is an explanation of the directory structure for the llm-pdf-parsing project
.
├── assets
├── data # Folders for archived data
├── LICENSE
├── logs # Folders for log
├── paddleocr # PaddleOCR's main directory, which contains various functions, including training models, etc.
├── pipline # Main programme folder for pdf processing
│ ├── clean-jsonl.py
│ ├── json2markdown.py
│ ├── markdown2jsonl.py
│ ├── pdf-structure-mgpu.py
│ ├── pdf-structure.py
│ ├── pos-process-figure-mgpu.py
│ ├── pos-process-mgpu.py
│ ├── pos-process.py
│ ├── pos-process-single-cnocr.py
│ ├── pos-process-single.py
│ ├── pos-process-table-mgpu.py
│ └── update-ppstru-json.py
├── README.md
├── tools # Folders for useful utility functions
└── weights # Folders where the model weights are stored
├── pix2text-mfr
├── ppocr_weights
├── readme.txt
└── unimernet
Run the shell program directly from the run file directory:
conda activate llmpro
# No handling of tables and figures
bash run/run_wo_fig_tab.sh
# handling of tables and figures (by InternVL)
bash run/run_both_fig_tab.sh
# handling of tables (by InternVL)
bash run/run_wo_fig.sh
Or it can be run step-by-step, open the pipline file directory and follow the sequence of steps below:
1.Execute pdf-structure.py to parse out the structure of the pdf to generate a structured json file
python pdf-structure.py
# Multi-card, multi-process version (way to fully utilise GPUs)
python pdf-structure-mgpu.py --input_directory ./data/input/ --output_directory ./data/output/ --num_processes 2
Just change the args parameter list
args = {
'--image_dir': '../demo/demo.pdf',
'--det_model_dir': '../weights/ppocr_weights/det/ch/ch_PP-OCRv4_det_infer',
'--rec_model_dir': '../weights/ppocr_weights/rec/ch/ch_PP-OCRv4_rec_infer',
'--rec_char_dict_path': '../paddleocr/ppocr/utils/ppocr_keys_v1.txt',
'--table_model_dir': '../weights/ppocr_weights/table/ch_ppstructure_mobile_v2.0_SLANet_infer',
'--table_char_dict_path': '../paddleocr/ppocr/utils/dict/table_structure_dict_ch.txt',
'--layout_model_dir': '../weights/ppocr_weights/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer',
'--layout_dict_path': '../paddleocr/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt',
'--recovery': 'True',
'--output': '../data/demo/',
'--use_pdf2docx_api': 'False',
'--mode': 'structure',
'--return_word_box': 'False',
'--use_gpu': 'True'
}
Or just run the following command:
python ../paddleocr/ppstructure/predict_system.py --image_dir /path/to/pdf/ --det_model_dir ./weights/ppocr_weights/det/ch/ch_PP-OCRv4_det_infer --rec_model_dir ./weights/ppocr_weights/rec/ch/ch_PP-OCRv4_rec_infer --rec_char_dict_path ./PaddleOCR/ppocr/utils/ppocr_keys_v1.txt --table_model_dir ./weights/ppocr_weights/table/ch_ppstructure_mobile_v2.0_SLANet_infer --table_char_dict_path ./PaddleOCR/ppocr/utils/dict/table_structure_dict_ch.txt --layout_model_dir ./weights/ppocr_weights/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer --layout_dict_path ./PaddleOCR/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt --recovery True --output /path/to/out/ --use_pdf2docx_api False --mode structure --return_word_box False --use_gpu True
2.Execute pos-process.py to post-process formulas, images, tables, update json files generated in previous step (process json files in all subdirectories of specified directory)
python pos-process.py --input_directory /path/to/out/structure
# Multi-card, multi-process version, will call the pos-process-single function multiple times (to process the json file in the specified directory)
python pos-process-mgpu.py --input_directory ../data/output/structure --config_path ./weights/unimernet/demo.yaml --num_processes 2
3.Execute json2markdown.py to convert the json file to markdown format
python json2markdown.py --input_directory ../data/output/structure
4.Execute markdown2jsonl.py to convert markdown to jsonl format
python markdown2jsonl.py --input_directory ../data/output/structure
5.(Optional) Processing of tables using large models (by InternVL)
- InternVL Processing
python pos-process-figure-mgpu.py --input_directory ../data/output/structure --gpus 0,1,2,3,5,6,7 --model_path ../../HFs/InternVL-Chat-V1-5
python pos-process-table-mgpu.py --input_directory ../data/output/structure --gpus 0,1,2,3,5,6,7 --model_path ../../HFs/InternVL-Chat-V1-5
- Updating structured json files
python update-ppstru-json.py --input_directory ../data/output/structure --process figures[tables,both]
Continue with steps 3 and 4 above to obtain the final training format.
6.jsonl-based data cleaning with data desensitisation and regularisation to remove outliers
python clean-jsonl.py --input_file /path/to/jsonl --delete_strs "key1" "key2"
1.Structured json file (json format ending in _ocr.json)
Expand to see the specific json format data structure
=# 主要针对pdf进行解析
pdf_info = {
"pdf_info": [page_info1, page_info2, ...],
"_parse_type": "ocr"
}
=# 每个页面中获取的信息
page_info = {
"preproc_blocks": [], # 也是按照顺序(包含所有需要的要素)
"page_idx": page_idx, # 需要更新
"images": [], # 图像包括图像标题放这里
"tables": [], # 表格包括表格标题放这里
"interline_equations": [], # 行间公式放在这里
"discarded_blocks": [], # 需要丢弃的放这里
"para_blocks": [] # 按照段落顺序,可以放置preproc_blocks后处理格式
}
################图像和图像标题处理###############################
=# images 结构解析(三级结构)
{
"type": "image",
"bbox": [],
"blocks": [] # 其中包含了image_body 和 image_caption
}
=## image_body 和 image_caption具体结构都为
{
"bbox": [],
"type": "image_body", # or "image_caption"
"lines": []
}
=### lines内包含
=#### image_body
{
"bbox": [],
"spans": [
{
"bbox": [],
"type": "image",
"image_path": "XXX.jpg"
}
]
}
=#### image_caption
{
"bbox": [],
"spans": [
{
"bbox": [],
"content": "XXXX",
"type": "inline_equation" # or "text"
}
]
}
################表格和表格标题处理###############################
(同上格式)
################文本和文本标题处理###############################
=# text/title
{
"type": "text", # or "title"
"bbox": [],
"lines": [] # 其中包含了image_body 和 image_caption
}
lines 内部包含:
{
"bbox": [],
"spans": [
{
"bbox": [],
"content": "XXX",
"type": "inline_equation" # or "text"
}
]
}
################公式处理###############################
=# interline_equations
=##
{
"type": "interline_equation",
"bbox": [],
"lines": [] # 其中包含了image_body 和 image_caption
}
lines 内部包含:
{
"bbox": [],
"spans": [
{
"bbox": [],
"content": "XXX",
"type": "interline_equation" $ 是转换成latex格式的,所以需要进行后处理
}
]
}
2.Large model training jsonl format (each markdown is a piece of metadata for the jsonl file):
{
"id": "BkORdv3xK7IA0HG7pccr",
"content": "\\*诗作[222]\n录自索菲娅·马克思的笔记本\n#### 人生\n时光倏忽即逝,\n宛如滔滔流水;\n时光带走的一切,\n永远都不会返回。\n生就是死,\n生就是不断死亡的过程;\n人们奋斗不息,\n却难以摆脱困顿;\n人走完生命的路,\n最后化为乌有;\n他的事业和追求\n湮没于时光的潮流。\n对于人的事业,\n精灵们投以嘲讽的目光;\n因为人的渴望是那样强烈,\n而人生道路是那样狭窄迷茫;\n人在沾沾自喜之后,\n便感到无穷的懊丧;\n那绵绵不尽的悔恨\n深藏在自己的心房;\n人贪婪追求的目标\n其实十分渺小;\n人生内容局限于此,\n那便是空虚的游戏。\n有人自命不凡,\n其实并不伟大;\n这种人的命运,\n就是自我丑化。\n卡尔·马克思\n#### 查理大帝\n使一个高贵心灵深受感动的一切,\n使所有美好心灵欢欣鼓舞的一切,\n如今已蒙上漆黑的阴影,\n野蛮人的手亵渎了圣洁光明。\n巍巍格拉亚山的崇高诗人,\n曾满怀激情把那一切歌颂,\n激越的歌声使那一切永不磨灭,\n诗人自己也沉浸在幸福欢乐之中。\n高贵的狄摩西尼热情奔放,\n曾把那一切滔滔宣讲,\n面对人山人海的广场,\n演讲者大胆嘲讽高傲的菲力浦国王。\n那一切就是崇高和美,\n那一切笼罩着缪斯的神圣光辉,\n那一切使缪斯的子孙激动陶醉,\n如今却被野蛮人无情地摧毁。\n这时查理大帝挥动崇高魔杖,\n呼唤缪斯重见天光;\n他使美离开了幽深的墓穴,\n他让一切艺术重放光芒。\n他改变陈规陋习,\n他发挥教育的神奇力量;\n民众得以安居乐业,\n因为可靠的法律成了安全的保障。\n他进行过多次战争,\n杀得尸横遍野血染疆场;\n他雄才大略英勇顽强,\n但辉煌的胜利中也隐含祸殃;\n他为善良的人类赢得美丽花冠,\n这花冠比一切战功都更有分量;\n他战胜了那个时代的蒙昧,\n这就是他获得的崇高奖赏。\n在无穷无尽的世界历史上,\n他将永远不会被人遗忘,\n历史将为他编织一顶桂冠,\n这桂冠决不会淹没于时代的激浪。\n卡尔·马克思于1833年\n#### 莱茵河女神\n**叙事诗**\n(见本卷第885—889页)\n#### 盲女\n**叙事诗**\n(见本卷第852—858页)\n#### 两重天\n**乘马车赴柏林途中**\n(见本卷第475—478页)\n#### 父亲诞辰献诗。1836年\n**(见本卷第845—846页)**\n#### 席勒\n**十四行诗两首**\n(见本卷第846—847页)\n#### 歌德\n**十四行诗两首**\n(见本卷第848—849页)\n#### 女儿\n**叙事诗**\n(见本卷第838—841页)\n#### 凄惨的女郎\n**叙事诗**\n(见本卷第533—537页)\n卡·马克思写于1833年一大约1837年\n第一次用原文发表于《马克思恩格斯全集》1975年历史考证版第1部分第1卷\n并用俄文发表于《马克思恩格斯全集》1975年莫斯科版第40卷\n原文是德文\n中文根据《马克思恩格斯全集》1975年历史考证版第1部分第1卷翻译\n---\n**注释:**\n[222]马克思的这些诗作是他的姐姐索菲娅抄录在一个笔记本里的。除了马克思的诗作外,笔记本里还有其他人的诗作以及索菲娅自己和她的亲友的个人记事。马克思的这些诗作,除了《人生》和《查理大帝》外都在马克思的几本诗集和索菲娅的纪念册里出现过。《查理大帝》一诗注明写作日期是1833年,可见马克思早在中学时代就已开始写诗了。《盲女》注明写作日期是1835年。为祝贺父亲生日而献给亨利希·马克思的诗作的写作日期应该不晚于1836年初。——913。"
}
-
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
-
PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle.
-
PPStructure: An intelligent document analysis system.
-
InternVL: InternVL Family: A Pioneering Open-Source Alternative to GPT-4V.
-
CnSTD: CnSTD: 基于 PyTorch/MXNet 的 中文/英文 场景文字检测(Scene Text Detection)、数学公式检测(Mathematical Formula Detection, MFD)、篇章分析(Layout Analysis)的Python3 包
-
Miner-PDF-Benchmark: MPB (Miner-PDF-Benchmark) is an end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios.
-
MinerU: MinerU is a one-stop, open-source data extraction tool,supports PDF/webpage/e-book extraction.