Skip to content

momoxia/MinerU

 
 

Repository files navigation

MinerU Modified (3.1.5)

A fork of MinerU (upstream 3.1.5) with enhanced content ordering to ensure chunk order matches PDF layout.

Introduction

This modified version ensures content list chunk order is consistent with PDF layout order. The key improvement is that extracted content blocks are ordered by their physical position on the page (top-to-bottom, left-to-right), making it easier to reconstruct document structure for downstream tasks.

Based on upstream MinerU 3.1.5, which builds on 3.0.9's SEAL/CHART recognition, DOCX/PPTX/XLSX parsing and mineru-router multi-GPU routing with further Office document fidelity improvements and multi-process API refinements.

Content List Data Structure

The extraction output is saved as *_content_list.json. Each item in the list has the following structure:

{
    "type": "text",           // Content type: "text", "image", "table", "chart", "seal", "code"
    "text": "...",            // Text content (for type="text")
    "text_level": 1,          // Heading level: 1=h1, 2=h2, 3=h3 (optional)
    "bbox": [x0, y0, x1, y1], // Bounding box coordinates (normalized to 1000)
    "page_idx": 0,            // Page index (0-based)
    "id": 1,                  // Sequential content ID consistent with pdf layout order

    // Image-specific fields (for type="image"):
    "img_path": "images/xxx.jpg",
    "image_caption": [],
    "image_footnote": []
}

ID Conventions

  • Numeric IDs (1, 2, 3...): Main content blocks, ordered by layout position
  • D-prefixed IDs ("D1", "D2"...): Discarded/auxiliary blocks (headers, footers, page numbers, etc.)

Deploy

Quick Start (Local Build)

# Build and start API service
docker compose -f docker/compose.yml --profile api up -d

# Build and start Gradio UI
docker compose -f docker/compose.yml --profile gradio up -d

# Build and start OpenAI-compatible VLM server
docker compose -f docker/compose.yml --profile openai-server up -d

Multi-GPU with mineru-router (3.0.9 New)

mineru-router is a load-balancing layer that manages multiple mineru-api workers across GPUs:

# Auto-detect all GPUs, one worker per card
mineru-router --host 0.0.0.0 --port 8002 --local-gpus auto

# Specify GPUs
mineru-router --host 0.0.0.0 --port 8002 --local-gpus 0,1,2

# Aggregate existing mineru-api instances
mineru-router --host 0.0.0.0 --port 8002 \
  --local-gpus none \
  --upstream-url http://api1:8000 \
  --upstream-url http://api2:8000

Available Services

Service Command Default Port Description
API Server mineru-api 8000 FastAPI REST service for PDF parsing
OpenAI Server mineru-openai-server 30000 vLLM OpenAI-compatible inference server
Router mineru-router 8002 Multi-GPU load balancer (3.0.9 new)
Gradio UI mineru-gradio 7860 Web UI for interactive use
Model Download mineru-models-download - Download required models

Architecture

Single GPU:
  User -> mineru-api (GPU 0) -> vllm engine

Multi GPU (via router):
                          ┌─ mineru-api (GPU 0) -> vllm engine
  User -> mineru-router --├─ mineru-api (GPU 1) -> vllm engine
                          └─ mineru-api (GPU 2) -> vllm engine

Configuration

Environment Variables

Variable Default Description
MINERU_MODEL_SOURCE - Set to local for local model files
MINERU_TABLE_MERGE_ENABLE true Set false to disable cross-page table merging (important for layout tracking)
MINERU_API_MAX_CONCURRENT_REQUESTS 3 (Mac=1) Max concurrent requests per mineru-api instance
MINERU_PROCESSING_WINDOW_SIZE 64 Max pages processed per task

Concurrency

  • Single mineru-api: controlled by MINERU_API_MAX_CONCURRENT_REQUESTS (default 3)
  • mineru-openai-server: vLLM native batching, concurrency depends on GPU VRAM
  • For higher throughput: use mineru-router to scale across multiple GPUs

What's New in 3.1.5 (vs 3.0.9)

  • Office document parsing: chart rendering via cached HTML / Excel-bytes fallback; DOCX/PPTX OMML→LaTeX with extended Unicode mapping; PPTX shape-type caching; DOCX broken-link sanitization
  • Async PDF image loading and Windows process termination support
  • API hardening: async model retrieval, configurable health-failure restart threshold, local API launch modes, timeout handling for result downloads
  • VLM: chart image content extraction, embedded table HTML formatting
  • Misumi fix: make_page_to_content_list content-IDs now strictly track draw_bbox numbering for IMAGE/TABLE/CHART/CODE composite blocks (previously misaligned for caption-below figures and silently dropped CHART blocks in vlm)

What's New in 3.0.9 (vs 2.7)

  • SEAL recognition: Stamp/seal detection and content extraction
  • CHART recognition: Separate chart type (previously grouped with images)
  • DOCX/PPTX parsing: Direct Office document support
  • mineru-router: Multi-GPU load balancing
  • CONTENT_LIST_V2: Span-level structured output format
  • VLM preload: Faster cold start
  • vLLM v0.11.2: Updated inference engine
  • Improved OCR: Dynamic batch sizing, better VRAM management

About

Make content list chunk order consistent with pdf layout order

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 99.2%
  • Other 0.8%