A collection of Jupyter notebooks for converting PDF documents to accessible, structured HTML using Mistral AI's OCR and vision capabilities.
This project provides tools to transform PDF documents into well-structured, WCAG-compliant HTML that preserves the original document's layout, formatting, and content. It uses Mistral's OCR API to extract text and images, then processes them to create accessible web content.
This repository contains the following notebooks:
The original example notebook from Mistral that demonstrates basic OCR extraction and structured data formatting.
A simplified notebook for PDF-to-HTML conversion with a focus on Google Colab compatibility. Features include:
- PDF upload via Google Colab's file upload mechanism
- OCR processing with Mistral OCR
- Basic HTML conversion with preserved layout
- Image extraction with basic alt text
- Download capability for the generated HTML
An enhanced version with advanced accessibility features:
- Full WCAG 2.1 compliance
- Screen reader optimized content
- Semantic HTML5 structure
- AI-generated descriptive alt text for images using Pixtral 12B
- Enhanced table accessibility with proper ARIA attributes
- Proper document structure and heading hierarchy
- Google Colab account (for running the notebooks in the cloud)
- Mistral API key from Mistral Platform
- Open the desired notebook in Google Colab
- Enter your Mistral API key in the designated cell
- Run the cells in sequence
- Upload your PDF when prompted
- The notebook will process your document and create the HTML output
- Download the generated HTML file
- Extracts text while preserving formatting
- Identifies and extracts images with their positions
- Maintains the document's structure and layout
- Semantic HTML5 elements (
<main>,<section>,<figure>, etc.) - ARIA landmarks and regions
- Proper heading hierarchy
- Skip links for keyboard navigation
- High contrast text (WCAG AA 4.5:1 ratio)
- Proper image alt text
- Accessible tables with proper markup
- Print-friendly styling
- AI-generated descriptive alt text
- Special handling for charts and graphs
- Figure captions for complex images
This project uses:
- Mistral OCR API (
mistral-ocr-latest) for text and image extraction - Pixtral 12B (
pixtral-12b-latest) for image understanding and alt text generation
Note that API usage incurs costs according to your Mistral AI account plan.
Feel free to fork this repository and submit pull requests with improvements or features.
This project is provided as-is for educational and demonstrative purposes.
- Mistral AI for providing the OCR and language model APIs
- WCAG guidelines for accessibility best practices