Mistral OCR PDF-to-HTML Converter

A collection of Jupyter notebooks for converting PDF documents to accessible, structured HTML using Mistral AI's OCR and vision capabilities.

Overview

This project provides tools to transform PDF documents into well-structured, WCAG-compliant HTML that preserves the original document's layout, formatting, and content. It uses Mistral's OCR API to extract text and images, then processes them to create accessible web content.

Notebooks

This repository contains the following notebooks:

1. `structured_ocr.ipynb`

The original example notebook from Mistral that demonstrates basic OCR extraction and structured data formatting.

2. `custom-structured-ocr.ipynb`

A simplified notebook for PDF-to-HTML conversion with a focus on Google Colab compatibility. Features include:

PDF upload via Google Colab's file upload mechanism
OCR processing with Mistral OCR
Basic HTML conversion with preserved layout
Image extraction with basic alt text
Download capability for the generated HTML

3. `custom-structured-ocr-v2.ipynb`

An enhanced version with advanced accessibility features:

Full WCAG 2.1 compliance
Screen reader optimized content
Semantic HTML5 structure
AI-generated descriptive alt text for images using Pixtral 12B
Enhanced table accessibility with proper ARIA attributes
Proper document structure and heading hierarchy

Getting Started

Prerequisites

Google Colab account (for running the notebooks in the cloud)
Mistral API key from Mistral Platform

Running the Notebooks

Open the desired notebook in Google Colab
Enter your Mistral API key in the designated cell
Run the cells in sequence
Upload your PDF when prompted
The notebook will process your document and create the HTML output
Download the generated HTML file

Features

OCR Processing

Extracts text while preserving formatting
Identifies and extracts images with their positions
Maintains the document's structure and layout

Accessibility Features

Semantic HTML5 elements (<main>, <section>, <figure>, etc.)
ARIA landmarks and regions
Proper heading hierarchy
Skip links for keyboard navigation
High contrast text (WCAG AA 4.5:1 ratio)
Proper image alt text
Accessible tables with proper markup
Print-friendly styling

Image Processing

AI-generated descriptive alt text
Special handling for charts and graphs
Figure captions for complex images

API Usage Notes

This project uses:

Mistral OCR API (mistral-ocr-latest) for text and image extraction
Pixtral 12B (pixtral-12b-latest) for image understanding and alt text generation

Note that API usage incurs costs according to your Mistral AI account plan.

Contributing

Feel free to fork this repository and submit pull requests with improvements or features.

License

This project is provided as-is for educational and demonstrative purposes.

Acknowledgments

Mistral AI for providing the OCR and language model APIs
WCAG guidelines for accessibility best practices

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
custom-structured-ocr-v2.ipynb		custom-structured-ocr-v2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mistral OCR PDF-to-HTML Converter

Overview

Notebooks

1. `structured_ocr.ipynb`

2. `custom-structured-ocr.ipynb`

3. `custom-structured-ocr-v2.ipynb`

Getting Started

Prerequisites

Running the Notebooks

Features

OCR Processing

Accessibility Features

Image Processing

API Usage Notes

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

coldplazma/Accessible-OCR-Mistral-

Folders and files

Latest commit

History

Repository files navigation

Mistral OCR PDF-to-HTML Converter

Overview

Notebooks

1. structured_ocr.ipynb

2. custom-structured-ocr.ipynb

3. custom-structured-ocr-v2.ipynb

Getting Started

Prerequisites

Running the Notebooks

Features

OCR Processing

Accessibility Features

Image Processing

API Usage Notes

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `structured_ocr.ipynb`

2. `custom-structured-ocr.ipynb`

3. `custom-structured-ocr-v2.ipynb`

Packages