Collection of scripts and stylesheets for conversion between various OCR formats.
You may also want to check out the excellent ocr-fileformat by @UB-Mannheim.
abbyy2hocr.xsl- ABBYY FineReader XML to hOCR converter @Rod Pageabbyy2hocr.xsl- ABBYY FineReader XML to hOCR converter by @Rod Page - updated by @OCR-Dabbyy-to-hocr- ABBYY FineReader XML to hOCR converter by @merlijnteip5-v5.xsl- Transform ABBYY Finereader XML into TEI @UPEIABBYY_to_TEI_by_XMLReader.php- Convert ABBYY XML to TEI using PHP's XMLReader @able-projectocr_to_teifacsimile.xsl- Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO @readuxAbbyyToAlto.php- PHP5 to convert Abbyy FineReader XML into ALTO XML @ironymarkAbbyyToAltoConverter.java- Java library to convert abbyy.xml (v10) to alto.xml (v2) @abbyy-to-alto
alto2tei.xsl- Output TEI from ALTO input format @OpenConvertAltoToTeiA.xsl- For Gale OCR XML or 18thConnect Typewright XML files @typewrightocr_to_teifacsimile.xsl- Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO @readuxalto2hocr.xsl- Convert ALTO 2.0 / ALTO 2.1 to hOCR @filakalto2text.xsl- Convert ALTO 2.0 / ALTO 2.1 to plain text @filakalto_ocr_text.py- Extracts the text from an ALTO file and writes it to stdout @cneudALTO2HTML.bat- Batch script to convert ALTO files to HTML @altomatordinglehopper-extract- Extracts the text from ALTO and PAGE XML files @qurator-spk
hOCR2ALTO.xsl- Utilities to process and handle hOCR @ONB-RDhocr2alto2.0.xsl- Convert hOCR to ALTO 2.0 @filakhocr2alto2.1.xsl- Convert hOCR to ALTO 2.1 @filakhocr2tei.xsl- Convert hOCR from Tesseract to basic TEI output @DH2015hocr2tei.xsl- Convert hOCR from Tesseract to basic TEI output from @DH2015 - updated by @OCR-Dhocr2text.xslConvert hOCR to plain text @filakHocrConverter.py- Create a PDF from an hOCR file and an image @jbrinley
PageConverter.java- Convert ALTO XML, FineReader XML, Google CV, and hOCR to the latest PAGE XML format @primaxml_to_box.xsl- Convert PAGE XML to Tesseract box file @eMOPpage_to_text.py- Extracts the text from a PAGE file and writes it to stdout @cneudPageToPdfConverter.java- Convert PAGE XML files with layout and text content to PDF @primapage2tei-0.xsl- Convert PAGE XML to TEI @dariokPageToAlto.xsl- Convert PAGE XML to ALTO @Transkribuspage-to-alto– Convert PAGE XML to ALTO (all versions) @kbadinglehopper-extract- Extracts the text from ALTO and PAGE XML files @qurator-spk
tei2txt.xsl- Convert DTA TEI-P5 to plain text @haoesstei2hocr.xsl- Convert DTA TEI-P5 to hOCR @jbaiter
iw2alto.xsl- Convert ImageWare MyBib eL OCR to ALTO @karkraegtranskribus-xslt- Various stylesheets from Transkribus @readcooptranskribus-to-prima– Convert Transkribus dialect to official PAGE XML format @kbatextract2page- Convert Amazon AWS Textract to PAGE XML @slubgcv2hocr– Convert Google Cloud Vision to hOCR @dinosauria123