-
Krishyam Techlabs
- Delhi
OCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Tesseract Open Source OCR Engine (main repository)
OCR, layout analysis, reading order, table recognition in 90+ languages
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Rub…
A Comprehensive Toolkit for High-Quality PDF Content Extraction
Get your documents ready for gen AI
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …
Toolkit for linearizing PDFs for LLM datasets/training
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
OCR model that handles complex tables, forms, handwriting with full layout.
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
A Python library to extract tabular data from PDFs