anujaggarwal

Anuj Aggarwal anujaggarwal

7 followers · 9 following

Krishyam Techlabs
Delhi

Achievements

Stars

OCR

16 repositories

PaddlePaddle / PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

Python 66,596 9,527 Updated Dec 16, 2025

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

C++ 71,493 10,430 Updated Dec 15, 2025

datalab-to / surya

OCR, layout analysis, reading order, table recognition in 90+ languages

Python 18,993 1,299 Updated Oct 21, 2025

NanoNets / docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python 1,096 103 Updated Oct 31, 2025

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Python 8,706 678 Updated Dec 18, 2025

kreuzberg-dev / kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Rub…

HTML 3,019 127 Updated Dec 20, 2025

opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction

Python 9,022 678 Updated Jan 3, 2025

docling-project / docling

Get your documents ready for gen AI

Python 47,319 3,326 Updated Dec 19, 2025

Unstructured-IO / unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …

HTML 13,452 1,110 Updated Dec 19, 2025