Skip to content
View anujaggarwal's full-sized avatar
  • Krishyam Techlabs
  • Delhi

Block or report anujaggarwal

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

OCR

16 repositories

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

Python 66,596 9,527 Updated Dec 16, 2025

Tesseract Open Source OCR Engine (main repository)

C++ 71,493 10,430 Updated Dec 15, 2025

OCR, layout analysis, reading order, table recognition in 90+ languages

Python 18,993 1,299 Updated Oct 21, 2025

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python 1,096 103 Updated Oct 31, 2025

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Python 8,706 678 Updated Dec 18, 2025

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Rub…

HTML 3,019 127 Updated Dec 20, 2025

A Comprehensive Toolkit for High-Quality PDF Content Extraction

Python 9,022 678 Updated Jan 3, 2025

Get your documents ready for gen AI

Python 47,319 3,326 Updated Dec 19, 2025

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …

HTML 13,452 1,110 Updated Dec 19, 2025

Toolkit for linearizing PDFs for LLM datasets/training

Python 16,251 1,255 Updated Dec 20, 2025

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Python 8,037 700 Updated Feb 10, 2025

OCR Benchmark

TypeScript 601 48 Updated Oct 21, 2025

OCR model that handles complex tables, forms, handwriting with full layout.

Python 3,817 430 Updated Dec 19, 2025

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python 1,822 136 Updated Aug 25, 2025

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

Python 6,018 576 Updated Dec 20, 2025

A Python library to extract tabular data from PDFs

Python 3,551 523 Updated Nov 12, 2025