🛠️ Build a multimodal ETL pipeline to automate the extraction, transformation, and loading of web content into structured storage for analysis.
-
Updated
Mar 24, 2026 - CoffeeScript
🛠️ Build a multimodal ETL pipeline to automate the extraction, transformation, and loading of web content into structured storage for analysis.
Maxcavator 2.0 is an intelligent, AI-native PDF Data Extraction and Retrieval-Augmented Generation (RAG) system. It fundamentally changes how you understand and interact with your PDF documents by instantly extracting complex structures (sections, tables, images), generating robust RAG indices.
Analyze all GoTo links, URI hyperlinks, URI file links, and TOC entries in a target PDF using a CLI and GUI wrapper for PyMuPDF. PyPI: https://pypi.org/project/pdflinkcheck Microsoft Store: https://apps.microsoft.com/detail/9n11hxvls1wg
ResumeGenie is an AI-based tool that analyzes resumes against job descriptions and provides fit scores, feedback, and skill improvement recommendations.
An AI-powered invoice and receipt analyzer that extracts structured invoice data from images (JPG/PNG) and PDF documents using a Large Language Model (LLM).
Fichamento Automático de PDFs: Aplicação Python para extração e formatação automática de trechos destacados em documentos PDF, seguindo padrões acadêmicos brasileiros de fichamento.
IntelliDoc is an intelligent document understanding system that helps users extract, analyze, and query information from PDFs, scanned documents, images, and multilingual reports using OCR, AI, and Retrieval-Augmented Generation (RAG)
Enterprise-grade AI banking chatbot built with FastAPI, Streamlit, OpenAI LLMs, pgvector RAG, secure PII masking, conversational memory, and real-time streaming.
A robust Python-based ETL pipeline designed to ingest, rasterize, and extract structured data from complex PDF documents. Unlike standard text scrapers, this engine uses a "Vision-First" approach to handle layouts, charts, and non-selectable text, preparing assets for Multimodal AI analysis.
The purpose of this application is to facilitate efficient and intelligent searching of text content within PDF documents. By leveraging semantic search techniques, the application enhances the user's ability to locate information quickly and accurately within large documents.
A multimodal RAG system that extracts text + images from PDFs, generates CLIP embeddings, stores them in FAISS, and answers queries using LangChain and a local LLM.
A Python tool for extracting text regions from PDF files, visualizing them as bounding boxes, and exporting structured data in JSON format.
The AI Tutor Platform is an intelligent educational application built with FastAPI, Streamlit, LangChain, and Groq. It provides users with an AI-powered conversational tutor, auto-generated quizzes, and a file-based doubt solver. The platform includes user authentication and progress tracking, with all data persistently stored in a PostgreSQL DB.
Python CLI & library for automated journal vetting — GPT‑4.1 summarization, YAML configuration, reproducible analysis.
Solution for the Adobe India Hackathon 2025, Team - Codient (Team Leader - Gopal Ranjan, Team Members - Rishav Kumar Sinha)
Organização de arquivos para meu Obsidian
ChatPDF is a web application that lets users upload PDFs and ask questions about their content.
This Python-based tool allows for efficient comparison of two or more PDF documents, highlighting the differences between them. It extracts and compares the words in the PDFs, ignoring whitespace differences, and highlights the changed, added, or missing words.
Add a description, image, and links to the pymupdf-fitz topic page so that developers can more easily learn about it.
To associate your repository with the pymupdf-fitz topic, visit your repo's landing page and select "manage topics."