Processing and hashing Slack communication to enable language modelling
-
Updated
Jun 7, 2023 - Python
Processing and hashing Slack communication to enable language modelling
A local GPU-accelerated Retrieval-Augmented Generation (RAG) pipeline for PDF question-answering with multi-LLM support and modular NLP components. Process documents locally with privacy-focused information retrieval.
A fast and easy-to-use Python toolkit for image processing with CLI tools for resizing, cropping, OCR, and optimization, including batch processing support.
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
This web application utilizes OCR technology to recognize text in uploaded images and provides spelling correction and word performance improvement. Users can easily upload images containing text and receive accurate and enhanced text results.
Web Application to extract text from image
Turn Old City Directory scans into searchable data. Automated pipeline handles column detection, OCR processing, and accuracy evaluation for historical document digitization.
Extract clean plain-text from subtitle files.
FastSnip - Free OCR screen capture tool for Windows. Extract text from anywhere on your screen with Ctrl+Shift+T. Perfect TextSniper alternative with multi-language support.
package for ml training in GCP
Python tool for converting PDF files to text. Simplify your document processing tasks.
A command-line tool in Go that extracts meaningful text from web pages, filters out unwanted elements, and outputs clean text for easy integration with AI applications, data mining, and web scraping.
A privacy-focused, client-side web application that extracts clean, readable content from any webpage and converts it to PDF format. Built with pure HTML, CSS, and JavaScript—no backend required, no tracking, complete privacy.
This repository contains my experiments with RAKE and its variants. RAKE is one of the most popular unsupervised approach for automatically extracting key-phrases/keywords from an unstructured data source like reviews, news, articles, documents etc.
DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.
This repository contains a Python script to extract text from images using OpenAI's GPT-4 API. The script supports text extraction from both online image URLs and locally stored images (converted to base64). It ensures accurate and structured text extraction, making it a powerful tool for OCR-like tasks. The extracted text is saved to a file
This assistant tool (WIP) will help you search, browse and summarize the answers to your questions from your uploaded PDF using advanced text analytics, semantic search and Large Language Model (LLM)
The objective is to analyze text content from a list of URLs. This involves extracting article titles and text, then performing natural language processing to generate metrics like sentiment, readability, and word usage. Finally, the results are stored for further analysis or visualization.
This Python script automates the extraction of text from images using Tesseract OCR. It processes all images in the test_images/ folder and saves the extracted text as .txt files in the extracted_texts/ directory, maintaining the original image filenames.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."