data-mining
A pure python based utility to extract text and images from docx files.
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
A python3 module that converts your bs4 Tag into json object (dict)
Apply different text recognition services to images of handwritten documents.
Python wrapper for the tesseract OCR engine. The module is based on OpenCV
Google Search Results via SERP API pip Python Package
A Python module for use with Elsevier's APIs: Scopus, ScienceDirect, others.
Extract data from all Google Scholar pages from a single Python module.
Zotero is a free, easy-to-use tool to help you collect, organize, annotate, cite, and share your research sources.
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Community maintained fork of pdfminer - we fathom PDF
pdfrw is a pure Python library that reads and writes PDFs
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
smoothscan is a tool to convert scanned text into a vectorized output form.
Extract text and its region from image using openCV
GUI for cropping a large amount of images quickly.
A list of awesome resources and links related to digital repositories (especially open access)
A versatile Python library for EPUB2/EPUB3 manipulation and processing.
Scrapes dictionary definitions from Oxford and Collins Cobuild dictionaries and stores them in a sqlite database.
This is a python code based on Scrapy package to crawl famous online dictionaries like Oxford, Longman, Cambridge, Webster, and Collins to make a dataset
Wrapper para la librería de pycatastro - consultas simplificadas a la API del catastro de España.
A Python library to extract tabular data from PDFs