Stars
- All languages
- Adblock Filter List
- C
- C#
- C++
- CSS
- Clojure
- Dart
- Dockerfile
- Elm
- Emacs Lisp
- Go
- Groff
- HCL
- HTML
- Java
- JavaScript
- Julia
- Jupyter Notebook
- Kotlin
- MATLAB
- MDX
- Makefile
- OpenEdge ABL
- PHP
- PLpgSQL
- Perl
- PowerShell
- Python
- R
- RMarkdown
- Rich Text Format
- Ruby
- Rust
- SCSS
- Scala
- Shell
- Svelte
- Swift
- TSQL
- TeX
- TypeScript
- Visual Basic
- Vue
- YARA
Free universal database tool and SQL client
A browser automation framework and ecosystem.
OpenRefine is a free, open source power tool for working with messy data and improving it
CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
A machine learning software for extracting information from scholarly documents
Cyberduck is a libre FTP, SFTP, WebDAV, Amazon S3, Backblaze B2, Microsoft Azure & OneDrive and OpenStack Swift file transfer client for Mac and Windows.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
TrackerControl Android: monitor and control trackers and ads.
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class…
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to t…
A scalable, mature and versatile web crawler based on Apache Storm
Content management platform to build modern business applications
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
News crawling with StormCrawler - stores content as WARC
Latent Dirichlet Allocation (LDA) model for Microblogs (Twitter, weibo etc.)
Tools to work with the big reddit JSON data dump.
Code samples to help you get started with the Amazon Mechanical Turk Requester API
Index Common Crawl archives in tabular format
We introduce TACIT: An Open-Source Text Analysis, Crawling and Interpretation Tool. TACIT's plugin architecture has three main components: 1. Crawling plugins 2. Corpus management 3. Analysis plugi…
Unsupervised method for extracting quotation-speaker pairs from large news corpora.