Cyberduck is a libre FTP, SFTP, WebDAV, Amazon S3, Backblaze B2, Microsoft Azure & OneDrive and OpenStack Swift file transfer client for Mac and Windows.

Java 4,132 326 Updated Dec 17, 2025

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java 3,110 780 Updated Dec 11, 2025

TrackerControl / tracker-control-android

TrackerControl Android: monitor and control trackers and ads.

Java 2,289 97 Updated Dec 10, 2025

JonathanLink / PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class…

Java 1,596 214 Updated Dec 17, 2023

mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to t…

Java 1,019 352 Updated Dec 9, 2025

apache / stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

Java 953 268 Updated Dec 15, 2025

nuxeo / nuxeo

Content management platform to build modern business applications

Java 686 390 Updated Dec 17, 2025

brendano / ark-tweet-nlp

CMU ARK Twitter Part-of-Speech Tagger

Java 575 196 Updated Dec 17, 2023

VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.

Java 475 135 Updated Aug 31, 2025

USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Java 419 139 Updated Mar 30, 2023

commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC

Java 360 40 Updated Feb 19, 2025

minghui / Twitter-LDA

Latent Dirichlet Allocation (LDA) model for Microblogs (Twitter, weibo etc.)

Java 319 108 Updated May 4, 2018

dewarim / data-tools-for-reddit

Tools to work with the big reddit JSON data dump.

Java 255 31 Updated Jul 6, 2024

aws-samples / mturk-code-samples

Code samples to help you get started with the Amazon Mechanical Turk Requester API

Java 170 58 Updated Aug 2, 2024

leifeld / dna

Discourse Network Analyzer (DNA)

Java 146 44 Updated Jun 4, 2025

commoncrawl / cc-index-table

Index Common Crawl archives in tabular format

Java 124 14 Updated Dec 4, 2025

USC-CSSL / TACIT

We introduce TACIT: An Open-Source Text Analysis, Crawling and Interpretation Tool. TACIT's plugin architecture has three main components: 1. Crawling plugins 2. Corpus management 3. Analysis plugi…

Java 109 16 Updated Mar 27, 2019

epfl-dlab / quootstrap

Unsupervised method for extracting quotation-speaker pairs from large news corpora.

Java 29 3 Updated Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jg-bernard

Highlights

Block or report jg-bernard

Stars

dbeaver / dbeaver

SeleniumHQ / selenium

neo4j / neo4j

OpenRefine / OpenRefine

stanfordnlp / CoreNLP

vespa-engine / vespa

gephi / gephi

kermitt2 / grobid

iterate-ch / cyberduck