Modular R web-scraping framework that crawls sitemaps, aggregates links by date range, and extracts target HTML fields using the paperboy package (German newspapers)
-
Updated
Oct 11, 2025 - HTML
Modular R web-scraping framework that crawls sitemaps, aggregates links by date range, and extracts target HTML fields using the paperboy package (German newspapers)
Creating an index measuring the "rurality" of counties in the contiguous United States
Augmented Synthetic Data-set for Deep Learning in C++
a utility for generating VOC image annotations
The Font Image Generator App creates diverse character images using various fonts, aiding in dataset creation for machine learning and analysis.
An AI scraper using crawl4ai and firecrawl.
Easily create training or fine-tuning data for OpenAI's ChatGPT models thru chatting with yourself, then export it to JSONL.
Lack of alumni tracking and poor alumni interaction among students who have graduated from educational institutions across Odisha.
Dataset builder is an application that allows users to build datasets up to 5 dimensions visually through an intuitive MS-Paint-like interface and store the dataset in a MySQL data base for effective data-wrangling. Furthermore, the application allows users to export the datasets to a CSV file.
Syntetic HR data creator
Benchmark datasets containing both normal background traffic and worm traffic
A Python command-line tool to create a geocoded dataset of Russian small and medium-sized enterprises (SMEs) from open data published by Federal Tax Service
UCL Geographical Information Systems (GIS) project, building up an open-source way of extracting the location of diplomatic outputs across the world. We argue that this new level of detail helps in the study of diplomatic interactions.
A collection of scripts and tools that tracks the availability of helium mobile wifi networks in the wild from the Wigle Dataset and Helium API. Updates every 24 hours.
Snippets for data set generation and analyses with ParlGov · 🗳️🧑🏻💻📊
An open-source software for synthetic web-based user interface and content dataset generation.
Code to generate the Inv3D dataset from our paper "Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping" (ICDAR) 2023.
"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang
It's a simulator based on Unity for RoboMaster. You can use it to get some labeled dataset for deep learning
Add a description, image, and links to the dataset-generation topic page so that developers can more easily learn about it.
To associate your repository with the dataset-generation topic, visit your repo's landing page and select "manage topics."