miniDeGenTWeb

miniDeGenTWeb is a small public demo of the DeGenTWeb detection pipeline. It takes page body text, applies lightweight quality filters, optionally scores the filtered pages with Binoculars, and combines page scores into a site-level prediction for whether a site appears LLM-dominant.

This repository is intentionally simplified. The full DeGenTWeb system is a larger internal research pipeline for collecting website content, extracting main-body text, applying quality and deduplication filters, scoring pages with LLM-detection models, and aggregating those signals to study LLM-generated content across sites. This public version keeps only the local, file-based pieces needed to demonstrate the core filtering and site-level aggregation workflow.

The miniature pipeline is:

Start with one JSONL row per page.
Filter the page's main body text.
Run Binoculars on each filtered page.
Group page scores by site and apply the frozen site-level SVM.

What this version includes

A main-body-text filter based on a standalone subset of the DeGenTWeb page quality rules.
Optional Binoculars scoring for filtered pages.
A packaged site-level linear SVM that aggregates page scores by site.
Small example inputs and tests that run locally.

What this version does not include

Website crawling, browser automation, or HTML extraction.
The original DeGenTWeb database or full training data.
The full duplicate-content pipeline used by the larger system.
A production service, hosted dashboard, or claim that a single site prediction is definitive.

Use this code for local experiments on page-body-text files. Treat its outputs as research signals that depend on input quality, sample size, model choice, and the limits of the simplified pipeline.

Input schema

Minimum page JSONL fields:

{"site_id": "example.com", "page_id": "https://example.com/a", "text": "main body text"}

Field meanings:

site_id: stable site or subdomain key used for grouping pages
page_id: stable page key or URL used only for traceability
text: extracted main body content, not full HTML
pcent_relative_dupe: optional percentage of this page's text already seen on earlier pages from the same site

If pcent_relative_dupe is absent, minidw-filter can estimate it with a portable fixed-chunk duplicate tracker. The full DeGenTWeb system can use richer duplicate tracking. If your input already has a better within-site duplicate percentage, pass it in the input rows.

Setup

Create an environment:

python3 -m venv .venv
. .venv/bin/activate
pip install -e '.[dev]'

For Binoculars scoring, install the GPU dependencies too:

pip install -e '.[binoculars]'

The default models are:

observer: SichangHe/falcon-7b-FP8-Dynamic
performer: SichangHe/falcon-7b-instruct-FP8-Dynamic

These are the defaults used by the exported command line tools. They require Hugging Face model access and a machine with enough GPU memory for two Falcon 7B models, or enough memory for device_map=auto to place them.

Quick demo

Run the local demo without downloading model weights:

minidw-filter examples/pages.jsonl /tmp/minidw_filtered.jsonl --whitespace-tokens
minidw-svm examples/scored_pages.csv /tmp/minidw_site_predictions.csv --format csv

This run checks the text filters and site SVM with local files only. It gives useful evidence about input quality, minimum sample size, score aggregation, and classification margins. It does not test Binoculars model loading or GPU throughput; run Step 2 when those limits matter.

Step 1: filter pages

minidw-filter pages.jsonl filtered.jsonl --use-site-duplicate-tracker

The output preserves the input fields and adds:

passes_filter: whether the page should be scored
problem: reason for filtering, or null
cleaned_text: Dolma-style cleaned text to pass to Binoculars
cleaned_n_tokens: token count of cleaned_text
pcent_relative_dupe: supplied or estimated duplicate percentage
metrics: detailed filter metrics

The filter logic matches the standalone subset used before Binoculars scoring:

Dolma-style quality rules after dropping lines without final punctuation
at least 200 cleaned tokens
at most 50% within-site duplicate text

The full HTML extraction metrics for link text, code text, large blocks, and list/table text require HTML and the original extraction tree. miniDeGenTWeb is for the lightweight main-body-content input path, so those HTML-only filters are not recomputed here.

Step 2: run Binoculars

minidw-binoculars filtered.jsonl scored.jsonl --batch-size 8

Only rows with passes_filter=true are scored. The command adds bino_score to each scored row.

Useful options:

minidw-binoculars filtered.jsonl scored.jsonl \
  --observer SichangHe/falcon-7b-FP8-Dynamic \
  --performer SichangHe/falcon-7b-instruct-FP8-Dynamic \
  --batch-size 4 \
  --dtype auto \
  --max-token-observed 2048

Interpretation: lower Binoculars scores are more LLM-like for this setup.

Step 3: run the site SVM

minidw-svm scored.jsonl site_predictions.csv

The SVM groups filtered page scores by site_id. Sites with fewer than 15 filtered scored pages are skipped. For each retained site, it computes the 10th, 20th, ..., 90th percentiles of bino_score, then applies:

svm_distance = dot(weights, deciles) + intercept

svm_distance > 0 means is_llm_dominant=true.

The frozen model is packaged at src/minidw/models/dw_dolma_linear_svm.json. It was exported on 2026-06-11 from a DeGenTWeb Dolma-cleaned full-site SVM.

CSV input for SVM only

If you already have Binoculars scores, run only the SVM:

minidw-svm scored_pages.csv site_predictions.csv --format csv

The CSV must contain:

site_id
bino_score
passes_filter if unfiltered rows are present

Output schema

minidw-svm writes one row per retained site:

site_id
n_filtered_pages
svm_distance
is_llm_dominant
deciles

Development checks

ruff check .
pytest

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/minidw		src/minidw
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miniDeGenTWeb

What this version includes

What this version does not include

Input schema

Setup

Quick demo

Step 1: filter pages

Step 2: run Binoculars

Step 3: run the site SVM

CSV input for SVM only

Output schema

Development checks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

miniDeGenTWeb

What this version includes

What this version does not include

Input schema

Setup

Quick demo

Step 1: filter pages

Step 2: run Binoculars

Step 3: run the site SVM

CSV input for SVM only

Output schema

Development checks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages