miniDeGenTWeb is a small public demo of the DeGenTWeb detection pipeline. It takes page body text, applies lightweight quality filters, optionally scores the filtered pages with Binoculars, and combines page scores into a site-level prediction for whether a site appears LLM-dominant.
This repository is intentionally simplified. The full DeGenTWeb system is a larger internal research pipeline for collecting website content, extracting main-body text, applying quality and deduplication filters, scoring pages with LLM-detection models, and aggregating those signals to study LLM-generated content across sites. This public version keeps only the local, file-based pieces needed to demonstrate the core filtering and site-level aggregation workflow.
The miniature pipeline is:
- Start with one JSONL row per page.
- Filter the page's main body text.
- Run Binoculars on each filtered page.
- Group page scores by site and apply the frozen site-level SVM.
- A main-body-text filter based on a standalone subset of the DeGenTWeb page quality rules.
- Optional Binoculars scoring for filtered pages.
- A packaged site-level linear SVM that aggregates page scores by site.
- Small example inputs and tests that run locally.
- Website crawling, browser automation, or HTML extraction.
- The original DeGenTWeb database or full training data.
- The full duplicate-content pipeline used by the larger system.
- A production service, hosted dashboard, or claim that a single site prediction is definitive.
Use this code for local experiments on page-body-text files. Treat its outputs as research signals that depend on input quality, sample size, model choice, and the limits of the simplified pipeline.
Minimum page JSONL fields:
{"site_id": "example.com", "page_id": "https://example.com/a", "text": "main body text"}Field meanings:
site_id: stable site or subdomain key used for grouping pagespage_id: stable page key or URL used only for traceabilitytext: extracted main body content, not full HTMLpcent_relative_dupe: optional percentage of this page's text already seen on earlier pages from the same site
If pcent_relative_dupe is absent, minidw-filter can estimate it with
a portable fixed-chunk duplicate tracker. The full DeGenTWeb system can use
richer duplicate tracking. If your input already has a better within-site
duplicate percentage, pass it in the input rows.
Create an environment:
python3 -m venv .venv
. .venv/bin/activate
pip install -e '.[dev]'For Binoculars scoring, install the GPU dependencies too:
pip install -e '.[binoculars]'The default models are:
- observer:
SichangHe/falcon-7b-FP8-Dynamic - performer:
SichangHe/falcon-7b-instruct-FP8-Dynamic
These are the defaults used by the exported command line tools. They require
Hugging Face model access and a machine with enough GPU memory for two Falcon
7B models, or enough memory for device_map=auto to place them.
Run the local demo without downloading model weights:
minidw-filter examples/pages.jsonl /tmp/minidw_filtered.jsonl --whitespace-tokens
minidw-svm examples/scored_pages.csv /tmp/minidw_site_predictions.csv --format csvThis run checks the text filters and site SVM with local files only. It gives useful evidence about input quality, minimum sample size, score aggregation, and classification margins. It does not test Binoculars model loading or GPU throughput; run Step 2 when those limits matter.
minidw-filter pages.jsonl filtered.jsonl --use-site-duplicate-trackerThe output preserves the input fields and adds:
passes_filter: whether the page should be scoredproblem: reason for filtering, ornullcleaned_text: Dolma-style cleaned text to pass to Binocularscleaned_n_tokens: token count ofcleaned_textpcent_relative_dupe: supplied or estimated duplicate percentagemetrics: detailed filter metrics
The filter logic matches the standalone subset used before Binoculars scoring:
- Dolma-style quality rules after dropping lines without final punctuation
- at least 200 cleaned tokens
- at most 50% within-site duplicate text
The full HTML extraction metrics for link text, code text, large blocks, and list/table text require HTML and the original extraction tree. miniDeGenTWeb is for the lightweight main-body-content input path, so those HTML-only filters are not recomputed here.
minidw-binoculars filtered.jsonl scored.jsonl --batch-size 8Only rows with passes_filter=true are scored. The command adds bino_score to
each scored row.
Useful options:
minidw-binoculars filtered.jsonl scored.jsonl \
--observer SichangHe/falcon-7b-FP8-Dynamic \
--performer SichangHe/falcon-7b-instruct-FP8-Dynamic \
--batch-size 4 \
--dtype auto \
--max-token-observed 2048Interpretation: lower Binoculars scores are more LLM-like for this setup.
minidw-svm scored.jsonl site_predictions.csvThe SVM groups filtered page scores by site_id. Sites with fewer than 15
filtered scored pages are skipped. For each retained site, it computes the
10th, 20th, ..., 90th percentiles of bino_score, then applies:
svm_distance = dot(weights, deciles) + intercept
svm_distance > 0 means is_llm_dominant=true.
The frozen model is packaged at src/minidw/models/dw_dolma_linear_svm.json.
It was exported on 2026-06-11 from a DeGenTWeb Dolma-cleaned full-site SVM.
If you already have Binoculars scores, run only the SVM:
minidw-svm scored_pages.csv site_predictions.csv --format csvThe CSV must contain:
site_idbino_scorepasses_filterif unfiltered rows are present
minidw-svm writes one row per retained site:
site_idn_filtered_pagessvm_distanceis_llm_dominantdeciles
ruff check .
pytest