Skip to content

USC-NSL/miniDeGenTWeb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

miniDeGenTWeb

miniDeGenTWeb is a small public demo of the DeGenTWeb detection pipeline. It takes page body text, applies lightweight quality filters, optionally scores the filtered pages with Binoculars, and combines page scores into a site-level prediction for whether a site appears LLM-dominant.

This repository is intentionally simplified. The full DeGenTWeb system is a larger internal research pipeline for collecting website content, extracting main-body text, applying quality and deduplication filters, scoring pages with LLM-detection models, and aggregating those signals to study LLM-generated content across sites. This public version keeps only the local, file-based pieces needed to demonstrate the core filtering and site-level aggregation workflow.

The miniature pipeline is:

  1. Start with one JSONL row per page.
  2. Filter the page's main body text.
  3. Run Binoculars on each filtered page.
  4. Group page scores by site and apply the frozen site-level SVM.

What this version includes

  • A main-body-text filter based on a standalone subset of the DeGenTWeb page quality rules.
  • Optional Binoculars scoring for filtered pages.
  • A packaged site-level linear SVM that aggregates page scores by site.
  • Small example inputs and tests that run locally.

What this version does not include

  • Website crawling, browser automation, or HTML extraction.
  • The original DeGenTWeb database or full training data.
  • The full duplicate-content pipeline used by the larger system.
  • A production service, hosted dashboard, or claim that a single site prediction is definitive.

Use this code for local experiments on page-body-text files. Treat its outputs as research signals that depend on input quality, sample size, model choice, and the limits of the simplified pipeline.

Input schema

Minimum page JSONL fields:

{"site_id": "example.com", "page_id": "https://example.com/a", "text": "main body text"}

Field meanings:

  • site_id: stable site or subdomain key used for grouping pages
  • page_id: stable page key or URL used only for traceability
  • text: extracted main body content, not full HTML
  • pcent_relative_dupe: optional percentage of this page's text already seen on earlier pages from the same site

If pcent_relative_dupe is absent, minidw-filter can estimate it with a portable fixed-chunk duplicate tracker. The full DeGenTWeb system can use richer duplicate tracking. If your input already has a better within-site duplicate percentage, pass it in the input rows.

Setup

Create an environment:

python3 -m venv .venv
. .venv/bin/activate
pip install -e '.[dev]'

For Binoculars scoring, install the GPU dependencies too:

pip install -e '.[binoculars]'

The default models are:

  • observer: SichangHe/falcon-7b-FP8-Dynamic
  • performer: SichangHe/falcon-7b-instruct-FP8-Dynamic

These are the defaults used by the exported command line tools. They require Hugging Face model access and a machine with enough GPU memory for two Falcon 7B models, or enough memory for device_map=auto to place them.

Quick demo

Run the local demo without downloading model weights:

minidw-filter examples/pages.jsonl /tmp/minidw_filtered.jsonl --whitespace-tokens
minidw-svm examples/scored_pages.csv /tmp/minidw_site_predictions.csv --format csv

This run checks the text filters and site SVM with local files only. It gives useful evidence about input quality, minimum sample size, score aggregation, and classification margins. It does not test Binoculars model loading or GPU throughput; run Step 2 when those limits matter.

Step 1: filter pages

minidw-filter pages.jsonl filtered.jsonl --use-site-duplicate-tracker

The output preserves the input fields and adds:

  • passes_filter: whether the page should be scored
  • problem: reason for filtering, or null
  • cleaned_text: Dolma-style cleaned text to pass to Binoculars
  • cleaned_n_tokens: token count of cleaned_text
  • pcent_relative_dupe: supplied or estimated duplicate percentage
  • metrics: detailed filter metrics

The filter logic matches the standalone subset used before Binoculars scoring:

  • Dolma-style quality rules after dropping lines without final punctuation
  • at least 200 cleaned tokens
  • at most 50% within-site duplicate text

The full HTML extraction metrics for link text, code text, large blocks, and list/table text require HTML and the original extraction tree. miniDeGenTWeb is for the lightweight main-body-content input path, so those HTML-only filters are not recomputed here.

Step 2: run Binoculars

minidw-binoculars filtered.jsonl scored.jsonl --batch-size 8

Only rows with passes_filter=true are scored. The command adds bino_score to each scored row.

Useful options:

minidw-binoculars filtered.jsonl scored.jsonl \
  --observer SichangHe/falcon-7b-FP8-Dynamic \
  --performer SichangHe/falcon-7b-instruct-FP8-Dynamic \
  --batch-size 4 \
  --dtype auto \
  --max-token-observed 2048

Interpretation: lower Binoculars scores are more LLM-like for this setup.

Step 3: run the site SVM

minidw-svm scored.jsonl site_predictions.csv

The SVM groups filtered page scores by site_id. Sites with fewer than 15 filtered scored pages are skipped. For each retained site, it computes the 10th, 20th, ..., 90th percentiles of bino_score, then applies:

svm_distance = dot(weights, deciles) + intercept

svm_distance > 0 means is_llm_dominant=true.

The frozen model is packaged at src/minidw/models/dw_dolma_linear_svm.json. It was exported on 2026-06-11 from a DeGenTWeb Dolma-cleaned full-site SVM.

CSV input for SVM only

If you already have Binoculars scores, run only the SVM:

minidw-svm scored_pages.csv site_predictions.csv --format csv

The CSV must contain:

  • site_id
  • bino_score
  • passes_filter if unfiltered rows are present

Output schema

minidw-svm writes one row per retained site:

  • site_id
  • n_filtered_pages
  • svm_distance
  • is_llm_dominant
  • deciles

Development checks

ruff check .
pytest

About

Local miniature DeGenTWeb pipeline for text filtering, Binoculars scoring, and site-level SVM experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages