Problem
Open data files often contain duplicate rows due to export errors or data entry mistakes. Currently not checked.
Proposed check
Phase 3 — Content, new check: phase3_duplicate_rows
- Detect exact duplicate rows (all columns match)
- Report: count of duplicates, percentage over total rows, example rows
- Severity: MAJOR (duplicate rows distort aggregations and statistics)
Inspiration
Article 5 Useful Python Scripts for Automated Data Quality Checks (KDnuggets, Feb 2026) — script 3 (duplicate record detector).
Implementation hint
DuckDB can detect exact duplicates efficiently:
SELECT COUNT(*) - COUNT(DISTINCT *) AS duplicate_count FROM read_csv_auto('data.csv');
Out of scope (for now)
Near-duplicate rows (fuzzy matching across all columns) — too expensive and domain-specific.
Problem
Open data files often contain duplicate rows due to export errors or data entry mistakes. Currently not checked.
Proposed check
Phase 3 — Content, new check:
phase3_duplicate_rowsInspiration
Article 5 Useful Python Scripts for Automated Data Quality Checks (KDnuggets, Feb 2026) — script 3 (duplicate record detector).
Implementation hint
DuckDB can detect exact duplicates efficiently:
Out of scope (for now)
Near-duplicate rows (fuzzy matching across all columns) — too expensive and domain-specific.