Skip to content

feat(open-data-quality): detect duplicate rows in odq-csv #12

@aborruso

Description

@aborruso

Problem

Open data files often contain duplicate rows due to export errors or data entry mistakes. Currently not checked.

Proposed check

Phase 3 — Content, new check: phase3_duplicate_rows

  • Detect exact duplicate rows (all columns match)
  • Report: count of duplicates, percentage over total rows, example rows
  • Severity: MAJOR (duplicate rows distort aggregations and statistics)

Inspiration

Article 5 Useful Python Scripts for Automated Data Quality Checks (KDnuggets, Feb 2026) — script 3 (duplicate record detector).

Implementation hint

DuckDB can detect exact duplicates efficiently:

SELECT COUNT(*) - COUNT(DISTINCT *) AS duplicate_count FROM read_csv_auto('data.csv');

Out of scope (for now)

Near-duplicate rows (fuzzy matching across all columns) — too expensive and domain-specific.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions