Skip to content

provero-org/provero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

133 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Provero

CI License Python Contributing

provero (Esperanto): to test, to put to proof.

A vendor-neutral, declarative data quality engine.

Provero demo

Quick Start

pip install provero
provero init

Edit provero.yaml:

source:
  type: duckdb
  table: orders

checks:
  - not_null: [order_id, customer_id, amount]
  - unique: order_id
  - accepted_values:
      column: status
      values: [pending, shipped, delivered, cancelled]
  - range:
      column: amount
      min: 0
      max: 100000
  - row_count:
      min: 1

Run:

provero run
┌─────────────────┬──────────────┬──────────┬──────────────────┬──────────────────┐
│ Check           │ Column       │ Status   │ Observed         │ Expected         │
├─────────────────┼──────────────┼──────────┼──────────────────┼──────────────────┤
│ not_null        │ order_id     │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ customer_id  │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ amount       │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ unique          │ order_id     │ ✓ PASS   │ 0 duplicates     │ 0 duplicates     │
│ accepted_values │ status       │ ✓ PASS   │ 0 invalid values │ only [pending..] │
│ range           │ amount       │ ✓ PASS   │ min=45, max=999  │ min=0, max=100k  │
│ row_count       │ -            │ ✓ PASS   │ 5                │ >= 1             │
└─────────────────┴──────────────┴──────────┴──────────────────┴──────────────────┘

Score: 100/100 | 7 passed, 0 failed | 22ms

Features

  • 20 check types: not_null, unique, unique_combination, completeness, accepted_values, range, regex, email_validation, type, freshness, latency, row_count, row_count_change, anomaly, custom_sql, referential_integrity, distribution, cardinality, drift, cross_table_count
  • 3 connectors: DuckDB (files + in-memory), PostgreSQL, Pandas/Polars DataFrame
  • SQL batch optimizer: compiles N checks into 1 query
  • Data contracts: schema validation, SLA enforcement, contract diff
  • Anomaly detection: Z-Score, MAD, IQR (stdlib only, no scipy needed)
  • HTML reports: provero run --report html
  • Webhook alerts: notify Slack, PagerDuty, or any HTTP endpoint on failure
  • Result store: SQLite with time-series metrics and provero history
  • Data profiling: provero profile --suggest auto-generates checks
  • Configurable severity: info, warning, critical, blocker per check
  • JSON Schema validation for provero.yaml
  • Statistical checks: distribution (mean/stddev bounds), cardinality (distinct count/ratio), drift (PSI vs a discrete baseline), cross_table_count (row-count parity/ratio between two tables)
  • Connection pooling and retry: per-source pool sizing, connect timeouts, and bounded retry-with-backoff for SQLAlchemy connectors
  • Observability: structured JSON audit log, OpenTelemetry spans, and Prometheus metrics on provero run, with secret redaction in audit output
  • Server mode: provero serve exposes a FastAPI REST API (health, suites, runs, /metrics), a stdlib interval scheduler, and X-API-Key authentication
  • CI output formats: provero run --format sarif and --format junit for code-scanning and test-report integrations
  • Contract versioning: version-aware provero contract diff flags breaking changes and missing version bumps with severity policies
  • Airflow provider: ProveroCheckOperator + @provero_check decorator
  • SodaCL migration: provero import soda converts configs in one command
  • dbt interop: provero export dbt generates schema.yml test definitions
  • Continuous monitoring: provero watch polls checks on interval

Check Types

Check Description Example
not_null Column has no null values not_null: order_id
unique Column has no duplicate values unique: order_id
unique_combination Composite uniqueness across columns unique_combination: [date, store_id]
completeness Minimum percentage of non-null values completeness: { column: email, min: 95% }
accepted_values Column values are within allowed set accepted_values: { column: status, values: [a, b] }
range Numeric values within min/max bounds range: { column: amount, min: 0, max: 100000 }
regex Values match a regular expression regex: { column: email, pattern: ".+@.+" }
email_validation Values are valid email addresses email_validation: { column: email }
type Column data type matches expected type: { column: amount, expected: numeric }
freshness Most recent timestamp within threshold freshness: { column: updated_at, max_age: 24h }
latency Time between two timestamp columns latency: { source_column: created_at, target_column: processed_at, max_latency: 1h }
row_count Table row count within bounds row_count: { min: 1, max: 1000000 }
row_count_change Row count change vs previous run row_count_change: { max_decrease: 10% }
anomaly Statistical anomaly detection anomaly: { column: amount, method: zscore }
custom_sql Custom SQL query returns truthy value custom_sql: "SELECT COUNT(*) > 0 FROM orders"
referential_integrity FK values exist in reference table referential_integrity: { column: customer_id, reference_table: customers, reference_column: id }
distribution Column mean/stddev within bounds distribution: { column: amount, mean: 100, mean_tolerance: 5, stddev_max: 50 }
cardinality Distinct count or ratio within bounds cardinality: { column: country_code, min: 2, max: 250 }
drift PSI of a column vs a discrete baseline drift: { column: segment, baseline: { A: 0.5, B: 0.3, C: 0.2 }, threshold: 0.25 }
cross_table_count Row-count parity/ratio between two tables cross_table_count: { other_table: staging.orders, tolerance: 0 }

Configuration

A provero.yaml file defines your data source, checks, alerts, and contracts:

# Source configuration
source:
  type: duckdb                    # duckdb, postgres, dataframe
  table: orders                   # table name or file expression
  # connection: postgres://...    # connection string for databases

# Quality checks
checks:
  - not_null: [order_id, customer_id]
  - unique: order_id
  - range:
      column: amount
      min: 0
      max: 100000
  - freshness:
      column: updated_at
      max_age: 24h
  - anomaly:
      column: amount
      method: zscore               # zscore, mad, iqr
      threshold: 3.0
      window: 30                   # lookback window in days
  - referential_integrity:
      column: customer_id
      reference_table: customers
      reference_column: id

# Severity levels: info, warning, critical, blocker
# Blocker checks cause a non-zero exit code

# Alert notifications
alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK
    trigger: on_failure            # on_failure, on_success, always

# Data contracts (optional)
contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
    sla:
      freshness: 24h

Anomaly Detection

Provero includes built-in statistical anomaly detection that works without external dependencies (no scipy needed).

Supported methods:

Method Description Best for
zscore Standard Z-Score Normally distributed metrics
mad Median Absolute Deviation Robust to outliers
iqr Interquartile Range Skewed distributions
checks:
  - anomaly:
      column: daily_revenue
      method: mad
      threshold: 3.5
      window: 30

Anomaly detection uses the result store to compare current values against historical data. Run provero run regularly to build up the baseline.

CLI Commands

Command Description
provero init Create a new provero.yaml template
provero run Execute quality checks
provero validate Validate config syntax without running
provero profile Profile a data source
provero history Show historical check results
provero contract validate Validate data contracts against live data
provero contract diff Compare two contract versions
provero watch Continuously run checks on interval
provero import soda Convert SodaCL config to Provero format
provero export dbt Generate dbt schema.yml from checks
provero serve Run the REST API + scheduler server
provero version Show version

provero run accepts --format table|json|csv|sarif|junit. The sarif and junit formats emit a single whole-run document for CI code-scanning and test-report integrations.

Alerts

Send webhook notifications when checks fail:

source:
  type: duckdb
  table: orders

checks:
  - not_null: order_id
  - row_count:
      min: 1

alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
    trigger: on_failure
  - type: webhook
    url: ${PAGERDUTY_WEBHOOK}
    headers:
      Authorization: "Bearer ${PD_TOKEN}"

Triggers: on_failure (default), on_success, always.

Observability

provero run can emit governance and telemetry signals through optional observer flags. All three are off by default and degrade gracefully when their optional dependency is absent.

# Append a structured JSON audit record of the run (pure stdlib, always available)
provero run --audit-log audit.jsonl

# Emit OpenTelemetry spans for the suite and each check
provero run --otel

# Write a Prometheus text exposition of run metrics to a file
provero run --metrics-file metrics.prom

OpenTelemetry and Prometheus require the observability extra:

pip install 'provero[observability]'

Exposed Prometheus metrics: provero_checks_total (by status), provero_check_duration_seconds, and provero_suite_score. The audit log records run id, suite name, a config hash, and per-check outcomes; connection strings and secrets are redacted before they are written.

Server Mode

provero serve starts a FastAPI application that exposes the engine over HTTP and can run suites on a schedule. It requires the server extra (FastAPI + Uvicorn):

pip install 'provero[server]'

provero serve                                   # 127.0.0.1:8000, auth disabled
provero serve -c production.yaml --host 0.0.0.0 --port 9000
provero serve --api-key secret1 --api-key secret2

Endpoints:

Method & path Auth Description
GET /health no Liveness probe
GET /ready no Readiness probe
GET /suites yes List configured suites
POST /suites/{name}/run yes Run a suite on demand
GET /runs yes List historical runs
GET /runs/{run_id} yes Run detail
GET /metrics no Prometheus exposition

Authentication is via the X-API-Key header. Allowed keys come from --api-key (repeatable) or the PROVERO_API_KEYS environment variable; if neither is set, auth is disabled. The bundled scheduler (SuiteScheduler) runs a suite on a fixed interval using the standard library only (no extra dependency) and persists every result to the store.

Statistical Checks

Four statistical checks extend the engine for distributional and cross-table validation:

checks:
  # Mean within tolerance and an upper bound on stddev (population statistics)
  - distribution:
      column: amount
      mean: 100.0
      mean_tolerance: 5.0
      stddev_max: 50.0

  # Distinct-value count and/or ratio bounds (ratio = distinct / non_null)
  - cardinality:
      column: country_code
      min: 2
      max: 250
      min_ratio: 0.0

  # Population Stability Index against a discrete baseline distribution
  - drift:
      column: segment
      baseline: { A: 0.5, B: 0.3, C: 0.2 }
      threshold: 0.25         # PSI above this fails
      warn_threshold: 0.1     # PSI above this warns

  # Row-count parity (or ratio) between two tables on the same source
  - cross_table_count:
      other_table: staging.orders
      tolerance: 0

drift is advisory by default (default severity warning): PSI above threshold fails, above warn_threshold warns, otherwise passes. cross_table_count also supports a ratio mode with min_ratio/max_ratio bounds.

Data Contracts

Define and enforce schema contracts:

contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    on_violation: warn
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
        - name: status
          type: varchar
    sla:
      freshness: 24h
      completeness: "95%"

Contracts carry a version field (default 1.0). provero contract diff is version-aware: it classifies each change as breaking or non-breaking, and warns when a breaking change ships without a major version bump.

provero contract validate
provero contract diff old.yaml new.yaml

Connectors

Connector Status Install
DuckDB Stable included
PostgreSQL Stable pip install provero[postgres]
DataFrame Stable pip install provero[dataframe]
Snowflake Beta pip install provero[snowflake]
BigQuery Beta pip install provero[bigquery]
MySQL Beta pip install provero[mysql]
Redshift Beta pip install provero[redshift]

DuckDB supports file expressions: read_csv('data.csv'), read_parquet('*.parquet').

Pooling and retry

SQLAlchemy-backed connectors (PostgreSQL and the beta connectors) accept optional connection-pool and retry tuning per source. Every key is optional; when omitted, the connector behaves exactly as before.

source:
  type: postgres
  table: orders
  connection: ${POSTGRES_URL}
  # Connection pool (forwarded to SQLAlchemy create_engine)
  pool_size: 5
  max_overflow: 10
  pool_pre_ping: true
  pool_recycle: 1800
  pool_timeout: 30
  connect_timeout: 10
  # Bounded retry-with-backoff on transient connection errors
  retry_attempts: 3
  retry_base_delay: 0.1
  retry_max_delay: 5.0
  retry_jitter: true

Only transient failures (dropped connections, backend restarts, deadlocks) are retried. Programming errors such as a missing table or bad SQL fail immediately. Backoff is exponential with full jitter.

API

Python API

from provero.core.engine import Engine

engine = Engine("provero.yaml")
results = engine.run()

for result in results:
    print(f"{result.check_name}: {result.status}")

Programmatic Configuration

from provero.core.engine import Engine

engine = Engine.from_dict({
    "source": {"type": "duckdb", "table": "orders"},
    "checks": [
        {"not_null": "order_id"},
        {"row_count": {"min": 1}},
    ],
})
results = engine.run()

Airflow Integration

pip install provero-airflow
from provero.airflow.operators import ProveroCheckOperator

check_orders = ProveroCheckOperator(
    task_id="check_orders",
    config_path="dags/provero.yaml",
    suite="orders_daily",
)

Documentation

Full documentation is available on GitHub Pages.

License

Apache License 2.0. See LICENSE.