Provero

provero (Esperanto): to test, to put to proof.

A vendor-neutral, declarative data quality engine.

Quick Start

pip install provero
provero init

Edit provero.yaml:

source:
  type: duckdb
  table: orders

checks:
  - not_null: [order_id, customer_id, amount]
  - unique: order_id
  - accepted_values:
      column: status
      values: [pending, shipped, delivered, cancelled]
  - range:
      column: amount
      min: 0
      max: 100000
  - row_count:
      min: 1

Run:

provero run

┌─────────────────┬──────────────┬──────────┬──────────────────┬──────────────────┐
│ Check           │ Column       │ Status   │ Observed         │ Expected         │
├─────────────────┼──────────────┼──────────┼──────────────────┼──────────────────┤
│ not_null        │ order_id     │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ customer_id  │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ amount       │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ unique          │ order_id     │ ✓ PASS   │ 0 duplicates     │ 0 duplicates     │
│ accepted_values │ status       │ ✓ PASS   │ 0 invalid values │ only [pending..] │
│ range           │ amount       │ ✓ PASS   │ min=45, max=999  │ min=0, max=100k  │
│ row_count       │ -            │ ✓ PASS   │ 5                │ >= 1             │
└─────────────────┴──────────────┴──────────┴──────────────────┴──────────────────┘

Score: 100/100 | 7 passed, 0 failed | 22ms

Features

20 check types: not_null, unique, unique_combination, completeness, accepted_values, range, regex, email_validation, type, freshness, latency, row_count, row_count_change, anomaly, custom_sql, referential_integrity, distribution, cardinality, drift, cross_table_count
3 connectors: DuckDB (files + in-memory), PostgreSQL, Pandas/Polars DataFrame
SQL batch optimizer: compiles N checks into 1 query
Data contracts: schema validation, SLA enforcement, contract diff
Anomaly detection: Z-Score, MAD, IQR (stdlib only, no scipy needed)
HTML reports: provero run --report html
Webhook alerts: notify Slack, PagerDuty, or any HTTP endpoint on failure
Result store: SQLite with time-series metrics and provero history
Data profiling: provero profile --suggest auto-generates checks
Configurable severity: info, warning, critical, blocker per check
JSON Schema validation for provero.yaml
Statistical checks: distribution (mean/stddev bounds), cardinality (distinct count/ratio), drift (PSI vs a discrete baseline), cross_table_count (row-count parity/ratio between two tables)
Connection pooling and retry: per-source pool sizing, connect timeouts, and bounded retry-with-backoff for SQLAlchemy connectors
Observability: structured JSON audit log, OpenTelemetry spans, and Prometheus metrics on provero run, with secret redaction in audit output
Server mode: provero serve exposes a FastAPI REST API (health, suites, runs, /metrics), a stdlib interval scheduler, and X-API-Key authentication
CI output formats: provero run --format sarif and --format junit for code-scanning and test-report integrations
Contract versioning: version-aware provero contract diff flags breaking changes and missing version bumps with severity policies
Airflow provider: ProveroCheckOperator + @provero_check decorator
SodaCL migration: provero import soda converts configs in one command
dbt interop: provero export dbt generates schema.yml test definitions
Continuous monitoring: provero watch polls checks on interval

Check Types

Check	Description	Example
`not_null`	Column has no null values	`not_null: order_id`
`unique`	Column has no duplicate values	`unique: order_id`
`unique_combination`	Composite uniqueness across columns	`unique_combination: [date, store_id]`
`completeness`	Minimum percentage of non-null values	`completeness: { column: email, min: 95% }`
`accepted_values`	Column values are within allowed set	`accepted_values: { column: status, values: [a, b] }`
`range`	Numeric values within min/max bounds	`range: { column: amount, min: 0, max: 100000 }`
`regex`	Values match a regular expression	`regex: { column: email, pattern: ".+@.+" }`
`email_validation`	Values are valid email addresses	`email_validation: { column: email }`
`type`	Column data type matches expected	`type: { column: amount, expected: numeric }`
`freshness`	Most recent timestamp within threshold	`freshness: { column: updated_at, max_age: 24h }`
`latency`	Time between two timestamp columns	`latency: { source_column: created_at, target_column: processed_at, max_latency: 1h }`
`row_count`	Table row count within bounds	`row_count: { min: 1, max: 1000000 }`
`row_count_change`	Row count change vs previous run	`row_count_change: { max_decrease: 10% }`
`anomaly`	Statistical anomaly detection	`anomaly: { column: amount, method: zscore }`
`custom_sql`	Custom SQL query returns truthy value	`custom_sql: "SELECT COUNT(*) > 0 FROM orders"`
`referential_integrity`	FK values exist in reference table	`referential_integrity: { column: customer_id, reference_table: customers, reference_column: id }`
`distribution`	Column mean/stddev within bounds	`distribution: { column: amount, mean: 100, mean_tolerance: 5, stddev_max: 50 }`
`cardinality`	Distinct count or ratio within bounds	`cardinality: { column: country_code, min: 2, max: 250 }`
`drift`	PSI of a column vs a discrete baseline	`drift: { column: segment, baseline: { A: 0.5, B: 0.3, C: 0.2 }, threshold: 0.25 }`
`cross_table_count`	Row-count parity/ratio between two tables	`cross_table_count: { other_table: staging.orders, tolerance: 0 }`

Configuration

A provero.yaml file defines your data source, checks, alerts, and contracts:

# Source configuration
source:
  type: duckdb                    # duckdb, postgres, dataframe
  table: orders                   # table name or file expression
  # connection: postgres://...    # connection string for databases

# Quality checks
checks:
  - not_null: [order_id, customer_id]
  - unique: order_id
  - range:
      column: amount
      min: 0
      max: 100000
  - freshness:
      column: updated_at
      max_age: 24h
  - anomaly:
      column: amount
      method: zscore               # zscore, mad, iqr
      threshold: 3.0
      window: 30                   # lookback window in days
  - referential_integrity:
      column: customer_id
      reference_table: customers
      reference_column: id

# Severity levels: info, warning, critical, blocker
# Blocker checks cause a non-zero exit code

# Alert notifications
alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK
    trigger: on_failure            # on_failure, on_success, always

# Data contracts (optional)
contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
    sla:
      freshness: 24h

Anomaly Detection

Provero includes built-in statistical anomaly detection that works without external dependencies (no scipy needed).

Supported methods:

Method	Description	Best for
`zscore`	Standard Z-Score	Normally distributed metrics
`mad`	Median Absolute Deviation	Robust to outliers
`iqr`	Interquartile Range	Skewed distributions

checks:
  - anomaly:
      column: daily_revenue
      method: mad
      threshold: 3.5
      window: 30

Anomaly detection uses the result store to compare current values against historical data. Run provero run regularly to build up the baseline.

CLI Commands

Command	Description
`provero init`	Create a new provero.yaml template
`provero run`	Execute quality checks
`provero validate`	Validate config syntax without running
`provero profile`	Profile a data source
`provero history`	Show historical check results
`provero contract validate`	Validate data contracts against live data
`provero contract diff`	Compare two contract versions
`provero watch`	Continuously run checks on interval
`provero import soda`	Convert SodaCL config to Provero format
`provero export dbt`	Generate dbt schema.yml from checks
`provero serve`	Run the REST API + scheduler server
`provero version`	Show version

provero run accepts --format table|json|csv|sarif|junit. The sarif and junit formats emit a single whole-run document for CI code-scanning and test-report integrations.

Alerts

Send webhook notifications when checks fail:

source:
  type: duckdb
  table: orders

checks:
  - not_null: order_id
  - row_count:
      min: 1

alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
    trigger: on_failure
  - type: webhook
    url: ${PAGERDUTY_WEBHOOK}
    headers:
      Authorization: "Bearer ${PD_TOKEN}"

Triggers: on_failure (default), on_success, always.

Observability

provero run can emit governance and telemetry signals through optional observer flags. All three are off by default and degrade gracefully when their optional dependency is absent.

# Append a structured JSON audit record of the run (pure stdlib, always available)
provero run --audit-log audit.jsonl

# Emit OpenTelemetry spans for the suite and each check
provero run --otel

# Write a Prometheus text exposition of run metrics to a file
provero run --metrics-file metrics.prom

OpenTelemetry and Prometheus require the observability extra:

pip install 'provero[observability]'

Exposed Prometheus metrics: provero_checks_total (by status), provero_check_duration_seconds, and provero_suite_score. The audit log records run id, suite name, a config hash, and per-check outcomes; connection strings and secrets are redacted before they are written.

Server Mode

provero serve starts a FastAPI application that exposes the engine over HTTP and can run suites on a schedule. It requires the server extra (FastAPI + Uvicorn):

pip install 'provero[server]'

provero serve                                   # 127.0.0.1:8000, auth disabled
provero serve -c production.yaml --host 0.0.0.0 --port 9000
provero serve --api-key secret1 --api-key secret2

Endpoints:

Method & path	Auth	Description
`GET /health`	no	Liveness probe
`GET /ready`	no	Readiness probe
`GET /suites`	yes	List configured suites
`POST /suites/{name}/run`	yes	Run a suite on demand
`GET /runs`	yes	List historical runs
`GET /runs/{run_id}`	yes	Run detail
`GET /metrics`	no	Prometheus exposition

Authentication is via the X-API-Key header. Allowed keys come from --api-key (repeatable) or the PROVERO_API_KEYS environment variable; if neither is set, auth is disabled. The bundled scheduler (SuiteScheduler) runs a suite on a fixed interval using the standard library only (no extra dependency) and persists every result to the store.

Statistical Checks

Four statistical checks extend the engine for distributional and cross-table validation:

checks:
  # Mean within tolerance and an upper bound on stddev (population statistics)
  - distribution:
      column: amount
      mean: 100.0
      mean_tolerance: 5.0
      stddev_max: 50.0

  # Distinct-value count and/or ratio bounds (ratio = distinct / non_null)
  - cardinality:
      column: country_code
      min: 2
      max: 250
      min_ratio: 0.0

  # Population Stability Index against a discrete baseline distribution
  - drift:
      column: segment
      baseline: { A: 0.5, B: 0.3, C: 0.2 }
      threshold: 0.25         # PSI above this fails
      warn_threshold: 0.1     # PSI above this warns

  # Row-count parity (or ratio) between two tables on the same source
  - cross_table_count:
      other_table: staging.orders
      tolerance: 0

drift is advisory by default (default severity warning): PSI above threshold fails, above warn_threshold warns, otherwise passes. cross_table_count also supports a ratio mode with min_ratio/max_ratio bounds.

Data Contracts

Define and enforce schema contracts:

contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    on_violation: warn
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
        - name: status
          type: varchar
    sla:
      freshness: 24h
      completeness: "95%"

Contracts carry a version field (default 1.0). provero contract diff is version-aware: it classifies each change as breaking or non-breaking, and warns when a breaking change ships without a major version bump.

provero contract validate
provero contract diff old.yaml new.yaml

Connectors

Connector	Status	Install
DuckDB	Stable	included
PostgreSQL	Stable	`pip install provero[postgres]`
DataFrame	Stable	`pip install provero[dataframe]`
Snowflake	Beta	`pip install provero[snowflake]`
BigQuery	Beta	`pip install provero[bigquery]`
MySQL	Beta	`pip install provero[mysql]`
Redshift	Beta	`pip install provero[redshift]`

DuckDB supports file expressions: read_csv('data.csv'), read_parquet('*.parquet').

Pooling and retry

SQLAlchemy-backed connectors (PostgreSQL and the beta connectors) accept optional connection-pool and retry tuning per source. Every key is optional; when omitted, the connector behaves exactly as before.

source:
  type: postgres
  table: orders
  connection: ${POSTGRES_URL}
  # Connection pool (forwarded to SQLAlchemy create_engine)
  pool_size: 5
  max_overflow: 10
  pool_pre_ping: true
  pool_recycle: 1800
  pool_timeout: 30
  connect_timeout: 10
  # Bounded retry-with-backoff on transient connection errors
  retry_attempts: 3
  retry_base_delay: 0.1
  retry_max_delay: 5.0
  retry_jitter: true

Only transient failures (dropped connections, backend restarts, deadlocks) are retried. Programming errors such as a missing table or bad SQL fail immediately. Backoff is exponential with full jitter.

API

Python API

from provero.core.engine import Engine

engine = Engine("provero.yaml")
results = engine.run()

for result in results:
    print(f"{result.check_name}: {result.status}")

Programmatic Configuration

from provero.core.engine import Engine

engine = Engine.from_dict({
    "source": {"type": "duckdb", "table": "orders"},
    "checks": [
        {"not_null": "order_id"},
        {"row_count": {"min": 1}},
    ],
})
results = engine.run()

Airflow Integration

pip install provero-airflow

from provero.airflow.operators import ProveroCheckOperator

check_orders = ProveroCheckOperator(
    task_id="check_orders",
    config_path="dags/provero.yaml",
    suite="orders_daily",
)

Documentation

Full documentation is available on GitHub Pages.

License

Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.github		.github
aql-spec		aql-spec
constraints		constraints
docs		docs
examples		examples
newsfragments		newsfragments
provero-airflow		provero-airflow
provero-core		provero-core
provero-flyte		provero-flyte
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT_PLAN.md		DEVELOPMENT_PLAN.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Provero

Quick Start

Features

Check Types

Configuration

Anomaly Detection

CLI Commands

Alerts

Observability

Server Mode

Statistical Checks

Data Contracts

Connectors

Pooling and retry

API

Python API

Programmatic Configuration

Airflow Integration

Documentation

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Provero

Quick Start

Features

Check Types

Configuration

Anomaly Detection

CLI Commands

Alerts

Observability

Server Mode

Statistical Checks

Data Contracts

Connectors

Pooling and retry

API

Python API

Programmatic Configuration

Airflow Integration

Documentation

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages