provero (Esperanto): to test, to put to proof.
A vendor-neutral, declarative data quality engine.
pip install provero
provero initEdit provero.yaml:
source:
type: duckdb
table: orders
checks:
- not_null: [order_id, customer_id, amount]
- unique: order_id
- accepted_values:
column: status
values: [pending, shipped, delivered, cancelled]
- range:
column: amount
min: 0
max: 100000
- row_count:
min: 1Run:
provero run┌─────────────────┬──────────────┬──────────┬──────────────────┬──────────────────┐
│ Check │ Column │ Status │ Observed │ Expected │
├─────────────────┼──────────────┼──────────┼──────────────────┼──────────────────┤
│ not_null │ order_id │ ✓ PASS │ 0 nulls │ 0 nulls │
│ not_null │ customer_id │ ✓ PASS │ 0 nulls │ 0 nulls │
│ not_null │ amount │ ✓ PASS │ 0 nulls │ 0 nulls │
│ unique │ order_id │ ✓ PASS │ 0 duplicates │ 0 duplicates │
│ accepted_values │ status │ ✓ PASS │ 0 invalid values │ only [pending..] │
│ range │ amount │ ✓ PASS │ min=45, max=999 │ min=0, max=100k │
│ row_count │ - │ ✓ PASS │ 5 │ >= 1 │
└─────────────────┴──────────────┴──────────┴──────────────────┴──────────────────┘
Score: 100/100 | 7 passed, 0 failed | 22ms
- 20 check types: not_null, unique, unique_combination, completeness, accepted_values, range, regex, email_validation, type, freshness, latency, row_count, row_count_change, anomaly, custom_sql, referential_integrity, distribution, cardinality, drift, cross_table_count
- 3 connectors: DuckDB (files + in-memory), PostgreSQL, Pandas/Polars DataFrame
- SQL batch optimizer: compiles N checks into 1 query
- Data contracts: schema validation, SLA enforcement, contract diff
- Anomaly detection: Z-Score, MAD, IQR (stdlib only, no scipy needed)
- HTML reports:
provero run --report html - Webhook alerts: notify Slack, PagerDuty, or any HTTP endpoint on failure
- Result store: SQLite with time-series metrics and
provero history - Data profiling:
provero profile --suggestauto-generates checks - Configurable severity: info, warning, critical, blocker per check
- JSON Schema validation for provero.yaml
- Statistical checks: distribution (mean/stddev bounds), cardinality (distinct count/ratio), drift (PSI vs a discrete baseline), cross_table_count (row-count parity/ratio between two tables)
- Connection pooling and retry: per-source pool sizing, connect timeouts, and bounded retry-with-backoff for SQLAlchemy connectors
- Observability: structured JSON audit log, OpenTelemetry spans, and Prometheus metrics on
provero run, with secret redaction in audit output - Server mode:
provero serveexposes a FastAPI REST API (health, suites, runs,/metrics), a stdlib interval scheduler, andX-API-Keyauthentication - CI output formats:
provero run --format sarifand--format junitfor code-scanning and test-report integrations - Contract versioning: version-aware
provero contract diffflags breaking changes and missing version bumps with severity policies - Airflow provider: ProveroCheckOperator + @provero_check decorator
- SodaCL migration:
provero import sodaconverts configs in one command - dbt interop:
provero export dbtgenerates schema.yml test definitions - Continuous monitoring:
provero watchpolls checks on interval
| Check | Description | Example |
|---|---|---|
not_null |
Column has no null values | not_null: order_id |
unique |
Column has no duplicate values | unique: order_id |
unique_combination |
Composite uniqueness across columns | unique_combination: [date, store_id] |
completeness |
Minimum percentage of non-null values | completeness: { column: email, min: 95% } |
accepted_values |
Column values are within allowed set | accepted_values: { column: status, values: [a, b] } |
range |
Numeric values within min/max bounds | range: { column: amount, min: 0, max: 100000 } |
regex |
Values match a regular expression | regex: { column: email, pattern: ".+@.+" } |
email_validation |
Values are valid email addresses | email_validation: { column: email } |
type |
Column data type matches expected | type: { column: amount, expected: numeric } |
freshness |
Most recent timestamp within threshold | freshness: { column: updated_at, max_age: 24h } |
latency |
Time between two timestamp columns | latency: { source_column: created_at, target_column: processed_at, max_latency: 1h } |
row_count |
Table row count within bounds | row_count: { min: 1, max: 1000000 } |
row_count_change |
Row count change vs previous run | row_count_change: { max_decrease: 10% } |
anomaly |
Statistical anomaly detection | anomaly: { column: amount, method: zscore } |
custom_sql |
Custom SQL query returns truthy value | custom_sql: "SELECT COUNT(*) > 0 FROM orders" |
referential_integrity |
FK values exist in reference table | referential_integrity: { column: customer_id, reference_table: customers, reference_column: id } |
distribution |
Column mean/stddev within bounds | distribution: { column: amount, mean: 100, mean_tolerance: 5, stddev_max: 50 } |
cardinality |
Distinct count or ratio within bounds | cardinality: { column: country_code, min: 2, max: 250 } |
drift |
PSI of a column vs a discrete baseline | drift: { column: segment, baseline: { A: 0.5, B: 0.3, C: 0.2 }, threshold: 0.25 } |
cross_table_count |
Row-count parity/ratio between two tables | cross_table_count: { other_table: staging.orders, tolerance: 0 } |
A provero.yaml file defines your data source, checks, alerts, and contracts:
# Source configuration
source:
type: duckdb # duckdb, postgres, dataframe
table: orders # table name or file expression
# connection: postgres://... # connection string for databases
# Quality checks
checks:
- not_null: [order_id, customer_id]
- unique: order_id
- range:
column: amount
min: 0
max: 100000
- freshness:
column: updated_at
max_age: 24h
- anomaly:
column: amount
method: zscore # zscore, mad, iqr
threshold: 3.0
window: 30 # lookback window in days
- referential_integrity:
column: customer_id
reference_table: customers
reference_column: id
# Severity levels: info, warning, critical, blocker
# Blocker checks cause a non-zero exit code
# Alert notifications
alerts:
- type: webhook
url: https://hooks.slack.com/services/YOUR/WEBHOOK
trigger: on_failure # on_failure, on_success, always
# Data contracts (optional)
contracts:
- name: orders_contract
owner: data-team
table: orders
schema:
columns:
- name: order_id
type: integer
checks: [not_null, unique]
sla:
freshness: 24hProvero includes built-in statistical anomaly detection that works without external dependencies (no scipy needed).
Supported methods:
| Method | Description | Best for |
|---|---|---|
zscore |
Standard Z-Score | Normally distributed metrics |
mad |
Median Absolute Deviation | Robust to outliers |
iqr |
Interquartile Range | Skewed distributions |
checks:
- anomaly:
column: daily_revenue
method: mad
threshold: 3.5
window: 30Anomaly detection uses the result store to compare current values against historical data. Run provero run regularly to build up the baseline.
| Command | Description |
|---|---|
provero init |
Create a new provero.yaml template |
provero run |
Execute quality checks |
provero validate |
Validate config syntax without running |
provero profile |
Profile a data source |
provero history |
Show historical check results |
provero contract validate |
Validate data contracts against live data |
provero contract diff |
Compare two contract versions |
provero watch |
Continuously run checks on interval |
provero import soda |
Convert SodaCL config to Provero format |
provero export dbt |
Generate dbt schema.yml from checks |
provero serve |
Run the REST API + scheduler server |
provero version |
Show version |
provero run accepts --format table|json|csv|sarif|junit. The sarif and junit
formats emit a single whole-run document for CI code-scanning and test-report
integrations.
Send webhook notifications when checks fail:
source:
type: duckdb
table: orders
checks:
- not_null: order_id
- row_count:
min: 1
alerts:
- type: webhook
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
trigger: on_failure
- type: webhook
url: ${PAGERDUTY_WEBHOOK}
headers:
Authorization: "Bearer ${PD_TOKEN}"Triggers: on_failure (default), on_success, always.
provero run can emit governance and telemetry signals through optional observer flags.
All three are off by default and degrade gracefully when their optional dependency is
absent.
# Append a structured JSON audit record of the run (pure stdlib, always available)
provero run --audit-log audit.jsonl
# Emit OpenTelemetry spans for the suite and each check
provero run --otel
# Write a Prometheus text exposition of run metrics to a file
provero run --metrics-file metrics.promOpenTelemetry and Prometheus require the observability extra:
pip install 'provero[observability]'Exposed Prometheus metrics: provero_checks_total (by status), provero_check_duration_seconds,
and provero_suite_score. The audit log records run id, suite name, a config hash, and
per-check outcomes; connection strings and secrets are redacted before they are written.
provero serve starts a FastAPI application that exposes the engine over HTTP and can
run suites on a schedule. It requires the server extra (FastAPI + Uvicorn):
pip install 'provero[server]'
provero serve # 127.0.0.1:8000, auth disabled
provero serve -c production.yaml --host 0.0.0.0 --port 9000
provero serve --api-key secret1 --api-key secret2Endpoints:
| Method & path | Auth | Description |
|---|---|---|
GET /health |
no | Liveness probe |
GET /ready |
no | Readiness probe |
GET /suites |
yes | List configured suites |
POST /suites/{name}/run |
yes | Run a suite on demand |
GET /runs |
yes | List historical runs |
GET /runs/{run_id} |
yes | Run detail |
GET /metrics |
no | Prometheus exposition |
Authentication is via the X-API-Key header. Allowed keys come from --api-key
(repeatable) or the PROVERO_API_KEYS environment variable; if neither is set, auth is
disabled. The bundled scheduler (SuiteScheduler) runs a suite on a fixed interval using
the standard library only (no extra dependency) and persists every result to the store.
Four statistical checks extend the engine for distributional and cross-table validation:
checks:
# Mean within tolerance and an upper bound on stddev (population statistics)
- distribution:
column: amount
mean: 100.0
mean_tolerance: 5.0
stddev_max: 50.0
# Distinct-value count and/or ratio bounds (ratio = distinct / non_null)
- cardinality:
column: country_code
min: 2
max: 250
min_ratio: 0.0
# Population Stability Index against a discrete baseline distribution
- drift:
column: segment
baseline: { A: 0.5, B: 0.3, C: 0.2 }
threshold: 0.25 # PSI above this fails
warn_threshold: 0.1 # PSI above this warns
# Row-count parity (or ratio) between two tables on the same source
- cross_table_count:
other_table: staging.orders
tolerance: 0drift is advisory by default (default severity warning): PSI above threshold fails,
above warn_threshold warns, otherwise passes. cross_table_count also supports a
ratio mode with min_ratio/max_ratio bounds.
Define and enforce schema contracts:
contracts:
- name: orders_contract
owner: data-team
table: orders
on_violation: warn
schema:
columns:
- name: order_id
type: integer
checks: [not_null, unique]
- name: status
type: varchar
sla:
freshness: 24h
completeness: "95%"Contracts carry a version field (default 1.0). provero contract diff is
version-aware: it classifies each change as breaking or non-breaking, and warns when a
breaking change ships without a major version bump.
provero contract validate
provero contract diff old.yaml new.yaml| Connector | Status | Install |
|---|---|---|
| DuckDB | Stable | included |
| PostgreSQL | Stable | pip install provero[postgres] |
| DataFrame | Stable | pip install provero[dataframe] |
| Snowflake | Beta | pip install provero[snowflake] |
| BigQuery | Beta | pip install provero[bigquery] |
| MySQL | Beta | pip install provero[mysql] |
| Redshift | Beta | pip install provero[redshift] |
DuckDB supports file expressions: read_csv('data.csv'), read_parquet('*.parquet').
SQLAlchemy-backed connectors (PostgreSQL and the beta connectors) accept optional connection-pool and retry tuning per source. Every key is optional; when omitted, the connector behaves exactly as before.
source:
type: postgres
table: orders
connection: ${POSTGRES_URL}
# Connection pool (forwarded to SQLAlchemy create_engine)
pool_size: 5
max_overflow: 10
pool_pre_ping: true
pool_recycle: 1800
pool_timeout: 30
connect_timeout: 10
# Bounded retry-with-backoff on transient connection errors
retry_attempts: 3
retry_base_delay: 0.1
retry_max_delay: 5.0
retry_jitter: trueOnly transient failures (dropped connections, backend restarts, deadlocks) are retried. Programming errors such as a missing table or bad SQL fail immediately. Backoff is exponential with full jitter.
from provero.core.engine import Engine
engine = Engine("provero.yaml")
results = engine.run()
for result in results:
print(f"{result.check_name}: {result.status}")from provero.core.engine import Engine
engine = Engine.from_dict({
"source": {"type": "duckdb", "table": "orders"},
"checks": [
{"not_null": "order_id"},
{"row_count": {"min": 1}},
],
})
results = engine.run()pip install provero-airflowfrom provero.airflow.operators import ProveroCheckOperator
check_orders = ProveroCheckOperator(
task_id="check_orders",
config_path="dags/provero.yaml",
suite="orders_daily",
)Full documentation is available on GitHub Pages.
- Getting Started
- Configuration
- Check Types
- Connectors
- CLI Reference
- Architecture
- Contributing
- Governance
- Security Policy
- Support
Apache License 2.0. See LICENSE.