Nessi is an open-source data quality and Delta Lake management tool that helps organizations maintain high-quality data and efficiently manage their Delta Lake tables.
-
Delta Lake Support
- Delta Lake table management
- Transaction log parsing and analysis
- Schema evolution tracking
- Partition management
- Version control and time travel
- Multi-format support (Delta, Parquet, CSV)
- Schema inference and validation
- Cloud integration with AWS, Azure, and GCP
- Data Catalog Integration: Integration with data catalogs like AWS Glue, Azure Purview, Google Cloud Data Catalog, and Databricks Unity Catalog
-
Data Quality Intelligence
- Data quality checks
- Custom quality rules
- Data profiling and statistics
- Anomaly detection
- Schema validation
- Freshness monitoring
- Consistency checks
- Quality reporting
-
Monitoring & Alerting
- File-based metrics collection
- Email alerting for failed checks
- Configurable alert thresholds
- Alert history in log files
-
Reporting
- CLI-generated reports
- Schema tree visualization
- Data quality metrics
- Trend analysis via report comparison
- Quality scoring and metrics
- dbt Integration: Seamless integration with dbt for validating and profiling models directly in your dbt workflow
- Go 1.18 or higher
- Python 3.8 or higher (for Python integrations)
- Access to Delta Lake tables
go install github.com/nessi-dev/nessi/cmd/nessi@latestDownload the latest release from the GitHub Releases page.
# Linux/macOS
chmod +x nessi
./nessi --version
# Add to your PATH for easier access
mv nessi /usr/local/bin/# Pull the latest image
docker pull nessi/nessi:latest
# Run Nessi
docker run -v $(pwd):/data nessi/nessi:latest scan /data/table# Scan a Delta Lake table
nessi scan s3://my-bucket/my-table
# Generate a data quality report
nessi report s3://my-bucket/my-table --format html --output report.html
# Validate a table against quality rules
nessi validate s3://my-bucket/my-table --rules rules.yaml
# Time travel to a specific version
nessi timetravel s3://my-bucket/my-table --version 5
# View schema evolution
nessi schema s3://my-bucket/my-table --historyCreate a nessi.yaml configuration file:
storage:
type: s3
region: us-west-2
bucket: my-delta-tables
monitoring:
enabled: true
metrics_retention: 30d
quality:
default_rules: path/to/rules.yaml
threshold: 0.95Nessi provides a Go API for programmatic access:
package main
import (
"fmt"
"github.com/nessi-dev/nessi/pkg/datalake"
)
func main() {
// Open a Delta Lake table
table, err := datalake.OpenTable("s3://my-bucket/my-table")
if err != nil {
panic(err)
}
// Get table metadata
meta, err := table.Metadata()
if err != nil {
panic(err)
}
fmt.Printf("Table: %s, Version: %d\n", meta.Name, meta.Version)
// Run quality checks
results, err := table.ValidateQuality(nil)
if err != nil {
panic(err)
}
fmt.Printf("Quality Score: %.2f%%\n", results.OverallScore*100)
}For detailed documentation, visit our User Guide.
Additional documentation:
-
Monitoring and Alerting (OSS + Optional)
- CLI-based metrics reporting
- Email notifications for failed checks (OSS)
- Performance tracking
- Resource utilization monitoring
- Health checks
- Freshness SLA Monitoring: Track data freshness with configurable SLAs and compliance tracking
-
Report Generation
- Customizable report templates
- Multiple output formats (CSV, JSON)
- Report generation via CLI
- Report archiving and retention
-
Security
- API key management
- TLS encryption for secure connections
- File-based credential storage
We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.
# Clone the repository
git clone git@github.com:nessi-dev/nessi.git
cd nessi
# Install dependencies
make deps
# Build the project
make build
# Run tests
make testWe aim for high test coverage. Run the test suite with:
./generate-test-coverage.shThis will generate a coverage report in the ./coverage directory.
- GitHub Discussions - Ask questions and share ideas
- Issue Tracker - Report bugs or request features
- Slack Community - Join our Slack for real-time discussions
See our project roadmap for upcoming features and enhancements.
Nessi is licensed under the Apache License 2.0.
-
The Delta Lake community for their excellent work on the Delta Lake format
-
All our contributors and users who provide valuable feedback
-
Testing Framework
- Comprehensive unit and integration tests
- In-memory testing for performance-critical components
- Mock connectors for external dependencies
- Test data generation utilities
- Configurable test runners
-
CLI-First Approach
- One-line validation commands
- Batch processing scripts
- Comprehensive configuration management
- Intuitive command structure
-
Developer Experience
- Intuitive CLI interface
- Comprehensive documentation
- Python API for programmatic access
- Extensible architecture with plugin system
- Custom extension support via plugins
- Embeddable in Airflow, GitHub Actions, Kubernetes, and custom scripts
- Interactive examples and integration guides
- Observability features with metrics collection, structured logging, and distributed tracing
- Authentication enhancements with OAuth 2.0 and token refresh capabilities
- Environment variable-based configuration for secure deployment
-
Integration Capabilities
- File-based metrics collection
- CLI-based reporting
- Secure API key access
- Email notifications
- Cloud provider integration (AWS, Azure, GCP)
- dbt Integration: Opt-in plugin for executing data quality rules against dbt models and generating data profiles
- Workflow Orchestration Integration: Integration with Apache Airflow, Prefect, and Dagster for incorporating data quality checks into data pipelines
- Error Handling and Retries: Configurable retry mechanisms for operations and detailed error reporting
- Observability: Structured logging and file-based metrics for monitoring
-
Documentation
- Comprehensive user guides
- API reference
- Example scripts
- Best practices
- Integration guides for dbt, Airflow, and other tools
- Go 1.21 or later
- Docker (optional)
- Make
- OpenSSL (for certificate generation)
-
Clone the repository:
git clone git@github.com:nessi-dev/nessi.git cd nessi-dev -
Install development tools:
make install-tools
-
Initialize the development environment:
make init
-
Build the application:
make build
The application is configured using a YAML file located at config/config.yaml. The configuration includes settings for:
- Security settings (API keys, TLS for CLI tools)
- Delta Lake paths and properties
- Data quality rules and thresholds
- Monitoring and metrics export
- Email notifications
- Report generation
- Logging
See config/config.yaml for detailed configuration options.
-
Run the CLI tool:
nessi --help
-
Or run with Docker:
docker run -v $(pwd):/data nessi/nessi:latest --help
The application provides the following CLI commands:
nessi tables list- List Delta tablesnessi tables info <n>- Get table detailsnessi quality check <n>- Run quality checksnessi quality metrics <n>- Get quality metricsnessi reports list <n>- List available reportsnessi reports generate <n>- Generate new reportnessi catalog list --type <catalog_type>- List available catalogsnessi catalog tables --type <catalog_type> --database <db>- List tables in a catalognessi catalog describe --type <catalog_type> --database <db> --table <table>- Get table details from a catalog
Nessi supports integration with Databricks, allowing you to work with Databricks Unity Catalog and Delta Lake tables.
To use the Databricks integration, set the following environment variables:
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_WORKSPACE_ID="your-workspace-id" # Optional, defaults to "0"
export DATABRICKS_DEFAULT_SCHEMA="default" # Optional, defaults to "default"
export DATABRICKS_DEFAULT_CATALOG="hive_metastore" # Optional, defaults to "hive_metastore"# List catalogs in Databricks
nessi catalog list --type databricks
# List tables in a specific catalog and schema
nessi catalog tables --type databricks --database main.default
# Get details of a specific table
nessi catalog describe --type databricks --database main.default --table customers
# Work with Delta Lake tables
nessi tables info /path/to/delta/table-
Run tests:
make test -
Run linters:
make lint
-
Clean build artifacts:
make clean
-
Generate self-signed certificates:
make generate-certs
-
Update the configuration in
config/config.yaml:security: tls: enabled: true cert_file: "certs/server.crt" key_file: "certs/server.key"
The application uses API keys for CLI authentication. Configure the API key settings in config/config.yaml:
security:
api_key:
enabled: true
key_file: "config/api_keys.yaml"Set the API key as an environment variable for CLI commands:
export NESSI_API_KEY="your-api-key"
nessi tables listThe application provides file-based metrics export for monitoring and observability. Configure the metrics export settings in config/config.yaml:
monitoring:
metrics:
enabled: true
collection_interval: 15s
export_path: "./data/metrics"
export_format: "json"
logging:
structured: true
format: "json"
level: "info"The application includes configurable retry mechanisms for API failures and external service interactions. Configure the retry settings in config/config.yaml:
api:
retries:
max_attempts: 3
initial_backoff: 1s
max_backoff: 30s
backoff_factor: 2.0
error_handling:
categorize: true
detailed_logging: trueWe use a unified test script that supports multiple test modes:
Run locally for rapid feedback:
./scripts/run_tests.sh --fastThis skips all integration/long-running tests.
Run specific test files:
./scripts/run_tests.sh --shortRun tests that complete in under 30 seconds (default):
./scripts/run_tests.shRun all tests including long-running ones:
./scripts/run_tests.sh --allAll integration/slow tests use the Go convention:
if testing.Short() {
t.Skip("Skipping integration test in short mode")
}For more options, see the help:
./scripts/run_tests.sh --help- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For support, please open an issue in the GitHub repository or contact the maintainers.