Skip to content

Open-source Delta Lake data quality and management tool. Go-first, dbt-compatible, CLI-friendly. Supports profiling, validation, lineage, and alerts.

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt
Notifications You must be signed in to change notification settings

nessi-dev/nessi-old

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nessi

Nessi Logo

Nessi is an open-source data quality and Delta Lake management tool that helps organizations maintain high-quality data and efficiently manage their Delta Lake tables.

Go Report Card Build Status License Go Reference

Features

  • Delta Lake Support

    • Delta Lake table management
    • Transaction log parsing and analysis
    • Schema evolution tracking
    • Partition management
    • Version control and time travel
    • Multi-format support (Delta, Parquet, CSV)
    • Schema inference and validation
    • Cloud integration with AWS, Azure, and GCP
    • Data Catalog Integration: Integration with data catalogs like AWS Glue, Azure Purview, Google Cloud Data Catalog, and Databricks Unity Catalog
  • Data Quality Intelligence

    • Data quality checks
    • Custom quality rules
    • Data profiling and statistics
    • Anomaly detection
    • Schema validation
    • Freshness monitoring
    • Consistency checks
    • Quality reporting
  • Monitoring & Alerting

    • File-based metrics collection
    • Email alerting for failed checks
    • Configurable alert thresholds
    • Alert history in log files
  • Reporting

    • CLI-generated reports
    • Schema tree visualization
    • Data quality metrics
    • Trend analysis via report comparison
    • Quality scoring and metrics
    • dbt Integration: Seamless integration with dbt for validating and profiling models directly in your dbt workflow

Installation

Prerequisites

  • Go 1.18 or higher
  • Python 3.8 or higher (for Python integrations)
  • Access to Delta Lake tables

Binary Installation

Using Go

go install github.com/nessi-dev/nessi/cmd/nessi@latest

Using Prebuilt Binaries

Download the latest release from the GitHub Releases page.

# Linux/macOS
chmod +x nessi
./nessi --version

# Add to your PATH for easier access
mv nessi /usr/local/bin/

Docker Installation

# Pull the latest image
docker pull nessi/nessi:latest

# Run Nessi
docker run -v $(pwd):/data nessi/nessi:latest scan /data/table

Quick Start

Basic Usage

# Scan a Delta Lake table
nessi scan s3://my-bucket/my-table

# Generate a data quality report
nessi report s3://my-bucket/my-table --format html --output report.html

# Validate a table against quality rules
nessi validate s3://my-bucket/my-table --rules rules.yaml

# Time travel to a specific version
nessi timetravel s3://my-bucket/my-table --version 5

# View schema evolution
nessi schema s3://my-bucket/my-table --history

Configuration

Create a nessi.yaml configuration file:

storage:
  type: s3
  region: us-west-2
  bucket: my-delta-tables

monitoring:
  enabled: true
  metrics_retention: 30d

quality:
  default_rules: path/to/rules.yaml
  threshold: 0.95

Using the API

Nessi provides a Go API for programmatic access:

package main

import (
	"fmt"
	"github.com/nessi-dev/nessi/pkg/datalake"
)

func main() {
	// Open a Delta Lake table
	table, err := datalake.OpenTable("s3://my-bucket/my-table")
	if err != nil {
		panic(err)
	}
	
	// Get table metadata
	meta, err := table.Metadata()
	if err != nil {
		panic(err)
	}
	
	fmt.Printf("Table: %s, Version: %d\n", meta.Name, meta.Version)
	
	// Run quality checks
	results, err := table.ValidateQuality(nil)
	if err != nil {
		panic(err)
	}
	
	fmt.Printf("Quality Score: %.2f%%\n", results.OverallScore*100)
}

Documentation

For detailed documentation, visit our User Guide.

Additional documentation:

  • CLI Reference

  • Configuration Options

  • Quality Rules

  • Report Generation

  • Monitoring and Alerting (OSS + Optional)

    • CLI-based metrics reporting
    • Email notifications for failed checks (OSS)
    • Performance tracking
    • Resource utilization monitoring
    • Health checks
    • Freshness SLA Monitoring: Track data freshness with configurable SLAs and compliance tracking
  • Report Generation

    • Customizable report templates
    • Multiple output formats (CSV, JSON)
    • Report generation via CLI
    • Report archiving and retention
  • Security

    • API key management
    • TLS encryption for secure connections
    • File-based credential storage

Contributing

We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.

Development Setup

# Clone the repository
git clone git@github.com:nessi-dev/nessi.git
cd nessi

# Install dependencies
make deps

# Build the project
make build

# Run tests
make test

Testing

We aim for high test coverage. Run the test suite with:

./generate-test-coverage.sh

This will generate a coverage report in the ./coverage directory.

Community

Roadmap

See our project roadmap for upcoming features and enhancements.

License

Nessi is licensed under the Apache License 2.0.

Acknowledgments

  • The Delta Lake community for their excellent work on the Delta Lake format

  • All our contributors and users who provide valuable feedback

  • Testing Framework

    • Comprehensive unit and integration tests
    • In-memory testing for performance-critical components
    • Mock connectors for external dependencies
    • Test data generation utilities
    • Configurable test runners

Developer Experience

  • CLI-First Approach

    • One-line validation commands
    • Batch processing scripts
    • Comprehensive configuration management
    • Intuitive command structure
  • Developer Experience

    • Intuitive CLI interface
    • Comprehensive documentation
    • Python API for programmatic access
    • Extensible architecture with plugin system
    • Custom extension support via plugins
    • Embeddable in Airflow, GitHub Actions, Kubernetes, and custom scripts
    • Interactive examples and integration guides
    • Observability features with metrics collection, structured logging, and distributed tracing
    • Authentication enhancements with OAuth 2.0 and token refresh capabilities
    • Environment variable-based configuration for secure deployment
  • Integration Capabilities

    • File-based metrics collection
    • CLI-based reporting
    • Secure API key access
    • Email notifications
    • Cloud provider integration (AWS, Azure, GCP)
    • dbt Integration: Opt-in plugin for executing data quality rules against dbt models and generating data profiles
    • Workflow Orchestration Integration: Integration with Apache Airflow, Prefect, and Dagster for incorporating data quality checks into data pipelines
    • Error Handling and Retries: Configurable retry mechanisms for operations and detailed error reporting
    • Observability: Structured logging and file-based metrics for monitoring
  • Documentation

    • Comprehensive user guides
    • API reference
    • Example scripts
    • Best practices
    • Integration guides for dbt, Airflow, and other tools

Requirements

  • Go 1.21 or later
  • Docker (optional)
  • Make
  • OpenSSL (for certificate generation)

Installation

  1. Clone the repository:

    git clone git@github.com:nessi-dev/nessi.git
    cd nessi-dev
  2. Install development tools:

    make install-tools
  3. Initialize the development environment:

    make init
  4. Build the application:

    make build

Configuration

The application is configured using a YAML file located at config/config.yaml. The configuration includes settings for:

  • Security settings (API keys, TLS for CLI tools)
  • Delta Lake paths and properties
  • Data quality rules and thresholds
  • Monitoring and metrics export
  • Email notifications
  • Report generation
  • Logging

See config/config.yaml for detailed configuration options.

Usage

Running the CLI

  1. Run the CLI tool:

    nessi --help
  2. Or run with Docker:

    docker run -v $(pwd):/data nessi/nessi:latest --help

CLI Commands

The application provides the following CLI commands:

  • nessi tables list - List Delta tables
  • nessi tables info <n> - Get table details
  • nessi quality check <n> - Run quality checks
  • nessi quality metrics <n> - Get quality metrics
  • nessi reports list <n> - List available reports
  • nessi reports generate <n> - Generate new report
  • nessi catalog list --type <catalog_type> - List available catalogs
  • nessi catalog tables --type <catalog_type> --database <db> - List tables in a catalog
  • nessi catalog describe --type <catalog_type> --database <db> --table <table> - Get table details from a catalog

Databricks Integration

Nessi supports integration with Databricks, allowing you to work with Databricks Unity Catalog and Delta Lake tables.

Configuration

To use the Databricks integration, set the following environment variables:

export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"
export DATABRICKS_WORKSPACE_ID="your-workspace-id"  # Optional, defaults to "0"
export DATABRICKS_DEFAULT_SCHEMA="default"          # Optional, defaults to "default"
export DATABRICKS_DEFAULT_CATALOG="hive_metastore"  # Optional, defaults to "hive_metastore"

Example Commands

# List catalogs in Databricks
nessi catalog list --type databricks

# List tables in a specific catalog and schema
nessi catalog tables --type databricks --database main.default

# Get details of a specific table
nessi catalog describe --type databricks --database main.default --table customers

# Work with Delta Lake tables
nessi tables info /path/to/delta/table

Development

  1. Run tests:

    make test
  2. Run linters:

    make lint
  3. Clean build artifacts:

    make clean

Security

TLS Configuration

  1. Generate self-signed certificates:

    make generate-certs
  2. Update the configuration in config/config.yaml:

    security:
      tls:
        enabled: true
        cert_file: "certs/server.crt"
        key_file: "certs/server.key"

Authentication

The application uses API keys for CLI authentication. Configure the API key settings in config/config.yaml:

security:
  api_key:
    enabled: true
    key_file: "config/api_keys.yaml"

Set the API key as an environment variable for CLI commands:

export NESSI_API_KEY="your-api-key"
nessi tables list

Monitoring and Metrics Export

The application provides file-based metrics export for monitoring and observability. Configure the metrics export settings in config/config.yaml:

monitoring:
  metrics:
    enabled: true
    collection_interval: 15s
    export_path: "./data/metrics"
    export_format: "json"
  logging:
    structured: true
    format: "json"
    level: "info"

Error Handling and Retries

The application includes configurable retry mechanisms for API failures and external service interactions. Configure the retry settings in config/config.yaml:

api:
  retries:
    max_attempts: 3
    initial_backoff: 1s
    max_backoff: 30s
    backoff_factor: 2.0
  error_handling:
    categorize: true
    detailed_logging: true

Test Strategy

We use a unified test script that supports multiple test modes:

Fast/unit tests

Run locally for rapid feedback:

./scripts/run_tests.sh --fast

This skips all integration/long-running tests.

Short tests

Run specific test files:

./scripts/run_tests.sh --short

Standard tests

Run tests that complete in under 30 seconds (default):

./scripts/run_tests.sh

Integration/slow tests

Run all tests including long-running ones:

./scripts/run_tests.sh --all

All integration/slow tests use the Go convention:

if testing.Short() {
    t.Skip("Skipping integration test in short mode")
}

For more options, see the help:

./scripts/run_tests.sh --help

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For support, please open an issue in the GitHub repository or contact the maintainers.

About

Open-source Delta Lake data quality and management tool. Go-first, dbt-compatible, CLI-friendly. Supports profiling, validation, lineage, and alerts.

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •