Skip to content
/ minidb Public

MiniDB is a distributed MPP (Massively Parallel Processing) database system built in Go, designed for analytical workloads. Currently implemented as a single-node prototype with vectorized execution and cost-based optimization, it provides the foundation for future distributed parallel processing capabilities.

Notifications You must be signed in to change notification settings

yyun543/minidb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

45 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MiniDB

MiniDB is a modern analytical database built in Go with Lakehouse architecture (v2.0). It combines Parquet-based columnar storage with Delta Lake transaction management, powered by Apache Arrow for vectorized query execution. The system uses cost-based optimization for intelligent query planning across dual execution engines.

Current Implementation (v2.0 Lakehouse)

Lakehouse Storage Layer โœ…

  • Parquet Storage Engine: Apache Arrow-based Parquet file format for columnar data storage
  • Delta Lake Integration: Transaction log with ACID properties, time-travel queries, and snapshot isolation
  • Arrow IPC Serialization: Efficient schema serialization using Arrow Inter-Process Communication format
  • Multi-Record Merging: Automatic concatenation of multiple Parquet record batches for efficient scanning
  • Predicate Pushdown: Filter pushdown to Parquet files for optimized data skipping
  • Comprehensive Statistics: Min/max values and null counts for all Arrow data types (INT8/16/32/64, FLOAT32/64, STRING, BOOLEAN, DATE, TIMESTAMP)

Core Database Engine โœ…

  • Multi-Client Support: TCP server with session management and concurrent connection handling
  • SQL Parser: ANTLR4-based parser supporting DDL, DML, qualified table names (database.table), and analytical query operations
  • SQL Self-Bootstrapping: Virtual system tables in sys database for metadata queries (schemata, table_catalog, columns, index_metadata, delta_log, table_files)
  • Type System: Complete Arrow type system with automatic type conversions
  • Enterprise Logging: Structured logging with zap library, daily log rotation, and environment-aware configuration

Query Processing โœ…

  • Dual Execution Engines: Vectorized (Apache Arrow) and regular execution engines with automatic selection
  • Cost-Based Optimization: Statistics-driven query plan selection using table row counts and cardinalities
  • Vectorized Operations: Batch processing with Arrow for SELECT, GROUP BY, aggregations
  • Aggregation Functions: COUNT, SUM, AVG, MIN, MAX with GROUP BY and HAVING support

MPP Architecture Goals ๐Ÿšง

Distributed Processing (Planned)

  • Query Coordinator: Distributed query planning and execution coordination
  • Compute Nodes: Parallel execution across multiple compute nodes
  • Data Distribution: Automatic data partitioning and distribution strategies
  • Inter-Node Communication: Efficient data transfer protocols between nodes

Lakehouse Enhancements (Planned)

  • Object Storage: S3, GCS, Azure Blob storage connectors for cloud-native deployments
  • Multi-Format Support: ORC and Iceberg table format readers (Parquet and Delta Lake โœ…)
  • Schema Evolution: ALTER TABLE support for column additions and type changes
  • Z-Ordering: Advanced data clustering for improved query performance
  • Compaction: Automatic file compaction and optimization for Delta tables

Supported SQL Features โœ…

Currently Working

  • DDL: CREATE/DROP DATABASE, CREATE/DROP TABLE, CREATE/DROP INDEX
  • DML: INSERT, SELECT, UPDATE, DELETE
  • Queries: WHERE clauses (=, >, <, >=, <=, AND, OR)
  • Aggregation: GROUP BY, HAVING with COUNT, SUM, AVG, MIN, MAX
  • Indexes: CREATE INDEX, CREATE UNIQUE INDEX, DROP INDEX, SHOW INDEXES
  • System Tables: SQL self-bootstrapping with virtual system tables in sys database
    • sys.schemata - Database catalog
    • sys.table_catalog - Table catalog
    • sys.columns - Column metadata
    • sys.index_metadata - Index information
    • sys.delta_log - Delta Log transaction history
    • sys.table_files - Active Parquet file list
  • Utilities: USE database, SHOW TABLES/DATABASES/INDEXES, EXPLAIN

Limited Support โš ๏ธ

  • JOIN operations (basic implementation)
  • WHERE operators: LIKE, IN, BETWEEN (fallback to regular engine)
  • ORDER BY (basic sorting)

Planned Enhancements ๐Ÿ”„

  • Advanced JOINs: Hash join, sort-merge join algorithms
  • Window Functions: ROW_NUMBER, RANK, analytical functions
  • Complex Expressions: Nested queries, CTEs, advanced operators

Architecture & Performance

Current Architecture (v2.0 Lakehouse) โœ…

  1. Lakehouse Storage: Parquet + Delta Lake for ACID transactions and time-travel
  2. Arrow-Native Processing: Vectorized execution using Apache Arrow columnar format
  3. Dual Execution Engines: Cost-optimizer selects between vectorized and regular engines
  4. Delta Transaction Log: Version control with snapshot isolation and checkpoint management
  5. Predicate Pushdown: Filter evaluation at storage layer for data skipping
  6. Statistics-Driven Optimization: Min/max/null statistics for intelligent query planning
  7. Enterprise Logging: Comprehensive structured logging across all modules with performance monitoring

MPP Design Principles ๐ŸŽฏ

  1. Distributed-First: Architecture designed for horizontal scaling
  2. Compute-Storage Separation: Independent scaling of processing and storage
  3. Parallel Processing: Query parallelization across multiple nodes
  4. Elastic Compute: Dynamic resource allocation based on workload

Performance Characteristics

  • Current Prototype: Single-node analytical query processing
  • Vectorized Operations: 10-100x speedup for compatible analytical queries
  • Session Management: Support for multiple concurrent connections
  • Memory Efficiency: Arrow-based columnar processing with efficient allocators

Project Structure

minidb/
โ”œโ”€โ”€ cmd/
โ”‚   โ””โ”€โ”€ server/                    # Application entry point
โ”‚       โ”œโ”€โ”€ main.go                # Server startup with CLI flags and signal handling
โ”‚       โ””โ”€โ”€ handler.go             # Enhanced query handling with dual execution engines
โ”œโ”€โ”€ internal/
โ”‚   โ”œโ”€โ”€ catalog/                   # Metadata management & SQL self-bootstrapping
โ”‚   โ”‚   โ”œโ”€โ”€ catalog.go             # Database/table management with type system
โ”‚   โ”‚   โ””โ”€โ”€ simple_sql_catalog.go  # SQL self-bootstrapping catalog (virtual system tables)
โ”‚   โ”œโ”€โ”€ delta/                     # Delta Lake transaction log (v2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ log.go                 # Delta Log manager with Arrow IPC serialization
โ”‚   โ”‚   โ””โ”€โ”€ types.go               # Delta Log entry types and operations
โ”‚   โ”œโ”€โ”€ executor/                  # Dual execution engines
โ”‚   โ”‚   โ”œโ”€โ”€ executor.go            # Regular execution engine
โ”‚   โ”‚   โ”œโ”€โ”€ vectorized_executor.go # Apache Arrow vectorized execution engine
โ”‚   โ”‚   โ”œโ”€โ”€ cost_optimizer.go      # Cost-based query optimization
โ”‚   โ”‚   โ”œโ”€โ”€ data_manager.go        # Data access layer with system table support
โ”‚   โ”‚   โ”œโ”€โ”€ context.go             # Execution context management
โ”‚   โ”‚   โ”œโ”€โ”€ interface.go           # Executor interfaces
โ”‚   โ”‚   โ””โ”€โ”€ operators/             # Execution operators
โ”‚   โ”‚       โ”œโ”€โ”€ table_scan.go      # Optimized table scanning
โ”‚   โ”‚       โ”œโ”€โ”€ filter.go          # Vectorized filtering
โ”‚   โ”‚       โ”œโ”€โ”€ join.go            # Cost-optimized joins
โ”‚   โ”‚       โ”œโ”€โ”€ aggregate.go       # Vectorized aggregations
โ”‚   โ”‚       โ”œโ”€โ”€ group_by.go        # GROUP BY operations
โ”‚   โ”‚       โ”œโ”€โ”€ order_by.go        # ORDER BY operations
โ”‚   โ”‚       โ”œโ”€โ”€ projection.go      # Column projection
โ”‚   โ”‚       โ””โ”€โ”€ operator.go        # Base operator interfaces
โ”‚   โ”œโ”€โ”€ logger/                    # Enterprise logging system
โ”‚   โ”‚   โ”œโ”€โ”€ logger.go              # Structured logging with zap
โ”‚   โ”‚   โ”œโ”€โ”€ config.go              # Environment-aware configuration
โ”‚   โ”‚   โ””โ”€โ”€ middleware.go          # Request/response logging middleware
โ”‚   โ”œโ”€โ”€ objectstore/               # Object storage abstraction layer
โ”‚   โ”‚   โ””โ”€โ”€ local.go               # Local filesystem storage implementation
โ”‚   โ”œโ”€โ”€ optimizer/                 # Advanced query optimizer
โ”‚   โ”‚   โ”œโ”€โ”€ optimizer.go           # Rule-based and cost-based optimization
โ”‚   โ”‚   โ”œโ”€โ”€ plan.go                # Enhanced query plan representation
โ”‚   โ”‚   โ”œโ”€โ”€ rule.go                # Base optimization rule interface
โ”‚   โ”‚   โ”œโ”€โ”€ predicate_push_down_rule.go   # Predicate pushdown optimization
โ”‚   โ”‚   โ”œโ”€โ”€ projection_pruning_rule.go    # Projection pruning optimization
โ”‚   โ”‚   โ””โ”€โ”€ join_reorder_rule.go   # Join reordering optimization
โ”‚   โ”œโ”€โ”€ parquet/                   # Parquet storage layer (v2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ reader.go              # Parquet reader with predicate pushdown
โ”‚   โ”‚   โ””โ”€โ”€ writer.go              # Parquet writer with comprehensive statistics
โ”‚   โ”œโ”€โ”€ parser/                    # SQL parser with ANTLR4
โ”‚   โ”‚   โ”œโ”€โ”€ MiniQL.g4              # Comprehensive ANTLR4 grammar (supports qualified table names)
โ”‚   โ”‚   โ”œโ”€โ”€ miniql_lexer.go        # ANTLR-generated lexer
โ”‚   โ”‚   โ”œโ”€โ”€ miniql_parser.go       # ANTLR-generated parser
โ”‚   โ”‚   โ”œโ”€โ”€ miniql_visitor.go      # ANTLR-generated visitor interface
โ”‚   โ”‚   โ”œโ”€โ”€ miniql_base_visitor.go # ANTLR-generated base visitor
โ”‚   โ”‚   โ”œโ”€โ”€ parser.go              # SQL parsing with enhanced error handling
โ”‚   โ”‚   โ””โ”€โ”€ ast.go                 # Complete AST node definitions
โ”‚   โ”œโ”€โ”€ session/                   # Session management
โ”‚   โ”‚   โ””โ”€โ”€ session.go             # Session lifecycle and cleanup
โ”‚   โ”œโ”€โ”€ statistics/                # Statistics collection system
โ”‚   โ”‚   โ””โ”€โ”€ statistics.go          # Table and column statistics management
โ”‚   โ”œโ”€โ”€ storage/                   # Lakehouse storage engine (v2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ parquet_engine.go      # Parquet-based storage engine with Delta Log
โ”‚   โ”‚   โ”œโ”€โ”€ parquet_iterator.go    # Parquet record iterator implementation
โ”‚   โ”‚   โ””โ”€โ”€ interface.go           # Storage engine interfaces
โ”‚   โ”œโ”€โ”€ types/                     # Enhanced type system
โ”‚   โ”‚   โ”œโ”€โ”€ schema.go              # Strong type system with Arrow integration
โ”‚   โ”‚   โ”œโ”€โ”€ partition.go           # Partitioning strategies for distribution
โ”‚   โ”‚   โ”œโ”€โ”€ vectorized.go          # Vectorized batch processing
โ”‚   โ”‚   โ””โ”€โ”€ types.go               # Data type definitions and conversions
โ”‚   โ””โ”€โ”€ utils/                     # Utility functions
โ”‚       โ””โ”€โ”€ utils.go               # Common utilities
โ”œโ”€โ”€ docs/                          # Project documentation
โ”‚   โ””โ”€โ”€ architecture/              # Architecture documentation
โ”‚       โ”œโ”€โ”€ README.md              # Architecture overview
โ”‚       โ”œโ”€โ”€ QUICKSTART.md          # Quick start guide
โ”‚       โ”œโ”€โ”€ minidb-v2-architecture.md        # V2.0 Lakehouse architecture
โ”‚       โ”œโ”€โ”€ delta-log-sql-bootstrap.md       # Delta Log and SQL bootstrapping
โ”‚       โ”œโ”€โ”€ sql-bootstrap-implementation.md  # SQL self-bootstrapping details
โ”‚       โ”œโ”€โ”€ implementation-roadmap.md        # Implementation roadmap
โ”‚       โ””โ”€โ”€ comparison-summary.md            # Architecture comparison
โ”œโ”€โ”€ logs/                          # Log files directory
โ”‚   โ””โ”€โ”€ minidb.log                 # Application logs with rotation
โ””โ”€โ”€ test/                          # Comprehensive test suite
    โ”œโ”€โ”€ arrow_ipc_test.go          # Arrow IPC serialization tests
    โ”œโ”€โ”€ comprehensive_plan_test.go # Comprehensive plan execution tests
    โ”œโ”€โ”€ delta_acid_test.go         # Delta Lake ACID transaction tests
    โ”œโ”€โ”€ executor_test.go           # Execution engine tests
    โ”œโ”€โ”€ group_by_test.go           # GROUP BY and aggregation tests
    โ”œโ”€โ”€ index_test.go              # Index operations tests
    โ”œโ”€โ”€ insert_fix_test.go         # INSERT operation tests
    โ”œโ”€โ”€ optimizer_test.go          # Query optimization tests
    โ”œโ”€โ”€ parquet_statistics_test.go # Parquet statistics tests
    โ”œโ”€โ”€ parser_test.go             # SQL parsing tests
    โ”œโ”€โ”€ predicate_pushdown_test.go # Predicate pushdown tests
    โ”œโ”€โ”€ readme_sql_comprehensive_test.go  # README SQL examples validation
    โ”œโ”€โ”€ show_tables_test.go        # SHOW TABLES/DATABASES tests
    โ”œโ”€โ”€ show_tables_integration_test.go   # SHOW TABLES integration tests
    โ”œโ”€โ”€ time_travel_test.go        # Time-travel query tests
    โ”œโ”€โ”€ unknown_plan_type_test.go  # Unknown plan type handling tests
    โ”œโ”€โ”€ update_delete_test.go      # UPDATE/DELETE operation tests
    โ”œโ”€โ”€ update_debug_test.go       # UPDATE debugging tests
    โ””โ”€โ”€ update_standalone_test.go  # UPDATE standalone tests

Current Performance Status

Prototype Metrics โœ…

  • Test Coverage: ~77% integration test success rate
  • Vectorized Execution: Automatic selection for compatible analytical queries
  • Connection Handling: Multi-client TCP server with session isolation
  • Query Processing: Basic analytical operations (GROUP BY, aggregations)

Target MPP Performance ๐ŸŽฏ

  • Distributed Processing: Linear scalability across compute clusters
  • Query Throughput: Thousands of concurrent analytical queries
  • Data Volume: Petabyte-scale data processing capabilities
  • Fault Tolerance: Automatic failure recovery and query restart

Logging & Observability

Enterprise Logging System โœ…

MiniDB includes a comprehensive logging system built with industry best practices:

  • Structured Logging: Uses Uber's zap library for high-performance structured logging
  • Environment-Aware Configuration:
    • Development: Debug-level logging for detailed troubleshooting
    • Production: Info-level logging to minimize log volume
    • Test: Error-level logging for clean test output
  • Daily Log Rotation: Automatic log rotation with configurable retention policies
  • Performance Monitoring: Detailed timing measurements for all database operations
  • Component-Based Logging: Easy identification of log sources across all modules
  • Error Tracking: Comprehensive error logging with context and stack traces

Logging Configuration

The logging system automatically configures based on the ENVIRONMENT variable:

# Development environment (detailed logs)
ENVIRONMENT=development ./minidb

# Production environment (optimized logs)  
ENVIRONMENT=production ./minidb

# Test environment (minimal logs)
ENVIRONMENT=test ./minidb

Log Output Examples

# Server startup
2024-08-31T10:15:30.123Z INFO server/main.go:45 Starting MiniDB server {"version": "2.0", "port": 7205, "environment": "development"}

# Query execution with timing
2024-08-31T10:15:45.456Z INFO executor/executor.go:89 Query executed successfully {"sql": "SELECT * FROM users", "execution_time": "2.5ms", "rows_returned": 150}

# Parser operations
2024-08-31T10:15:46.789Z INFO parser/parser.go:73 SQL parsing completed successfully {"sql": "INSERT INTO users VALUES (1, 'John')", "node_type": "*parser.InsertStmt", "total_parsing_time": "0.8ms"}

# Storage operations
2024-08-31T10:15:47.012Z INFO storage/wal.go:67 WAL entry written successfully {"operation": "INSERT", "table": "users", "write_duration": "0.3ms"}

Installation & Usage

Building MiniDB

# Clone the repository
git clone <repository-url>
cd minidb

# Install dependencies (zap logging, lumberjack rotation)
go mod download

# Build the optimized server
go build -o minidb ./cmd/server

# Run tests to verify installation
go test ./test/... -v

Starting the Server

# Start single-node prototype (localhost:7205)
./minidb

# Start with custom configuration  
./minidb -host 0.0.0.0 -port 8080

# Show available options
./minidb -h

Current Server Output

=== MiniDB Server ===
Version: 2.0 (Lakehouse Architecture)
Listening on: localhost:7205
Storage: Parquet + Delta Lake with Arrow IPC serialization
Features: Vectorized Execution, Predicate Pushdown, Cost-based Optimization, Enterprise Logging
Logging: Structured logging enabled with daily rotation (logs/minidb.log)
Ready for connections...

Future MPP Cluster (Planned)

# Start coordinator node
./minidb coordinator --port 7205

# Start compute nodes
./minidb compute --coordinator localhost:7205 --port 8001
./minidb compute --coordinator localhost:7205 --port 8002

SQL Usage Examples

Database Operations

-- Create and manage databases
CREATE DATABASE ecommerce;
USE ecommerce;
SHOW DATABASES;

Basic DDL Operations โœ…

-- Create tables with optimized type system
CREATE TABLE users (
    id INT,
    name VARCHAR,
    email VARCHAR,
    age INT,
    created_at VARCHAR
);

CREATE TABLE orders (
    id INT,
    user_id INT,
    amount INT,
    order_date VARCHAR
);

-- Show tables in current database
SHOW TABLES;

-- Create indexes for query optimization
CREATE INDEX idx_users_email ON users (email);
CREATE UNIQUE INDEX idx_users_id ON users (id);
CREATE INDEX idx_orders_user_id ON orders (user_id);
CREATE INDEX idx_composite ON users (name, email);

-- Show all indexes on a table
SHOW INDEXES ON users;
SHOW INDEXES FROM orders;

-- Drop indexes
DROP INDEX idx_users_email ON users;

Working DML Examples โœ…

-- Insert data (triggers automatic statistics updates)
INSERT INTO users VALUES (1, 'John Doe', 'john@example.com', 25, '2024-01-01');
INSERT INTO users VALUES (2, 'Jane Smith', 'jane@example.com', 30, '2024-01-02');
INSERT INTO users VALUES (3, 'Bob Wilson', 'bob@example.com', 35, '2024-01-03');

INSERT INTO orders VALUES (1, 1, 100, '2024-01-05');
INSERT INTO orders VALUES (2, 2, 250, '2024-01-06');
INSERT INTO orders VALUES (3, 1, 150, '2024-01-07');

-- Vectorized SELECT operations
SELECT * FROM users;
SELECT name, email FROM users WHERE age > 25;
SELECT * FROM orders;

-- Cost-optimized JOIN operations
SELECT u.name, o.amount, o.order_date
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.age > 25;

-- Vectorized aggregations
SELECT age, COUNT(*) as user_count, AVG(age) as avg_age
FROM users
GROUP BY age
HAVING user_count > 0;

-- Advanced WHERE clauses
SELECT * FROM users WHERE age >= 25 AND age <= 35;
SELECT * FROM users WHERE name LIKE 'J%';
SELECT * FROM orders WHERE amount IN (100, 250);

System Tables & Metadata Queries โœ…

-- Query all databases (SQL self-bootstrapping)
SELECT * FROM sys.schemata;
-- Returns: sys, default, ecommerce, ...

-- Query all tables in the system
SELECT * FROM sys.table_catalog;
-- Returns: database_name, table_name for all tables

-- View column metadata for specific tables
SELECT column_name, data_type, is_nullable
FROM sys.columns
WHERE table_schema = 'ecommerce' AND table_name = 'users';

-- Check index information
SELECT index_name, column_name, is_unique, index_type
FROM sys.index_metadata
WHERE table_schema = 'ecommerce' AND table_name = 'users';

-- View Delta Log transaction history
SELECT version, operation, table_name, file_path, row_count
FROM sys.delta_log
WHERE table_schema = 'ecommerce'
ORDER BY version DESC
LIMIT 10;

-- View active Parquet files for a table
SELECT file_path, file_size, row_count, status
FROM sys.table_files
WHERE table_schema = 'ecommerce' AND table_name = 'orders';

Query Optimization โœ…

-- Visualize optimized query execution plans
EXPLAIN SELECT u.name, SUM(o.amount) as total_spent
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.age > 25
GROUP BY u.name
ORDER BY total_spent DESC;

-- Output shows:
-- Query Execution Plan:
--------------------
-- Select
--   OrderBy
--     GroupBy
--       Filter
--         Join
--           TableScan
--           TableScan

Limited Features โš ๏ธ

-- Complex analytical queries (uses vectorized execution)
SELECT 
    u.name,
    COUNT(o.id) as order_count,
    SUM(o.amount) as total_amount,
    AVG(o.amount) as avg_amount
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name
HAVING order_count > 1
ORDER BY total_amount DESC;

-- Update operations with statistics maintenance
UPDATE users 
SET email = 'john.doe@newdomain.com' 
WHERE name = 'John Doe';

-- Efficient delete operations
DELETE FROM orders WHERE amount < 50;

Future MPP Features ๐Ÿ”ฎ

-- Planned: Advanced analytical queries
SELECT 
    region,
    amount,
    SUM(amount) OVER (PARTITION BY region ORDER BY amount) as running_total,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) as rank
FROM sales;

-- Planned: Complex multi-table operations with distributed execution
SELECT 
    region,
    COUNT(DISTINCT product) as product_variety,
    AVG(amount) as avg_sale,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median_sale
FROM sales s
JOIN product_catalog p ON s.product = p.name
WHERE s.date >= '2024-01-01'
GROUP BY region
ORDER BY avg_sale DESC;

Result Formatting

-- Formatted table output with row counts
SELECT name, age FROM users WHERE age > 25;

| name            | age            |
|-----------------+----------------|
| Jane Smith      | 30             |
| Bob Wilson      | 35             |
|-----------------+----------------|
2 rows in set

-- Empty result handling
SELECT * FROM users WHERE age > 100;
Empty set

Error Handling

-- Comprehensive error messages
CREATE TABLE users (...);
Error: table users already exists

SELECT nonexistent_column FROM users;
Error: column nonexistent_column does not exist

SELECT FROM users WHERE;
Error: parsing error: syntax error near 'WHERE'

Connection Examples

Using Network Clients

# Connect using netcat
nc localhost 7205

# Connect using telnet
telnet localhost 7205

# Example session
Welcome to MiniDB v1.0!
Session ID: 1234567890
Type 'exit;' or 'quit;' to disconnect
------------------------------------
minidb> CREATE TABLE test (id INT, name VARCHAR);
OK

minidb> INSERT INTO test VALUES (1, 'Hello');
OK

minidb> SELECT * FROM test;
| id              | name           |
|-----------------+----------------|
| 1               | Hello          |
|-----------------+----------------|
1 rows in set

minidb> exit;
Goodbye!

Development Status & Advantages

Current Lakehouse Benefits โœ…

  1. Lakehouse Architecture: Combines data lake flexibility with data warehouse performance
  2. ACID Transactions: Delta Lake ensures consistency with snapshot isolation
  3. Time Travel: Query historical data using version numbers or timestamps
  4. SQL Self-Bootstrapping: Virtual system tables (sys.*) for metadata queries without circular dependencies
  5. Vectorized Analytics: 10-100x speedup for GROUP BY, aggregations using Apache Arrow
  6. Predicate Pushdown: Filter evaluation at storage layer reduces data read
  7. Arrow IPC Serialization: Efficient binary schema serialization with full type fidelity
  8. Comprehensive Statistics: Min/max/null tracking for all data types enables data skipping
  9. Enterprise Logging: Comprehensive structured logging with performance monitoring and error tracking

MPP Architecture Advantages ๐ŸŽฏ

  1. Linear Scalability: Designed for horizontal scaling across compute clusters
  2. Compute-Storage Separation: Independent scaling of processing and storage resources
  3. Fault Tolerance: Automatic failure recovery and query restart capabilities
  4. Elastic Resource Management: Dynamic compute allocation based on workload patterns

Developer Experience

  1. Simple Deployment: Single binary with no external dependencies (current)
  2. Comprehensive Testing: Integration test framework with ~77% success rate
  3. Clear Documentation: Honest status reporting of working vs planned features
  4. MPP-Ready Design: Minimal changes needed for distributed deployment
  5. Production-Ready Logging: Enterprise-grade observability and debugging capabilities

MPP Roadmap

Phase 1: MPP Foundation ๐Ÿšง

  • Distributed Query Coordinator: Central query planning and execution coordination
  • Compute Node Management: Automatic node discovery and health monitoring
  • Inter-Node Communication: Efficient data transfer protocols between nodes
  • Query Distribution: Automatic query parallelization across compute clusters
  • Resource Management: Intelligent workload scheduling and resource allocation

Phase 2: Lakehouse Integration ๐Ÿ”ฎ

  • Object Storage Connectors: S3, GCS, Azure Blob storage integration
  • Multi-Format Support: Native Parquet, ORC, Delta Lake, Iceberg readers
  • Distributed Metadata Service: Schema evolution and transaction coordination
  • Data Distribution: Automatic partitioning and pruning for optimal performance
  • Elastic Compute: Dynamic scaling based on workload demands

Phase 3: Advanced Analytics ๐ŸŒŸ

  • Window Functions: ROW_NUMBER, RANK, advanced analytical functions
  • Machine Learning Integration: SQL-based ML algorithms
  • Real-time Streaming: Live data ingestion and processing
  • Advanced Optimization: Adaptive query execution and auto-tuning
  • Multi-tenant Support: Resource isolation and security

Contributing

We welcome contributions! Please follow these guidelines:

  1. Ensure all tests pass: go test ./test/... -v
  2. Follow the existing code architecture and patterns
  3. Add appropriate unit tests for new features
  4. Update documentation for user-facing changes

Testing & Validation

Current Testing Status

  • Integration Tests: ~77% success rate across test framework
  • Working Features: Basic DDL, DML, GROUP BY, aggregations
  • Vectorized Queries: Functional for compatible analytical operations
  • Connection Handling: Multi-client TCP server with session management

Target MPP Benchmarks ๐ŸŽฏ

  • Distributed Processing: Linear scalability across compute clusters
  • Query Throughput: Support for thousands of concurrent analytical queries
  • Data Volume: Petabyte-scale processing capabilities
  • Fault Tolerance: Sub-second failure detection and recovery

License

This project is licensed under the GPL License - see the LICENSE file for details.


MiniDB v2.0 - Lakehouse Architecture with Parquet + Delta Lake, Apache Arrow Vectorization, and Predicate Pushdown

About

MiniDB is a distributed MPP (Massively Parallel Processing) database system built in Go, designed for analytical workloads. Currently implemented as a single-node prototype with vectorized execution and cost-based optimization, it provides the foundation for future distributed parallel processing capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages