A distributed key-value store built from scratch using the Raft consensus algorithm. This system provides strong consistency guarantees, fault tolerance, and high availability through replicated state machine (RSM) architecture.
- Overview
- Architecture
- Features
- System Components
- Getting Started
- Usage
- Configuration
- Performance Benchmarking
- Testing
- Technical Details
- Project Structure
RaftKV is a distributed key-value store that implements the Raft consensus algorithm to ensure strong consistency across a cluster of nodes. The system provides three core operations: Put, Get, and Delete, all of which are replicated through the Raft log to guarantee linearizability.
- Strong Consistency: All operations are linearizable through Raft log replication
- Fault Tolerance: Can tolerate up to (N-1)/2 node failures in an N-node cluster
- Durability: Raft log entries are persisted to disk using Badger DB
- Leader Election: Automatic leader election and failover
- Stateless Client: Client library with automatic leader discovery and retry logic
- Production Ready: Comprehensive error handling, logging, and monitoring
The system follows a layered architecture:
┌─────────────────────────────────────────────────────────┐
│ Client Layer │
│ (gRPC Client with Leader Discovery & Retry Logic) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ gRPC Service Layer │
│ (KV Service: Put/Get/Delete Handlers) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Replicated State Machine (RSM) │
│ (Proposal Queue, State Management, Result Routing) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Raft Layer │
│ (Leader Election, Log Replication, Persistence) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Transport Layer (gRPC) │
│ (AppendEntries, RequestVote RPCs) │
└─────────────────────────────────────────────────────────┘
- Client Request: Client sends gRPC request (Put/Get/Delete) to any server
- gRPC Handler: Server handler generates unique command ID and enqueues to RSM
- RSM Proposal: RSM checks leadership, encodes operation to protobuf, and proposes to Raft
- Raft Replication: Leader appends entry to log and broadcasts
AppendEntriesto followers - Follower Processing: Followers validate consistency and append entries, respond to leader
- Commit: Leader advances commit index when majority replicates
- Apply: Committed entries delivered to RSM via apply channel
- State Update: RSM applies operation to in-memory state map
- Result Routing: RSM routes result back to waiting gRPC handler via command ID
- Client Response: Handler returns result to client
- Three Operations:
Put,Get,Deletewith full Raft replication - Linearizable Reads: All reads go through Raft log for consistency
- Automatic Leader Election: Nodes automatically elect a leader when needed
- Log Replication: Leader replicates entries to all followers with consistency checks
- Persistent Log: Raft log persisted to Badger DB for durability
- State Recovery: Nodes recover state from disk on restart
- Leader Discovery: Automatically discovers and caches the current leader
- Retry Logic: Exponential backoff with jitter for failed requests
- Size Validation: Enforces 1.5 MiB limit on keys and values
- Stateless Design: No persistent client state, works with any server configuration
- CLI Tool: Command-line interface for testing and operations
- Fault Tolerance: Handles node failures and network partitions
- Consistency Guarantees: Log Matching Property ensures log consistency
- Durability: All committed entries persisted before acknowledgment
- Error Handling: Comprehensive error handling with leader information propagation
The core Raft implementation providing:
- Leader Election: Randomized election timeouts (150-300ms) prevent split votes
- Log Replication: Batched
AppendEntriesRPCs (up to 256 entries) - State Persistence: Metadata and log entries saved to Badger DB
- State Machine: Manages term, votedFor, log entries, commit index, and match indices
Key Files:
raft.go: Core Raft state machine and logicmessage.go: RPC message handling (AppendEntries, RequestVote)service.go: gRPC service implementationnetwork.go: Transport layer for peer communication
Coordinates between client requests and Raft commits:
- Single Goroutine Design: Ensures sequential processing and eliminates race conditions
- Proposal Queue: Serializes client requests for backpressure handling
- Result Routing: Correlates proposals with results using command IDs
- State Management: In-memory key-value map with thread-safe access
Key Files:
rsm.go: Main RSM implementationtypes.go: Operation and result types
gRPC service layer exposing the key-value store API:
- Request Handling: Generates unique command IDs for each operation
- Admission Control: Semaphore-based admission control (5120 concurrent requests)
- Error Propagation: Returns leader information on non-leader errors
- Timeout Handling: 1-second timeout per request
Key Files:
kv_service.go: gRPC service implementationerrors.go: Error definitions and handling
Go client library with leader discovery and retry logic:
- Connection Pooling: Maintains gRPC connections to all servers
- Leader Caching: Caches leader index and term for efficient routing
- Exponential Backoff: Binary exponential backoff with jitter (100ms base, 5s max)
- Random Server Selection: When no cached leader exists
Key Files:
pkg/client/client.go: Core client implementationcmd/cli/main.go: CLI tool
Performance benchmarking and testing tool:
- Workload Generation: Write-heavy, read-heavy, mixed, and custom workloads
- Metrics Collection: Latency, throughput, error rates
- CSV Reporting: Time-series and summary metrics export
- Warmup Phase: Optional warmup to stabilize performance
Key Files:
main.go: Benchmark runnerworkload.go: Workload generation logicstats.go: Metrics collectionreporter.go: Report generation
- Go 1.25.2 or later
- Docker and Docker Compose (for cluster deployment)
- Protocol Buffers compiler (
protoc) - Go plugins for protobuf and gRPC
-
Clone the repository:
git clone <repository-url> cd sharded-kv
-
Install dependencies:
go mod download
-
Generate protobuf code:
export PATH=$PATH:$(go env GOPATH)/bin protoc --go_out=. --go-grpc_out=. proto/kvstore/v1/kvstore.proto protoc --go_out=. --go-grpc_out=. proto/raft/v1/raft.proto protoc --go_out=. proto/rsm/v1/rsm.proto
-
Build the server:
go build -o main ./cmd/main.go
-
Build the client CLI:
cd kv_client go build -o cli ./cmd/cli/
The easiest way to run a cluster is using Docker Compose:
docker compose up --build --watchThis starts a 5-node cluster with:
- Nodes:
raft1throughraft5 - Network:
172.28.0.0/16subnet - Ports:
50051for gRPC - Auto-rebuild on code changes
To stop/start individual nodes:
docker compose stop raft1
docker compose start raft1# Put a key-value pair
./kv_client/cli --servers=172.28.0.11:50051,172.28.0.12:50051 put mykey myvalue
# Get a value
./kv_client/cli --servers=172.28.0.11:50051,172.28.0.12:50051 get mykey
# Delete a key
./kv_client/cli --servers=172.28.0.11:50051,172.28.0.12:50051 delete mykeypackage main
import (
"fmt"
"raftkv-client/pkg/client"
)
func main() {
addresses := []string{
"172.28.0.11:50051",
"172.28.0.12:50051",
"172.28.0.13:50051",
}
kvClient := client.NewClient(addresses)
defer kvClient.Close()
// Put operation
err := kvClient.Put([]byte("key"), []byte("value"))
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
// Get operation
value, err := kvClient.Get([]byte("key"))
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
if value == nil {
fmt.Println("Key not found")
} else {
fmt.Printf("Value: %s\n", string(value))
}
// Delete operation
err = kvClient.Delete([]byte("key"))
if err != nil {
fmt.Printf("Error: %v\n", err)
}
}Configuration is managed through TOML files. See sample_config/dev.toml:
host = 0
address = [
"172.28.0.11:50051",
"172.28.0.12:50051",
"172.28.0.13:50051",
"172.28.0.14:50051",
"172.28.0.15:50051"
]
debug_level = "info" # Options: debug, info, warning, error, fatalCurrently hardcoded (configurable in future versions):
- Election Timeout: 150-300ms (randomized)
- Heartbeat Interval: 10-20ms
- gRPC Reconnection: 10-50ms (exponential backoff with jitter)
- Max Batch Size: 256 entries per AppendEntries RPC
Configured in cmd/main.go:
- Num Memtables: 8
- Num Level Zero Tables: 8
- Num Compactors: 4
- Compression: None (for performance)
- Sync Writes: Enabled (for durability)
The benchmark tool provides comprehensive performance testing:
cd benchmark
go build -o benchmark .# Write-heavy workload (300 seconds, 1 concurrent client)
./benchmark --duration 300s --concurrency 1 --workload write-heavy
# Read-heavy workload
./benchmark --duration 300s --concurrency 1 --workload read-heavy
# Mixed workload
./benchmark --duration 300s --concurrency 1 --workload mixed
# Custom workload with specified ratios
./benchmark --duration 300s --concurrency 1 --workload custom \
--put-ratio 0.5 --get-ratio 0.3 --delete-ratio 0.2
# With warmup phase
./benchmark --duration 300s --concurrency 1 --workload write-heavy --warmupThe tool generates:
- Console Summary: Throughput, latency percentiles, error rates
- CSV Summary:
benchmark_summary.csvwith aggregate metrics - Time-Series CSV:
benchmark_timeseries.csvwith per-second metrics
- Throughput: Operations per second (total and per operation type)
- Latency: P50, P95, P99, P99.9 percentiles
- Error Rates: Per operation type
- Time-Series Data: Per-second metrics for visualization
# Test Raft implementation
go test ./raft/ -v
# Test RSM
go test ./rsm/ -v
# Test client library
go test ./kv_client/pkg/client/ -v# Run client integration tests (requires running cluster)
KV_INTEGRATION=1 \
KV_SERVERS=172.28.0.11:50051,172.28.0.12:50051,172.28.0.13:50051 \
go test ./kv_client/pkg/client/ -v -run Live-
Start the cluster:
docker compose up --build
-
Test basic operations:
./kv_client/cli --servers=172.28.0.11:50051 put testkey testvalue ./kv_client/cli --servers=172.28.0.11:50051 get testkey
-
Test fault tolerance:
# Stop the leader docker compose stop raft1 # Operations should continue (new leader elected) ./kv_client/cli --servers=172.28.0.12:50051 get testkey # Restart the node docker compose start raft1
- Election Timeout: Randomized between 150-300ms to prevent split votes
- Vote Granting: Followers grant votes based on:
- No vote cast in current term OR candidate is the same as previously voted
- Candidate's log is at least as up-to-date (Log Matching Property)
- Majority Rule: Candidate needs majority of votes (> N/2) to become leader
- AppendEntries RPC: Leader sends batches of up to 256 entries
- Consistency Check: Followers validate previous log index and term match
- Log Matching Property: Ensures logs are consistent across all nodes
- Commit Advancement: Leader commits entries when majority replicates
- Term Safety: Only entries from current term can be committed
- Metadata: Term, votedFor, lastLogIdx persisted on every change
- Log Entries: Persisted with prefix-based keys (
log-00000000000000000001) - Recovery: On startup, nodes load metadata and reconstruct log from entries
- Atomic Writes: Badger transactions ensure atomic persistence
The RSM uses a single goroutine that selects on:
- Proposal Queue: Client requests waiting to be proposed
- Apply Channel: Committed entries from Raft ready to be applied
This ensures:
- Sequential processing eliminates race conditions
- No locks needed for state access
- Predictable ordering of operations
Each client operation gets a unique command ID:
- Generated by gRPC handler (8 random bytes, hex encoded)
- Included in protobuf log entry
- Used to route results back to waiting handler
- Cleaned up after result delivery
- In-Memory Map:
map[string][]bytefor key-value storage - Thread Safety: Single goroutine owns all state mutations
- Copy on Read: Values copied before returning to prevent aliasing
- Caching: Client caches leader index and term
- Invalidation: Cache invalidated on:
- Higher term received
- Timeout (after 1 second)
- Fallback: Random server selection when no cached leader
- Per-Server Retries: 3 attempts per server before trying next
- Exponential Backoff: Binary exponential backoff with jitter
- Base delay: 100ms
- Max delay: 5 seconds
- Jitter: ±25% random variation
- No Backoff: Immediate retry with new leader for leader discovery
- Size Validation: Keys and values limited to 1.5 MiB
- gRPC Errors: Properly handled with retry logic
- Timeout Handling: 1-second timeout per request
- Leader Information: Non-leader errors include leader index
- Protocol: gRPC over HTTP/2
- Credentials: Insecure (for development)
- Timeouts: 50ms for Raft RPCs, 1s for KV operations
- Connection Pooling: Client maintains connections to all servers
- RequestVote: Used during leader election
- AppendEntries: Used for log replication and heartbeats
- KV Operations: Put, Get, Delete requests/responses
sharded-kv/
├── cmd/
│ └── main.go # Server entry point
├── raft/ # Raft consensus implementation
│ ├── raft.go # Core Raft state machine
│ ├── message.go # RPC message handling
│ ├── service.go # gRPC service implementation
│ ├── network.go # Transport layer
│ └── utility.go # Helper functions
├── rsm/ # Replicated State Machine
│ ├── rsm.go # RSM implementation
│ └── types.go # Operation and result types
├── server/ # gRPC service layer
│ ├── kv_service.go # KV service implementation
│ └── errors.go # Error definitions
├── kv_client/ # Client library
│ ├── pkg/client/ # Client implementation
│ │ ├── client.go
│ │ └── client_test.go
│ └── cmd/cli/ # CLI tool
│ └── main.go
├── benchmark/ # Benchmark tool
│ ├── main.go # Benchmark runner
│ ├── workload.go # Workload generation
│ ├── stats.go # Metrics collection
│ └── reporter.go # Report generation
├── proto/ # Protocol buffer definitions
│ ├── kvstore/v1/ # KV service protos
│ ├── raft/v1/ # Raft RPC protos
│ └── rsm/v1/ # RSM log entry protos
├── config/ # Configuration handling
│ └── config.go
├── sample_config/ # Example configurations
│ └── dev.toml
├── docker/ # Docker configuration
│ └── Dockerfile
├── docker-compose.yml # Cluster orchestration
├── go.mod # Go module definition
└── README.md # This file
See LICENSE file for details.
This implementation follows the Raft consensus algorithm as described in "In Search of an Understandable Consensus Algorithm" by Diego Ongaro and John Ousterhout.