SQLFlow: DuckDB for Streaming Data.

Quickstart | Tutorials | | Documentation

SQLFlow is a high-performance stream processing engine that simplifies building data pipelines by enabling you to define them using just SQL. Think of SQLFLow as a lightweight, modern Flink.

Key Features:

Process data from Kafka, WebSockets, and more.
Write ouputs to PostgreSQL, Kafka topics, or cloud storage (such as S3), in a variety of formats, including parquet and iceberg.
Built on DuckDB and Apache Arrow for high-speed processing.

Quick Start (Getting Started in 5 Minutes)

Pull the SQLFlow docker image

docker pull turbolytics/sql-flow:latest

Setup Local Development Environment

make setup-dev

Validate config against test data:

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.mem.yml /tmp/conf/fixtures/simple.json

['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']

Start kafka locally using docker:

docker-compose -f dev/kafka-single.yml up -d

Publish test messages to kafka:

python3 cmd/publish-test-data.py --num-messages=10000 --topic="input-simple-agg-mem"

Start kafka consumer from inside docker-compose container, to verify SQLFlow output

docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-simple-agg-mem

Start SQLFlow in docker:

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 turbolytics/sql-flow:latest run /tmp/conf/config/examples/basic.agg.mem.yml --max-msgs-to-process=10000

Verify output in the kafka consumer:

...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}

You just ran SQLFlow against a stream of kafka data!

How SQLFlow Works

SQLFlow is a stream processing engine written in python. SQLFLow embeds DuckDB and Apache Arrow for high performance. SQLFLow consists of a couple of components:

Core Components

Input Source

SQLFlow ingests data from a variety of input sources, including Kafka, and Webhooks. SQLFlow models the input as a stream of data.

Handler

SQLFlow uses DuckDB and Apache Arrow to execute SQL against the input source. Handlers contain the stream processing logica, filter, aggregate, enrich or drop data.

Output Sink

SQLFlow writes the results of the SQL to output sources including: Kafka, Postgres, Filesystem, Blob Storage.

The following image shows an example SQLFlow configuration file:

The file explicitly contains a pipeline configuration with a source, handler and sink section. This configuration file also contains commands to be executed prior to the pipeline running. These commands support things like attaching databases to the pipeline execution context.

SQLFlow Use-Cases

Streaming Data Transformations: Clean data and types and publish the new data (example config).
Stream Enrichment: Add data an input stream and publish the new data (example config).
Data aggregation: Aggregate input data batches to decrease data volume (example config).
Tumbling Window Aggregation: Bucket data into arbitrary time windows (such as "hour" or "10 minutes") (example config).
Run SQL against the Bluesky Firehose: Execute SQL against any webhook source, such as the Bluesky firehose (example config)
Stream Data to Iceberg: Stream writes to an Iceberg Catalog.
Enrich Streams with Postgres Data: Query postgres during stream processing to enrich stream data.
Sink Kafka to Postgres: Insert stream processing outputs into postgres.

SQLFlow Features & Roadmap

Sources
- Kafka Consumer using consumer groups
- Websocket input (for consuming bluesky firehose)
- HTTP (for webhooks)
Sinks
- Kafka Producer
- Stdout
- Local Disk
- Postgres
- S3
- Any output DuckDB Supports!
Serialization
- JSON Input
- JSON Output
- Parquet Output
- Iceberg Output (using pyiceberg)
Handlers
- Memory Persistence
- Pipeline-scoped SQL such as defining views, or attaching to databases.
- User Defined Functions (UDF)
- Dynamic Schema Inferrence
- Disk Persistence
- Static Schema Definition
Table Managers
- Tumbling Window Aggregations
- Buffered Table
Operations
- Observability Metrics (Prometheus)

Examples

Additional examples are available in the wiki: Tutorials

Consume Bluesky Firehose

SQLFlow supports DuckDB over websocket. Running SQL against the Bluesky firehose is a simple configuration file:

The following command starts a bluesky consumer and prints every post to stdout:

docker run -v $(pwd)/dev/config/examples:/examples turbolytics/sql-flow:latest run /examples/bluesky/bluesky.raw.stdout.yml

Checkout the configuration files here

Stream Kafka to Iceberg

SQLFlow supports writing to Iceberg tables using pyiceberg.

The following configuration writes to an Iceberg table using a local SQLite catalog:

Initialize the SQLite iceberg catalog and test table

python3 cmd/setup-iceberg-local.py setup
created default.city_events
created default.bluesky_post_events
Catalog setup complete.

Start Kafka Locally

docker-compose -f dev/kafka-single.yml up -d

Publish Test Messages to Kafka

python3 cmd/publish-test-data.py --num-messages=5000 --topic="input-kafka-mem-iceberg"

Run SQLFlow, which will read from kafka and write to the iceberg table locally

docker run \
  -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 \
  -e PYICEBERG_HOME=/tmp/iceberg/ \ 
  -v $(pwd)/dev/config/iceberg/.pyiceberg.yaml:/tmp/iceberg/.pyiceberg.yaml \
  -v /tmp/sqlflow/warehouse:/tmp/sqlflow/warehouse \
  -v $(pwd)/dev/config/examples:/examples \
  turbolytics/sql-flow:latest run /examples/kafka.mem.iceberg.yml --max-msgs-to-process=5000

Verify iceberg data was written by querying it with duckdb

% duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from '/tmp/sqlflow/warehouse/default.db/city_events/data/*.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         5000 │
└──────────────┘

Running SQLFlow

Coming Soon! Until then checkout:

Tutorials

If you need any support please open an issue or contact us directly! (danny [AT] turbolytics.io)!

Development

Install python deps

pip install -r requirements.txt
pip install -r requirements.dev.txt

C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka

Run tests

make test-unit

Benchmarks

The following table shows the performance of different test scenarios:

Name	Throughput	Max RSS Memory	Peak Memory Usage
Simple Aggregation Memory	45,000 msgs / sec	230 MiB	130 MiB
Simple Aggregation Disk	36,000 msgs / sec	256 MiB	102 MiB
Enrichment	13,000 msgs /sec	368 MiB	124 MiB
CSV Disk Join	11,500 msgs /sec	312 MiB	152 MiB
CSV Memory Join	33,200 msgs / sec	300 MiB	107 MiB
In Memory Tumbling Window	44,000 msgs / sec	198 MiB	96 MiB

More information about benchmarks are available in the wiki.

Contact Us

Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! danny@turbolytics.io

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
adrs		adrs
benchmark		benchmark
cmd		cmd
dev		dev
sqlflow		sqlflow
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQLFlow: DuckDB for Streaming Data.

Quick Start (Getting Started in 5 Minutes)

How SQLFlow Works

Core Components

SQLFlow Use-Cases

SQLFlow Features & Roadmap

Examples

Consume Bluesky Firehose

Stream Kafka to Iceberg

Running SQLFlow

Development

Benchmarks

Contact Us

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SQLFlow: DuckDB for Streaming Data.

Quick Start (Getting Started in 5 Minutes)

How SQLFlow Works

Core Components

SQLFlow Use-Cases

SQLFlow Features & Roadmap

Examples

Consume Bluesky Firehose

Stream Kafka to Iceberg

Running SQLFlow

Development

Benchmarks

Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages