Skip to content

a duckdb extension for querying encoded protobuf messages

Notifications You must be signed in to change notification settings

0xcaff/duckdb_protobuf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

duckdb_protobuf

a duckdb extension for parsing sequences of protobuf messages encoded in either the standard varint delimited format or a u32 big endian delimited format.

quick start

ensure you're using duckdb 1.1.0 for support with the latest features. if you need new features on an old versions, please open an issue.

$ duckdb -version
v1.1.0 fa5c2fe15f

start duckdb with -unsigned flag to allow loading unsigned libraries.

$ duckdb -unsigned

or if you're using the jdbc connector, you can do this with the allow_unsigned_extensions jdbc connection property.

now install the extension:

INSTALL protobuf from 'https://duckdb.0xcaff.xyz';

next load it (you'll need to do this once for every session you want to use the extension)

LOAD protobuf;

and start shredding up your protobufs!

SELECT *
FROM protobuf(
    descriptors = './descriptor.pb',
    files = './scrape/data/SceneVersion/**/*.bin',
    message_type = 'test_server.v1.GetUserSceneVersionResponse',
    delimiter = 'BigEndianFixed'
)
LIMIT 10;

if you want builds for a platform or version which currently doesn't have builds, please open an issue.

install from file

download the latest version from releases. if you're on macOS, blow away the quarantine params with the following to allow the file to be loaded

$ xattr -d com.apple.quarantine /Users/martin/Downloads/protobuf.duckdb_extension

next load the extension

LOAD '/Users/martin/Downloads/protobuf.duckdb_extension';

why

sometimes you want to land your row primary data in a format with a well-defined structure and pretty good decode performance and poke around without a load step. maybe you're scraping an endpoint which returns protobuf responses, you're figuring out the schema as you go and iteration speed matters much more than query performance.

duckdb_protobuf allows for making a new choice along the flexibility-performance tradeoff continuum for fast exploration of protobuf streams with little upfront load complexity or time.

configuration

  • descriptors: path to the protobuf descriptor file. Generated using something like protoc --descriptor_set_out=descriptor.pb ...
  • files: glob pattern for the files to read. Uses the glob crate for evaluating globs.
  • message_type: the fully qualified message type to parse.
  • delimiter: specifies where one message starts and the next one begins
    • BigEndianFixed: every message is prefixed with a u32 big endian value specifying its length. files are a sequence of messages
    • Varint: every message is prefixed with a protobuf Varint value (encoding). files are a sequence of messages
    • SingleMessagePerFile: each file contains a single message
  • filename, position and size: boolean values enabling columns which add source information about where the messages originated from

features

  • converts google.protobuf.Timestamp messages to duckdb timestamp
  • supports nested messages with repeating fields
  • scales decoding across as many threads as duckdb allows
  • supports projection pushdown (for first level of columns) ensuring only necessary columns are decoded.

limitations

  • doesn't support a few types (bytes, maps, {s,}fixed{32,64}, sint{32,64}), contributions and even feedback that these field types are used is welcome!

i'm releasing this to understand how other folks are using protobuf streams and duckdb. i'm open to PRs, issues and other feedback.