Skip to content

Latest commit

 

History

History
54 lines (31 loc) · 3.19 KB

File metadata and controls

54 lines (31 loc) · 3.19 KB

1. What is it for? This library is used for seamlessly serializing and deserializing Avro records into and off Kafka topics in Spark jobs.

Deserialization capabilities are used in conjunction with Spark structured streams, i.e., Spark jobs can create structured streams capable of consuming data from Kafka and use the library to convert the incoming binary data into Spark SQL Rows which can than be queried (it "feels" like running SQL queries on binary data).

Serialization capabilities allows Dataframes to be written into Kafka as binary Avro records, which improves on the amount of data sent through the network.

2. Why is it important? Avro is Confluent's recommended payload for Kafka (see https://www.confluent.io/blog/avro-kafka-data/) but there's not support for using it in streaming Spark jobs.

Databricks' library only provides support for batch jobs (https://github.com/databricks/spark-avro), thus, the burden of performing conversions in streaming jobs is all on the user side.

Plus, Avro is a very efficient and fast serialization library and works on top of a very robust schema (i.e. allows for schema changes).

This library can be used as a "bridge" between two Spark applications where Kafka is the broker.

3. How does it work? It is implemented as a "bridge" between binary Avro data and Spark Rows.

It uses Avro's original Java API implementations to do must of the job. However, as those implementations rely on specific Java formats that cannot be understood by Catalyst (e.g. Avro has its own implementation of arrays, which extends java.util.Collection), this library overwrote some of the methods of the original library to convert between Java and Catalyst types during Avro records reading and writing.

It also uses Databricks Avro Spark library to perform the conversions between Avro and Spark SQL schemas, and also to convert Avro records into Rows.

4. Does it support all Avro formats? Yes, except for null as it is not originally supported by Databricks. Although adding the support is trivial, we decided to keep full compatibility with Databricks. Moreover, nullable fields can easily be redefined as options.

5. Why not using existing libraries? Because existing libraries do not support streaming. Also, every available open solution that tries to solve this problem ignore does not cope with nested types.

6. Can it read batches instead of streams? No. For that, Databricks library could be used instead.

7. Can it write streams instead of Dataframes? No. But if you can transform your input stream into Dataframes, then yes.

8. Why is it distributed as an uber jar? Because we wanted it to be a self-contained library without any external dependencies to be resolved by users. Otherwise, if we decided to exclude dependencies directly in the library POM, then we would be making guesses about users' environments, which does not sound appropriate. If users want their environment to provide any dependency, all the need to do is to specify this in their own project POM.

9. Does it support data sources other than Kafka? For writing, yes. For reading, the current API works with DataInputStreams, thus, theoretically, any stream should work, although only Kafka has been tested so far.