0% found this document useful (0 votes)
68 views14 pages

Data Model For Big Data: Illustration: This Chapter Covers

This document discusses using Apache Thrift to implement a graph data model for the SuperWebAnalytics website. It covers defining nodes, edges, and properties using Thrift data types like unions and structs. Bringing all the pieces together, it shows how to define a DataUnit union to store properties and edges together, and a Data struct to pair each DataUnit with metadata like a timestamp. Overall, the document illustrates how to represent a graph schema using a serialization framework like Apache Thrift.

Uploaded by

Alex Adamitei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views14 pages

Data Model For Big Data: Illustration: This Chapter Covers

This document discusses using Apache Thrift to implement a graph data model for the SuperWebAnalytics website. It covers defining nodes, edges, and properties using Thrift data types like unions and structs. Bringing all the pieces together, it shows how to define a DataUnit union to store properties and edges together, and a Data struct to pair each DataUnit with metadata like a timestamp. Overall, the document illustrates how to represent a graph schema using a serialization framework like Apache Thrift.

Uploaded by

Alex Adamitei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data model for

Big Data: Illustration

This chapter covers


■ Apache Thrift
■ Implementing a graph schema using Apache
Thrift
■ Limitations of serialization frameworks

In the last chapter you saw the principles of forming a data model—the value of
raw data, dealing with semantic normalization, and the critical importance of
immutability. You saw how a graph schema can satisfy all these properties and saw
what the graph schema looks like for SuperWebAnalytics.com.
This is the first of the illustration chapters, in which we demonstrate the concepts of
the previous chapter using real-world tools. You can read just the theory chapters of
the book and learn the whole Lambda Architecture, but the illustration chapters show
you the nuances of mapping the theory to real code. In this chapter we’ll implement
the SuperWebAnalytics.com data model using Apache Thrift, a serialization frame-
work. You’ll see that even in a task as straightforward as writing a schema, there is fric-
tion between the idealized theory and what you can achieve in practice.

47

Licensed to Mark Watson <nordickan@gmail.com>


48 CHAPTER 3 Data model for Big Data: Illustration

3.1 Why a serialization framework?


Many developers go down the path of writing their raw data in a schemaless format
like JSON. This is appealing because of how easy it is to get started, but this approach
quickly leads to problems. Whether due to bugs or misunderstandings between differ-
ent developers, data corruption inevitably occurs. It’s our experience that data cor-
ruption errors are some of the most time-consuming to debug.
Data corruption issues are hard to debug because you have very little context on
how the corruption occurred. Typically you’ll only notice there’s a problem when
there’s an error downstream in the processing—long after the corrupt data was writ-
ten. For example, you might get a null pointer exception due to a mandatory field
being missing. You’ll quickly realize that the problem is a missing field, but you’ll have
absolutely no information about how that data got there in the first place.
When you create an enforceable schema, you get errors at the time of writing the
data—giving you full context as to how and why the data became invalid (like a stack
trace). In addition, the error prevents the program from corrupting the master data-
set by writing that data.
Serialization frameworks are an easy approach to making an enforceable schema.
If you’ve ever used an object-oriented, statically typed language, using a serialization
framework will be immediately familiar. Serialization frameworks generate code for
whatever languages you wish to use for reading, writing, and validating objects that
match your schema.
However, serialization frameworks are limited when it comes to achieving a fully
rigorous schema. After discussing how to apply a serialization framework to the Super-
WebAnalytics.com data model, we’ll discuss these limitations and how to work around
them.

3.2 Apache Thrift


Apache Thrift (http://thrift.apache.org/) is a tool that can be used to define statically
typed, enforceable schemas. It provides an interface definition language to describe the
schema in terms of generic data types, and this description can later be used to auto-
matically generate the actual implementation in multiple programming languages.

OUR USE OF APACHE THRIFT Thrift was initially developed at Facebook for
building cross-language services. It can be used for many purposes, but we’ll
limit our discussion to its usage as a serialization framework.

Other serialization frameworks


There are other tools similar to Apache Thrift, such as Protocol Buffers and Avro.
Remember, the purpose of this book is not to provide a survey of all possible tools
for every situation, but to use an appropriate tool to illustrate the fundamental con-
cepts. As a serialization framework, Thrift is practical, thoroughly tested, and
widely used.

Licensed to Mark Watson <nordickan@gmail.com>


Apache Thrift 49

The workhorses of Thrift are the struct and union type definitions. They’re composed
of other fields, such as
■ Primitive data types (strings, integers, longs, and doubles)
■ Collections of other types (lists, maps, and sets)
■ Other structs and unions
In general, unions are useful for representing nodes, structs are natural representa-
tions of edges, and properties use a combination of both. This will become evident
from the type definitions needed to represent the SuperWebAnalytics.com schema
components.

3.2.1 Nodes
For our SuperWebAnalytics.com user nodes, an individual is identified either by a
user ID or a browser cookie, but not both. This pattern is common for nodes, and it
matches exactly with a union data type—a single value that may have any of several
representations.
In Thrift, unions are defined by listing all possible representations. The following
code defines the SuperWebAnalytics.com nodes using Thrift unions:
union PersonID {
1: string cookie;
2: i64 user_id;
}

union PageID {
1: string url;
}

Note that unions can also be used for nodes with a single representation. Unions
allow the schema to evolve as the data evolves—we’ll discuss this further later in this
section.

3.2.2 Edges
Each edge can be represented as a struct containing two nodes. The name of an edge
struct indicates the relationship it represents, and the fields in the edge struct contain
the entities involved in the relationship.
The schema definition is very simple:
struct EquivEdge {
1: required PersonID id1;
2: required PersonID id2;
}

struct PageViewEdge {
1: required PersonID person;
2: required PageID page;
3: required i64 nonce;
}

Licensed to Mark Watson <nordickan@gmail.com>


50 CHAPTER 3 Data model for Big Data: Illustration

The fields of a Thrift struct can be denoted as required or optional. If a field is


defined as required, then a value for that field must be provided, or else Thrift will
give an error upon serialization or deserialization. Because each edge in a graph
schema must have two nodes, they are required fields in this example.

3.2.3 Properties
Last, let’s define the properties. A property contains a node and a value for the property.
The value can be one of many types, so it’s best represented using a union structure.
Let’s start by defining the schema for page properties. There’s only one property
for pages, so it’s really simple:
union PagePropertyValue {
1: i32 page_views;
}

struct PageProperty {
1: required PageID id;
2: required PagePropertyValue property;
}

Next let’s define the properties for people. As you can see, the location property is
more complex and requires another struct to be defined:
struct Location {
1: optional string city;
2: optional string state;
3: optional string country;
}

enum GenderType {
MALE = 1,
FEMALE = 2
}

union PersonPropertyValue {
1: string full_name;
2: GenderType gender;
3: Location location;
}

struct PersonProperty {
1: required PersonID id;
2: required PersonPropertyValue property;
}

The location struct is interesting because the city, state, and country fields could have
been stored as separate pieces of data. In this case, they’re so closely related it makes
sense to put them all into one struct as optional fields. When consuming location
information, you’ll almost always want all of those fields.

Licensed to Mark Watson <nordickan@gmail.com>


Apache Thrift 51

3.2.4 Tying everything together into data objects


At this point, the edges and properties are defined as separate types. Ideally you’d
want to store all of the data together to provide a single interface to access your infor-
mation. Furthermore, it also makes your data easier to manage if it’s stored in a single
dataset. This is accomplished by wrapping every property and edge type into a
DataUnit union—see the following code listing.

Listing 3.1 Completing the SuperWebAnalytics.com schema

union DataUnit {
1: PersonProperty person_property;
2: PageProperty page_property;
3: EquivEdge equiv;
4: PageViewEdge page_view;
}

struct Pedigree {
1: required i32 true_as_of_secs;
}

struct Data {
1: required Pedigree pedigree;
2: required DataUnit dataunit;
}

Each DataUnit is paired with its metadata, which is kept in a Pedigree struct. The
pedigree contains the timestamp for the information, but could also potentially con-
tain debugging information or the source of the data. The final Data struct corre-
sponds to a fact from the fact-based model.

3.2.5 Evolving your schema


Thrift is designed so that schemas can evolve over time. This is a crucial property,
because as your business requirements change you’ll need to add new kinds of data,
and you’ll want to do so as effortlessly as possible.
The key to evolving Thrift schemas is the numeric identifiers associated with each
field. Those IDs are used to identify fields in their serialized form. When you want to
change the schema but still be backward compatible with existing data, you must obey
the following rules:
■ Fields may be renamed. This is because the serialized form of an object uses the
field IDs, not the names, to identify fields.
■ A field may be removed, but you must never reuse that field ID. When deserializing
existing data, Thrift will ignore all fields with field IDs not included in the
schema. If you were to reuse a previously removed field ID, Thrift would try to
deserialize that old data into the new field, which will lead to either invalid or
incorrect data.

Licensed to Mark Watson <nordickan@gmail.com>


52 CHAPTER 3 Data model for Big Data: Illustration

■ Only optional fields can be added to existing structs. You can’t add required fields
because existing data won’t have those fields and thus won’t be deserializable.
(Note that this doesn’t apply to unions, because unions have no notion of
required and optional fields.)
As an example, should you want to change the SuperWebAnalytics.com schema to
store a person’s age and the links between web pages, you’d make the following
changes to your Thrift definition file (changes in bold font).

Listing 3.2 Extending the SuperWebAnalytics.com schema

union PersonPropertyValue {
1: string full_name;
2: GenderType gender;
3: Location location;
4: i16 age;
}
struct LinkedEdge {
1: required PageID source;
2: required PageID target;
}

union DataUnit {
1: PersonProperty person_property;
2: PageProperty page_property;
3: EquivEdge equiv;
4: PageViewEdge page_view;
5: LinkedEdge page_link;
}

Notice that adding a new age property is done by adding it to the corresponding
union structure, and a new edge is incorporated by adding it into the DataUnit union.

3.3 Limitations of serialization frameworks


Serialization frameworks only check that all required fields are present and are of the
expected type. They’re unable to check richer properties like “Ages should be non-
negative” or “true-as-of timestamps should not be in the future.” Data not matching
these properties would indicate a problem in your system, and you wouldn’t want
them written to your master dataset.
This may not seem like a limitation because serialization frameworks seem some-
what similar to how schemas work in relational databases. In fact, you may have found
relational database schemas a pain to work with and worry that making schemas even
stricter would be even more painful. But we urge you not to confuse the incidental
complexities of working with relational database schemas with the value of schemas
themselves. The difficulties of representing nested objects and doing schema migra-
tions with relational databases are non-existent when applying serialization frame-
works to represent immutable objects using graph schemas.

Licensed to Mark Watson <nordickan@gmail.com>


Summary 53

The right way to think about a schema is as a function that takes in a piece of data
and returns whether it’s valid or not. The schema language for Apache Thrift lets you
represent a subset of these functions where only field existence and field types are
checked. The ideal tool would let you implement any possible schema function.
Such an ideal tool—particularly one that is language neutral—doesn’t exist, but
there are two approaches you can take to work around these limitations with a serial-
ization framework like Apache Thrift:
■ Wrap your generated code in additional code that checks the additional properties you care
about, like ages being non-negative. This approach works well as long as you’re only
reading/writing data from/to a single language—if you use multiple languages,
you have to duplicate the logic in many languages.
■ Check the extra properties at the very beginning of your batch-processing workflow. This
step would split your dataset into “valid data” and “invalid data” and send a noti-
fication if any invalid data was found. This approach makes it easier to imple-
ment the rest of your workflow, because anything getting past the validity check
can be assumed to have the stricter properties you care about. But this approach
doesn’t prevent the invalid data from being written to the master dataset and
doesn’t help with determining the context in which the corruption happened.
Neither approach is ideal, but it’s hard to see how you can do better if your organiza-
tion reads/writes data in multiple languages. You have to decide whether you’d rather
maintain the same logic in multiple languages or lose the context in which corruption
was introduced. The only approach that would be perfect would be a serialization
framework that is also a general-purpose programming language that translates itself
into whatever languages it’s targeting. Such a tool doesn’t exist, though it’s theoreti-
cally possible.

3.4 Summary
For the most part, implementing the enforceable graph schema for SuperWebAnalyt-
ics.com was straightforward. You saw the friction that appears when using a serializa-
tion framework for this purpose—namely, the inability to enforce every property you
care about. The tooling will rarely capture your requirements perfectly, but it’s impor-
tant to know what would be possible with ideal tools. That way you’re cognizant of the
trade-offs you’re making and can keep an eye out for better tools (or make your own).
This will be a common theme as we go through the theory and illustration chapters.
In the next chapter you’ll learn how to physically store a master dataset in the
batch layer so that it can be processed easily and efficiently.

Licensed to Mark Watson <nordickan@gmail.com>


Features of Apache Thrift
By Randy Abernethy

In this article, excerpted from The Programmer's Guide to Apache


Thrift, we’ll discuss the key features of Apache Thrift.

There are several key benefits associated with using Apache Thrift to develop network
services or perform cross language serialization tasks.

 Full SOA Implementation - Apache Thrift supplies a complete SOA solution


 Modularity - Apache Thrift supports plug-in serialization protocols and transports
 Performance - Apache Thrift is fast and efficient
 Reach - Apache Thrift supports a wide range of languages and platforms
 Flexibility - Apache Thrift supports interface evolution

Let’s take a look at each of these features in turn.

Service Implementation
Services are modular application components that provide interfaces accessible over a
network. Service interfaces are described in Apache Thrift using Interface Definition Language
(IDL) (see Listing 1). The IDL can be compiled to generate stub code used to connect clients
and servers in a wide range of languages.

For example, imagine you have a C++ module in a GUI application that tracks and
computes sailing team statistics for the America’s Cup. As it happens, your company’s web
development team would like to use the sail stats module to enhance a client facing web
application, but the web site is written in PHP. To provide the sail stats features to the web
dev team the sail stats module can be deployed as a network service.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
Figure 1 - Converting a module from a monolithic application (above dotted line) into a network service
for a distributed application (below dotted line)

Microservices And Service Oriented Architecture (SOA)

The microservices and SOA approaches to distributed application design break applications down
into services, which are remotely accessible autonomous modules composed of a set of closely
related functions. SOA based systems generally provide their features over language agnostic
interfaces, allowing clients to be constructed in the most appropriate language and on the most
appropriate platform, independent of the service implementation. SOA services are typically
stateless and loosely coupled, communicating with clients through a formal interface contract. SOA
services may be internal to an organization or support clients across business boundaries.

Encapsulating the SailStats module in a SOA style service will make it easy for any part of
the company’s enterprise to access the service. There are several common ways to build SOA
services using web-oriented technologies. However many of these would require the
installation of web or application servers, possibly a material amount of additional coding, the
use of HTTP communications schemes and text based data formats, which are broadly
supported but not famous for being fast or compact.

Apache Thrift offers a compelling alternative. Using Apache Thrift IDL, we can define a
service interface with the functions we want to expose. We can then use the Apache Thrift
compiler to generate RPC code for our SailStats service in PHP and C++ (and most other
commercially viable languages). The web team can now use code generated in their language

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
of choice to call the functions offered by the SailStats service, exactly as if the functions were
defined locally (see Figure 1).

Apache Thrift also supplies a complete library of RPC servers. This means that you can use
one of the powerful multithreaded servers provided by Apache Thrift to handle all of the
server RPC processing and concurrency matters. Apache Thrift RPC servers are not only fast
but they also have a much smaller footprint than most web application servers, making them
suitable for many embedded systems.

Listing 1

service SailStats {
double GetSailorRating(1: string SailorName)
double GetTeamRating(1: string TeamName)
double GetBoatRating(1: i64 BoatSerialNumber)
list<string> GetSailorsOnTeam(1: string TeamName)
list<string> GetSailorsRatedBetween(1: double MinRating,
2: double MaxRating)
string GetTeamCaptain(1: string TeamName)
}

In summary, to turn a code library or module into a high performance RPC service with
Apache Thrift, all we need do is:

1. Define the service interface in IDL

2. Compile the IDL to generate client and server RPC stub code in the desired languages

3. On the client side call the remote functions as if they were local using the client stubs

4. On the Server side connect the server stubs to the desired functionality

5. Choose one of the prebuilt Apache Thrift servers to host the service

In exchange for a fairly small amount of work, we can turn almost any set of existing
functions into a high performance Apache Thrift service, accessible from a broad range of
client languages.

Modular Serialization
To make a function call from a client to a server, both client and server must agree on the
representation of data exchanged. The typical approach to solving this problem is to select an
interchange format and then to transform all data to be exchanged into this interchange
format. The process of transforming data to and from an interchange format is called
serialization.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
The Apache Thrift framework provides a complete, modular, cross language serialization
layer which supports RPC and stand alone serialization. Serialization frameworks make it easy
to store data to disk for later retrieval by another application. For example, a service written
in C that captures live earthquake data in a C struct could serialize this data to disk using
Apache Thrift (see figure 3). The serialization process converts the C struct into a generic
Apache Thrift serialized object. At a later time, a Ruby earthquake analysis application could
use Apache Thrift to restore the serialized object. The serialization layer takes care of the
various differences in data representation between the languages automatically.

Figure 2 - Apache Thrift serialization protocols enable different programming languages to share
abstract data types

A fairly unique feature of the Apache Thrift serialization framework is that it is not hard-
wired to a single serialization protocol. The serialization layer provided by Apache Thrift is
modular, making it possible to choose from an assortment of serialization protocols, or even
to create custom serialization protocols. Out of the box, Apache Thrift supports an efficient
binary serialization protocol, a compact protocol that reduces the size of serialized objects
and a JSON protocol which provides broad interoperability with JavaScript and the web. A
ZLib layer can also be added to provide high ratio compretion in some languages.

Performance
Apache Thrift is a good fit in many distributed computing settings, however it excels in the
area of high performance backend services. The choice of prebuilt and custom protocols for
serialization allows the application designer to choose the most appropriate serialization
protocol for the needs of the application, balancing transmission size, speed, portability and
human readability.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
Custom Apache Thrift REST

Extreme High Performance Extreme


Performance Broad Reach Reach

Figure 3 – Apache Thrift balances performance with reach and flexibility.

Apache Thrift supports compiled languages such as C, C++, Java and C#, which generally
have a performance edge over interpreted languages. This allows performance-critical
services to be built in the appropriate language while still providing interoperability with
highly productive front end development languages.

Apache Thrift RPC servers are lightweight, performing only the task of hosting Apache
Thrift services. A selection of servers is available in various languages giving application
designers the flexibility to choose a concurrency model well suited to their application
requirements. These servers are easy to deploy and load balance as standalone processes or
within virtual machines or containers.

Apache Thrift covers a wide range of performance requirements in the spectrum between
custom communications development on one end and REST on the other (see figure 3). The
lightweight nature of Apache Thrift combined with a choice of efficient serialization protocols
allows Apache Thrift to meet demanding performance requirements while offering support for
an impressive breadth of languages and platforms.

Reach
The Apache Thrift framework supports a number of programming languages, operating
systems and hardward platforms in both serialization and service capacities. Companies that
are growing and changing rapidly need solutions that give teams the flexibility to integrate
with new languages and platforms rapidly and with low friction. Apache Thrift can be a
significant business advantage in such settings. Figure 4 illustrates the broad scope of
environments within which Apache Thrift is often found.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
Figure 4 - Apache Thrift is an effective solution in embedded, enterprise and web technology
environments.

The table below provides a list of the languages currently supported directly by Apache
Thrift. Note that support for C# enables other .Net/CLR languages, such as F#, VisualBasic
and IronPython. By the same token, support for Java enables most JVM based languages to
interoperate with Apache Thrift, including Scala, Clojure and Groovy. JavaScript support is
provided for browser based applications and Node.js. Other projects found on the web expand
this list further.

Table 1 - Languages supported by Apache Thrift


C C++ C# D

Delphi Erlang Go Haskell

Haxe Java JavaScript Lua

Objective-C OCaml Perl PHP

Python Ruby Smalltalk TypeScript

Apache Thrift supports these languages on a range of platforms including Windows, iOS,
OS X, Linux, Android and many other Unix-like systems. Because Apache Thrift is compact
and supports C/C++ and JavaME, it is often appropriate for embedded systems. Apache Thrift
also supports HTTP[S], Webscoket and an array of web tech languages, including Perl, PHP,
Python, Ruby and JavaScript, making it viable in web oriented environments. Few frameworks
can supply the breadth of reach in languages and platforms offered by Apache Thrift.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/
Interface Evolution
Interface evolution is the process of changing the elements of an interface gradually over
time. Modern IDL based systems like Apache Thrift make it possible to evolve interfaces
without breaking interoperability with modules built around older versions of the interface.

For example, consider the previously described earthquake application where a C


language program writes a C language struct to disk each time a tremor is reported. Let’s
assume that the earthquake struct contains fields for the date, time, position and magnitude.
The interface evolution features of Apache Thrift allow new fields, say the earthquake’s
nearest city and state, to be added to the earthquake struct without breaking other
applications reading the serialized data. The Ruby reporting program will continue to read old
and new earthquake files, simply ignoring fields it does not recognize. Should the Ruby
programmers require the new fields they may add support for them at their leisure, using
default values when old files without the new fields are read.

Early RPC systems like SunRPC, DCE RPC, CORBA and MSRPC supplied little or no support
for interface evolution. As platforms grow and requirements change, rigid interfaces can make
it hard to extend and maintain RPC based services. Modern RPC systems such as Apache
Thrift provide a number of features which allow interfaces to evolve over time without
breaking compatibility with existing systems. Functions can be extended with new
parameters, old parameters can be removed, and default values can be supplied. Properly
applied these changes can be made without impacting peers using older versions of the
interface.

Modern engineering sensibilities such as Microservices, Continuous Integration (CI) and


Continuous Delivery (CD) require systems to support incremental improvements without
impacting the rest of the platform. Systems that do not supply some form of interface
evolution tend to “break the world” when changed. In such systems changing an interface
means that all of the clients and servers using that interface must be rewritten and/or
recompiled, then redeployed in a big bang. Apache Thrift interface evolution features allow
multiple interface versions to coexist, making incremental updates simple and natural.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/abernethy/

You might also like