HBASE
Non-relational, scalable database
built on HDFS
Based on Google’s BigTable
CRUD
■ Create
■ Read
■ Update
■ Delete
■ There is no query language, only CRUD API’s!
HBase architecture
Zookeeper HMaster
Zookeeper HMaster
Zookeeper HMaster
Region Region Region Region
Server Server Server Server
Auto-sharding!
HDFS
HBase data model
■ Fast access to any given ROW
■ A ROW is referenced by a unique KEY
■ Each ROW has some small number of COLUMN FAMILIES
■ A COLUMN FAMILY may contain arbitrary COLUMNS
■ You can have a very large number of COLUMNS in a COLUMN FAMILY
■ Each CELL can have many VERSIONS with given timestamps
■ Sparse data is A-OK – missing columns in a row consume no storage.
Example: One row of a web table
Contents column family Anchor column family
Key Contents: Anchor:cnnsi.com Anchor:my.look.ca
com.cnn.www <html><head>
<html><head> “CNN” “CNN.com”
CNN…
<html><head>
CNN…
CNN…
Some ways to access HBase
■ HBase shell
■ Java API
– Wrappers for Python, Scala, etc.
■ Spark, Hive, Pig
■ REST service
■ Thrift service
■ Avro service
LET’S PLAY WITH
HBASE
Creating a HBase table with Python via REST
What are we doing?
■ Create a HBase table for movie ratings by user
■ Then show we can quickly query it for individual users
■ Good example of sparse data
Column family: rating
Rating:50 Rating:33 Rating:223
UserID 1 5 5
How are we doing it?
Python client
REST service
HBase
HDFS
Let’s do this
HBASE / PIG
INTEGRATION
Populating HBase at scale
Integrating Pig with HBase
■ Must create HBase table ahead of time
■ Your relation must have a unique key as its first column, followed by
subsequent columns as you want them saved in Hbase
■ USING clause allows you to STORE into an HBase table
■ Can work at scale – Hbase is transactional on rows
Let’s do this