Impor&ng
Rela&onal
Data
with
Sqoop
Chapter
4
201509
Course
Chapters
1
Introduc&on
Course
Introduc&on
2
Introduc&on
to
Hadoop
and
the
Hadoop
Ecosystem
Introduc&on
to
Hadoop
3
Hadoop
Architecture
and
HDFS
4
Impor*ng
Rela*onal
Data
with
Apache
Sqoop
5
Introduc&on
to
Impala
and
Hive
Impor*ng
and
Modeling
6
Modeling
and
Managing
Data
with
Impala
and
Hive
Structured
Data
7
Data
Formats
8
Data
File
Par&&oning
9
Capturing
Data
with
Apache
Flume
Inges&ng
Streaming
Data
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
Aggrega&ng
Data
with
Pair
RDDs
13
Wri&ng
and
Deploying
Spark
Applica&ons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaHerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
18
Conclusion
Course
Conclusion
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐2
Impor&ng
Rela&onal
Data
with
Apache
Sqoop
In
this
chapter
you
will
learn
§ How
to
import
tables
from
an
RDBMS
into
your
Hadoop
cluster
§ How
to
change
the
delimiter
and
file
format
of
imported
tables
§ How
to
control
which
columns
and
rows
are
imported
§ What
techniques
you
can
use
to
improve
Sqoop’s
performance
§ How
the
next-‐genera*on
version
of
Sqoop
compares
to
the
original
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐3
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐4
What
is
Apache
Sqoop?
§ Open
source
Apache
project
originally
developed
by
Cloudera
– The
name
is
a
contrac&on
of
“SQL-‐to-‐Hadoop”
§ Sqoop
exchanges
data
between
a
database
and
HDFS
– Can
import
all
tables,
a
single
table,
or
a
par&al
table
into
HDFS
– Data
can
be
imported
a
variety
of
formats
– Sqoop
can
also
export
data
from
HDFS
to
a
database
Database Hadoop Cluster
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐5
How
Does
Sqoop
Work?
§ Sqoop
is
a
client-‐side
applica*on
that
imports
data
using
Hadoop
MapReduce
§ A
basic
import
involves
three
steps
Database Server
orchestrated
by
Sqoop
1. Examine
table
details
2. Create
and
submit
job
to
cluster
1
3. Fetch
records
from
table
and
Sqoop 3
write
this
data
to
HDFS
User
Hadoop Cluster
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐6
Basic
Syntax
§ Sqoop
is
a
command-‐line
u*lity
with
several
subcommands,
called
tools
– There
are
tools
for
import,
export,
lis&ng
database
contents,
and
more
– Run
sqoop help
to
see
a
list
of
all
tools
– Run
sqoop help tool-name
for
help
on
using
a
specific
tool
§ Basic
syntax
of
a
Sqoop
invoca*on
$ sqoop tool-name [tool-options]
§ This
command
will
list
all
tables
in
the
loudacre
database
in
MySQL
$ sqoop list-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐7
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐8
Overview
of
the
Import
Process
§ Imports
are
performed
using
Hadoop
MapReduce
jobs
§ Sqoop
begins
by
examining
the
table
to
be
imported
– Determines
the
primary
key,
if
possible
– Runs
a
boundary
query
to
see
how
many
records
will
be
imported
– Divides
result
of
boundary
query
by
the
number
of
tasks
(mappers)
– Uses
this
to
configure
tasks
so
that
they
will
have
equal
loads
§ Sqoop
also
generates
a
Java
source
file
for
each
table
being
imported
– It
compiles
and
uses
this
during
the
import
process
– The
file
remains
afer
import,
but
can
be
safely
deleted
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐9
Impor&ng
an
En&re
Database
with
Sqoop
§ The
import-all-tables
tool
imports
an
en*re
database
– Stored
as
comma-‐delimited
files
– Default
base
loca&on
is
your
HDFS
home
directory
– Data
will
be
in
subdirectories
corresponding
to
name
of
each
table
$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw
§ Use
the
--warehouse-dir
op*on
to
specify
a
different
base
directory
$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--warehouse-dir /loudacre
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐10
Impor&ng
a
Single
Table
with
Sqoop
§ The
import
tool
imports
a
single
table
§ This
example
imports
the
accounts
table
– It
stores
the
data
in
HDFS
as
comma-‐delimited
fields
$ sqoop import --table accounts \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw
§ This
varia*on
writes
tab-‐delimited
fields
instead
$ sqoop import --table accounts \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--fields-terminated-by "\t"
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐11
Incremental
Imports
(1)
§ What
if
records
have
changed
since
last
import?
– Could
re-‐import
all
records,
but
this
is
inefficient
§ Sqoop’s
incremental
lastmodified
mode
imports
new
and
modified
records
– Based
on
a
&mestamp
in
a
specified
column
– You
must
ensure
&mestamps
are
updated
when
records
are
added
or
changed
in
the
database
$ sqoop import --table invoices \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--incremental lastmodified \
--check-column mod_dt \
--last-value '2015-09-30 16:00:00'
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐12
Incremental
Imports
(2)
§ Or
use
Sqoop’s
incremental
append
mode
to
import
only
new
records
– Based
on
value
of
last
record
in
specified
column
$ sqoop import --table invoices \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--incremental append \
--check-column id \
--last-value 9478306
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐13
Expor&ng
Data
from
Hadoop
to
RDBMS
with
Sqoop
§ Sqoop's
import
tool
pulls
records
from
an
RDBMS
into
HDFS
§ It
is
some*mes
necessary
to
push
data
in
HDFS
back
to
an
RDBMS
– Good
solu&on
when
you
must
do
batch
processing
on
large
data
sets
– Export
results
to
a
rela&onal
database
for
access
by
other
systems
§ Sqoop
supports
this
via
the
export
tool
– The
RDBMS
table
must
already
exist
prior
to
export
$ sqoop export \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--export-dir /loudacre/recommender_output \
--update-mode allowinsert \
--table product_recommendations
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐14
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi*ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐15
Impor&ng
Par&al
Tables
with
Sqoop
§ Import
only
specified
columns
from
accounts
table
$ sqoop import --table accounts \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--columns "id,first_name,last_name,state"
§ Import
only
matching
rows
from
accounts
table
$ sqoop import --table accounts \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--where "state='CA'"
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐16
Using
a
Free-‐Form
Query
§ You
can
also
import
the
results
of
a
query,
rather
than
a
single
table
§ Supply
a
complete
SQL
query
using
the
--query
op*on
– You
must
add
the
literal
WHERE $CONDITIONS
token
– Use
--split-by
to
iden&fy
field
used
to
divide
work
among
mappers
– The
--target-dir
op&on
is
required
for
free-‐form
queries
$ sqoop import \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--target-dir /data/loudacre/payable \
--split-by accounts.id \
--query 'SELECT accounts.id, first_name,
last_name, bill_amount FROM accounts JOIN invoices ON
(accounts.id = invoices.cust_id) WHERE $CONDITIONS'
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐17
Using
a
Free-‐Form
Query
with
WHERE
Criteria
§ The
--where
op*on
is
ignored
in
a
free-‐form
query
– You
must
specify
your
criteria
using
AND
following
the
WHERE
clause
$ sqoop import \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--target-dir /data/loudacre/payable \
--split-by accounts.id \
--query 'SELECT accounts.id, first_name,
last_name, bill_amount FROM accounts JOIN invoices ON
(accounts.id = invoices.cust_id) WHERE $CONDITIONS AND
bill_amount >= 40'
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐18
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐19
Op&ons
for
Database
Connec&vity
§ Generic
(JDBC)
– Compa&ble
with
nearly
any
database
– Overhead
imposed
by
JDBC
can
limit
performance
§ Direct
Mode
– Can
improve
performance
through
use
of
database-‐specific
u&li&es
– Currently
supports
MySQL
and
Postgres
(use
--direct
op&on)
– Not
all
Sqoop
features
are
available
in
direct
mode
§ Cloudera
and
partners
offer
high-‐performance
Sqoop
connectors
– These
use
na&ve
database
protocols
rather
than
JDBC
– Connectors
available
for
Netezza,
Teradata,
and
Oracle
– Download
these
from
Cloudera’s
Web
site
– Not
open
source
due
to
licensing
issues,
but
free
to
use
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐20
Controlling
Parallelism
§ By
default,
Sqoop
typically
imports
data
using
four
parallel
tasks
(called
mappers)
– Increasing
the
number
of
tasks
might
improve
import
speed
– Cau&on:
Each
task
adds
load
to
your
database
server
§ You
can
influence
the
number
of
tasks
using
the
-m
op*on
– Sqoop
views
this
only
as
a
hint
and
might
not
honor
it
$ sqoop import --table accounts \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
-m 8
§ Sqoop
assumes
all
tables
have
an
evenly-‐distributed
numeric
primary
key
– Sqoop
uses
this
column
to
divide
work
among
the
tasks
– You
can
use
a
different
column
with
the
--split-by
op&on
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐21
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐22
Limita&ons
of
Sqoop
§ Sqoop
is
stable
and
has
been
used
successfully
in
produc*on
for
years
§ However,
its
client-‐side
architecture
does
impose
some
limita*ons
– Requires
connec&vity
to
RDBMS
from
the
client
(client
must
have
Database Server
JDBC
drivers
installed)
– Requires
connec&vity
to
cluster
from
the
client
1
– Requires
user
to
specify
RDBMS
Sqoop 3
username
and
password
User
– Difficult
to
integrate
a
CLI
within
2
external
applica&ons
§ Also
*ghtly
coupled
to
JDBC
seman*cs
– A
problem
for
NoSQL
databases
Hadoop Cluster
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐23
Sqoop
2
Architecture
§ Sqoop
2
is
the
next-‐genera*on
version
of
Sqoop
– Client-‐server
design
addresses
limita&ons
described
earlier
– API
changes
also
simplify
development
of
other
Sqoop
connectors
§ Client
requires
connec*vity
only
to
the
Sqoop
server
– DB
connec&ons
are
configured
Sqoop Server Database Server
on
the
server
by
a
system
administrator
2
– End
users
no
longer
need
to
possess
database
creden&als
1 3 4
– Centralized
audit
trail
– BeHer
resource
management
Sqoop
Client
– Sqoop
server
is
accessible
via
CLI,
REST
API,
and
Web
UI
User
Hadoop Cluster
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐24
Sqoop
2
Status
§ Sqoop
2
is
being
ac*vely
developed
– It
began
shipping
(alongside
Sqoop)
star&ng
in
CDH
4.2
§ Sqoop
2
is
not
yet
at
feature
parity
with
Sqoop
– Implemented
features
are
regarded
as
stable
– Consider
using
Sqoop
2
unless
you
require
a
feature
it
lacks
§ We
use
Sqoop,
rather
than
Sqoop
2,
in
this
class
– Primarily
due
to
memory
constraints
in
the
VM
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐25
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐26
Essen&al
Points
§ Sqoop
exchanges
data
between
a
database
and
the
Hadoop
cluster
– Provides
subcommands
(tools)
for
impor&ng,
expor&ng,
and
more
§ Tables
are
imported
using
MapReduce
jobs
– These
are
wriHen
as
comma-‐delimited
text
by
default
– You
can
specify
alternate
delimiters
or
file
formats
– Uncompressed
by
default,
but
you
can
specify
a
codec
to
use
§ Sqoop
provides
many
op*ons
to
control
imports
– You
can
select
only
certain
columns
or
limit
rows
– Supports
using
joins
in
free-‐form
queries
§ Sqoop
2
is
the
next-‐genera*on
version
of
Sqoop
– Client-‐server
design
improves
administra&on
and
resource
management
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐27
Bibliography
The
following
offer
more
informa*on
on
topics
discussed
in
this
chapter
§ Sqoop
User
Guide
– http://tiny.cloudera.com/sqoopuserguide
§ Apache
Sqoop
Cookbook
(published
by
O’Reilly)
– http://tiny.cloudera.com/sqoopcookbook
§ A
New
Genera*on
of
Data
Transfer
Tools
for
Hadoop:
Sqoop
2
– http://tiny.cloudera.com/adcc05c
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐28
Chapter
Topics
Impor*ng
Rela*onal
Data
with
Impor*ng
and
Modeling
Structured
Apache
Sqoop
Data
§ Sqoop
Overview
§ Basic
Imports
and
Exports
§ Limi&ng
Results
§ Improving
Sqoop’s
Performance
§ Sqoop
2
§ Conclusion
§ Homework:
Import
Data
from
MySQL
Using
Sqoop
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐29
Homework:
Import
Data
from
MySQL
Using
Sqoop
§ In
this
homework
,
you
will
– Use
Sqoop
to
import
web
page
and
customer
account
data
from
an
RDBMS
to
HDFS
– Perform
incremental
imports
of
new
and
updated
account
data
§ Please
refer
to
the
Homework
descrip*on
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
4-‐30