0% found this document useful (0 votes)
13 views30 pages

04 Sqoop

Chapter 4 focuses on importing relational data into a Hadoop cluster using Apache Sqoop. It covers the basics of Sqoop, including how to import tables, change delimiters, control imported data, and improve performance. The chapter also introduces Sqoop 2 and provides practical exercises for importing data from MySQL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

04 Sqoop

Chapter 4 focuses on importing relational data into a Hadoop cluster using Apache Sqoop. It covers the basics of Sqoop, including how to import tables, change delimiters, control imported data, and improve performance. The chapter also introduces Sqoop 2 and provides practical exercises for importing data from MySQL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Impor&ng

 Rela&onal  Data  with  Sqoop  


Chapter  4  

201509  
Course  Chapters  

1   Introduc&on   Course  Introduc&on  


2   Introduc&on  to  Hadoop  and  the  Hadoop  Ecosystem  
Introduc&on  to  Hadoop  
3   Hadoop  Architecture  and  HDFS  
4   Impor*ng  Rela*onal  Data  with  Apache  Sqoop  
5   Introduc&on  to  Impala  and  Hive  
Impor*ng  and  Modeling  
6   Modeling  and  Managing  Data  with  Impala  and  Hive  
Structured  Data  
7   Data  Formats  
8   Data  File  Par&&oning  
9   Capturing  Data  with  Apache  Flume     Inges&ng  Streaming  Data  

10   Spark  Basics  
11   Working  with  RDDs  in  Spark  
12   Aggrega&ng  Data  with  Pair  RDDs  
13   Wri&ng  and  Deploying  Spark  Applica&ons   Distributed  Data  Processing  with  
14   Parallel  Processing  in  Spark   Spark  
15   Spark  RDD  Persistence    
16   Common  PaHerns  in  Spark  Data  Processing  
17   Spark  SQL  and  DataFrames  

18   Conclusion   Course  Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐2  
Impor&ng  Rela&onal  Data  with  Apache  Sqoop  

In  this  chapter  you  will  learn  


§ How  to  import  tables  from  an  RDBMS  into  your  Hadoop  cluster  
§ How  to  change  the  delimiter  and  file  format  of  imported  tables  
§ How  to  control  which  columns  and  rows  are  imported  
§ What  techniques  you  can  use  to  improve  Sqoop’s  performance  
§ How  the  next-­‐genera*on  version  of  Sqoop  compares  to  the  original  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐3  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐4  
What  is  Apache  Sqoop?  

§ Open  source  Apache  project  originally  developed  by  Cloudera  


– The  name  is  a  contrac&on  of  “SQL-­‐to-­‐Hadoop”  
§ Sqoop  exchanges  data  between  a  database  and  HDFS  
– Can  import  all  tables,  a  single  table,  or  a  par&al  table  into  HDFS    
– Data  can  be  imported  a  variety  of  formats  
– Sqoop  can  also  export  data  from  HDFS  to  a  database  

Database Hadoop Cluster

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐5  
How  Does  Sqoop  Work?  

§ Sqoop  is  a  client-­‐side  applica*on  that  imports  data  using  Hadoop  


MapReduce  
§ A  basic  import  involves  three  steps     Database Server
orchestrated  by  Sqoop  
1. Examine  table  details  
2. Create  and  submit  job  to  cluster   1

3. Fetch  records  from  table  and     Sqoop 3


write  this  data  to  HDFS   User

Hadoop Cluster

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐6  
Basic  Syntax  

§ Sqoop  is  a  command-­‐line  u*lity  with  several  subcommands,  called  tools  


– There  are  tools  for  import,  export,  lis&ng  database  contents,  and  more  
– Run  sqoop help  to  see  a  list  of  all  tools  
– Run  sqoop help tool-name  for  help  on  using  a  specific  tool  
§ Basic  syntax  of  a  Sqoop  invoca*on  

$ sqoop tool-name [tool-options]

§ This  command  will  list  all  tables  in  the  loudacre  database  in  MySQL  

$ sqoop list-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser \
--password pw

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐7  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐8  
Overview  of  the  Import  Process  

§ Imports  are  performed  using  Hadoop  MapReduce  jobs  


§ Sqoop  begins  by  examining  the  table  to  be  imported  
– Determines  the  primary  key,  if  possible  
– Runs  a  boundary  query  to  see  how  many  records  will  be  imported  
– Divides  result  of  boundary  query  by  the  number  of  tasks  (mappers)  
– Uses  this  to  configure  tasks  so  that  they  will  have  equal  loads  
§ Sqoop  also  generates  a  Java  source  file  for  each  table  being  imported  
– It  compiles  and  uses  this  during  the  import  process  
– The  file  remains  afer  import,  but  can  be  safely  deleted  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐9  
Impor&ng  an  En&re  Database  with  Sqoop  

§ The  import-all-tables  tool  imports  an  en*re  database  


– Stored  as  comma-­‐delimited  files    
– Default  base  loca&on  is  your  HDFS  home  directory  
– Data  will  be  in  subdirectories  corresponding  to  name  of  each  table  

$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw

§ Use  the  --warehouse-dir  op*on  to  specify  a  different  base  directory  

$ sqoop import-all-tables \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--warehouse-dir /loudacre

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐10  
Impor&ng  a  Single  Table  with  Sqoop  

§ The  import  tool  imports  a  single  table  


§ This  example  imports  the  accounts  table    
– It  stores  the  data  in  HDFS  as  comma-­‐delimited  fields  

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw

§ This  varia*on  writes  tab-­‐delimited  fields  instead  

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--fields-terminated-by "\t"

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐11  
Incremental  Imports  (1)  

§ What  if  records  have  changed  since  last  import?  


– Could  re-­‐import  all  records,  but  this  is  inefficient  
§ Sqoop’s  incremental  lastmodified  mode  imports  new  and  
modified  records  
– Based  on  a  &mestamp  in  a  specified  column  
– You  must  ensure  &mestamps  are  updated  when  records  are  added  or  
changed  in  the  database  

$ sqoop import --table invoices \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--incremental lastmodified \
--check-column mod_dt \
--last-value '2015-09-30 16:00:00'

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐12  
Incremental  Imports  (2)  

§ Or  use  Sqoop’s  incremental  append  mode  to  import  only  new  records  
– Based  on  value  of  last  record  in  specified  column  

$ sqoop import --table invoices \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--incremental append \
--check-column id \
--last-value 9478306

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐13  
Expor&ng  Data  from  Hadoop  to  RDBMS  with  Sqoop  

§ Sqoop's  import  tool  pulls  records  from  an  RDBMS  into  HDFS  
§ It  is  some*mes  necessary  to  push  data  in  HDFS  back  to  an  RDBMS  
– Good  solu&on  when  you  must  do  batch  processing  on  large  data  sets  
– Export  results  to  a  rela&onal  database  for  access  by  other  systems  
§ Sqoop  supports  this  via  the  export  tool  
– The  RDBMS  table  must  already  exist  prior  to  export  

$ sqoop export \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--export-dir /loudacre/recommender_output \
--update-mode allowinsert \
--table product_recommendations

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐14  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi*ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐15  
Impor&ng  Par&al  Tables  with  Sqoop  

§ Import  only  specified  columns  from  accounts  table  

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--columns "id,first_name,last_name,state"

§ Import  only  matching  rows  from  accounts  table  

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--where "state='CA'"

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐16  
Using  a  Free-­‐Form  Query  

§ You  can  also  import  the  results  of  a  query,  rather  than  a  single  table  
§ Supply  a  complete  SQL  query  using  the  --query  op*on  
– You  must  add  the  literal  WHERE $CONDITIONS  token  
– Use  --split-by  to  iden&fy  field  used  to  divide  work  among  mappers  
– The  --target-dir  op&on  is  required  for  free-­‐form  queries  

$ sqoop import \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--target-dir /data/loudacre/payable \
--split-by accounts.id \
--query 'SELECT accounts.id, first_name,
last_name, bill_amount FROM accounts JOIN invoices ON
(accounts.id = invoices.cust_id) WHERE $CONDITIONS'

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐17  
Using  a  Free-­‐Form  Query  with  WHERE  Criteria  

§ The  --where  op*on  is  ignored  in  a  free-­‐form  query  


– You  must  specify  your  criteria  using  AND  following  the  WHERE  clause  

$ sqoop import \
--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
--target-dir /data/loudacre/payable \
--split-by accounts.id \
--query 'SELECT accounts.id, first_name,
last_name, bill_amount FROM accounts JOIN invoices ON
(accounts.id = invoices.cust_id) WHERE $CONDITIONS AND
bill_amount >= 40'

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐18  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐19  
Op&ons  for  Database  Connec&vity  

§ Generic  (JDBC)  
– Compa&ble  with  nearly  any  database  
– Overhead  imposed  by  JDBC  can  limit  performance  
§ Direct  Mode  
– Can  improve  performance  through  use  of  database-­‐specific  u&li&es    
– Currently  supports  MySQL  and  Postgres  (use  --direct  op&on)  
– Not  all  Sqoop  features  are  available  in  direct  mode  
§ Cloudera  and  partners  offer  high-­‐performance  Sqoop  connectors  
– These  use  na&ve  database  protocols  rather  than  JDBC  
– Connectors  available  for  Netezza,  Teradata,  and  Oracle
– Download  these  from  Cloudera’s  Web  site  
– Not  open  source  due  to  licensing  issues,  but  free  to  use  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐20  
Controlling  Parallelism  

§ By  default,  Sqoop  typically  imports  data  using  four  parallel  tasks  (called  
mappers)  
– Increasing  the  number  of  tasks  might  improve  import  speed  
– Cau&on:  Each  task  adds  load  to  your  database  server  
§ You  can  influence  the  number  of  tasks  using  the  -m  op*on  
– Sqoop  views  this  only  as  a  hint  and  might  not  honor  it  

$ sqoop import --table accounts \


--connect jdbc:mysql://dbhost/loudacre \
--username dbuser --password pw \
-m 8

§ Sqoop  assumes  all  tables  have  an  evenly-­‐distributed  numeric  primary  key  
– Sqoop  uses  this  column  to  divide  work  among  the  tasks  
– You  can  use  a  different  column  with  the  --split-by  op&on  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐21  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐22  
Limita&ons  of  Sqoop  

§ Sqoop  is  stable  and  has  been  used  successfully  in  produc*on  for  years  
§ However,  its  client-­‐side  architecture  does  impose  some  limita*ons  
– Requires  connec&vity  to  RDBMS  
from  the  client  (client  must  have     Database Server

JDBC  drivers  installed)  


– Requires  connec&vity  to  cluster    
from  the  client   1

– Requires  user  to  specify  RDBMS     Sqoop 3


username  and  password   User

– Difficult  to  integrate  a  CLI  within   2


external  applica&ons  
§ Also  *ghtly  coupled  to  JDBC  seman*cs  
– A  problem  for  NoSQL  databases   Hadoop Cluster

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐23  
Sqoop  2  Architecture  

§ Sqoop  2  is  the  next-­‐genera*on  version  of  Sqoop  


– Client-­‐server  design  addresses  limita&ons  described  earlier  
– API  changes  also  simplify  development  of  other  Sqoop  connectors  
§ Client  requires  connec*vity  only  to  the  Sqoop  server  
– DB  connec&ons  are  configured  
Sqoop Server Database Server
on  the  server  by  a  system    
administrator   2
– End  users  no  longer  need  to    
possess  database  creden&als   1 3 4

– Centralized  audit  trail  


– BeHer  resource  management   Sqoop
Client
– Sqoop  server  is  accessible  via  
CLI,  REST  API,  and  Web  UI  
User
Hadoop Cluster

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐24  
Sqoop  2  Status  

§ Sqoop  2  is  being  ac*vely  developed  


– It  began  shipping  (alongside  Sqoop)  star&ng  in  CDH  4.2    
§ Sqoop  2  is  not  yet  at  feature  parity  with  Sqoop  
– Implemented  features  are  regarded  as  stable  
– Consider  using  Sqoop  2  unless  you  require  a  feature  it  lacks  
§ We  use  Sqoop,  rather  than  Sqoop  2,  in  this  class  
– Primarily  due  to  memory  constraints  in  the  VM  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐25  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐26  
Essen&al  Points  

§ Sqoop  exchanges  data  between  a  database  and  the  Hadoop  cluster  


– Provides  subcommands  (tools)  for  impor&ng,  expor&ng,  and  more  
§ Tables  are  imported  using  MapReduce  jobs  
– These  are  wriHen  as  comma-­‐delimited  text  by  default  
– You  can  specify  alternate  delimiters  or  file  formats  
– Uncompressed  by  default,  but  you  can  specify  a  codec  to  use  
§ Sqoop  provides  many  op*ons  to  control  imports  
– You  can  select  only  certain  columns  or  limit  rows  
– Supports  using  joins  in  free-­‐form  queries  
§ Sqoop  2  is  the  next-­‐genera*on  version  of  Sqoop  
– Client-­‐server  design  improves  administra&on  and  resource  management  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐27  
Bibliography  

The  following  offer  more  informa*on  on  topics  discussed  in  this  chapter  
§ Sqoop  User  Guide  
– http://tiny.cloudera.com/sqoopuserguide
§ Apache  Sqoop  Cookbook  (published  by  O’Reilly)  
– http://tiny.cloudera.com/sqoopcookbook
§ A  New  Genera*on  of  Data  Transfer  Tools  for  Hadoop:  Sqoop  2  
– http://tiny.cloudera.com/adcc05c

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐28  
Chapter  Topics  

Impor*ng  Rela*onal  Data  with   Impor*ng  and  Modeling  Structured  


Apache  Sqoop   Data  

§ Sqoop  Overview  
§ Basic  Imports  and  Exports  
§ Limi&ng  Results  
§ Improving  Sqoop’s  Performance  
§ Sqoop  2  
§ Conclusion  
§ Homework:  Import  Data  from  MySQL  Using  Sqoop    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐29  
Homework:  Import  Data  from  MySQL  Using  Sqoop    

§ In  this  homework  ,  you  will  


– Use  Sqoop  to  import  web  page  and  customer  account  data  from  an  
RDBMS  to  HDFS  
– Perform  incremental  imports  of  new  and  updated  account  data  
§ Please  refer  to  the  Homework  descrip*on  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriHen  consent  from  Cloudera.    4-­‐30  

You might also like