Apache Pig
What is Apache Pig?
 An abstraction over MapReduce
 Used to analyse larger sets of data representing them as data flows
 Performs all the data manipulation operations in Hadoop
 Provides a high-level language known as Pig Latin
 Programmers can develop their own functions for reading, writing, and
  processing data
 Scripts are internally converted to Map and Reduce tasks done by Pig
  Engine
Why Do We Need Apache Pig?
 Programmers can perform MapReduce tasks easily without having to type
  complex codes in Java
 uses multi-query approach, thereby reducing the length of codes
 SQL-like language
 Provides many built-in operators and Data Types
Features of Pig
Rich set of operators
It provides many operators to perform operations like join,
   sort, filer, etc
Ease of programming
Pig Latin is similar to SQL and it is easy to write a Pig script if
  you are good at SQL.
Optimization opportunities
Apache Pig optimize their execution automatically, so the
  programmers need to focus only on semantics of the
  language.
Extensibility
Using the existing operators, users can develop their own
  functions to read, process, and write data.
User-defined Functions
Invoke or embed them in Pig Scripts
Handles all kinds of data
Structured as well as unstructured.
Apache Pig Vs MapReduce
   Apache Pig                        MapReduce
 Data flow language                Data processing paradigm
 High level language               Low level and rigid
 Performing a Join operation is    Difficult to perform a Join
  simple                             operation between datasets
 Knowledge of SQL is sufficient
                                    Exposure to Java is mandatory
 No need for compilation.
                                    Have a long compilation process.
 Every Apache Pig operator is
  converted internally into a
  MapReduce job
Apache Pig Vs SQL
   Pig                                  SQL
 Procedural language                  Declarative language
 Schema is optional                   Schema is mandatory
 Limited opportunity for Query        More opportunity for Query
  optimization.                         optimization
 Allows splits in the pipeline.
 Allows developers to store data
  anywhere in the pipeline.
 Provides operators to perform ETL
  (Extract, Transform, and Load)
  functions.
Apache Pig Vs Hive
   Pig                                  Hive
 Language Pig Latin.                  HiveQL
 Created at Yahoo                     Facebook
 Data flow language                   Query processing language
 A procedural language and it fits    Declarative language
  in pipeline paradigm
                                       Mostly for structured data
 Handle structured, unstructured,
  and semi-structured data
Applications of Apache Pig
 Tasks involving ad-hoc processing and quick prototyping
    To process huge data sources such as web logs
    To perform data processing for search platforms
    To process time sensitive data loads
   History
    2006 –
Developed as                 2008 – The first
 a research                    release of
  project at                  Apache Pig
   Yahoo                         came
                                                   2010 – It
               2007 – Open                      graduated as
               Sourced Via                      Apache top-
                 Apache                         level Project
                Incubator
Architecture
Parser
 Checks the syntax of the script - type checking
 Output of the parser will be a DAG, represents Pig Latin
  statements and logical operators
Optimizer
 The logical plan (DAG) is passed to the logical optimizer,
  which carries out the logical optimizations such as
  projection and pushdown.
Compiler and Execution engine
 The compiler compiles the optimized logical plan into a
  series of MapReduce jobs.
 Finally the MapReduce jobs are submitted to Hadoop in
  a sorted order for execution to produce desired results.
Data Model
 Atom                                      Tuple
   Any single value - irrespective of         A record that is formed by an
    their data type - Atom.                     ordered set of fields is known as a
                                                tuple, the fields can be of any
   It is stored as string and can be
                                                type.
    used as string and number.
                                               A tuple is similar to a row in a
   int, long, float, double, chararray,
                                                table of RDBMS.
    and bytearray are the atomic
    values of Pig.                             Example: (Raja, 30)
   A piece of data or a simple
    atomic value is known as a field.
   Example: ‘raja’ or ‘30’
 Bag                                       Relation
   Unordered set of tuples or a               A bag of tuples.
    collection of tuples
                                               Unordered - No guarantee that
   Tuple can have any number of                tuples are processed in any
    fields (flexible schema).                   particular order.
   Represented by ‘{}’.                    Map
   Similar to a table in RDBMS                A map (or data map) is a set of
   Not necessary - tuples contain the          key-value pairs.
    same number of fields and have             The key needs to be of type
    the same type.                              chararray and should be unique.
   Example:                                   The value might be of any type. It
        {(Raja, 30), (Mohammad, 45)}            is represented by ‘[]’
   Can be a field in a relation - inner       Example: [name#Raja, age#30]
    bag.
   Example: {Raja, 30, {9848022338,
    raja@gmail.com,}}
Execution Modes
 Local Mode                         MapReduce Mode
    Run from your local host and      Load or process the data that
     local file system                  exists in the Hadoop File
    Used for testing purpose           System (HDFS)
                                       A MapReduce job is invoked
                                        in the back-end to perform a
                                        particular operation on the
                                        data
Execution Mechanisms
Interactive Mode (Grunt shell)
Batch Mode (Script)
Embedded Mode (UDF)
  Defining our own functions (User Defined Functions) in
   programming languages such as Java, and using
   them in our script.
Invoking the Grunt Shell
$ ./pig –x local
$ ./pig -x mapreduce
  Either of these commands gives you the Grunt shell
   prompt as shown below.
  grunt>
You can exit the Grunt shell using ‘ctrl + d’.
Batch Mode
Write an entire Pig Latin script in a file and
 execute it using the –x command.
  $ pig -x local Sample_script.pig
  $ pig -x mapreduce Sample_script.pig
Shell & Utility commands
sh Command
  Invoke any shell commands
  grunt> sh shell_command parameters
  grunt> sh ls
     pig
     pig_1444799121955.log
     pig.cmd
     pig.py
fs Command
 Invoke any Hadoop File system Shell commands
 grunt> fs File System command parameters
 grunt> fs –ls
    Found 3 items
    drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
    drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
    drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
Utility Commands
 clear : clear the screen
    grunt> clear
 help : Provides help about the commands.
 history : Displays a list of statements executed / used so
  far since the Grunt sell is invoked .
 set : Used to show/assign values to keys used in Pig.
 quit : You can quit from the Grunt shell.
 exec/run: Can execute Pig scripts
    grunt> exec [–param param_name = param_value] [–param_file
     file_name] script
 kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin
 A Relation is the outermost structure data model. And it is a bag
  where –
    A bag is a collection of tuples.
    A tuple is an ordered set of fields.
    A field is a piece of data.
 Processing Data:
    Statements are the basic constructs
    Statements work with relations
    Statements include operators, expressions and schemas
    Statements take a relation as input and produce another relation as
     output except load and store
    Student_data = LOAD
       'student_data.txt‘
USING PigStorage(',') as ( id:int,
     firstname:chararray,
     lastname:chararray,
       phone:chararray,
         city:chararray );
 Values for all the data types can be NULL and
  treats null values in a similar way as SQL does
Operators
Category       Operators                          Example
Arithmetic     +, - , * , / , % ,                 b = (a == 1)? 20: 30;
               ?: (Bincond Operator)
               CASE                               CASE f2 % 2
               WHEN THEN                          WHEN 0 THEN 'even'
               ELSE                               WHEN 1 THEN 'odd'
               END                                END
Comparison     ==, !=, >, <, >=,<=                f1 matches '.*tutorial.*'
               matches (Pattern Matching)
Type           Tuple Construction operator : ()   (Raju, 30)
Construction   Bag Construction operator: {}      {(Raju, 30), (Mohammad,
               Map Construction operator: []      45)}
                                                  [name#Raja, age#30]
Relational operators
Preparing Data
 In MapReduce mode, Pig reads (loads) data from HDFS and stores the
  results back in HDFS. Therefore, let us start HDFS and create the following
  sample data in HDFS.
  Load Operator
   The load statement consists of two parts divided by the “=” operator.
   On the left-hand side, we need to mention the name of the relation where
    we want to store the data, and on the right-hand side, we have to define
    how we store the data.
   Given below is the syntax of the Load operator.
        Relation_name = LOAD 'Input file path' USING function as schema;
    Component                                    Description
Relation_name          The relation in which we want to store the data.
Input file path        Mention the HDFS directory where the file is stored
function               A function from the set of load functions provided by
                       Apache Pig (BinStorage, JsonLoader, PigStorage,
                       TextLoader).
schema                 Define the schema of the data
 We can define the required schema as follows
    (column1 : data type, column2 : data type, column3 : data type);
 Note: We load the data without specifying the schema.
  In that case, the columns will be addressed as $01, $02,
  etc…
 grunt> student = LOAD
  ‘hdfs://localhost:9000/pig_data/student_data.txt' USING
  PigStorage(',') as ( id:int, firstname:chararray,
  lastname:chararray, phone:chararray, city:chararray );
 The PigStorage() function:
    It loads and stores data as structured text files. It takes a
     delimiter using which each entity of a tuple is separated,
     as a parameter. By default, it takes ‘\t’ as a parameter.
Store operator
 STORE Relation_name INTO ' required_directory_path '
  [USING function];
 Ex:
    grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
     (',');
Diagnostic Operators
 Dump Operator
   The Dump operator is used to run the Pig Latin statements and
    display the results on the screen. It is generally used for
    debugging Purpose.
 grunt> Dump Relation_Name;
  Ex: Dump student;
   Once you execute the above Pig Latin statement, it will start a
    MapReduce job to read data from HDFS.
 Describe: Used to view the schema of a relation
 grunt> Describe Relation_name
    Ex: grunt> describe student;
   Output: student: { id: int,firstname: chararray,lastname: chararray,phone:
   chararray, city: chararray }
 Explain: Used to display the logical, physical, and MapReduce
  execution plans of a relation.
 grunt> explain Relation_name;
    Ex: grunt> explain student;
 Illustrate: Gives you the step-by-step execution of a sequence of
  statements
 grunt> illustrate Relation_name;
    grunt> illustrate student;
Group Operator
 The group operator is used to group the data in one or more relations. It
  collects the data having the same key.
 Group_data = GROUP Relation_name BY age;
    grunt> group_data = GROUP student_details by age;
    grunt> Dump group_data;
    The output contains two columns: one is age – with which we have grouped the
     relation, and the other is Bag – Which contains group of tuples, student records
     with the respective age.
 You can see the schema of the table after grouping the data using the
  describe command as shown below.
Cogroup Operator
 group operator is normally used with one relation, while the cogroup
  operator is used in statements involving two or more relations.
The cogroup operator groups the tuples from
 each schema according to age where each
 group depicts a particular age value.
For example, if we consider the 1st tuple of the
 result, it is grouped by age 21. And it contains
 two bags –
  the first bag holds all the tuples from the first schema
   (student_details in this case) having age 21, and
  the second bag contains all the tuples from the
   second schema (employee_details in this case)
   having age 21.
  In case a schema doesn’t have tuples having the
   age value 21, it returns an empty bag.
Join Operator
 The join operator is used to combine records from two or more relations.
 While performing a join operation, we declare one (or a group of) tuple(s)
  from each relation, as keys.
 When these keys match, the two particular tuples are matched, else the
  records are dropped.
 Joins can be of the following types:
     Self-join
     Inner-join
     Outer-join : left join, right join, and full join
 Self-join is used to join a table with itself as if the table
  were two relations, temporarily renaming at least one
  relation.
    Generally, in Apache Pig, to perform self-join, we will
     load the same data multiple times, under different
     aliases (names).
Outer Join
 Returns all the rows from at least one of the relations. An outer join
  operation is carried out in three ways – Left, Right, and Full.
 left outer Join operation returns all rows from the left table, even if there are
  no matches in the right relation.
 right outer join operation returns all rows from the right table, even if there
  are no matches in the left table.
 full outer join operation returns rows when there is a match in one of the
  relations.
Cross Operator
 Computes the cross-product of two or more relations.
Combining and Splitting
 Union Operator :
    The UNION operator of Pig Latin is used to merge the content of two relations.
    To perform UNION operation on two relations, their columns and domains must
     be identical.
 Split : Used to split a relation into two or more relations.
Filter Operator
 Used to select the required tuples from a relation based on a condition.
Distinct Operator
 Used to remove redundant (duplicate) tuples from a relation
Foreach Operator
 Used to generate specified data transformations based on the column
  data.
Order By
 Used to display the contents of a relation in a sorted order based on one or
  more fields.
Limit Operator
 Used to get a limited number of tuples from a relation
Built-in Functions – EVAL Functions
 AVG: Used to compute the average of the numerical values within a bag
  and ignores the NULL values.
    To get the global average value, we need to perform a Group All operation,
     and calculate the average value using the AVG function.
    To get the average value of a group, we need to group it using the Group By
     operator and proceed with the average function.
 Max - Used to calculate the highest value for a column (numeric values or
  chararrays) in a single-column bag and ignores the NULL values.
         Count:
              Used to get the number of elements in a bag.
              While counting the number of tuples in a bag, the count() function ignores (will
               not count) the tuples having a NULL value in the FIRST FIELD.
       COUNT_STAR
    •    It includes the NULL values.
 Sum: to get the total of the numeric values of a column in a single-column
  bag and ignores the null values.
 DIFF:
    Used to compare two bags (fields) in a tuple.
    It takes two fields of a tuple as input and matches them.
    If they match, it returns an empty bag.
    If they do not match, it finds the elements that exist in one filed (bag) and not
     found in the other, and returns these elements by wrapping them within a bag.
    Generally the Diff() function compares two bags in a tuple.
 SUBTRACT :
    Used to subtract two bags.
    It takes two bags as inputs and returns a bag which contains the tuples of the first
     bag that are not in the second bag.
 IsEmpty : Used to check if a bag or map is empty.
 Size : Used to compute the number of elements based on any Pig data
  type.
 BagToString :
    Used to concatenate the elements of a bag into a string.
    While concatenating, we can place a delimiter between these values (optional).
 Concat : Used to concatenate two or more expressions of the same type.
 Tokenize :
    Used to split a string (which contains a group of words) in a single tuple and
     return a bag which contains the output of the split operation.
    As a delimeter to the tokenize function, we can pass space [ ], double quote [" "],
     coma [ , ], parenthesis [ () ], star [ * ].
 Word Count Example:
    lines = LOAD ‘data’ AS (line:chararray);
    words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
    grouped = GROUP words BY word;
    wordcount = FOREACH grouped GENERATE group, count(words);
    DUMP wordcount;
Load and Store functions
 Used to determine how the data goes in and comes out of Pig.
                  It Can’t be used for store operation
BinStorge() in Pig is generally used to store temporary data generated
between the MapReduce jobs.
 Handling Compression: Compressed files can read using PigStorage and
  TextLoader functions.
Bag and Tuple Functions
 TOBAG :
    Converts one or more expressions to individual tuples. And these tuples are
     placed in a bag.
 TOP : Used to get the top N tuples of a bag.
     To this function, as inputs, we have to pass a relation, the number of tuples we
      want, and the column name whose values are being compared.
     This function will return a bag containing the required columns.
 TOTUPLE : Used convert one or more expressions to the data type tuple.
 TOMAP : Used to convert the key-value pairs into a Map.
String Functions
   Operator                                Description
ENDSWITH           ENDSWITH(string, testAgainst)
                   To verify whether a given string ends with a particular
                   substring
STARTSWITH         STARTSWITH(string, substring)
                   Accepts two string parameters and verifies whether the
                   first string starts with the second.
SUBSTRING          SUBSTRING(string, startIndex, stopIndex)
                   Returns a substring from a given string.
EqualsIgnoreCase   EqualsIgnoreCase(string1, string2)
                   To compare two stings ignoring the case.
INDEXOF            INDEXOF(string, ‘character’, startIndex)
                   Returns the first occurrence of a character in a string,
                   searching forward from a start index.
    Operator                                    Description
LAST_INDEX_OF      LAST_INDEX_OF(expression)
                   Returns the index of the last occurrence of a character in a
                   string, searching backward from a start index.
LCFIRST /UCFIRST   LCFIRST(expression) /UCFIRST(expression)
                   Converts the first character in a string to lower case /Upper
                   case.
REPLACE            REPLACE(string, ‘oldChar’, ‘newChar’);
                   To replace existing characters in a string with new characters.
UPPER / LOWER      UPPER(expression) / LOWER(expression)
                   Returns a string converted to upper/lower case.
STRSPLIT           STRSPLIT(string, regex, limit)
                   To split a string around matches of a given regular expression.
SPLITTOBAG         SPLITTOBAG(string, regex, limit)
                   Similar to the STRSPLIT() function, it splits the string by given
                   delimiter and returns the result in a bag.
TRIM /LTRIM/RTRIM TRIM(expression) /LTRIM(expression)/RTRIM(expression) Returns
                  a copy of a string with leading and trailing / leading/ trailing
                  whitespaces removed.
Date and Time functions
 ToDate : Used to generate a DateTime object according to the given
  parameters.
    ToDate(milliseconds)
    ToDate(userstring, format)
    ToDate(userstring, format, timezone)
Math functions
 ABS, ACOS, ATAN, ASIN, CBRT, CEIL, COS, COSH, EXP, FLOOR, LOG, LOG10,
  RANDOM, ROUND, SIN, SINH, SQRT, TAN, TANH
 Ex:
Running Scripts
 how to run Apache Pig scripts in batch mode ?
 Comments in Pig Script :
     /* */ - Multiline comment , -- - Single line comment
 Executing Pig Script in Batch mode
Step 1
Write all the required Pig Latin statements in a single file. We can write all the
Pig Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell
(Linux) as shown below.
 You can execute it from the Grunt shell as well using the exec command as
  shown below.
    grunt> exec /sample_script.pig
 Executing a Pig Script from HDFS :
     Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory
      named /pig_data/.
     $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig