Department of
Computer Science and Engineering
10212CS210 – Big Data Analytics
Course Category : Program Elective
Credits :4
Slot : S1 & S5
Semester : Summer
Academic Year : 2024-2025
Faculty Name : Dr. S. Jagan
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit 4 Big Data Visualization and Prediction
Pig : Introduction to PIG, Execution Modes of Pig, Comparison of
Pig with Databases, Grunt, Pig Latin, User Defined Functions, Data
Processing operators. Hive : Hive Shell, Hive Services, Hive
Metastore, Comparison with Traditional Databases, HiveQL, Tables,
Querying Data and User Defined Functions, NoSQL Databases :
Schema-less Models‖: Increasing Flexibility for Data Manipulation-
Key Value Stores- Document Stores – Tabular Stores – Object Data
Stores – Graph Databases Hive – Sharding- Hbase – Analyzing big
data with twitter – Big data for E-Commerce Big data for blogs.
Department of Computer Science and Engineering 2
Introduction to PIG
• Developed by Yahoo! and a top level Apache project
• Immediately makes data on a cluster available to non-
Java programmers via Pig Latin – a dataflow language
• Interprets Pig Latin and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
• Pig interpreter runs on a client machine – no
administrative overhead required
Department of Computer Science and Engineering 3
Introduction to PIG
Department of Computer Science and Engineering 4
Pig Terms
• All data in Pig one of four types:
• An Atom is a simple data value - stored as a string but can
be used as either a string or a number
• A Tuple is a data record consisting of a sequence of "fields"
• Each field is a piece of data of any type (atom, tuple or bag)
• A Bag is a set of tuples (also referred to as a ‘Relation’)
• The concept of a “kind of a” table
• A Map is a map from keys that are string literals to values
that can be any data type
• The concept of a hash map
Department of Computer Science and Engineering 5
Pig Capabilities
• Support for
• Grouping
• Joins
• Filtering
• Aggregation
• Extensibility
• Support for User Defined Functions (UDF’s)
• Leverages the same massive parallelism as native
MapReduce
Department of Computer Science and Engineering 6
Pig Basics
• Pig is a client application
• No cluster software is required
• Interprets Pig Latin scripts to MapReduce jobs
• Parses Pig Latin scripts
• Performs optimization
• Creates execution plan
• Submits MapReduce jobs to the cluster
Department of Computer Science and Engineering 7
Execution Modes
• Pig has two execution modes
• Local Mode - all files are installed and run using your local host
and file system
• MapReduce Mode - all files are installed and run on a Hadoop
cluster and HDFS installation
• Interactive
• By using the Grunt shell by invoking Pig on the command line
$ pig
grunt>
• Batch
• Run Pig in batch mode using Pig Scripts and the "pig" command
$ pig –f id.pig –p <param>=<value> ...
Department of Computer Science and Engineering 8
Pig Latin
• Pig Latin scripts are generally organized as follows
• A LOAD statement reads data
• A series of “transformation” statements process the data
• A STORE statement writes the output to the filesystem
• A DUMP statement displays output on the screen
• Logical vs. physical plans:
• All statements are stored and validated as a logical plan
• Once a STORE or DUMP statement is found the logical plan
is executed
Department of Computer Science and Engineering 9
Example Pig Script
-- Load the content of a file into a pig bag named ‘input_lines’
input_lines = LOAD 'CHANGES.txt' AS (line:chararray);
-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
-- Store the results ( executes the pig script )
STORE ordered_word_count INTO 'output’;
Department of Computer Science and Engineering 10
Basic “grunt” Shell Commands
• Help is available
$ pig -h
• Pig supports HDFS commands
grunt> pwd
• put, get, cp, ls, mkdir, rm, mv, etc.
Department of Computer Science and Engineering 11
About Pig Scripts
• Pig Latin statements grouped together in a file
• Can be run from the command line or the shell
• Support parameter passing
• Comments are supported
• Inline comments '--'
• Block comments /* */
Department of Computer Science and Engineering 12
Simple Data Types
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
Department of Computer Science and Engineering 13
Complex Data Types
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
Department of Computer Science and Engineering 14
Pig Data Formats
• BinStorage
• Loads and stores data in machine-readable (binary) format
• PigStorage
• Loads and stores data as structured, field delimited text
files
• TextLoader
• Loads unstructured data in UTF-8 format
• PigDump
• Stores data in UTF-8 format
• YourOwnFormat!
• via UDFs
Department of Computer Science and Engineering 15
Loading Data Into Pig
• Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name,
salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
• Each LOAD statement defines a new bag
• Each bag can have multiple elements (atoms)
• Each element can be referenced by name or position ($n)
• A bag is immutable
• A bag can be aliased and referenced later
Department of Computer Science and Engineering 16
Storing Data Into Pig
• STORE
• Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO
'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field delimiter
• DUMP
• Write output to screen
grunt> DUMP processed;
Department of Computer Science and Engineering 17
Relational Operators
• FOREACH
• Applies expressions to every record in a bag
• FILTER
• Filters by expression
• GROUP
• Collect records with the same key
• ORDER BY
• Sorting
• DISTINCT
• Removes duplicates
Department of Computer Science and Engineering 18
Relational Operators
• Use the FOREACH …GENERATE operator to work with
rows of data, call functions, etc.
• Basic syntax:
alias2 = FOREACH alias1 GENERATE
expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
Department of Computer Science and Engineering 19
Relational Operators
• Use the FILTER operator to restrict tuples or rows of
data
• Basic syntax:
alias2 = FILTER alias1 BY expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
Department of Computer Science and Engineering 20
Relational Operators
• Use the GROUP…ALL operator to group data
• Use GROUP when only one relation is involved
• Use COGROUP with multiple relations are involved
• Basic syntax:
alias2 = GROUP alias1 ALL;
• Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F)
(Bill,20,3.9F) (Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Department of Computer Science and Engineering 21
Relational Operators
• Use the ORDER…BY operator to sort a relation based
on one or more fields
• Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3)
(4,2,1)
Department of Computer Science and Engineering 22
Relational Operators
• Use the DISTINCT operator to remove duplicate tuples
in a relation.
• Basic syntax:
alias2 = DISTINCT alias1;
• Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
Department of Computer Science and Engineering 23
Relational Operators
• FLATTEN
• Used to un-nest tuples as well as bags
• INNER JOIN
• Used to perform an inner join of two or more relations based on
common field values
• OUTER JOIN
• Used to perform left, right or full outer joins
• SPLIT
• Used to partition the contents of a relation into two or more
relations
• SAMPLE
• Used to select a random data sample with the stated sample size
Department of Computer Science and Engineering 24
Relational Operators
• Use the JOIN operator to perform an inner, equi-join
join of two or more relations based on common field
values
• The JOIN operator always performs an inner join
• Inner joins ignore null keys
• Filter null keys before the join
• JOIN and COGROUP operators perform similar
functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records
Department of Computer Science and Engineering 25
Relational Operators
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1
(8,3,4) BY Col1, Alias2 BY
Col1;
(4,3,3)
(7,2,5)
(8,4,3) Dump Alias3;
DUMP Alias2; (1,2,3,1,3)
(2,4) (4,2,1,4,6)
(8,9) (4,3,3,4,6)
(1,3) (4,2,1,4,9)
(2,7) (4,3,3,4,9)
(2,9) (8,3,4,8,9)
(4,6) (8,4,3,8,9)
(4,9)
Department of Computer Science and Engineering 26
Relational Operators
• Use the OUTER JOIN operator to perform left, right, or full
outer joins
• Pig Latin syntax closely adheres to the SQL standard
• The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
• Outer joins will only work provided the relations which
need to produce nulls (in the case of non-matching keys)
have schemas
• Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-way
outer join statements
Department of Computer Science and Engineering 27
User-Defined Functions
• Natively written in Java, packaged as a jar file
• Other languages include JavaScript, Ruby, Groovy, and
Python
• Register the jar with the REGISTER statement
• Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
Department of Computer Science and Engineering 28
DEFINE
• DEFINE can be used to work with UDFs and also
streaming commands
• Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
Department of Computer Science and Engineering 29
Data Warehousing package built on top of
Hadoop
Department of Computer Science and Engineering 30
Hive Background
• Started at Facebook
• Data was collected and stored into Oracle DB
• Data Grew from 10s of GB (2006) to 1 TB/day new data(2007)
• Now the 2020 time its 1024 TB of data generating in a minute.
Department of Computer Science and Engineering 31
Hive use case @ Facebook
Department of Computer Science and Engineering 32
What is Hive
• Data Warehousing package built on top of Hadoop.
• Used for data analysis.
• Targeted towards users comfortable with SQL.
• It is similar to SQL and called HiveQL.
• For managing and querying structured data.
• No need to learn Java and Hadoop APIs.
• Developed by Facebook and contributed by community.
• Facebook analyzed several Terabytes of data every day using Hive.
Department of Computer Science and Engineering 33
Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its
functionality.
Department of Computer Science and Engineering 34
What is Hive
ETL – Extract,
Transform,
Load
Department of Computer Science and Engineering 35
Why go for Hive? When Pig is there
Department of Computer Science and Engineering 36
Hive Architecture and components
Department of Computer Science and Engineering 37
Why go for Hive When Pig is there
Pig Latin: Hive QL:
Procedural data-flow language Declarative SQLish language
A=load’mydata’; Select * from ‘mytable’;
Dump A;
Pig is used by programmer and Hive is used by Analysts generating daily
Researchers. reports.
Department of Computer Science and Engineering 38
Pig vs Hive
Features Hive Pig
Language SQL-like Piglatin
Schemas/Types Yes(Explicit) Yes(Implicit)
Partitions Yes No
Server Optional(Thrift) No
User Defined Yes(Java) Yes(Java)
Functions(UDF)
DFS Direct access Yes Yes
Join/Order/Sort Yes Yes
Shell Yes Yes
Web Interface Yes No
JDBC/ODBC Yes No
Department of Computer Science and Engineering 39
Differences between Hive and Pig
Hive Pig
Hive is commonly used by Data Pig is commonly used by
Analysts. programmers.
It follows SQL-like queries. It follows the data-flow language.
It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS It works on client-side of HDFS
cluster. cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
Department of Computer Science and Engineering 40
Hive Architecture
Department of Computer Science and Engineering 41
Apache Hive Installation
Java Installation - Check whether the Java is installed or not using the following
command.
$ java -version
•Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
$hadoop version
Steps to install Apache Hive
Download the Apache Hive tar file.
http://mirrors.estointernet.in/apache/hive/hive-1.2.2/
DUnzip the downloaded tar file.
Department of Computer Science and Engineering 42
Apache Hive Installation
tar -xvf apache-hive-1.2.2-bin.tar.gz
DOpen the bashrc file.
$ sudo nano ~/.bashrc
DNow, provide the following HIVE_HOME path.
export HIVE_HOME=/home/codegyani/apache-hive-1.2.2-bin
export PATH=$PATH:/home/codegyani/apache-hive-1.2.2-bin/bin
DUpdate the environment variable.
$ source ~/.bashrc
DLet's start the hive by providing the following command.
$ hive
Department of Computer Science and Engineering 43
Hive Components
Department of Computer Science and Engineering 44
Metastore
Department of Computer Science and Engineering 45
Limitations of Hive
Department of Computer Science and Engineering 46
Abilities of Hive Query Language
Department of Computer Science and Engineering 47
Hive Data Models
Department of Computer Science and Engineering 48
Partitioning
Department of Computer Science and Engineering 49
Partitioning in Hive
• The partitioning in Hive means dividing the table into some parts based
on the values of a particular column like date, course, city or country.
• The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it.
• The partitioning in Hive is the best example of it.
Department of Computer Science and Engineering 50
Partitioning in Hive
• Let's assume we have a data of 10 million students studying in an institute.
• Now, we have to fetch the students of a particular course.
• If we use a traditional approach, we have to go through the entire data.
• This leads to performance degradation.
• In such a case, we can adopt the better approach i.e., partitioning in Hive and
divide the data among the different datasets based on particular columns.
The partitioning in Hive can be executed in two ways -
•Static partitioning
•Dynamic partitioning
Department of Computer Science and Engineering 51
Bucketing
• Bucket concept is based on (Hashing function) mod (By total
number of bucket)
Department of Computer Science and Engineering 52
Bucketing in Hive
• The bucketing in Hive is a data organizing technique.
• It is similar to partitioning in Hive with an added functionality that it divides
large datasets into more manageable parts known as buckets.
• So, we can use bucketing in Hive when the implementation of partitioning
becomes difficult.
• However, we can also divide partitions further in buckets.
Department of Computer Science and Engineering 53
Bucketing in Hive
•The concept of bucketing is based on the hashing technique.
•Here, modules of current column value and the number of required
buckets is calculated (let say, F(x) % 3).
•Now, based on the resulted value, the data is stored into the
corresponding bucket.
Department of Computer Science and Engineering 54
Example of Bucketing in Hive
•First, select the database in which we want to create a table.
hive> use showbucket;
Department of Computer Science and Engineering 55
SerDe - Serialization and Deserialization
Introduction to Hive SerDe
• For the purpose of IO, Apache Hive uses the Hive SerDe interface.
Hence, it handles both serialization and deserialization in Hive.
• Also, interprets the results of serialization as individual fields for
processing.
• In addition, to read in data from a table a SerDe allows Hive.
Further writes it back out to HDFS in any custom format.
• However, it is possible that anyone can write their own SerDe for
their own data formats.
Department of Computer Science and Engineering 56
SerDe
• HDFS files –> InputFileFormat –> <key, value> –>
Deserializer –> Row object
• Row object –> Serializer –> <key, value> –>
OutputFileFormat –> HDFS files
Department of Computer Science and Engineering 57
UDF
• User Defined Functions, also known as UDF, allow you to
create custom functions to process records or groups of
records.
• Hive comes with a comprehensive library of functions.
• There are however some omissions, and some specific cases
for which UDFs are the solution.
Department of Computer Science and Engineering 58
UDF
A UDF processes one or several columns of one row and outputs one
value. For example :
•SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value
of "str", and outputs one value, the lowercase representation of "str".
•SELECT datediff(date_begin, date_end) from table
Department of Computer Science and Engineering 59
UDF
For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time
between these two dates.
Each argument of a UDF can be:
•A column of the table.
•A constant value.
•The result of another UDF.
•The result of an arithmetic computation.
Department of Computer Science and Engineering 60
Types of Built-in Functions in HIVE
• Collection Functions.
• Date Functions.
• Mathematical Functions.
• Conditional Functions.
• String Functions.
Department of Computer Science and Engineering 61
NoSQL – Not Only Sql
• Lightweight, Open source,.
• NoSQL DB used in
• Bigdata
• Real-time Web application.
• Log analysis
• Social networking feeds
• Non-relational database.
• Distributed.
• No support for Acid properties.
• No fixed table schema.
Department of Computer Science and Engineering 62
NoSQL - Types
• NoSQL
• Key-value or big hash table – Dynamo, Redis, Riak
• Document – MongoDB, Apache CouchDB, Mark Logic.
• Columnar – Cassandra, Hbase.
• Graph formats – Neo4j, Hypergraph DB, Infinite Graph
Department of Computer Science and Engineering 63
NoSQL - Types
Department of Computer Science and Engineering 64
What is it?
• NoSql database are not relational. - Key value
• Key value pair or document oriented or column oriented or graph
oriented.
Key value or big hash table.
• Key Value
• Firstname Rahul
• Lastnme Dravid
Document oriented.
• Maintain data in collections constituted of documents.
• For ex- mongoDB, Apache CouchDB, Couchbase, MarkLogic.
{
“Book Name” : BDA “,
“Publisher” : Wiley India
“ Year of publications”: 2011
}
Department of Computer Science and Engineering 65
Column
• Column – each storage block has data from only one column.
NoSQL
Key/Value or Bighash
table Schema less
Department of Computer Science and Engineering 66
Graph
• They are called network database, A graph stores in nodes.
ID: 1001 ID : 1002
ID : 1003
Department of Computer Science and Engineering 67
NoSQL – Types & Tools
Department of Computer Science and Engineering 68
Advantages of NoSql
• Can easily scale up and down
• Does not require a predefined schema
• Cheap, easily to implement.
• Relaxes the data consistency requirement.
• Data can be replicated to multiple nodes and can be partitioned.
Department of Computer Science and Engineering 69
Sql Vs NoSql
Department of Computer Science and Engineering 70
No SQL Vendors
Company Product Most widely used by
Amazon DynamoDB LinkedIn, Mozilla
Facebook Cassandra Netflix, Twitter, Ebay
Google Big Table Adobe Photoshop
Department of Computer Science and Engineering 71
Hbase
HBase is an open-source,
distributed, column-oriented
database built on top of HDFS
based on BigTable!
Department of Computer Science and Engineering 72
Hbase
• A distributed data store that can scale horizontally to
1,000s of commodity servers and petabytes of
indexed storage.
• Designed to operate on top of the Hadoop distributed
file system (HDFS) or Kosmos File System (KFS, aka
Cloudstore) for scalability, fault tolerance, and high
availability.
Department of Computer Science and Engineering 73
Hbase
• Distributed storage
• Table-like in data structure
• multi-dimensional map
• High scalability
• High availability
• High performance
Department of Computer Science and Engineering 74
Hbase
• Started toward by Chad Walters and Jim
• 2006.11
• Google releases paper on BigTable
• 2007.2
• Initial HBase prototype created as Hadoop contrib.
• 2007.10
• First useable HBase
• 2008.1
• Hadoop become Apache top-level project and HBase becomes
subproject
• 2008.10~
• HBase 0.18, 0.19 released
Department of Computer Science and Engineering 75
Hbase
• Tables have one primary index, the row key.
• No join operators.
• Scans and queries can select a subset of available
columns, perhaps by using a wildcard.
• There are three types of lookups:
• Fast lookup using row key and optional timestamp.
• Full table scan
• Range scan from region start to end.
Department of Computer Science and Engineering 76
Hbase
• HBase is a Bigtable clone.
• It is open source
• It has a good community and promise for the future
• It is developed on top of and has good integration for
the Hadoop platform, if you are using Hadoop
already.
• It has a Cascading connector.
Department of Computer Science and Engineering 77
Hbase
Department of Computer Science and Engineering 78
Analyzing big data with twitter
Department of Computer Science and Engineering 79
Big data for E-Commerce
Department of Computer Science and Engineering 80
Big data for blogs
Department of Computer Science and Engineering 81