Big Data Analysis
Mrs.Sneha Mhatre
In today’s discussion…
● Introduction to Big data
Characteristics and Types
● Current trend
● Data and Big data
● Traditional vs Big data
● Tools and techniques
● HADOOP,
● MAP REDUCE
● NOSQL
2
Introduction to data
● Example:
10, 25, …, Kharagpur, 10CS3002, namo@gov.in
Anything else?
● Data vs. Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
3
How large your data is?
● What is the maximum file size you have dealt
so far?
○ Movies/files/streaming video that you have used?
● What is the maximum download speed you
get?
○ To retrieve data stored in distant locations?
4
● How fast your computation is?
○ How much time to just transfer from you, process
and get result?
Growth of data
5
Sources of data
● “Every day, we create 2.5 quintillion bytes of data
○ So much that 90% of the data in the world today has been created in the last two years
alone.
○ The data come from several sources
■ sensors used to gather climate information
■ posts to social media sites,
■ digital pictures and videos
■ purchase transaction records
■ cell phone GPS signals
etc. …… to name a few!
6
Examples
Social media and networks Scientific instruments
(All of us are generating (Collecting all sorts of data)
data)
Sensor technology and
Mobile devices
networks
(Tracking all objects all the
(Measuring all kinds of data)
time)
7
Now data is Big data!
● No single standard definition!
● ‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
■ techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it… 8
● Big data is data that cannot be processed by
currently used traditional databases and
software infrastructure.
● It is a field dedicated to the analysis,
processing and storage of large collections of
data which cannot be handled by current
technology infrastructure.
Characteristics of Big data: V3
10
V3 : V for Volume
● Volume of data, which needs
to be processed is increasing
rapidly
○ More storage capacity
○ More computation
○ More tools and techniques
11
V3: V for Variety
● Various formats, types, and
structures
○ Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dimensional arrays, etc…
● Static data vs. streaming data
● A single application can be
generating/collecting many types of
data
To extract knowledge🡺 all these types of
data need to be linked together
13
V3: V for Velocity
● Data is being generated fast and need to be processed fast
○ For time-sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in order to
maximize its value
○ Scrutinize 5 million trade events created each day to
identify potential fraud
○ Analyze 500 million daily call detail records in real-time to predict
customer churn faster
● Sometimes, 2 minutes is too late!
○ The latest we have heard is 10 ns (nano seconds) delay is
too much
14
● SCIENTIFIC PROCESS OF
TRANSFORMING DATA INTO INSIGHT FOR
BETTER DECISIONS.
● HAVING THE KNOWLEDGE YOU NEED
● MAKING BETTER AND FASTER DECISIONS
● OPTIMIZING BUSINESS PERFORMANCE
● UNCOVER NEW BUSINESS OPPORTUNITIES
● MORE DATA LEADS TO ACCURATE ANALYSIS
● MORE ACCURATE ANALYSIS LEADS
TO BETTER DECISION MAKING
● OPERATIONAL EFFICIENCIES,
● REDUCTION AND REDUCED RISK.
○ BETTER DECISIONS MEANS
GREATER
● COST
Big data vs. small data
- Optimizations and predictive
analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
19
Big data vs. small data
● Big data is more real-time in
nature than traditional
applications
● Big data architecture
○ Traditional architectures are not
well-suited for big data
applications (e.g. Exa-data, Tera-
data)
○ Massively parallel processing, 20
Traditional Data Big Data
Traditional data is generated in enterprise level. Big data is generated outside the enterprise level.
Its volume ranges from Petabytes to Zettabytes or
Its volume ranges from Gigabytes to Terabytes.
Exabytes.
Big data system deals with structured, semi-
Traditional database system deals with structured data.
structured,database, and unstructured data.
But big data is generated more frequently mainly per
Traditional data is generated per hour or per day or more.
seconds.
Traditional data source is centralized and it is managed in Big data source is distributed and it is managed in
centralized form. distributed form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is capable to process traditional High system configuration is required to process big
data. data.
Traditional Data Big Data
The size of the data is very small. The size is more than the traditional data size.
Traditional database tools are required to perform any database Special kind of database tools are required to
operation. perform any database schema-based operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is a flat schema based and it is
Its data model is strict schema based and it is static.
dynamic.
Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
Big data is in huge volume which becomes
Traditional data is in manageable volume.
unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
Its data sources includes ERP transaction data, CRM transaction Its data sources includes social media, device data,
data, financial data, organizational data, web transaction data etc. sensor data, video, images, audio etc.
Challenges ahead…
● The Bottleneck is in technology
○ New architecture, algorithms, techniques are needed
● Also in technical skills
○ Experts in using the new technology and dealing with Big data
23
Who are the major players in the
world of Big data?
Big data players
24
Major players…
● Google
● Hadoop
● MapReduce
● Mahout
● Apache Hbase Apache HBase
25
● Cassandra
Tools available
● NoSQL
○ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak,
ZooKeeper
● MapReduce
○ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,
Greenplum
● Storage
○ S3, HDFS, GDFS
● Servers 26
○ EC2, Google App Engine, Elastic, Beanstalk, Heroku
● Processing
○ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
● Click stream for retail website : Walmart, Flipkart
● IBM predictive analysis for Bharti Airtel
● Traffic Analysis : London road transport
● Computation framework
● Provide parallel processing model
● Implemented to process huge amounts of data
● Like Divide & Conquer technique
● MapStep: Queries are split and
distributed across parallel nodes and processed
in parallel
● Reduce step: the results are then combined and
delivered.
Features of HADOOP
● Commodity hardware
● Data locality
● Huge Storage Capacity
● Data Replication
● Faster access and faster processing speed
● Write once read many times
● Parallel Processing
● Moves computation to data instead of data
● NoSQL is a set of concepts that allows the
rapid and efficient processing of data sets with
a focus on performance, reliability, and agility.
● It is more than rows in a table and is free
of joins and schema free.
● Works on many processors
● Uses shared nothing commodity computers
● Supports linear scalability
● Open source software framework for
storage and large scale processing of data-
sets on clusters of commodity hardware.
● Created by Doug Cutting and Mike
Cafarella.
● Written in Java.
● Inspired by Google’s MapReduce Algorithm
Question of the day…
What type of data are involved in the following applications?
1. Weather forecasting
1. Mobile usage of all customers of a service provider
1. Anomaly (e.g. fraud) detection in a bank organization
1. Person categorization, that is, identifying a human
1. Air traffic control in an airport 33
1. Streaming data from all flying aircrafts of Boeing
The importance of big data analytics
• Big data analytics through specialized systems and
software can lead to positive business- related
outcomes:
• New revenue opportunities
• More effective marketing
• Better customer service
• Improved operational efficiency
• Competitive advantages over rivals
• Preparing Campaigning strategies
How big data analysis works
Once the data is ready, it can be analyzed with the software
commonly used for advanced analysis processes. That
includes tools for:
• data mining, which sift through data sets in search of patterns and
relationships;
• predictive analytics, which build models to
forecast customer behavior and other future developments;
• machine learning, which taps algorithms
to analyze large data sets; and
• deep learning, a more advanced
offshoot of machine learning.
Applications and key data sources for big data
and business analytics
Use cases for Big data analytics