0 ratings0% found this document useful (0 votes) 606 views27 pagesBda Chapter 1 Techneo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Big Data Analytics (MU-Sem.8-IT) 1 Table of Contents
Ve eee
Table Of Contents
Syllabus =
Introduction to Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big
Data business approach, Big Data Challenges, Examples of Big Data in Real Life, Big
Data Applications.
'Self-learning Topics : Identification of Big Data applications and its solutions.
1.1 Introduction to Big Data and Hadoop
1.1.1 Whatis Big Data ?..............
1.1.2 Sotrces of Big Data...
1.2 Big Data Characteristics
apa)
1.2.2 Varety....
1.2.3
1.2.4
1.2.5
1.2.6 Visualization
1.2.7 Virality......
1.3 Types of Big Data
1.3.1 Type #1 : Unstructured
1.3.1(A) Characteristics of Unstructured Data......
1.3.1(B) Sources of Unstructured Data.......
1.3.1(C) Advantages and Disadvantages of Unstructured Data.
1.3.2 Type #2: Structured...
1.3.2(A) Characteristics of Structured Data... seesseecssesccssessees 4-11
1.3.2(B) Sources of Structured Data... we etd
all
(MU- 22-28) (MB-131) Tech-Neo PublicationsBig Data Analytics (MU-Sem.8-IT) 2 Table of Contents
1-3.2(C) Advantages of Structured Data.....cc:cvsusseesseesetitisssnn 1-12
1.3.3 TYPO HS =) Semi Structured ccc cscscessestictensssvecenseeseeeecsees 1-12
1.3.3(A) Characteristics of Semi-structured Data .iu.eesvessessseseeee 1-13
1.3.3(B) Sources of SEMI-StUCtUTEd Data........sccceceseeererererer scenes 1-14
1.4
7.5 Traditional vs. Big Data business approach
1.6 Examples of Big Data Applications...
1.7 Big Data Challenges
1.8 Examples of Big Data in Real Life.Introduction to
Big Data
University Prescribed Syllabus
Introduction te Big Data, Big Data characteristics, types of Big Data,
Traditional vs. Big Data business approach, Big Data Challenges,
Examples of Big Data in Real Life, Big Data Applications.
Self-learning Topics : Identification of Big Data applications and its
solutions.
(1) Now a day the amount of data created by various advanced
technologies like Social networking sites, E-commerce etc. is
very large. It is really difficult to store such huge data by using
the traditional data storage facilities.
(2) Until 2003, the size of data produced was 5 billion gigabytes. If
this data is stored in the form of disks it may fill an entire football
field. In 2011, the same amount of data was created in every two
days and in 2013 it was created in every ten minutes. This is
teally tremendous rate.(3) In this topic, we will discuss about big data oe fundamenta
level and define common concepts related to big data, We will
also see in deep about some of the processes and technologies
currently being used in this field.
1.1.1 What is Big Data ?
Big Data is a Massive collection of data that continues to grow
dramatically over time.3. Video sharing portals : Video sharing portals like pounDS,
Vimeo etc. contains millions of videos each of which requires lots
of memory to store.
Sources of big data
3. Video sharing portals
4. Search Engine Data
5. Transport Data
6. Banking Data
| Fig. 1.1.1 : Sources of big data
| 4. Search Engine Data : The search engines like Google and
| Yahoo holds lot much of metadata Tegarding various sites.
5. Transport Data : Transport data contains information about
model, capacity, distance and availability of various vehicles,
6. Banking Data : The big giants in banking domain like SBI or
ICICI hold large amount of data tegarding huge transactions of
account holders.
(MU: 22-28) (Mg-131)
Tech-Neo Publications(1) Volume represents the volume i.e. amount of data that is 2TOWing
at a high rate i.e. data volume in Petabytes.
(2) Value refers to turning data into value. By turning accesseq big
data into values, businesses may generate revenue.
(3) Veracity refers to the uncertainty of available data. Veracity
arises due to the high volume of data that brings incompletenes,
and inconsistency.
(4) Visualization is the Process of displaying data in charts, graphs,
maps, and other visual forms,
(5) Variety refers to the different data types i.e. various data formats
like text, audios, videos, etc.
(6) Velocity is the rate at which data grows. Social media contributes
a major role in the velocity of growing data,
() Virality describes how quickly information gets spread across
people to people (P2P) networks,
@ 1.2.1 Volume
As it follows from the name, big data is used to refer to enormous
amounts of information.
We are talking about not gigabytes but terabytes and petabytes of
data.
The IoT (Internet of Things) is creating exponential growth in
data.
The volume of data is Projected to chan
ge significantly in the
coming years,
e Hence, 'Volume’ is one characteristic which needs to be
considered while dealing with Big Data,tS Volume
[Data at Rest ]
Terabytes, Petabytes Records/Arch Table/Files _ Distributed
% 1.2.2 Variety
e Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured.
e Data comes in different formats — from structured, numeric data
in traditional databases to unstructured text documents, emails,
videos, audios, stock ticker data and financial transactions.
e This variety of unstructured data poses certain issues for storage,
mining and analysing data.
e Organizing the data in a meaningful way is no simple task,
especially when the data itself changes rapidly.
e Another challenge of Big Data processing goes beyond the
massive volumes and increasing velocities of data but also in
manipulating the enormous variety of these data.
tS Variety
[ Data in many Forms ] i
Structured Unstructured Text Multimedia
% 1.2.3 Veracity
e Veracity describes whether the data can be trusted. Veracity refers
to the uncertainty of available data.
° Veracity arises due to the high volume of data that brings
incompleteness and inconsistency.
° Hygiene of data in analytics is important because otherwise, you
cannot guarantee the accuracy of your results.
RAIntroduction to Big Data)...Pg. no...
Data Analytics (MU-Sem.8-
« Because data comes from so many different sources, it’s difficuy,
to link, match, cleanse and transform data across systems,
¢ However, it is useless if the data being analysed are inaccurate o,
incomplete.
e Veracity is all about making sure the data is accurate, which
Tequires processes to keep the bad data from accumulating in your
systems.
5S Veracity
[Data in Doubt ]
Trustworthiness Authenticity Accurate Availability
@ 1.2.4 Velocity
e Velocity is the speed in which data is grows, process and
becomes accessible.
¢ A data flows in from sources like business Processes, application
logs, networks, and social media sites, sensors, Mobile devices,
etc.
e The flow of data is massive and continuous.
¢ Most data are warehoused before analysis, there is an increasing
need for real-time processing of these enormous volumes.
e Real-time processing reduces Storage requirements while
Providing more responsive, accurate and profitable responses.
It should be processed fast by batch, in a stream-like manner
because it just keeps growing every years.
oS Velocity
[Data in Motion ]
Streaming Batch Real / Near Time Processes% 12.5 Value
It refers to turning data into value. By turning accessed big data
into values, businesses May generate revenue,
Value is the end game. After addressing volume, velocity, variety,
variability, veracity, and visualization — which takes a lot of time,
effort and resources — you want to be sure your organization is
getting value from the data.
For example, data that can be used to analyze consumer behavior
Is valuable for your company because you can use the research
Tesults to make individualized offers,
t= Value
Statistical Events
[Data into Money]
Correlations
@ 1.2.6 Visualization
Big data visualization is the process of displaying data in charts,
graphs, maps, and other visual forms.
It is used to help people easily understand and interpret their data
at a glance, and to clearly show trends and Patterns that arise from
this data.
Raw data comes in a different formats, so Creating data
Visualizations is Process of gathering, Managing, and
transforming data into a format that’s Most usable and
Meaningful.
Big Data Visualization makes your data as accessible as Possible
to everyone within your Organization, whether
they have technical
data skills or not.
(MU- 29-23) (49-494) Re& Visualization
[Data Readable ]
Readable Accessible Presentation Visual Forms |
@ 1.2.7 Virality
Virality describes how quickly information gets spread across
people to people (P2P) networks.
Tt is measures how quickly data is spread and shared to each
unique node.
‘Time is a determinant factor along with rate of spread.
tS Virality
[Data Spread ]
Sere e Shared e Rate of Spread
There are three types of Big Data Analytics :
1. Unstructured 2. Structured 3. Semi-structured
1.3.1 Type #1: Unstructured
Any data with unknown form or the structure is classified as
unstructured data. In addition - to the size being huge, un
Structured data poses Multiple challenges in terms of its
Processing for deriving value out of it.a
@)
GB)
(4)
i)
©)
a
(1)
(3)
(6)
(6)
(7)
Typical example of unstructured data is, a heterogeneous data
source containing a combination of simple text files, images,
videos like search in Google Engine.
Now a day organizations have wealth of data available with them
but unfortunately they don't know how to derive value out of it
since this data is in its raw form or unstructured format.
Human Generated Data Machine Generated Data.
Unstructured — Example : The output returned by ‘Google Search’
1.3.1(A) Characteristics of Unstructured Data
Data neither conforms to a data model nor has any structure.
Data can not be stored in the form of rows and columns as in
Databases.
Data does not follows any semantic or rules,
Data lacks any particular format or sequence.
Data has no easily identifiable structure.
Due to lack of identifiable structure, it can not used by computer
programs easily,
1.3.1(B) Sources of Unstructured Data
Web pages (2) Images (JPEG, GIF, PNG, etc.)
Videos (4) Memos
Reports
Word documents and PowerPoint Presentations
Surveys
(MU- 22-23) (MB-131) “Tact Nettges and Disadvantages of
%@ 1.3.1(C) Advanta
Unstructured Data
ti Advantages
1. Its supports the data which lacks a proper format or sequence.
The data is not constrained by a fixed schema.
Very Flexible due to absence of schema.
Data is portable.
Ttis very scalable.
It can deal easily with the heterogeneity of sources.
These type of data have a variety of business intelligence and
analytics applications.
Sol TON OR a a
t= Disadvantages
1. It is difficult to store and manage unstructured data due to lack of
schema and structure.
2. Indexing the data is difficult and error prone due to unclear
structure and not having pre-defined attributes. Due to which
search results are not very accurate.
3. Ensuring security to data is difficult task.
% 1.3.2 Type #2 : Structured
e Any data that can be stored, accessed and processed in the form
of fixed format is termed as a "Structured" data.
* Over the period of time, talent in computer science have achieved
Gfeater success in developing techniques for working with such
kind of data (where the format is well known in advance) and also
determining value out of it,
size Of such data grows to a huge extent, typical sizes ate
me i the range of multiple zettabyte, Data stored in 2
Telational database mana;
ome! s i
Structured data gement system in one example of a)
(WU22-29) (8-191) fe)i data model, has
red data is the data which conforms to a
gaan : der and can be
a well define structure, follows a consistent ©:
easily accessed and used by a person or a computer program.
Structured data is usually stored in well-defined schemas such as
Databases. It is generally tabular with column and rows that
clearly define its attributes.
» SQL (Structured Query language) is often used to manage
structured data stored in databases.
%, 1.3.2(A) Characteristics of Structured Data
» Data conforms to a data model and has easily identifiable
structure.
e Data is stored in the form of rows and columns.
Example : Database
e Data is well organised so, Definition, Format and Meaning of
data is explicitly known.
e Data resides in fixed fields within a record or file.
e Similar entities are grouped together to form relations or classes.
e Entities in the same group have same attributes.
« Easy to access and query, So data can be easily used by other
programs.
« Data elements are addressable, so efficient to analyse and process.
@. 1,3.2(B) Sources of Structured Data
(1) SQL Databases (2) Spreadsheets such as Excel
(3) OLTP Systems (4) Online forms
() Sensors such as GPS or RFID tags
(6) Network and Web server logs
(7) Medical devices
(MU- 22-28) (Me-131) il
Tech: Nama%& 1.3.2(C) Advantages of Structured Data
iN
ine
Structured data have a well defined structure that hel IPS in easy
Storage and access of data.
Data can be indexed based on text string as well as attributes
This makes search operation hassle-free.
Data mining is easy i.e. knowledge can be easily extracted from
data,
Operations such as Updatin,
Structured form of data,
Business Intelligence oj
easily undertaken,
ig and deleting is easy due to well
perations such as Data warehousing can be
Easily scalable in Case there is an increment of data.
Ensuring Security to data is easy.
Structured - Example
Employee Table
1 XYX MALE FINANCE
2 ABC MALE ADMIN 250000
3 PQR pear | SALES 350000
4 MNR _ [FEMALE FINANCE 600000
13.3 Type #3 : Semi Structured
S the third type of big data Semi-structured data
the forms of data
“ Pettains to the data Containing both the
above, that is, Sttuctured and unstructuredTo be precise, it refers to the data that although has not pee
classified under a particular repository (database), yet conalnls
vital information or tags that segregate individual elements within
the data.
Web application data, which is unstructured, consists of log files,
transaction history files etc.
Online transaction processing systems are built to work with
structured data wherein data is stored in relations (tables).
Semi-structured data is data that does not conform to a data
model but has some structure. It lacks a fixed or rigid schema. It
is the data that does not reside in a rational database but that have
Some organizational properties that make it easier to analyze.
With some Processes, we can store them in the relational
database.
1.3.3(A) Characteristics of Semi-structured
Data
Data does not conform to a data model but has some structure.
Data can not be stored in the form of rows and columns as in
Databases
Semi-structured data contains tags and elements (Metadata)
which is used to group data and describe how the data is stored.
Similar entities are grouped together and organized in a hierarchy,
Entities in the same group may or may not have the same
attributes or properties,
Does not contain sufficient metadata
which makes automation
and management of data difficult.
Size and type of the same attributes in a group may differ.
Due to lack of a well-defined structu
ire, it can not used by
Computer programs easily.
(MU- 22-03) (M8-131) El%®. 1.3.3(B) Sources of semi-structured Data
(1) E-mails (2) XML and other markup languages
(3) Binary executables (4) TCP/IP packets
(S) Zipped files (6) Integration of data from digg
sources
(7) Web pages
1.3.3(¢) Advantages and Disadvantages of
Semi-structured Data
= Advantages
1. The data is not Constrained by a fixed schema.
Flexible i.e. Schema can be easily changed.
Data is portable.
2
3
4, It is possible to view structured data as semi-structured data.
5, Its supports users who can not express their need in SQL.
6
It can deal easily with the heterogeneity of sources.
tS Disadvantages
1. Lack of fixed, rigid schema make it difficult in storage of th
data. |
2, Interpreting the relationship between data is difficult as there i
no separation of the schema and the data.
3. Queries are less efficient as compared to structured data.
* Semi-structured - Example
; iti
User can see semi-structured data as a structured in form but
actually not defined with e.g. a table definition in relation
DBMs,Technology | Itis based | It is based on Itis based on
on XML/RDF(Resource | character and
Relational | Description binary data
database Framework).
table
Transaction | Matured Transaction is No
Management | transaction | adapted from DBMS | transaction
and various | not matured management
concurrency and no
techniques concurrency
Version Versioning | Versioning over Versioned as
management | over tuples, | tuples or graph is a whole
Tow, tables | possible
| Hexibitty Itis schema | It is more flexible It is more
dependent | than structured data flexible and
and less but less flexible than there is
flexible unstructured data absence of
schema
aia he
teBig Data Analytics (MU-Sem.8-IT) (Introduction to Big
Scalability | Itis very It’s scaling is
difficult to | simpler than
scale DB structured data
schema
scalable.
Robustness Very robust New technology, not
very spread
Query Structured | Queries over Only textual
performance query allow anonymous nodes queries are
complex are possible possible
joining
1,
Traditional Data
© Traditional data is the Structured data
maintained by all types of busines,
Small to big organizations.
Which is being Majorly
Ses starting from very
Managing and accessing the
data Structured Query Language (SQL) is used,
2. Bigdata
(MU- 22-23) (Me.131) (eer:Tt deals with large volume of both structured, semi structured
and unstructured data. Volume, Velocity and Variety,
Veracity and Value refer to the 5’V characteristics of big
data.
Big data not only refers to large amount of data it refers to
extracting meaningful data by analyzing the huge amount of
complex data sets.
Traditional data is generated
in enterprise level.
Big data is generated in
outside and enterprise level.
Its volume ranges from] Its volume ranges from
Gigabytes to Terabytes. Petabytes to Zettabytes or
Exabytes.
Traditional database system
deals with structured data.
Big data system deals with
structured, semi structured
and unstructured data.
Traditional data is generated
per hour or per day or more.
But big data is generated
more frequently mainly per
seconds,
Traditional data source is
centralized and it is managed
in centralized form.
Big data source is distributed
and it is managed in
distributed form.
Data integration is very easy.
Data integration is
difficult.
very
Normal system configuration
is capable to process
traditional data.
High system configuration is
required to process big data.
(MU- 22-23) (M8-131)
Tech-Neo PublicationsThe size of the data is very
Small.
eran oe
Traditional data base tools
The size is more thay
traditional data size.
Special kind of data base tools |
e Tequired to perform any | are required to perform any
ta base operation, data base operation. |
10. | No i
rmal functions can Special kind of functions can||
Manipulate data. manipulate data.
11. | Its ie
: data model is Strict | Its data model is flat schema
|__| Schema based and it is static. based and it is dynamic.
2, Traditional data is stable and Big data is not stable and
inter relationship. unknown relationship,
13. | Traditional data’ is in Big data is in huge volume
manageable volume. which becomes
unmanageable.
14. | It is easy to manage and | It is difficult to manage and
ipulate the data. manipulate the data.
15. | Its data sources includes ERP | Its data sources includes
transaction data, CRM
transaction data, financial
data, organizational data,
Web transaction data etc.
social media, device data,
sensor data, video, images,
audio etc.> 1.
°
y 2.
Fraud detection
Fraud detection is a Big Data application example for
businesses which has operations like any type of claims or
transaction processing.
Number of times the detection of fraud is concluded long
after the fact. At this point the damage has been already done
all that's left is to decrease the harm and revise policies to
prevent it in future.
Big data
applications
1. Fraud detection
i
2. IT log analytics
3. Call center analytics
|
4. Social media analysis |
Fig. 1.6.1 : Big data applications
The Big Data platforms can analyze claims and transactions
of businesses. They identify large-scale patterns across many
transactions or detect anomalous behaviour of a some user.
This helps to avoid the fraud.
IT log analytics
An enormous quantity of logs and trace data is generated in
_TT solutions and IT departments. Many times such data go
unexamined: organizations simply don't have the manpower
or resource to go through all such information.i identify large.
Big data has the ability to quickly 1 he ue d
° es to help in diagnosing and prevent ee ems,
: i artment.
helps the organization with a large II dep:
> 3. Call center analytics a
Now we tum to the customer-facing Big Data mie
examples, of which call center analytics are Pp cule
ful, Without a Big Data solution, much of the insighy
a be ignored or exposed later
. that a call center can provide will
“e By making sense of time/quality resolution metrics, the Big
Data solutions are able to identify recurring problems oy
customer and staff behaviour patterns. Big data can also
capture and process call content itself.
> 4. Social media analysis
e With the help of Social media we can observe the real-time
insights into how the market is responding to products and
campaigns.
¢ With the help of these insights, it is possible for companies to
adjust their pricing, promotion, and campaign placement to
get optimal results.
1. Sharing and Accessing Data
© Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources,
» Sharing data can cause substantial challenges,
It include the need for inter and intra-
institutional legal
documents,
Accessing ‘ata from eads to
public reposit
dat Positories leads t multipl
p IU- 22-23) (M8-131) Tec! Publication>
Ml il ‘ech-Neo Publi ,It is necessary for the data to be available in an accurate,
complete and timely manner because if data in the companies
information system is to be used to make accurate decisions
in time then it becomes necessary for data to be available in
this manner.
2. Privacy and Security
It is another most important challenge with Big Data. This
challenge includes sensitive, conceptual, technical as well as
legal significance.
Most of the organizations are unable to maintain regular
checks due to large amounts of data generation. However, it
should be necessary to perform security checks and
observation in real time because it is most beneficial.
Thete is some information of a person which when combined
with external large data may lead to some facts of a person
which may be secretive and he might not want the owner to
know this information about that person.
Some of the organization collects information of the people
in order to add value to their business. This is done by
making insights into their lives that they’re unaware of,
3. Analytical Challenges
(MU- 22-23) (MB-131)
There are some huge analytical challenges in big data which
arise some main challenges questions like how to deal with a
problem if data volume gets too large?
Or how to find out the important data points?
Or how to use data to the best advantage?
These large amount of data on which these type of analysis is
to be done can be structured (organized data), semi-structured
(Semi-organized data) or unstructured (unorganized data)
Tech-Neo Publicationsh which decision making ¢,,
|
1 techniques throve
There are tw ”
be done: ive data volumes 11 the analysis,
y | orate mass
1, Either incorp'
ig, data js relevant.
9, Ordetermine upfront which Bi;
4. Technical challenges
lity of data
oF lection of a large amount of data and storage
0)
business leaders
1. When there is ac He
of this data, it comes at a cost. Big companies,
and IT leaders always want large data storage.
2. For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
3. This further arise a question that how it can be ensured that data is
Teleyant, how much data would be enough for decision making
and whether the stored data is accurate or not.
S Fault tolerance
1. Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
Nowadays some of the new technologies like cloud computing
and big data always intended that whenever the failure occurs the
damage done should be within the acceptable threshold that is the
whole task should not begin from the Scratch,
Scalability
Big data Projects can 8TOw and evoly
. of Big Data has lead towards cloy,
* Wtleads to Vari
ie, os Ous challenges like how,
“tfectivety, Sal of ach Workloa,
€ rapidly. The scalability
id computing,3. It also requires dealing with the system failures in an Cee
manner. This leads to a big question again that what kinds of
Storage devices are to be used.
(1) In the Education Industry
The University of Alabama has more than 38,000 students and an
ocean of data. In the past when there were no real solutions to analyze
that much data, some of them seemed useless. Now, administrators
can use analytics and data visualizations for this data to draw out
patterns of students revolutionizing the university’s operations,
Tecruitment, and retention efforis.
(2) In the Healthcare
Wearable devices and sensors have been introduced in the
healthcare industry which can provide real-time feed to the electronic
health record of a patient. One such technology is Apple.
Apple has come up with Apple HealthKit, CareKit, and
ResearchKit. The main goal is to empower iPhone users to Store and
access their real-time health records on their phones.
(3) In Government Sector
Food and Drug Administration (FDA) which runs under the
Jurisdiction of the Federal Government of the USA leverages the
analysis of big data to discover patterns and asso,
and examine the expected or unexp
infections,
ciations to identify
ected occurrences of food-based
(4) In Media and Entertainment Industry
Spotify, on-demand music-providing platform, uses
Analytics, collects data from all its users around the globe,
uses the analyzed data to give informed music Tecommendat
Suggestions to every individual user.
Big Data
and then
tions and
(MU- 22-28) (Me-131) Ree pail stBig Data Analytics MU-Sem.8-I
Amazon Prime which offers, videos, music,
® one-stop shop is also big on using big data.
(5) In Weather Patterns
IBM Deep Thunder, which is ar
Weather forecasting through high-pei
IBM is also assisting Tokyo with
natural disasters or Predicting the
esearch project by IBM, Provig.
formance computing of big dy
improved weather forecasting fo
probability of damaged power Tines.
(6) In Transportation Industry
customer for more than 25 years.
(8) In Marketing
New offers and advertisements,
(9) In Business Sights
en iS using Big Data to understand the user behavior, the
0 e ;
ie ®Y like, popular movies on the Website, simila
can . k :
they invest in, mERESt to the user, and which series or movies should
(MU- 22-23) (Mg. 31) 5(10) In Space Sector
NASA is collecting data from different satellites and rovers about
the geography, atmospheric conditions, and other factors of mars for
their upcoming mission. It uses big data to manage all that data and
analyzes that to run simulations.
Chapter Ends...
000