Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
DECS 43A – Big Data Analysis
II Year IV Semester
Unit -1
Introduction to Big Data
Dr. S. P. Ponnusamy
Assistant Professor and Head
1
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Unit -1
Introduction to Big Data
Data
Characteristics of data
Types of digital data: Unstructured, Semi-structured and Structured,
Sources of data
Working with unstructured data
Evolution and Definition of big data
Characteristics and Need of big data
Challenges of big data
Data environment versus big data environment
2
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Data
• The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media
Big Data
• Big Data is a collection of data that is huge in volume, yet growing exponentially
with time.
• It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
• It includes data mining, data storage, data analysis, data sharing, and data
visualization.
3
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Data vs Big Data
4
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Data Growth
• 1,024 bytes = 1 kilobyte (KB).
• 1,024 kilobytes (KB) = 1 MB.
• 1,024 MB = 1 GB.
• 1,024 GB = 1 TB
• 1,024 TB = 1 petabyte (PB).
• 1,024 PB = an exabyte (EB).
• 1,024 EB = a zettabyte (ZB)
• 1,024 ZB = 1 YB (Yottabyte).
5
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Data Growth
6
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Types of Data
7
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Structured Data
• This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
• Relationships exist between entities of data, such as classes and their objects.
• Data stored in databases is an example of structured data.
• Structured data is also called relational data.
• It is split into multiple tables to enhance the integrity of the data by creating a single
record to depict an entity.
• A Structured Query Language (SQL) is needed to bring the data together.
• Structured data is easy to enter, query, and analyze.
8
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Structured Data
9
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Structured Data - Sources
10
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Ease with Structured Data
11
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Semi-Structured Data
• This is the data which does not conform to a data model but has some structure.
• However, it is not in a form which can be used easily by a computer program.
• Example, emails, XML, markup languages like HTML, JSON document, etc.
• Metadata for this data is available but is not sufficient.
• It is commonly called NoSQL data
12
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Semi-Structured Data - Sources
13
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Semi-Structured Data –XML Example
<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
<!--The 2ndFav and 3rdFav Coding Platforms are imaginative because Geeksforgeeks is
the best!-->
14
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Semi Structured Data – JSON Example
15
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Semi-Structured Data
16
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Unstructured Data
• This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
• Data can not be stored in the form of rows and columns as in Databases
• Data does not follows any semantic or rules
• Data lacks any particular format or sequence
• Data has no easily identifiable structure
• Due to lack of identifiable structure, it can not used by computer programs easily
• About 80–90% data of an organization is in this format.
• Example: memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, body of an email, etc.
17
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Unstructured Data – Example
18
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Unstructured Data – Sources
19
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Unstructured Data – issues
20
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Dealing with Unstructured Data
21
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Definition of Big Data
Big Data is high-volume, high-
High-volume
velocity, and high-variety High-velocity
information assets that demand High-variety
cost effective, innovative forms
of information processing for
enhanced insight and decision
making. Cost-effective, innovative forms of
information processing
Source: Gartner IT Glossary
Enhanced insight & decision
making
22
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Data
1. Composition: The composition of data deals with the structure of data, that is,
the sources of data, the granularity, the types, and the nature of data as to
whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one
use this data as is for analysis?" or "Does it require cleansing for further
enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?"
"Why was this data generated?" How sensitive is this data?" "What are the
events associated with this data?" and so on.
23
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Evolution of Big Data
24
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Evolution of Big Data
25
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Evolution of Big Data
26
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Evolution of Big Data
27
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Why of Big Data?
28
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Need of Big Data
29
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Big Data/What is Big Data?
• Volume: the size and amounts of big data that companies manage and
analyse.
• Variety: the diversity and range of different data types, including
unstructured data, semi-structured data and structured data.
• Velocity: the speed at which companies receive, store and manage data
– e.g., the specific number of social media posts or search queries
received within a day, hour or other unit of time.
30
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Big Data
31
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Big Data
32
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Characteristics of Big Data – other V’s
• Value: refers to the value that big data can
provide, and it relates directly to what
organizations can do with that collected
data.
• Veracity: the “truth” or accuracy of data
and information assets, which often
determines executive-level confidence
33
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Challenges of Big Data Capture
Storage
Curation
Challenges with Big Data
Search
Analysis
Transfer
Visualization
Privacy
Violations
34
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Sources of Big Data
35
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Traditional Business Intelligence (BI) versus Big Data
• In traditional BI environment, data resides in a central server whereas
in big data environment, data resides in a distributed file system.
• Traditional BI Move data to code
• Big Data Environment Move code to data
• In traditional BI environment, data is analyzed in offline mode
whereas in big data environment data is analyzed in both real time as
well as offline mode.
36
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
A Typical Data Warehouse Environment
• In a typical DW environment, data is collected from multiple disparate sources,
integrated, cleansed and transformed before loading it to a data warehouse.
• A host of market leading BI tools can then be used on top of the data warehouse for
reporting/dashboarding, ad hoc querying and modelling.
37
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
A Typical Hadoop Environment
Hadoop takes care of storage and processing using the following:
a)HDFS (Hadoop Distributed File System) (distributed storage)
b)MapReduce (distributed processing)
ODS-operational Data store
38
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
Co-existence of Big Data and Data Warehouse
39
Big Data Analytics
Government Arts and Science College
Tittagudi-606106
Department of Computer Science
End
40