Bda (Unit 1)
Bda (Unit 1)
Syllabus:
Introduction to BigData Platform – Challenges of Conventional Systems - Intelligent data
analysis – Nature of Data - Analytic Processes and Tools - Analysis vs Reporting
Table of Contents
4 Nature of Data 17
6 Analysis vs Reporting 24
4
Growth of Big Data
11
Then enterprise version of GridGain can be purchased from official website of
GridGain. While free version can be downloaded from GitHub repository.
Website: https://www.gridgain.com/
iv) HPCC Systems
HPCC Systems stands for "high performance computing cluster” and this system is
developed by LexisNexis Risk Solutions.
According to the company this software is much faster than Hadoop and can be used in the
cloud environment.
HPCC Systems is developed in C++ and compiled into binary code for distribution.
HPCC Systems is open-source, massive parallel processing system which is installed in
cluster to process data in real-time.
It requires Linux operating system and runs on the commodity servers connected with
high-speed network.
It is scalable from one node to 1000s of nodes to provide performance and scalability.
Website: https://hpccsystems.com/
v) Apache Storm
Apache Storm is a software for real-time computing and distributed processing.
Its free and open-source software developed at Apache Software foundation. It’s a real-
time, parallel processing engine.
Apache Storm is highly scalable, fault-tolerant which supports almost all the
programming language.
vi) Apache Strom can be used in:
Realtime analytics
Online machine learning
Continuous computation
Distributed RPC
ETL
And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
vii) Apache Spark
Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-
memory processing and analysis of large set of stored in the HDFS.
It stores the data into memory for faster processing.
Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as
compared to the MapRedue.
Apache Spark is here to faster the processing and analysis of big data sets in Big Data
environment.
Apache Spark is being adopted very fast by the business to analyze their data set to get real
value of their data.
Website: http://spark.apache.org/
viii) SAMOA
SAMOA stands for Scalable Advanced Massive Online Analysis,
It’s a system for mining the Big Data streams.
SAMOA is open-source software distributed at GitHub, which can be used as distributed
machine learning framework also.
Website: https://github.com/yahoo/samoa
Thus, the Big Data industry is growing very fast in 2017 and companies are fast moving their
data to Big Data Platform. There is huge requirement of Big Data in the job market; many
companies are providing training and certifications in Big Data technologies.
12
1.2. CHALLENGES OF CONVENTIONAL SYSTEMS
1.2.1 Introduction to Conventional Systems
What is Conventional System?
Conventional Systems.
The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
Big data is huge amount of data which is beyond the processing capacity
Of conventional data base systems to manage and analyze the data in a specific time
interval.
Difference between conventional computing and intelligent computing
The conventional computing functions logically with a set of rules and calculations
while the neural computing can function via images, pictures, and concepts.
Conventional computing is often unable to manage the variability of data obtained in the
real world.
On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This
allows them to excel in those areas that conventional computing often finds difficult.
1.2.2 Comparison of Big Data with Conventional Data
Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only in predictive modeling .
a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.
13
Generated by big financial institutions, Generated by small enterprises and small
Facebook, Google, Amazon, eBay, banks.
Walmart, and so on.
1.2.2 List of challenges of Conventional Systems
The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
1) Uncertainty of Data Management Landscape:
Because big data is continuously expanding, there are new companies and technologies
that are being developed everyday.
A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
2) The Big Data Talent Gap:
While Big Data is a growing field, there are very few experts available in this field.
This is because Big data is a complex field and people who understand the complexity and
intricate nature of this field are far few.
3) The talent gap that exists in the industry Getting data into the big data platform:
Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
The scale and variety of data that is available today can overwhelm any data practitioner and
that is why it is important to make data accessibility simple and convenient for mangers and
owners.
4) Need for synchronization across data sources:
As data sets become more diverse, there is a need to incorporate them into an analytical
platform.
If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights(understanding a situation) through the use of Big data
analytics:
It is important that companies gain proper insights from big data analytics and it is important
that the correct department has access to this information.
A major challenge in the big data analytics is bridging this gap in an effective fashion.
Other Three challenges of Conventional systems
Three Challenges That big data face.
1. Data
2. Process
3. Management
1. Data Challenges
Volume
1.The volume of data, especially machine-generated data, is exploding,
2.how fast that data is growing every year, withnew sources of data that are emerging.
3.For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it is
expected to reach 35 zetta bytes (ZB) by2020 (according to IBM).
14
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook, 10
TB.
•Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in
2011.
•The challenge is how to deal with the size of Big Data.
Variety, Combining Multiple Data Sets
•More than 80% of today’s information is unstructured and it is typically too big to manage
effectively.
•Today, companies are looking to leverage a lot more
data from a wider variety of sources both inside and outside the organization.
•Things like documents, contracts, machine data, sensor data, social media, health records,
emails, etc. The list is endless really.
Variety
•A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows
and columns.
2. Processing
More than 80% of today’s information is unstructured and it is typically too big to
manage effectively.
Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health
records, emails, etc. The list is endless really.
3. Management
A lot of this data is unstructured, or has acomplex structure that’s hard to represent in
rows and columns.
Big Data Challenges
– The challenges include capture, duration, storage, search, sharing, transfer,
– analysis, and visualization.
• Big Data is trend to larger data sets
• due to the additional information derivable from analysis of a single large set of related data,
– as compared to separate smaller sets with the same total amount of data, allowing
correlations to be found to
• "spot business trends, determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic conditions.”
Challenges of Big Data
The following are the five most important challenges of the Big Data
a) Meeting the need for speed
In today’s hypercompetitive business environment, companies not only have to find and
analyze the relevant data they need, they must find it quickly.
b) Visualization helps organizations perform analyses and make decisions much more
rapidly, but the challenge is going through the sheer volumes of data and accessing the
level of detail needed, all at a high speed.
c) The challenge only grows as the degree of granularity(the level of details) increases. One
possible solution is hardware. Some vendors are using increased memory and powerful
parallel processing to crunch large volumes of data extremely quickly
d) Understanding the data
It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
visualization as part of data analysis.
d) Addressing data quality
15
Even if you can find and analyze data quickly and put it in the proper context for the
audience that will be consuming the information
e) Displaying meaningful results
Plotting points on a graph for analysis becomes difficult when dealing with extremely
large amounts of information or a variety of categories of information.
For example, imagine you have 10 billion rows of retail SKU data that you’re trying to
compare. The user trying to view 10 billion plots on the screen will have a hard time
seeing so many data points.
. By grouping the data together, or “binning,” you can more effectively visualize the
data.
1.3. INTELLIGENT DATA ANALYSIS
1.3.1 INTRODUCTION TO INTELLIGENT DATA ANALYSIS (IDA)
Intelligent Data Analysis (IDA) is one of the hot issues in the field of artificial
intelligence and information.
Intelligent data analysis reveals implicit, previously unknown and potentially valuable
information or knowledge from large amounts of data.
Intelligent data analysis is also a kind of decision support process.
It Based on artificial intelligence, machine learning, pattern recognition, statistics, database
and visualization technology mainly, IDA automatically extracts useful information, necessary
knowledge and interesting models from a lot of online data in order to help decision makers make
the right choices.
The process of IDA generally consists of the following three stages:
(1) data preparation
(2) rule finding or data mining
(3) result validation and explanation.
Data preparation involves selecting the required data from the relevant data source and
integrating this into a data set to be used for data mining.
Rule finding is working out rules contained in the data set by means of certain methods
or algorithms.
Result validation requires examining these rules, and result explanation is giving intuitive,
reasonable and understandable descriptions using logical reasoning.
As the goal of intelligent data analysis is to extract useful knowledge, the process
demands a combination of extraction, analysis, conversion, classification, organization,
reasoning, and so on.
It is challenging and fun working out how to choose appropriate methods to resolve the
difficulties encountered in the process.
Intelligent data analysis methods and tools, as well as the authenticity of obtained results
pose us continued challenges.
1,3,2 Uses / Benefits of IDA
Intelligent Data Analysis provides a forum for the examination of issues related to the research and
applications of Artificial Intelligence techniques in data analysis across a variety of disciplines and
the techniques include (but are not limited to):
The benefit areas are:
Data Visualization
Data pre-processing (fusion, editing, transformation, filtering, sampling)
Data Engineering
16
Database mining techniques, tools and applications
Use of domain knowledge in data analysis
Big Data applications
Evolutionary algorithms
Machine Learning(ML)
Neural nets
Fuzzy logic
Statistical pattern recognition
Knowledge Filtering and
Post-processing
1.3.4 Intelligent Data Examples:
Example of IDA
Epidemiological study (1970-1990)
Sample of examinees died from cardiovascular diseases during the period
Evaluation of IDA results
Absolute & relative accuracy
Sensitivity & specificity
False positive & false negative
Error rate
Reliability of rules
1. 4 NATURE OF DATA
1.4.1 INTRODUCTION
Data
Data is a set of values of qualitative or quantitative variables; restated, pieces of
data are individual pieces of information.
Data is measured, collected and reported, and analyzed, where upon it can be visualized
using graphs or images.
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use(willing to conform)
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts used
in deciding something. In short, data are meant to be used as a base for arriving at definitive
conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the matter.
e) Aggregation: Aggregation is cumulating or adding up.
17
f) Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size.Graphs and charts are some examples of
compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when data
are processed or refined.
1.4.2 TYPES OF DATA
In order to understand the nature of data it is necessary to categorize them into various
types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be categorized into
types.
There are four types of data:
Nominal
Ordinal
Interval
Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable(explainable)
3. The presence or inclusion of a true zero(absence of something being measured ex: zero
objects --negative numbers not accepted )
1.4.2.1 Nominal Scales
Nominal scales measure categories and have the following characteristics:
Order: The order of the responses or observations does not matter.
Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the
same as a 2 and 3.
True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts
1.4.2.2 Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our
characteristics for ordinal scales are:
Order: The order of the responses or observations matters.
Distance: Ordinal scales do not hold distance. The distance between first and second is
unknown as is the distance between first and third along with all observations.
18
True Zero: There is no true or real zero. An item, observation, or category cannot finish
zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
1.4.2.3 Interval Scales
Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
The characteristics of interval scales are:
Order: The order of the responses or observations does matter.
Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the
same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can
perform arithmetic operations on the data.
True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as 11 to
19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4
scale because we subtracted 5 from all values.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
1.4.2.4 Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
Order: The order of the responses or observations matters.
Distance: Ratio scales do have an interpretable distance.
True Zero: There is a true zero. Income
is a classic example of a ratio scale:
Order is established. We would all prefer $100 to $1!
Zero dollars means we have no income (or, in accounting terms, our revenue exactly
equals our expenses!)
Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.
19
1.5. ANALYTIC PROCESS AND TOOLS
• There are 6 analytic processes:
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business
objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data
and verify data quality to ensure it contains the data we require.
20
– Data collected from the various sources is described in terms of its application and
the need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Data exploration helps identify:
Patterns and relationships
Anomalies(irregularities)
Trends
Errors or outliers
Step 4: Data Preparation
• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful
information and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
Step 5: Data Modeling
• we need to
– select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented on
the data in this phase.
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
5. Data Evaluation
Here, we evaluate the results from the last step, review the scope of error, and determine the next
steps to perform. We evaluate the results of the test cases and review the scope of errors in this
phase.
Understanding these processes is essential, but the right tools can make or break the big data
analytics journey. The right tools can simplify and enhance your big data analytics process. The top
tools that are shaping the world of big data analytics are:
1. Hadoop - open source software framework is a powerhouse for storing and processing large
data sets across clusters of computers -It is designedtoscale upfrom asingle server to
thousands of machines, each offering local computation and storage.
2. Spark - Open source, distributed computing system excels at real-time processing. It’s
21
lightning fast and can handle both batch and streaming workloads and it's compatible with
hadoop making it a versatile tool in big data analytics.
3. Flink - Stream processing framework provides high throughput, low latency and exactly-
once semantics, making it ideal for event-driven applications. It's excellent for real time
analytics and complex event processing.
4. Hive - Data warehouse software facilitates reading, writing and managing large datasets
residing in distributed storage. It's fantastic for ad hoc querying and analysis of structured
and semi-structured data.
5. Tableau - Business Intelligence software excels at data visualization. It helps to turn raw data
into easily understandable visuals, making the analysis process more intuitive and accessible.
1. Hadoop
Apache Hadoop is the most prominent and used tool in big data industry with its
enormous capability of large-scale processing data. This is 100% open source framework and runs
on commodity hardware in an existing data center. Furthermore, it can run on a cloud infrastructure.
Hadoop consists of four parts:
Hadoop Distributed File System: Commonly known as HDFS, it is a distributed file
system compatible with very high scale bandwidth.
MapReduce: A programming model for processing big data.
YARN: It is a platform used for managing and scheduling Hadoop’s resources in
Hadoop infrastructure.
Libraries: To help other modules to work with Hadoop.
2. Apache Spark
Apache Spark is the next hype in the industry among the big data tools. The key point of this
open source big data tool is it fills the gaps of Apache Hadoop concerning data processing.
Interestingly, Spark can handle both batch data and real-time data. As Spark does in-memory data
processing, it processes data much faster than traditional disk processing. This is indeed a plus
point for data analysts handling certain types of data to achieve the faster outcome.
Apache Spark is flexible to work with HDFS as well as with other data stores, for
example with OpenStack Swift or Apache Cassandra. It’s also quite easy to run Spark on a single
local system to make development and testing easier. Spark Core is the heart of the project,
and it facilitates many things like
distributed task transmission
scheduling
I/O functionality
Spark is an alternative to Hadoop’s MapReduce. Spark can run jobs 100 times faster
than Hadoop’s MapReduce.
3. Apache Storm
Apache Storm is a distributed real-time framework for reliably processing the
unbounded data stream. The framework supports any programming language. The unique
features of Apache Storm are:
Massive scalability
Fault-tolerance
22
“fail fast, auto restart” approach
The guaranteed process of every tuple
Written in Clojure
Runs on the JVM
Supports multiple languages
Supports protocols like JSON
Storm topologies can be considered similar to MapReduce job. However, in case of
Storm, it is real-time stream data processing instead of batch data processing. Based on the
topology configuration, Storm scheduler distributes the workloads to nodes. Storm can
interoperate with Hadoop’s HDFS through adapters if needed which is another point that makes it
useful as an open source big data tool.
4. Cassandra
Apache Cassandra is a distributed type database to manage a large set of data across the
servers. This is one of the best big data tools that mainly process structured data sets. It provides
highly available service with no single point of failure. Additionally, it has certain capabilities which
no other relational database and any NoSQL database can provide. These capabilities are:
Continuous availability as a data source
Linear scalable performance
Simple operations
Across the data centers easy distribution of data
Cloud availability points
Scalability
Performance
Apache Cassandra architecture does not follow master-slave architecture, and all nodes play
the same role. It can handle numerous concurrent users across data centers. Hence, adding a new
node is no matter in the existing cluster even at its up time.
6. MongoDB
MongoDB is an open source NoSQL database which is cross-platform compatible with many
built-in features. It is ideal for the business that needs fast and real-time data for instant decisions.
It is ideal for the users who want data-driven experiences. It runs on MEAN software stack, NET
applications and, Java platform.
Some notable features of MongoDB are:
It can store any type of data like integer, string, array, object, Boolean, date etc.
It provides flexibility in cloud-based infrastructure.
It is flexible and easily partitions data across the servers in a cloud structure.
MongoDB uses dynamic schemas. Hence, you can prepare data on the fly and
quickly. This is another way of cost saving.
7. R Programming Tool
This is one of the widely used open source big data tools in big data industry for
statistical analysis of data. The most positive part of this big data tool is – although used for statistical
23
analysis, as a user you don’t have to be a statistical expert. R has its own public library CRAN
(Comprehensive R Archive Network) which consists of more than 9000 modules and
algorithms for statistical analysis of data.
R can run on Windows and Linux server as well inside SQL server. It also supports Hadoop
and Spark. Using R tool one can work on discrete data and try out a new analytical algorithm for
analysis. It is a portable language. Hence, an R model built and tested on a local data source can be
easily implemented in other servers or even against a Hadoop data lake.
1.6 ANALYSIS AND REPORTING
1.6.1 INTRODUCTION TO ANALYSIS AND REPORTING
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”
1.6.2 COMPARING ANALYSIS WITH REPORTING
• Reporting is “the process of organizing data in to informational summaries in order to
monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a slidedeck, or
online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to when data
falls outside of expected ranges.
• Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and providing
actionable recommendations.
• A firm may be focused on the general area of analytics (strategy, implementation,
reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related
activities and don’t make it to the analysis stage
24