0% found this document useful (0 votes)
11 views44 pages

Bda Unit 4 PPT 2

Uploaded by

PRIANSHU KHALDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views44 pages

Bda Unit 4 PPT 2

Uploaded by

PRIANSHU KHALDE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Nutan College of

Engineering and Research

Big Data Analytics


UNIT-4: Big Data Applications 6 Hr
► Overview of Big Data Machine Learning,

► Mahout,

► Big Data Machine learning Algorithms in Mahout- kmeans, Naive Bayes etc.

► Machine learning with Spark,

► Machine Learning Algorithms in Spark,

► Spark MLlib,

► Deep Learning for Big Data,

► Graph Processing: Pregel, Giraph, Spark GraphX


Overview of Big Data Machine Learning
Overview of Big Data Machine Learning
► In the past few years, more data has been produced than in the millennia of human
history before.

► This data represents a gold mine in terms of commercial value and also important
reference material for policy makers.

► But much of this value will stay untapped — or, worse, be misinterpreted — as long as
the tools necessary for processing the staggering amount of information remain
unavailable.
Overview of Big Data Machine Learning
What is Machine Learning?

► The core of machine learning consists of self-learning algorithms that evolve by


continuously improving at their assigned task.

► When structured correctly and fed proper data, these algorithms eventually produce
results in the contexts of pattern recognition and predictive modeling.
Overview of Big Data Machine Learning
► For machine-learning algorithms, data is like exercise: the more the better.

► Algorithms fine-tune themselves with the data they train on in the same way Olympic
athletes hone their bodies and skills by training every day.

► Many programming languages work with machine learning, including Python,


R, Java, JavaScript and Scala.
Overview of Big Data Machine Learning
► Python is the preferred choice for many developers because of its TensorFlow library,
which offers a comprehensive ecosystem of machine-learning tools.
Overview of Big Data Machine Learning
What is Big Data?

► Data consists of numbers, words, measurements and observations formatted in ways


computers can process. Big data refers to vast sets of that data, either structured or
unstructured.

► The digital era presents a challenge for traditional data-processing software: information
becomes available in such volume, velocity and variety that it ends up outpacing
human-centered computation.
Overview of Big Data Machine Learning
► Good data analysis requires someone with business acumen, programming knowledge
and a comprehensive skill set of math and analytic techniques.

► But how can a professional armed with traditional techniques sort through millions of
credit card scores, or billions of social media interactions? That’s where machine
learning comes in.
Overview of Big Data Machine Learning
Big Data Meets Machine Learning

► Machine-learning algorithms become more effective as the size of training datasets


grows.

► So when combining big data with machine learning, we benefit twice: the algorithms
help us keep up with the continuous influx of data, while the volume and variety of the
same data feeds the algorithms and helps them grow.
Overview of Big Data Machine Learning
► By feeding big data to a machine-learning algorithm, we might expect to see defined
and analyzed results, like hidden patterns and analytics, that can assist in predictive
modeling.
Overview of Big Data Machine Learning
Machine Learning Applications for Big Data

1. Cloud Networks

► A research firm has a large amount of medical data it wants to study, but in order to do
so on-premises it needs servers, online storage, networking and security assets, all of
which adds up to an unreasonable expense.
Overview of Big Data Machine Learning
► Instead, the firm decides to invest in Amazon EMR, a cloud service that offers
data-analysis models within a managed framework.

► Machine-learning models of this sort include GPU-accelerated image recognition and


text classification.

► These algorithms don’t learn once they are deployed, so they can be distributed and
supported by a content-delivery network (CDN).
Overview of Big Data Machine Learning
2. Web Scraping

► Let’s imagine that a manufacturer of kitchen appliances learns about market tendencies
and customer-satisfaction trends from a retailer’s quarterly reports.

► In their desire to find out what the reports might have left out, the manufacturer decides
to web-scrape the enormous amount of existing data that pertains to online customer
feedback and product reviews.
Overview of Big Data Machine Learning
► By aggregating this data and feeding it to a deep-learning model, the manufacturer
learns how to improve and better describe its products, resulting in increased sales.

► While web scraping generates a huge amount of data, it’s worthwhile to note that
choosing the sources for this data is the most important part of the process.
Overview of Big Data Machine Learning
3. Mixed-Initiative Systems

► The recommendation system that suggests titles on your Netflix homepage employs
collaborative filtering: It uses big data to track your history (and everyone else’s) and
machine-learning algorithms to decide what it should recommend next.

► Smart-car manufacturers implement big data and machine learning in the


predictive-analytics systems that run their products. Tesla cars, for example,
communicate with their drivers and respond to external stimuli by using data to make
algorithm-based decisions.
Mahout
► We are living in a day and age where information is available in abundance.

► The information overload has scaled to such heights that sometimes it becomes difficult
to manage our little mailboxes! Imagine the volume of data and records some of the
popular websites (the likes of Facebook, Twitter, and Youtube) have to collect and
manage on a daily basis.
Mahout
► Normally we fall back on data mining algorithms to analyze bulk data to identify trends
and draw conclusions.

► However, no data mining algorithm can be efficient enough to process very large
datasets and provide outcomes in quick time, unless the computational tasks are run on
multiple machines distributed over the cloud.
Mahout
► We now have new frameworks that allow us to break down a computation task into
multiple segments and run each segment on a different machine.

► Mahout is such a data mining framework that normally runs coupled with the Hadoop
infrastructure at its background to manage huge volumes of data.
► Data mining is the process of automatically discovering patterns,
relationships, and insights from large amounts of data. This
involves using various techniques, such as machine learning,
statistics, and database systems, to identify valuable patterns and
relationships that can help organizations make better decisions.
► Data mining is often used in a wide range of fields, including
business, finance, healthcare, and marketing, to name a few.
Mahout
What is Apache Mahout?

► A mahout is one who drives an elephant as its master. The name comes from its close
association with Apache Hadoop which uses an elephant as its logo.

► Hadoop is an open-source framework from Apache that allows to store and process big
data in a distributed environment across clusters of computers using simple
programming models.
Mahout
► Apache Mahout is an open source project that is primarily used for creating scalable
machine learning algorithms. It implements popular machine learning techniques such
as:

• Recommendation
• Classification
• Clustering
Mahout
► Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout
became a top level project of Apache.
Mahout
Features of Mahout:

The primitive features of Apache Mahout are listed below.

• The algorithms of Mahout are written on top of Hadoop, so it works well in


distributed environment. Mahout uses the Apache Hadoop library to scale
effectively in the cloud.

• Mahout offers the coder a ready-to-use framework for doing data mining tasks on
large volumes of data.
Mahout
• Mahout lets applications to analyze large sets of data effectively and in quick
time.

• Includes several MapReduce enabled clustering implementations such as


k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.

• Supports Distributed Naive Bayes and Complementary Naive Bayes


classification implementations.
Mahout
• Comes with distributed fitness function capabilities for evolutionary programming.

• Includes matrix and vector libraries.


Mahout
Applications of Mahout:

• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout
internally.

• Foursquare helps you in finding out places, food, and entertainment available in a particular
area. It uses the recommender engine of Mahout.

• Twitter uses Mahout for user interest modelling.

• Yahoo! uses Mahout for pattern mining.


Big Data Machine learning Algorithms in Mahout- kmeans
► A cluster refers to a small group of objects.

► Clustering in Mahout means grouping any forms of data into characteristically similar groups
of data-sets.

► Clustering is dividing data points into homogeneous classes or clusters, such that the points in
the same group are as similar as possible, while those in different groups are as dissimilar as
possible.
Big Data Machine learning Algorithms in Mahout- kmeans
► When a collection of objects is given, they are divided into groups based on similarity.

K-Means Clustering:

► K-means clustering, discovered by Macqueen in 1967, is one of the simplest unsupervised


learning algorithms that solve the well-known clustering problem.
Big Data Machine learning Algorithms in Mahout- kmeans
► K-Means clustering is a method of vector quantization, which originally comes from signal processing,
a popular technique for cluster analysis in data mining.

If k is defined, following are the steps, in which k-means algorithm can be executed:

• Partition of the objects into k non-empty subsets.

• Identifying the cluster centroids (mean point) of the current partition.

• Assigning each point to a specific cluster.

• Finding out the distance of each point from the centroid and allot points to the cluster where the distance
from the centroid is the minimum.

• After re-allocation of the points, identifying the centroid of the new cluster formed.
Big Data Machine learning Algorithms in Mahout- kmeans
K-Means: Pizza Hut Clustering Example:

Let’s consider an example which takes in account the Pizza Hut delivery points.

We can provide a solution to this by using the K-Means Clustering, which is one part of algorithm
under the pillow of clustering.

The algorithm makes a centroid and from there it calculates the distance between the centroid and
the points.
Big Data Machine learning Algorithms in Mahout- kmeans
It then, finds out which is the minimal distance, and tries to group together all those points.

When we have the delivery locations for Pizza, first of all, we need to group the delivery
locations.

If we need three delivery locations, or three clusters, or groups of records of the data we acquire,
then, we find out the distance between the centroid and the delivery points.
Big Data Machine learning Algorithms in Mahout- kmeans
If the grouping is not sufficient or is not giving the closest results, we re-position the centroid
nearest to the points and try to group them together, so as to optimize the distance between the
cluster centroid points and the data points.

Then again, we need to find the distance.

This is not needed to be done manually, as everything is done by the algorithm.

The only thing that one has to do is study the inferential statistics.
Big Data Machine learning Algorithms in Mahout- kmeans
The outcome of this Mahout algorithm, where you have inference out of it to find out what we are
getting is right or wrong.

Once we find this out, we have to group the similar sets of data that have very less distance, and
share similar characteristics of a data-set, and then, we go on to group them together.

This way clustering brings together the similar kind of data or common sets of information.
Big Data Machine learning Algorithms in Mahout- Naive
Naïve Bayes:

► Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.

► It is mainly used in text classification that includes a high-dimensional training dataset.

► Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
Big Data Machine learning Algorithms in Mahout- Naive
► It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.

► Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Big Data Machine learning Algorithms in Mahout- Naive
Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:

► Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
Big Data Machine learning Algorithms in Mahout- Naive
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Big Data Machine learning Algorithms in Mahout- Naive
The formula for Bayes' theorem is given as:

Where,
► P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

► P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
► P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
► P(B) is Marginal Probability: Probability of Evidence.
Big Data Machine learning Algorithms in Mahout- Naive
Working of Naïve Bayes' Classifier:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".

So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions.

So to solve this problem, we need to follow the below steps:


Big Data Machine learning Algorithms in Mahout- Naive
1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.

3. Now, use Bayes theorem to calculate the posterior probability.


Machine learning with Spark
The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data
problems and models instead of solving the complexities surrounding distributed data (such as
infrastructure, configurations, and so on).
Deep learning in big data Analytics
Deep learning methods are extensively applied to various fields of science and
engineering such as speech recognition, image classifications, and learning methods in
language processing.
THANK YOU

You might also like