0% found this document useful (0 votes)

24 views32 pages

Bda Model

Uploaded by

rahbaralam747

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views32 pages

Bda Model

Uploaded by

rahbaralam747

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Module-4

7(a) Explain Map Reduce Execution steps with neat diagram.

MapReduce is a framework for processing large datasets in a distributed manner across a
cluster. It breaks down the processing into two main phases: Map and Reduce.
1. Input Splitting
 The input data is split into smaller, manageable chunks, called input splits.
 Each split is processed independently by the Mapper.
 These splits are stored in the Hadoop Distributed File System (HDFS).
2. Mapping Phase
 The Mapper function processes each input split and generates intermediate key-value
pairs.
 For example, in a word count application, each word is emitted as a key, and its count
is emitted as the value.
3. Shuffling and Sorting
 The intermediate key-value pairs from all Mappers are grouped by keys (shuffling).
 Within each key group, the values are sorted to facilitate aggregation.
 This step ensures that all values associated with a key are sent to the same Reducer.
4. Reducing Phase
 The Reducer processes each group of intermediate key-value pairs.
 It applies a reducing function (e.g., summation) to produce the final output for each
key.
 For the word count example, the Reducer aggregates the counts for each word.
5. Output Phase
 The final key-value pairs generated by the Reducer are written to HDFS.
 This output can be used for further analysis or visualization.
7(b)What is HIVE? Explain HIVE Architecture.
Apache Hive is a data warehousing framework built on top of Hadoop. It is designed to
process structured and semi-structured data by providing a SQL-like interface, called
HiveQL, which translates SQL queries into MapReduce, Tez, or Spark jobs for execution.
Hive is particularly useful for querying and analyzing large datasets stored in the Hadoop
Distributed File System (HDFS).

1. User Interface (UI)

 Provides an interface for users to interact with Hive.
 Supports different modes of interaction:
o Hive CLI: Command-line interface for submitting queries.
o Web Interface: Browser-based interface for submitting queries and
managing Hive metadata.
o JDBC/ODBC Drivers: Connects Hive with external tools like Tableau or BI
systems.
2. Driver
 Manages the lifecycle of a HiveQL query.
 Responsibilities:
o Parsing: Converts SQL queries into Abstract Syntax Trees (ASTs).
o Logical Plan Generation: Creates an execution plan based on the query.
o Query Execution: Coordinates with the execution engine to run the query.
3. Compiler
 Converts the logical query plan into a physical plan.
 Optimizes the query for efficient execution.
 Translates HiveQL queries into MapReduce, Tez, or Spark jobs.
4. Metastore
 Stores metadata about the tables, columns, partitions, and the location of data files in
HDFS.
 Acts as the central repository for all Hive-related metadata.
 Uses an RDBMS (e.g., MySQL or Derby) to store this information.
5. Execution Engine
 Executes the physical plan generated by the Compiler.
 Works with Hadoop's MapReduce, Apache Tez, or Apache Spark to process data.
 Coordinates with the Driver for task execution and progress updates.
6. HDFS
 Stores the actual data processed by Hive.
 Hive reads and writes data from HDFS using input and output formats compatible
with Hadoop.

8(a) Explain Pig architecture for scripts dataflow and processing.

Apache Pig is a high-level platform built on top of Hadoop for processing large datasets. It
simplifies the complexity of writing MapReduce programs by providing a scripting language
called Pig Latin. Pig is designed for data transformation tasks such as ETL (Extract,
Transform, Load) operations and processing of large-scale data in Hadoop. Pig scripts can
run on large datasets across a Hadoop cluster, and Pig Latin scripts are automatically
translated into MapReduce jobs for execution.

1. Pig Latin Script

 Pig Latin is a scripting language for expressing data transformations.
 It abstracts complex MapReduce code into simpler commands.
 Pig Latin allows the user to specify how data should be processed (e.g., filtering,
joining, and grouping) without dealing with low-level MapReduce code.
2. Grunt Shell
 The Grunt Shell is the interactive shell for executing Pig Latin scripts.
 It acts as the front-end interface where users can run scripts, query data, and interact
with the system.
3. Parser
 The parser takes a Pig Latin script as input and performs parsing to generate a logical
plan.
 It checks the syntax of the script and translates it into a format that can be processed
by the Pig execution engine.
 The logical plan contains the steps and operations to be performed on the data.
4. Optimizer
 The optimizer improves the performance of the logical plan by applying various
optimization techniques.
 It makes adjustments to the logical plan, such as simplifying queries and pushing
operations to a lower level for better execution.
5. Compiler
 The compiler converts the logical plan into a physical plan.
 The physical plan specifies how the logical plan will be executed, defining the
sequence of operations, including MapReduce jobs.
 The compiler ensures that the data flow and processing tasks are optimized for
efficient execution.
6. Execution Engine
 The execution engine is responsible for running the physical plan.
 It coordinates the execution of the jobs, sending tasks to the Hadoop cluster, where
MapReduce jobs are executed.
 The engine handles job scheduling and resource management.
7. HDFS (Hadoop Distributed File System)
 Pig operates over HDFS, storing and retrieving input and output data.
 All data processed by Pig is read from and written to HDFS, leveraging Hadoop's
distributed storage capabilities.
8(b) Explain Key Value pairing in Map Reduce.
In the MapReduce programming model, key-value pairing is a fundamental concept that
helps organize, process, and aggregate data efficiently. The MapReduce framework processes
data in the form of key-value pairs, where each unit of data is represented by a key and an
associated value. These key-value pairs are used throughout the different stages of the
MapReduce process: Map phase, Shuffle and Sort phase, and Reduce phase.
Key-value pairing enables the MapReduce system to divide, process, and aggregate large
datasets in a distributed manner, making it suitable for handling vast amounts of data in
parallel across a cluster.
(Same as Map reduce execution process)

Module-5
9(a) What is Machine Learning? Explain different types of Regression Analysis.
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves developing algorithms
that enable computers to learn patterns from data and make decisions or predictions without explicit
programming. Instead of being hard-coded with specific instructions, ML systems improve their
performance on a task over time by analyzing data. It is widely used in areas like recommendation
systems, speech recognition, and financial forecasting.
 Simple Linear Regression: basic form of regression analysis, modeling the
relationship between a single dependent variable and one independent variable. The
relationship is represented by a straight line, continuous data.
Equation- y=mx+c, where y is the predicted value, m is the slope, and c is the
intercept.
 Multiple Linear Regression: extension of simple linear regression to model the
relationship between a dependent variable and two or more independent variables.
Equation-Y=a+b1X1+b2X2+…+bnXn+e
 Polynomial Regression: It is used when the relationship between the independent
variable and the dependent variable is not linear but can be modeled by a polynomial
equation. The relationship is represented by a curve .Equation-f(x)=b0+b1x+b2x2
 Logistic Regression: This is used when the dependent variable is binary (e.g., 0 or
1, true or false). It models the probability of the outcome occurring as a function of
the independent variables.
 Ridge Regression: it includes a regularization term to penalize the size of the
coefficients. It is used to prevent overfitting, especially when dealing with
multicollinearity. Equation- λ∑β^2
 Lasso Regression: Similar to ridge regression, lasso regression also includes a
regularization term. However, it can shrink some coefficients to zero, effectively
performing variable selection. Equation- λ∑∣β∣
 Nonlinear Regression: it is used when the relationship between the dependent and
independent variables is not linear. It can handle more complex data sets .

9(b) Explain with neat diagram K-means clustering.

K-Means Clustering is an unsupervised machine learning algorithm used to partition a dataset
into K distinct groups (clusters) based on similarities. It minimizes the distance between
data points within the same cluster while maximizing the distance between data points
in different clusters. Each cluster is represented by its centroid (the average of all points in
the cluster).
1. Choose the number of clusters k
The first step in k-means is to pick the number of clusters, k.
2. Select k random points from the data as centroids
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2
clusters, so k is equal to 2 here. We then randomly select the centroid:
3. Assign all the points to the closest cluster centroid
Once we have initialized the centroids, we assign each point to the closest cluster
centroid.
4. Recompute the centroids of newly formed clusters
Now, once we have assigned all of the points to either cluster, the next step is to
compute the centroids of newly formed clusters:
5. Repeat steps 3 and 4
We then repeat steps 3 and 4:
There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached
9(c) Explain Naïve Bayes Theorem with example.
Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is
used for classification tasks and assumes that features are independent of each other (the
"naïve" assumption). Despite this simplification, it performs well for many real-world
applications such as spam detection and sentiment analysis.
Bayes' Theorem
Bayes' Theorem calculates the probability of a hypothesis (HHH) given evidence (EEE). It is
expressed as:
P(H∣E)=P(E∣H)⋅P(H)/P(E)
Where:
 P(H∣E): Posterior probability (probability of hypothesis H given evidence E).
 P(E∣H): Likelihood (probability of evidence E given H).
 P(H): Prior probability (initial probability of H).
 P(E): Marginal probability (total probability of evidence E).
10(a) Explain five phases in a process pipeline text mining.
1. Text Preprocessing-This phase cleans and prepares the raw text data for analysis.
 Steps:
o Tokenization: Splitting the text into individual words or phrases (tokens).
o Stopword Removal: Eliminating common words (e.g., "the," "and") that
do not contribute to the meaning.
o Stemming/Lemmatization: Reducing words to their root form (e.g.,
"running" → "run").
o Lowercasing: Converting text to lowercase to maintain uniformity.
o Removing Noise: Eliminating punctuation, numbers, and special
characters.
2. Feature Generation-Transform the pre-processed text into a structured format that can be
analyzed by machine learning algorithms.
 Methods:
o Bag-of-Words (BoW): Representing text as word frequency vectors.
o TF-IDF: Assigning weights to words based on their importance in a
document relative to a corpus.
o Word Embeddings: Using techniques like Word2Vec or GloVe to represent
words as vectors in a continuous space.
3. Feature Selection-This phase involves applying techniques to standardize or enhance the
text representation for specific tasks.
 Techniques:
o Part-of-Speech (POS) Tagging: Labeling words with their grammatical
roles (e.g., noun, verb).
o Named Entity Recognition (NER): Identifying entities like names,
locations, or organizations.
o Topic Modeling: Identifying underlying themes or topics in the text using
methods like LDA (Latent Dirichlet Allocation).
4. Data Mining-Analyze the structured data to uncover patterns, relationships, or trends.
 Techniques:
o Clustering: Grouping similar documents or sentences (e.g., K-Means or
DBSCAN).
o Classification: Categorizing text into predefined labels (e.g., spam vs. non-
spam).
o Sentiment Analysis: Determining the emotional tone of the text (positive,
negative, neutral).
5. Analyzing results-The final phase involves evaluating the results and interpreting the
findings.
 Steps:
o Accuracy Metrics: Use precision, recall, F1-score, or confusion matrices to
measure model performance.
o Visualizations: Use graphs or word clouds to represent insights effectively.
o Actionable Insights: Extract useful conclusions that support decision-
making.

10(b) Explain Web Usage Mining.

Web usage mining discovers and analyses the patterns in click streams. Web usage mining
also includes associated data generated and collected as a consequence of user interactions
with web resources.
The phases are:
1. Pre-processing - Converts the usage information collected from the various data sources
into the data abstractions necessary for pattern discovery.
2. Pattern discovery - Exploits methods and algorithms developed from fields, such as
statistics, data mining, ML and pattern recognition.
3. Pattern analysis - Filter outs uninteresting rules or patterns from the set found during the
pattern discovery phase.

0Preprocessing-The raw data is cleaned and prepared for analysis.

 Tasks:
o Removing irrelevant records (e.g., failed HTTP requests).
o Identifying unique users and sessions.
o Data formatting and transformation.
 Example: Converting timestamps into session durations or categorizing URLs.

Pattern Discovery-This step applies data mining techniques to identify patterns in user behaviour.

 Techniques:
o Clustering: Grouping users with similar browsing behavior.
o Association Rules: Identifying frequently accessed page sequences.
o Sequential Pattern Mining: Finding the order of page visits.
o Classification: Categorizing users based on behavior (e.g., new vs. returning users).

Pattern Analysis-Interpreting the patterns discovered to derive actionable insights.

 Tools: Visualization tools, statistical analysis, and query languages.

 Example: Identifying that users often navigate from "Homepage → Products → Checkout."

EXTRA

Outliers -Outliers are data points that differ significantly from the majority of the data in a
dataset. They are values that fall far outside the normal range, either much higher or much
lower than most other values, potentially indicating variability in the data, errors, or rare
events.

Variance -Variance is a statistical measure that represents the degree of spread or dispersion
of a set of values. It is the average of the squared differences between each data point and the
mean of the dataset. High variance indicates that the data points are widely spread out, while
low variance means they are closely clustered around the mean.

Probability Distribution -A probability distribution is a mathematical function that describes

the likelihood of different outcomes in an experiment. It assigns probabilities to each possible
outcome in a sample space, and it can be discrete (for countable outcomes) or continuous (for
a range of outcomes).

Correlation -Correlation is a statistical measure that indicates the strength and direction of the
relationship between two variables. It can be positive (both variables increase or decrease
together), negative (one variable increases while the other decreases), or zero (no linear
relationship). The correlation coefficient quantifies this relationship on a scale from -1 to +1.

Own Answer 2
No ratings yet
Own Answer 2
22 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
Software Frameworks & Big Data Tools
No ratings yet
Software Frameworks & Big Data Tools
21 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
BDA Assignment QP-3 IT A With Key Solutions
No ratings yet
BDA Assignment QP-3 IT A With Key Solutions
5 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Bda QB3
No ratings yet
Bda QB3
22 pages
Cloud Series 2 ORAF
No ratings yet
Cloud Series 2 ORAF
19 pages
Bda (Mid-2)
No ratings yet
Bda (Mid-2)
14 pages
Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
Map Reduce 1
No ratings yet
Map Reduce 1
50 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
M5
No ratings yet
M5
18 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data
No ratings yet
Big Data
6 pages
Bigdata All Mid-1
No ratings yet
Bigdata All Mid-1
10 pages
3 Unit
No ratings yet
3 Unit
17 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
BDA Notes
No ratings yet
BDA Notes
39 pages
Unit 2
No ratings yet
Unit 2
9 pages
M /R, H & P: AP Educe Adoop IG
No ratings yet
M /R, H & P: AP Educe Adoop IG
24 pages
Model Paper BIG DATA (KOE097)
No ratings yet
Model Paper BIG DATA (KOE097)
8 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Module 3
No ratings yet
Module 3
36 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Unit 5
No ratings yet
Unit 5
23 pages
Notes - 4 Unit Neha
No ratings yet
Notes - 4 Unit Neha
44 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Part C - Assignment No. 5 Health Care Case Study
No ratings yet
Part C - Assignment No. 5 Health Care Case Study
10 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
DA Unit 5
100% (1)
DA Unit 5
191 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
BDA Notes
No ratings yet
BDA Notes
15 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
29 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Ia-3 S&S
No ratings yet
Ia-3 S&S
10 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
Bda (M-4)
No ratings yet
Bda (M-4)
8 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Data Analytics Chapter 5
No ratings yet
Data Analytics Chapter 5
14 pages
Assgnment2 Group B
No ratings yet
Assgnment2 Group B
5 pages
SAP C_IBP_2205 Practice Exam Q&A
No ratings yet
SAP C_IBP_2205 Practice Exam Q&A
11 pages
Chapter 5 Thesis Filipino
100% (2)
Chapter 5 Thesis Filipino
7 pages
2022 Student-Guide MTeachSec 24nov21
No ratings yet
2022 Student-Guide MTeachSec 24nov21
12 pages
HICET - Department of Computer Science and Engineering
No ratings yet
HICET - Department of Computer Science and Engineering
1 page
Profed10critical Literacy
No ratings yet
Profed10critical Literacy
32 pages
Academic Referencing Guide
No ratings yet
Academic Referencing Guide
20 pages
Manual of Surgical Pathology Expert Consult Online and Print Expert Consult Title Online Print Third Edition Susan C. Lester MD PHD Instant Download
No ratings yet
Manual of Surgical Pathology Expert Consult Online and Print Expert Consult Title Online Print Third Edition Susan C. Lester MD PHD Instant Download
121 pages
Q2 Mil WK 6
No ratings yet
Q2 Mil WK 6
3 pages
The Facebook Experiment
No ratings yet
The Facebook Experiment
17 pages
Brand Guidelines - NAE and Master 1, Master 2 and Hybrid Schools
No ratings yet
Brand Guidelines - NAE and Master 1, Master 2 and Hybrid Schools
125 pages
Data 10 00007
No ratings yet
Data 10 00007
9 pages
UCSP PPT Concept Characteristics and Forms Stratification Systems
No ratings yet
UCSP PPT Concept Characteristics and Forms Stratification Systems
29 pages
978-84-7873-656-0 NGA A1plus TEACHER GUIDE DIGITAL EDITION PDF
100% (2)
978-84-7873-656-0 NGA A1plus TEACHER GUIDE DIGITAL EDITION PDF
123 pages
Soccer Math for 3rd Graders
No ratings yet
Soccer Math for 3rd Graders
2 pages
1 Policy Implementation
No ratings yet
1 Policy Implementation
36 pages
Learning Delivery Modalities Course
No ratings yet
Learning Delivery Modalities Course
57 pages
Western Mindanao Adventist Academy "The School For Better Future"
No ratings yet
Western Mindanao Adventist Academy "The School For Better Future"
3 pages
Career Assessment Tools
No ratings yet
Career Assessment Tools
4 pages
Operative Techniques in Pediatric Plastic and Reconstructive Surgery Complete Ebook Edition
No ratings yet
Operative Techniques in Pediatric Plastic and Reconstructive Surgery Complete Ebook Edition
16 pages
E-SHE Sign Up Instructions - MS Office Workspace
No ratings yet
E-SHE Sign Up Instructions - MS Office Workspace
8 pages
Nicole Biamonte, Ed. Pop-Culture Pedagogy in The
No ratings yet
Nicole Biamonte, Ed. Pop-Culture Pedagogy in The
4 pages
NetControl2Classroom DataSheet
No ratings yet
NetControl2Classroom DataSheet
2 pages
Lesson Plan FABM
No ratings yet
Lesson Plan FABM
2 pages
ENGLISH-TEMPLATE-REMEDIATION-PLAN For Judelyn
No ratings yet
ENGLISH-TEMPLATE-REMEDIATION-PLAN For Judelyn
2 pages
Registration Form With Fee Strcture Vision-Ias
No ratings yet
Registration Form With Fee Strcture Vision-Ias
3 pages
How To Carry Out A Literature Review For A Dissertation or Research Paper
No ratings yet
How To Carry Out A Literature Review For A Dissertation or Research Paper
11 pages
Logical Compressed PDF
No ratings yet
Logical Compressed PDF
330 pages
Complete Bundle Electrical Control For Machines 7th Edition Lobsiger
No ratings yet
Complete Bundle Electrical Control For Machines 7th Edition Lobsiger
413 pages
Cisco Actualtests 300-710 PDF 2022-Apr-23 by Buck 111q Vce
No ratings yet
Cisco Actualtests 300-710 PDF 2022-Apr-23 by Buck 111q Vce
5 pages
Quay Container Crane Productivity Effectiveness Analysis Case Study PT Jakarta International Container Terminal
No ratings yet
Quay Container Crane Productivity Effectiveness Analysis Case Study PT Jakarta International Container Terminal
9 pages

Bda Model

Uploaded by

Bda Model

Uploaded by

Module-4

7(a) Explain Map Reduce Execution steps with neat diagram.

1. User Interface (UI)

8(a) Explain Pig architecture for scripts dataflow and processing.

1. Pig Latin Script

9(b) Explain with neat diagram K-means clustering.

10(b) Explain Web Usage Mining.

0Preprocessing-The raw data is cleaned and prepared for analysis.

Pattern Analysis-Interpreting the patterns discovered to derive actionable insights.

 Tools: Visualization tools, statistical analysis, and query languages.

Probability Distribution -A probability distribution is a mathematical function that describes

You might also like