Module-4
7(a) Explain Map Reduce Execution steps with neat diagram.
MapReduce is a framework for processing large datasets in a distributed manner across a
cluster. It breaks down the processing into two main phases: Map and Reduce.
1. Input Splitting
The input data is split into smaller, manageable chunks, called input splits.
Each split is processed independently by the Mapper.
These splits are stored in the Hadoop Distributed File System (HDFS).
2. Mapping Phase
The Mapper function processes each input split and generates intermediate key-value
pairs.
For example, in a word count application, each word is emitted as a key, and its count
is emitted as the value.
3. Shuffling and Sorting
The intermediate key-value pairs from all Mappers are grouped by keys (shuffling).
Within each key group, the values are sorted to facilitate aggregation.
This step ensures that all values associated with a key are sent to the same Reducer.
4. Reducing Phase
The Reducer processes each group of intermediate key-value pairs.
It applies a reducing function (e.g., summation) to produce the final output for each
key.
For the word count example, the Reducer aggregates the counts for each word.
5. Output Phase
The final key-value pairs generated by the Reducer are written to HDFS.
This output can be used for further analysis or visualization.
7(b)What is HIVE? Explain HIVE Architecture.
Apache Hive is a data warehousing framework built on top of Hadoop. It is designed to
process structured and semi-structured data by providing a SQL-like interface, called
HiveQL, which translates SQL queries into MapReduce, Tez, or Spark jobs for execution.
Hive is particularly useful for querying and analyzing large datasets stored in the Hadoop
Distributed File System (HDFS).
1. User Interface (UI)
Provides an interface for users to interact with Hive.
Supports different modes of interaction:
o Hive CLI: Command-line interface for submitting queries.
o Web Interface: Browser-based interface for submitting queries and
managing Hive metadata.
o JDBC/ODBC Drivers: Connects Hive with external tools like Tableau or BI
systems.
2. Driver
Manages the lifecycle of a HiveQL query.
Responsibilities:
o Parsing: Converts SQL queries into Abstract Syntax Trees (ASTs).
o Logical Plan Generation: Creates an execution plan based on the query.
o Query Execution: Coordinates with the execution engine to run the query.
3. Compiler
Converts the logical query plan into a physical plan.
Optimizes the query for efficient execution.
Translates HiveQL queries into MapReduce, Tez, or Spark jobs.
4. Metastore
Stores metadata about the tables, columns, partitions, and the location of data files in
HDFS.
Acts as the central repository for all Hive-related metadata.
Uses an RDBMS (e.g., MySQL or Derby) to store this information.
5. Execution Engine
Executes the physical plan generated by the Compiler.
Works with Hadoop's MapReduce, Apache Tez, or Apache Spark to process data.
Coordinates with the Driver for task execution and progress updates.
6. HDFS
Stores the actual data processed by Hive.
Hive reads and writes data from HDFS using input and output formats compatible
with Hadoop.
8(a) Explain Pig architecture for scripts dataflow and processing.
Apache Pig is a high-level platform built on top of Hadoop for processing large datasets. It
simplifies the complexity of writing MapReduce programs by providing a scripting language
called Pig Latin. Pig is designed for data transformation tasks such as ETL (Extract,
Transform, Load) operations and processing of large-scale data in Hadoop. Pig scripts can
run on large datasets across a Hadoop cluster, and Pig Latin scripts are automatically
translated into MapReduce jobs for execution.
1. Pig Latin Script
Pig Latin is a scripting language for expressing data transformations.
It abstracts complex MapReduce code into simpler commands.
Pig Latin allows the user to specify how data should be processed (e.g., filtering,
joining, and grouping) without dealing with low-level MapReduce code.
2. Grunt Shell
The Grunt Shell is the interactive shell for executing Pig Latin scripts.
It acts as the front-end interface where users can run scripts, query data, and interact
with the system.
3. Parser
The parser takes a Pig Latin script as input and performs parsing to generate a logical
plan.
It checks the syntax of the script and translates it into a format that can be processed
by the Pig execution engine.
The logical plan contains the steps and operations to be performed on the data.
4. Optimizer
The optimizer improves the performance of the logical plan by applying various
optimization techniques.
It makes adjustments to the logical plan, such as simplifying queries and pushing
operations to a lower level for better execution.
5. Compiler
The compiler converts the logical plan into a physical plan.
The physical plan specifies how the logical plan will be executed, defining the
sequence of operations, including MapReduce jobs.
The compiler ensures that the data flow and processing tasks are optimized for
efficient execution.
6. Execution Engine
The execution engine is responsible for running the physical plan.
It coordinates the execution of the jobs, sending tasks to the Hadoop cluster, where
MapReduce jobs are executed.
The engine handles job scheduling and resource management.
7. HDFS (Hadoop Distributed File System)
Pig operates over HDFS, storing and retrieving input and output data.
All data processed by Pig is read from and written to HDFS, leveraging Hadoop's
distributed storage capabilities.
8(b) Explain Key Value pairing in Map Reduce.
In the MapReduce programming model, key-value pairing is a fundamental concept that
helps organize, process, and aggregate data efficiently. The MapReduce framework processes
data in the form of key-value pairs, where each unit of data is represented by a key and an
associated value. These key-value pairs are used throughout the different stages of the
MapReduce process: Map phase, Shuffle and Sort phase, and Reduce phase.
Key-value pairing enables the MapReduce system to divide, process, and aggregate large
datasets in a distributed manner, making it suitable for handling vast amounts of data in
parallel across a cluster.
(Same as Map reduce execution process)
Module-5
9(a) What is Machine Learning? Explain different types of Regression Analysis.
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves developing algorithms
that enable computers to learn patterns from data and make decisions or predictions without explicit
programming. Instead of being hard-coded with specific instructions, ML systems improve their
performance on a task over time by analyzing data. It is widely used in areas like recommendation
systems, speech recognition, and financial forecasting.
Simple Linear Regression: basic form of regression analysis, modeling the
relationship between a single dependent variable and one independent variable. The
relationship is represented by a straight line, continuous data.
Equation- y=mx+c, where y is the predicted value, m is the slope, and c is the
intercept.
Multiple Linear Regression: extension of simple linear regression to model the
relationship between a dependent variable and two or more independent variables.
Equation-Y=a+b1X1+b2X2+…+bnXn+e
Polynomial Regression: It is used when the relationship between the independent
variable and the dependent variable is not linear but can be modeled by a polynomial
equation. The relationship is represented by a curve .Equation-f(x)=b0+b1x+b2x2
Logistic Regression: This is used when the dependent variable is binary (e.g., 0 or
1, true or false). It models the probability of the outcome occurring as a function of
the independent variables.
Ridge Regression: it includes a regularization term to penalize the size of the
coefficients. It is used to prevent overfitting, especially when dealing with
multicollinearity. Equation- λ∑β^2
Lasso Regression: Similar to ridge regression, lasso regression also includes a
regularization term. However, it can shrink some coefficients to zero, effectively
performing variable selection. Equation- λ∑∣β∣
Nonlinear Regression: it is used when the relationship between the dependent and
independent variables is not linear. It can handle more complex data sets .
9(b) Explain with neat diagram K-means clustering.
K-Means Clustering is an unsupervised machine learning algorithm used to partition a dataset
into K distinct groups (clusters) based on similarities. It minimizes the distance between
data points within the same cluster while maximizing the distance between data points
in different clusters. Each cluster is represented by its centroid (the average of all points in
the cluster).
1. Choose the number of clusters k
The first step in k-means is to pick the number of clusters, k.
2. Select k random points from the data as centroids
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2
clusters, so k is equal to 2 here. We then randomly select the centroid:
3. Assign all the points to the closest cluster centroid
Once we have initialized the centroids, we assign each point to the closest cluster
centroid.
4. Recompute the centroids of newly formed clusters
Now, once we have assigned all of the points to either cluster, the next step is to
compute the centroids of newly formed clusters:
5. Repeat steps 3 and 4
We then repeat steps 3 and 4:
There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached
9(c) Explain Naïve Bayes Theorem with example.
Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is
used for classification tasks and assumes that features are independent of each other (the
"naïve" assumption). Despite this simplification, it performs well for many real-world
applications such as spam detection and sentiment analysis.
Bayes' Theorem
Bayes' Theorem calculates the probability of a hypothesis (HHH) given evidence (EEE). It is
expressed as:
P(H∣E)=P(E∣H)⋅P(H)/P(E)
Where:
P(H∣E): Posterior probability (probability of hypothesis H given evidence E).
P(E∣H): Likelihood (probability of evidence E given H).
P(H): Prior probability (initial probability of H).
P(E): Marginal probability (total probability of evidence E).
10(a) Explain five phases in a process pipeline text mining.
1. Text Preprocessing-This phase cleans and prepares the raw text data for analysis.
Steps:
o Tokenization: Splitting the text into individual words or phrases (tokens).
o Stopword Removal: Eliminating common words (e.g., "the," "and") that
do not contribute to the meaning.
o Stemming/Lemmatization: Reducing words to their root form (e.g.,
"running" → "run").
o Lowercasing: Converting text to lowercase to maintain uniformity.
o Removing Noise: Eliminating punctuation, numbers, and special
characters.
2. Feature Generation-Transform the pre-processed text into a structured format that can be
analyzed by machine learning algorithms.
Methods:
o Bag-of-Words (BoW): Representing text as word frequency vectors.
o TF-IDF: Assigning weights to words based on their importance in a
document relative to a corpus.
o Word Embeddings: Using techniques like Word2Vec or GloVe to represent
words as vectors in a continuous space.
3. Feature Selection-This phase involves applying techniques to standardize or enhance the
text representation for specific tasks.
Techniques:
o Part-of-Speech (POS) Tagging: Labeling words with their grammatical
roles (e.g., noun, verb).
o Named Entity Recognition (NER): Identifying entities like names,
locations, or organizations.
o Topic Modeling: Identifying underlying themes or topics in the text using
methods like LDA (Latent Dirichlet Allocation).
4. Data Mining-Analyze the structured data to uncover patterns, relationships, or trends.
Techniques:
o Clustering: Grouping similar documents or sentences (e.g., K-Means or
DBSCAN).
o Classification: Categorizing text into predefined labels (e.g., spam vs. non-
spam).
o Sentiment Analysis: Determining the emotional tone of the text (positive,
negative, neutral).
5. Analyzing results-The final phase involves evaluating the results and interpreting the
findings.
Steps:
o Accuracy Metrics: Use precision, recall, F1-score, or confusion matrices to
measure model performance.
o Visualizations: Use graphs or word clouds to represent insights effectively.
o Actionable Insights: Extract useful conclusions that support decision-
making.
10(b) Explain Web Usage Mining.
Web usage mining discovers and analyses the patterns in click streams. Web usage mining
also includes associated data generated and collected as a consequence of user interactions
with web resources.
The phases are:
1. Pre-processing - Converts the usage information collected from the various data sources
into the data abstractions necessary for pattern discovery.
2. Pattern discovery - Exploits methods and algorithms developed from fields, such as
statistics, data mining, ML and pattern recognition.
3. Pattern analysis - Filter outs uninteresting rules or patterns from the set found during the
pattern discovery phase.
0Preprocessing-The raw data is cleaned and prepared for analysis.
Tasks:
o Removing irrelevant records (e.g., failed HTTP requests).
o Identifying unique users and sessions.
o Data formatting and transformation.
Example: Converting timestamps into session durations or categorizing URLs.
Pattern Discovery-This step applies data mining techniques to identify patterns in user behaviour.
Techniques:
o Clustering: Grouping users with similar browsing behavior.
o Association Rules: Identifying frequently accessed page sequences.
o Sequential Pattern Mining: Finding the order of page visits.
o Classification: Categorizing users based on behavior (e.g., new vs. returning users).
Pattern Analysis-Interpreting the patterns discovered to derive actionable insights.
Tools: Visualization tools, statistical analysis, and query languages.
Example: Identifying that users often navigate from "Homepage → Products → Checkout."
EXTRA
Outliers -Outliers are data points that differ significantly from the majority of the data in a
dataset. They are values that fall far outside the normal range, either much higher or much
lower than most other values, potentially indicating variability in the data, errors, or rare
events.
Variance -Variance is a statistical measure that represents the degree of spread or dispersion
of a set of values. It is the average of the squared differences between each data point and the
mean of the dataset. High variance indicates that the data points are widely spread out, while
low variance means they are closely clustered around the mean.
Probability Distribution -A probability distribution is a mathematical function that describes
the likelihood of different outcomes in an experiment. It assigns probabilities to each possible
outcome in a sample space, and it can be discrete (for countable outcomes) or continuous (for
a range of outcomes).
Correlation -Correlation is a statistical measure that indicates the strength and direction of the
relationship between two variables. It can be positive (both variables increase or decrease
together), negative (one variable increases while the other decreases), or zero (no linear
relationship). The correlation coefficient quantifies this relationship on a scale from -1 to +1.