0% found this document useful (0 votes)

15 views33 pages

Question 105A

The document discusses various concepts related to neural networks, including the advantages of hidden layers, the calculation of net inputs in neurons, the steps in a Kohonen network, and the description of McCulloch-Pitts neurons. It also covers the training algorithm for Multilayer Perceptrons using backpropagation, the architecture of Radial Basis Function networks, the significance of weights and learning factors in ANN, and the Widrow's Adaline model. Additionally, it explains the importance of kernel functions in Support Vector Machines and provides examples of activation functions.

Uploaded by

Kumar Kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views33 pages

Question 105A

Uploaded by

Kumar Kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Question:

51.Explain the key advantage of having the hidden layer of computational elements (as opposed to
having the input nodes connect directly to the output layer). (5 Marks)
Answer:
The key advantage of having a hidden layer in a neural network is that it allows the model to learn and
represent complex patterns and relationships in the data. When input nodes connect directly to the output
layer, the network can only perform simple linear mappings. However, with the hidden layer, the network
can capture non-linear interactions between features, enabling it to solve more complex problems.
Hidden layers act as feature detectors. For instance, in image recognition tasks, the hidden layers can
identify edges, shapes, and eventually more abstract features like objects. This layered learning process
improves the model's ability to generalize from the training data, leading to better performance on unseen
data. Without hidden layers, the model would struggle with tasks that require understanding intricate data
patterns.
52.Question:
A neuron jj receives inputs from four other neurons whose activity levels are 10, -20, 4, and -2. The
respective synaptic weights of neuron jj are 0.8, -0.2, -1, and 0.6. Calculate the net input to neuron jj. (5
Marks)
Answer:
To calculate the net input to neuron jj, we multiply each input by its corresponding synaptic weight and
sum up the results. Here’s the step-by-step calculation:
Given inputs: 10, -20, 4, -2
Respective weights: 0.8, -0.2, -1, 0.6
Now, we calculate each weighted input:
• 10×0.8=810 \times 0.8 = 8
• −20×−0.2=4-20 \times -0.2 = 4
• 4×−1=−44 \times -1 = -4
• −2×0.6=−1.2-2 \times 0.6 = -1.2
Summing these results gives the net input: 8+4−4−1.2=6.88 + 4 - 4 - 1.2 = 6.8
So, the net input to neuron jj is 6.8.
53 Question
How many steps are there in a Kohonen network (Self Organizing Map) and what do they do? (5 Marks)
Answer:
A Kohonen network, also known as a Self-Organizing Map (SOM), typically involves three main steps:
1. Initialization:
o The weight vectors of the network are initialized, usually with small random values or by
sampling from the input data. This step sets up the initial state of the map.
2. Competition:
o For each input vector, the network identifies the neuron (or node) with the weight vector
most similar to the input. This is often done using a distance metric like Euclidean distance.
The neuron that is closest is called the Best Matching Unit (BMU).
3. Adaptation (or Learning):
o The BMU and its neighbouring neurons update their weights to become more similar to the
input vector. This step is crucial for the self-organizing aspect of the map. The update is done
using a learning rate that decreases over time, along with a neighbourhood function that
ensures nearby neurons are also updated, though to a lesser extent.
54 Question:
Describe a McCulloch-Pitts neuron. (5 Marks)
Answer:
The McCulloch-Pitts neuron is a simple mathematical model of a biological neuron, introduced by Warren
McCulloch and Walter Pitts in 1943. It serves as the foundation for modern neural networks. Here are the
key characteristics:
1. Inputs: The neuron receives multiple binary inputs (either 0 or 1), representing signals from other
neurons.
2. Weights: Each input has an associated weight, which represents the strength of the connection.
These weights can be positive or negative.
3. Summation: The neuron calculates the weighted sum of the inputs. This is done by multiplying each
input by its respective weight and summing all the products.
4. Threshold Function: The neuron uses a threshold (or activation function) to decide whether to fire
(output a 1) or not (output a 0). If the weighted sum is greater than or equal to a predefined
threshold, the neuron outputs 1; otherwise, it outputs 0.
55.Question:
Describe the algorithm for training using a Multilayer Perceptron (MLP) with Backpropagation. (10
Marks)
Answer:
Training a Multilayer Perceptron (MLP) using the backpropagation algorithm involves several key steps,
which can be summarized as follows:
1. Initialization:
o Initialize the weights and biases of the network with small random values. This step sets the
initial state of the network before training begins.
2. Forward Propagation:
o Input the training data into the network.
o For each layer, calculate the weighted sum of inputs plus the bias and apply an activation
function (such as Sigmoid, ReLU, or Tanh) to produce the output of that layer.
o Continue propagating the outputs forward through each layer until the final output layer is
reached.
3. Error Calculation:
o Calculate the error at the output layer by comparing the predicted output to the actual
target value using a loss function (commonly Mean Squared Error for regression or Cross-
Entropy Loss for classification).
4. Backward Propagation:
o Compute the gradient of the loss function with respect to the weights and biases using the
chain rule of calculus. This involves:
▪ Calculating the gradient of the error with respect to the output of the network
(output layer error).
▪ Propagating this error backward through the network, layer by layer, adjusting the
gradients for each layer’s weights and biases.
5. Weight and Bias Updates:
o Update the weights and biases using the gradients computed during backpropagation. This is
typically done using Gradient Descent or one of its variants (like Stochastic Gradient Descent
or Adam). The update rule is: Weight_new = Weight_old−η×∂Loss/∂Weight
where η is the learning rate, a hyperparameter that controls the step size during the update.
6. Iteration:
o Repeat the forward propagation, error calculation, backward propagation, and weight
update steps for many epochs (iterations over the entire training dataset) until the model
converges to an acceptable level of accuracy or the error no longer decreases significantly.
7. Stopping Criteria:
o The training process stops when a pre-defined number of epochs is reached, or the model
achieves a desired level of accuracy or a sufficiently low error rate.
Question:
Describe the architecture of a Radial Basis Function (RBF) network with DD input units and KK output
units, and explain what is computed at each layer. (10 Marks)
Answer:
The architecture of a Radial Basis Function (RBF) network consists of three layers: the input layer, the
hidden layer with radial basis functions, and the output layer. Here's a detailed description of each layer:
1. Input Layer:
o This layer has DD input units, corresponding to the dimensionality of the input data. Each
input unit simply passes the input data to the next layer without any transformation.
2. Hidden Layer:
o The hidden layer consists of neurons that use radial basis functions (typically Gaussian
functions) as their activation functions.
o Each hidden neuron computes the distance between the input vector and a center (or
prototype) vector specific to that neuron. The output of a hidden neuron is given by:
ϕ(x)=exp⁡(−∥x−cj∥^2/ 2σ^2)
o where x is the input vector, c_j is the center vector for the j-th hidden neuron, σ\sigma is the
width of the Gaussian function, and ∥⋅∥ denotes the Euclidean distance.
o This layer transforms the input space into a new space where the distance from the input to
the centers is used to compute the activations.
3. Output Layer:
o The output layer has KK units, corresponding to the number of output classes or target
values.
o Each output unit computes a linear combination of the activations from the hidden layer,
typically using weights w_{jk} that connect the j-th hidden neuron to the k-th output unit:
y_k=∑jwjkϕj(x) where ϕj(x) is the output of the j-th hidden neuron and y_k is the output of
the k-th unit.
Computation at Each Layer:
• Input Layer: Receives and forwards the raw input data to the hidden layer.
• Hidden Layer: Computes the activation of each neuron based on the distance between the input
and the neuron’s center. This represents the similarity between the input and the center.
• Output Layer: Combines the activations from the hidden layer using weighted sums to produce the
final output, which could be used for classification or regression tasks.
The RBF network is particularly effective for tasks that require capturing local features of the data, as the
hidden neurons focus on regions around their respective centers.
Question:
What is the significance of weights and learning factor used in Artificial Neural Networks (ANN), explain
with an example. (10 Marks)
Answer:
1. Weights in ANN:
• Significance: Weights are crucial in an ANN as they determine the importance of each input in the
network. Each connection between neurons is assigned a weight, which is adjusted during training
to minimize the error between the predicted output and the actual output.
• Function: Weights control the strength of the signal that flows from one neuron to another. By
adjusting these weights, the network learns to make better predictions or classifications.
2. Learning Factor (Learning Rate):
• Significance: The learning factor (or learning rate) is a hyperparameter that determines the step size
at which the weights are updated during the training process. It controls how quickly or slowly the
network learns.
• Function: A small learning rate ensures gradual and stable convergence but may require more
iterations. A large learning rate speeds up the learning but risks overshooting the optimal solution
or causing the model to become unstable.
Example Scenario:
Imagine training an ANN to recognize handwritten digits. Initially, the weights are random, so the
predictions are poor. As training progresses:
• Weights: Adjustments in weights help the network to focus on important features (like specific
edges or shapes of digits).
• Learning Rate: A carefully chosen learning rate ensures that the network learns effectively without
making abrupt changes, leading to better accuracy over time.
Question:
Give the Widrow's Adaline neuron model. (5 Marks)
Answer:
Widrow's Adaline (Adaptive Linear Neuron) model is a type of single-layer neural network and is an
extension of the perceptron model. Here’s a detailed description:
1. Structure:
o Inputs: The Adaline model takes multiple input signals, denoted as x1,x2,…,xn.
o Weights: Each input x_i is associated with a weight wiw_i.
o Summation: The weighted sum of the inputs is calculated, plus a bias term bb: y=∑i=1n
wixi+b
2. Activation Function:
o Unlike the perceptron, which uses a step function for activation, Adaline uses a linear
activation function. This means the output y is a continuous value, not just 0 or 1.
o The output is directly the weighted sum of inputs.
3. Learning Rule:
o Adaline uses the Least Mean Squares (LMS) algorithm to update the weights. The error e is
the difference between the actual output y and the desired output d:
e=d-y
o The weights are updated using the formula: wi^new=wi^old + η× e × x_i where η is the
learning rate.
Key Features:
• Linear Output: Adaline outputs a continuous value, which makes it suitable for regression tasks.
• Learning Process: The model minimizes the mean squared error (MSE) between the predicted and
actual outputs, leading to an optimal set of weights for the given data.
Example:
For a simple two-input Adaline model:
• Inputs: x_1 = 1, x_2 = 2
• Weights: w_1 = 0.5, w_2 = -0.3
• Bias: b = 0.1
The output yy would be:
y=(0.5×1)+(−0.3×2)+0.1 = 0.5−0.6+0.1 = 0.0
Question:
What is the significance of kernel functions in Support Vector Machines (SVM)? Give two kernel functions
used in SVM. (10 Marks)
Answer:
Significance of Kernel Functions in SVM:
Kernel functions in Support Vector Machines (SVM) play a crucial role in enabling the algorithm to work in
high-dimensional or even infinite-dimensional spaces without explicitly calculating the coordinates of the
data points in that space. The primary purpose of kernel functions is to transform the data into a higher-
dimensional space, where a linear decision boundary can be used to separate the classes that may not be
linearly separable in the original input space.
In simple terms:
• Non-linearity Handling: SVM is a linear classifier, but many real-world problems are non-linear. By
using kernel functions, we can implicitly map the input data into a higher-dimensional space where
it becomes easier to find a hyperplane that separates the data.
• Efficient Computation: Directly mapping data points to a higher-dimensional space can be
computationally expensive. However, kernel functions enable us to compute the inner product
between data points in the higher-dimensional space without ever explicitly transforming the data,
thus saving computational resources. This approach is known as the "kernel trick."
By using kernel functions, SVM can create complex decision boundaries while maintaining its optimization
properties (maximizing the margin between classes), making it a powerful tool for classification and
regression tasks.
Two Common Kernel Functions Used in SVM:
1. Linear Kernel:
o The linear kernel is the simplest type of kernel function. It computes the inner product of the
input vectors directly without any transformation, thus representing a linear decision
boundary.
o Formula: K(x,y) = x^T y
o Use case: The linear kernel is used when the data is already linearly separable or when we
expect the decision boundary to be linear.
2. Gaussian Radial Basis Function (RBF) Kernel:
o The RBF kernel is a popular choice for non-linear SVM problems. It maps the input data into
an infinite-dimensional space and computes the similarity between two data points based
on their distance. The transformation is done implicitly through the kernel function.
o Formula: K(x,y)=exp⁡(−∥x−y∥^22σ^2) where |x - y|^2 is the squared Euclidean distance
between the two points xx and y, and σ\sigma is a parameter that controls the spread of the
kernel.
o Use case: The RBF kernel is effective when the data is not linearly separable and there is a
need to create non-linear decision boundaries.
In summary, kernel functions enable SVM to handle complex datasets with non-linear relationships, making
it a versatile and powerful tool for classification and regression tasks. The linear and RBF kernels are two
widely used options, depending on the nature of the data.
Question:
What are activation functions? Give two examples with necessary graphical and mathematical
representation. (10 Marks)
Answer:
Activation Functions:
Activation functions are mathematical functions used in Artificial Neural Networks (ANNs) to introduce
non-linearity into the network. They determine the output of a neural network neuron based on its input.
Without activation functions, a neural network would behave like a linear regression model, regardless of
the complexity of the data. Activation functions help the model to learn complex patterns and relationships
by transforming the weighted sum of inputs into a non-linear output.
In simple terms, activation functions are the "gatekeepers" that decide whether a neuron should be
activated or not based on the input signals it receives.

Two Examples of Activation Functions:

1. Sigmoid Activation Function:
o Mathematical Representation:
f(x)=1/1+e^−x
o The sigmoid function maps any input value into a range between 0 and 1. This makes it
particularly useful for binary classification problems.
o Graphical Representation:
▪ The sigmoid curve has an S-shape and approaches 0 as x→−∞ and approaches 1 as
x→+∞
o Properties:
▪ Output Range: (0,1)
▪ Derivative: f′(x)=f(x)×(1−f(x))
▪ Use Case: Sigmoid is used in the output layer for binary classification problems,
where the output can be interpreted as a probability.
2. ReLU (Rectified Linear Unit) Activation Function:
o Mathematical Representation:
f(x)=max⁡(0,x)
o The ReLU function outputs the input directly if it is positive, otherwise, it outputs zero. This
function is widely used because of its simplicity and effectiveness.
o Graphical Representation:
▪ The ReLU function has a linear behavior for positive values and flat behavior for
negative values.

o Properties:
▪ Output Range: [0,∞)
▪ Derivative: f'(x) = 1 for x > 0, and f'(x) = 0 for x≤0
▪ Use Case: ReLU is commonly used in hidden layers of neural networks, especially for
deep learning models due to its ability to reduce the likelihood of vanishing gradients
and speed up convergence.
Question:
Explain Gradient Descent and name its types. (10 Marks)
Answer:
Gradient Descent:
Gradient Descent is an optimization algorithm used in machine learning and deep learning to minimize the
loss function by iteratively adjusting the model's parameters (weights) in the direction of the steepest
descent of the loss. The loss function measures how far the model's predictions are from the actual values.
By minimizing this function, we improve the model’s accuracy.
The basic idea is to start with an initial set of parameters and iteratively update them to reduce the error.
The updates are made in small steps based on the gradient of the loss function concerning the model
parameters. The gradient indicates the direction of the steepest increase in the loss function, and by
moving in the opposite direction (steepest descent), we minimize the error.
The update rule in gradient descent for a parameter w is:
W = w−η⋅∂L/∂w
where:
• w is the model parameter (weight),
• η is the learning rate (step size),
• L is the loss function, and
• ∂L/∂w is the gradient (partial derivative of the loss with respect to ww).
Steps of Gradient Descent:
1. Initialization: Start with random or predefined values for the model parameters.
2. Compute the Gradient: Calculate the gradient (or derivative) of the loss function concerning each
parameter.
3. Update the Parameters: Adjust the parameters in the opposite direction of the gradient.
4. Repeat: Repeat the process until convergence, i.e., until the loss function reaches its minimum or a
predefined stopping criterion is met.

Types of Gradient Descent:

1. Batch Gradient Descent (BGD):
o In Batch Gradient Descent, the model parameters are updated using the gradient calculated
over the entire dataset. It computes the exact gradient of the loss function with respect to
each parameter based on all the data points.
o Pros: Converges to the global minimum for convex loss functions and ensures smooth
convergence.
o Cons: Can be very slow and computationally expensive, especially for large datasets.
Update Rule:
W = w−η⋅1/m(∑i=1m ( ∂L/∂wi ))
where m is the number of data points in the dataset.
2. Stochastic Gradient Descent (SGD):
o In Stochastic Gradient Descent, instead of computing the gradient over the entire dataset,
the model parameters are updated for each training example. In other words, the gradient is
computed and the parameters are updated after each individual data point.
o Pros: Much faster than batch gradient descent since it updates parameters more frequently,
leading to faster learning.
o Cons: The updates are noisy and can cause the algorithm to fluctuate around the minimum,
potentially leading to suboptimal solutions. However, with a properly tuned learning rate, it
can still converge to a good solution.
Update Rule:
w=w−η⋅(∂L(xi,yi)/∂w)
where x_i, y_i is a single data point and its corresponding label.
3. Mini-Batch Gradient Descent:
o Mini-Batch Gradient Descent is a compromise between Batch and Stochastic Gradient
Descent. In this method, the dataset is divided into small batches (mini-batches), and the
gradient is computed and the model parameters are updated for each mini-batch, instead of
the entire dataset or a single point.
o Pros: It reduces the variance of the parameter updates, leading to more stable convergence
compared to SGD. It also takes advantage of vectorized operations, which makes it more
efficient than BGD for large datasets.
o Cons: Requires tuning the mini-batch size to get the best results.
Update Rule:
w=w−η⋅1/B(∑i=1B(∂L(xi,yi)/∂w))
where B is the batch size.

Summary of Differences:

Type Pros Cons Use Case

Batch Gradient Stable, exact gradients, Slow for large datasets, Small to medium datasets,
Descent smooth convergence high memory usage convex loss functions

Stochastic Gradient Fast, updates after every Noisy, can oscillate around Large datasets, online
Descent data point the minimum learning

Mini-Batch Efficient, balance between Requires tuning mini- Large datasets, deep
Gradient Descent speed and stability batch size learning

Question:
What do you mean by Boltzmann Machine? (10 Marks)
Answer:
Boltzmann Machine:
A Boltzmann Machine (BM) is a type of recurrent artificial neural network that is stochastic and
probabilistic in nature. It is inspired by the physical system in thermodynamics and is used primarily for
unsupervised learning tasks, such as pattern recognition, optimization problems, and feature learning. The
Boltzmann Machine is a network of symmetrically connected neurons (or nodes), where each connection
has a weight that determines the relationship between the neurons.
The Boltzmann Machine uses the principles of statistical mechanics to model a system of neurons that
reaches a state of equilibrium in which the system's energy is minimized. It aims to learn patterns and
represent data by adjusting its weights in such a way that the system’s energy is minimized for the given
dataset.
The Boltzmann Machine can be seen as a probabilistic version of an autoencoder, where the neurons of the
network have binary values (0 or 1), and their values are determined based on probabilities.

Key Concepts:
1. Neurons and States:
o In a Boltzmann Machine, each neuron has a binary state: either 0 or 1. These states are
probabilistically determined.
o The state of a neuron ii, denoted as sis_i, depends on the inputs it receives from other
neurons and the weight of the connection between them.
2. Energy Function:
o The Boltzmann Machine has an energy function EE that represents the state of the network.
The energy function is used to define how "good" or "bad" the current state of the network
is.
o The goal of the network is to adjust the weights such that the energy is minimized, which
corresponds to learning a useful representation of the data.
E(v,h)=−∑i∑jwijvihj
where v and h are the visible and hidden units, respectively, and wij represents the weight between the
units.
3. Probability Distribution:
o The Boltzmann Machine uses the concept of a Boltzmann distribution to model the
probabilities of a neuron being in state 1 or 0. The probability that a unit ii is in state 1
depends on the weighted sum of the inputs from other units.
o The probability is given by:
P(si=1∣input)=1/1+exp⁡(−∑jwijsj)
where the sum is taken over the neighbouring neurons j connected to neuron i.
4. Training a Boltzmann Machine:
o The goal of training a Boltzmann Machine is to learn the weights w_ij such that the
probability distribution of the network's states matches the distribution of the input data.
o The Contrastive Divergence (CD) algorithm is commonly used for training Boltzmann
Machines. It works by updating the weights based on the difference between the visible
layer's states before and after a Gibbs sampling process.

Types of Boltzmann Machines:

1. Restricted Boltzmann Machine (RBM):
o The Restricted Boltzmann Machine is a simplified version of the Boltzmann Machine, where
the network has two layers: a visible layer and a hidden layer. The key restriction is that
there are no connections between the neurons within the same layer, meaning that the
visible and hidden units are only connected to each other.
o RBMs are easier to train and are widely used in deep learning models, especially for
dimensionality reduction, collaborative filtering, and feature learning. They serve as the
building blocks for deep networks like Deep Belief Networks (DBN).
2. Deep Boltzmann Machine (DBM):
o A Deep Boltzmann Machine consists of multiple layers of hidden units, which are connected
to each other and the visible units. Unlike the RBM, DBMs allow connections between the
hidden layers, providing a more complex model for learning hierarchical representations.
o DBMs can capture more intricate patterns in the data, but they are harder to train compared
to RBMs due to the increased complexity.

Applications of Boltzmann Machines:

• Unsupervised Learning: Boltzmann Machines can learn the distribution of data without labeled
examples, making them useful for unsupervised learning tasks.
• Optimization Problems: They can be used in optimization problems, where the goal is to find the
configuration of variables that minimizes a certain objective function.
• Dimensionality Reduction: Boltzmann Machines, especially RBMs, are effective for learning low-
dimensional representations of high-dimensional data.
• Recommendation Systems: Boltzmann Machines can model user preferences in collaborative
filtering-based recommendation systems, such as in movie or product recommendation.

Question:
With a supervised learning algorithm, we can specify target output values, but we may never get close to
those targets at the end of learning. Give two reasons. (10 Marks)
Answer:
In supervised learning, we aim to learn a model that maps inputs to target outputs based on labeled
training data. However, even after training, the model may never perfectly match the target output values
for various reasons. Below are two key reasons why this happens:
1. Limited Model Complexity or Capacity
Reason: A model may not have enough complexity (capacity) to capture the underlying patterns in the
data, especially if the data is highly non-linear or complex.
Explanation:
• Supervised learning models like linear regression, decision trees, or simple neural networks may not
be capable of learning the true relationship between inputs and outputs if the data exhibits more
complex patterns.
• For example, a linear model will struggle to approximate a non-linear relationship between input
and output. Similarly, a shallow neural network may not have enough layers to learn complex
features from the data.
Example:
• If you try to fit a linear regression model to a dataset that exhibits a non-linear relationship, the
model will only be able to capture a linear approximation of the data, leading to a poor fit and an
inability to closely approximate the target outputs.
Impact:
• The model’s limited capacity to learn complex patterns will prevent it from ever getting close to the
target values, no matter how much training is done.

2. Noise and Inaccuracies in the Data

Reason: Real-world datasets often contain noise or inaccuracies, which can make it impossible for a
supervised learning model to perfectly match the target outputs.
Explanation:
• Noise refers to random variations in the data that do not represent meaningful patterns or
relationships. This noise can come from various sources, such as errors in measurement, data entry
mistakes, or unaccounted-for variables that influence the outputs.
• If the training data contains noisy or inconsistent examples, the model may be forced to fit the data
as best as it can, which may lead to deviations from the target values.
Example:
• Consider a dataset where you're trying to predict house prices. If there are errors in the listing (e.g.,
incorrect square footage or a typo in the price), these discrepancies will introduce noise, which will
prevent the model from perfectly matching the true target values.
Impact:
• The presence of noise in the data leads to variations that prevent the model from achieving perfect
accuracy, and as a result, the model may never be able to get close to the target outputs, no matter
how much it learns.

Perceptron for Learning Between Sweet and Sour:

A Perceptron is a simple type of artificial neural network used for binary classification. It takes multiple
inputs, applies a set of weights to each input, computes the weighted sum, and then passes this sum
through an activation function to produce an output.
Steps for the Perceptron:
1. Inputs and Weights:
o The perceptron receives inputs. In this case, let’s assume the inputs represent certain
features that determine whether a sample is sweet or sour (e.g., sourness level, sugar
content, etc.).
o There is also a bias term (represented as w_0), which is set to a constant value (usually 1).
2. Weighted Sum:
o The perceptron computes a weighted sum of all the inputs, including the bias term:
sum = w1 . x1 + w2 . x2 + ⋯ + wn . xn + w0
where x1 , x2 , … , x𝑛 are the input values, and w1 , w2 , … , w𝑛 are their corresponding weights.
The term w_0 is the bias.
3. Activation Function:
o The perceptron uses an activation function (typically a step function) to decide the output:
𝑜𝑢𝑡𝑝𝑢𝑡 = {1 𝑖𝑓 𝑠𝑢𝑚 > 0
0 𝑖𝑓 𝑠𝑢𝑚 ≤ 0
This function outputs either 1 (for "sweet") or 0 (for "sour"), depending on whether the
weighted sum exceeds a certain threshold.
4. Learning:
o During training, the perceptron adjusts the weights w1 , w2 , … , w𝑛 based on the error in its
predictions. This is done using the Perceptron Learning Rule:
𝑤𝑖 ← 𝑤𝑖 + 𝛥𝑤𝑖
where the change in weight 𝛥𝑤𝑖 is calculated as:
Δwi=η⋅(y−𝑦̂)⋅xi
Here:
▪ η is the learning rate (a small constant),
▪ y is the true label (0 or 1),
▪ 𝑦̂ is the predicted label, and
▪ 𝑥𝑖 the input feature.
5. Convergence:
o The perceptron continues adjusting its weights until it successfully classifies all training
examples or reaches a predefined number of iterations.
Example:
Let's assume the perceptron is trained to classify between sweet and sour based on two features:
• x1: Sugar content (higher values may correspond to sweet).
• x2: Sourness level (higher values may correspond to sour).
The perceptron will be given input pairs (x1, x2) with known outputs, such as:
• (x1 = 0.8, x2 = 0.2) → Label: Sweet (1)
• (x1 = 0.1, x2 = 0.9) → Label: Sour (0)
Based on these inputs, the perceptron will update its weights to minimize errors and learn the boundary
between "sweet" and "sour."
Summary:
A perceptron learns to classify inputs based on the weighted sum of the inputs, applying an activation
function to make decisions. It adjusts its weights using the perceptron learning rule to reduce the
classification error over time, enabling it to differentiate between categories (like sweet and sour) based on
the provided features.
Here is the comparison between the computational model (artificial neuron) and the biological equivalent
(biological neuron) in a tabular format:

Aspect Computational Model (Artificial Neuron) Biological Neuron

Inputs, weights, bias, summation, Dendrites, soma, axon, synapses, action

Structure
activation function potentials

Complex signal integration, action potential

Processing Linear transformation and discrete output
firing

Learning Supervised learning, gradient descent, Hebbian learning, neuroplasticity,

Mechanism backpropagation unsupervised learning

Processing Speed Fast computation Slower signal transmission

Energy Efficiency High energy consumption Highly efficient

Scalability Scalable to millions of neurons Limited scalability

Robustness Sensitive to initialization and data quality Highly robust and fault-tolerant

Parallelism Limited by hardware capabilities Massive parallelism in the brain

This table provides a clear comparison between the key characteristics of artificial neurons and biological
neurons.
The output of a McCulloch-Pitts neuron can be mathematically described as follows:
Equation for the Output of a McCulloch-Pitts Neuron:
𝑛
1 if 𝛴𝑖=1 (𝑤𝑖 𝑥𝑖 ) ≥ 𝜃
𝑦={ 𝑛
0 if 𝛴𝑖=1 (𝑤𝑖 𝑥𝑖 ) < 𝜃
Where:
• y is the output of the neuron (either 0 or 1).
• 𝑤𝑖 is the weight associated with the input x_i, where ii represents the input index.
• 𝑥𝑖 represents the input values, which are typically either 0 or 1.
• n is the number of inputs to the neuron.
• θ is the threshold value, which is the cutoff that determines whether the neuron fires or not.
𝑛
• The summation 𝛴𝑖=1 (𝑤𝑖 𝑥𝑖 ) calculates the weighted sum of the inputs.

Explanation:
• The neuron fires (output = 1) when the weighted sum of the inputs is greater than or equal to the
threshold θ.
• If the weighted sum is less than the threshold, the neuron does not fire (output = 0).
This model is a very simple representation of how biological neurons might behave in a binary manner,
where they either "fire" or "do not fire" based on the inputs and the threshold value.
Demerits of Backpropagation Network:
1. Local Minima:
o Backpropagation can get stuck in local minima or saddle points of the error surface. This
prevents the network from reaching the global minimum, which can lead to suboptimal
performance.
2. Slow Convergence:
o The training process using gradient descent is computationally expensive and may take a
long time to converge, especially for large networks. This is particularly an issue when the
network has many layers or neurons.
3. Overfitting:
o If the model is too complex (e.g., too many layers or neurons), it may fit the noise in the
training data, leading to overfitting. Overfitting reduces the model's generalization capability
to new, unseen data.
4. Requires Large Data Sets:
o Backpropagation requires large amounts of labeled data for training to prevent overfitting
and ensure good generalization. This can be a challenge when data is limited or expensive to
obtain.
5. Gradient Vanishing and Exploding:
o In deep networks, gradients may become too small (vanishing gradients) or too large
(exploding gradients), which makes training difficult or impossible.
6. Computationally Intensive:
o For large networks, the computational cost can be high due to the need to compute
gradients for each parameter and propagate them back through each layer during training.

Applications of Backpropagation Network:

1. Image Recognition:
o Backpropagation networks are widely used for classifying images, detecting objects, and
performing tasks such as handwritten digit recognition (e.g., MNIST dataset). It helps in
learning complex features in images for accurate predictions.
2. Speech Recognition:
o Backpropagation is used in training deep neural networks for speech-to-text applications. It
enables the model to map acoustic signals to phonetic representations, improving the
accuracy of speech recognition systems.
These applications demonstrate the versatility of Backpropagation Networks in handling various complex
tasks across different domains.
Gradient Descent Learning:
Gradient Descent is an optimization algorithm used to minimize the loss function (or error function) in
machine learning and neural networks. The goal is to find the minimum value of the loss function, which
represents the best fit for the model parameters (weights and biases).
In simpler terms, Gradient Descent helps the model adjust its weights to minimize the difference between
the predicted output and the actual output, effectively improving its performance.
How Gradient Descent Works:
1. Start with Initial Parameters:
o Initialize the weights and biases of the model randomly or with some heuristic values. These
parameters will be adjusted through the learning process.
2. Compute the Gradient:
o For each parameter (weight or bias), compute the gradient of the loss function with respect
to that parameter. The gradient represents the direction of the steepest ascent, meaning the
direction where the loss function increases the most.
o To minimize the loss, we move in the opposite direction of the gradient, which is called the
negative gradient.
𝜕 Loss Function
𝐺=
𝜕 Parameter
3. Update Parameters:
o Update the parameters using the learning rate (α) and the computed gradients. The learning
rate determines how large a step we take in the opposite direction of the gradient.
𝜕𝐿
𝜃 =𝜃−𝛼⋅
𝜕𝜃
Where:
o θ represents the model parameters (weights or biases).
o L is the loss function.
o α is the learning rate.
𝜕𝐿
o is the gradient of the loss function with respect to θ\theta.
𝜕𝜃

4. Repeat Until Convergence:

o Repeat the above steps (compute gradients and update parameters) for a specified number
of iterations or until the loss function converges to a minimum (i.e., the loss stops decreasing
significantly).
Types of Gradient Descent:
1. Batch Gradient Descent:
o In this approach, the gradient is computed using the entire training dataset.
o It is computationally expensive, especially for large datasets, but it guarantees convergence
to the global minimum for convex problems or local minima for non-convex problems.
2. Stochastic Gradient Descent (SGD):
o In SGD, the gradient is computed using only one data point at a time, making the algorithm
much faster compared to batch gradient descent.
o The updates are more noisy, but it allows the model to escape local minima and find a better
global minimum in some cases.
3. Mini-Batch Gradient Descent:
o This is a compromise between batch gradient descent and SGD. The dataset is divided into
small mini-batches (e.g., 32 or 64 samples), and the gradient is computed for each mini-
batch.
o It provides faster convergence than batch gradient descent and reduces the variance
compared to SGD.
Key Components of Gradient Descent:
• Loss Function:
o Measures the difference between the predicted output and the true value. Examples include
Mean Squared Error (MSE) for regression and Cross-Entropy Loss for classification.
• Learning Rate:
o A hyperparameter that controls how big the steps are when updating the parameters. A
small learning rate leads to slow convergence, while a large learning rate may cause the
algorithm to overshoot the optimal solution.
• Convergence:
o The algorithm converges when the loss function stops decreasing significantly, indicating
that the model has found the optimal or near-optimal parameters.
Pros and Cons of Gradient Descent:
Pros:
• Simple and easy to implement.
• Works well with large datasets.
• Can be used with various types of machine learning models (linear regression, neural networks,
etc.).
Cons:
• Can get stuck in local minima or saddle points, especially with non-convex loss functions.
• The choice of the learning rate is crucial and may require experimentation.
• Computationally expensive for very large datasets (especially in batch gradient descent).
Graphical Example:
Imagine the loss function as a 3D surface, where the X and Y axes represent the parameters (weights and
biases) and the Z axis represents the loss. Gradient descent starts at a random point on this surface and
moves downhill (in the direction of the negative gradient) to find the lowest point (global or local
minimum).
RBF Network for Approximation (5 Marks)
Radial Basis Function (RBF) networks are used for function approximation by leveraging their architecture:
1. Input Layer: Passes the input data.
2. Hidden Layer: Contains neurons with radial basis functions (like Gaussian), which compute
activations based on the distance between the input and each neuron's center.
3. Output Layer: Produces the final output by combining weighted activations from the hidden layer.
Process:
• Training: Centers and spreads of the radial basis functions are determined, often using clustering
(like k-means), and weights are adjusted using methods like least squares.
• Approximation: For any input, the network calculates activations and outputs a weighted sum,
approximating the target function.
Example: Approximating sin(x) by learning from input-output pairs, where the network captures non-linear
relationships through localized responses of Gaussian functions.
To calculate the weight matrix W for an auto associative network (such as a Hopfield network) that stores
the pattern p=[1,−1,1,−1], we use the Hebbian learning rule.
Hebbian Learning Rule:
The weight matrix W is calculated as:

𝑤 = 𝑝𝑇 ⋅ 𝑝
Where:
• p is the pattern vector.

• 𝑝𝑇 is the transpose of p.
• The diagonal elements of W are typically set to zero to avoid self-feedback.
Steps:
1. Pattern Vector: p=[1,−1,1,−1].
2. Outer Product: Calculate the outer product 𝑝𝑇 ⋅ 𝑝.

3. Set Diagonal Elements to Zero:

Final Weight Matrix WW:

This matrix can be used to recall the pattern p=[1,−1,1,−1] in the auto associative network.
The architecture of a Hopfield Network:
A Hopfield network is a type of recurrent neural network used for associative memory. It stores patterns
and retrieves them even when presented with noisy or incomplete input.
Key Features:
1. Fully Connected Neurons:
o Each neuron is connected to every other neuron in the network.
o There are no self-connections; each neuron does not connect to itself (i.e., 𝑤𝑖𝑖 = 0).
2. Symmetric Weights:
o The weight matrix W is symmetric, meaning 𝑤𝑖j = 𝑤j𝑖 .

3. Binary States:
o Neurons have binary states, typically +1 or −1 (sometimes 1 or 0).
4. Update Rule:
o The network updates neuron states asynchronously or synchronously using the activation
function:

o 𝑆𝑖 is the state of neuron i.

o 𝑤j𝑖 is the weight between neuron ii and neuron j.

o 𝜃𝑖 is the threshold for neuron i.

o sgn is the sign function, returning +1 if the input is positive and −1 otherwise.
Components:
1. Neurons:
o Represented by nodes in the network.
o Each neuron acts as both an input and output neuron.
2. Weights:
o Connections between neurons are represented by weights.
o The weights are calculated using Hebbian learning or other learning rules to store patterns.
3. State Update:
o The state of each neuron is updated based on the weighted sum of inputs from other
neurons.
o The network stabilizes at a state corresponding to a stored pattern.
Operation:
• Training Phase: Patterns are stored by setting the weight matrix using a learning rule (e.g., Hebbian
learning).
• Recall Phase: A noisy or partial pattern is presented, and the network iteratively updates until it
converges to a stored pattern.
Example:
• For a network storing patterns p1=[1,−1,1] and p2=[−1,1,−1]; the weight matrix is calculated, and
the network can recall these patterns even from noisy inputs.
Hopfield networks are used in associative memory, optimization problems, and pattern recognition tasks.
RBF Network for Classification:
Radial Basis Function (RBF) networks are used for classification by mapping input data into a higher-
dimensional space using radial basis functions (typically Gaussian functions) and then performing a linear
separation.
Steps in RBF Classification:
1. Input Layer: Passes the input data to the hidden layer.
2. Hidden Layer (RBF Layer):
o Contains neurons with radial basis functions.
o Each neuron computes the distance between the input and its center.
o Outputs are computed using a Gaussian function:

where cc is the center and σ is the spread.

3. Output Layer:
o Performs a weighted sum of the hidden layer outputs.
o Uses a softmax function or other techniques for classification into different classes.
Comparison of RBF Networks with MLPs (Multilayer Perceptrons):

Feature RBF Networks MLPs (Multilayer Perceptrons)

Multiple layers: input, hidden (fully

Architecture 3 layers: input, hidden (RBF), and output
connected), and output

Uses neurons with activation

Hidden Layer Uses radial basis functions (Gaussian)
functions (sigmoid, ReLU)

Typically two-phase: unsupervised (clustering) for Supervised learning with

Training
RBF centers, then supervised for weights backpropagation

Decision
Typically local and spherical Global, complex, and non-linear
Boundaries

Slower, requires gradient descent

Learning Speed Faster due to localized learning
and backpropagation

More complex, can model more

Complexity Simpler, easier to train for small datasets
intricate patterns

Function approximation, classification with fewer Deep learning, complex classification

Applications
samples tasks

Topographic Map:
A Topographic Map in neural networks refers to an ordered mapping of input data into a spatial
arrangement of neurons, where:
• Neighboring Neurons: Neurons that are spatially close on the map respond to similar input
patterns.
• Preservation of Input Structure: The spatial relationships of input data are preserved in the neuron
arrangement, meaning similar inputs are mapped to nearby neurons.
Example:
• Self-Organizing Maps (SOM): A type of topographic map where the network learns to organize
neurons based on input similarity. This helps visualize and cluster high-dimensional data into a 2D or
3D map for easy interpretation.
Topographic maps are useful in data visualization, clustering, and dimensionality reduction.
Neuron Inhibition and Activation Functions:
Neuron inhibition refers to the reduction or suppression of a neuron's activity, which depends on the type
of activation function used. Different activation functions influence how input signals are transformed into
output, and hence, how inhibition is manifested.
Justification with Different Activation Functions:
1. Step Function (Threshold Function):
o Function:

o Inhibition: If the weighted sum of inputs is below the threshold θ\theta, the output is zero,
effectively inhibiting the neuron from firing.
2. Sigmoid Function:
o Function:

o Inhibition: The sigmoid function squashes the input to a range between 0 and 1. For
negative inputs, the output approaches zero, representing inhibition as it diminishes the
neuron's response.
3. ReLU (Rectified Linear Unit):
o Function:

o Inhibition: For negative inputs, the output is zero, effectively inhibiting the neuron. This
allows only positive activations to pass through.
4. Tanh Function:
o Function:

o Inhibition: Outputs range between -1 and 1. Negative inputs produce negative outputs,
which can represent a form of inhibition depending on the context (e.g., if outputs are
expected to be positive for activation).
In summary, the choice of activation function determines how neurons handle inhibitory inputs, with some
functions (like ReLU and step functions) explicitly zeroing out negative inputs, while others (like sigmoid and
tanh) reduce the output magnitude or allow negative outputs to represent inhibition.

Delta Learning Rule:

The Delta Learning Rule (also known as the Widrow-Hoff rule or the Least Mean Squares (LMS) rule) is a
method for updating the weights of a neural network to minimize the error between the predicted and
actual outputs. It is primarily used in simple perceptrons and linear models.
Steps:
1. Initialization: Initialize weights 𝑤i to small random values.
2. Compute Output: Calculate the network's output for a given input:

3. Compute Error: Calculate the error ee between the desired output dd and the actual output y:

4. Update Weights: Adjust the weights to reduce the error:

𝛥𝑤𝑖 = 𝜂ⅇ𝑥𝑖
o 𝜂 is the learning rate.
o 𝑥𝑖 is the input value.
o 𝛥𝑤𝑖 is the change in weight for input ii.
5. Iteration: Repeat the process for multiple epochs or until the error converges to a minimum
threshold.
Significance:
• The delta learning rule helps the network learn the correct weights by iteratively reducing the error,
ensuring the network converges to an optimal set of weights for accurate predictions.
Operations Implemented by Perceptron:
A perceptron is a simple type of artificial neuron that can implement basic linear classification tasks. The
perceptron can learn to perform operations like:
• AND
• OR
• NAND
• NOR
These operations are linearly separable, meaning they can be separated by a straight line in a 2D input
space.
Why Perceptron Cannot Implement XOR Function:
The Exclusive OR (XOR) function is not linearly separable, which means it cannot be represented by a single
straight line in the input space. Here's why:
XOR Truth Table:

Input 1 Input 2 Output (XOR)

0 0 0

0 1 1

1 0 1

1 1 0

Graphical Representation:
• Points (0,0) and (1,1) belong to one class (output 0).
• Points (0,1) and (1,0) belong to another class (output 1).
These points cannot be separated by a single straight line, which is a limitation of the perceptron.
Illustration of XOR Non-linearity:
1. Graph:
o Plot the points (0,0), (1,1) as one class (output 0) and (0,1), (1,0) as another class (output 1).
o No straight line can separate these two classes.
2. Explanation:
o The perceptron computes the weighted sum of inputs and applies a step function.
o Since XOR requires separating non-linearly arranged points, the perceptron's linear
boundary fails.
Working Principle of Perceptron:
1. Initialization:
o Initialize weights 𝑤1 , 𝑤2, … , 𝑤𝑛 and bias b to small random values.

2. Weighted Sum:
o Compute the weighted sum of the inputs:

3. Activation Function:
o Apply a step function to decide the output:

4. Learning Rule:
o Adjust the weights using the perceptron learning rule:

where dd is the desired output, y is the actual output, η is the learning rate.
5. Iteration:
o Repeat the process for all inputs until the weights converge or after a set number of
iterations.
Summary:
• Operations Implementable: AND, OR, NAND, NOR.
• Limitation: Cannot implement XOR due to its inability to separate non-linear boundaries.
• Perceptron Principle: It learns to adjust weights to minimize the difference between actual and
desired outputs, working well for linearly separable problems.
Supervised Learning:
Supervised learning is a type of machine learning where the model is trained on a labeled dataset. Each
input comes with a corresponding output label, and the goal is for the model to learn the mapping from
inputs to outputs.
Key Characteristics:
• Labeled Data: The training data includes input-output pairs.
• Objective: The model learns to predict the output for new inputs based on the learned mapping.
• Examples: Classification (e.g., spam detection) and regression (e.g., predicting house prices).
Process:
1. Input Data: Provide input features and corresponding output labels.
2. Model Training: The model learns by minimizing the error between predicted and actual outputs.
3. Prediction: Once trained, the model can predict labels for new, unseen inputs.
Examples:
• Classification: Email spam detection (spam or not spam).
• Regression: Predicting temperature based on historical weather data.

Unsupervised Learning:
Unsupervised learning involves training a model on data that does not have labeled outputs. The goal is to
find patterns, structures, or clusters within the data.
Key Characteristics:
• Unlabeled Data: No explicit output labels are provided.
• Objective: Discover hidden patterns or intrinsic structures in the data.
• Examples: Clustering (e.g., customer segmentation) and dimensionality reduction (e.g., PCA).
Process:
1. Input Data: Provide the model with input features without labels.
2. Pattern Discovery: The model identifies patterns or groupings in the data.
3. Application: Use the discovered patterns for insights, data compression, or as pre-processing for
other tasks.
Examples:
• Clustering: Grouping customers based on purchasing behavior.
• Dimensionality Reduction: Reducing the number of features in a dataset while retaining essential
information.

Comparison:

Feature Supervised Learning Unsupervised Learning

Data Type Labelled data (input-output pairs) Unlabelled data

Goal Predict outputs from inputs Find hidden patterns or structures

Examples Classification, Regression Clustering, Dimensionality Reduction

Output Known during training Unknown, model identifies patterns

Complexity Requires labelled data, more complex setup Easier data collection, harder to evaluate

In summary, supervised learning is ideal for tasks where labeled data is available and the goal is prediction,
while unsupervised learning is useful for discovering underlying patterns in data without predefined labels.
Implementing AND Function using McCulloch-Pitts Neuron:
The McCulloch-Pitts neuron is a simple model of a biological neuron used to simulate basic logical functions
such as AND, OR, and NOT. Here's how you can implement the AND function:
Step-by-Step Implementation:
1. Inputs and Output:
o The AND function has two binary inputs X_1 and X_2.
o The output Y is 1 if both inputs are 1; otherwise, the output is 0.
2. Truth Table for AND:

X1X_1 X2X_2 Output (AND)

0 0 0

0 1 0

1 0 0

1 1 1
3. Weights and Threshold:
o Assign weights w_1 = 1 and w_2 = 1 for inputs X_1 and X_2.
o Set a threshold θ=1.5.
4. Net Input Calculation:
o The net input z is calculated as:

o Apply the step function to determine the output:

Calculations for Each Input Pair:

1. For X_1 = 0, X_2 = 0:

2. For X_1 = 0, X_2 = 1:

3. For X_1 = 1, X_2 = 0:

4. For X_1 = 1, X_2 = 1:

Summary:
• The McCulloch-Pitts neuron correctly implements the AND function by using weights w_1 = 1, w_2 =
1, and a threshold θ = 1.5.
• The neuron outputs 1 only when both inputs are 1, and 0 otherwise.
This simple implementation shows how logical operations can be modeled using basic neural networks.
Bayesian Neural Network:
A Bayesian Neural Network is a type of neural network that incorporates Bayesian inference into the
training process. In contrast to standard neural networks, which learn deterministic weights during training,
Bayesian neural networks model uncertainty by treating weights as distributions rather than fixed values.
This allows the network to quantify the uncertainty in predictions, which is useful for tasks where
uncertainty estimation is important (e.g., classification with uncertain data).
How Bayesian Neural Networks Work:
1. Prior Distribution: Before training, we define prior distributions over the weights of the neural
network, typically using Gaussian distributions. These priors encode our beliefs about the weights
before observing the data.
2. Likelihood: The likelihood function represents the probability of observing the data given the
weights of the network.
3. Posterior Distribution: Using Bayes' theorem, the posterior distribution of the weights is computed
based on the data. This represents the updated belief about the weights after observing the training

data.
4. Prediction: Predictions are made by averaging over the posterior distribution of weights, taking into
account the uncertainty in the model parameters.

Bayesian Neural Network Algorithm:

1. Define Prior Distribution: Choose prior distributions for the weights (e.g., Gaussian prior).
2. Calculate Likelihood: Compute the likelihood of the observed data given the current weights.
3. Compute Posterior: Use Bayes' rule to compute the posterior distribution of the weights.
4. Prediction: Use the posterior distribution to make predictions, usually by averaging over all possible
sets of weights.

Regularization Theory:
Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to
the loss function. Overfitting occurs when a model learns the noise or irrelevant details in the training data
rather than the underlying trend. Regularization helps to improve the model's ability to generalize to
unseen data by making the model simpler.
Types of Regularization:
1. L2 Regularization (Ridge Regularization): Adds the squared magnitude of the coefficients as a
penalty term to the loss function.

Where λ is the regularization parameter, θi are the model parameters, and 𝐿(𝑦, 𝑦̂) is the error term.
2. L1 Regularization (Lasso Regularization): Adds the absolute value of the coefficients as a penalty
term.
3. Elastic Net: A combination of L1 and L2 regularization that balances between both.
Benefits of Regularization:
• Prevents the model from fitting the noise in the data (overfitting).
• Encourages simpler models, improving generalization.

Regularization in Radial Basis Function (RBF) Networks:

In Radial Basis Function (RBF) networks, regularization is used to improve the generalization capability of
the model, especially when the network is trained on small or noisy datasets. RBF networks consist of three
layers: input, hidden (RBF), and output. Each hidden node applies a radial basis function, typically Gaussian,
to the input data.
How Regularization is Used in RBF Networks for Function Approximation:
1. Goal of Regularization in RBF Networks: The goal is to prevent overfitting by controlling the
complexity of the network. In RBF networks, regularization helps by controlling the centers of the
radial basis functions and the widths of the Gaussian functions.
2. Regularization Approach in RBF Networks:
o Centers: Regularization can constrain the number of centers (hidden nodes) used in the
network. Too many centers may result in overfitting, while too few may cause underfitting.
o Widths of Gaussian Functions: Regularization ensures that the widths (or spreads) of the
Gaussian functions are not too small, which would lead to overfitting, or too large, which
could lead to underfitting.
A penalty term can be added to the error function to penalize large centers or overly narrow Gaussian
functions. This helps to avoid overfitting by preventing the model from becoming too sensitive to noise in
the data.
3. Regularized Loss Function for RBF Networks: The loss function in an RBF network with
regularization can be written as:

Where:
o 𝑦𝑖 is the true output.
o 𝑦̂𝑖 is the predicted output.
o λ is the regularization parameter.
o M is the number of hidden units (centers).
4. Benefits of Regularization in RBF Networks:
o Reduces the risk of overfitting, especially when the number of centers is large or when the
dataset is noisy.
o Helps to find a balance between underfitting and overfitting by penalizing overly complex
models.
o Improves the ability of the RBF network to generalize to new data.
Conclusion:
• Bayesian Neural Networks provide uncertainty estimates by treating model parameters
probabilistically, making them more robust in situations where data is uncertain or noisy.
• Regularization Theory is a key concept in preventing overfitting by introducing a penalty term to the
loss function, improving model generalization.
• In RBF Networks, regularization helps in controlling the complexity of the model, ensuring the
network doesn't overfit or underfit the data by adjusting the number of centers and the width of
the radial basis functions.
RBF Networks for Pattern Classification:
Radial Basis Function (RBF) networks are a type of artificial neural network used for pattern classification
and regression tasks. They consist of three layers: input, hidden (RBF), and output layers.
Architecture of RBF Networks:
1. Input Layer: The input layer contains nodes that represent the features of the data.
2. Hidden Layer (RBF Layer): This layer applies a radial basis function (typically a Gaussian function) to
the input data. Each node in this layer represents a radial basis function with a center, spread, and
weight. The output of each hidden node depends on the distance between the input data and the
center of the RBF.
o The radial basis function computes the similarity between the input vector and a predefined
center vector.
o The typical RBF used is Gaussian:

where x is the input, ccc is the center of the radial function, and σ is the spread (standard
deviation) of the Gaussian.
3. Output Layer: The output layer is typically a linear combination of the outputs from the RBFs. For
classification, a softmax function or thresholding can be used to produce class labels.
Working of RBF Networks in Pattern Classification:
1. Training the RBF Network:
o Centers (c): Centers are typically chosen using clustering techniques like k-means, which find
representative points in the data.
o Spreads (σ): The spread determines how broad the influence of each center is. It is usually
set based on the variance of the input data or the distance between centers.
o Weights (w): The weights between the hidden layer and output layer are learned by using a
least-squares or another optimization technique.
2. Classification Process:
o Once the centers, spreads, and weights are learned, the network can classify new inputs.
The input is passed through the RBF layer, where the distances from each center are
computed. These distances are passed through the Gaussian function to calculate
activations.
o The output layer then combines these activations, and the final result is classified into a
particular category using an activation function (like a softmax for multi-class classification).
Advantages of RBF Networks for Classification:
• Non-linear mapping: RBF networks can map input patterns to high-dimensional spaces, making
them effective in capturing complex, non-linear relationships in the data.
• Local sensitivity: Since RBFs are sensitive to local regions of input space, they can handle data with
varying degrees of complexity.

Relevance Vector Machine (RVM) for Classification and Regression:

Relevance Vector Machines (RVM) are a probabilistic model for classification and regression that is closely
related to Support Vector Machines (SVMs) but with a key difference: RVMs provide probabilistic
predictions, which allow for uncertainty estimation. RVMs are based on Bayesian inference and use a
sparse model with fewer relevance vectors than support vectors in SVMs.
Key Characteristics of RVM:
• Sparsity: RVMs are designed to use only a small number of "relevance vectors" from the training
data, which leads to a sparse model. This is achieved by placing a prior distribution on the model
weights and using marginal likelihood maximization to determine which data points are most
relevant.
• Probabilistic Output: Unlike SVMs, which provide deterministic output, RVMs generate probabilistic
predictions, making them more flexible in terms of handling uncertainty.
RVM for Classification:
1. Model Representation: In RVM, the model for classification is defined as:

where wiw_iwi are the weights, 𝜙(𝑥𝑖 , 𝐶𝑖 ) is the radial basis function (e.g., Gaussian) between the input
vector x and the center c_i, and b is the bias.
2. Training RVM:
o The parameters w_i and b are learned by maximizing the marginal likelihood of the data
under the Bayesian framework. The likelihood is obtained using a Gaussian likelihood
function for each data point.
o A sparsity-inducing prior is applied to the weights w_i, which leads to only a small subset of
the data points being "relevant" for the model.
3. Prediction for Classification: For a new test input xtestx_{test}xtest, the RVM predicts the class label
by computing the posterior distribution over the weights:

where y^test\hat{y}_{test}y^test is the predicted value and σ2\sigma^2σ2 is the variance of the
prediction.
The RVM provides a probabilistic prediction, which can be interpreted as the confidence level in the
predicted class.
RVM for Regression:
1. Model Representation: In regression, RVM works similarly to classification but with continuous
output. The model for regression is:

where y(x) is the predicted continuous value.

2. Training RVM for Regression: The training process is similar to classification, but instead of using
class labels, we use continuous target values for the data. The model is learned by maximizing the
marginal likelihood of the continuous target values under the Bayesian framework.
3. Prediction for Regression: The output is a continuous value, and the prediction uncertainty is also
estimated. The RVM provides a mean prediction and a confidence interval for each prediction.
Advantages of RVM:
• Sparsity: RVM uses fewer relevant vectors than SVMs use support vectors, leading to a more
efficient and less complex model.
• Probabilistic Outputs: RVMs provide uncertainty estimates for predictions, making them more
suitable for applications where confidence in predictions is important.
• Flexibility: RVMs can handle both classification and regression tasks using a similar approach.
Comparison of RVM and SVM:
• Sparsity: RVMs tend to use fewer support vectors (relevant vectors) than SVMs, making them more
efficient in terms of memory and computation.
• Probabilistic vs. Deterministic: RVMs provide probabilistic outputs, while SVMs provide
deterministic outputs (i.e., the predicted class label or continuous value).
• Computational Complexity: RVMs involve more complex optimization due to the need for Bayesian
inference, while SVMs typically use quadratic programming.

Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
37 pages
Session 1
No ratings yet
Session 1
8 pages
12 AI Unit 6 Understanding Neural Networks
No ratings yet
12 AI Unit 6 Understanding Neural Networks
21 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Neural Network
No ratings yet
Neural Network
55 pages
Neural Networks: Models: Why There Are Many Neural Network Models? Characteristics
No ratings yet
Neural Networks: Models: Why There Are Many Neural Network Models? Characteristics
8 pages
Data Mining Techniques: Presentation On Neural Network
No ratings yet
Data Mining Techniques: Presentation On Neural Network
55 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
19 pages
Neural Network: Throughout The Whole Network, Rather Than at Specific Locations
No ratings yet
Neural Network: Throughout The Whole Network, Rather Than at Specific Locations
8 pages
MLT Answer Key
No ratings yet
MLT Answer Key
10 pages
Neural NetworksChapter2Sup
No ratings yet
Neural NetworksChapter2Sup
20 pages
19ANN
No ratings yet
19ANN
21 pages
NN Tutorial
No ratings yet
NN Tutorial
92 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Unit V
No ratings yet
Unit V
9 pages
Unit 2
No ratings yet
Unit 2
18 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Neural Network
100% (1)
Neural Network
54 pages
Neural
No ratings yet
Neural
53 pages
@vtucode - in Module 5 AI 2021 Scheme 5th Sem
No ratings yet
@vtucode - in Module 5 AI 2021 Scheme 5th Sem
66 pages
Softcomputing NN
No ratings yet
Softcomputing NN
84 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Artificial Neural Network Concepts/Terminology
No ratings yet
Artificial Neural Network Concepts/Terminology
22 pages
AI Learning & Neural Networks
No ratings yet
AI Learning & Neural Networks
69 pages
Unit 2 Aml
No ratings yet
Unit 2 Aml
60 pages
Back-Propagation Algorithm of CHBPN Code
No ratings yet
Back-Propagation Algorithm of CHBPN Code
10 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
48 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
ML QuestionPaper Solution
No ratings yet
ML QuestionPaper Solution
33 pages
DP Learn
No ratings yet
DP Learn
72 pages
Artificial Neural Networks Explained
No ratings yet
Artificial Neural Networks Explained
54 pages
UNIT-II Chapter-2
No ratings yet
UNIT-II Chapter-2
20 pages
Major Classes of Neural Networks
No ratings yet
Major Classes of Neural Networks
21 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Biological vs. Artificial Neurons
No ratings yet
Biological vs. Artificial Neurons
33 pages
CH 12 - Artificial Neural Networks
No ratings yet
CH 12 - Artificial Neural Networks
39 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
34 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Neural Networks Notes
No ratings yet
Neural Networks Notes
22 pages
NNFL Unit III For ECE & EEE
No ratings yet
NNFL Unit III For ECE & EEE
29 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
86 pages
Module 3
No ratings yet
Module 3
83 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Artificial Neural Network: Synapses Weight The Individual Parts of Information
No ratings yet
Artificial Neural Network: Synapses Weight The Individual Parts of Information
8 pages
Machine Learning
No ratings yet
Machine Learning
77 pages
Module 4
No ratings yet
Module 4
58 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
What Actions Can Human Brain Do?: Trained
No ratings yet
What Actions Can Human Brain Do?: Trained
40 pages
Neural Network Essentials
No ratings yet
Neural Network Essentials
34 pages
3ML.05.NeuralNetworks DeepLearning
No ratings yet
3ML.05.NeuralNetworks DeepLearning
67 pages
Medical Negligence Lawsuit
No ratings yet
Medical Negligence Lawsuit
6 pages
Manual 860014
86% (7)
Manual 860014
194 pages
GG Tronics SSDAC CPU Card Settings
No ratings yet
GG Tronics SSDAC CPU Card Settings
14 pages
HUB-Cloud Network Architect
No ratings yet
HUB-Cloud Network Architect
4 pages
Contoh Costing Beverage
No ratings yet
Contoh Costing Beverage
207 pages
Potentiality of A Business in A Particular District Read Our District Industrial Potentiality Survey Report
No ratings yet
Potentiality of A Business in A Particular District Read Our District Industrial Potentiality Survey Report
49 pages
Chapter 6 Earthing System
No ratings yet
Chapter 6 Earthing System
28 pages
Lunar Pastries - Employee Handbook
100% (1)
Lunar Pastries - Employee Handbook
11 pages
PA-44-180 Seminole Checklist
No ratings yet
PA-44-180 Seminole Checklist
13 pages
Child & Adolescent Labour Act, 1986
No ratings yet
Child & Adolescent Labour Act, 1986
17 pages
Adiong Vs Comelec Case Digest
100% (3)
Adiong Vs Comelec Case Digest
2 pages
Registration Form
No ratings yet
Registration Form
4 pages
Testbank For Financial Accounting Information For Decisions 11th Edition Wild Instant Download
No ratings yet
Testbank For Financial Accounting Information For Decisions 11th Edition Wild Instant Download
18 pages
Probate Court Tax Liability Ruling
No ratings yet
Probate Court Tax Liability Ruling
18 pages
Business Etiquettes and Professionalism
No ratings yet
Business Etiquettes and Professionalism
22 pages
Arabic Typesetting Font Overview
No ratings yet
Arabic Typesetting Font Overview
1 page
Cat-De-Dp20n-25n FC MC
100% (1)
Cat-De-Dp20n-25n FC MC
246 pages
Materials For Engineering
No ratings yet
Materials For Engineering
3 pages
Assurance - Certificate Level Notes
No ratings yet
Assurance - Certificate Level Notes
48 pages
Disc Brake Rotor Thermal Analysis
No ratings yet
Disc Brake Rotor Thermal Analysis
12 pages
BE-CSDF Unit Test-2 Question Paper
No ratings yet
BE-CSDF Unit Test-2 Question Paper
1 page
SAP MRP - Material Requirement Planning Overview PDF
No ratings yet
SAP MRP - Material Requirement Planning Overview PDF
20 pages
My Next Writing 1 Student Book
100% (8)
My Next Writing 1 Student Book
118 pages
Selected Abbreviations
No ratings yet
Selected Abbreviations
15 pages
Quarterly Percentage Tax Return: (From Schedule 1 Item 7)
No ratings yet
Quarterly Percentage Tax Return: (From Schedule 1 Item 7)
2 pages
Sky High 3 Course PDF
No ratings yet
Sky High 3 Course PDF
7 pages
Design Guidance - Medication Line
No ratings yet
Design Guidance - Medication Line
44 pages
Battery Safety Sheet CLARIOS
No ratings yet
Battery Safety Sheet CLARIOS
8 pages
Digital Electronics Lab-I: Laboratory Manual (EEC-352)
0% (1)
Digital Electronics Lab-I: Laboratory Manual (EEC-352)
41 pages
COMPARE - Types of Retirement Accounts
No ratings yet
COMPARE - Types of Retirement Accounts
2 pages

Question 105A

Uploaded by

Question 105A

Uploaded by

Question:

Two Examples of Activation Functions:

Types of Gradient Descent:

Type Pros Cons Use Case

Types of Boltzmann Machines:

Applications of Boltzmann Machines:

2. Noise and Inaccuracies in the Data

Perceptron for Learning Between Sweet and Sour:

Aspect Computational Model (Artificial Neuron) Biological Neuron

Inputs, weights, bias, summation, Dendrites, soma, axon, synapses, action

Complex signal integration, action potential

Learning Supervised learning, gradient descent, Hebbian learning, neuroplasticity,

Processing Speed Fast computation Slower signal transmission

Energy Efficiency High energy consumption Highly efficient

Scalability Scalable to millions of neurons Limited scalability

Parallelism Limited by hardware capabilities Massive parallelism in the brain

Applications of Backpropagation Network:

4. Repeat Until Convergence:

3. Set Diagonal Elements to Zero:

Final Weight Matrix WW:

o 𝑆𝑖 is the state of neuron i.

o 𝜃𝑖 is the threshold for neuron i.

where cc is the center and σ is the spread.

Feature RBF Networks MLPs (Multilayer Perceptrons)

Multiple layers: input, hidden (fully

Uses neurons with activation

Typically two-phase: unsupervised (clustering) for Supervised learning with

Slower, requires gradient descent

More complex, can model more

Function approximation, classification with fewer Deep learning, complex classification

Delta Learning Rule:

4. Update Weights: Adjust the weights to reduce the error:

Input 1 Input 2 Output (XOR)

Feature Supervised Learning Unsupervised Learning

Data Type Labelled data (input-output pairs) Unlabelled data

Goal Predict outputs from inputs Find hidden patterns or structures

Examples Classification, Regression Clustering, Dimensionality Reduction

Output Known during training Unknown, model identifies patterns

X1X_1 X2X_2 Output (AND)

o Apply the step function to determine the output:

Calculations for Each Input Pair:

2. For X_1 = 0, X_2 = 1:

3. For X_1 = 1, X_2 = 0:

4. For X_1 = 1, X_2 = 1:

Bayesian Neural Network Algorithm:

Regularization in Radial Basis Function (RBF) Networks:

Relevance Vector Machine (RVM) for Classification and Regression:

where y(x) is the predicted continuous value.

You might also like