CH 1
CH 1
Unit-1
How AI works
an AI system accepts data input in the form of speech, text, image, etc. The
system then processes data by applying various rules and algorithms,
interpreting, predicting, and acting on the input data. Upon processing, the
system provides an outcome, i.e., success or failure, on data input. The result is
then assessed through analysis, discovery, and feedback. Lastly, the system uses
its assessments to adjust input data, rules and algorithms, and target outcomes.
This loop continues until the desired result is achieved.
History of Machine Learning
1.Probabilistic modeling
Probabilistic modeling is the application of the principles of statistics to data analysis. It was
one of the earliest forms of machine learning, and it’s still widely used to this day. One of the
best-known algorithms in this category is the Naive Bayes algorithm
Naive Bayes is a type of machine-learning classifier based on applying Bayes’ theorem while
assuming that the features in the input data are all independent (a strong, or “naive”
assumption,
A closely related model is the logistic regression (logreg for short), which is sometimes
considered to be the “hello world” of modern machine learning. Logreg is a classification
algorithm rather than a regression algorithm. Much like Naive Bayes, logreg predates
computing by a long time, yet it’s still useful to this daybecause of its simple and versatile
nature.
Early neural networks
Early iterations of neural networks have been completely supplanted by the modern
variants , but it’s helpful to be aware of how deep learning originated. Although the core
ideas of neural networks were investigated in toy forms as early as the 1950s, the approach
took decades to get started This changed in the mid-1980, when multiple people
independently rediscovered the Backpropagation algorithm— a way to train chains of
parametric operations using gradient-descent optimization.
The first successful practical application of neural nets came in 1989 from Bell Labs, when
Yann LeCun combined the earlier ideas of convolutional neural networks and
backpropagation, and applied them to the problem of classifying handwritten digits. The
resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s
to automate the reading of ZIP codes on mail envelopes.
Kernel methods
Kernel Methods are a group of classification algorithms, the best known of which is the
support vector machine (SVM). The modern formulation of an SVM was developed by
Vladimir Vapnik and Corinna Cortes in the early 1990s at Bell Labs and published in 1995,2
although an older linear formulation was published by Vapnik and Alexey Chervonenkis as
early as 1963.3 SVMs aim at solving classification problems by finding good decision
boundaries) between two sets of points belonging to two different categories.
A decision boundary can be thought of as a line or surface separating your training data into
two spaces corresponding to two categories. To classify new data points, we just need to
check which side of the decision boundary they fall on
1 The data is mapped to a new high-dimensional representation where the decision boundary
can be expressed as a hyperplane (if the data was twodimensional, as in figure 1.10, a
hyperplane would be a straight line).
Kernel methods is used to convert the input data into a high-dimensional feature space, which
makes it simpler to distinguish between classes or generate predictions. Kernel methods
employ a kernel function to implicitly map the data into the feature space, as opposed to
manually computing the feature space.
The most popular kind of kernel approach is the Support Vector Machine (SVM), a binary
classifier that determines the best hyperplane that most effectively divides the two groups. In
order to efficiently locate the ideal hyperplane, SVMs map the input into a higher-
dimensional space using a kernel function.
Decision trees are flowchart-like structures that let you classify input data points or predict
output values given inputs. They’re easy to visualize and interpret.
Random Forest algorithm introduced a robust, practical take on decision-tree learning that
involves building a large number of specialized decision trees and then ensembling their
outputs. Random forests are applicable to a w ide range of problems the second-best
algorithm for any shallow machine-learning task
Definition
Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions
without being explicitly programmed. Machine learning brings computer science and
statistics together for creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of predicted
output depends upon the amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help
of these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
Four Forms of Machine Learning
Supervised Learning
Unsupervised Learning
Self Supervised Learning
Reinforced Learning
1) Supervised Learning
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.
o Classification
o Regression
Generally, almost all applications of deep learning that are in the spotlight these days belong
in this category, such as optical character recognition, speech recognition, image
classification, and language translation.
Although supervised learning mostly consists of classification and regression, there are more
exotic variants as well, including the following (with examples):
2) Unsupervised Learning
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
o Clustering
o Association
Unsupervised learning is the bread and butter of data analytics, and it’s often a
necessary step in better understanding a dataset before attempting to solve a
supervised-learning problem. Dimensionality reduction and clustering are well-known
categories of unsupervised learning.
This is a specific instance of supervised learning, but it’s different enough that it deserves its
own category. Self-supervised learning is supervised learning without human-annotated
labels. There are still labels involved (because the learning has to be supervised by
something), but they’re generated from the input data, typically using a heuristic algorithm.
For example, autoencoders are a well-known instance of self-supervised learning, where the
generated targets are the input, unmodified. In the same way, trying to predict the next frame
in a video, given past frames, or the next word in a text, given previous words, are instances
of self-supervised learning
4) Reinforcement Learning
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
it's important to use new data when evaluating our model to prevent the likelihood of
overfitting to the training set
To evaluate the model while still building and tuning the model, we create a third subset of
the data known as the validation set. A typical train/test/validation split would be to use 60%
of the data for training, 20% of the data for validation, and 20% of the data for testing.
The reason for chosing validation set is that developing a model always involves tuning its
configuration:
for example, choosing the number of layers or the size of the layers (called the
hyperparameters of the model, to distinguish them from the parameters, which are the
network’s weights).
We do this tuning by using as a feedback signal the performance of the model on the
validation data. In essence, this tuning is a form of learning: a search for a good configuration
in some parameter space. As a result, tuning the configuration of the model based on its
performance on the validation set can quickly result in overfitting to the validation set, even
though your model is never directly trained on it.
We need to use a completely different, never-before-seen dataset to evaluate the model: the
test dataset. our model shouldn’t have had access to any information about the test set, even
indirectly.
2. K fold validation
Refer textbook2
Overfitting and Underfitting
In a nutshell, Underfitting refers to a model that can neither performs well on the training
data nor generalize to new data.
Reasons for Underfitting
1. High bias and low variance.
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
Techniques to Reduce Underfitting
1. Increase model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.
The fundamental issue in machine learning is the tension between optimization and
generalization
Optimization refers to the process of adjusting a model to get the best performance possible
on the training data
generalization refers to how well the trained model performs on data it has never seen
before.
Regularization techniques
The simplest way to prevent overfitting is to reduce the size of the model: the number of
learnable parameters in the model (which is determined by the number of layers and the
number of units per layer). In deep learning, the number of learnable parameters in a model is
often referred to as the model’s capacity. Intuitively, a model with more parameters has more
memorization capacity and therefore can easily learn a perfect dictionary-like mapping
between training samples and their targets—a mapping without any generalization power.
Refer Textbook2
1.L1 Regularization
2. L2 Regularization
A regression model that uses L1 regularization technique is called Lasso Regression and
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
Cost function
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of
Cost function
Refer Textbook2
Adding dropout
Dropout is one of the most effective and most commonly used regularization techniques for
neural networks, developed by Geoff Hinton and his students at the University of Toronto
Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of
output features of the layer during training.
During training, some layer outputs are ignored or dropped at random. This makes the layer
appear and is regarded as having a different number of nodes and connectedness to the
preceding layer.
Dropout makes the training process noisy, requiring nodes within a layer to take on more or
less responsible for the inputs on a probabilistic basis
dropout may break apart circumstances in which network tiers co-adapt to fix mistakes
committed by prior layers, making the model more robust.
if neurons are randomly dropped out of the network during training, other neurons will have
to step in and handle the representation required to make predictions for the missing neurons.
This is believed to result in multiple independent internal representations being learned by the
network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This,
in turn, results in a network capable of better generalization and less likely to overfit the
training data.
A neural network consists of interconnected nodes, called neurons, organized into layers.
Each neuron receives input signals, performs a computation on them using an activation
function, and produces an output signal that may be passed to other neurons in the network.
An activation function determines the output of a neuron given its input. These functions
introduce nonlinearity into the network, enabling it to learn complex patterns in data.
The network is typically organized into layers, starting with the input layer, where data is
introduced. Followed by hidden layers where computations are performed and finally, the
output layer where predictions or decisions are made.
Neurons in adjacent layers are connected by weighted connections, which transmit signals
from one layer to the next. The strength of these connections, represented by weights,
determines how much influence one neuron's output has on another neuron's input. During
the training process, the network learns to adjust its weights based on examples provided in a
training dataset. Additionally, each neuron typically has an associated bias, which allows the
neuron to adjust its output threshold.
The goal of training a neural network is to minimize this loss function by adjusting the
weights and biases. The adjustments are guided by an optimization algorithm, such as
gradient descent.
The ANN depicted on the right of the image is a simple neural network called ‘perceptron’. It consists
of a single layer, which is the input layer, with multiple neurons with their own weights; there are no
hidden layers. The perceptron algorithm learns the weights for the input signals in order to draw a
linear decision boundary.
However, to solve more complicated, non-linear problems related to image processing, computer
vision, and natural language processing tasks, we work with deep neural networks.
These are the simplest form of ANNs, where information flows in one direction, from input
to output. There are no cycles or loops in the network architecture. Multilayer perceptrons
(MLP) are a type of feedforward neural network.
Recurrent Neural Networks (RNN)
In RNNs, connections between nodes form directed cycles, allowing information to persist
over time. This makes them suitable for tasks involving sequential data, such as time series
prediction, natural language processing, and speech recognition.
CNNs are designed to effectively process grid-like data, such as images. They consist of
layers of convolutional filters that learn hierarchical representations of features within the
input data. CNNs are widely used in tasks like image classification, object detection, and
image segmentation.
Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU)
These are specialized types of recurrent neural networks designed to address the vanishing
gradient problem in traditional RNN. LSTMs and GRUs incorporate gated mechanisms to
better capture long-range dependencies in sequential data, making them particularly effective
for tasks like speech recognition, machine translation, and sentiment analysis.
Autoencoder
It is designed for unsupervised learning and consists of an encoder network that compresses
the input data into a lower-dimensional latent space, and a decoder network that reconstructs
the original input from the latent representation. Autoencoders are often used for
dimensionality reduction, data denoising, and generative modeling.
Input Layer: This layer consists of neurons that receive inputs and pass them on
to the next layer. The number of neurons in the input layer is determined by the
dimensions of the input data.
Hidden Layers: These layers are not exposed to the input or output and can be
considered as the computational engine of the neural network. Each hidden layer's
neurons take the weighted sum of the outputs from the previous layer, apply
an activation function, and pass the result to the next layer. The network can have
zero or more hidden layers.
Output Layer: The final layer that produces the output for the given inputs. The
number of neurons in the output layer depends on the number of possible outputs
the network is designed to produce.
Artificial Neuron
Anatomy of an artificial neuron.
Perceptron
Perceptron is a type of neural network that performs binary classification that
maps input features to an output decision, usually classifying data into one of two
categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a
layer of output nodes..
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning
linearly separable patterns. It is effective for tasks where the data can be
divided into distinct categories through a straight line. While powerful in its
simplicity, it struggles with more complex problems where the relationship
between inputs and outputs is non-linear.
A single-layer perceptron (SLP) is the simplest type of artificial neural network. It
consists of a single layer of neurons that directly connect to the input features. It's a
foundational concept in understanding neural networks and is limited to learning
linearly separable patterns.
Key Characteristics:
Single Layer:
The defining feature is the single layer of neurons. There are no hidden layers between the
input and output.
Linear Separability:
SLPs can only classify data that can be divided by a straight line (in 2D) or a hyperplane (in
higher dimensions). They struggle with non-linearly separable data like XOR, circles, or
spirals.
Input Connections:
Each neuron in the SLP is directly connected to all input features.
Activation Function:
A threshold or step function is typically used to determine the output of each neuron based
on a weighted sum of the inputs.
How it works:
1. 1. Weighted Sum:
The input features are multiplied by corresponding weights, and these weighted values are
summed up.
2. 2. Activation Function:
The sum is then passed through an activation function, which determines the neuron's
output. The most common activation function for SLPs is a step function that outputs 1 if
the sum is above a threshold and 0 otherwise.
3. 3. Output:
The output of the activation function is the final prediction of the perceptron
Output: The final output is determined by the activation function, often used
for binary classification tasks.
Bias: The bias term helps the perceptron make adjustments independent of the
input, improving its flexibility in learning.
Biases are constants added to the weighted sum of inputs before an activation function is
applied.
In the linear equation (y = mx + b), the bias (b) represents the y-intercept and determines the
output when x is zero
Like weights, biases are also learned during training.
The output of a neuron is expressed by the formula
output = inputs * weights + bias
Importance:
Flexibility:
Weights and biases provide the neural network with the flexibility to learn complex
relationships in data.
Non-linearity:
They enable the network to model non-linear relationships, which is crucial for solving
many real-world problems.
Pattern Recognition:
By adjusting weights and biases, the network can learn to recognize specific patterns and
make accurate predictions
The output of the activation function becomes the input for the next layer, and the process
repeats.
In essence, weights and biases are the knobs that a neural network uses to fine-tune
its behavior and learn from data
Learning Algorithm: The perceptron adjusts its weights and bias using a
learning algorithm, such as the Perceptron Learning Rule , to minimize
prediction errors.
.
Bias
Once a neuron receives inputs from all the other neurons connected to it,
a bias is added, a constant value that changes is added to the previous
computation involving weight.
Key Features:
Fully Connected: Each node in a layer is connected to every node in the adjacent layers.
Feedforward: Information flows in one direction, from input to output, without cycles.
Non-linear Activation Functions: Introduce non-linearity, allowing the network to model
complex relationships.
Multiple Layers: At least one hidden layer is present, enabling the network to learn
hierarchical representations of data.
Backpropagation: The learning algorithm used to adjust the network's weights and biases
based on the error between predicted and actual outputs.
How it works:
1. Input Layer: Receives the initial data or features.
2. Hidden Layers: Perform computations and feature extraction.
3. Output Layer: Produces the final prediction or classification.
4. Forward Pass: Data flows through the network, with each layer applying weights, biases,
and activation functions.
5. Error Calculation: The difference between the predicted and actual output is calculated.
6. Backward Pass (Backpropagation): The error is used to adjust the weights and biases,
minimizing the error.
Applications:
Image Recognition: Classifying images into different categories.
Natural Language Processing: Analyzing and understanding text.
Speech Recognition: Converting spoken language into text.
Regression Problems: Predicting continuous values.
Cross-entropy loss, also known as log loss, is a measure of the difference between
two probability distributions. In machine learning, particularly in classification
problems, it quantifies the dissimilarity between a model's predicted probability
distribution and the true distribution (often a one-hot encoded vector representing
the correct class). Lower cross-entropy loss indicates a better model fit, with 0
representing a perfect prediction.
Measures Dissimilarity:
Cross-entropy loss quantifies how different the predicted probability distribution is from
the actual distribution.
Common in Classification:
It's widely used as a loss function in classification tasks, including binary and multi-class
scenarios.
Interpreting the Value:
The loss value ranges from 0 to 1, with 0 being the best possible score (perfect prediction).
Optimization Goal:
During training, optimization algorithms aim to minimize the cross-entropy loss, pushing
the model's predictions closer to the true labels.
Mathematical Formulation:
In a binary classification problem with true label y (either 0 or 1) and predicted
probability p, the cross-entropy loss is calculated as:
(y * log(p) + (1 - y) * log(1 - p))
For multi-class classification, where the model predicts probabilities for multiple
classes, the formula becomes:
Σ (yᵢ * log(pᵢ)), where i ranges over all classes
Practical Applications:
Image Classification:
Models predict the probability of an image belonging to different classes (e.g., cat, dog,
bird), and cross-entropy loss helps evaluate the accuracy of these predictions.
Language Modeling:
In language models, cross-entropy loss measures how well the model predicts the next
word in a sequence.
Recommender Systems:
It can be used to assess how well a model predicts user preferences or item
recommendations
Practical Applications:
Image Classification:
Models predict the probability of an image belonging to different classes (e.g., cat, dog,
bird), and cross-entropy loss helps evaluate the accuracy of these predictions.
Language Modeling:
In language models, cross-entropy loss measures how well the model predicts the next
word in a sequence.
Recommender Systems:
It can be used to assess how well a model predicts user preferences or item
recommendations.
Entropy calculates the degree of randomness or disorder within a system. In the context of
information theory, the entropy of a random variable is the average uncertainty, surprise, or
information inherent to the possible outcomes. To put things simply, it measures the
uncertainty of an event.
Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function used in machine
learning to measure the performance of a classification model.
The cross-entropy loss function is used to find the optimal solution by adjusting the weights
of a machine learning model during training. The objective is to minimize the error between
the actual and predicted outcomes. A lower cross-entropy value indicates better performance.
event from one probability distribution, P, using the optimal code for another probability distribution,
Q, and is typically used in machine learning to evaluate the performance of a model where the
objective is to minimize the error between the predicted probability distribution and true distribution.
The measure of error from a loss function also serves as a guide during
the optimization process by providing feedback to the model on how well it fits the data.
Hence, most machine learning models implement a loss function during the optimization
phase, where the model parameters are chosen to help the model minimize the error and
arrive at an optimal solution – the smaller the error, the better the model.
We can measure the error between two probability distributions using the cross-entropy loss
function. For example, let’s assume we’re conducting a binary classification task (a
classification task with two classes, 0 and 1)
Binary cross entropy formula
The cross-entropy loss is a scalar value that quantifies how far off the model's
predictions are from the true labels. For each sample in the dataset, the cross-
entropy loss reflects how well the model's prediction matches the true label. A
lower loss for a sample indicates a more
accurate prediction, while a higher loss suggests a larger discrepancy.
Interpretability with Binary Classification:
o In binary classification, since there are two classes (0 and 1) it is
start forward to interpret th e loss value,
o If the true label is 1, the loss is primarily influenced by how close
the predicted probability for class 1 is to 1.0.
o If the true label is 0, the loss is influenced by how close the
predicted probability for class 1 is to 0.0.
Entropy:
Entropy, also known as Shannon entropy, was formally introduced in 1948 by Claude Shannon
where:
y is the true label (0 or 1)
is the predicted probability for the true class (e.g., if y=1, then p is the probability of the cat
p
being present)
1. Applying the formula:
In our example: y = 1 (cat) and p = 0.8 (predicted probability of cat).
Therefore, the loss for this single sample is:
Loss = -(1 * log(0.8) + (1 - 1) * log(1 - 0.8))
Loss = -(log(0.8))
Loss ≈ -(-0.223)
Loss ≈ 0.223
Interpretation:
Purpose:
Probability Distribution:
The softmax function transforms raw output scores (logits) from the previous layer into
probabilities.
Multi-class Classification:
It's particularly useful when an input can belong to one of several classes, allowing the
network to predict the most likely class.
Decision Making:
By providing probabilities, the softmax layer facilitates making decisions about which class
the input belongs to.
How it works:
1. 1. Exponentiation:
Each logit (output from the previous layer) is exponentiated, ensuring all values are
positive.
2. 2. Normalization:
The exponentiated values are then divided by the sum of all exponentiated values. This
normalization step ensures that the outputs sum up to 1, forming a probability distribution.
Formula:
The softmax function is often represented by the following formula:
Where:
σ(z)_i is the i-th element of the softmax output vector.
z_i is the i-th logit (input to the softmax layer).
K is the number of classes.
e is the exponential function.
Eg:A neural network trained to classify images into three categories: cat, dog, and
bird. The softmax layer would take the raw output scores from the previous layer
and convert them into probabilities for each class. For example, the output might
be [0.2, 0.7, 0.1], indicating a 20% probability of being a cat, 70% probability of
being a dog, and 10% probability of being a bird.
Gradient Descent
The cost function (or loss function) measures how well the model’s predictions
match the actual data. By iteratively adjusting the parameters in the direction that
reduces the cost function, the model improves its accuracy.
1. Initialize Parameters – Start with random values for the parameters
(weights and biases).
2. Compute the Gradient – Calculate the derivative (gradient) of the loss
function with respect to each parameter.
3. Update Parameters – Adjust the parameters by moving in the opposite
Where:
θ represents model parameters
The learning rate is a hyperparameter that determines the size of the step taken in the
weight update. A small learning rate results in a slow convergence, while a large learning
rate can lead to overshooting the minimum and oscillating around the minimum. It’s
important to choose an appropriate learning rate that balances the speed of convergence and
the stability of the optimization.
1) Batch Gradient Descent:
In batch gradient descent, the gradient of the loss function is computed with respect to the
weights for the entire training dataset, and the weights are updated after each iteration. This
provides a more accurate estimate of the gradient, but it can be computationally expensive
for large datasets.
2) Stochastic Gradient Descent (SGD):
In SGD, the gradient of the loss function is computed with respect to a single training
example, and the weights are updated after each example. SGD has a lower computational
cost per iteration compared to batch gradient descent, but it can be less stable and may not
converge to the optimal solution.
3) Mini-Batch Gradient Descent:
Mini-batch gradient descent is a compromise between batch gradient descent and SGD. The
gradient of the loss function is computed with respect to a small randomly selected subset
of the training examples (called a mini-batch), and the weights are updated after each mini-
batch. Mini-batch gradient descent provides a balance between the stability of batch
gradient descent and the computational efficiency of SGD.
4) Momentum:
Momentum is a variant of gradient descent that incorporates information from the previous
weight updates to help the algorithm converge more quickly to the optimal solution.
Momentum adds a term to the weight update that is proportional to the running average of
the past gradients, allowing the algorithm to move more quickly in the direction of the
optimal solution
Derivatives and Stochastic Gradient Descent
Neural Network Implementation Issues
Data Dependency:
Neural networks, especially deep learning models, require large amounts of high-quality
data for effective training. Gathering and cleaning sufficient data can be time-consuming,
expensive, and sometimes impractical.
Computational Costs:
Training large neural networks can be computationally expensive, demanding significant
processing power, often requiring specialized hardware like GPUs or TPUs.
Overfitting:
Neural networks can memorize the training data instead of generalizing, leading to poor
performance on new data.
Interpretability:
Neural networks are often described as "black boxes" due to the difficulty in understanding
how they arrive at their predictions. This lack of transparency can be problematic in fields
where explainability is crucial.
Optimization Challenges:
Training neural networks involves finding optimal model parameters, which can be difficult
due to issues like vanishing or exploding gradients, local minima, and the need for proper
hyperparameter tuning.
Hyperparameter Tuning:
Choosing the right hyperparameters (e.g., learning rate, batch size, network architecture) is
crucial for optimal performance, but this process can be time-consuming and complex.
Vanishing and Exploding Gradients:
In deep neural networks, gradients can become very small (vanishing) or very large
(exploding) during backpropagation, hindering the learning process.
Limited Data:
Many real-world applications lack the vast amounts of labeled data needed for effective
neural network training.
Bias and Fairness:
Neural networks can inherit biases from the training data, leading to unfair or
discriminatory outcomes.
Continual Learning:
Training neural networks on continuously arriving data can be challenging, as models may
struggle with "catastrophic forgetting" or interference.
Scalability:
Scaling neural networks to handle large problem instances and real-world applications can
be a significant challenge.
Data Independence
Data independence in the context of neural networks refers to the ability to modify
the underlying data representation or structure without requiring changes to the
network's architecture or training process. This concept is analogous to data
independence in databases, where changes to the physical storage or logical
structure of data can be made without affecting the applications that use it. In
essence, it allows for greater flexibility and maintainability of neural networks by
decoupling the data from the model's implementation
Data augmentation:
The application of data augmentation techniques (e.g., random rotations, crops) should not
require modifications to the core network architecture or training loop.
Data source changes:
Switching to a different dataset or source of data should not require significant changes to
the network's structure or training strategy.
Feature engineering:
Modifications to feature selection, extraction, or engineering should be possible without
requiring changes to the network's architecture.
Benefits
Increased flexibility:
Neural networks can be adapted to new data sources or tasks without requiring extensive
re-engineering.
Improved maintainability:
Changes to the data pipeline or preprocessing steps can be made without affecting the core
network, simplifying maintenance and updates.
Enhanced reusability:
Networks trained on one dataset can be more easily adapted to other datasets or tasks,
promoting reusability of trained models.
Reduced development time:
Data independence allows for faster experimentation with different data representations and
preprocessing techniques.
Image classification:
A convolutional neural network trained on a dataset of images can be adapted to a new
dataset with different image resolutions or color spaces without requiring changes to the
convolutional layers or the overall architecture.
Natural language processing:
A recurrent neural network trained on text data can be adapted to a new language or corpus
without requiring changes to the core RNN architecture or the training procedure.
Time series analysis:
A neural network trained on time series data can be adapted to handle different time scales
or sampling frequencies without requiring changes to the network's architecture or training
loop