How Neural Networks resolve XOR issue?
ARCHITECTURE ? 1 ip layer, x1 and x2
                  1 middle layer, h1 and h2
                  1 op layer, y1
# Middle layer consists of 2 ReLu-based units
[Diagram]
x1 x2
0 0
0 1
1 0
1 1
When (0, 0) is in the middle layer unit h1, the weights [1, 1] are applied to the input values to obtain:
0(1) + 0(1) = 0
Adding bias term:
0 + (-1) (+1) = -1
However, since ReLu-based units produce a value of 0 for all values that are negative in nature, the
output value for h1 = 0.
Similarly for h2, it yields a value of 0.
? Middle layer yields the value [0, 0], applying the value of the weights [-2, 1] and the bias
expression (0+1) the output of the entire neural network will be 0.
When this entire process is replicated for all of their input values of the table, the corresponding
values produced are that of their XOR operation.
PREREQUISITES
=> Concept of Neural Networks:
A neural network is simply a network of neural computing units, each of which takes in a vector of
inputs and produces a single output value.
[Diagram]
A neural computing unit is the fundamental building block of a neural network.
It takes in a vector of input values, performs some computation on them and produces an output
value.
When neural computing units receive a vector of input values, they perform a weighted sum on
these input values and then they add a bias to the result of this weighted sum.
The result is then passed into some linear function, known as activation function, to produce an
output value.
Eg: w = [0.2, 0.2, 0.2, 0.1], b = 0.5
x = [5.0, 4.0, 1.0, 2.0]
weighted sum = 0.2(5) + 0.2(4) + 0.2(1) + 0.1(2)
                = 2.2
adding bias = 2.2 + 0.5 = 2.7 ? g supplied to activation function
=> Activation Functions ? function that is added to an ANN in order to help the network learn
complex patterns in the data.
When comparing with neuron-based model that is in our brain, it is at the end deciding what is to be
fired to the next neuron.
In ANN, the activation function of a node defines the output of the node given an input or set of
inputs.
They are simply non-linear functions that convert the values they receive into output values for the
neural computational units.
1. Sigmoid Activation Function
f(z) = 1 / (1 + e^-z)
[Graph of sigmoid]
2. Tanh or Hyperbolic Tangent Activation Function (Sigmoidal)
f(x) = tanh(x) = (2 / (1 + e^-2x)) - 1
[Graph of tanh]
? Why is tanh better than sigmoid activation function?
- When the input is large or small, the output is almost smooth and the gradient is small, which is not
conducive to weight update.
The difference is the output interval.
Output interval for tanh is [-1, 1] and the output function is 0-centric, which is better than sigmoid.
- Major advantage is that the negative inputs will be mapped strongly negative and the 0 input will
be mapped near 0 in the tanh graph.
? In binary classification problems, tanh is used for hidden layer and the sigmoid function for output
layer.
3. ReLu (Rectified Linear Unit) activation function:
?(x) = max(0, x), x > 0
         = 0, x < 0
[Graph of ReLU]
Range [0, ?)
? Better than tanh and sigmoid:
? When the input is positive, there is no gradient saturation problem.
? The calculation speed is much faster. It has only a linear relationship so whether it is forward or
backward it is faster than the 2 (sigmoid and tanh need to calculate their exp output which will be
slower)
(* dead ReLU problem)
? Feed Forward Neural Networks
A multilayer network of neural units in which the outputs from the units in each layer are passed to
the units in the higher layer.
These networks don?t have any cycles within them, i.e., the outputs from these units don?t flow in
cyclical manner.
[Diagram]
n (x1, x2, ..., x_input) ? i/p values
y (y1, y2, ..., y_output) ? o/p values
x1 to xn ? n input values of the network reside on the first layer (Layer 0)
y1 to ym ? n input output values reside on the last layer (Layer 2)
W ? matrix containing the weights to be applied to the input values
U ? matrix containing the weights to be applied to the output values of the hidden layer
b ? vector containing the bias terms to be applied to the input values
Mathematical Representation:
Multinomial classification:
h = activation function (W.x + b)
z = U.h
y = softmax(z)
For multinomial classification, it?s prudent to choose a softmax function to normalize any vector of
real values received through the performance of a matrix multiplication between U and h.
The normalization process is meant to transform the vector of real values into a vector that
represents a probability distribution.
softmax(zi) = e^zi / ? e^zj for 1 ? i ? d
A feed-forward neural network is a supervised ML algorithm.
To train a neural network means to figure out the right values of W and U for each layer in the neural
network to enable it to predict accurate values of y when given input values of x.