Basic
I. One Node
II. Four Nodes
III. One Hidden Layer
IV. Three Inputs
V. Seven Layers
Advanced
1. Mixture of Experts (MOEs)
2. Recurrent Neural Network (RNN)
3. Mamba
4. Matrix Multiplication
5. LLM Sampling
6. MLP in PyTorch
7. Backpropagation
8. Transformer
9. Batch Normalization
10. Generative Adversarial Network (GAN)
11. Self Attention
12. Dropout
13. Autoencoder
14. Vector Database
15. CLIP
16. Residual Network (ResNet)
17. Graph Convolution Network (GCN)
18. SORA’s Diffusion Transformer (DiT)
19. Gemini 1.5's Switch Transformer
20. Reinforcement Learning with Human Feedback (RLHF)
© 2024 Tom Yeh
Link to my original LinkedIn post
with animation and explanation
Date originally posted
I. One Node 406
12.5.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
II. Four Nodes 148
12.6.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
III. Hidden Layer 82
12.7.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
IV. Three Inputs 105
12.13.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
V. Seven Layers 197
12.13.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
1. Mixture of Experts 683
X1 X2
2 3
1 1
3 2
1 1
Gate 1 1 0 0 Max
Network 0 1 1 0 ≈
1 0 1 0 0 1 1 0
1 1 0 0 1 1 0 0 ReLU
0 0 1 0 1 -1 0 0 ≈
-1 0 1 0 1 0 1 0
Y1 Y2
Expert 1 Expert 2
12.15.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
2. Recurrent Neural Network (RNN)
406
Input Sequence X 3 4 5 6
1 -1 1
Parameters A B C -1 1
1 1 2
Activation Function ɸ: ReLU
Hidden States H0
0
0
Output Sequence Y
1 1 -1 ɸ ɸ ɸ ɸ
2 1 1 ≈ ≈ ≈ ≈
-1 1
12.18.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
3. Mamba’s S6 Model 263
Input Sequence 3 4 5 6 Parameters
Output Sequence
Selective Structured State-Space
Scan
1 0
1 -1 0 0
0 -1
0 -1 0 1
1 0 -1 0
1 0 0 -1
1 0 -1 0 1 0
0 1 0 -1 0 -1
1 -1 0 0
0 0 -1 1
-1 0 0 0 -1 0
1 0 0 0 0 1
0 0 -1 0
0 1 0 0
1 -1 0 0
-1 0
0 0 -1 1
0 1
1 0 0 0
0 -1 1 0
12.19.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
4. Matrix Multiplication 127
1 1 1 5 2
X =?
-1 1 2 4 2
1.5.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
5. How does an LLM sample a sentence?
Input Embeddings 1123
LLM
Probability Distributions Random
Numbers
I .01 .01 .03
Vocab you .01 .01 .50
they .01 .01 .40
are .01 .40 .01 .34
am .01 .40 .01 .52
how .50 .05 .01 .92
why .10 .05 .01 .65
where .10 .05 .01
who .15 .01 .01
what .10 .01 .01
______ ______ ______
1.6.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
6. Multi Layer Perceptron in pytorch
2
337
1 mlp_model = nn.Sequential(
1
3
2 nn._______( ___, ___, bias = ___ ),
1
1 -1 1 -5 -1 0
3
nn._______(),
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
4
nn._______( ___, ___, bias = ___ ),
1 0 1 -2 3 3
5
nn._______(), 1 -1 1 0 2 ReLU 2
0 1 -1 1 1
≈ 1
nn._______( ___, ___, bias = ___ ),
6 1
1 -1 2 3 .95
7
nn._______() -1 1 1 0 .50
σ
8
)
1 -2 -2 -2
≈ .12
2 1 0 5 .99
-3 0 1 -5 .01
Hints:
Linear Layer: { Identity | Linear | Bilinear }
Activation Function: { ReLU | Tanh | Sigmoid }
in_features: { int }
out_features: { int }
bias: { T | F }
1.8.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
7. Backpropagation 1260
X
2
1
3
1
Layer 1
1 -1 1 -5 -1 0
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
1 0 1 -2 3 3
1
Layer 2
1 -1 1 0 0 2 ReLU 2
0 1 -1 1 3 4 ≈ 4
1
Layer 3 3
Soft
0
2 0 -1 max .5
0 2 -5 3 ≈ .5 1
1 1 1 -1 0 0
YPred YTarget
L: Cross-Entropy Loss
1.9.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
8. Transformer 1281
Features from the
Previous Block
↓ ↓ ↓ ↓ ↓
1 0 0 0 1
Attention
1 1 0 0 0
Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0 Features
1 1 1 1 1
1 -1 0 1
1 1 0 0 ReLU
0 1 1 1 ≈
-1 1 1 0
1 1 1 1 1
Position-wise 1 0 0 -1 0
Feed-Forward 0 1 1 0 0
Network (FFN) 0 0 1 -1 1
↓ ↓ ↓ ↓ ↓
Next Block
1.11.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
9. Batch Normalization 504
Mini-batch: X1 X2 X3 X4
1 0 3 0
0 3 1 1 Batch Statistics
2 1 0 2
Linear Layer 1 1 1 1 Σ µ σ2 σ
1 0 1 0
ReLU
1 1 0 -1
0 2 -1 0 ≈
Normalize µ
Sum (Σ)
- Mean (µ)
Variance (σ2)
Std Dev (σ)
σ
÷
1 1 1 1
Scale & Shift 2 0 0 0 Trainable
0 3 0 0 Parameters
0 0 -1 1
Next Layer
1.14.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
10. Generative Adversarial Network (GAN)
Noise: N1 N2 N3 N4
1004
1 1 0 1
1 0 1 -1
1 1 1 1
Generator 1 1 0
0 1 2
-1 1 0
[≈ ReLU] 1 1 1 1 Real:
Fake: F1 F2 F3 F4 X1 X2 X3 X4
-1 1 0 0 2 3 3 4
1 0 1 0 1 1 1 1
0 1 1 0 2 3 4 3
0 0 1 1 1 1 1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
Discriminator 1 0 0 -1 0
0 1 1 0 0
0 0 1 -1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
1 1 -1 -1 Z
[≈ σ] [≈ σ]
Predictions: Y
Training the
Discriminator
Targets: - YD 0 0 0 0 1 1 1 1
𝜕𝐿𝐷
Loss Gradients:
𝜕𝑍
Training the
Generator
Targets: - YG 1 1 1 1
𝜕𝐿𝐺
Loss Gradients:
𝜕𝑍
1.15.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
11. Self Attention 1376
q1 q2 q3 q4
x1 x2 x3 x4
2 0 0 2 MatMul
Features
0 1 0 0 (KTQ)
0 2 1 0
0 0 1 1 k1 T
2 0 0 0 k2 T
1 0 1 1 k3 T
WQ q1 q2 q3 q4
k4 T
1 1 0 0 0 0
Scale
0 1 0 1 0 0
0 0 1 0 1 1
□
!!
WK k1 k2 k3 k4
0 0 1 0 0 0
Softmax
0 1 0 0 0 0
1 0 0 0 0 -1
e□
÷ Σ
Attention
Weight
Matrix (A)
MatMul
WV v1 v2 v3 v4 z1 z2 z3 z4
10 0 0 0 0 0 Attention
Weighted
0 0 0 10 0 0 Features
0 10 0 0 0 0
FFN
1.16.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
12. Dropout 557
Random
Sequence
X1 X2 Inference
3 5 3 3
Training Data: Unseen Data:
.61 4 1 2 1
.39 1 1 1 1
.75 Linear 1 0 0 1 0 0
.40 1 1 0 1 1 1
.65 0 1 1 -1 1 1
.42 1 -1 0 1 -1 0
.23 [≈ ReLU] 1 1 [≈ ReLU] 1 1
.19 Dropout 0 0 0 0 0 0
.93 (p=0.5) 0 0 0 0 0 0
.42 0 0 0 0 0 0
.87 0 0 0 0 0 0
.53
Linear 1 0 0 1 0 1 0 1 1 0
.27
.69 0 1 1 0 0 1 1 1 0 0
.50 1 0 -1 -1 1 1 0 -1 0 1
[≈ ReLU] 1 1 [≈ ReLU] 1 1
.11
Dropout 0 0 0 0
.42 (p=0.33) 0 0 0 0
0 0 0 0
1 1
Linear 1 -1 0 0 Outputs 1 1 0 0
0 1 -1 -2 Y 0 1 -1 -1
-4 7 Targets
Training
- 10 5 Y’ MSE Loss
Gradients
𝜕𝐿
X2
𝜕𝑌
1.19.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
13. Autoencoder X1 X2 X3 X4
1 1 2 1
849
2 1 2 0
3 2 4 1
1 1 2 1
1 1 1 1
Encoder 1 0 0 1 0
0 1 1 0 0
-1 0 1 0 -1
[≈ ReLU] 1 1 1 1
1 0 1 0
Bottleneck
-1 1 0 0
[≈ ReLU] 1 1 1 1
1 0 0
Decoder
0 1 1
1 -1 0
[≈ ReLU] 1 1 1 1
1 0 -1 0 Outputs
1 -1 0 0 Y
0 0 1 1
0 1 1 -3
Targets
Reconstruction
Loss Y’
- MSE Loss 𝜕𝐿
Gradients 𝜕𝑌
X2
1.22.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
14. Vector Database 2224
Query
Data how are you who are you who am I am I you
Word Embeddings a an the how why who what are is am be was you we I they she he she me him her
0 -1 0 1 0 1 0 0 -1 1 0 0 0 3 1 0 -1 0 0 0 -1 0
2 0 2 0 0 0 -1 1 0 0 0 2 1 0 2 0 2 0 0 2 0 0
-1 0 -1 1 2 0 0 1 0 1 -1 0 0 -1 0 3 0 0 -1 0 2 -1
0 1 0 0 1 0 1 0 1 0 1 -2 0 0 0 1 0 1 0 1 0 1
Text Embeddings 1 1 1 1 1 1 1 1 1 1 1 1
Encoder 1 1 0 0 0
0 1 0 1 0
1 0 1 0 -1
Linear &
ReLU 1 -1 0 0 0
Mean Pooling
Indexing Projection
1 1 0 0 Vector
Storage
0 0 1 1
Retrieval Dot Products
Nearest Neighbor (argmax)
2.1.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
mini batch of text-image pairs
15. CLIP 400 millions more …
885
big table mini chair top hat table top big chair
word2vec mini big top hat chair table Flatten
0 1 1 0 1 0 Patches
1 0 0 0 1 1
0 1 0 1 0 1
Word
Image Encoder
Embeddings 1 1 1 1 1 1
1 1 0 0 0
Text Encoder 1 1 1 1 1 1 0 1 0 1 1
1 0 1 0 0 0 1 -1 1
0 3 0 -2 0 1 1 1 -1
1 1 0 1
[Mean Pooling] [Mean Pooling]
(round) (round)
[Projection] 1 1 1 [Projection] 1 1 1
1 1 0 -2 -1 1 1 0 -1
0 1 1 -2 0 0 -1 1 0
T1 T2 T3 Shared Embedding Space
Cross Entropy
[Softmax] Loss Gradients
÷ Similarity -
e□ Σ ImageàText Target ImageàText
I1 1 0 0
I2 0 1 0
I3 0 0 1
÷ Σ TextàImage
1 0 0
Similarity
TextàImage - 0 1 0
0 0 1
2.10.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
16. Residual Network 625
2 1 0
0 3 4
X
1 0 2
1 1 1
weight
layer 1 1 0 0
ReLU 0 1 -1 0
1 0 1 -1
weight 1 1 1
layer
1 0 0 0 0 1 0
+ 0 1 0 -1 0 0 0
ReLU 0 0 1 0 -1 0 0
1 1 1
Transformer’s Encoder Block
1 0 0
0 1 0
Input
Embedding 0 0 1
1 0 1 1 0 0
Q
K 1 1 0 0 1 0
Attention 0 1 1 0 0 1
Add & Norm
2 0 1
1 3 -2
Feed 1 1 1
Forward 1 1 1
Add & Norm
1 -1 2
1 1 1
1 0 0 -1 3
Next Block
0 1 -1 0 0
↓ ↓ ↓
Next Block
2.15.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
17. Graph Convolutional Network
573
Graph Data A B C D E
A B C D E
Graph A B C D E A
Convolutional 2 0 1 0 1 Adjacency
Network B
Matrix
1 1 0 0 0
C
0 0 -1 1 1 A B C D E
D
0 3 0 1 0
E
1 1 1 1 1
1 1 0 0 0
0 1 0 -1 0
1 0 0 1 -1
[ReLU] Messages 1 1 1 1 1
0 1 1 0
1 -1 0 -2
1 0 0 0
[ReLU] Messages 1 1 1 1 1
Fully 1 0 0 -2
Connected 0 1 0 -2
Network
0 0 1 -5
1 -1 0 0
1 0 -1 0
[ReLU] 1 1 1 1 1
1 1 1 1 1 -9
2.17.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
18. SORA’s Diffusion Transformer 2118
Training Video Diffusion Prompt
Step t = 3
1 0 2 0 0 1 0 1 “sora is sky”
0 1 1 0 3 0 4 0
Text
Encoder
Spacetime
Patches 0
1
(Pixels) 1
1 0 0 1 0 0 -1
1 1 1 1 0 1 0 -1 1 0
Visual Encoder Latent
1 1 0 0 1 0 Self-Attention
1 0 -1 0 0
0 2 0 1 0 2 1 0 0 0
0 1 0 1 1
[ReLU] Adaptive 1 1 0 0
Sampled 0 2 1 -1 Q 0 1 1 0
Layer Norm
Noise K
+ -1 0 -2 1 0 0 1 1
X +
Noised
Latent
1 1 1 1
Predicted Pointwise -1 1 -2
Noise FFN
- 0 1 -5
Noise-free Train Sampled 0 2 1 -1
Noise
Latent
- -1 0 -2 1
Visual Decoder 1 1 1 1
MSE Loss
1 0 1
Gradients
0 1 0
1 1 0 Generated Video
-1 1 0
[ReLU]
2.19.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
19. Switch Transformer 576
(Gemini 1.5’s Sparse Mixture of Experts)
1 0 0 0 1 Attention
Previous Block 1 1 0 0 0
↓ ↓ ↓ ↓ ↓ Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0
Features
Switch A 1 -1 0 Gate
B 1 0 -2 Values
C -1 1 1
argmax Expert IDs
Expert A 1 1
Expert B 1 1
Expert C 1 1
1 0 -1 0 0 0 -1 0 1 0 0 0
0 1 0 0 1 0 0 0 0 1 1 1
0 0 -1 1 1 1 -1 1 0 1 0 0
Position-wise 1 2 3 4 5
Feed-Forward x3
Network (FFN)
↓ ↓ ↓ ↓ ↓
AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Next Block
2.24.24
20. Reinforcement
LLM 5 1 0
prompt
0 4 1 Learning with Human
[S] CEO is 0 0 4 Feedback (RLHF)
550
Preferences
{Winner | Loser} {Winner | Loser}
1 0 -1 2 prompt next prompt next
[ReLU] -1 1 0 0 doc is him doc is them
him 1 1 0
her 0 1 0 Word Embeddings
them 1 0 0 him her them is doc CEO [S]
is/are -1 1 1 0 1 1 1 1 0 -1
doc 2 0 -2 1 0 0 1 1 1 -1
0 1 0 1 0 1 -1
CEO 2 0 -1
Sample (max) Mean Mean
Pool Pool
1/3 1/3 1/3
Reward
1/3 1/3 1/3
Model (RM)
1/3 1/3 1/3
1 0 1 0
0 1 0 0
1 0 -1 0
1 1 0 0
[ReLU]
3 3 3 -3 1 Reward
Align LLM Loss Train RM Winner - Loser
Loss Gradient Predicted σ
- Target -1
3.4.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Loss Gradient