NLP & LLM - 11
NLP & LLM - 11
Deep Learning
IIT ROPAR Minor In AI
21 March, 2025
Contents
1 Introduction to AI, Machine Learning, and Deep Learning 3
1.1 AI: Mimicking Human Intelligence . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning: Learning from Data without Explicit Coding 3
1.3 Deep Learning: Inspired by Human Brain, Uses Neural Networks 4
1
6 Prompt Engineering 11
6.1 Types: Zero-shot, Few-shot, Chain of Thought . . . . . . . . . . 11
6.1.1 Zero-shot Learning . . . . . . . . . . . . . . . . . . . . . . 11
6.1.2 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . 12
6.1.3 Chain of Thought (CoT) . . . . . . . . . . . . . . . . . . 12
6.2 Components: Instruction, Context, Input Data, Output Indicator 12
7 Future Developments 13
7.1 Agentic AI and Autonomous Agents . . . . . . . . . . . . . . . . 13
7.2 Debates on AI Capabilities and Potential Risks . . . . . . . . . . 14
7.2.1 Alignment and Safety . . . . . . . . . . . . . . . . . . . . 14
7.2.2 Scaling Laws and Emergent Abilities . . . . . . . . . . . . 14
8 Evaluation of LLMs 15
8.1 Code-based, Human Evaluation, LLM as Judge . . . . . . . . . . 15
8.1.1 Code-based Evaluation . . . . . . . . . . . . . . . . . . . . 15
8.1.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 15
8.1.3 LLM as Judge . . . . . . . . . . . . . . . . . . . . . . . . 15
8.2 Concepts like Distillation and Mixture of Experts . . . . . . . . . 16
8.2.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . 16
8.2.2 Mixture of Experts (MoE) . . . . . . . . . . . . . . . . . . 16
2
1 Introduction to AI, Machine Learning, and
Deep Learning
1.1 AI: Mimicking Human Intelligence
Artificial Intelligence (AI) refers to systems designed to perform tasks that typ-
ically require human intelligence, such as visual perception, speech recognition,
decision-making, and language translation.
Historical Context
The term ”Artificial Intelligence” was coined by John McCarthy in 1956
at the Dartmouth Conference, which is considered the founding event
of AI as a field. Early AI systems were predominantly rule-based and
focused on symbolic reasoning.
Task Performance
Learning Capability
AI System = (1)
Adaptability
Reasoning Mechanisms
f :X→Y (2)
3
• Reinforcement Learning: Learning through interaction with an envi-
ronment
Case Study: Netflix Recommendation System
Netflix employs machine learning algorithms to analyze user viewing history,
ratings, and preferences to recommend content. This system processes billions
of data points to learn patterns that predict which shows a user might enjoy,
demonstrating how ML can create personalized experiences at scale. The rec-
ommendation system combines collaborative filtering (comparing user behavior
with similar users) and content-based methods (analyzing show attributes).
y = σ(w · x + b) (3)
Where σ is an activation function, w represents weights, x is the input, and
b is a bias term.
Case Study: AlphaFold
DeepMind’s AlphaFold represents a breakthrough application of deep learn-
ing in protein structure prediction. Prior to AlphaFold, determining protein
structures was an enormously time-consuming laboratory process. AlphaFold
uses deep neural networks trained on known protein structures to predict the
three-dimensional structure of proteins from their amino acid sequences with
unprecedented accuracy, revolutionizing molecular biology and drug discovery.
% Rules
g r a n d p a r e n t (X, Z ) :− p a r e n t (X, Y) , p a r e n t (Y, Z ) .
4
Case Study: MYCIN
MYCIN was an early expert system developed at Stanford University in
the 1970s to diagnose infectious blood diseases and recommend antibiotics. It
contained approximately 600 rules that encoded the knowledge of infectious
disease experts. When tested, MYCIN performed at a level comparable to
specialists, demonstrating how explicit rules could capture expert knowledge.
5
Agentic AI Framework
n
X
z= wi xi + b (9)
i=1
a = σ(z) (10)
6
Where wi are weights, xi are inputs, b is bias, and σ is an activation function.
Multi-layer networks stack these units to create more complex architectures:
XX
(f ∗ g)(x, y) = f (m, n)g(x − m, y − n) (15)
m n
Key Components:
7
ht = σ(Wxh xt + Whh ht−1 + bh ) (16)
yt = σ(Why ht + by ) (17)
Transformer Architecture
Key innovations:
• Self-attention mechanism
• Positional encoding
• Multi-head attention
• Feed-forward networks in each layer
8
4.2 Self-attention and Parallel Processing Capabilities
Self-attention computes relationships between all positions in a sequence:
QK T
Attention(Q, K, V ) = softmax √ V (24)
dk
Where Q (queries), K (keys), and V (values) are derived from the input
sequence.
Multi-head attention computes attention multiple times in parallel:
9
Pre-training Scale
pθ (y|x)
LRLHF = Ex∼D [rϕ (x, y) − β log ] (28)
pref (y|x)
10
5.2.2 Limitations
Hallucination: LLMs can generate plausible-sounding but factually incorrect
information.
Example of Hallucination
When asked about obscure topics, LLMs may confidently generate fic-
tional information, such as inventing non-existent research papers or cre-
ating false historical events.
Knowledge Cutoff: LLMs cannot know about events after their training
data ends.
(
Comprehensive for t < tcutoff
Knowledge Access = (29)
None for t > tcutoff
Context Length: LLMs have a finite window of text they can process at
once.
6 Prompt Engineering
6.1 Types: Zero-shot, Few-shot, Chain of Thought
6.1.1 Zero-shot Learning
The model performs tasks without specific examples:
11
6.1.2 Few-shot Learning
Providing examples helps the model understand the desired pattern:
Let ’ s t h i n k through t h i s s t e p by s t e p :
1 . The s t o r e s t a r t s with 10 a p p l e s .
2 . I t s e l l s 3 a p p l e s t o customer A, l e a v i n g 10 − 3 = 7 a p p l e s .
3 . I t s e l l s 4 a p p l e s t o customer B, l e a v i n g 7 − 4 = 3 a p p l e s .
4 . I t buys 5 more a p p l e s , g i v i n g i t 3 + 5 = 8 a p p l e s t o t a l .
T h e r e f o r e , t h e s t o r e has 8 a p p l e s now .
Case Study: GSM8K Math Problems
Research on the GSM8K benchmark (grade school math problems) demon-
strates the dramatic improvement in performance achieved through chain-of-
thought prompting. Without CoT, even large language models struggle with
multi-step reasoning problems. With CoT prompting, performance improved by
20-40 percentage points across various model sizes, highlighting how the right
prompting strategy can unlock capabilities already present in the model.
12
Prompt Components
# CONTEXT
This i s f o r a p a t i e n t e d u c a t i o n w e b s i t e . The a u d i e n c e has no m e d i c a l background .
# INPUT DATA
[ Research a b s t r a c t t e x t h e r e ]
# OUTPUT INDICATOR
Your summary s h o u l d be 3−5 s h o r t p a r a g r a p h s . I n c l u d e a one−s e n t e n c e ”Key Takeawa
Case Study: Legal Document Analysis
Law firms use structured prompts to extract specific information from con-
tracts. By providing clear instructions (e.g., ”Identify all payment terms and
obligations”), relevant context (e.g., ”This is for a procurement contract re-
view”), specific input data (the contract text), and output indicators (e.g.,
”Format as a table with clause references”), they achieve consistent, structured
outputs that can be directly incorporated into legal workflows, demonstrating
how well-crafted prompts can turn LLMs into specialized information extraction
tools.
7 Future Developments
7.1 Agentic AI and Autonomous Agents
Agentic AI systems combine LLMs with:
13
Perception Module
Memory System
Agent Architecture = Planning Engine (31)
Action Execution
Learning Mechanism
14
8 Evaluation of LLMs
8.1 Code-based, Human Evaluation, LLM as Judge
8.1.1 Code-based Evaluation
Automated metrics provide objective but limited assessment:
Helpfulness
Accuracy
Human Evaluation = Safety (34)
Quality
Bias
S c o r e from 1−10 on :
− R e l e v a n c e t o query
− Factual accuracy
− Completeness
− Clarity
− Helpfulness
P r o v i d e j u s t i f i c a t i o n f o r each s c o r e .
Case Study: MMLU Benchmark
The Massive Multitask Language Understanding (MMLU) benchmark eval-
uates models across 57 subjects ranging from elementary mathematics to pro-
fessional medicine. This comprehensive evaluation reveals both strengths and
15
weaknesses in model capabilities across different domains of knowledge. Re-
cent models achieve human expert-level performance in some categories while
still struggling in others, providing a nuanced picture of progress and remaining
challenges in language model development.
Where LKD measures the divergence between student and teacher model
outputs.
n
X
y= g(x, i) · fi (x) (36)
i=1
16