0% found this document useful (0 votes)

42 views16 pages

NLP & LLM - 11

The document provides a comprehensive overview of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL), detailing their definitions, historical context, and various phases of development. It covers foundational concepts, key paradigms, and applications, including case studies like IBM's Deep Blue and Google's Neural Machine Translation. Additionally, it discusses advancements in AI, such as Generative AI and Agentic AI, along with evaluation methods and future developments in the field.

Uploaded by

saurabhtanwar7320

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views16 pages

NLP & LLM - 11

Uploaded by

saurabhtanwar7320

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Artificial Intelligence, Machine Learning, and

Deep Learning
IIT ROPAR Minor In AI
21 March, 2025

Contents
1 Introduction to AI, Machine Learning, and Deep Learning 3
1.1 AI: Mimicking Human Intelligence . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning: Learning from Data without Explicit Coding 3
1.3 Deep Learning: Inspired by Human Brain, Uses Neural Networks 4

2 Phases of AI: Rule-based, Predictive, Generative, Agentic 4

2.1 Rule-based AI (1950s-1990s) . . . . . . . . . . . . . . . . . . . . . 4
2.2 Predictive AI (1990s-2010s) . . . . . . . . . . . . . . . . . . . . . 5
2.3 Generative AI (2010s-Present) . . . . . . . . . . . . . . . . . . . . 5
2.4 Agentic AI (Emerging) . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Deep Learning Basics 6

3.1 Inspiration from Human Brain Neurons . . . . . . . . . . . . . . 6
3.2 Perceptrons and Multi-layer Neural Networks . . . . . . . . . . . 6
3.3 Convolutional Neural Networks (CNN) for Image Processing . . . 7
3.4 Recurrent Neural Networks (RNN) for Sequential Data . . . . . . 7

4 Transformers and Attention Mechanism 8

4.1 Google’s ”Attention is All You Need” Paper (2017) . . . . . . . . 8
4.2 Self-attention and Parallel Processing Capabilities . . . . . . . . 9

5 Large Language Models (LLMs) 9

5.1 Training Process: Pre-training, Post-training, Reinforcement Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.2 Post-training (Fine-tuning) . . . . . . . . . . . . . . . . . 10
5.1.3 Reinforcement Learning from Human Feedback (RLHF) . 10
5.2 Applications and Limitations . . . . . . . . . . . . . . . . . . . . 10
5.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1
6 Prompt Engineering 11
6.1 Types: Zero-shot, Few-shot, Chain of Thought . . . . . . . . . . 11
6.1.1 Zero-shot Learning . . . . . . . . . . . . . . . . . . . . . . 11
6.1.2 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . 12
6.1.3 Chain of Thought (CoT) . . . . . . . . . . . . . . . . . . 12
6.2 Components: Instruction, Context, Input Data, Output Indicator 12

7 Future Developments 13
7.1 Agentic AI and Autonomous Agents . . . . . . . . . . . . . . . . 13
7.2 Debates on AI Capabilities and Potential Risks . . . . . . . . . . 14
7.2.1 Alignment and Safety . . . . . . . . . . . . . . . . . . . . 14
7.2.2 Scaling Laws and Emergent Abilities . . . . . . . . . . . . 14

8 Evaluation of LLMs 15
8.1 Code-based, Human Evaluation, LLM as Judge . . . . . . . . . . 15
8.1.1 Code-based Evaluation . . . . . . . . . . . . . . . . . . . . 15
8.1.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 15
8.1.3 LLM as Judge . . . . . . . . . . . . . . . . . . . . . . . . 15
8.2 Concepts like Distillation and Mixture of Experts . . . . . . . . . 16
8.2.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . 16
8.2.2 Mixture of Experts (MoE) . . . . . . . . . . . . . . . . . . 16

2
1 Introduction to AI, Machine Learning, and
Deep Learning
1.1 AI: Mimicking Human Intelligence
Artificial Intelligence (AI) refers to systems designed to perform tasks that typ-
ically require human intelligence, such as visual perception, speech recognition,
decision-making, and language translation.

Historical Context
The term ”Artificial Intelligence” was coined by John McCarthy in 1956
at the Dartmouth Conference, which is considered the founding event
of AI as a field. Early AI systems were predominantly rule-based and
focused on symbolic reasoning.

Case Study: IBM’s Deep Blue

The 1997 chess match between IBM’s Deep Blue and world champion Garry
Kasparov represented an early milestone in AI. Deep Blue used a combination of
brute-force computation and sophisticated evaluation functions to defeat Kas-
parov, demonstrating how machines could outperform humans in specific do-
mains through different approaches than human cognition.



 Task Performance

Learning Capability
AI System = (1)


 Adaptability
Reasoning Mechanisms


1.2 Machine Learning: Learning from Data without Ex-

plicit Coding
Machine Learning (ML) is a subset of AI that focuses on building systems
that can learn from and make decisions based on data, without being explicitly
programmed for specific tasks.

f :X→Y (2)

Where X represents input data and Y represents output predictions. The

function f is learned from training data rather than being explicitly defined.
Key ML Paradigms:
• Supervised Learning: Training on labeled data
• Unsupervised Learning: Finding patterns in unlabeled data

3
• Reinforcement Learning: Learning through interaction with an envi-
ronment
Case Study: Netflix Recommendation System
Netflix employs machine learning algorithms to analyze user viewing history,
ratings, and preferences to recommend content. This system processes billions
of data points to learn patterns that predict which shows a user might enjoy,
demonstrating how ML can create personalized experiences at scale. The rec-
ommendation system combines collaborative filtering (comparing user behavior
with similar users) and content-based methods (analyzing show attributes).

1.3 Deep Learning: Inspired by Human Brain, Uses Neu-

ral Networks
Deep Learning is a subset of machine learning that uses neural networks with
multiple layers (hence ”deep”) to progressively extract higher-level features from
raw input.

y = σ(w · x + b) (3)
Where σ is an activation function, w represents weights, x is the input, and
b is a bias term.
Case Study: AlphaFold
DeepMind’s AlphaFold represents a breakthrough application of deep learn-
ing in protein structure prediction. Prior to AlphaFold, determining protein
structures was an enormously time-consuming laboratory process. AlphaFold
uses deep neural networks trained on known protein structures to predict the
three-dimensional structure of proteins from their amino acid sequences with
unprecedented accuracy, revolutionizing molecular biology and drug discovery.

2 Phases of AI: Rule-based, Predictive, Gener-

ative, Agentic
2.1 Rule-based AI (1950s-1990s)
Rule-based AI systems operate using explicitly programmed rules in the form
of if-then statements.
Listing 1: Example of Rule-Based AI in Prolog
% Facts
p a r e n t ( john , mary ) .
p a r e n t ( john , tom ) .
p a r e n t ( mary , ann ) .

% Rules
g r a n d p a r e n t (X, Z ) :− p a r e n t (X, Y) , p a r e n t (Y, Z ) .

4
Case Study: MYCIN
MYCIN was an early expert system developed at Stanford University in
the 1970s to diagnose infectious blood diseases and recommend antibiotics. It
contained approximately 600 rules that encoded the knowledge of infectious
disease experts. When tested, MYCIN performed at a level comparable to
specialists, demonstrating how explicit rules could capture expert knowledge.

2.2 Predictive AI (1990s-2010s)

Predictive AI uses statistical methods and machine learning to make predictions
based on patterns in data.
Case Study: Credit Scoring Models
Financial institutions use predictive AI to assess credit risk. These systems
analyze factors such as payment history, debt levels, and income to predict
the likelihood of loan repayment. Modern credit scoring systems use ensemble
methods combining multiple models (decision trees, logistic regression, etc.)
to improve prediction accuracy. These models have transformed lending by
enabling more objective, data-driven decisions.

2.3 Generative AI (2010s-Present)

Generative AI creates new content (text, images, audio, etc.) that resembles
human-created content.

P (xt |x<t ) = softmax(W · ht + b) (4)

Where xt is the next token to be generated, x<t represents previous tokens,

and ht is the hidden state.
Case Study: DALL-E
OpenAI’s DALL-E demonstrates the capabilities of generative AI in visual
domains. Given a text prompt like ”an astronaut riding a horse in a photoreal-
istic style,” DALL-E can generate original images that integrate these concepts.
This demonstrates how generative models can combine concepts in creative ways
never explicitly shown during training, exhibiting a form of artificial creativity.

2.4 Agentic AI (Emerging)

Agentic AI systems can operate autonomously, make decisions, and take actions
to achieve specified goals.

5
Agentic AI Framework

1. Perception: Understanding the environment

2. Planning: Determining action sequences

3. Execution: Implementing planned actions

4. Learning: Improving from experiences

Case Study: AutoGPT

AutoGPT represents an early example of agentic AI application. It combines
large language models with the ability to use tools (web search, file operations,
etc.) and maintain a memory of past actions. Given a high-level objective
like ”research the market for electric vehicles and write a report,” AutoGPT
can break this down into sub-tasks, execute them sequentially, and produce the
desired output with minimal human intervention, demonstrating autonomous
goal-directed behavior.

3 Deep Learning Basics

3.1 Inspiration from Human Brain Neurons
Artificial neural networks draw inspiration from the structure and function of
biological neurons in the human brain.

Biological Neuron → Artificial Neuron (5)

Dendrites → Input Weights (6)
Cell Body → Summation & Activation (7)
Axon → Output (8)

While artificial neurons are vast simplifications of biological neurons, they

capture the essential computational elements: receiving weighted inputs, in-
tegrating them, and producing an output if the integrated signal exceeds a
threshold.

3.2 Perceptrons and Multi-layer Neural Networks

The perceptron is the fundamental building block of neural networks.

n
X
z= wi xi + b (9)
i=1
a = σ(z) (10)

6
Where wi are weights, xi are inputs, b is bias, and σ is an activation function.
Multi-layer networks stack these units to create more complex architectures:

z[1] = W[1] x + b[1] (11)

[1] [1]
a = σ(z ) (12)
[2] [2] [1] [2]
z =W a +b (13)
[2] [2]
a = σ(z ) (14)

Case Study: XOR Problem

The XOR problem (exclusive OR) illustrates why multi-layer networks are
necessary. A single perceptron cannot solve the XOR problem because it’s
not linearly separable. However, a neural network with at least one hidden
layer can learn this function. This simple example demonstrates how adding
layers enables networks to represent increasingly complex functions and decision
boundaries.

3.3 Convolutional Neural Networks (CNN) for Image Pro-

cessing
CNNs apply convolutional operations to extract spatial features from images.

XX
(f ∗ g)(x, y) = f (m, n)g(x − m, y − n) (15)
m n

Key Components:

• Convolutional layers: Extract features using learnable filters

• Pooling layers: Reduce dimensionality while preserving important infor-
mation
• Fully connected layers: Final classification based on extracted features

Case Study: ResNet

Residual Networks (ResNet) addressed the problem of training very deep
CNNs by introducing skip connections that allow gradients to flow more easily
through the network. This innovation enabled the creation of networks with
over 100 layers that could be effectively trained. ResNet dramatically improved
image classification performance on the ImageNet dataset and became a foun-
dational architecture for many computer vision applications.

3.4 Recurrent Neural Networks (RNN) for Sequential Data

RNNs process sequential data by maintaining a hidden state that captures in-
formation from previous timesteps.

7
ht = σ(Wxh xt + Whh ht−1 + bh ) (16)
yt = σ(Why ht + by ) (17)

LSTM (Long Short-Term Memory) networks address the vanishing gra-

dient problem in traditional RNNs:

ft = σ(Wf · [ht−1 , xt ] + bf ) (18)

it = σ(Wi · [ht−1 , xt ] + bi ) (19)
C̃t = tanh(WC · [ht−1 , xt ] + bC ) (20)
Ct = ft ∗ Ct−1 + it ∗ C̃t (21)
ot = σ(Wo · [ht−1 , xt ] + bo ) (22)
ht = ot ∗ tanh(Ct ) (23)

Case Study: Neural Machine Translation

Google’s Neural Machine Translation (GNMT) system demonstrated the
power of RNNs in sequence-to-sequence learning. Prior to the transformer
architecture, GNMT used bidirectional LSTMs with attention mechanisms to
translate between languages. The system showed significant improvements over
phrase-based statistical methods, especially for grammatically complex language
pairs like English-Japanese, by capturing long-range dependencies and context.

4 Transformers and Attention Mechanism

4.1 Google’s ”Attention is All You Need” Paper (2017)
The landmark paper by Vaswani et al. introduced the transformer architecture,
which revolutionized natural language processing by eliminating recurrence and
convolutions in favor of attention mechanisms.

Transformer Architecture
Key innovations:
• Self-attention mechanism
• Positional encoding

• Multi-head attention
• Feed-forward networks in each layer

8
4.2 Self-attention and Parallel Processing Capabilities
Self-attention computes relationships between all positions in a sequence:

QK T

Attention(Q, K, V ) = softmax √ V (24)
dk

Where Q (queries), K (keys), and V (values) are derived from the input
sequence.
Multi-head attention computes attention multiple times in parallel:

MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W O (25)

Where each head performs attention with different linear projections.

Case Study: BERT
Google’s Bidirectional Encoder Representations from Transformers (BERT)
demonstrated the power of transformer architectures for language understand-
ing. BERT is pre-trained using masked language modeling and next sentence
prediction objectives on a large corpus of text. When fine-tuned on specific
tasks, BERT achieved state-of-the-art results on a wide range of natural lan-
guage understanding benchmarks. Its bidirectional attention mechanism allows
it to consider context from both directions, improving performance on tasks like
question answering and sentiment analysis.
Parallel Processing Advantage:
Unlike RNNs, which process sequences element by element, transformers
process entire sequences in parallel:

RNN Complexity = O(n) sequential operations (26)

Transformer Complexity = O(1) sequential operations (27)

This parallelization enables efficient training on modern GPU hardware, al-

lowing for much larger models.

5 Large Language Models (LLMs)

5.1 Training Process: Pre-training, Post-training, Rein-
forcement Learning
5.1.1 Pre-training
During pre-training, models learn general language patterns from vast amounts
of text.

9
Pre-training Scale

Modern LLMs are trained on:

• Hundreds of billions to trillions of tokens

• Diverse sources: books, websites, code, research papers

• Months of computation on thousands of GPUs

5.1.2 Post-training (Fine-tuning)

After pre-training, models are adapted for specific capabilities:

• Supervised Fine-tuning (SFT): Using human-created demonstrations

• Instruction Tuning: Teaching models to follow user instructions

5.1.3 Reinforcement Learning from Human Feedback (RLHF)

RLHF aligns model outputs with human preferences:

pθ (y|x)
LRLHF = Ex∼D [rϕ (x, y) − β log ] (28)
pref (y|x)

Where rϕ is a learned reward model based on human preferences, and the

second term is a KL-divergence penalty to prevent excessive deviation from the
reference model.
Case Study: ChatGPT
OpenAI’s ChatGPT illustrates the full LLM training pipeline. Starting with
a GPT architecture pre-trained on a diverse text corpus, it underwent instruc-
tion tuning to follow user directions and RLHF to align with human preferences.
This process transformed a general text prediction model into an assistant that
could respond helpfully to user queries, follow instructions, and generate more
useful, safe, and truthful responses. Its capabilities and limitations demonstrate
both the potential and challenges of current LLM technology.

5.2 Applications and Limitations

5.2.1 Applications
• Content Generation: Writing, summarization, translation
• Code Assistance: Generating, explaining, and debugging code
• Conversational AI: Customer service, digital assistants

• Information Extraction: Analyzing documents, reports

10
5.2.2 Limitations
Hallucination: LLMs can generate plausible-sounding but factually incorrect
information.

Example of Hallucination

When asked about obscure topics, LLMs may confidently generate fic-
tional information, such as inventing non-existent research papers or cre-
ating false historical events.

Knowledge Cutoff: LLMs cannot know about events after their training
data ends.

(
Comprehensive for t < tcutoff
Knowledge Access = (29)
None for t > tcutoff

Context Length: LLMs have a finite window of text they can process at
once.

Maximum Context = n tokens (30)

Where n has increased from about 2,048 in early models to 128,000+ in

recent architectures.
Case Study: Mitigating Limitations in Claude
Anthropic’s Claude demonstrates approaches to addressing LLM limitations.
To reduce hallucinations, Claude was trained using constitutional AI methods
that encourage the model to express uncertainty rather than confabulate when
asked about topics outside its knowledge base. To overcome context limitations,
Claude implements techniques for efficient context compression and retrieval, al-
lowing it to process longer documents while maintaining coherent understand-
ing.

6 Prompt Engineering
6.1 Types: Zero-shot, Few-shot, Chain of Thought
6.1.1 Zero-shot Learning
The model performs tasks without specific examples:

Listing 2: Zero-shot Prompt

C l a s s i f y the f o l l o w i n g text as e i t h e r p o s i t i v e or negative :
”The s e r v i c e a t t h i s r e s t a u r a n t was t e r r i b l e and t h e f o o d was c o l d . ”

11
6.1.2 Few-shot Learning
Providing examples helps the model understand the desired pattern:

Listing 3: Few-shot Prompt

C l a s s i f y reviews as p o s i t i v e or negative :

Review : ”Amazing f o o d and e x c e l l e n t s e r v i c e ! ”

Sentiment : P o s i t i v e

Review : ” Waited an hour and t h e f o o d was bland . ”

Sentiment : N e g a t i v e

Review : ”The ambiance was n i c e but o v e r p r i c e d f o r what you g e t . ”

Sentiment :

6.1.3 Chain of Thought (CoT)

Encouraging step-by-step reasoning improves performance on complex tasks:

Listing 4: Chain of Thought Prompt

Q u e s t i o n : I f a s t o r e has 10 a p p l e s and s e l l s 3 t o customer A and 4 t o customer B

Let ’ s t h i n k through t h i s s t e p by s t e p :
1 . The s t o r e s t a r t s with 10 a p p l e s .
2 . I t s e l l s 3 a p p l e s t o customer A, l e a v i n g 10 − 3 = 7 a p p l e s .
3 . I t s e l l s 4 a p p l e s t o customer B, l e a v i n g 7 − 4 = 3 a p p l e s .
4 . I t buys 5 more a p p l e s , g i v i n g i t 3 + 5 = 8 a p p l e s t o t a l .

T h e r e f o r e , t h e s t o r e has 8 a p p l e s now .
Case Study: GSM8K Math Problems
Research on the GSM8K benchmark (grade school math problems) demon-
strates the dramatic improvement in performance achieved through chain-of-
thought prompting. Without CoT, even large language models struggle with
multi-step reasoning problems. With CoT prompting, performance improved by
20-40 percentage points across various model sizes, highlighting how the right
prompting strategy can unlock capabilities already present in the model.

6.2 Components: Instruction, Context, Input Data, Out-

put Indicator
Effective prompts typically include:

12
Prompt Components

1. Instruction: Clear directions about the task

2. Context: Background information or constraints

3. Input Data: The specific content to process
4. Output Indicator: Format or style specifications

Example of a Structured Prompt:

Listing 5: Structured Prompt Components

# INSTRUCTION
Summarize t h e f o l l o w i n g m e d i c a l r e s e a r c h a b s t r a c t i n s i m p l e terms t h a t a p a t i e n t

# CONTEXT
This i s f o r a p a t i e n t e d u c a t i o n w e b s i t e . The a u d i e n c e has no m e d i c a l background .

# INPUT DATA
[ Research a b s t r a c t t e x t h e r e ]

# OUTPUT INDICATOR
Your summary s h o u l d be 3−5 s h o r t p a r a g r a p h s . I n c l u d e a one−s e n t e n c e ”Key Takeawa
Case Study: Legal Document Analysis
Law firms use structured prompts to extract specific information from con-
tracts. By providing clear instructions (e.g., ”Identify all payment terms and
obligations”), relevant context (e.g., ”This is for a procurement contract re-
view”), specific input data (the contract text), and output indicators (e.g.,
”Format as a table with clause references”), they achieve consistent, structured
outputs that can be directly incorporated into legal workflows, demonstrating
how well-crafted prompts can turn LLMs into specialized information extraction
tools.

7 Future Developments
7.1 Agentic AI and Autonomous Agents
Agentic AI systems combine LLMs with:

• Planning: Breaking down complex goals into subtasks

• Memory: Maintaining information across interactions

• Tool Use: Leveraging external capabilities (APIs, databases, etc.)
• Self-Improvement: Learning from successes and failures

13



Perception Module
Memory System



Agent Architecture = Planning Engine (31)

Action Execution




Learning Mechanism

Case Study: BabyAGI

BabyAGI demonstrates simple but powerful agentic capabilities. Given a
high-level task like ”Research investment opportunities in renewable energy,”
it autonomously creates subtasks, executes them in a reasonable order, utilizes
tools like web search and document analysis, and compiles findings into a coher-
ent output. While limited compared to human researchers, its ability to work
autonomously toward complex goals illustrates the direction of agent-based AI
systems.

7.2 Debates on AI Capabilities and Potential Risks

7.2.1 Alignment and Safety
As AI systems become more capable, ensuring they act in accordance with
human values becomes increasingly important:

Alignment Gap = AI Capability − Alignment Level (32)

7.2.2 Scaling Laws and Emergent Abilities

Research suggests that capabilities may emerge non-linearly as models scale:

Performance ≈ C · (Compute)α · (Data)β · (Parameters)γ (33)

Case Study: Frontier Model Research

Research by organizations like Anthropic on frontier models has revealed
surprising emergent capabilities. As models scaled beyond certain thresholds,
they suddenly demonstrated abilities not observed in smaller versions, such as
multi-step reasoning, code generation, and creative problem-solving. These dis-
continuous improvements suggest that further scaling may unlock additional
capabilities that are difficult to predict in advance, highlighting both the poten-
tial and uncertainty in continued AI advancement.

14
8 Evaluation of LLMs
8.1 Code-based, Human Evaluation, LLM as Judge
8.1.1 Code-based Evaluation
Automated metrics provide objective but limited assessment:

• BLEU, ROUGE: Lexical overlap with reference texts

• Perplexity: Probability assigned to correct tokens
• Task-specific Metrics: Accuracy, F1 score, etc.

8.1.2 Human Evaluation

Human judgments capture nuanced quality aspects:


Helpfulness


Accuracy



Human Evaluation = Safety (34)

Quality





Bias

8.1.3 LLM as Judge

Using stronger models to evaluate outputs:

Listing 6: LLM-as-Judge Prompt Template

Rate t h e q u a l i t y o f t h e f o l l o w i n g r e s p o n s e t o t h e g i v e n query :

Query : [ User query ]

Response : [ Model r e s p o n s e ]

S c o r e from 1−10 on :
− R e l e v a n c e t o query
− Factual accuracy
− Completeness
− Clarity
− Helpfulness

P r o v i d e j u s t i f i c a t i o n f o r each s c o r e .
Case Study: MMLU Benchmark
The Massive Multitask Language Understanding (MMLU) benchmark eval-
uates models across 57 subjects ranging from elementary mathematics to pro-
fessional medicine. This comprehensive evaluation reveals both strengths and

15
weaknesses in model capabilities across different domains of knowledge. Re-
cent models achieve human expert-level performance in some categories while
still struggling in others, providing a nuanced picture of progress and remaining
challenges in language model development.

8.2 Concepts like Distillation and Mixture of Experts

8.2.1 Knowledge Distillation
Transferring knowledge from larger to smaller models:

Ldistill = αLtask + (1 − α)LKD (35)

Where LKD measures the divergence between student and teacher model
outputs.

8.2.2 Mixture of Experts (MoE)

Combining specialized sub-networks:

n
X
y= g(x, i) · fi (x) (36)
i=1

Where g(x, i) is a gating function determining how much expert fi con-

tributes to the output.
Case Study: Google’s Switch Transformer
Google’s Switch Transformer demonstrated the efficiency gains possible with
MoE architectures. By using a sparse mixture of experts approach where only
a subset of experts process each input token, the model achieved performance
comparable to dense models with significantly less computation during inference.
This approach enables larger effective model sizes while maintaining reasonable
training and deployment costs, potentially offering a more efficient scaling path
than simply increasing dense model parameters.
Benefits of MoE:
• Computational efficiency through sparse activation
• Specialization of different components for different subtasks
• Capacity scaling without proportional computation increase

Parameters in MoE ≫ Parameters used per forward pass (37)

Effective Capacity ≈ Experts × Parameters per Expert (38)
Computation ≈ Active Experts × Parameters per Expert (39)

UNIT - 1 Notes
No ratings yet
UNIT - 1 Notes
28 pages
AI Fundamentals Textbook
No ratings yet
AI Fundamentals Textbook
10 pages
IntroToAi NOTES
No ratings yet
IntroToAi NOTES
2 pages
DK SeminarReport-p2 Pagenumber
No ratings yet
DK SeminarReport-p2 Pagenumber
17 pages
Lect 1
No ratings yet
Lect 1
17 pages
Unit-1 AI
No ratings yet
Unit-1 AI
103 pages
Unit 3-1
No ratings yet
Unit 3-1
34 pages
AI Last Verison
No ratings yet
AI Last Verison
8 pages
Genai Handout (Handout)
No ratings yet
Genai Handout (Handout)
14 pages
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
No ratings yet
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
23 pages
5
No ratings yet
5
5 pages
Unit I AI
No ratings yet
Unit I AI
22 pages
DL 1
No ratings yet
DL 1
54 pages
Notes - Introduction To AI, ML, DS
No ratings yet
Notes - Introduction To AI, ML, DS
61 pages
AI Explains AI
No ratings yet
AI Explains AI
5 pages
3
No ratings yet
3
6 pages
NOTE
No ratings yet
NOTE
13 pages
57 ST CLASS 2 AI @UPSCPirates
No ratings yet
57 ST CLASS 2 AI @UPSCPirates
14 pages
Deep Learning Introduction Class
No ratings yet
Deep Learning Introduction Class
46 pages
Artificial Intelligence and Machine Learning Question Bank
No ratings yet
Artificial Intelligence and Machine Learning Question Bank
23 pages
Presentation AI Lect1
No ratings yet
Presentation AI Lect1
14 pages
Minor AI
No ratings yet
Minor AI
6 pages
Artificial Intelligence & Machine Learning
No ratings yet
Artificial Intelligence & Machine Learning
14 pages
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
No ratings yet
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
23 pages
Artificial Intelligence All Unit
No ratings yet
Artificial Intelligence All Unit
41 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
29 pages
AI Unit 1 With Assignment
No ratings yet
AI Unit 1 With Assignment
60 pages
Ai Unit1
No ratings yet
Ai Unit1
16 pages
Unit - 1 DL
No ratings yet
Unit - 1 DL
28 pages
AI
No ratings yet
AI
12 pages
UNIT1
No ratings yet
UNIT1
11 pages
Presentationn
No ratings yet
Presentationn
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
9 pages
CS3492-DBMS Question Bank - Watermark
No ratings yet
CS3492-DBMS Question Bank - Watermark
23 pages
Foundations of AI and ML Detailed Presentation
No ratings yet
Foundations of AI and ML Detailed Presentation
22 pages
Ai 3RD Sem
No ratings yet
Ai 3RD Sem
20 pages
CTC 408, AI Fndamentals.
No ratings yet
CTC 408, AI Fndamentals.
16 pages
Chapter 2. Artificial Intelligence and Machine Learning
No ratings yet
Chapter 2. Artificial Intelligence and Machine Learning
19 pages
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
15 pages
1 Introduction To AI 15-07-2024
No ratings yet
1 Introduction To AI 15-07-2024
63 pages
AI Basics for Beginners
No ratings yet
AI Basics for Beginners
12 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Lec 01 Introductionv 2024
No ratings yet
Lec 01 Introductionv 2024
127 pages
Aiml
No ratings yet
Aiml
101 pages
Unit 1
No ratings yet
Unit 1
11 pages
Unit 1 Full
No ratings yet
Unit 1 Full
41 pages
AI & ML: A Beginner's Guide
No ratings yet
AI & ML: A Beginner's Guide
23 pages
Unit 1-L2
No ratings yet
Unit 1-L2
22 pages
Unit 3 of AI in Marketing
No ratings yet
Unit 3 of AI in Marketing
15 pages
Introductiontoaiml 240919083826 24f51819
No ratings yet
Introductiontoaiml 240919083826 24f51819
105 pages
Final Unit No1
No ratings yet
Final Unit No1
40 pages
Ai Research Paper
No ratings yet
Ai Research Paper
4 pages
Preprints202502 0369 v1
No ratings yet
Preprints202502 0369 v1
54 pages
Unit 1a - Fundamentals of Deep Learning
No ratings yet
Unit 1a - Fundamentals of Deep Learning
54 pages
Advanced Macroeconomics for Grad Students
No ratings yet
Advanced Macroeconomics for Grad Students
56 pages
Fuzzy and Neural Control
No ratings yet
Fuzzy and Neural Control
181 pages
SAE Paper On Compliance
No ratings yet
SAE Paper On Compliance
6 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Network Flow Algorithms
No ratings yet
Network Flow Algorithms
33 pages
Week 2
No ratings yet
Week 2
8 pages
Test 1 Sample
No ratings yet
Test 1 Sample
4 pages
Linear Equations Worksheet 3
No ratings yet
Linear Equations Worksheet 3
4 pages
PBL PPT
No ratings yet
PBL PPT
11 pages
Robotics & AI Research Portfolio
100% (1)
Robotics & AI Research Portfolio
2 pages
Deep Learning in 3D Point Clouds
No ratings yet
Deep Learning in 3D Point Clouds
28 pages
Six Sigma Process Map Guide
No ratings yet
Six Sigma Process Map Guide
1 page
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
174 pages
WhatsApp Chat With Videsi Girls Group
No ratings yet
WhatsApp Chat With Videsi Girls Group
1 page
Cellular Neural Network
No ratings yet
Cellular Neural Network
7 pages
Nonlinear Modeling of RC Structures Using Opensees: University of Naples Federico Ii
No ratings yet
Nonlinear Modeling of RC Structures Using Opensees: University of Naples Federico Ii
59 pages
Gauss Numerical Methods
No ratings yet
Gauss Numerical Methods
2 pages
CNN Architectures: LeNet to Inception
100% (1)
CNN Architectures: LeNet to Inception
6 pages
AES Image Encryption (Advanced Encryption Standard)
No ratings yet
AES Image Encryption (Advanced Encryption Standard)
9 pages
Grade 11 Math: Matrix Operations
No ratings yet
Grade 11 Math: Matrix Operations
3 pages
Maths Syllabus
No ratings yet
Maths Syllabus
2 pages
Control Assignment Rev 3
No ratings yet
Control Assignment Rev 3
23 pages
Ida Unit-4
No ratings yet
Ida Unit-4
19 pages
Linear Programming Guide
No ratings yet
Linear Programming Guide
5 pages
SCPPP 22 LX
No ratings yet
SCPPP 22 LX
139 pages
Active Machine Learning For Heterogeneity Activity
No ratings yet
Active Machine Learning For Heterogeneity Activity
13 pages
9ma0 31 Que 20240621
100% (3)
9ma0 31 Que 20240621
20 pages
Daily Report
No ratings yet
Daily Report
3 pages
JEDI Slides-DataSt-Chapter01-Basic Concepts and Notations
No ratings yet
JEDI Slides-DataSt-Chapter01-Basic Concepts and Notations
23 pages
Linear Optimization - Max
No ratings yet
Linear Optimization - Max
186 pages