0% found this document useful (0 votes)
84 views72 pages

The Atlas of 50 Common AI Models

The document is a comprehensive guide titled 'The Atlas of 50 Common AI Models' that details 50 notable AI models across five categories: Language Models, Vision Models, Text-to-Speech, Speech-to-Text, and Multimodal Models. Each section provides insights into the models' development, innovations, architectures, and use cases. The guide aims to serve both newcomers and experienced practitioners in the AI field.

Uploaded by

expert.govind86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views72 pages

The Atlas of 50 Common AI Models

The document is a comprehensive guide titled 'The Atlas of 50 Common AI Models' that details 50 notable AI models across five categories: Language Models, Vision Models, Text-to-Speech, Speech-to-Text, and Multimodal Models. Each section provides insights into the models' development, innovations, architectures, and use cases. The guide aims to serve both newcomers and experienced practitioners in the AI field.

Uploaded by

expert.govind86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

The Atlas of 50

Common
AI Models
A Comprehensive Guide to

Language, Vision, Speech, and

Multimodal Systems

Nabil EL MAHYAOUI
January 3, 2025
ii
Contents

1 Introduction 1

2 Language Models (LLMs) 3

2.1 GPT-4 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 BERT (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 T5 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 LaMDA (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Claude (Anthropic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 LLaMA 2 (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 PaLM 2 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 RedPajama (Together) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.9 Falcon (TII) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.10 BioGPT (Microsoft) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Vision Models 15

3.1 Vision Transformer (ViT, Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 CLIP (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 SAM (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 DALL·E 3 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 DeepLabv3+ (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 YOLOv7 (Ultralytics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 EfficientNet (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.8 StyleGAN3 (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iii
iv CONTENTS

3.9 Flamingo (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.10 Imagine (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Text-to-Speech (TTS) 27

4.1 WaveNet (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Tacotron 2 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 FastSpeech 2 (Microsoft) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 VITS (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Glow-TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.6 DeepVoice 3 (Baidu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7 Speechify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.8 Google TTS API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.9 Sonantic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.10 Respeecher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Speech-to-Text (STT) 39

5.1 DeepSpeech (Mozilla) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 wav2vec 2.0 (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Whisper (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Conformer (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Jasper (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 Kaldi (Open-Source) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.7 AssemblyAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.8 Speechmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.9 Rev AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.10 AWS Transcribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Multimodal Models 51

6.1 GPT-4 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 CLIP (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


CONTENTS v

6.3 Flamingo (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.4 DALL·E 3 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.5 PaLM-E (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.6 LLaVA (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.7 ImageBind (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.8 Blip-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.9 Perceiver IO (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.10 Muse (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Conclusion 63

References and Further Reading 65


vi CONTENTS
Chapter 1

Introduction

Artificial Intelligence has evolved into a vast ecosystem of specialized models that serve different
modalities:

• Language Models (LLMs) for text-based tasks

• Vision Models for image recognition, segmentation, and generation

• Text-to-Speech (TTS) engines for producing human-like spoken audio

• Speech-to-Text (STT) systems for converting human speech into text

• Multimodal Models that combine data from multiple modalities (e.g., text & vision)

The following chapters cover 50 notable AI models, 10 for each of the five categories. We include
details on their development, key innovations, parameter sizes (when known), architectures,
typical use cases, and references or relevant links. This work aims to serve as both a quick-start
guide for newcomers and a deeper reference for seasoned practitioners.

1
2 CHAPTER 1. INTRODUCTION
Chapter 2

Language Models (LLMs)

Overview

Language Models are at the forefront of modern AI research, excelling at tasks such as text
generation, sentiment analysis, summarization, translation, and more. Below are ten notable
LLMs that have significantly influenced the NLP landscape.

3
4 CHAPTER 2. LANGUAGE MODELS (LLMS)

GPT-4 (OpenAI)

Release & Developer

• Release: 2023

• Developer: OpenAI

Parameter Count

Not publicly disclosed. Generally believed to have a very large scale in the hundreds of billions
of parameters.

Architecture

• Transformer-based, improved upon GPT-3’s decoder-only style.

• Incorporates multimodal inputs (image + text) at certain access levels.

Key Innovations

• Enhanced reasoning and steerability via system messages.

• Supports limited image input understanding (vision).

• Improved factual correctness and decreased tendency to generate disallowed content.

Use Cases

• Advanced writing assistance and content generation.

• Code generation and debugging.

• Summarization, Q&A, creative storytelling.

References

• OpenAI Official GPT-4 Page: https://openai.com/product/gpt-4


2.2. BERT (GOOGLE) 5

BERT (Google)

Release & Developer

• Release: 2018

• Developer: Google

Parameter Count

• BERT-Base: 110M

• BERT-Large: 340M

Architecture

• Transformer encoder architecture (bidirectional).

• Trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Key Innovations

• Bidirectional context representation (as opposed to left-to-right or right-to-left).

• Strong for tasks like Q&A (SQuAD) and classification (GLUE).

Use Cases

• Text classification, sentiment analysis, entity recognition.

• Fine-tuning for a wide array of NLP tasks (Q&A, classification, etc.).

References

• Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under-
standing.” (2018)
6 CHAPTER 2. LANGUAGE MODELS (LLMS)

T5 (Google)

Release & Developer

• Release: 2019

• Developer: Google

Parameter Count

Multiple variants ranging from 220M to 11B+.

Architecture

• Sequence-to-sequence (encoder-decoder) Transformer.

• All tasks cast as text-to-text, simplifying the training objective.

Key Innovations

• Unified framework that reframes all NLP tasks as text generation.

• Performance improvements on summarization, translation, classification.

Use Cases

• Summarization, QA, translation, classification.

• Multi-task learning with text-based input/output.

References

• Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” (2019)
2.4. LAMDA (GOOGLE) 7

LaMDA (Google)

Release & Developer

• Release: 2021

• Developer: Google

Parameter Count

Exact number undisclosed; rumored to be in the hundreds of billions.

Architecture

• Transformer-based with a focus on dialogue optimization.

• Specialized training on conversational data to improve coherence.

Key Innovations

• Fine-tuned for multi-turn dialogue and open-ended conversation.

• Emphasizes safe and context-aware responses.

Use Cases

• Chatbots for customer support or entertainment.

• Prototyping advanced conversational agents.

References

• Google AI Blog: https://ai.googleblog.com/2021/05/lamda-towards-safe-grounded-and-h


html
8 CHAPTER 2. LANGUAGE MODELS (LLMS)

Claude (Anthropic)

Release & Developer

• Release: 2022

• Developer: Anthropic

Parameter Count

Exact size undisclosed.

Architecture

• Transformer-based large language model.

• Emphasis on “constitutional AI” principles for safer text generation.

Key Innovations

• Focuses on interpretable, controllable, and safe text outputs.

• Uses curated “rules” or “constitution” to guide responses.

Use Cases

• Ethical chatbots, safer content moderation systems.

• Research on model alignment and interpretability.

References

• Anthropic’s Official Site: https://www.anthropic.com/index/claude


2.6. LLAMA 2 (META) 9

LLaMA 2 (Meta)

Release & Developer

• Release: 2023

• Developer: Meta (Facebook AI)

Parameter Count

Varies: 7B, 13B, 70B, etc.

Architecture

• Transformer-based decoder architecture.

• Released with open weights for research and commercial use under specific licenses.

Key Innovations

• Leaner approach to large language modeling with strong performance.

• Encourages community-driven experimentation and fine-tuning.

Use Cases

• NLP tasks in academia and industry (chatbots, summarization, etc.).

• Fine-tuning to specialized domains or smaller footprints.

References

• Meta AI Official LLaMA 2 Announcement: https://ai.meta.com/llama/


10 CHAPTER 2. LANGUAGE MODELS (LLMS)

PaLM 2 (Google)

Release & Developer

• Release: 2023

• Developer: Google

Parameter Count

Exact size undisclosed, but indicated to be smaller and more efficient than the original PaLM
(540B).

Architecture

• Next-generation large language model from Google’s Pathways approach.

• Improved multilingual and multimodal capabilities.

Key Innovations

• Enhanced code understanding and translation tasks.

• Stronger reasoning performance, better data efficiency.

Use Cases

• Complex reasoning, code generation, language translation.

• Multilingual applications and advanced Q&A.

References

• Google I/O 2023 Keynote: https://blog.google/technology/ai/


2.8. REDPAJAMA (TOGETHER) 11

RedPajama (Together)

Release & Developer

• Release: 2023

• Developer: Together, various open-source contributors

Parameter Count

Mirrors LLaMA’s variants (7B, 13B, 30B, 70B), with open data and training code.

Architecture

• Transformer-based, aims to replicate LLaMA’s design as open-source.

• Emphasizes transparent data pipelines.

Key Innovations

• Entire replication of LLaMA with publicly available datasets and code.

• Facilitates community-driven fine-tuning and model improvement.

Use Cases

• Research on scaling laws, model interpretability, and open benchmarking.

• Fine-tuning for domain-specific tasks.

References

• RedPajama GitHub: https://github.com/togethercomputer/RedPajama-Data


12 CHAPTER 2. LANGUAGE MODELS (LLMS)

Falcon (TII)

Release & Developer

• Release: 2023

• Developer: Technology Innovation Institute (TII)

Parameter Count

Available in variants of 7B and 40B.

Architecture

• Transformer-based, similar to GPT-style decoder architectures.

• Apache 2.0 License, encouraging open use.

Key Innovations

• High-performance on benchmarks relative to parameter size.

• Focus on efficiency and strong baseline performance.

Use Cases

• Chatbots, language understanding, text generation for research and commercial apps.

• Fine-tuning with relatively smaller resource footprints.

References

• Falcon GitHub: https://github.com/Technology-Innovation-Institute/falcon


2.10. BIOGPT (MICROSOFT) 13

BioGPT (Microsoft)

Release & Developer

• Release: 2022

• Developer: Microsoft Research

Parameter Count

Varies; base models typically in the range of hundreds of millions of parameters, specialized for
biomedical text.

Architecture

• Transformer-based, fine-tuned on biomedical corpora.

• Focuses on specialized domain vocabulary, terminologies, and references.

Key Innovations

• High accuracy on biomedical NLP tasks like entity recognition, relation extraction.

• Enhanced understanding of domain-specific language and abbreviations.

Use Cases

• Scientific literature analysis, biomedical knowledge discovery.

• Drug discovery, clinical text interpretation, and medical Q&A.

References

• Microsoft BioGPT Paper: https://arxiv.org/abs/2210.10341


14 CHAPTER 2. LANGUAGE MODELS (LLMS)
Chapter 3

Vision Models

Overview

Vision Models are primarily focused on understanding and generating images (and sometimes
video). They handle tasks such as classification, segmentation, object detection, and image
synthesis. Below are ten influential vision models.

15
16 CHAPTER 3. VISION MODELS

Vision Transformer (ViT, Google)

Release & Developer

• Release: 2020

• Developer: Google

Parameter Count

ViT Base: 86M, ViT Large: 307M, and bigger variants go beyond that.

Architecture

• Uses Transformer encoders on image patches (16x16 or 32x32).

• Positional embeddings handle patch order.

Key Innovations

• Demonstrated that Transformers can outperform CNNs on large-scale image datasets.

• Simplified architecture by reusing standard NLP Transformer blocks.

Use Cases

• Image classification, object detection (with minor adaptations).

• Transfer learning for specialized tasks (medical imaging, etc.).

References

• Dosovitskiy, A., et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale.” (2020)
3.2. CLIP (OPENAI) 17

CLIP (OpenAI)

Release & Developer

• Release: 2021

• Developer: OpenAI

Parameter Count

Varies depending on the backbone (ResNet or ViT-based). Typically in the range of 63M–300M.

Architecture

• Two-tower model: one for images, one for text, joined by a shared embedding space.

• Trained with contrastive learning over (image, text) pairs.

Key Innovations

• Enables zero-shot classification of images based on textual prompts.

• Learns general-purpose representations applicable to diverse downstream tasks.

Use Cases

• Image search, labeling, content moderation.

• Zero-shot classification without explicit labeled training data.

References

• Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision
(CLIP).” (2021)
18 CHAPTER 3. VISION MODELS

SAM (Meta)

Release & Developer

• Release: 2023

• Developer: Meta (Facebook AI)

Parameter Count

Exact details undisclosed; it’s large, with multiple encoders for segmentation.

Architecture

• Promptable model with an image encoder and flexible mask decoder.

• Designed to segment “anything” within an image upon prompt (points, boxes, text).

Key Innovations

• General-purpose segmentation that does not require specific training on particular classes.

• Powerful zero-shot or one-shot segmentation performance across varied domains.

Use Cases

• Quick image segmentation for design, medical imaging, autonomous vehicles.

• Interactive image editing (selecting regions to modify).

References

• Kirillov, A., et al. “Segment Anything.” (2023), https://segment-anything.com/


3.4. DALL·E 3 (OPENAI) 19

DALL·E 3 (OpenAI)

Release & Developer

• Release: 2023

• Developer: OpenAI

Parameter Count

Not explicitly stated; building on DALL·E 2’s large model architecture.

Architecture

• Diffusion or autoregressive approach (details vary).

• Integrates CLIP-like text encoders to parse complex prompts.

Key Innovations

• More coherent text-to-image generation with improved text rendering in images.

• Greater control over style and scene composition.

Use Cases

• Rapid prototyping for design, concept art, creative illustrations.

• Marketing materials, user-generated content for visual storytelling.

References

• OpenAI DALL·E 3 Announcement: https://openai.com/product/dall-e-3


20 CHAPTER 3. VISION MODELS

DeepLabv3+ (Google)

Release & Developer

• Release: 2018

• Developer: Google

Parameter Count

Typically in the range of 40M–60M, depending on backbone (e.g., ResNet, Xception).

Architecture

• Uses atrous convolution for multi-scale context.

• Encoder-decoder structure for semantic segmentation.

Key Innovations

• Atrous Spatial Pyramid Pooling (ASPP) captures rich contextual information.

• Effective for pixel-accurate segmentation tasks.

Use Cases

• Autonomous driving scene analysis (lane detection, obstacle segmentation).

• Medical imaging segmentation.

References

• Chen, L.-C., et al. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image
Segmentation.” (2018)
3.6. YOLOV7 (ULTRALYTICS) 21

YOLOv7 (Ultralytics)

Release & Developer

• Release: 2022

• Developer: Ultralytics

Parameter Count

Ranges from about 7M to 70M for various YOLOv7 models.

Architecture

• One-stage detector with real-time inference speed.

• Builds on YOLO’s design with CSP (Cross Stage Partial) improvements.

Key Innovations

• SOTA object detection with a balance of speed and accuracy.

• Modular design for easy scaling and fine-tuning.

Use Cases

• Real-time object detection in robotics or drone applications.

• Surveillance, traffic monitoring, and video analytics.

References

• Ultralytics YOLO GitHub: https://github.com/ultralytics/yolov7


22 CHAPTER 3. VISION MODELS

EfficientNet (Google)

Release & Developer

• Release: 2019

• Developer: Google

Parameter Count

Ranges from 5M to 66M across EfficientNet-B0 through B7.

Architecture

• Compound scaling of depth, width, and resolution in a principled manner.

• Mobile inverted bottleneck MBConv with squeeze-excitation.

Key Innovations

• SOTA accuracy on ImageNet with relatively fewer parameters.

• Simple scaling rules reduce trial-and-error for architectural design.

Use Cases

• Mobile and embedded vision applications.

• Large-scale classification with limited compute resources.

References

• Tan, M., and Le, Q. “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks.” (2019)
3.8. STYLEGAN3 (NVIDIA) 23

StyleGAN3 (NVIDIA)

Release & Developer

• Release: 2021

• Developer: NVIDIA

Parameter Count

Typically in the tens of millions, depending on configuration for image resolution.

Architecture

• Generative Adversarial Network (GAN) with “style” layers controlling generation at multiple
scales.

• Removes aliasing artifacts from previous iterations (StyleGAN2).

Key Innovations

• Improved temporal stability for video and motion consistency.

• Enhanced disentanglement of latent representation for image editing.

Use Cases

• High-fidelity face generation, art creation, data augmentation.

• Research in GAN interpretability and generative design.

References

• Karras, T., et al. “Alias-Free Generative Adversarial Networks (StyleGAN3).” (2021)


24 CHAPTER 3. VISION MODELS

Flamingo (DeepMind)

Release & Developer

• Release: 2022

• Developer: DeepMind

Parameter Count

Built on top of large language models; parameter count can be in the billions.

Architecture

• Combines a frozen LLM (e.g., Chinchilla) with adapter layers for visual input.

• Processes sequences of images and text tokens for cross-modal reasoning.

Key Innovations

• Few-shot adaptation: strong performance on varied vision+language tasks with limited extra
training.

• Memory-efficient approach by keeping the LLM “frozen.”

Use Cases

• Image captioning, visual question answering.

• Embodied AI for robotics with vision-language interactions.

References

• Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.” (2022)
3.10. IMAGINE (META) 25

Imagine (Meta)

Release & Developer

• Release: 2023

• Developer: Meta (Facebook AI)

Parameter Count

Details not publicly disclosed; presumably large-scale generative model.

Architecture

• Likely using diffusion or GAN-based approach for image creation.

• Accepts textual descriptions to generate images.

Key Innovations

• Focus on generating high-quality, diverse images from textual prompts.

• Potential synergy with LLaMA or other large language models for improved prompt under-
standing.

Use Cases

• Creative art, concept design, marketing material generation.

• Rapid prototyping of visual ideas in AR/VR contexts.

References

• Meta AI announcements and demos (internal developer notes; public info limited).
26 CHAPTER 3. VISION MODELS
Chapter 4

Text-to-Speech (TTS)

Overview

Text-to-Speech systems convert textual input into synthetic speech. Modern TTS leverages
neural networks that can produce surprisingly human-like prosody and clarity. Here are ten
TTS systems making significant impact.

27
28 CHAPTER 4. TEXT-TO-SPEECH (TTS)

WaveNet (DeepMind)

Release & Developer

• Release: 2016

• Developer: DeepMind (Google)

Parameter Count

Initial WaveNet had about 3–5 million parameters, though later variants vary in size.

Architecture

• Autoregressive convolutional model that generates raw waveforms sample by sample.

• Dilated convolutions to capture long-range dependencies in audio.

Key Innovations

• Dramatically improved naturalness of synthetic speech compared to older vocoder methods.

• Also adapted for music generation and conditional audio tasks.

Use Cases

• Google Assistant TTS, voice assistants, audiobooks, any domain needing high-fidelity speech.

References

• van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” (2016)
4.2. TACOTRON 2 (GOOGLE) 29

Tacotron 2 (Google)

Release & Developer

• Release: 2017 (initial Tacotron); 2018 (Tacotron 2)

• Developer: Google

Parameter Count

Tacotron 2 typically 28 million parameters in the seq2seq part, plus the WaveNet (or similar)
vocoder.

Architecture

• Sequence-to-sequence model that predicts mel-spectrograms from text.

• Vocoder (like WaveNet) synthesizes final waveform from spectrogram.

Key Innovations

• End-to-end TTS pipeline with state-of-the-art naturalness.

• Replaced older “Griffin-Lim” vocoding with neural vocoding for better quality.

Use Cases

• Google Cloud TTS services, accessibility tools, voice apps.

References

• Shen, J., et al. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram
Predictions.” (2018)
30 CHAPTER 4. TEXT-TO-SPEECH (TTS)

FastSpeech 2 (Microsoft)

Release & Developer

• Release: 2020 (FastSpeech), 2021 (FastSpeech 2)

• Developer: Microsoft

Parameter Count

Model sizes vary; generally on the order of tens of millions of parameters.

Architecture

• Non-autoregressive TTS that predicts durations for each phoneme/token.

• Uses alignment and variance adapters for pitch, energy, and duration.

Key Innovations

• Faster inference than autoregressive models.

• High-quality output comparable to Tacotron-class systems.

Use Cases

• Real-time TTS applications with low-latency requirements.

• Systems that need controllable prosody (pitch, duration).

References

• Ren, Y., et al. “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.” (2021)
4.4. VITS (NVIDIA) 31

VITS (NVIDIA)

Release & Developer

• Release: 2021

• Developer: NVIDIA

Parameter Count

Typically tens of millions, though may vary in forks.

Architecture

• End-to-end TTS with normalizing flows and adversarial training.

• Learns to generate waveforms directly, bypassing a separate vocoder.

Key Innovations

• One-stage generation: text to raw audio in a single model.

• High fidelity and natural prosody, with adjustable speed and style.

Use Cases

• Voice user interfaces, synthetic podcasts, real-time TTS in gaming.

References

• Kim, J., et al. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End
Text-to-Speech.” (2021)
32 CHAPTER 4. TEXT-TO-SPEECH (TTS)

Glow-TTS

Release & Developer

• Release: 2020

• Developer: Various (paper from KakaoBrain)

Parameter Count

Tens of millions in the base version.

Architecture

• Flow-based generative model for mel-spectrograms.

• Non-autoregressive approach for faster synthesis.

Key Innovations

• Uses invertible flow steps for better likelihood-based training.

• Competitive quality and speed with fewer model complexities.

Use Cases

• High-speed TTS with decent quality for multi-lingual systems.

• Flexible style transfer or voice cloning in research contexts.

References

• Kim, J., et al. “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment
Search.” (2020)
4.6. DEEPVOICE 3 (BAIDU) 33

DeepVoice 3 (Baidu)

Release & Developer

• Release: 2017

• Developer: Baidu Research

Parameter Count

Around 20–30 million, depending on model configuration.

Architecture

• Fully convolutional seq2seq architecture for TTS.

• Supports multi-speaker and multi-lingual data.

Key Innovations

• GPU-friendly convolutional layers.

• Faster training than RNN-based seq2seq TTS.

Use Cases

• Multi-language TTS platforms, voice-based AI assistants, large-scale production TTS.

References

• Ping, W., et al. “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.”
(2017)
34 CHAPTER 4. TEXT-TO-SPEECH (TTS)

Speechify

Release & Developer

• Release: 2017 (initial) - Present

• Developer: Speechify Inc.

Parameter Count

Proprietary. Uses multiple TTS engines under the hood.

Architecture

• Cloud-based TTS platform with user-friendly API.

• Focus on multi-voice and accent variety.

Key Innovations

• Aimed at accessibility and reading assistance (especially for dyslexia).

• Device-agnostic, supporting apps, browser extensions, and mobile usage.

Use Cases

• Education technology, reading-assist apps, corporate eLearning.

References

• Official Site: https://speechify.com/


4.8. GOOGLE TTS API 35

Google TTS API

Release & Developer

• Release: Ongoing expansions since 2018

• Developer: Google Cloud

Parameter Count

Multiple voices with undisclosed parameters; many are built off Tacotron and WaveNet deriva-
tives.

Architecture

• Cloud-based TTS service, harnessing neural net backends like WaveNet.

• Variety of voices, languages, and dialects.

Key Innovations

• Scalable, low-latency TTS at global scale.

• Large library of predefined voices, SSML customization.

Use Cases

• Integrations into contact centers, mobile apps, accessibility tools.

References

• Google Cloud TTS Documentation: https://cloud.google.com/text-to-speech


36 CHAPTER 4. TEXT-TO-SPEECH (TTS)

Sonantic

Release & Developer

• Release: 2019

• Developer: Sonantic (acquired by Spotify)

Parameter Count

Proprietary. Focus on expressive “cinematic” TTS.

Architecture

• Likely seq2seq with a neural vocoder, specialized in emotional prosody.

• Web-based studio for adjusting pitch, tone, emphasis.

Key Innovations

• Highly expressive voices for gaming, entertainment, dubbing.

• Fine control of emotions (happy, sad, serious, etc.).

Use Cases

• Voice acting for video games, film post-production, personalized narrations.

References

• Sonantic Official Site: https://sonantic.io/


4.10. RESPEECHER 37

Respeecher

Release & Developer

• Release: 2020

• Developer: Respeecher

Parameter Count

Proprietary, but typically includes high-end deep networks for voice cloning.

Architecture

• Source-to-target voice conversion using deep neural nets.

• Maintains emotional resonance and intonation of original speech.

Key Innovations

• “Voice cloning” that closely mimics timbre, accent, and style.

• Used in film production for recreating voices of actors.

Use Cases

• Filmmaking, gaming, personalized assistants, archiving historical voices.

References

• Respeecher Official Site: https://www.respeecher.com/


38 CHAPTER 4. TEXT-TO-SPEECH (TTS)
Chapter 5

Speech-to-Text (STT)

Overview

Speech-to-Text (STT) systems transform spoken language into written text, powering voice
assistants, dictation software, and transcription services. Here are ten notable STT solutions.

39
40 CHAPTER 5. SPEECH-TO-TEXT (STT)

DeepSpeech (Mozilla)

Release & Developer

• Release: 2017

• Developer: Mozilla (inspired by Baidu’s Deep Speech)

Parameter Count

Range of 50M–100M in typical configurations.

Architecture

• End-to-end RNN-based (later versions used CNN layers).

• Uses CTC (Connectionist Temporal Classification) for alignment.

Key Innovations

• Fully open-source, easy to fine-tune for various languages.

• Good baseline for academic and hobby projects.

Use Cases

• Voice-enabled applications, real-time transcriptions.

• Low-resource or specialized domain STT by custom training.

References

• Mozilla DeepSpeech GitHub: https://github.com/mozilla/DeepSpeech


5.2. WAV2VEC 2.0 (META) 41

wav2vec 2.0 (Meta)

Release & Developer

• Release: 2020

• Developer: Meta (Facebook AI)

Parameter Count

About 95M for Base; larger variants for multilingual tasks can go beyond.

Architecture

• Self-supervised model on raw audio waveforms, learns robust speech representations.

• Fine-tuned with CTC or seq2seq for final transcription.

Key Innovations

• Significant WER reduction with much less labeled data than prior methods.

• Works well in noisy conditions, beneficial for low-resource languages.

Use Cases

• Real-time STT for social media platforms, voice-driven apps, research on low-resource lan-
guages.

References

• Baevski, A., et al. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations.” (2020)
42 CHAPTER 5. SPEECH-TO-TEXT (STT)

Whisper (OpenAI)

Release & Developer

• Release: 2022

• Developer: OpenAI

Parameter Count

Models range from 39M (Tiny) up to 1.5B (Large).

Architecture

• Encoder-decoder Transformer trained on 680k hours of multilingual data.

• Handles multiple languages and robust to background noise.

Key Innovations

• Excellent zero-shot performance across multiple languages.

• End-to-end approach that handles speech recognition and language ID.

Use Cases

• Live transcription, multi-lingual speech recognition, podcast or meeting transcription.

References

• Radford, A., et al. “Robust Speech Recognition via Large-Scale Weak Supervision (Whisper).”
(2022)
5.4. CONFORMER (GOOGLE) 43

Conformer (Google)

Release & Developer

• Release: 2020

• Developer: Google

Parameter Count

Typically tens of millions, but can scale up.

Architecture

• Combines convolution and Transformer blocks, preserving local and global context.

• Often used with CTC or RNN-T for transcription output.

Key Innovations

• Achieves high accuracy with a small footprint.

• Lower computational cost compared to pure Transformer stacks.

Use Cases

• On-device speech recognition for mobile or embedded devices.

• Large-scale STT for call centers, voice assistants.

References

• Gulati, A., et al. “Conformer: Convolution-augmented Transformer for Speech Recognition.”


(2020)
44 CHAPTER 5. SPEECH-TO-TEXT (STT)

Jasper (NVIDIA)

Release & Developer

• Release: 2019

• Developer: NVIDIA

Parameter Count

Ranges from 10M to 300M+ for deeper variants.

Architecture

• Deep 1D convolutional layers with skip connections.

• Uses CTC for alignment and decoding.

Key Innovations

• Modular block design for easy scaling.

• Highly parallelizable on GPUs, leading to faster training.

Use Cases

• High-throughput transcription tasks.

• Real-time speech recognition systems.

References

• Li, J., et al. “Jasper: An End-to-End Convolutional Neural Acoustic Model.” (2019)
5.6. KALDI (OPEN-SOURCE) 45

Kaldi (Open-Source)

Release & Developer

• Release: 2011, ongoing updates

• Developer: Community-driven, started by Daniel Povey

Architecture

• Toolkit offering DNN/HMM or end-to-end approaches.

• Very modular, highly customizable pipelines for acoustic modeling.

Key Innovations

• Gold standard in academic speech recognition research for many years.

• Strong support for multi-language, lattice generation, and advanced acoustic modeling.

Use Cases

• Research labs, open-source STT solutions, building custom ASR systems.

References

• Kaldi: https://kaldi-asr.org/
46 CHAPTER 5. SPEECH-TO-TEXT (STT)

AssemblyAI

Release & Developer

• Release: 2017

• Developer: AssemblyAI Inc.

Parameter Count

Proprietary. Model details undisclosed.

Architecture

• Cloud-based API, often with deep Transformer or RNN-based core.

• Real-time streaming endpoints and post-processing features (topic detection, sentiment).

Key Innovations

• High accuracy and easy integration into enterprise pipelines.

• Additional AI tasks like speaker diarization, content filtering.

Use Cases

• Media transcription, call center analytics, meeting transcription.

References

• Official Site: https://www.assemblyai.com/


5.8. SPEECHMATICS 47

Speechmatics

Release & Developer

• Release: Ongoing; the company established in 2006

• Developer: Speechmatics Ltd.

Architecture

• Proprietary neural-based acoustic modeling.

• Strong multi-language support with cloud or on-prem solutions.

Key Innovations

• High-accuracy transcriptions in many global languages.

• Customizable language models for domain-specific jargon.

Use Cases

• Enterprise-level transcription for broadcasting, call centers, legal domains.

References

• Speechmatics Official Site: https://www.speechmatics.com/


48 CHAPTER 5. SPEECH-TO-TEXT (STT)

Rev AI

Release & Developer

• Release: 2019 for API

• Developer: Rev.com

Architecture

• Cloud-based STT with deep learning backends.

• Additional text cleanup and speaker identification.

Key Innovations

• Seamless integration with Rev’s human transcription platform for hybrid workflows.

• High accuracy in English with domain customization.

Use Cases

• Podcast or video transcription, enterprise software integrations, real-time call analysis.

References

• Rev AI: https://www.rev.ai/


5.10. AWS TRANSCRIBE 49

AWS Transcribe

Release & Developer

• Release: 2017 (preview), ongoing updates

• Developer: Amazon Web Services

Architecture

• Proprietary deep learning-based STT service, integrated into AWS ecosystem.

• Offers custom vocabularies and language models.

Key Innovations

• Domain adaptation for call centers, specialized jargon.

• Streaming or batch transcription, integrated with Amazon S3 and other AWS services.

Use Cases

• Automated call center analytics, meeting transcription, video subtitles.

References

• AWS Transcribe Documentation: https://aws.amazon.com/transcribe/


50 CHAPTER 5. SPEECH-TO-TEXT (STT)
Chapter 6

Multimodal Models

Overview

Multimodal models handle more than one data type (e.g., text+image, text+audio). They can
integrate linguistic, visual, or auditory cues, enabling tasks like visual Q&A, image captioning,
or text-driven image generation. Here are ten pivotal models.

51
52 CHAPTER 6. MULTIMODAL MODELS

GPT-4 (OpenAI)

Multimodal Context

Although GPT-4 is primarily text-based, it has a multimodal variant capable of interpreting


images.

Key Innovations

• Accepts image inputs (through specific partner integrations).

• Can perform visual reasoning (describe images, interpret humor in memes).

Use Cases

• Visual question answering, combined text+image search queries.

• Document parsing (images of documents).

References

• OpenAI GPT-4: https://openai.com/product/gpt-4


6.2. CLIP (OPENAI) 53

CLIP (OpenAI)

Multimodal Context

Originally introduced for vision-text alignment; used in many generative and retrieval pipelines.

Key Innovations

• Learns a shared embedding space for text and images.

• Powers models like DALL·E for prompt-image coherence.

Use Cases

• Zero-shot image classification, cross-modal retrieval, content moderation.

• Foundation for controlling image generation with textual prompts.

References

• Radford, A., et al. “CLIP.” (2021)


54 CHAPTER 6. MULTIMODAL MODELS

Flamingo (DeepMind)

Multimodal Context

Flamingo extends large language models to handle sequences of images + text.

Key Innovations

• Few-shot learning: can adapt to new tasks with minimal examples.

• Freezes the LLM and learns cross-attention adapter layers for images.

Use Cases

• Visual Q&A, image description generation, story generation using images as prompts.

References

• Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.” (2022)
6.4. DALL·E 3 (OPENAI) 55

DALL·E 3 (OpenAI)

Multimodal Context

Accepts text prompts and outputs images; pairs vision with linguistic understanding.

Key Innovations

• Improved text rendering in images (e.g., signage, typed words).

• Better alignment between textual prompt and generated imagery.

Use Cases

• Concept art, storyboarding, and product design ideation.

• Generating custom visuals for marketing or social media content.

References

• OpenAI DALL·E 3: https://openai.com/product/dall-e-3


56 CHAPTER 6. MULTIMODAL MODELS

PaLM-E (Google)

Multimodal Context

• Extension of PaLM for embodied tasks (robotics, sensor data).

Key Innovations

• Integrates text, vision, and sensor input for advanced robotic control.

• Allows instruction-following in real-world tasks: “Pick up that object from the table.”

Use Cases

• Robotics, real-world assistant tasks, autonomous navigation.

References

• PaLM-E Paper/Google AI Blog: https://ai.googleblog.com/search?q=palm-e


6.6. LLAVA (META) 57

LLaVA (Meta)

Multimodal Context

Vision-augmented LLaMA for visual Q&A and image understanding.

Key Innovations

• Fine-tunes LLaMA with a visual encoder bridging images to text tokens.

• Maintains strong language capabilities with added visual insight.

Use Cases

• Automated tutoring with diagrams, visual-based Q&A (e.g., “What is in this picture?”).

References

• LLaVA GitHub: https://github.com/haotian-liu/LLaVA


58 CHAPTER 6. MULTIMODAL MODELS

ImageBind (Meta)

Multimodal Context

• Binds images, text, audio, 3D, and more into a shared embedding space.

Key Innovations

• Joint representation for multiple sensory modalities (speech, text, images, depth, etc.).

• Potential for retrieval tasks across multiple data types (e.g., find matching audio for an image).

Use Cases

• Multimodal search, creative design (matching audio tracks to visual scenes), AR/VR input.

References

• Girdhar, R., et al. “ImageBind: One Embedding Space To Bind Them All.” (2023)
6.8. BLIP-2 59

Blip-2

Multimodal Context

• Vision-language model for tasks like image captioning and VQA (Vision Question Answer-
ing).

Key Innovations

• Integrates an image encoder with a language decoder, uses Q-Former to bridge them.

• Supports instruction tuning for more general visual QA tasks.

Use Cases

• Interactive chat about images, dynamic scene descriptions, image-based search.

References

• Li, J., et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders
and Large Language Models.” (2023)
60 CHAPTER 6. MULTIMODAL MODELS

Perceiver IO (DeepMind)

Multimodal Context

• Generalized Transformer architecture that accepts multiple input modalities.

Key Innovations

• Can handle 1D, 2D, or 3D inputs (audio, images, point clouds).

• Latent bottleneck approach scales well with large input sizes.

Use Cases

• Video classification, audio-visual tasks, and general neural data fusion.

References

• Jaegle, A., et al. “Perceiver IO: A General Architecture for Structured Inputs & Outputs.”
(2021)
6.10. MUSE (GOOGLE) 61

Muse (Google)

Multimodal Context

• Text-conditioned image generation using a Transformer-based diffusion approach.

Key Innovations

• Combines the strengths of diffusion and large language modeling for image creation.

• Potentially better text alignment and fine-grained detail than some prior models.

Use Cases

• Artwork generation, concept design, or integrated marketing campaigns.

References

• Chang, B., et al. “Muse: Text-to-Image Generation via Masked Generative Transformers.”
(2023)
62 CHAPTER 6. MULTIMODAL MODELS
Chapter 7

Conclusion

This Atlas provided a comprehensive tour of 50 notable AI models, sorted by their primary
modality:

1. Language Models (LLMs) – from GPT-4 to domain-specific BioGPT

2. Vision Models – covering classification, segmentation, and generation

3. Text-to-Speech (TTS) – from the groundbreaking WaveNet to specialized voice cloning


solutions

4. Speech-to-Text (STT) – spanning open-source (DeepSpeech, Kaldi) to enterprise APIs (AWS


Transcribe)

5. Multimodal Models – bridging text, images, and even audio for complex tasks

As AI continues to evolve, we expect further convergence of modalities—leading to more


generalist “foundation models.” Open-source communities and research labs will continue
releasing new variations, expanding upon these baseline technologies. The references linked
throughout are excellent entry points for deeper dives into specific models, training approaches,
and potential applications.

63
64 CHAPTER 7. CONCLUSION
References and Further Reading

Below is a consolidated reference list for many of the papers and resources mentioned:

• BERT: Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)

• T5: Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” (2019)

• WaveNet: van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” (2016)

• Tacotron 2: Shen, J., et al. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectro-
gram Predictions.” (2018)

• ViT: Dosovitskiy, A., et al. “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale.” (2020)

• CLIP: Radford, A., et al. “Learning Transferable Visual Models From Natural Language
Supervision.” (2021)

• wav2vec 2.0: Baevski, A., et al. “wav2vec 2.0: A Framework for Self-Supervised Learning of
Speech Representations.” (2020)

• Flamingo: Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.”
(2022)

• StyleGAN3: Karras, T., et al. “Alias-Free Generative Adversarial Networks (StyleGAN3).”


(2021)

• Conformer: Gulati, A., et al. “Conformer: Convolution-augmented Transformer for Speech


Recognition.” (2020)

• FastSpeech 2: Ren, Y., et al. “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.”
(2021)

• Glow-TTS: Kim, J., et al. “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic
Alignment Search.” (2020)

• DeepVoice 3: Ping, W., et al. “Deep Voice 3: Scaling Text-to-Speech with Convolutional
Sequence Learning.” (2017)

• Jasper: Li, J., et al. “Jasper: An End-to-End Convolutional Neural Acoustic Model.” (2019)

• DeepLabv3+: Chen, L.-C., et al. “Encoder-Decoder with Atrous Separable Convolution for
Semantic Image Segmentation.” (2018)

65
66 CHAPTER 7. REFERENCES AND FURTHER READING

• ImageBind: Girdhar, R., et al. “ImageBind: One Embedding Space To Bind Them All.” (2023)

• BLIP-2: Li, J., et al. “BLIP-2: Bootstrapping Language-Image Pre-training.” (2023)

• Perceiver IO: Jaegle, A., et al. “Perceiver IO: A General Architecture for Structured Inputs &
Outputs.” (2021)

You might also like