The Atlas of 50
Common
AI Models
A Comprehensive Guide to
Language, Vision, Speech, and
Multimodal Systems
Nabil EL MAHYAOUI
January 3, 2025
ii
Contents
1 Introduction 1
2 Language Models (LLMs) 3
2.1 GPT-4 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 BERT (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 T5 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 LaMDA (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Claude (Anthropic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 LLaMA 2 (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 PaLM 2 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 RedPajama (Together) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Falcon (TII) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.10 BioGPT (Microsoft) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Vision Models 15
3.1 Vision Transformer (ViT, Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 CLIP (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 SAM (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 DALL·E 3 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 DeepLabv3+ (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 YOLOv7 (Ultralytics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 EfficientNet (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 StyleGAN3 (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
iv CONTENTS
3.9 Flamingo (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Imagine (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Text-to-Speech (TTS) 27
4.1 WaveNet (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Tacotron 2 (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 FastSpeech 2 (Microsoft) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 VITS (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Glow-TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 DeepVoice 3 (Baidu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 Speechify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8 Google TTS API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Sonantic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.10 Respeecher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Speech-to-Text (STT) 39
5.1 DeepSpeech (Mozilla) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 wav2vec 2.0 (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Whisper (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Conformer (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Jasper (NVIDIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Kaldi (Open-Source) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 AssemblyAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.8 Speechmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.9 Rev AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.10 AWS Transcribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Multimodal Models 51
6.1 GPT-4 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 CLIP (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
CONTENTS v
6.3 Flamingo (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4 DALL·E 3 (OpenAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5 PaLM-E (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6 LLaVA (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.7 ImageBind (Meta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.8 Blip-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.9 Perceiver IO (DeepMind) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.10 Muse (Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusion 63
References and Further Reading 65
vi CONTENTS
Chapter 1
Introduction
Artificial Intelligence has evolved into a vast ecosystem of specialized models that serve different
modalities:
• Language Models (LLMs) for text-based tasks
• Vision Models for image recognition, segmentation, and generation
• Text-to-Speech (TTS) engines for producing human-like spoken audio
• Speech-to-Text (STT) systems for converting human speech into text
• Multimodal Models that combine data from multiple modalities (e.g., text & vision)
The following chapters cover 50 notable AI models, 10 for each of the five categories. We include
details on their development, key innovations, parameter sizes (when known), architectures,
typical use cases, and references or relevant links. This work aims to serve as both a quick-start
guide for newcomers and a deeper reference for seasoned practitioners.
1
2 CHAPTER 1. INTRODUCTION
Chapter 2
Language Models (LLMs)
Overview
Language Models are at the forefront of modern AI research, excelling at tasks such as text
generation, sentiment analysis, summarization, translation, and more. Below are ten notable
LLMs that have significantly influenced the NLP landscape.
3
4 CHAPTER 2. LANGUAGE MODELS (LLMS)
GPT-4 (OpenAI)
Release & Developer
• Release: 2023
• Developer: OpenAI
Parameter Count
Not publicly disclosed. Generally believed to have a very large scale in the hundreds of billions
of parameters.
Architecture
• Transformer-based, improved upon GPT-3’s decoder-only style.
• Incorporates multimodal inputs (image + text) at certain access levels.
Key Innovations
• Enhanced reasoning and steerability via system messages.
• Supports limited image input understanding (vision).
• Improved factual correctness and decreased tendency to generate disallowed content.
Use Cases
• Advanced writing assistance and content generation.
• Code generation and debugging.
• Summarization, Q&A, creative storytelling.
References
• OpenAI Official GPT-4 Page: https://openai.com/product/gpt-4
2.2. BERT (GOOGLE) 5
BERT (Google)
Release & Developer
• Release: 2018
• Developer: Google
Parameter Count
• BERT-Base: 110M
• BERT-Large: 340M
Architecture
• Transformer encoder architecture (bidirectional).
• Trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Key Innovations
• Bidirectional context representation (as opposed to left-to-right or right-to-left).
• Strong for tasks like Q&A (SQuAD) and classification (GLUE).
Use Cases
• Text classification, sentiment analysis, entity recognition.
• Fine-tuning for a wide array of NLP tasks (Q&A, classification, etc.).
References
• Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under-
standing.” (2018)
6 CHAPTER 2. LANGUAGE MODELS (LLMS)
T5 (Google)
Release & Developer
• Release: 2019
• Developer: Google
Parameter Count
Multiple variants ranging from 220M to 11B+.
Architecture
• Sequence-to-sequence (encoder-decoder) Transformer.
• All tasks cast as text-to-text, simplifying the training objective.
Key Innovations
• Unified framework that reframes all NLP tasks as text generation.
• Performance improvements on summarization, translation, classification.
Use Cases
• Summarization, QA, translation, classification.
• Multi-task learning with text-based input/output.
References
• Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” (2019)
2.4. LAMDA (GOOGLE) 7
LaMDA (Google)
Release & Developer
• Release: 2021
• Developer: Google
Parameter Count
Exact number undisclosed; rumored to be in the hundreds of billions.
Architecture
• Transformer-based with a focus on dialogue optimization.
• Specialized training on conversational data to improve coherence.
Key Innovations
• Fine-tuned for multi-turn dialogue and open-ended conversation.
• Emphasizes safe and context-aware responses.
Use Cases
• Chatbots for customer support or entertainment.
• Prototyping advanced conversational agents.
References
• Google AI Blog: https://ai.googleblog.com/2021/05/lamda-towards-safe-grounded-and-h
html
8 CHAPTER 2. LANGUAGE MODELS (LLMS)
Claude (Anthropic)
Release & Developer
• Release: 2022
• Developer: Anthropic
Parameter Count
Exact size undisclosed.
Architecture
• Transformer-based large language model.
• Emphasis on “constitutional AI” principles for safer text generation.
Key Innovations
• Focuses on interpretable, controllable, and safe text outputs.
• Uses curated “rules” or “constitution” to guide responses.
Use Cases
• Ethical chatbots, safer content moderation systems.
• Research on model alignment and interpretability.
References
• Anthropic’s Official Site: https://www.anthropic.com/index/claude
2.6. LLAMA 2 (META) 9
LLaMA 2 (Meta)
Release & Developer
• Release: 2023
• Developer: Meta (Facebook AI)
Parameter Count
Varies: 7B, 13B, 70B, etc.
Architecture
• Transformer-based decoder architecture.
• Released with open weights for research and commercial use under specific licenses.
Key Innovations
• Leaner approach to large language modeling with strong performance.
• Encourages community-driven experimentation and fine-tuning.
Use Cases
• NLP tasks in academia and industry (chatbots, summarization, etc.).
• Fine-tuning to specialized domains or smaller footprints.
References
• Meta AI Official LLaMA 2 Announcement: https://ai.meta.com/llama/
10 CHAPTER 2. LANGUAGE MODELS (LLMS)
PaLM 2 (Google)
Release & Developer
• Release: 2023
• Developer: Google
Parameter Count
Exact size undisclosed, but indicated to be smaller and more efficient than the original PaLM
(540B).
Architecture
• Next-generation large language model from Google’s Pathways approach.
• Improved multilingual and multimodal capabilities.
Key Innovations
• Enhanced code understanding and translation tasks.
• Stronger reasoning performance, better data efficiency.
Use Cases
• Complex reasoning, code generation, language translation.
• Multilingual applications and advanced Q&A.
References
• Google I/O 2023 Keynote: https://blog.google/technology/ai/
2.8. REDPAJAMA (TOGETHER) 11
RedPajama (Together)
Release & Developer
• Release: 2023
• Developer: Together, various open-source contributors
Parameter Count
Mirrors LLaMA’s variants (7B, 13B, 30B, 70B), with open data and training code.
Architecture
• Transformer-based, aims to replicate LLaMA’s design as open-source.
• Emphasizes transparent data pipelines.
Key Innovations
• Entire replication of LLaMA with publicly available datasets and code.
• Facilitates community-driven fine-tuning and model improvement.
Use Cases
• Research on scaling laws, model interpretability, and open benchmarking.
• Fine-tuning for domain-specific tasks.
References
• RedPajama GitHub: https://github.com/togethercomputer/RedPajama-Data
12 CHAPTER 2. LANGUAGE MODELS (LLMS)
Falcon (TII)
Release & Developer
• Release: 2023
• Developer: Technology Innovation Institute (TII)
Parameter Count
Available in variants of 7B and 40B.
Architecture
• Transformer-based, similar to GPT-style decoder architectures.
• Apache 2.0 License, encouraging open use.
Key Innovations
• High-performance on benchmarks relative to parameter size.
• Focus on efficiency and strong baseline performance.
Use Cases
• Chatbots, language understanding, text generation for research and commercial apps.
• Fine-tuning with relatively smaller resource footprints.
References
• Falcon GitHub: https://github.com/Technology-Innovation-Institute/falcon
2.10. BIOGPT (MICROSOFT) 13
BioGPT (Microsoft)
Release & Developer
• Release: 2022
• Developer: Microsoft Research
Parameter Count
Varies; base models typically in the range of hundreds of millions of parameters, specialized for
biomedical text.
Architecture
• Transformer-based, fine-tuned on biomedical corpora.
• Focuses on specialized domain vocabulary, terminologies, and references.
Key Innovations
• High accuracy on biomedical NLP tasks like entity recognition, relation extraction.
• Enhanced understanding of domain-specific language and abbreviations.
Use Cases
• Scientific literature analysis, biomedical knowledge discovery.
• Drug discovery, clinical text interpretation, and medical Q&A.
References
• Microsoft BioGPT Paper: https://arxiv.org/abs/2210.10341
14 CHAPTER 2. LANGUAGE MODELS (LLMS)
Chapter 3
Vision Models
Overview
Vision Models are primarily focused on understanding and generating images (and sometimes
video). They handle tasks such as classification, segmentation, object detection, and image
synthesis. Below are ten influential vision models.
15
16 CHAPTER 3. VISION MODELS
Vision Transformer (ViT, Google)
Release & Developer
• Release: 2020
• Developer: Google
Parameter Count
ViT Base: 86M, ViT Large: 307M, and bigger variants go beyond that.
Architecture
• Uses Transformer encoders on image patches (16x16 or 32x32).
• Positional embeddings handle patch order.
Key Innovations
• Demonstrated that Transformers can outperform CNNs on large-scale image datasets.
• Simplified architecture by reusing standard NLP Transformer blocks.
Use Cases
• Image classification, object detection (with minor adaptations).
• Transfer learning for specialized tasks (medical imaging, etc.).
References
• Dosovitskiy, A., et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale.” (2020)
3.2. CLIP (OPENAI) 17
CLIP (OpenAI)
Release & Developer
• Release: 2021
• Developer: OpenAI
Parameter Count
Varies depending on the backbone (ResNet or ViT-based). Typically in the range of 63M–300M.
Architecture
• Two-tower model: one for images, one for text, joined by a shared embedding space.
• Trained with contrastive learning over (image, text) pairs.
Key Innovations
• Enables zero-shot classification of images based on textual prompts.
• Learns general-purpose representations applicable to diverse downstream tasks.
Use Cases
• Image search, labeling, content moderation.
• Zero-shot classification without explicit labeled training data.
References
• Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision
(CLIP).” (2021)
18 CHAPTER 3. VISION MODELS
SAM (Meta)
Release & Developer
• Release: 2023
• Developer: Meta (Facebook AI)
Parameter Count
Exact details undisclosed; it’s large, with multiple encoders for segmentation.
Architecture
• Promptable model with an image encoder and flexible mask decoder.
• Designed to segment “anything” within an image upon prompt (points, boxes, text).
Key Innovations
• General-purpose segmentation that does not require specific training on particular classes.
• Powerful zero-shot or one-shot segmentation performance across varied domains.
Use Cases
• Quick image segmentation for design, medical imaging, autonomous vehicles.
• Interactive image editing (selecting regions to modify).
References
• Kirillov, A., et al. “Segment Anything.” (2023), https://segment-anything.com/
3.4. DALL·E 3 (OPENAI) 19
DALL·E 3 (OpenAI)
Release & Developer
• Release: 2023
• Developer: OpenAI
Parameter Count
Not explicitly stated; building on DALL·E 2’s large model architecture.
Architecture
• Diffusion or autoregressive approach (details vary).
• Integrates CLIP-like text encoders to parse complex prompts.
Key Innovations
• More coherent text-to-image generation with improved text rendering in images.
• Greater control over style and scene composition.
Use Cases
• Rapid prototyping for design, concept art, creative illustrations.
• Marketing materials, user-generated content for visual storytelling.
References
• OpenAI DALL·E 3 Announcement: https://openai.com/product/dall-e-3
20 CHAPTER 3. VISION MODELS
DeepLabv3+ (Google)
Release & Developer
• Release: 2018
• Developer: Google
Parameter Count
Typically in the range of 40M–60M, depending on backbone (e.g., ResNet, Xception).
Architecture
• Uses atrous convolution for multi-scale context.
• Encoder-decoder structure for semantic segmentation.
Key Innovations
• Atrous Spatial Pyramid Pooling (ASPP) captures rich contextual information.
• Effective for pixel-accurate segmentation tasks.
Use Cases
• Autonomous driving scene analysis (lane detection, obstacle segmentation).
• Medical imaging segmentation.
References
• Chen, L.-C., et al. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image
Segmentation.” (2018)
3.6. YOLOV7 (ULTRALYTICS) 21
YOLOv7 (Ultralytics)
Release & Developer
• Release: 2022
• Developer: Ultralytics
Parameter Count
Ranges from about 7M to 70M for various YOLOv7 models.
Architecture
• One-stage detector with real-time inference speed.
• Builds on YOLO’s design with CSP (Cross Stage Partial) improvements.
Key Innovations
• SOTA object detection with a balance of speed and accuracy.
• Modular design for easy scaling and fine-tuning.
Use Cases
• Real-time object detection in robotics or drone applications.
• Surveillance, traffic monitoring, and video analytics.
References
• Ultralytics YOLO GitHub: https://github.com/ultralytics/yolov7
22 CHAPTER 3. VISION MODELS
EfficientNet (Google)
Release & Developer
• Release: 2019
• Developer: Google
Parameter Count
Ranges from 5M to 66M across EfficientNet-B0 through B7.
Architecture
• Compound scaling of depth, width, and resolution in a principled manner.
• Mobile inverted bottleneck MBConv with squeeze-excitation.
Key Innovations
• SOTA accuracy on ImageNet with relatively fewer parameters.
• Simple scaling rules reduce trial-and-error for architectural design.
Use Cases
• Mobile and embedded vision applications.
• Large-scale classification with limited compute resources.
References
• Tan, M., and Le, Q. “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks.” (2019)
3.8. STYLEGAN3 (NVIDIA) 23
StyleGAN3 (NVIDIA)
Release & Developer
• Release: 2021
• Developer: NVIDIA
Parameter Count
Typically in the tens of millions, depending on configuration for image resolution.
Architecture
• Generative Adversarial Network (GAN) with “style” layers controlling generation at multiple
scales.
• Removes aliasing artifacts from previous iterations (StyleGAN2).
Key Innovations
• Improved temporal stability for video and motion consistency.
• Enhanced disentanglement of latent representation for image editing.
Use Cases
• High-fidelity face generation, art creation, data augmentation.
• Research in GAN interpretability and generative design.
References
• Karras, T., et al. “Alias-Free Generative Adversarial Networks (StyleGAN3).” (2021)
24 CHAPTER 3. VISION MODELS
Flamingo (DeepMind)
Release & Developer
• Release: 2022
• Developer: DeepMind
Parameter Count
Built on top of large language models; parameter count can be in the billions.
Architecture
• Combines a frozen LLM (e.g., Chinchilla) with adapter layers for visual input.
• Processes sequences of images and text tokens for cross-modal reasoning.
Key Innovations
• Few-shot adaptation: strong performance on varied vision+language tasks with limited extra
training.
• Memory-efficient approach by keeping the LLM “frozen.”
Use Cases
• Image captioning, visual question answering.
• Embodied AI for robotics with vision-language interactions.
References
• Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.” (2022)
3.10. IMAGINE (META) 25
Imagine (Meta)
Release & Developer
• Release: 2023
• Developer: Meta (Facebook AI)
Parameter Count
Details not publicly disclosed; presumably large-scale generative model.
Architecture
• Likely using diffusion or GAN-based approach for image creation.
• Accepts textual descriptions to generate images.
Key Innovations
• Focus on generating high-quality, diverse images from textual prompts.
• Potential synergy with LLaMA or other large language models for improved prompt under-
standing.
Use Cases
• Creative art, concept design, marketing material generation.
• Rapid prototyping of visual ideas in AR/VR contexts.
References
• Meta AI announcements and demos (internal developer notes; public info limited).
26 CHAPTER 3. VISION MODELS
Chapter 4
Text-to-Speech (TTS)
Overview
Text-to-Speech systems convert textual input into synthetic speech. Modern TTS leverages
neural networks that can produce surprisingly human-like prosody and clarity. Here are ten
TTS systems making significant impact.
27
28 CHAPTER 4. TEXT-TO-SPEECH (TTS)
WaveNet (DeepMind)
Release & Developer
• Release: 2016
• Developer: DeepMind (Google)
Parameter Count
Initial WaveNet had about 3–5 million parameters, though later variants vary in size.
Architecture
• Autoregressive convolutional model that generates raw waveforms sample by sample.
• Dilated convolutions to capture long-range dependencies in audio.
Key Innovations
• Dramatically improved naturalness of synthetic speech compared to older vocoder methods.
• Also adapted for music generation and conditional audio tasks.
Use Cases
• Google Assistant TTS, voice assistants, audiobooks, any domain needing high-fidelity speech.
References
• van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” (2016)
4.2. TACOTRON 2 (GOOGLE) 29
Tacotron 2 (Google)
Release & Developer
• Release: 2017 (initial Tacotron); 2018 (Tacotron 2)
• Developer: Google
Parameter Count
Tacotron 2 typically 28 million parameters in the seq2seq part, plus the WaveNet (or similar)
vocoder.
Architecture
• Sequence-to-sequence model that predicts mel-spectrograms from text.
• Vocoder (like WaveNet) synthesizes final waveform from spectrogram.
Key Innovations
• End-to-end TTS pipeline with state-of-the-art naturalness.
• Replaced older “Griffin-Lim” vocoding with neural vocoding for better quality.
Use Cases
• Google Cloud TTS services, accessibility tools, voice apps.
References
• Shen, J., et al. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram
Predictions.” (2018)
30 CHAPTER 4. TEXT-TO-SPEECH (TTS)
FastSpeech 2 (Microsoft)
Release & Developer
• Release: 2020 (FastSpeech), 2021 (FastSpeech 2)
• Developer: Microsoft
Parameter Count
Model sizes vary; generally on the order of tens of millions of parameters.
Architecture
• Non-autoregressive TTS that predicts durations for each phoneme/token.
• Uses alignment and variance adapters for pitch, energy, and duration.
Key Innovations
• Faster inference than autoregressive models.
• High-quality output comparable to Tacotron-class systems.
Use Cases
• Real-time TTS applications with low-latency requirements.
• Systems that need controllable prosody (pitch, duration).
References
• Ren, Y., et al. “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.” (2021)
4.4. VITS (NVIDIA) 31
VITS (NVIDIA)
Release & Developer
• Release: 2021
• Developer: NVIDIA
Parameter Count
Typically tens of millions, though may vary in forks.
Architecture
• End-to-end TTS with normalizing flows and adversarial training.
• Learns to generate waveforms directly, bypassing a separate vocoder.
Key Innovations
• One-stage generation: text to raw audio in a single model.
• High fidelity and natural prosody, with adjustable speed and style.
Use Cases
• Voice user interfaces, synthetic podcasts, real-time TTS in gaming.
References
• Kim, J., et al. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End
Text-to-Speech.” (2021)
32 CHAPTER 4. TEXT-TO-SPEECH (TTS)
Glow-TTS
Release & Developer
• Release: 2020
• Developer: Various (paper from KakaoBrain)
Parameter Count
Tens of millions in the base version.
Architecture
• Flow-based generative model for mel-spectrograms.
• Non-autoregressive approach for faster synthesis.
Key Innovations
• Uses invertible flow steps for better likelihood-based training.
• Competitive quality and speed with fewer model complexities.
Use Cases
• High-speed TTS with decent quality for multi-lingual systems.
• Flexible style transfer or voice cloning in research contexts.
References
• Kim, J., et al. “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment
Search.” (2020)
4.6. DEEPVOICE 3 (BAIDU) 33
DeepVoice 3 (Baidu)
Release & Developer
• Release: 2017
• Developer: Baidu Research
Parameter Count
Around 20–30 million, depending on model configuration.
Architecture
• Fully convolutional seq2seq architecture for TTS.
• Supports multi-speaker and multi-lingual data.
Key Innovations
• GPU-friendly convolutional layers.
• Faster training than RNN-based seq2seq TTS.
Use Cases
• Multi-language TTS platforms, voice-based AI assistants, large-scale production TTS.
References
• Ping, W., et al. “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.”
(2017)
34 CHAPTER 4. TEXT-TO-SPEECH (TTS)
Speechify
Release & Developer
• Release: 2017 (initial) - Present
• Developer: Speechify Inc.
Parameter Count
Proprietary. Uses multiple TTS engines under the hood.
Architecture
• Cloud-based TTS platform with user-friendly API.
• Focus on multi-voice and accent variety.
Key Innovations
• Aimed at accessibility and reading assistance (especially for dyslexia).
• Device-agnostic, supporting apps, browser extensions, and mobile usage.
Use Cases
• Education technology, reading-assist apps, corporate eLearning.
References
• Official Site: https://speechify.com/
4.8. GOOGLE TTS API 35
Google TTS API
Release & Developer
• Release: Ongoing expansions since 2018
• Developer: Google Cloud
Parameter Count
Multiple voices with undisclosed parameters; many are built off Tacotron and WaveNet deriva-
tives.
Architecture
• Cloud-based TTS service, harnessing neural net backends like WaveNet.
• Variety of voices, languages, and dialects.
Key Innovations
• Scalable, low-latency TTS at global scale.
• Large library of predefined voices, SSML customization.
Use Cases
• Integrations into contact centers, mobile apps, accessibility tools.
References
• Google Cloud TTS Documentation: https://cloud.google.com/text-to-speech
36 CHAPTER 4. TEXT-TO-SPEECH (TTS)
Sonantic
Release & Developer
• Release: 2019
• Developer: Sonantic (acquired by Spotify)
Parameter Count
Proprietary. Focus on expressive “cinematic” TTS.
Architecture
• Likely seq2seq with a neural vocoder, specialized in emotional prosody.
• Web-based studio for adjusting pitch, tone, emphasis.
Key Innovations
• Highly expressive voices for gaming, entertainment, dubbing.
• Fine control of emotions (happy, sad, serious, etc.).
Use Cases
• Voice acting for video games, film post-production, personalized narrations.
References
• Sonantic Official Site: https://sonantic.io/
4.10. RESPEECHER 37
Respeecher
Release & Developer
• Release: 2020
• Developer: Respeecher
Parameter Count
Proprietary, but typically includes high-end deep networks for voice cloning.
Architecture
• Source-to-target voice conversion using deep neural nets.
• Maintains emotional resonance and intonation of original speech.
Key Innovations
• “Voice cloning” that closely mimics timbre, accent, and style.
• Used in film production for recreating voices of actors.
Use Cases
• Filmmaking, gaming, personalized assistants, archiving historical voices.
References
• Respeecher Official Site: https://www.respeecher.com/
38 CHAPTER 4. TEXT-TO-SPEECH (TTS)
Chapter 5
Speech-to-Text (STT)
Overview
Speech-to-Text (STT) systems transform spoken language into written text, powering voice
assistants, dictation software, and transcription services. Here are ten notable STT solutions.
39
40 CHAPTER 5. SPEECH-TO-TEXT (STT)
DeepSpeech (Mozilla)
Release & Developer
• Release: 2017
• Developer: Mozilla (inspired by Baidu’s Deep Speech)
Parameter Count
Range of 50M–100M in typical configurations.
Architecture
• End-to-end RNN-based (later versions used CNN layers).
• Uses CTC (Connectionist Temporal Classification) for alignment.
Key Innovations
• Fully open-source, easy to fine-tune for various languages.
• Good baseline for academic and hobby projects.
Use Cases
• Voice-enabled applications, real-time transcriptions.
• Low-resource or specialized domain STT by custom training.
References
• Mozilla DeepSpeech GitHub: https://github.com/mozilla/DeepSpeech
5.2. WAV2VEC 2.0 (META) 41
wav2vec 2.0 (Meta)
Release & Developer
• Release: 2020
• Developer: Meta (Facebook AI)
Parameter Count
About 95M for Base; larger variants for multilingual tasks can go beyond.
Architecture
• Self-supervised model on raw audio waveforms, learns robust speech representations.
• Fine-tuned with CTC or seq2seq for final transcription.
Key Innovations
• Significant WER reduction with much less labeled data than prior methods.
• Works well in noisy conditions, beneficial for low-resource languages.
Use Cases
• Real-time STT for social media platforms, voice-driven apps, research on low-resource lan-
guages.
References
• Baevski, A., et al. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations.” (2020)
42 CHAPTER 5. SPEECH-TO-TEXT (STT)
Whisper (OpenAI)
Release & Developer
• Release: 2022
• Developer: OpenAI
Parameter Count
Models range from 39M (Tiny) up to 1.5B (Large).
Architecture
• Encoder-decoder Transformer trained on 680k hours of multilingual data.
• Handles multiple languages and robust to background noise.
Key Innovations
• Excellent zero-shot performance across multiple languages.
• End-to-end approach that handles speech recognition and language ID.
Use Cases
• Live transcription, multi-lingual speech recognition, podcast or meeting transcription.
References
• Radford, A., et al. “Robust Speech Recognition via Large-Scale Weak Supervision (Whisper).”
(2022)
5.4. CONFORMER (GOOGLE) 43
Conformer (Google)
Release & Developer
• Release: 2020
• Developer: Google
Parameter Count
Typically tens of millions, but can scale up.
Architecture
• Combines convolution and Transformer blocks, preserving local and global context.
• Often used with CTC or RNN-T for transcription output.
Key Innovations
• Achieves high accuracy with a small footprint.
• Lower computational cost compared to pure Transformer stacks.
Use Cases
• On-device speech recognition for mobile or embedded devices.
• Large-scale STT for call centers, voice assistants.
References
• Gulati, A., et al. “Conformer: Convolution-augmented Transformer for Speech Recognition.”
(2020)
44 CHAPTER 5. SPEECH-TO-TEXT (STT)
Jasper (NVIDIA)
Release & Developer
• Release: 2019
• Developer: NVIDIA
Parameter Count
Ranges from 10M to 300M+ for deeper variants.
Architecture
• Deep 1D convolutional layers with skip connections.
• Uses CTC for alignment and decoding.
Key Innovations
• Modular block design for easy scaling.
• Highly parallelizable on GPUs, leading to faster training.
Use Cases
• High-throughput transcription tasks.
• Real-time speech recognition systems.
References
• Li, J., et al. “Jasper: An End-to-End Convolutional Neural Acoustic Model.” (2019)
5.6. KALDI (OPEN-SOURCE) 45
Kaldi (Open-Source)
Release & Developer
• Release: 2011, ongoing updates
• Developer: Community-driven, started by Daniel Povey
Architecture
• Toolkit offering DNN/HMM or end-to-end approaches.
• Very modular, highly customizable pipelines for acoustic modeling.
Key Innovations
• Gold standard in academic speech recognition research for many years.
• Strong support for multi-language, lattice generation, and advanced acoustic modeling.
Use Cases
• Research labs, open-source STT solutions, building custom ASR systems.
References
• Kaldi: https://kaldi-asr.org/
46 CHAPTER 5. SPEECH-TO-TEXT (STT)
AssemblyAI
Release & Developer
• Release: 2017
• Developer: AssemblyAI Inc.
Parameter Count
Proprietary. Model details undisclosed.
Architecture
• Cloud-based API, often with deep Transformer or RNN-based core.
• Real-time streaming endpoints and post-processing features (topic detection, sentiment).
Key Innovations
• High accuracy and easy integration into enterprise pipelines.
• Additional AI tasks like speaker diarization, content filtering.
Use Cases
• Media transcription, call center analytics, meeting transcription.
References
• Official Site: https://www.assemblyai.com/
5.8. SPEECHMATICS 47
Speechmatics
Release & Developer
• Release: Ongoing; the company established in 2006
• Developer: Speechmatics Ltd.
Architecture
• Proprietary neural-based acoustic modeling.
• Strong multi-language support with cloud or on-prem solutions.
Key Innovations
• High-accuracy transcriptions in many global languages.
• Customizable language models for domain-specific jargon.
Use Cases
• Enterprise-level transcription for broadcasting, call centers, legal domains.
References
• Speechmatics Official Site: https://www.speechmatics.com/
48 CHAPTER 5. SPEECH-TO-TEXT (STT)
Rev AI
Release & Developer
• Release: 2019 for API
• Developer: Rev.com
Architecture
• Cloud-based STT with deep learning backends.
• Additional text cleanup and speaker identification.
Key Innovations
• Seamless integration with Rev’s human transcription platform for hybrid workflows.
• High accuracy in English with domain customization.
Use Cases
• Podcast or video transcription, enterprise software integrations, real-time call analysis.
References
• Rev AI: https://www.rev.ai/
5.10. AWS TRANSCRIBE 49
AWS Transcribe
Release & Developer
• Release: 2017 (preview), ongoing updates
• Developer: Amazon Web Services
Architecture
• Proprietary deep learning-based STT service, integrated into AWS ecosystem.
• Offers custom vocabularies and language models.
Key Innovations
• Domain adaptation for call centers, specialized jargon.
• Streaming or batch transcription, integrated with Amazon S3 and other AWS services.
Use Cases
• Automated call center analytics, meeting transcription, video subtitles.
References
• AWS Transcribe Documentation: https://aws.amazon.com/transcribe/
50 CHAPTER 5. SPEECH-TO-TEXT (STT)
Chapter 6
Multimodal Models
Overview
Multimodal models handle more than one data type (e.g., text+image, text+audio). They can
integrate linguistic, visual, or auditory cues, enabling tasks like visual Q&A, image captioning,
or text-driven image generation. Here are ten pivotal models.
51
52 CHAPTER 6. MULTIMODAL MODELS
GPT-4 (OpenAI)
Multimodal Context
Although GPT-4 is primarily text-based, it has a multimodal variant capable of interpreting
images.
Key Innovations
• Accepts image inputs (through specific partner integrations).
• Can perform visual reasoning (describe images, interpret humor in memes).
Use Cases
• Visual question answering, combined text+image search queries.
• Document parsing (images of documents).
References
• OpenAI GPT-4: https://openai.com/product/gpt-4
6.2. CLIP (OPENAI) 53
CLIP (OpenAI)
Multimodal Context
Originally introduced for vision-text alignment; used in many generative and retrieval pipelines.
Key Innovations
• Learns a shared embedding space for text and images.
• Powers models like DALL·E for prompt-image coherence.
Use Cases
• Zero-shot image classification, cross-modal retrieval, content moderation.
• Foundation for controlling image generation with textual prompts.
References
• Radford, A., et al. “CLIP.” (2021)
54 CHAPTER 6. MULTIMODAL MODELS
Flamingo (DeepMind)
Multimodal Context
Flamingo extends large language models to handle sequences of images + text.
Key Innovations
• Few-shot learning: can adapt to new tasks with minimal examples.
• Freezes the LLM and learns cross-attention adapter layers for images.
Use Cases
• Visual Q&A, image description generation, story generation using images as prompts.
References
• Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.” (2022)
6.4. DALL·E 3 (OPENAI) 55
DALL·E 3 (OpenAI)
Multimodal Context
Accepts text prompts and outputs images; pairs vision with linguistic understanding.
Key Innovations
• Improved text rendering in images (e.g., signage, typed words).
• Better alignment between textual prompt and generated imagery.
Use Cases
• Concept art, storyboarding, and product design ideation.
• Generating custom visuals for marketing or social media content.
References
• OpenAI DALL·E 3: https://openai.com/product/dall-e-3
56 CHAPTER 6. MULTIMODAL MODELS
PaLM-E (Google)
Multimodal Context
• Extension of PaLM for embodied tasks (robotics, sensor data).
Key Innovations
• Integrates text, vision, and sensor input for advanced robotic control.
• Allows instruction-following in real-world tasks: “Pick up that object from the table.”
Use Cases
• Robotics, real-world assistant tasks, autonomous navigation.
References
• PaLM-E Paper/Google AI Blog: https://ai.googleblog.com/search?q=palm-e
6.6. LLAVA (META) 57
LLaVA (Meta)
Multimodal Context
Vision-augmented LLaMA for visual Q&A and image understanding.
Key Innovations
• Fine-tunes LLaMA with a visual encoder bridging images to text tokens.
• Maintains strong language capabilities with added visual insight.
Use Cases
• Automated tutoring with diagrams, visual-based Q&A (e.g., “What is in this picture?”).
References
• LLaVA GitHub: https://github.com/haotian-liu/LLaVA
58 CHAPTER 6. MULTIMODAL MODELS
ImageBind (Meta)
Multimodal Context
• Binds images, text, audio, 3D, and more into a shared embedding space.
Key Innovations
• Joint representation for multiple sensory modalities (speech, text, images, depth, etc.).
• Potential for retrieval tasks across multiple data types (e.g., find matching audio for an image).
Use Cases
• Multimodal search, creative design (matching audio tracks to visual scenes), AR/VR input.
References
• Girdhar, R., et al. “ImageBind: One Embedding Space To Bind Them All.” (2023)
6.8. BLIP-2 59
Blip-2
Multimodal Context
• Vision-language model for tasks like image captioning and VQA (Vision Question Answer-
ing).
Key Innovations
• Integrates an image encoder with a language decoder, uses Q-Former to bridge them.
• Supports instruction tuning for more general visual QA tasks.
Use Cases
• Interactive chat about images, dynamic scene descriptions, image-based search.
References
• Li, J., et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders
and Large Language Models.” (2023)
60 CHAPTER 6. MULTIMODAL MODELS
Perceiver IO (DeepMind)
Multimodal Context
• Generalized Transformer architecture that accepts multiple input modalities.
Key Innovations
• Can handle 1D, 2D, or 3D inputs (audio, images, point clouds).
• Latent bottleneck approach scales well with large input sizes.
Use Cases
• Video classification, audio-visual tasks, and general neural data fusion.
References
• Jaegle, A., et al. “Perceiver IO: A General Architecture for Structured Inputs & Outputs.”
(2021)
6.10. MUSE (GOOGLE) 61
Muse (Google)
Multimodal Context
• Text-conditioned image generation using a Transformer-based diffusion approach.
Key Innovations
• Combines the strengths of diffusion and large language modeling for image creation.
• Potentially better text alignment and fine-grained detail than some prior models.
Use Cases
• Artwork generation, concept design, or integrated marketing campaigns.
References
• Chang, B., et al. “Muse: Text-to-Image Generation via Masked Generative Transformers.”
(2023)
62 CHAPTER 6. MULTIMODAL MODELS
Chapter 7
Conclusion
This Atlas provided a comprehensive tour of 50 notable AI models, sorted by their primary
modality:
1. Language Models (LLMs) – from GPT-4 to domain-specific BioGPT
2. Vision Models – covering classification, segmentation, and generation
3. Text-to-Speech (TTS) – from the groundbreaking WaveNet to specialized voice cloning
solutions
4. Speech-to-Text (STT) – spanning open-source (DeepSpeech, Kaldi) to enterprise APIs (AWS
Transcribe)
5. Multimodal Models – bridging text, images, and even audio for complex tasks
As AI continues to evolve, we expect further convergence of modalities—leading to more
generalist “foundation models.” Open-source communities and research labs will continue
releasing new variations, expanding upon these baseline technologies. The references linked
throughout are excellent entry points for deeper dives into specific models, training approaches,
and potential applications.
63
64 CHAPTER 7. CONCLUSION
References and Further Reading
Below is a consolidated reference list for many of the papers and resources mentioned:
• BERT: Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)
• T5: Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” (2019)
• WaveNet: van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” (2016)
• Tacotron 2: Shen, J., et al. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectro-
gram Predictions.” (2018)
• ViT: Dosovitskiy, A., et al. “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale.” (2020)
• CLIP: Radford, A., et al. “Learning Transferable Visual Models From Natural Language
Supervision.” (2021)
• wav2vec 2.0: Baevski, A., et al. “wav2vec 2.0: A Framework for Self-Supervised Learning of
Speech Representations.” (2020)
• Flamingo: Alayrac, J-B., et al. “Flamingo: a Visual Language Model for Few-Shot Learning.”
(2022)
• StyleGAN3: Karras, T., et al. “Alias-Free Generative Adversarial Networks (StyleGAN3).”
(2021)
• Conformer: Gulati, A., et al. “Conformer: Convolution-augmented Transformer for Speech
Recognition.” (2020)
• FastSpeech 2: Ren, Y., et al. “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.”
(2021)
• Glow-TTS: Kim, J., et al. “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic
Alignment Search.” (2020)
• DeepVoice 3: Ping, W., et al. “Deep Voice 3: Scaling Text-to-Speech with Convolutional
Sequence Learning.” (2017)
• Jasper: Li, J., et al. “Jasper: An End-to-End Convolutional Neural Acoustic Model.” (2019)
• DeepLabv3+: Chen, L.-C., et al. “Encoder-Decoder with Atrous Separable Convolution for
Semantic Image Segmentation.” (2018)
65
66 CHAPTER 7. REFERENCES AND FURTHER READING
• ImageBind: Girdhar, R., et al. “ImageBind: One Embedding Space To Bind Them All.” (2023)
• BLIP-2: Li, J., et al. “BLIP-2: Bootstrapping Language-Image Pre-training.” (2023)
• Perceiver IO: Jaegle, A., et al. “Perceiver IO: A General Architecture for Structured Inputs &
Outputs.” (2021)