Gen AI: Creating images from text description with ai: build a system that
can generate high quality images based on textual prompts.
1. Abstract:
The proposed project focuses on the development of an AI-driven system capable of
generating high-quality images from textual descriptions. Leveraging advancements in
natural language processing (NLP) and generative models, the system will interpret user
prompts and create visually accurate images. The system will be trained on large datasets
comprising both text and corresponding images, ensuring it can understand a wide variety of
descriptions, ranging from simple objects to complex scenes. The model will be designed to
handle various artistic styles, photorealism, and abstract visuals, ensuring flexibility and
creativity in image generation.
The core technology behind this project is a combination of transformer-based NLP models,
such as GPT, and deep generative models like Generative Adversarial Networks (GANs) or
diffusion models. By synthesizing textual input with visual elements, the system will
progressively enhance the realism and quality of generated images. Key features include
customizable style parameters, the ability to refine outputs, and scalability to handle diverse
requests. The system will also focus on efficiency, ensuring high-quality output with
optimized computational resources.
This project has potential applications across various industries such as media, design,
marketing, and education, where creating visual content quickly and accurately is crucial. By
streamlining the creative process, this AI image generation system will empower
professionals and hobbyists alike to produce stunning visuals with minimal effort, lowering
the barriers to high-quality content creation.
2. System Requirements for AI Image Generation from Textual Prompts
2.1. Hardware Requirements:
● GPU (Graphics Processing Unit):
A powerful GPU is essential for training and running deep learning models.
Recommended: NVIDIA A100, V100, or RTX 3080 with at least 12 GB of VRAM.
● CPU (Central Processing Unit):
A high-performance multi-core processor is required for handling complex operations
and data preprocessing.
Recommended: Intel Core i7/i9 or AMD Ryzen 7/9.
● RAM:
For handling large datasets and processing image generations efficiently.
Minimum: 32 GB
Recommended: 64 GB or more.
● Storage:
High-speed SSDs are required for storing datasets, models, and generated images.
Minimum: 1 TB SSD
Recommended: 2 TB SSD with additional external storage for backups.
● Power Supply and Cooling:
A robust power supply and efficient cooling system are essential, especially for
extended model training sessions.
2.2. Software Requirements:
● Operating System:
Ubuntu 20.04 or later (for compatibility with machine learning frameworks) or
Windows 10/11.
● Python Environment:
Python 3.8 or later, with virtual environment support to isolate dependencies.
● Libraries and Frameworks:
o PyTorch or TensorFlow (for building and training generative models)
o Hugging Face Transformers (for handling NLP tasks and textual prompt
processing)
o CUDA (for GPU acceleration on NVIDIA hardware)
o cuDNN (for optimizing deep learning performance)
o OpenCV or Pillow (for image handling and preprocessing)
● Text-to-Image Models:
Pre - trained models such as DALL-E, Stable Diffusion, or CLIP can be fine-
tuned for improved performance.
● Development Tools:
o Jupyter Notebook or VS Code for interactive development and debugging.
o Git for version control.
o Docker (optional) for containerized environments and easy deployment.
2.3. Additional Requirements:
● Datasets:
Access to large image-text paired datasets like MS-COCO, OpenAI’s WebImageText,
or other publicly available datasets for training the model.
● Cloud Support (Optional):
For large-scale deployments and training, services like AWS, Google Cloud, or Azure
for GPU/TPU instances may be used for scalability.
● API Integration:
Optional API integration for generating images via web interfaces, which requires
setting up RESTful APIs with Flask or Fast API for seamless integration.
3.Flow chart:
4.Code implementation for project on text to speech:
!pip install diffusers transformers accelerate torch datasets
from huggingface_hub import notebook_login
# Log in to Hugging Face to access the dataset
notebook_login()
from datasets import load_dataset
# Load the dataset from Hugging Face Hub with the correct split
dataset = load_dataset("anjunhu/naively_captioned_CUB2002011_test", split="train")
import torch
from diffusers import StableDiffusionPipeline, DDPMScheduler
from transformers import CLIPTokenizer
from datasets import load_dataset
from torch.optim import AdamW
from accelerate import Accelerator
# Load the dataset from Hugging Face Hub with the correct split
dataset = load_dataset("anjunhu/naively_captioned_CUB2002011_test", split="train")
# Load the Stable Diffusion model
model_id = "CompVis/stable-diffusion-v1-4"
pipeline = StableDiffusionPipeline.from_pretrained(model_id,
torch_dtype=torch.float16).to("cuda")
# Load the scheduler for Stable Diffusion
noise_scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
# Initialize the optimizer on the UNet part of the pipeline
optimizer = AdamW(pipeline.unet.parameters(), lr=5e-5)
# Accelerator for mixed precision training
accelerator = Accelerator(mixed_precision="fp16")
# Initialize tokenizer for the text
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
# Tokenize the text prompts in the dataset
def preprocess_data(batch):
text_inputs = tokenizer(batch['text'], padding='max_length', truncation=True,
return_tensors='pt')
return text_inputs
# Apply the preprocessing to the dataset
dataset = dataset.map(preprocess_data, batched=True)
# Set the number of training epochs
num_epochs = 3
# Training loop
for epoch in range(num_epochs):
for batch in dataset:
optimizer.zero_grad()
# Convert tokenized captions (which is a list) to a tensor of type long
captions = torch.tensor(batch["input_ids"]).to(accelerator.device).long().unsqueeze(0)
# Add batch dimension
# Sample random noise for diffusion (in float16)
noise = torch.randn((1, 4, 64, 64), dtype=torch.float16).to(accelerator.device)
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (1,),
device=accelerator.device).long()
# Use the text encoder to get embeddings for the captions
text_embeddings = pipeline.text_encoder(captions).last_hidden_state
# Forward pass (generate images from text using diffusion)
noise_added = noise_scheduler.add_noise(noise, noise, timesteps)
# Pass the noise and text embeddings to the U-Net
model_output = pipeline.unet(noise_added, timesteps, text_embeddings).sample
# Compute the loss (MSE)
loss = torch.nn.functional.mse_loss(model_output, noise)
# Backward pass
accelerator.backward(loss)
optimizer.step()
print(f"Epoch {epoch + 1} | Loss: {loss.item()}")
from PIL import Image
# Generate an image from a caption
text_prompt = "A photo of a crow."
# Use the pipeline to generate an image from the text
image = pipeline(prompt=text_prompt).images[0]
# Display the generated image
image.show()
# Save the generated image to a file
image.save("generated_image.png")
print("Image saved as generated_image.png")
Project Hurdles
As we use free GPU for training using the data sets number of epochs we test is very
less. Hope it can be optimized in phase three. Due to this issue the output we get is not
proper.
5. OUTPUT