0% found this document useful (0 votes)
42 views6 pages

NM Narash

Uploaded by

ragul2661996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views6 pages

NM Narash

Uploaded by

ragul2661996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Gen AI: Creating images from text description with ai: build a system that

can generate high quality images based on textual prompts.

1. Abstract:

The proposed project focuses on the development of an AI-driven system capable of


generating high-quality images from textual descriptions. Leveraging advancements in
natural language processing (NLP) and generative models, the system will interpret user
prompts and create visually accurate images. The system will be trained on large datasets
comprising both text and corresponding images, ensuring it can understand a wide variety of
descriptions, ranging from simple objects to complex scenes. The model will be designed to
handle various artistic styles, photorealism, and abstract visuals, ensuring flexibility and
creativity in image generation.

The core technology behind this project is a combination of transformer-based NLP models,
such as GPT, and deep generative models like Generative Adversarial Networks (GANs) or
diffusion models. By synthesizing textual input with visual elements, the system will
progressively enhance the realism and quality of generated images. Key features include
customizable style parameters, the ability to refine outputs, and scalability to handle diverse
requests. The system will also focus on efficiency, ensuring high-quality output with
optimized computational resources.

This project has potential applications across various industries such as media, design,
marketing, and education, where creating visual content quickly and accurately is crucial. By
streamlining the creative process, this AI image generation system will empower
professionals and hobbyists alike to produce stunning visuals with minimal effort, lowering
the barriers to high-quality content creation.

2. System Requirements for AI Image Generation from Textual Prompts

2.1. Hardware Requirements:

● GPU (Graphics Processing Unit):


A powerful GPU is essential for training and running deep learning models.
Recommended: NVIDIA A100, V100, or RTX 3080 with at least 12 GB of VRAM.
● CPU (Central Processing Unit):
A high-performance multi-core processor is required for handling complex operations
and data preprocessing.
Recommended: Intel Core i7/i9 or AMD Ryzen 7/9.
● RAM:
For handling large datasets and processing image generations efficiently.
Minimum: 32 GB
Recommended: 64 GB or more.
● Storage:
High-speed SSDs are required for storing datasets, models, and generated images.
Minimum: 1 TB SSD
Recommended: 2 TB SSD with additional external storage for backups.
● Power Supply and Cooling:
A robust power supply and efficient cooling system are essential, especially for
extended model training sessions.

2.2. Software Requirements:

● Operating System:
Ubuntu 20.04 or later (for compatibility with machine learning frameworks) or
Windows 10/11.
● Python Environment:
Python 3.8 or later, with virtual environment support to isolate dependencies.
● Libraries and Frameworks:
o PyTorch or TensorFlow (for building and training generative models)
o Hugging Face Transformers (for handling NLP tasks and textual prompt
processing)
o CUDA (for GPU acceleration on NVIDIA hardware)
o cuDNN (for optimizing deep learning performance)
o OpenCV or Pillow (for image handling and preprocessing)
● Text-to-Image Models:
Pre - trained models such as DALL-E, Stable Diffusion, or CLIP can be fine-
tuned for improved performance.
● Development Tools:
o Jupyter Notebook or VS Code for interactive development and debugging.
o Git for version control.
o Docker (optional) for containerized environments and easy deployment.

2.3. Additional Requirements:

● Datasets:
Access to large image-text paired datasets like MS-COCO, OpenAI’s WebImageText,
or other publicly available datasets for training the model.
● Cloud Support (Optional):
For large-scale deployments and training, services like AWS, Google Cloud, or Azure
for GPU/TPU instances may be used for scalability.
● API Integration:
Optional API integration for generating images via web interfaces, which requires
setting up RESTful APIs with Flask or Fast API for seamless integration.
3.Flow chart:
4.Code implementation for project on text to speech:

!pip install diffusers transformers accelerate torch datasets


from huggingface_hub import notebook_login

# Log in to Hugging Face to access the dataset


notebook_login()

from datasets import load_dataset

# Load the dataset from Hugging Face Hub with the correct split
dataset = load_dataset("anjunhu/naively_captioned_CUB2002011_test", split="train")

import torch
from diffusers import StableDiffusionPipeline, DDPMScheduler
from transformers import CLIPTokenizer
from datasets import load_dataset
from torch.optim import AdamW
from accelerate import Accelerator

# Load the dataset from Hugging Face Hub with the correct split
dataset = load_dataset("anjunhu/naively_captioned_CUB2002011_test", split="train")

# Load the Stable Diffusion model


model_id = "CompVis/stable-diffusion-v1-4"
pipeline = StableDiffusionPipeline.from_pretrained(model_id,
torch_dtype=torch.float16).to("cuda")

# Load the scheduler for Stable Diffusion


noise_scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

# Initialize the optimizer on the UNet part of the pipeline


optimizer = AdamW(pipeline.unet.parameters(), lr=5e-5)

# Accelerator for mixed precision training


accelerator = Accelerator(mixed_precision="fp16")

# Initialize tokenizer for the text


tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")

# Tokenize the text prompts in the dataset


def preprocess_data(batch):
text_inputs = tokenizer(batch['text'], padding='max_length', truncation=True,
return_tensors='pt')
return text_inputs

# Apply the preprocessing to the dataset


dataset = dataset.map(preprocess_data, batched=True)

# Set the number of training epochs


num_epochs = 3

# Training loop
for epoch in range(num_epochs):
for batch in dataset:
optimizer.zero_grad()

# Convert tokenized captions (which is a list) to a tensor of type long


captions = torch.tensor(batch["input_ids"]).to(accelerator.device).long().unsqueeze(0)
# Add batch dimension

# Sample random noise for diffusion (in float16)


noise = torch.randn((1, 4, 64, 64), dtype=torch.float16).to(accelerator.device)
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (1,),
device=accelerator.device).long()

# Use the text encoder to get embeddings for the captions


text_embeddings = pipeline.text_encoder(captions).last_hidden_state

# Forward pass (generate images from text using diffusion)


noise_added = noise_scheduler.add_noise(noise, noise, timesteps)

# Pass the noise and text embeddings to the U-Net


model_output = pipeline.unet(noise_added, timesteps, text_embeddings).sample

# Compute the loss (MSE)


loss = torch.nn.functional.mse_loss(model_output, noise)

# Backward pass
accelerator.backward(loss)
optimizer.step()

print(f"Epoch {epoch + 1} | Loss: {loss.item()}")

from PIL import Image

# Generate an image from a caption


text_prompt = "A photo of a crow."

# Use the pipeline to generate an image from the text


image = pipeline(prompt=text_prompt).images[0]

# Display the generated image


image.show()

# Save the generated image to a file


image.save("generated_image.png")

print("Image saved as generated_image.png")


Project Hurdles

As we use free GPU for training using the data sets number of epochs we test is very
less. Hope it can be optimized in phase three. Due to this issue the output we get is not
proper.

5. OUTPUT

You might also like