0% found this document useful (0 votes)
14 views31 pages

Mini Project Report

The document is a technical seminar report on '6G: Technology Evolution in Future Wireless Networks,' submitted for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project objectives, methodology, and applications of an Image Caption Generator using deep learning techniques, specifically CNNs and LSTMs. The report emphasizes the importance of bridging computer vision and natural language processing for various applications, including aiding the visually impaired and enhancing social media engagement.

Uploaded by

Sahana S.H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views31 pages

Mini Project Report

The document is a technical seminar report on '6G: Technology Evolution in Future Wireless Networks,' submitted for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project objectives, methodology, and applications of an Image Caption Generator using deep learning techniques, specifically CNNs and LSTMs. The report emphasizes the importance of bridging computer vision and natural language processing for various applications, including aiding the visually impaired and enhancing social media engagement.

Uploaded by

Sahana S.H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

V I SV E S VAR AYA TECH N O L O G I C A L U N I V E R S IT Y

“ J N A N A S A N G A M A ” B E L A G A VI – 5 9 0 018

An
[Technical Seminar report on]

“6G: Technology Evolution in Future Wireless Networks”

Submitted in the partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
(Accredited by NBA, New Delhi, validity up to 30.06.2026)

SUBMITTED BY
Sahana S.H 4JD22CS406

UNDER THE GUIDANCE OF

Mrs. Chaithra B M BE,M.Tech


Assistant Professor,
Dept. of CS&E,
Jain Institute of Technology, Davangere

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(Accredited by NBA, New Delhi, validity up to 30.06.2026)

JAIN INSTITUTE OF TECHNOLOGY


DAVANGERE – 577003

2023 - 2024
ABSTRACT

i
ACKNOWLEDGEMENT

Although a single sentence hardly suffices, we would like to thank almighty God for blessing me with
his grace and taking my endeavor to a successful culmination.

I express my gratitude to my guide Prof. Prashantha G. R, Dept. of CS&E, JIT, Davangere,


for his valuable guidance and continual encouragement and assistance throughout the seminar. I greatly
appreciate the freedom and collegial respect. I am grateful to him for discussions about the technical
matters and suggestions concerned to my topic.

I extend my sense of gratitude to Prof. Sameer.B .Internship Co-ordinator, Dept.of CS&E, JIT,
Davangere, for extending support and cooperation which helped me in completion of the Internship.

We extend our sense of gratitude to Dr. Mouneshachari S, Professor & Head, Department of
CS&E, JIT, Davangere, for extending support and cooperation which helped us in completion of the
project.

We express our sincere thanks to Dr. Ganesh D B, Principal and Director, J.I.T, Davangere,
for extending support and cooperation which helped us in the completion of the project.

We would like to extend our gratitude to all staff of Department of Computer Science and
Engineering for the help and support rendered to us. We have benefited a lot from the feedback,
suggestions given by them.

We would like to extend our gratitude to all our family members and friends especially for their
advice and moral support.

ANJAN KUMAR T G 4JD21CS005)

CHANKRIKA S (4JD21CS015)

DRASHAN P K (4JD21CS016)

SAHANA S H (4JD22CS0406)

ii
CONTENTS

Page No.

ABSTRACT i
ACKNOWLEDGEMENT ii
CONTENTS iii
CHAPTER 1: INTRODUCTION 1-2
1.1 Overview of the Project
1.2 Objectives
1.3 Scope

CHAPTER 2: LITERATURE SURVEY 3

CHAPTER 3: METHODOLOGY 4-10

3.1 Introduction
3.2 System Requirement Specification
3.3 Working Explanation
3.4 Algorithms
3.5 KNN Algorithm
3.6 Logistic
3.7 Overview of Dataset
3.8 Methodology

CHAPTER 4: IMPLEMENTATION 11-21


4.1 Description of Implementation
4.2 Source Code

CHAPTER 5: RESULT 22

SNAPSHOTS 23-24

CONCLUSION 25

REFERENCES 26

iii
Image Caption Generator 2023-2024

CHAPTER 1:
INTRODUCTION

Every day, we are bombarded with photos in our surroundings, on social media, and in the news.
Only humans are capable of recognizing photos. We humans can recognize photographs without
their assigned captions, but machines require images to be taught first. The encoder-decoder
architecture of Image Caption Generator models uses input vectors to generate valid and acceptable
captions. This paradigm connects the worlds of natural language processing and computer vision.
It's a job of recognizing and evaluating the image's context before describing everything in a natural
language like English.

Our approach is based on two basic models: CNN (Convolutional Neural Network) and LSTM
(Long Short-Term Memory). CNN is utilized as an encoder in the derived application to extract
features from the snapshot or image, and LSTM is used as a decoder to organize the words and
generate captions. Image captioning can help with a variety of things, such as assisting the
visionless with text-to-speech through real-time input about the scenario over a camera feed, and
increasing social medical leisure by restructuring captions for photos in social feeds as well as
spoken messages.

Assisting children in recognizing chemicals is a step toward learning the language. Captions for
every photograph on the internet can result in faster and more accurate authentic photograph
exploration and indexing. Image captioning is used in a variety of sectors, including biology,
business, the internet, and in applications such as self-driving cars wherein it could describe the
scene around the car, and CCTV cameras where the alarms could be raised if any malicious activity
is observed. The main purpose of this research article is to gain a basic understanding of deep
learning methodologies.

Dept. of CS&E 1
Image Caption Generator 2023-2024

1.1 Overview of the Project

The project focuses on developing an Image Caption Generator using deep learning methodologies,
specifically employing Convolutional Neural Networks (CNNs) as encoders and Long Short-Term
Memory networks (LSTMs) as decoders. This approach aims to enable machines to interpret visual
content and generate descriptive captions in natural language. By extracting features from images
with CNNs and generating coherent captions with LSTMs, the model bridges the gap between
computer vision and natural language processing. Applications range from enhancing accessibility
for visually impaired individuals through real-time image description to improving social media
engagement by automatically captioning and indexing visual content. Moreover, the technology
finds practical uses in sectors such as biology, business, and automotive industries, demonstrating
its broad impact on enhancing human-computer interaction and advancing technological capabilities
in understanding and processing visual information.

1.2 Objectives

1. The project aims to work on one of the ways to context a photograph in simple English sentences
using Deep Learning (DL).
2. The need to use CNN and LSTM models for this project.

1.3 Scope

1. Model Development: Creating a robust deep learning architecture using CNNs for image
feature extraction and LSTMs for caption generation, optimizing for accuracy and efficiency.

2. Application: Deploying the model in practical scenarios such as aiding visually impaired
individuals, automating social media content, and integrating into industries like autonomous
vehicles and surveillance systems.

3. Ethical Considerations: Addressing issues like data privacy, bias mitigation in captioning, and
ensuring ethical deployment of AI technologies to promote fairness and societal benefit.

Dept. of CS&E 2
Image Caption Generator 2023-2024

CHAPTER 2:
LITERATURE SURVEY

FUTURE
METHODOLOGY RESULTS
SL.NO PAPER DETAILS WORK/
USED OBTAINED
CONCLUSION

The aim of this


The classifier was
research was to
created by
Implementing Complexity in increase
combining
Automatic Image Caption classification
RNN with the
Generator using Recurrent accuracy by
LSTM algorithm
Neural Network, over a Long adding RNN and
1 Qualitative Analysis and finally using
Short- Term Memory. comparing its
RNN to make top-
Sai Teja. N.R, Rashmitha performance to
quality desires on
Khilar that of
the classification
27-09-2022 LSTM by
problem.
encoder-decoder
models.

The model
Experimental Assessment of
This paper is generates the
Beam Search Algorithm for
aimed at a beam basic caption
Improvement in Image
search algorithm through the aid of
Caption Generation. Case Study
2 for improvement the LSTM and
Chirani Lal Chowdhary, Aman
in image caption RNN
Goyal Bhavesh, Kumar
generation. implementation
Vasnani
with InceptionV3.
01-12-2019

Dept. of CS&E 3
Image Caption Generator 2023-2024

CHAPTER 3:
METHODOLOGY

3.1 Introduction

The image caption generating project is loaded with CNN and LSTM which act as the platform to
generate the sentences from a simple image. This can be worked on all applications.

3.2 System Requirement Specification

3.2.1 Hardware Requirements


• System: i3 Processor
• Hard Disk: 500 GB.
• Monitor: 15’’LED
• Input Devices: Keyboard, Mouse • Ram: 4GB.

3.2.2 Software Requirements


• Platform: Google Colab/Jupyter Notebook
• Coding Language: Python

3.3 Working Explanation

1. A user uploads an image that they want to generate a caption.


2. A gray-scale image is processed through CNN to identify the objects.
3. A gray-scale image is processed through CNN to identify the objects.
4. CNN scans images left-right, and top-bottom, and extracts important image features.
5. By applying various layers like Convolutional, Pooling, Fully Connected, and thus using
activation function, we successfully extracted features of every image.
6. It is then converted to LSTM.
7. Using the LSTM layer, we try to predict what the next word could be.
8. Then the application proceeds to generate a sentence describing the image.

Dept. of CS&E 4
Image Caption Generator 2023-2024

3.4 Algorithms

1. Convolutional Neural Network


2. Long Short-Term Memory

3.5 Overview of CNN

Convolutional Neural Network (CNN) is a type of deep learning model for processing data that has
a grid pattern, such as images.
• Deep-learning CNN models to train and test, each input image will pass through a series of
convolution layers with filters (Kernals), Pooling, fully connected layers (FC), and apply
Softmax function to classify an object with probabilistic values between 0 and 1.
• CNN's have unique layers called convolutional layers which separate them from RNNs and
other neural networks.
• Within a convolutional layer, the input is transformed before being passed to the next layer. A
CNN transforms the data by using filters.

Fig 3.5.1 CNN

Some advantages of CNN are:


• It works well for both supervised and unsupervised learning.
• Easy to understand and fast to implement.
• It has the highest accuracy among all algorithms that predicts images.
• Little dependence on pre-processing, decreasing the need for human effort to develop its
functionalities.

Dept. of CS&E 5
Image Caption Generator 2023-2024

3.6 Overview of VGG16

Fig 3.6.1 VGG16


1. Architecture:

• Layers: VGG16 has 16 layers, including 13 convolutional layers followed by 3 fully


connected layers.
• Filters: It uses 3x3 convolutional filters with a stride of 1, maintaining spatial resolution.
• Pooling: Max-pooling layers with 2x2 filters and a stride of 2 are used for downsampling.
• Activation: ReLU is used throughout, with softmax on the output layer for classification.

2. Features:

• Depth: VGG16 is one of the first deep CNNs, deeper than previous models like AlexNet.
• Simplicity: It follows a simple and uniform architecture, using the same filter size (3x3)
throughout.
• Performance: Achieved state-of-the-art results on ImageNet in 2014, showcasing the
effectiveness of deep networks for image classification.

3. Usage:

• Pre-Trained Model: Often used for transfer learning, leveraging pretrained weights on
ImageNet.
• Limitations: High computational cost due to its depth and parameter count, not ideal for
resource-constrained devices.

Dept. of CS&E 6
Image Caption Generator 2023-2024

3.7 Overview of LSTM

LSTM networks are a type of recurrent neural network capable of learning order dependence in
sequence prediction problems This is a behavior required in complex problem domains like
machine translation, speech recognition, and more.
LSTMs are a complex area of deep learning. This is a behavior required in complex problem
domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep
learning.

Fig 3.7.1 LSTM


Some advantages of LSTM are:
• Provides us with a large range of parameters such as learning rates, and input and output
biases.
• The complexity to update each weight is reduced to O (1) with LSTMs.

3.8 CNN - LSTM Architecture Model

The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for
feature extraction on input data combined with LSTMs to support sequence prediction.
CNN-LSTMs were developed for visual time series prediction problems and the application of
generating textual descriptions from sequence of image (e.g., videos)Specifically, the problem of

Dept. of CS&E 7
Image Caption Generator 2023-2024

• Activity Recognition: Generating a textual description of activity demonstrated in a sequence


of images.
• Image Description: Generating a textual description of a single image.
• Video Description: Generating a textual description of a sequence of images.

This architecture was originally referred to as a Long-term Recurrent Convolutional Network


(LRCN) model, although we will use the more generic name “CNN LSTM”
• CNN is used for extracting features from the image. We will use the pre-trained model
Xception.
• LSTM will use the information from CNN to help generate a description of the image.

Fig 3.8.1 CNN - LSTM MODEL

3.9 Methodology

1. Import Libraries.
2. Upload Flickr8k Dataset. (Data Preprocessing).
3. Apply CNN to identify the objects in the image.
4. Preprocess and tokenize the captions.
5. Use LSTM to predict the next word of the sentence.
6. Make a Data Generator.
7. View Images with caption.

Dept. of CS&E 8
Image Caption Generator 2023-2024

Fig 3.9.1 SYSTEM ARCHITECTURE

The system architecture for the image captioning model is comprised of several key components.
Initially, it extracts image features using a pre-trained VGG16 model, which outputs high-level
image representations. These features are then used as inputs for a custom captioning model. The
captioning model consists of a dense layer with dropout to process image features, followed by an
LSTM-based sequence processing layer for the caption input. An attention mechanism is employed
to focus on relevant parts of the image features while generating captions. The model is trained
using categorical cross-entropy loss and the Adam optimizer, with data provided through a
generator that batches image features and tokenized captions.

After training, the model predicts captions for new images by generating one word at a time based
on the image features and previously generated words. The quality of generated captions is
evaluated using BLEU scores, which measure how closely the predicted captions match the ground
truth. The system also includes functions for visualizing the results by displaying the images along
with actual and predicted captions. This end-to-end pipeline enables automated generation and
evaluation of image descriptions.

Dept. of CS&E 9
Image Caption Generator 2023-2024

Fig 3.9.2 WORKFLOW DIAGRAM

The image captioning system begins with data collection and preparation, where images and
captions are gathered and preprocessed. Image features are extracted using a pre-trained VGG16
model, and captions are tokenized and cleaned. During model training, a data generator creates
batches of features and caption sequences for the LSTM-based captioning model, which includes an
attention mechanism. The trained model is then used to generate captions for new images, with
performance evaluated using BLEU scores. Finally, results are visualized by displaying images with
their actual and predicted captions, and the trained model along with the tokenizer is saved for
future use.

Dept. of CS&E 10
Image Caption Generator 2023-2024

CHAPTER 5:
IMPLEMENTATION

4.1 Description of Implementation

The code implements an image captioning system using a combination of deep learning techniques
and natural language processing.

1. Setup and Libraries

The code begins by importing necessary libraries for image processing (`PIL`, `numpy`,
`matplotlib`), deep learning (`tensorflow`), and text processing (`nltk`). It sets up directories for
input (`INPUT_DIR`) and output (`OUTPUT_DIR`) data.

2. Image Feature Extraction

• VGG16 Model: Utilizes the VGG16 model pretrained on ImageNet for extracting image
features. The model's classification layer is removed to access the penultimate layer's output,
which serves as image features.
• ‘extract_image_features’ Function: Loads each image from the directory (`flickr8k/
Images`), preprocesses it to fit the VGG16 model requirements (`224x224` pixels,
preprocess_input), and extracts features using the truncated VGG16 model. Extracted features
are stored in a dictionary (`image_features`) with image IDs as keys.
• Storage: Extracted image features are stored using `pickle` in `img_features.pkl` for later use.

3. Caption Data Handling

• Captions Loading: Reads captions from `captions.txt`, where each line contains an image ID
and its associated captions.
• Cleaning Captions: Preprocesses captions by converting to lowercase, removing non-
alphabetical characters, trimming extra spaces, and adding start (`startseq`) and end (`endseq`)
tokens. Cleaned captions are stored in a dictionary (`image_to_captions_mapping`).

Dept. of CS&E 11
Image Caption Generator 2023-2024

4. Tokenization

• Tokenizer: Tokenizes cleaned captions to convert words into integer tokens and build a
vocabulary (`tokenizer`). The tokenizer is saved in `tokenizer.pkl` for later use.
• Vocabulary Size: Determines the size of the vocabulary (`vocab_size`) and calculates the
maximum caption length (`max_caption_length`) among all captions.

5. Data Splitting

• Training and Testing Sets: Splits image IDs into training (`train_ids`) and testing (`test_ids`)
sets for model evaluation. By default, 90% of the data is used for training.

6. Data Generator

• ‘data_generator’ Function: Generates batches of data for model training and validation. It
iterates through image IDs, extracts image features and corresponding token sequences, and
yields batches (`X1_batch`, `X2_batch`, `y_batch`) for training the model.

7. Model Architecture

• Encoder-Decoder Architecture: Defines a deep learning model using Keras functional API:
- Encoder: Processes image features (`inputs1`) through dense and LSTM layers to extract
context (`fe2_projected`).
- Decoder: Takes token sequences (`inputs2`) through embedding and LSTM layers to
generate captions.

8. Model Training

• Training Loop: Trains the defined model over multiple epochs (`epochs`) using
`data_generator` for both training and validation data. It compiles the model with categorical
cross-entropy loss and Adam optimizer.
• Model Saving: Saves the trained model (`mymodel.h5`) in the `output_dir`.

Dept. of CS&E 12
Image Caption Generator 2023-2024

9. Model Evaluation

• BLEU Score Calculation: Evaluates model performance on test data using BLEU-1 and
BLEU-2 scores (`corpus_bleu` from `nltk`). Compares actual vs. predicted captions to assess
captioning quality.

10. Caption Generation

• ‘generate_caption’ Function: Takes an image name (`image_name`), loads the corresponding


image and its actual captions, predicts captions using the trained model (`predict_caption`),
and displays both actual and predicted captions alongside the image.

11. Example Usage

• Example: Demonstrates how to use `generate_caption` to generate and display captions for a
sample image (“101669240_b2d3e7f17b.jpg”) after training the model.

Dept. of CS&E 13
Image Caption Generator 2023-2024

4.2 Source Code

# Basic libraries
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
from math import ceil
from collections import defaultdict
from tqdm.notebook import tqdm # Progress bar library for Jupyter Notebook

# Deep learning framework for building and training models


import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, concatenate,
Bidirectional, Dot, Activation, RepeatVector, Lambda

# For checking score


from nltk.translate.bleu_score import corpus_bleu

# Setting the input and output directory


INPUT_DIR = 'flickr8k'
OUTPUT_DIR = ‘output_dir'

Dept. of CS&E 14
Image Caption Generator 2023-2024

# Load the VGG16 model without the top classification layer


model = VGG16(weights='imagenet')
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)

# Function to extract image features using VGG16


def extract_image_features(img_dir):
image_features = {}
for img_name in tqdm(os.listdir(img_dir)):
if img_name.startswith("."):
continue
img_path = os.path.join(img_dir, img_name)
image = load_img(img_path, target_size=(224, 224))
image = img_to_array(image)
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
image = preprocess_input(image)
image_feature = model.predict(image, verbose=0)
image_id = img_name.split('.')[0]
image_features[image_id] = image_feature
return image_features

# Extract image features and save them using pickle


image_features = extract_image_features(os.path.join(INPUT_DIR, 'Images'))
pickle.dump(image_features, open(os.path.join(OUTPUT_DIR, 'img_features.pkl'), 'wb'))

# Load captions from file


with open(os.path.join(INPUT_DIR, 'captions.txt'), 'r') as file:
next(file)
captions_doc = file.read()

# Create a mapping of image IDs to captions


image_to_captions_mapping = defaultdict(list)
for line in tqdm(captions_doc.split('\n')):

Dept. of CS&E 15
Image Caption Generator 2023-2024

tokens = line.split(',')
if len(tokens) < 2:
continue
image_id, *captions = tokens
image_id = image_id.split('.')[0]
caption = " ".join(captions)
image_to_captions_mapping[image_id].append(caption)

# Preprocess captions
def clean_captions(mapping):
for key, captions in mapping.items():
for i in range(len(captions)):
caption = captions[i]
caption = caption.lower()
caption = ''.join(char for char in caption if char.isalpha() or char.isspace())
caption = caption.replace('\s+', ' ')
caption = 'startseq ' + ' '.join([word for word in caption.split() if len(word) > 1]) + ' endseq'
captions[i] = caption
return mapping

# Clean captions
clean_captions(image_to_captions_mapping)

# Create a list of all captions


all_captions = [caption for captions in image_to_captions_mapping.values() for caption in captions]

# Tokenize captions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)

# Save tokenizer
with open('tokenizer.pkl', 'wb') as tokenizer_file:

Dept. of CS&E 16
Image Caption Generator 2023-2024

pickle.dump(tokenizer, tokenizer_file)

# Calculate maximum caption length and vocabulary size


max_caption_length = max(len(tokenizer.texts_to_sequences([caption])[0]) for caption in
all_captions)
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size:", vocab_size)
print("Maximum Caption Length:", max_caption_length)

# Create lists of image IDs for training and testing


image_ids = list(image_to_captions_mapping.keys())
split = int(len(image_ids) * 0.90)
train_ids = image_ids[:split]
test_ids = image_ids[split:]

# Data generator function for model training


def data_generator( data_keys, image_to_captions_mapping, features, tokenizer,
max_caption_length, vocab_size, batch_size):
X1_batch, X2_batch, y_batch = [], [], []
batch_count = 0
while True:
for image_id in data_keys:
try:
captions = image_to_captions_mapping[image_id]
for caption in captions:
caption_seq = tokenizer.texts_to_sequences([caption])[0]
for i in range(1, len(caption_seq)):
in_seq, out_seq = caption_seq[:i], caption_seq[i]
in_seq = pad_sequences([in_seq], maxlen=max_caption_length)[0]
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
X1_batch.append(features[image_id][0])
X2_batch.append(in_seq)

Dept. of CS&E 17
Image Caption Generator 2023-2024

y_batch.append(out_seq)
batch_count += 1

if batch_count == batch_size:
yield ([np.array(X1_batch), np.array(X2_batch)], np.array(y_batch))
X1_batch, X2_batch, y_batch = [], [], []
batch_count = 0
except Exception as e:
print(f"Exception occurred: {e}")

# Define the model architecture


inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
fe2_projected = RepeatVector(max_caption_length)(fe2)
fe2_projected = Bidirectional(LSTM(256, return_sequences=True))(fe2_projected)

inputs2 = Input(shape=(max_caption_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = Bidirectional(LSTM(256, return_sequences=True))(se2)

attention = Dot(axes=[2, 2])([fe2_projected, se3])


attention_scores = Activation('softmax')(attention)
attention_context = Lambda(lambda x: tf.einsum('ijk,ijl->ikl', x[0], x[1]))([attention_scores, se3])
context_vector = Lambda(lambda x: tf.reduce_sum(x, axis=1))(attention_context)

decoder_input = concatenate([context_vector, fe2], axis=-1)


decoder1 = Dense(256, activation='relu')(decoder_input)
outputs = Dense(vocab_size, activation='softmax')(decoder1)
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Dept. of CS&E 18
Image Caption Generator 2023-2024

# Train the model


epochs = 50
batch_size = 32
steps_per_epoch = ceil(len(train_ids) / batch_size)
validation_steps = ceil(len(test_ids) / batch_size)

for epoch in range(epochs):


print(f"Epoch {epoch+1}/{epochs}")
train_generator = data_generator(train_ids, image_to_captions_mapping, image_features,
tokenizer, max_caption_length, vocab_size, batch_size)
test_generator = data_generator(test_ids, image_to_captions_mapping, image_features,
tokenizer, max_caption_length, vocab_size, batch_size)
model. f i t ( t r ain _g en er ator , epochs = 1 , s tep s _ p er _ epo ch = s teps _ p er _ ep o ch ,
validation_data=test_generator, validation_steps=validation_steps, verbose=1)

# Save the model


model.save(os.path.join(OUTPUT_DIR, 'mymodel.h5'))

# Function to predict captions


def predict_caption(model, image_features, tokenizer, max_caption_length):
caption = 'startseq'
for _ in range(max_caption_length):
sequence = tokenizer.texts_to_sequences([caption])[0]
sequence = pad_sequences([sequence], maxlen=max_caption_length)
yhat = model.predict([np.array([image_features]), np.array(sequence)], verbose=0)
predicted_index = np.argmax(yhat)
predicted_word = get_word_from_index(predicted_index, tokenizer)
caption += " " + predicted_word
if predicted_word is None or predicted_word == 'endseq':
break
return caption

Dept. of CS&E 19
Image Caption Generator 2023-2024

# Function to get word from tokenizer index


def get_word_from_index(index, tokenizer):
return next((word for word, idx in tokenizer.word_index.items() if idx == index), None)

# Evaluate the model using BLEU scores


actual_captions_list = []
predicted_captions_list = []
for key in tqdm(test_ids):
actual_captions = image_to_captions_mapping[key]
predicted_caption = predict_caption(model, image_features[key], tokenizer,
max_caption_length)
actual_captions_words = [caption.split() for caption in actual_captions]
predicted_caption_words = predicted_caption.split()
actual_captions_list.append(actual_captions_words)
predicted_captions_list.append(predicted_caption_words)

print("BLEU-1: %f" % corpus_bleu(actual_captions_list, predicted_captions_list, weights=(1.0, 0,


0, 0)))
print("BLEU-2: %f" % corpus_bleu(actual_captions_list, predicted_captions_list, weights=(0.5,
0.5, 0, 0)))

# Function to generate and display captions for an image


def generate_caption(image_name):
image_id = image_name.split('.')[0]
img_path = os.path.join(INPUT_DIR, "Images", image_name)
image = Image.open(img_path)
captions = image_to_captions_mapping[image_id]
print('---------------------Actual --------------------- ')
for caption in captions:
print(caption)
y_pred = predict_caption(model, image_features[image_id], tokenizer, max_caption_length)
print('--------------------Predicted -------------------- ')

Dept. of CS&E 20
Image Caption Generator 2023-2024

print(y_pred)
plt.imshow(image)
plt.axis('off')
plt.show()

# Example usage
generate_caption (“101669240_b2d3e7f17b.jpg”)

Dept. of CS&E 21
Image Caption Generator 2023-2024

CHAPTER 5:
RESULT

The project aims to develop an Image Caption Generator using Convolutional Neural Networks
(CNNs) for image feature extraction and Long Short-Term Memory networks (LSTMs) for
generating descriptive captions. This fusion of computer vision and natural language processing
enables machines to interpret visual content and produce coherent sentences in natural language.
Applications include enhancing accessibility for visually impaired individuals through real-time
image description, automating social media content by automatically captioning visual posts, and
supporting industries such as autonomous vehicles and surveillance systems.

The methodology involves utilizing CNNs to extract meaningful features from images and
employing LSTMs to predict and generate captions. The CNN layers process images to identify
objects, while LSTM sequences these features into understandable sentences. The project's scope
covers model development, application deployment, and ethical considerations like data privacy
and bias mitigation in AI technologies. By leveraging deep learning techniques like CNNs and
LSTMs, the project aims to advance human-computer interaction capabilities in understanding and
processing visual information effectively.

Fig 5.1 REPRESENTATION OF IMAGE CAPTIONING

Dept. of CS&E 22
Image Caption Generator 2023-2024

SNAPSHOTS
The output with Actual and Predicted Captions is as follows:

Figure 1: Output 1

Figure 2: Output 2

Dept. of CS&E 23
Image Caption Generator 2023-2024

Figure 3: Output 3

Figure 4: Output 4

Dept. of CS&E 24
Image Caption Generator 2023-2024

CONCLUSION

The integration of CNNs and LSTMs in Image Caption Generator models marks a significant
advancement at the intersection of computer vision and natural language processing. CNNs excel in
extracting detailed features from images, while LSTMs effectively organize these features into
coherent sentences, enabling machines to describe visual content accurately in natural language.

This technology offers practical benefits across various fields. It enhances accessibility by
providing real-time image descriptions for visually impaired individuals and automates social media
interactions through meaningful image captions. Industries such as autonomous vehicles and
surveillance systems also leverage its ability to interpret and articulate visual scenes with precision.

However, ethical considerations, such as safeguarding data privacy, mitigating biases in captioning,
and ensuring responsible AI deployment, are critical for fostering trust and ensuring equitable
outcomes in society.

In summary, this project highlights the transformative potential of deep learning in improving
human-computer interaction and advancing capabilities in visual information processing. Future
research should focus on enhancing model efficiency, refining performance metrics, and addressing
ethical implications to broaden the application and societal acceptance of AI technologies.

Dept. of CS&E 25
Image Caption Generator 2023-2024

REFERENCES

[1] R. Subash (November 2019): Automatic Image Captioning Using Convolution Neural Networks
and LSTM.

[2] Seung-Ho Han, Ho-Jin Choi (2020): Domain-Specific Image Caption Generator with Semantic
Ontology.

[3] Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra and Nand Kumar Bansode (2017):
Camera2Caption: A Real-Time Image Caption Generator

[4] Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (June 2019): Image Captioning:
Transforming Objects into words.

[5] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse, Dr. Shabnam Sayyad (March
2021): Deep learning-based Image Caption Generator

[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2015):
Show and Tell: A Neural Image Caption Generator

[7] Jianhui Chen, Wenqiang Dong, Minchen Li (2015): Image Caption Generator based on Deep
Neural Networks

[8] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,
and Lei Zhang. (2017): Bottom-up and top-down attention for image captioning.

[9] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing (2018): Convolutional image
captioning.

[10] Shuang Bai and Shan An (2018): A survey on automatic image caption generation.

Dept. of CS&E 26

You might also like