0% found this document useful (0 votes)

14 views31 pages

Mini Project Report

The document is a technical seminar report on '6G: Technology Evolution in Future Wireless Networks,' submitted for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project objectives, methodology, and applications of an Image Caption Generator using deep learning techniques, specifically CNNs and LSTMs. The report emphasizes the importance of bridging computer vision and natural language processing for various applications, including aiding the visually impaired and enhancing social media engagement.

Uploaded by

Sahana S.H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views31 pages

Mini Project Report

Uploaded by

Sahana S.H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

V I SV E S VAR AYA TECH N O L O G I C A L U N I V E R S IT Y

“ J N A N A S A N G A M A ” B E L A G A VI – 5 9 0 018

An
[Technical Seminar report on]

“6G: Technology Evolution in Future Wireless Networks”

Submitted in the partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
(Accredited by NBA, New Delhi, validity up to 30.06.2026)

SUBMITTED BY
Sahana S.H 4JD22CS406

UNDER THE GUIDANCE OF

Mrs. Chaithra B M BE,M.Tech

Assistant Professor,
Dept. of CS&E,
Jain Institute of Technology, Davangere

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(Accredited by NBA, New Delhi, validity up to 30.06.2026)

JAIN INSTITUTE OF TECHNOLOGY

DAVANGERE – 577003

2023 - 2024
ABSTRACT

i
ACKNOWLEDGEMENT

Although a single sentence hardly suffices, we would like to thank almighty God for blessing me with
his grace and taking my endeavor to a successful culmination.

I express my gratitude to my guide Prof. Prashantha G. R, Dept. of CS&E, JIT, Davangere,

for his valuable guidance and continual encouragement and assistance throughout the seminar. I greatly
appreciate the freedom and collegial respect. I am grateful to him for discussions about the technical
matters and suggestions concerned to my topic.

I extend my sense of gratitude to Prof. Sameer.B .Internship Co-ordinator, Dept.of CS&E, JIT,
Davangere, for extending support and cooperation which helped me in completion of the Internship.

We extend our sense of gratitude to Dr. Mouneshachari S, Professor & Head, Department of
CS&E, JIT, Davangere, for extending support and cooperation which helped us in completion of the
project.

We express our sincere thanks to Dr. Ganesh D B, Principal and Director, J.I.T, Davangere,
for extending support and cooperation which helped us in the completion of the project.

We would like to extend our gratitude to all staff of Department of Computer Science and
Engineering for the help and support rendered to us. We have benefited a lot from the feedback,
suggestions given by them.

We would like to extend our gratitude to all our family members and friends especially for their
advice and moral support.

ANJAN KUMAR T G 4JD21CS005)

CHANKRIKA S (4JD21CS015)

DRASHAN P K (4JD21CS016)

SAHANA S H (4JD22CS0406)

ii
CONTENTS

Page No.

ABSTRACT i
ACKNOWLEDGEMENT ii
CONTENTS iii
CHAPTER 1: INTRODUCTION 1-2
1.1 Overview of the Project
1.2 Objectives
1.3 Scope

CHAPTER 2: LITERATURE SURVEY 3

CHAPTER 3: METHODOLOGY 4-10

3.1 Introduction
3.2 System Requirement Specification
3.3 Working Explanation
3.4 Algorithms
3.5 KNN Algorithm
3.6 Logistic
3.7 Overview of Dataset
3.8 Methodology

CHAPTER 4: IMPLEMENTATION 11-21

4.1 Description of Implementation
4.2 Source Code

CHAPTER 5: RESULT 22

SNAPSHOTS 23-24

CONCLUSION 25

REFERENCES 26

iii
Image Caption Generator 2023-2024

CHAPTER 1:
INTRODUCTION

Every day, we are bombarded with photos in our surroundings, on social media, and in the news.
Only humans are capable of recognizing photos. We humans can recognize photographs without
their assigned captions, but machines require images to be taught first. The encoder-decoder
architecture of Image Caption Generator models uses input vectors to generate valid and acceptable
captions. This paradigm connects the worlds of natural language processing and computer vision.
It's a job of recognizing and evaluating the image's context before describing everything in a natural
language like English.

Our approach is based on two basic models: CNN (Convolutional Neural Network) and LSTM
(Long Short-Term Memory). CNN is utilized as an encoder in the derived application to extract
features from the snapshot or image, and LSTM is used as a decoder to organize the words and
generate captions. Image captioning can help with a variety of things, such as assisting the
visionless with text-to-speech through real-time input about the scenario over a camera feed, and
increasing social medical leisure by restructuring captions for photos in social feeds as well as
spoken messages.

Assisting children in recognizing chemicals is a step toward learning the language. Captions for
every photograph on the internet can result in faster and more accurate authentic photograph
exploration and indexing. Image captioning is used in a variety of sectors, including biology,
business, the internet, and in applications such as self-driving cars wherein it could describe the
scene around the car, and CCTV cameras where the alarms could be raised if any malicious activity
is observed. The main purpose of this research article is to gain a basic understanding of deep
learning methodologies.

Dept. of CS&E 1
Image Caption Generator 2023-2024

1.1 Overview of the Project

The project focuses on developing an Image Caption Generator using deep learning methodologies,
specifically employing Convolutional Neural Networks (CNNs) as encoders and Long Short-Term
Memory networks (LSTMs) as decoders. This approach aims to enable machines to interpret visual
content and generate descriptive captions in natural language. By extracting features from images
with CNNs and generating coherent captions with LSTMs, the model bridges the gap between
computer vision and natural language processing. Applications range from enhancing accessibility
for visually impaired individuals through real-time image description to improving social media
engagement by automatically captioning and indexing visual content. Moreover, the technology
finds practical uses in sectors such as biology, business, and automotive industries, demonstrating
its broad impact on enhancing human-computer interaction and advancing technological capabilities
in understanding and processing visual information.

1.2 Objectives

1. The project aims to work on one of the ways to context a photograph in simple English sentences
using Deep Learning (DL).
2. The need to use CNN and LSTM models for this project.

1.3 Scope

1. Model Development: Creating a robust deep learning architecture using CNNs for image
feature extraction and LSTMs for caption generation, optimizing for accuracy and efficiency.

2. Application: Deploying the model in practical scenarios such as aiding visually impaired
individuals, automating social media content, and integrating into industries like autonomous
vehicles and surveillance systems.

3. Ethical Considerations: Addressing issues like data privacy, bias mitigation in captioning, and
ensuring ethical deployment of AI technologies to promote fairness and societal benefit.

Dept. of CS&E 2
Image Caption Generator 2023-2024

CHAPTER 2:
LITERATURE SURVEY

FUTURE
METHODOLOGY RESULTS
SL.NO PAPER DETAILS WORK/
USED OBTAINED
CONCLUSION

The aim of this

The classifier was
research was to
created by
Implementing Complexity in increase
combining
Automatic Image Caption classification
RNN with the
Generator using Recurrent accuracy by
LSTM algorithm
Neural Network, over a Long adding RNN and
1 Qualitative Analysis and finally using
Short- Term Memory. comparing its
RNN to make top-
Sai Teja. N.R, Rashmitha performance to
quality desires on
Khilar that of
the classification
27-09-2022 LSTM by
problem.
encoder-decoder
models.

The model
Experimental Assessment of
This paper is generates the
Beam Search Algorithm for
aimed at a beam basic caption
Improvement in Image
search algorithm through the aid of
Caption Generation. Case Study
2 for improvement the LSTM and
Chirani Lal Chowdhary, Aman
in image caption RNN
Goyal Bhavesh, Kumar
generation. implementation
Vasnani
with InceptionV3.
01-12-2019

Dept. of CS&E 3
Image Caption Generator 2023-2024

CHAPTER 3:
METHODOLOGY

3.1 Introduction

The image caption generating project is loaded with CNN and LSTM which act as the platform to
generate the sentences from a simple image. This can be worked on all applications.

3.2 System Requirement Specification

3.2.1 Hardware Requirements

• System: i3 Processor
• Hard Disk: 500 GB.
• Monitor: 15’’LED
• Input Devices: Keyboard, Mouse • Ram: 4GB.

3.2.2 Software Requirements

• Platform: Google Colab/Jupyter Notebook
• Coding Language: Python

3.3 Working Explanation

1. A user uploads an image that they want to generate a caption.

2. A gray-scale image is processed through CNN to identify the objects.
3. A gray-scale image is processed through CNN to identify the objects.
4. CNN scans images left-right, and top-bottom, and extracts important image features.
5. By applying various layers like Convolutional, Pooling, Fully Connected, and thus using
activation function, we successfully extracted features of every image.
6. It is then converted to LSTM.
7. Using the LSTM layer, we try to predict what the next word could be.
8. Then the application proceeds to generate a sentence describing the image.

Dept. of CS&E 4
Image Caption Generator 2023-2024

3.4 Algorithms

1. Convolutional Neural Network

2. Long Short-Term Memory

3.5 Overview of CNN

Convolutional Neural Network (CNN) is a type of deep learning model for processing data that has
a grid pattern, such as images.
• Deep-learning CNN models to train and test, each input image will pass through a series of
convolution layers with filters (Kernals), Pooling, fully connected layers (FC), and apply
Softmax function to classify an object with probabilistic values between 0 and 1.
• CNN's have unique layers called convolutional layers which separate them from RNNs and
other neural networks.
• Within a convolutional layer, the input is transformed before being passed to the next layer. A
CNN transforms the data by using filters.

Fig 3.5.1 CNN

Some advantages of CNN are:

• It works well for both supervised and unsupervised learning.
• Easy to understand and fast to implement.
• It has the highest accuracy among all algorithms that predicts images.
• Little dependence on pre-processing, decreasing the need for human effort to develop its
functionalities.

Dept. of CS&E 5
Image Caption Generator 2023-2024

3.6 Overview of VGG16

Fig 3.6.1 VGG16

1. Architecture:

• Layers: VGG16 has 16 layers, including 13 convolutional layers followed by 3 fully

connected layers.
• Filters: It uses 3x3 convolutional filters with a stride of 1, maintaining spatial resolution.
• Pooling: Max-pooling layers with 2x2 filters and a stride of 2 are used for downsampling.
• Activation: ReLU is used throughout, with softmax on the output layer for classification.

2. Features:

• Depth: VGG16 is one of the first deep CNNs, deeper than previous models like AlexNet.
• Simplicity: It follows a simple and uniform architecture, using the same filter size (3x3)
throughout.
• Performance: Achieved state-of-the-art results on ImageNet in 2014, showcasing the
effectiveness of deep networks for image classification.

3. Usage:

• Pre-Trained Model: Often used for transfer learning, leveraging pretrained weights on
ImageNet.
• Limitations: High computational cost due to its depth and parameter count, not ideal for
resource-constrained devices.

Dept. of CS&E 6
Image Caption Generator 2023-2024

3.7 Overview of LSTM

LSTM networks are a type of recurrent neural network capable of learning order dependence in
sequence prediction problems This is a behavior required in complex problem domains like
machine translation, speech recognition, and more.
LSTMs are a complex area of deep learning. This is a behavior required in complex problem
domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep
learning.

Fig 3.7.1 LSTM

Some advantages of LSTM are:
• Provides us with a large range of parameters such as learning rates, and input and output
biases.
• The complexity to update each weight is reduced to O (1) with LSTMs.

3.8 CNN - LSTM Architecture Model

The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for
feature extraction on input data combined with LSTMs to support sequence prediction.
CNN-LSTMs were developed for visual time series prediction problems and the application of
generating textual descriptions from sequence of image (e.g., videos)Specifically, the problem of

Dept. of CS&E 7
Image Caption Generator 2023-2024

• Activity Recognition: Generating a textual description of activity demonstrated in a sequence

of images.
• Image Description: Generating a textual description of a single image.
• Video Description: Generating a textual description of a sequence of images.

This architecture was originally referred to as a Long-term Recurrent Convolutional Network

(LRCN) model, although we will use the more generic name “CNN LSTM”
• CNN is used for extracting features from the image. We will use the pre-trained model
Xception.
• LSTM will use the information from CNN to help generate a description of the image.

Fig 3.8.1 CNN - LSTM MODEL

3.9 Methodology

1. Import Libraries.
2. Upload Flickr8k Dataset. (Data Preprocessing).
3. Apply CNN to identify the objects in the image.
4. Preprocess and tokenize the captions.
5. Use LSTM to predict the next word of the sentence.
6. Make a Data Generator.
7. View Images with caption.

Dept. of CS&E 8
Image Caption Generator 2023-2024

Fig 3.9.1 SYSTEM ARCHITECTURE

The system architecture for the image captioning model is comprised of several key components.
Initially, it extracts image features using a pre-trained VGG16 model, which outputs high-level
image representations. These features are then used as inputs for a custom captioning model. The
captioning model consists of a dense layer with dropout to process image features, followed by an
LSTM-based sequence processing layer for the caption input. An attention mechanism is employed
to focus on relevant parts of the image features while generating captions. The model is trained
using categorical cross-entropy loss and the Adam optimizer, with data provided through a
generator that batches image features and tokenized captions.

After training, the model predicts captions for new images by generating one word at a time based
on the image features and previously generated words. The quality of generated captions is
evaluated using BLEU scores, which measure how closely the predicted captions match the ground
truth. The system also includes functions for visualizing the results by displaying the images along
with actual and predicted captions. This end-to-end pipeline enables automated generation and
evaluation of image descriptions.

Dept. of CS&E 9
Image Caption Generator 2023-2024

Fig 3.9.2 WORKFLOW DIAGRAM

The image captioning system begins with data collection and preparation, where images and
captions are gathered and preprocessed. Image features are extracted using a pre-trained VGG16
model, and captions are tokenized and cleaned. During model training, a data generator creates
batches of features and caption sequences for the LSTM-based captioning model, which includes an
attention mechanism. The trained model is then used to generate captions for new images, with
performance evaluated using BLEU scores. Finally, results are visualized by displaying images with
their actual and predicted captions, and the trained model along with the tokenizer is saved for
future use.

Dept. of CS&E 10
Image Caption Generator 2023-2024

CHAPTER 5:
IMPLEMENTATION

4.1 Description of Implementation

The code implements an image captioning system using a combination of deep learning techniques
and natural language processing.

1. Setup and Libraries

The code begins by importing necessary libraries for image processing (`PIL`, `numpy`,
`matplotlib`), deep learning (`tensorflow`), and text processing (`nltk`). It sets up directories for
input (`INPUT_DIR`) and output (`OUTPUT_DIR`) data.

2. Image Feature Extraction

• VGG16 Model: Utilizes the VGG16 model pretrained on ImageNet for extracting image
features. The model's classification layer is removed to access the penultimate layer's output,
which serves as image features.
• ‘extract_image_features’ Function: Loads each image from the directory (`flickr8k/
Images`), preprocesses it to fit the VGG16 model requirements (`224x224` pixels,
preprocess_input), and extracts features using the truncated VGG16 model. Extracted features
are stored in a dictionary (`image_features`) with image IDs as keys.
• Storage: Extracted image features are stored using `pickle` in `img_features.pkl` for later use.

3. Caption Data Handling

• Captions Loading: Reads captions from `captions.txt`, where each line contains an image ID
and its associated captions.
• Cleaning Captions: Preprocesses captions by converting to lowercase, removing non-
alphabetical characters, trimming extra spaces, and adding start (`startseq`) and end (`endseq`)
tokens. Cleaned captions are stored in a dictionary (`image_to_captions_mapping`).

Dept. of CS&E 11
Image Caption Generator 2023-2024

4. Tokenization

• Tokenizer: Tokenizes cleaned captions to convert words into integer tokens and build a
vocabulary (`tokenizer`). The tokenizer is saved in `tokenizer.pkl` for later use.
• Vocabulary Size: Determines the size of the vocabulary (`vocab_size`) and calculates the
maximum caption length (`max_caption_length`) among all captions.

5. Data Splitting

• Training and Testing Sets: Splits image IDs into training (`train_ids`) and testing (`test_ids`)
sets for model evaluation. By default, 90% of the data is used for training.

6. Data Generator

• ‘data_generator’ Function: Generates batches of data for model training and validation. It
iterates through image IDs, extracts image features and corresponding token sequences, and
yields batches (`X1_batch`, `X2_batch`, `y_batch`) for training the model.

7. Model Architecture

• Encoder-Decoder Architecture: Defines a deep learning model using Keras functional API:
- Encoder: Processes image features (`inputs1`) through dense and LSTM layers to extract
context (`fe2_projected`).
- Decoder: Takes token sequences (`inputs2`) through embedding and LSTM layers to
generate captions.

8. Model Training

• Training Loop: Trains the defined model over multiple epochs (`epochs`) using
`data_generator` for both training and validation data. It compiles the model with categorical
cross-entropy loss and Adam optimizer.
• Model Saving: Saves the trained model (`mymodel.h5`) in the `output_dir`.

Dept. of CS&E 12
Image Caption Generator 2023-2024

9. Model Evaluation

• BLEU Score Calculation: Evaluates model performance on test data using BLEU-1 and
BLEU-2 scores (`corpus_bleu` from `nltk`). Compares actual vs. predicted captions to assess
captioning quality.

10. Caption Generation

• ‘generate_caption’ Function: Takes an image name (`image_name`), loads the corresponding

image and its actual captions, predicts captions using the trained model (`predict_caption`),
and displays both actual and predicted captions alongside the image.

11. Example Usage

• Example: Demonstrates how to use `generate_caption` to generate and display captions for a
sample image (“101669240_b2d3e7f17b.jpg”) after training the model.

Dept. of CS&E 13
Image Caption Generator 2023-2024

4.2 Source Code

# Basic libraries
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
from math import ceil
from collections import defaultdict
from tqdm.notebook import tqdm # Progress bar library for Jupyter Notebook

# Deep learning framework for building and training models

import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, concatenate,
Bidirectional, Dot, Activation, RepeatVector, Lambda

# For checking score

from nltk.translate.bleu_score import corpus_bleu

# Setting the input and output directory

INPUT_DIR = 'flickr8k'
OUTPUT_DIR = ‘output_dir'

Dept. of CS&E 14
Image Caption Generator 2023-2024

# Load the VGG16 model without the top classification layer

model = VGG16(weights='imagenet')
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)

# Function to extract image features using VGG16

def extract_image_features(img_dir):
image_features = {}
for img_name in tqdm(os.listdir(img_dir)):
if img_name.startswith("."):
continue
img_path = os.path.join(img_dir, img_name)
image = load_img(img_path, target_size=(224, 224))
image = img_to_array(image)
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
image = preprocess_input(image)
image_feature = model.predict(image, verbose=0)
image_id = img_name.split('.')[0]
image_features[image_id] = image_feature
return image_features

# Extract image features and save them using pickle

image_features = extract_image_features(os.path.join(INPUT_DIR, 'Images'))
pickle.dump(image_features, open(os.path.join(OUTPUT_DIR, 'img_features.pkl'), 'wb'))

# Load captions from file

with open(os.path.join(INPUT_DIR, 'captions.txt'), 'r') as file:
next(file)
captions_doc = file.read()

# Create a mapping of image IDs to captions

image_to_captions_mapping = defaultdict(list)
for line in tqdm(captions_doc.split('\n')):

Dept. of CS&E 15
Image Caption Generator 2023-2024

tokens = line.split(',')
if len(tokens) < 2:
continue
image_id, *captions = tokens
image_id = image_id.split('.')[0]
caption = " ".join(captions)
image_to_captions_mapping[image_id].append(caption)

# Preprocess captions
def clean_captions(mapping):
for key, captions in mapping.items():
for i in range(len(captions)):
caption = captions[i]
caption = caption.lower()
caption = ''.join(char for char in caption if char.isalpha() or char.isspace())
caption = caption.replace('\s+', ' ')
caption = 'startseq ' + ' '.join([word for word in caption.split() if len(word) > 1]) + ' endseq'
captions[i] = caption
return mapping

# Clean captions
clean_captions(image_to_captions_mapping)

# Create a list of all captions

all_captions = [caption for captions in image_to_captions_mapping.values() for caption in captions]

# Tokenize captions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)

# Save tokenizer
with open('tokenizer.pkl', 'wb') as tokenizer_file:

Dept. of CS&E 16
Image Caption Generator 2023-2024

pickle.dump(tokenizer, tokenizer_file)

# Calculate maximum caption length and vocabulary size

max_caption_length = max(len(tokenizer.texts_to_sequences([caption])[0]) for caption in
all_captions)
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size:", vocab_size)
print("Maximum Caption Length:", max_caption_length)

# Create lists of image IDs for training and testing

image_ids = list(image_to_captions_mapping.keys())
split = int(len(image_ids) * 0.90)
train_ids = image_ids[:split]
test_ids = image_ids[split:]

# Data generator function for model training

def data_generator( data_keys, image_to_captions_mapping, features, tokenizer,
max_caption_length, vocab_size, batch_size):
X1_batch, X2_batch, y_batch = [], [], []
batch_count = 0
while True:
for image_id in data_keys:
try:
captions = image_to_captions_mapping[image_id]
for caption in captions:
caption_seq = tokenizer.texts_to_sequences([caption])[0]
for i in range(1, len(caption_seq)):
in_seq, out_seq = caption_seq[:i], caption_seq[i]
in_seq = pad_sequences([in_seq], maxlen=max_caption_length)[0]
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
X1_batch.append(features[image_id][0])
X2_batch.append(in_seq)

Dept. of CS&E 17
Image Caption Generator 2023-2024

y_batch.append(out_seq)
batch_count += 1

if batch_count == batch_size:
yield ([np.array(X1_batch), np.array(X2_batch)], np.array(y_batch))
X1_batch, X2_batch, y_batch = [], [], []
batch_count = 0
except Exception as e:
print(f"Exception occurred: {e}")

# Define the model architecture

inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
fe2_projected = RepeatVector(max_caption_length)(fe2)
fe2_projected = Bidirectional(LSTM(256, return_sequences=True))(fe2_projected)

inputs2 = Input(shape=(max_caption_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = Bidirectional(LSTM(256, return_sequences=True))(se2)

attention = Dot(axes=[2, 2])([fe2_projected, se3])

attention_scores = Activation('softmax')(attention)
attention_context = Lambda(lambda x: tf.einsum('ijk,ijl->ikl', x[0], x[1]))([attention_scores, se3])
context_vector = Lambda(lambda x: tf.reduce_sum(x, axis=1))(attention_context)

decoder_input = concatenate([context_vector, fe2], axis=-1)

decoder1 = Dense(256, activation='relu')(decoder_input)
outputs = Dense(vocab_size, activation='softmax')(decoder1)
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Dept. of CS&E 18
Image Caption Generator 2023-2024

# Train the model

epochs = 50
batch_size = 32
steps_per_epoch = ceil(len(train_ids) / batch_size)
validation_steps = ceil(len(test_ids) / batch_size)

for epoch in range(epochs):

print(f"Epoch {epoch+1}/{epochs}")
train_generator = data_generator(train_ids, image_to_captions_mapping, image_features,
tokenizer, max_caption_length, vocab_size, batch_size)
test_generator = data_generator(test_ids, image_to_captions_mapping, image_features,
tokenizer, max_caption_length, vocab_size, batch_size)
model. f i t ( t r ain _g en er ator , epochs = 1 , s tep s _ p er _ epo ch = s teps _ p er _ ep o ch ,
validation_data=test_generator, validation_steps=validation_steps, verbose=1)

# Save the model

model.save(os.path.join(OUTPUT_DIR, 'mymodel.h5'))

# Function to predict captions

def predict_caption(model, image_features, tokenizer, max_caption_length):
caption = 'startseq'
for _ in range(max_caption_length):
sequence = tokenizer.texts_to_sequences([caption])[0]
sequence = pad_sequences([sequence], maxlen=max_caption_length)
yhat = model.predict([np.array([image_features]), np.array(sequence)], verbose=0)
predicted_index = np.argmax(yhat)
predicted_word = get_word_from_index(predicted_index, tokenizer)
caption += " " + predicted_word
if predicted_word is None or predicted_word == 'endseq':
break
return caption

Dept. of CS&E 19
Image Caption Generator 2023-2024

# Function to get word from tokenizer index

def get_word_from_index(index, tokenizer):
return next((word for word, idx in tokenizer.word_index.items() if idx == index), None)

# Evaluate the model using BLEU scores

actual_captions_list = []
predicted_captions_list = []
for key in tqdm(test_ids):
actual_captions = image_to_captions_mapping[key]
predicted_caption = predict_caption(model, image_features[key], tokenizer,
max_caption_length)
actual_captions_words = [caption.split() for caption in actual_captions]
predicted_caption_words = predicted_caption.split()
actual_captions_list.append(actual_captions_words)
predicted_captions_list.append(predicted_caption_words)

print("BLEU-1: %f" % corpus_bleu(actual_captions_list, predicted_captions_list, weights=(1.0, 0,

0, 0)))
print("BLEU-2: %f" % corpus_bleu(actual_captions_list, predicted_captions_list, weights=(0.5,
0.5, 0, 0)))

# Function to generate and display captions for an image

def generate_caption(image_name):
image_id = image_name.split('.')[0]
img_path = os.path.join(INPUT_DIR, "Images", image_name)
image = Image.open(img_path)
captions = image_to_captions_mapping[image_id]
print('---------------------Actual --------------------- ')
for caption in captions:
print(caption)
y_pred = predict_caption(model, image_features[image_id], tokenizer, max_caption_length)
print('--------------------Predicted -------------------- ')

Dept. of CS&E 20
Image Caption Generator 2023-2024

print(y_pred)
plt.imshow(image)
plt.axis('off')
plt.show()

# Example usage
generate_caption (“101669240_b2d3e7f17b.jpg”)

Dept. of CS&E 21
Image Caption Generator 2023-2024

CHAPTER 5:
RESULT

The project aims to develop an Image Caption Generator using Convolutional Neural Networks
(CNNs) for image feature extraction and Long Short-Term Memory networks (LSTMs) for
generating descriptive captions. This fusion of computer vision and natural language processing
enables machines to interpret visual content and produce coherent sentences in natural language.
Applications include enhancing accessibility for visually impaired individuals through real-time
image description, automating social media content by automatically captioning visual posts, and
supporting industries such as autonomous vehicles and surveillance systems.

The methodology involves utilizing CNNs to extract meaningful features from images and
employing LSTMs to predict and generate captions. The CNN layers process images to identify
objects, while LSTM sequences these features into understandable sentences. The project's scope
covers model development, application deployment, and ethical considerations like data privacy
and bias mitigation in AI technologies. By leveraging deep learning techniques like CNNs and
LSTMs, the project aims to advance human-computer interaction capabilities in understanding and
processing visual information effectively.

Fig 5.1 REPRESENTATION OF IMAGE CAPTIONING

Dept. of CS&E 22
Image Caption Generator 2023-2024

SNAPSHOTS
The output with Actual and Predicted Captions is as follows:

Figure 1: Output 1

Figure 2: Output 2

Dept. of CS&E 23
Image Caption Generator 2023-2024

Figure 3: Output 3

Figure 4: Output 4

Dept. of CS&E 24
Image Caption Generator 2023-2024

CONCLUSION

The integration of CNNs and LSTMs in Image Caption Generator models marks a significant
advancement at the intersection of computer vision and natural language processing. CNNs excel in
extracting detailed features from images, while LSTMs effectively organize these features into
coherent sentences, enabling machines to describe visual content accurately in natural language.

This technology offers practical benefits across various fields. It enhances accessibility by
providing real-time image descriptions for visually impaired individuals and automates social media
interactions through meaningful image captions. Industries such as autonomous vehicles and
surveillance systems also leverage its ability to interpret and articulate visual scenes with precision.

However, ethical considerations, such as safeguarding data privacy, mitigating biases in captioning,
and ensuring responsible AI deployment, are critical for fostering trust and ensuring equitable
outcomes in society.

In summary, this project highlights the transformative potential of deep learning in improving
human-computer interaction and advancing capabilities in visual information processing. Future
research should focus on enhancing model efficiency, refining performance metrics, and addressing
ethical implications to broaden the application and societal acceptance of AI technologies.

Dept. of CS&E 25
Image Caption Generator 2023-2024

REFERENCES

[1] R. Subash (November 2019): Automatic Image Captioning Using Convolution Neural Networks
and LSTM.

[2] Seung-Ho Han, Ho-Jin Choi (2020): Domain-Specific Image Caption Generator with Semantic
Ontology.

[3] Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra and Nand Kumar Bansode (2017):
Camera2Caption: A Real-Time Image Caption Generator

[4] Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (June 2019): Image Captioning:
Transforming Objects into words.

[5] Manish Raypurkar, Abhishek Supe, Pratik Bhumkar, Pravin Borse, Dr. Shabnam Sayyad (March
2021): Deep learning-based Image Caption Generator

[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2015):
Show and Tell: A Neural Image Caption Generator

[7] Jianhui Chen, Wenqiang Dong, Minchen Li (2015): Image Caption Generator based on Deep
Neural Networks

[8] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,
and Lei Zhang. (2017): Bottom-up and top-down attention for image captioning.

[9] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing (2018): Convolutional image
captioning.

[10] Shuang Bai and Shan An (2018): A survey on automatic image caption generation.

Dept. of CS&E 26

New PDF
No ratings yet
New PDF
48 pages
Technical Seminar Final
No ratings yet
Technical Seminar Final
18 pages
Project Report
No ratings yet
Project Report
35 pages
15 Report PDF
No ratings yet
15 Report PDF
35 pages
Image Caption Genrator Report
No ratings yet
Image Caption Genrator Report
45 pages
Sample Project doc-REC
No ratings yet
Sample Project doc-REC
66 pages
BCA Image Captioning Project
No ratings yet
BCA Image Captioning Project
15 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Project Report
No ratings yet
Project Report
31 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Internship Report (Sanjay Final)
No ratings yet
Internship Report (Sanjay Final)
45 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Image Caption Generator
No ratings yet
Image Caption Generator
16 pages
AI-Powered Image Caption Generator
No ratings yet
AI-Powered Image Caption Generator
22 pages
Deep Learning Image Captioning
No ratings yet
Deep Learning Image Captioning
6 pages
Image Captioning
No ratings yet
Image Captioning
44 pages
AI Image Captioning System Development
No ratings yet
AI Image Captioning System Development
25 pages
Final - Done (1) 2.0
No ratings yet
Final - Done (1) 2.0
16 pages
ImageCaptioning (BLIP) Final
No ratings yet
ImageCaptioning (BLIP) Final
90 pages
Caption Credits
No ratings yet
Caption Credits
25 pages
Fin Irjmets1689950550
No ratings yet
Fin Irjmets1689950550
5 pages
Project Review
No ratings yet
Project Review
12 pages
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
No ratings yet
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
9 pages
Ttoimage Merged
No ratings yet
Ttoimage Merged
57 pages
Mini Project Doc
No ratings yet
Mini Project Doc
56 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Mini Project Final
No ratings yet
Mini Project Final
27 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
ImageCaptioning (BLIP) Final
No ratings yet
ImageCaptioning (BLIP) Final
90 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
AI-Driven Image Captioning Insights
No ratings yet
AI-Driven Image Captioning Insights
6 pages
Title Aproval Page
No ratings yet
Title Aproval Page
1 page
BTP Report
No ratings yet
BTP Report
27 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
CNN and RNN
No ratings yet
CNN and RNN
82 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
AI Image Caption Generator Project
No ratings yet
AI Image Caption Generator Project
14 pages
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
No ratings yet
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
6 pages
Document From Deependra Singh
No ratings yet
Document From Deependra Singh
10 pages
Cherukuri Varalakshmi-2
No ratings yet
Cherukuri Varalakshmi-2
21 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Image Captioning with Deep Learning
No ratings yet
Image Captioning with Deep Learning
5 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Report 1
No ratings yet
Report 1
34 pages
Gray Scale Image Captioning Using CNN and LSTM
No ratings yet
Gray Scale Image Captioning Using CNN and LSTM
8 pages
Image Captioning for CS Students
No ratings yet
Image Captioning for CS Students
32 pages
Image Captionbot For Assistive Technology
No ratings yet
Image Captionbot For Assistive Technology
3 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
2 pages
Image Caption
No ratings yet
Image Caption
16 pages
Image Captioning For The Visually Impaired
No ratings yet
Image Captioning For The Visually Impaired
5 pages
Poster 2
No ratings yet
Poster 2
1 page
ACCA SBR Group Questions by GM
No ratings yet
ACCA SBR Group Questions by GM
22 pages
Intimacy - My Virgin Girlfriend
100% (1)
Intimacy - My Virgin Girlfriend
207 pages
Tenses PPT For Grade 6
No ratings yet
Tenses PPT For Grade 6
21 pages
Mongols and Frankspdf
No ratings yet
Mongols and Frankspdf
14 pages
Social Studies Education
No ratings yet
Social Studies Education
9 pages
Rubrics - Soft Roll
No ratings yet
Rubrics - Soft Roll
2 pages
Transition Metals Overview
No ratings yet
Transition Metals Overview
5 pages
Video Presentation in Math Education
No ratings yet
Video Presentation in Math Education
50 pages
Hfe Aiwa Cr-A61 Service Revision
No ratings yet
Hfe Aiwa Cr-A61 Service Revision
12 pages
Result PPSC Combined Civil Services Main Exam
No ratings yet
Result PPSC Combined Civil Services Main Exam
2 pages
Rebadulla v. Republic, G.R. No. 222159, January 31, 2018
No ratings yet
Rebadulla v. Republic, G.R. No. 222159, January 31, 2018
14 pages
Sdi Report
No ratings yet
Sdi Report
15 pages
The Empty Pot
No ratings yet
The Empty Pot
3 pages
The Jook
No ratings yet
The Jook
89 pages
Auto Attendance AA9600
No ratings yet
Auto Attendance AA9600
2 pages
Subiecte Oral - Examen Situatii Neincheiate
No ratings yet
Subiecte Oral - Examen Situatii Neincheiate
17 pages
Daftar - PD-SMK Negeri 7 Batam-14 Mei 2024
No ratings yet
Daftar - PD-SMK Negeri 7 Batam-14 Mei 2024
704 pages
167-Article Text-128-1-10-20211020
No ratings yet
167-Article Text-128-1-10-20211020
8 pages
Technical Documents - K5036-4 HD
No ratings yet
Technical Documents - K5036-4 HD
31 pages
Legal Framework for Detention
No ratings yet
Legal Framework for Detention
2 pages
2123imguf ESE 2023 Mains Offline-Test Series
No ratings yet
2123imguf ESE 2023 Mains Offline-Test Series
9 pages
Holt Physics Chapter 3
100% (1)
Holt Physics Chapter 3
5 pages
FEA Workflow Guide for Beginners
No ratings yet
FEA Workflow Guide for Beginners
12 pages
Matzner R. An SNR Estimation Algorithm For Complex Baseband Signals Using Higher Order Statistics
No ratings yet
Matzner R. An SNR Estimation Algorithm For Complex Baseband Signals Using Higher Order Statistics
12 pages
Consumer Challenges & Solutions
No ratings yet
Consumer Challenges & Solutions
19 pages
Environmental & Social Specialist CV
No ratings yet
Environmental & Social Specialist CV
3 pages
The Roles of A Pharmacistat Ministry Level
No ratings yet
The Roles of A Pharmacistat Ministry Level
14 pages
Quarter 3 Module 3
No ratings yet
Quarter 3 Module 3
10 pages
Independent Mediation: What Do I Do Next?
No ratings yet
Independent Mediation: What Do I Do Next?
2 pages
Kathmandu Model Hospital Institute of Health Sciences, Bagbazar, Kathmandu
No ratings yet
Kathmandu Model Hospital Institute of Health Sciences, Bagbazar, Kathmandu
9 pages

Mini Project Report

Uploaded by

Mini Project Report

Uploaded by

V I SV E S VAR AYA TECH N O L O G I C A L U N I V E R S IT Y

“6G: Technology Evolution in Future Wireless Networks”

UNDER THE GUIDANCE OF

Mrs. Chaithra B M BE,M.Tech

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

JAIN INSTITUTE OF TECHNOLOGY

I express my gratitude to my guide Prof. Prashantha G. R, Dept. of CS&E, JIT, Davangere,

ANJAN KUMAR T G 4JD21CS005)

CHAPTER 2: LITERATURE SURVEY 3

CHAPTER 3: METHODOLOGY 4-10

CHAPTER 4: IMPLEMENTATION 11-21

1.1 Overview of the Project

The aim of this

3.2 System Requirement Specification

3.2.1 Hardware Requirements

3.2.2 Software Requirements

3.3 Working Explanation

1. A user uploads an image that they want to generate a caption.

1. Convolutional Neural Network

3.5 Overview of CNN

Fig 3.5.1 CNN

Some advantages of CNN are:

3.6 Overview of VGG16

Fig 3.6.1 VGG16

• Layers: VGG16 has 16 layers, including 13 convolutional layers followed by 3 fully

3.7 Overview of LSTM

Fig 3.7.1 LSTM

3.8 CNN - LSTM Architecture Model

• Activity Recognition: Generating a textual description of activity demonstrated in a sequence

This architecture was originally referred to as a Long-term Recurrent Convolutional Network

Fig 3.8.1 CNN - LSTM MODEL

Fig 3.9.1 SYSTEM ARCHITECTURE

Fig 3.9.2 WORKFLOW DIAGRAM

4.1 Description of Implementation

1. Setup and Libraries

2. Image Feature Extraction

3. Caption Data Handling

10. Caption Generation

• ‘generate_caption’ Function: Takes an image name (`image_name`), loads the corresponding

11. Example Usage

4.2 Source Code

# Deep learning framework for building and training models

# For checking score

# Setting the input and output directory

# Load the VGG16 model without the top classification layer

# Function to extract image features using VGG16

# Extract image features and save them using pickle

# Load captions from file

# Create a mapping of image IDs to captions

# Create a list of all captions

# Calculate maximum caption length and vocabulary size

# Create lists of image IDs for training and testing

# Data generator function for model training

# Define the model architecture

attention = Dot(axes=[2, 2])([fe2_projected, se3])

decoder_input = concatenate([context_vector, fe2], axis=-1)

# Train the model

for epoch in range(epochs):

# Save the model

# Function to predict captions

# Function to get word from tokenizer index

# Evaluate the model using BLEU scores

print("BLEU-1: %f" % corpus_bleu(actual_captions_list, predicted_captions_list, weights=(1.0, 0,

# Function to generate and display captions for an image

Fig 5.1 REPRESENTATION OF IMAGE CAPTIONING

You might also like